Simon Willison’s Weblog

Subscribe

Items in Jun, 2022

Filters: Year: 2022 × Month: Jun × Sorted by date


s3-ocr: Extract text from PDF files stored in an S3 bucket

I’ve released s3-ocr, a new tool that runs Amazon’s Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.

[... 1493 words]

The general idea of an “Islands” architecture is deceptively simple: render HTML pages on the server, and inject placeholders or slots around highly dynamic regions. These placeholders/slots contain the server-rendered HTML output from their corresponding widget. They denote regions that can then be “hydrated” on the client into small self-contained widgets, reusing their server-rendered initial HTML.

Jason Miller # 28th June 2022, 3:01 pm

The Magic Interview Question (via) Jeff Gothelf explains why “Tell me about the last time you [did something]” is the most valuable question you can ask when interviewing a user or potential user. # 28th June 2022, 2:26 pm

First impressions of DALL-E, generating images from text

I made it off the DALL-E waiting list a few days ago and I’ve been having an enormous amount of fun experimenting with it. Here are some notes on what I’ve learned so far (and a bunch of example images too).

[... 2102 words]

How Imagen Actually Works. Imagen is Google’s new text-to-image model, similar to (but possibly even more effective than) DALL-E. This article is the clearest explanation I’ve seen of how Imagen works: it uses Google’s existing T5 text encoder to convert the input sentence into an encoding that captures the semantic meaning of the sentence (including things like items being described as being on top of other items), then uses a trained diffusion model to generate a 64x64 image. That image is passed through two super-res models to increase the resolution to the final 1024x1024 output. # 23rd June 2022, 6:05 pm

Joining CSV files in your browser using Datasette Lite

I added a new feature to Datasette Lite—my version of Datasette that runs entirely in your browser using WebAssembly (previously): you can now use it to load one or more CSV files by URL, and then run SQL queries against them—including joins across data from multiple files.

[... 546 words]

The State of WebAssembly 2022. Colin Eberhardt talks through the results of the State of WebAssembly 2022 survey. Rust continues to dominate as the most popular language for working to WebAssembly, but Python has a notable increase of interest. # 20th June 2022, 6:07 pm

WarcDB (via) Florents Tselai built this tool for loading web crawl data stored in WARC (Web ARChive) format into a SQLite database for smaller-scale analysis with SQL, on top of my sqlite-utils Python library. # 19th June 2022, 6:08 pm

Weeknotes: datasette-socrata, and the last 10%...

... takes 90% of the work. I continue to work towards a preview of the new Datasette Cloud, and keep finding new “just one more things” to delay inviting in users.

[... 1214 words]

Becoming a good engineer is about collecting experience. Each project, even small ones, is a chance to add new techniques and tools to your toolbox. Where this delivers even more value is when you can solve problems by pairing techniques learned on one project with tools learned working on another. It all adds up.

Addy Osmani # 18th June 2022, 9:21 pm

Making Code Faster. Tim Bray’s detailed guide to using the Go profiler. # 13th June 2022, 7:40 pm

Twenty years of my blog

I started this blog on June 12th 2002—twenty years ago today! To celebrate two decades of blogging, I decided to pull together some highlights and dive down a self-indulgent nostalgia hole.

[... 4455 words]

A tiny web app to create images from OpenStreetMap maps

Earlier today I found myself wanting to programmatically generate some images of maps.

[... 1388 words]

The End of Localhost. swyx makes the argument for cloud-based development environments, and points out that many large companies—including Google, Facebook, Shopify and GitHub—have made the move already. I was responsible for the team maintaining the local development environment experience at Eventbrite for a while, and my conclusion is that with a large enough engineering team someone will ALWAYS find a new way to break their local environment: the idea of being able to bootstrap a fresh, guaranteed-to-work environment in the cloud at the click of a button could save SO much time and money. # 8th June 2022, 6:09 pm

Announcing Pyston-lite: our Python JIT as an extension module (via) The Pyston JIT can now be installed in any Python 3.8 virtual environment by running “pip install pyston_lite_autoload”—which includes a hook to automatically inject the JIT. I just tried a very rough benchmark against Datasette (ab -n 1000 -c 10) and got 391.20 requests/second without the JIT compared to 404.10 request/second with it. # 8th June 2022, 5:58 pm

Weeknotes: Datasette Cloud ready to preview

I made an absolute ton of progress building Datasette Cloud on Fly this week, and also had a bunch of fun playing with GPT-3.

[... 370 words]

How to use the GPT-3 language model

I ran a Twitter poll the other day asking if people had tried GPT-3 and why or why not. The winning option, by quite a long way, was “No, I don’t know how to”. So here’s how to try it out, for free, without needing to write any code.

[... 838 words]