Simon Willison’s Weblog

On datasettecloud 15 ai 18 datasette 294 shotscraper 9 githubactions 33 ...


Recent entries

Litestream backups for Datasette Cloud (and weeknotes) 25 minutes ago

My main focus this week has been adding robust backups to the forthcoming Datasette Cloud.

Datasette Cloud is a SaaS service for Datasette. It allows people to create a private Datasette instance where they can upload data, visualize and transform it and share it with other members of their team. You can join the waiting list to try it out using this form.

I’m building Datastte Cloud on Fly, specifically on Fly Machines.

Security is a big concern for Datasette Cloud. Teams should only be able to access their own data—bugs where users accidentally (or maliciously) access data for another team should be protected against as much as possible.

To help guarantee that, I’ve designed Datasette Cloud so that each team gets their own, dedicated instance, running in a Firecracker VM managed by Fly. Their data lives in a dedicated volume.

Fly volumes already implement snapshot backups, but I’m interested in defence in depth. This is where Litestream comes in (coincidentally now part of Fly, although it wasn’t when I first selected it as my backup strategy).

I’m using Litestream to constantly backup the data for each Datasette Cloud team to an S3 bucket. In the case of a complete failure of a volume, I can restore data from a backup that should be at most a few seconds out of date. Litestream also gives me point-in-time backups, such that I can recover a previous version of the data within a configurable retention window.

Keeping backups isolated

Litestream works by writing a constant stream of pages from SQLite’s WAL (Write-Ahead Log) up to an S3 bucket. It needs the ability to both read and write from S3.

This requires making S3 credentials available within the containers that run Datasette and Litestream for each team account.

Credentials in those containers are not visible to the users of the software, but I still wanted to be confident that if the credentials leaked in some way the isolation between teams would be maintained.

Initially I thought about having a separate S3 bucket for each team, but it turns out AWS has a default limit of 100 buckets per account, and a hard limit of 1,000. I aspire to have more than 1,000 customers, so this limit makes a bucket-per-team seem like the wrong solution.

I’ve learned an absolute ton about S3 and AWS permissions building my s3-credentials tool for creating credentials for accessing S3.

One of the tricks I’ve learned is that it’s possible to create temporary, time-limited credentials that only work for a prefix (effectively a folder) within an S3 bucket.

This means I can run Litestream with credentials that are specific to the team—that can read and write only from the team-ID/ prefix in the S3 bucket I am using to store the backups.

Obtaining temporary credentials

My s3-credentials tool can create credentials for a prefix within an S3 bucket like this:

s3-credentials create my-bucket-for-backus \
  --duration 12h \
  --prefix team-56/

This command uses the sts.assume_role() AWS method to create credentials that allow access to that bucket, attaching this generated JSON policy to it in order to restrict access to the provided prefix.

I extracted the relevant Python code from s3-credentials and used it to create a private API endpoint in my Datasette Cloud management server which could return the temporary credentials needed by the team container.

With the endpoint in place, my code for launching a team container can do this:

  • Create the volume and machine for that team (if they do not yet exist)
  • Generate a signed secret token that the machine container can exchange for its S3 credentials
  • Launch the machine container, passing it the secret token
  • On launch, the container runs a script which exchanges that secret token for its 12 hour S3 credentials, using the private API endpoint I created
  • Those credentials are used to populate the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN environment variables used by Litestream
  • Start Litestream, which then starts Datasette

Restarting every 12 hours

You may be wondering why I bothered with that initial secret token—why not just pass the temporary AWS credentials to the container when I launch it?

The reason for this is that I need to be able to obtain fresh credentials every 12 hours.

A really neat feature of Fly Machines is that they support scale-to-zero. You can stop them, and Fly will automatically restart them the next time they recieve traffic.

All you need to do is call sys.exit(0) in your Python code (or the equivalent in any other language) and Fly will stop your container... and then restart it again with a couple of seconds of cold start time the next time an HTTP request for your container hits the Fly router.

So far I’m mainly using this to avoid the cost of running containers when they aren’t actually in- use. But there’s a neat benefit when it comes to Litestream too.

I’m using S3 credentials which expire after 12 hours. This means I need to periodically refresh the credentials and restart Litestream or it will stop being able to write to the S3 bucket.

After considering a few ways of doing this, I selected the simplest to implement: have Datasette call sys.exit(0) after ten hours, and let Fly restart the container causing my startup script to fetch freshly generated 12 hour credentials and pass them to Litestream.

I implemented this by adding it as a new setting to my existing datasette-scale-to-zero plugin. You can now configure that with "max-age": "10h" and it will shut down Datasette once the server has been running for that long.

Why does this require my own secret token system? Because when the container is restarted, it needs to make an authenticated call to my endpoint to retrieve those fresh S3 credentials. Fly persists environment variable secrets between restarts to the container, so that secret can be long-lived even while it is exchanged for short-term S3 credentials.

I only just put the new backup system in place, so I’m exercising it a bit before I open things up to trial users—but so far it’s looking like a very robust solution to the problem.

s3-ocr improvements

I released a few new versions of s3-ocr this week, as part of my ongoing project working with the San Francisco Microscopical Society team to release a searchable version of their scanned document archives.

The two main improvements are:

  • A new --dry-run option to s3-ocr start which shows you what the tool will do without making any changes to your S3 bucket, or triggering any OCR jobs. #22
  • s3-ocr start used to fail with an error if running it would create more than 100 (or 600 depending on your region) concurrent OCR jobs. The tool now knows how to identify that error and pause and retry starting the jobs instead. #21

The fix that took the most time is this: installations of the tool no longer arbitrarily fail to work depending on the environment you install them into!

Solving this took me the best part of a day. The short version is this: Click 8.1.0 introduced a new feature that lets you use @cli.command as a decorator instead of @cli.command(). This meant that installing s3-ocr in an environment that already had a previous version of Click would result in silent errors.

The solution is simple: pin to click>=8.1.0 in the project dependencies if you plan to use this new syntax.

If I’d read the Click changelog more closely I would have saved myself a whole lot of time.

Issues #25 and #26 detail the many false turns I took trying to figure this out.

More fun with GPT-3 and DALL-E

This tweet scored over a million impressions on Twitter:

As this got retweeted outside of my usual circles it started confusing people who thought the “prototype” was a working game, as opposed to a fake screenshot and a paragraph of descriptive text! I wasn’t kidding when I said I spent 60 seconds on this.

I also figured out how to use GPT-3 to write jq one-liners. I love jq but I have to look up how to use it every time, so having GPT-3 do the work for me is a pretty neat time saver. More on that in this TIL: Using GPT-3 to figure out jq recipes

Releases this week

TIL this week

Weeknotes: Joining the board of the Python Software Foundation 11 days ago

A few weeks ago I was elected to the board of directors for the Python Software Foundation.

I put myself up for election partly because I’ve found myself saying “I wish the PSF would help with ...” a few times over the years, and realized that joining the board could be a more useful way to actively participate, rather than shouting from the sidelines.

I was quite surprised to win. I wrote up a short manifesto—you can see that here—but the voting system lets voters select as many candidates as they like, so it’s possible I got in more on broad name recognition among the voters than based on what I wrote. I don’t think there’s a way to tell one way or the other.

I had my first board meeting on Wednesday, where I formally joined the board and got to vote on my first resolutions. This is my first time as a board member for a non-profit and I have learned a bunch already, with a lot more to go!

Board terms last three years. I expect it will take me at least a few months to get fully up to speed on how everything works.

As a board member, my primary responsibilities are to show up to the meetings, vote on resolutions, act as an ambassador for the PSF to the Python community and beyond and help both set the direction for the PSF and ensure that the PSF meets its goals and holds true to its values.

I’m embarassed to admit that I wrote my election manifesto without a deep understanding of how the PSF operates and how much is possible for it to get done. Here’s the section I wrote about my goals should I be elected:

I believe there are problems facing the Python community that require dedicated resources beyond volunteer labour. I’d like the PSF to invest funding in the following areas in particular:

  • Improve Python onboarding. In coaching new developers I’ve found that the initial steps to getting started with a Python development environment can be a difficult hurdle to cross. I’d like to help direct PSF resources to tackling this problem, with a goal of making the experience of starting to learn Python as smooth as possible, no matter what platform the learner is using.
  • Make Python a great platform for distributing software. In building my own application, Datasette, in Python I’ve seen how difficult it can be to package up a Python application so that it can be installed by end-users, who aren’t ready to install Python and learn pip in order to try out a new piece of software. I’ve researched solutions for this for my own software using Homebrew, Docker, an Electron app and WASM/Pyodide. I’d like the PSF to invest in initiatives and documentation to make this as easy as possible, so that one of the reasons to build with Python is that distributing an application to end-users is already a solved problem.

I still think these are good ideas, and I hope to make progress on them during my term as a director—but I’m not going to start arguing for new initiatives until I’ve learned the ropes and fully understood the PSF’s abilities, current challenges and existing goals.

In figuring out how the board works, one of the most useful pages I stumbled across was this list of resolutions voted on by the board, dating back to 2001. There are over 1,600 of them! Browsing through them gave me a much better idea of the kind of things the board has the authority to do.

Scraping data into Datasette Lite

Because everything looks like a nail when you have a good hammer, I explored the board resolutions by loading them into Datasette. I tried a new trick this time: I scraped data from that page into a CSV file, then loaded up that CSV file in Datasette Lite via a GitHub Gist.

My scraper isn’t perfect—it misses about 150 resolutions because they don’t exactly fit the format it expects, but it was good enough for a proof of concept. I wrote that in a Jupyter Notebook which you can see here.

Here’s the CSV in a Gist. The great thing about Gists is that GitHub serve those files with the access-control-allow-origin: * HTTP header, which means you can load them cross-domain.

Here’s what you get if you paste the URL to that CSV into Datasette Lite (using this new feature I added last month):

A screenshot of the psf-resolutions table in Datasette, showing 1,654 rows

And here’s a SQL query that shows the sum total dollar amount from every resolution that mentions “Nigeria”:

with filtered as (
  select * from
    "dollars" is not null
    and "text" like '%' || :search || '%'
  'Total: $' || printf('%,d', sum(dollars)) as text,
  null as date
from filtered
union all
  text, date
from filtered;

A screenshot of the results of that query, returning 132 rows the top of which says Total: $163,849

I’m using a new-to-me trick here: I use a CTE to filter down to just the rows I am interested in, then I create a first row that sums the dollar amounts as the text column and leaves the date column null, then unions that against the rows from the query.

Important note: These numbers aren’t actually particularly useful. Just because the PSF board voted on a resolution does not mean that the money made it to the grantee—there are apparently situations where the approved grant may not be properly claimed and transferred. Also, my scraper logic isn’t perfect. Plus the PSF spends a whole lot of money in ways that don’t show up in these resolutions.

So this is a fun hack, and a neat way to start getting a general idea of how the PSF works, but any numbers it produces should not be taken as the absolute truth.

As a general pattern though, I really like this workflow of generating CSV files, saving them to a Gist and then opening them directly in Datasette Lite. It provides a way to use Datasette to share and explore data without needing to involve any server-side systems (other than GitHub Gists) at all!

Big-endian bugs in sqlite-fts4

sqlite-fts4 is a small Python library I wrote that adds SQLite functions for calculating relevance scoring for full-text search using the FTS4 module that comes bundled with SQLite. I described that project in detail in Exploring search relevance algorithms with SQLite.

It’s a dependency of sqlite-utils so it has a pretty big install base, despite being relatively obscure.

This week I had a fascinating bug report from Sarah Julia Kriesch: Test test_underlying_decode_matchinfo fails on PPC64 and s390x on openSUSE.

The s390x is an IBM mainframe architecture and it uses a big-endian byte order, unlike all of the machines I use which are little-endian.

This is the first time I’ve encountered a big-endian v.s. little-endian bug in my entire career! I was excited to dig in.

Here’s the relevant code:

 def decode_matchinfo(buf): 
     # buf is a bytestring of unsigned integers, each 4 bytes long 
     return struct.unpack("I" * (len(buf) // 4), buf) 

SQLite FTS4 provides a matchinfo binary string which you need to decode in order to calculate the relevance score. This code uses the struct standard library module to unpack that binary string into a list of integers.

My initial attempt at fixing this turned out to be entirely incorrect.

I didn’t have a big-endian machine available for testing, and I assumed that the problem was caused by Python interpreting the bytes as the current architecture’s byte order. So I applied this fix:

    return struct.unpack(">" + ("I" * (len(buf) // 4)), buf)

The > prefix there ensures that struct will always interpret the bytes as little-endian. I wrote up a TIL and shipped 1.0.2 with the fix.

Sarah promptly got back to me and reported some new failing tests.

It turns out my fix was entirely incorrect—in fact, I’d broken something that previously was working just fine.

The clue is in the SQLite documentation for matchinfo (which I really should have checked):

The matchinfo function returns a blob value. If it is used within a query that does not use the full-text index (a “query by rowid” or “linear scan”), then the blob is zero bytes in size. Otherwise, the blob consists of zero or more 32-bit unsigned integers in machine byte-order (emphasis mine).

Looking more closely at the original bug report, the test that failed was this one:

            (1, 2, 2, 2),
def test_underlying_decode_matchinfo(buf, expected):
    assert expected == decode_matchinfo(buf)

That test hard-codes a little-endian binary string and checks the output of my decode_matchinfo function. This is obviously going to fail on a big-endian system.

So my original behaviour was actually correct: I was parsing the string using the byte order of the architecture, and SQLite was providing the string in the byte order of the architecture. The only bug was in my test.

I reverted my previous fix and fixed the test instead:

            if sys.byteorder == "little"
            else b"\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02",
            (1, 2, 2, 2),
def test_underlying_decode_matchinfo(buf, expected):
    assert expected == decode_matchinfo(buf)

sys.byteorder reports the byte order of the host system, so this test now passes on both little-endian and big-endian systems.

There was one remaining challenge: how to test this? I wasn’t going to make the same mistake of shipping a fix that hadn’t actually been exercised on the target architecture a second time.

After quite a bit of research (mainly throwing the terms docker and s390x into the GitHub code search engine and seeing what I could find) I figured out a fix. It turns out you can use Docker and QEMU to run an emulated s390x system—both on a Mac loptop and in GitHub Actions.

Short version:

docker run --rm --privileged multiarch/qemu-user-static:register --reset
docker run -it multiarch/ubuntu-core:s390x-focal /bin/bash

For the longer version, check my TIL: Emulating a big-endian s390x with QEMU.

Releases this week

TIL this week

Weeknotes: Datasette, sqlite-utils, Datasette Desktop 21 days ago

A flurry of releases this week, including a new Datasette alpha and a fixed Datasette Desktop.

datasette 0.62a1

Work on Datasette Cloud continues—the last 10% of the work needed for the beta launch is trending towards taking 90% of the time. It’s been driving all sorts of fixes to the wider Datasette ecosystem, which is nice.

I ran into a bug which would have been easier to investigate using Sentry. The datasette-sentry plugin wasn’t working right, and it turned out I needed a new handle_exception() plugin hook to fix it. This was the impetus I needed to push out a new Datasette alpha—I like to get new hooks into an alpha as quickly as possible so I can depend on that Datasette version from alpha releases of plugins.

Here are some other highlights from the alpha’s release notes:

  • The render_cell() plugin hook is now also passed a row argument, representing the sqlite3.Row object that is being rendered. (#1300)

A neat thing about Pluggy is that you can add new arguments to existing plugin hooks without breaking plugins that already use them.

  • New --nolock option for ignoring file locks when opening read-only databases. (#1744)

Since the very start of the project Datasette has suggested trying the following command to start exploring your Google Chrome history, which is stored using SQLite:

datasette ~/Library/Application\ Support/Google/Chrome/Default/History

I’m not sure when this changed, but I tried running the command recently and got the following error:

sqlite3.OperationalError: database is locked

Since Datasette opens databases in read-only mode I didn’t see why a lock like this should be respected. It turns out SQLite can be told to ignore locks like so:


So I added a --nolock option to Datasette which does exactly that:

datasette ~/Library/Application\ Support/Google/Chrome/Default/History --nolock

Inspired by 6 More Things I Learned Building Snowpack to 20,000 Stars (Part 2) by Fred K. Schott I finally setup a chat community for Datasette, using Discord.

It’s attracted 88 members already! You can join it here. I wrote detailed notes on how I configured it in this issue.

  • Database file downloads now implement conditional GET using ETags. (#1739)

This is a change I made to support Datasette Lite—I noticed that the WASM version of Datasette was downloading a fresh database every time, so I added ETag support to encourage browsers to avoid a duplicate download and use a cached copy of the database file instead, provided it hasn’t changed.

Datasette Desktop

Datasette Desktop was hanging on launch. Paul Everitt figured out a fix, which it took me way too long to get around to applying.

I finally shipped that in Datasette Desktop 0.2.2, but I wanted to reduce the chances of this happening again as much as possible. Datasette Desktop’s Elecron tests used the spectron test harness, but that’s marked as deprecated.

I’m a big fan of Playwright and I was optimistic to see that it has support for testing Electron apps. I figured out how to use that with Datasette Desktop and run the tests in GitHub Actions: I wrote up what I learned in a TIL, Testing Electron apps with Playwright and GitHub Actions.

sqlite-utils 3.28

Annotated release notes:

davidleejy suggested the table.duplicate() method and contributed an implementation. This was the impetus for pushing out a fresh release.

I added the CLI equivalent, sqlite-utils duplicate.

  • sqlite_utils.utils.rows_from_file() is now a documented API. It can be used to read a sequence of dictionaries from a file-like object containing CSV, TSV, JSON or newline-delimited JSON. It can be passed an explicit format or can attempt to detect the format automatically. (#443)
  • sqlite_utils.utils.TypeTracker is now a documented API for detecting the likely column types for a sequence of string rows, see Detecting column types using TypeTracker. (#445)
  • sqlite_utils.utils.chunks() is now a documented API for splitting an iterator into chunks. (#451)

I have a policy that any time I need to use an undocumented method from sqlite-utils in some other project I file an issue to add that to the documented API surface.

I had used rows_from_file() and TypeTracker in datasette-socrata.

  • sqlite-utils enable-fts now has a --replace option for replacing the existing FTS configuration for a table. (#450)
  • The create-index, add-column and duplicate commands all now take a --ignore option for ignoring errors should the database not be in the right state for them to operate. (#450)

This was inspired by my TIL Ignoring errors in a section of a Bash script—a trick I had to figure out because one of my scripts needed to add columns and enable FTS but only if those changes had not been previously applied.

In looking into that I spotted inconsistencies in the design of the sqlite-utils commands, so I fixed those as much as I could while still maintaining backwards compatibility with the 3.x releases.

Releases this week

TIL this week

sqlite-comprehend: run AWS entity extraction against content in a SQLite database one month ago

I built a new tool this week: sqlite-comprehend, which passes text from a SQLite database through the AWS Comprehend entity extraction service and stores the returned entities.

I created this as a complement to my s3-ocr tool, which uses AWS Textract service to perform OCR against every PDF file in an S3 bucket.

Short version: given a database table full of text, run the following:

% pip install sqlite-comprehend
% sqlite-comprehend entities myblog.db blog_entry body --strip-tags
  [###---------------------------------]    9%  00:01:02

This will churn through every piece of text in the body column of the blog_entry table in the myblog.db SQLite database, strip any HTML tags (the --strip-tags option), submit it to AWS Comprehend, and store the extracted entities in the following tables:

The above table names link to a live demo produced by running the tool against all of the content in my blog.

Here are 225 mentions that Comprehend classified as the organization called “Mozilla”.

The tool tracks which rows have been processed already (in the blog_entry_comprehend_entities_done table), so you can run it multiple times and it will only process newly added rows.

AWS Comprehend pricing starts at $0.0001 per hundred characters. sqlite-comprehend only submits the first 5,000 characters of each row.

How the demo works

My live demo for this tool uses a new Datasette instance at It hosts a complete copy of the data from my blog—data that lives in a Django/PostgreSQL database on Heroku, but is now mirrored to a SQLite database hosted by Datasette.

The demo runs out of my simonwillisonblog-backup GitHub repository.

A couple of years ago I realized that I’m no longer happy having any content I care about not stored in a Git repository. I want to track my changes! I also want really robust backups: GitHub mirror their repos to three different regions around the world, and having data in a Git repository makes mirroring it somewhere else as easy as running a git pull.

So I created simonwillisonblog-backup using a couple of my other tools: db-to-sqlite, which converts a PostgreSQL database to a SQLite database, and sqlite-diffable, which dumps out a SQLite database as a “diffable” directory of newline-delimited JSON files.

Here’s the simplest version of that pattern:

db-to-sqlite \
    'postgresql+psycopg2://user:pass@hostname:5432/dbname' \
    simonwillisonblog.db --all

This connects to PostgreSQL, loops through all of the database tables and converts them all to SQLite tables stored in simonwillisonblog.db.

sqlite-diffable dump simonwillisonblog.db simonwillisonblog --all

This converts that SQLite database into a directory of JSON files. Each table gets two files: table.metadata.json containing the table’s name, columns and schema and table.ndjson containing a newline-separated list of JSON arrays representing every row in that table.

You can see these files for my blog’s database in the simonwillisonblog folder.

(My actual script is a little more complex, because I backup only selected tables and then run extra code to redact some of the fields.)

Since I set this up it’s captured over 600 changes I’ve applied to my blog’s database, all made the regular Django admin interface.

This morning I extended the script to run sqlite-comprehend against my blog entries and deploy the resulting data using Datasette.

The concise version of the new script looks like this:

wget -q

This retrieves the previous version of the database. I do this to avoid being charged by AWS Comprehend for running entity extraction against rows I have already processed.

sqlite-diffable load simonwillisonblog.db simonwillisonblog --replace

This creates the simonwillisonblog.db database by loading in the JSON from the simonwillisonblog/ folder. I do it this way mainly to exercise the new sqlite-diffable load command I just added to that tool.

The --replace option ensures that any tables that already exist are replaced by a fresh copy (while leaving my existing comprehend entity extraction data intact).

sqlite-comprehend entities simonwillisonblog.db blog_entry title body --strip-tags

This runs sqlite-comprehend against the blog entries that have not yet been processed.

set +e
sqlite-utils enable-fts simonwillisonblog.db blog_series title summary --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_tag tag --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_quotation quotation source --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_entry title body --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_blogmark link_title via_title commentary --create-triggers --tokenize porter 2>/dev/null
set -e

This configures SQLite full-text search against each of those tables, using this pattern to supress any errors that occur if the FTS tables already exist.

Setting up FTS in this way means I can use the datasette-search-all plugin to run searches like this one for aws across all of those tables at once.

datasette publish cloudrun simonwillisonblog.db \
-m metadata.yml \
--service simonwillisonblog \
--install datasette-block-robots \
--install datasette-graphql \
--install datasette-search-all

This uses the using datasette publish command to deploy the site to Google Cloud Run.

I’m adding two more plugins here: datasette-block-robots to avoid search engine crawlers indexing a duplicate copy of my blog’s content, and datasette-graphql to enable GraphQL queries against my data.

Here’s an example GraphQL query that returns my most recent blog entries that are tagged with datasette.

Releases this week

TIL this week

Using GPT-3 to explain how code works one month ago

One of my favourite uses for the GPT-3 AI language model is generating explanations of how code works. It’s shockingly effective at this: its training set clearly include a vast amount of source code.

(I initially thought this was related to GitHub Copilot being built on GPT-3, but actually that’s built on a GPT-3 descendent called OpenAI Codex.)

Here are a few recent examples.

Explaining some Python

Ned Batchelder shared this brilliant Python code snippet on Twitter this morning:

import re

TOKEN_RX = r"""(?xm)
    (?P<string> ".*?"|'.*?'             )| 
    (?P<float>  \d*(\d\.|\.\d)\d*       )|
    (?P<int>    \d+                     )|
    (?P<id>     [_a-zA-Z][_a-zA-Z0-9]*  )|
    (?P<punct>  [(){}:\[\]=.,+*/-]      )|
    (           \#.*$                   )|
    (           \s+                     )

def tokens(text):
    for match in re.finditer(TOKEN_RX, text):
        if match.lastgroup:
            yield (match.lastgroup, match[0])

TEXT = """
    x = 123 + "hello #99"  # ignore me!
    print(hello.bye[0] + 3.14, 'single')

for kind, text in tokens(TEXT):
    print(f"{kind:7}: {text=}")

I pasted it into GPT-3 (here’s how to use the playground interface) and started asking questions.

This uses the feature of GPT-3 where you can get a response, then add more text and run it again to get further details.

Anything in bold is text that GPT-3 responded with. The rest are my prompts.

Explain this code:

import re

TOKEN_RX = r"""(?xm)
    (?P<string> ".*?"|'.*?'             )| 
    (?P<float>  \d*(\d\.|\.\d)\d*       )|
    (?P<int>    \d+                     )|
    (?P<id>     [_a-zA-Z][_a-zA-Z0-9]*  )|
    (?P<punct>  [(){}:\[\]=.,+*/-]      )|
    (           \#.*$                   )|
    (           \s+                     )

def tokens(text):
    for match in re.finditer(TOKEN_RX, text):
        if match.lastgroup:
            yield (match.lastgroup, match[0])

TEXT = """
    x = 123 + "hello #99"  # ignore me!
    print(hello.bye[0] + 3.14, 'single')

for kind, text in tokens(TEXT):
    print(f"{kind:7}: {text=}")

The code is defining a regular expression that will match different types of tokens in a string of text. The different groups defined in the regular expression will match different types of tokens, such as strings, floats, ints, ids, punctuation, and comments. The code will then iterate over the matches and yield the type of token and the text that was matched.

Explain the regular expression:

The regular expression is made up of a number of different groups, each of which match a different type of token. The first group, “string”, matches any text enclosed in double or single quotes. The second group, “float”, matches any text that looks like a floating point number. The third group, “int”, matches any text that looks like an integer. The fourth group, “id”, matches any text that looks like a variable name. The fifth group, “punct”, matches any text that looks like punctuation. The sixth group, “#.*$”, matches any text that looks like a comment. The seventh group, “\s+”, matches any text that looks like whitespace.

What does (?xm) do?

The (?xm) notation is a modifier that allows the regular expression to span multiple lines and to use comments.

Why does it use match.lastgroup?

The match.lastgroup variable contains the name of the last group that was matched by the regular expression. This is used to determine the type of token that was matched.

This is a really good, really clear explanation of what’s going on.

I already knew how that code works though. Let’s try something that I genuinely didn’t understand.

Explaining some JavaScript

Chris Haynes tweeted an ingenious demo showing how the new Web Component Declarative Shadow DOM <slot> mechanism (described here) can be used to create a page that streams content from a server, filling in different regions of the page as the content arrives—without using any JavaScript!

His demo is at but only works for me in Chrome. For other browsers, he provides which loads a tiny polyfill.

The polyfill code lives in enamel-lite.js and I could not figure out how it works. So I asked GPT-3.

Explain this code:

class Enamel extends HTMLElement {
  attemptPolyfillDSD() {
    const dsd = this.querySelector('template[shadowroot]');

    if (dsd?.content) {
      const mode = dsd.getAttribute('shadowroot');
      this.attachShadow({ mode });


      return true;

    return false;

  connectedCallback() {
    if (
      !HTMLTemplateElement.prototype.hasOwnProperty('shadowRoot') &&
    ) {
      const _observer = new MutationObserver(() => {
        if (this.attemptPolyfillDSD()) {

      _observer.observe(this, {
        childList: true,

export default Enamel;

This code polyfills the Shadow DOM API for older browsers that do not support it.

OK, that sounds convincing, but it’s not enough information for me. So I added another prompt:

How does it do that?

The code uses a MutationObserver to watch for changes to the element’s children. When a new child is added, the code checks to see if that child is a <template> element with a ’shadowroot’ attribute. If so, the code creates a Shadow Root for the element and appends the template’s content to it. Finally, the code removes the template element from the DOM.

That is a really good explanation. After reading that, I went back to look at the code again and I now completely understand how it works.

Building some SQL

I wanted to build a SQL query that would return all of the rows in one table that were not represented in another table... but where the rows used a compound primary key covering two columns. I didn’t know the right syntax for this off the top of my head.

First, I needed to create a table with a compound primary key. Rather than look that up I asked GPT-3 directly:

SQLite create a pages table that has a compound primary key of type, id and a content column.

CREATE TABLE pages ( type TEXT, id TEXT, content TEXT, PRIMARY KEY (type, id) );

It seems to know about SQLite SQL syntax, which is handy. Turns out you create a compound primary key using PRIMARY KEY (col1, col2) at the end of the table definition.

I added a second line with my own CREATE TABLE, then asked it to build me a query:

CREATE TABLE pages_done ( type TEXT, id TEXT, PRIMARY KEY (type, id) );

SQL query returning rows in the pages table that do not have corresponding records in the pages_done table:

SELECT type, id, content FROM pages WHERE (type, id) NOT IN (SELECT type, id FROM pages_done)

This is useful! I thought that a WHERE ... NOT IN (SELECT pattern would be good here, but I wasn’t sure of the syntax for looking up two columns at once.

I’m pretty sure this could instead be solved with a join... so I asked that too:

That query but using a join:

SELECT p.type,, p.content FROM pages p LEFT JOIN pages_done pd ON p.type = pd.type AND = WHERE pd.type IS NULL;

But which is better? I posed two more questions:

Which of those two queries is more efficient and why?

The first query is more efficient because it uses a WHERE clause with a subquery. The second query is less efficient because it uses a LEFT JOIN.

Why is a subquery more efficient?

A subquery is more efficient because it only returns the rows that meet the conditions specified in the WHERE clause. A LEFT JOIN returns all rows from the left table, even if there is no match in the right table.

As with everything GPT-3, this answer sounds very convincing! But is this actually true? I would expect the SQL query optimizer to be smart enough to optimize the LEFT JOIN to the point where it would run efficiently. I think GPT-3’s answer to that question is actually very misleading.

And that’s not surprising, because GPT-3 doesn’t actually know anything about anything at all. It’s a huge pattern generator. You can’t trust anything it says, because all it does is group words together into convincing looking shapes based on text that it’s seen before.

Once again, I’m reminded that tools like GPT-3 should be classified in the “bicycles for the mind” category. You still have to know how to pedal!

They’re fantastic tools for thinking, but to actually use their output effectively requires VERY deep knowledge—both of the subject matter in question, and of the way that the AI tools themselves work.

Leading questions

Got this great tip from Riley Goodside:

When prompting like this, it’s important (and difficult) to avoid leading questions. If you ask which of two queries are more efficient and why, it’s extremely unlikely to to say “It doesn’t matter,” no matter how true that is.

I hadn’t considered this at all. I’ll be keeping that in mind in my future experiments.

Bonus: how about mathematics?

I decided to try one more thing: could it explain mathematical equations?

GitHub recently added LaTeX rendering to their Markdown implementation. I grabbed a formula at random from this page of documentation:

Explain this:

$$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$$

This is the Cauchy-Schwarz inequality.

OK, so you know what it’s called. But how does it work?

I added “Explain the formula syntax step by step, using $$ wrapped around the formula steps:” (the $$ bit is explained in a moment):

$$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$$

This is the Cauchy-Schwarz inequality.

$$\left( \sum_{k=1}^n a_k b_k \right)^2$$

This is the sum of all products of corresponding elements in the vectors a and b, squared.

$$\leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$$

This is saying that this sum is less than or equal to the product of the sums of the squares of each vector.

Here’s why I added that bit about $$ to the prompt: doing so caused it to produce its answer in valid GitHub Markdown syntax. I pasted that into a Gist and rendered it, and it came out like this:

That same thing but with rendered mathematical equations for the different step explanations

Explaining configuration files

Here’s another prompt pattern I stumbled across that seems to work really well: “Explain every line of this config file:”. I tried it against the Lit example from the Vite 3.0 starter templates:

Explain every line of this config file:

import { defineConfig } from 'vite'

export default defineConfig({
  build: {
    lib: {
      entry: 'src/my-element.js',
      formats: ['es']
    rollupOptions: {
      external: /^lit/

This config file is written in JavaScript, using the ViteJS library. The first line imports the “defineConfig” function from the ViteJS library. This function allows the user to define a configuration object for the build process.

The next line exports the “defineConfig” function as the default export. This means that when this file is imported, the “defineConfig” function will be the only thing that is imported.

The “defineConfig” function takes one argument, which is an object. This object contains the configuration for the build process. The “build” property contains the configuration for the build tool. The “lib” property contains the configuration for the library that will be built. The “entry” property contains the path to the entry file for the library. The “formats” property contains the formats that the library will be built in. In this case, the library will only be built in the “es” format.

The “rollupOptions” property contains the configuration for the Rollup tool. The “external” property contains a regular expression that matches the names of the dependencies that should be excluded from the build. In this case, all dependencies whose names start with “lit” will be excluded from the build.

s3-ocr: Extract text from PDF files stored in an S3 bucket one month ago

I’ve released s3-ocr, a new tool that runs Amazon’s Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.

You can search through a demo of 697 pages of OCRd text at

Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It charges $1.50 per thousand pages processed.

Why I built this

My initial need for this is a collaboration I have running with the San Francisco Microscopy Society. They’ve been digitizing their archives—which stretch back to 1870!—and were looking for help turning the digital scans into something more useful.

The archives are full of hand-written and type-written notes, scanned and stored as PDFs.

I decided to wrap my work up as a tool because I’m sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.

Running Textract directly against large numbers of files is somewhat inconvenient (here’s my earlier TIL about it). s3-ocr is my attempt to make it easier.

Tutorial: How I built that demo

The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection on the Internet Archive:

I started by downloading PDFs of those three files.

Then I installed the two tools I needed:

pip install s3-ocr s3-credentials

I used my s3-credentials tool to create a new S3 bucket and credentials with the ability to write files to it, with the new --statement option (which I released today) to add textract permissions to the generated credentials:

s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' --create-bucket > ocr.json

(Note that you don’t need to use s3-credentials at all if you have AWS credentials configured on your machine with root access to your account—just leave off the -a ocr.json options in the following examples.)

s3-ocr-demo is now a bucket I can use for the demo. ocr.json contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.

I uploaded my three PDFs to the bucket:

s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json

(I often use Transmit as a GUI for this kind of operation.)

Then I kicked off OCR jobs against every PDF file in the bucket:

% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9

The --all option scans for any file with a .pdf extension. You can pass explicit file names instead if you just want to process one or two files at a time.

This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.

The job IDs can be used to inspect the progress of each task like so:

% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
  "DocumentMetadata": {
    "Pages": 244
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"

Once the job completed, I could preview the text extracted from the PDF like so:

% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf

Author of "Modern Magic," etc.

To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:

% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34

I then published the resulting pages.db SQLite database using Datasette—you can explore it here.

How s3-ocr works

s3-ocr works by calling Amazon’s S3 and Textract APIs.

Textract only works against PDF files in asynchronous mode: you call an API endpoint to tell it “start running OCR against this PDF file in this S3 bucket”, then wait for it to finish—which can take several minutes.

It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket—I use that option in s3-ocr.

A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.

I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called textract-output/JOB_ID/1-?—but there’s no indication as to which PDF file the results correspond to.

My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.

If you have a file called latestmagicbeing00hoff.pdf the start command will create a new file called latestmagicbeing00hoff.pdf.s3-ocr.json with the following content:

  "job_id": "f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402",
  "etag": "\"d79af487579dcbbef26c9b3be763eb5e-2\""

This associates the job ID with the PDF file. It also records the original ETag of the PDF file—this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.

The existence of these files lets me do two things:

  • If you run s3-ocr start s3-ocr-demo --all it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the .s3-ocr.json file.
  • When you later ask for the results of the OCR it can use these files to associate the PDF with the results.

Scatting .s3-ocr.json files all over the place feels a little messy, so I have an open issue considering moving them all to a s3-ocr/ prefix in the bucket instead.

Try it and let me know what you think

This is a brand new project, but I think it’s ready for other people to start trying it out.

I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!

If you try this out and it works (or it doesn’t work) please let me know via Twitter or GitHub.

A challenging example page

Here’s one of the more challenging pages I processed using Textract:

A very old page of difficult to read handwriting

Here’s the result:

In. In J a ... the Joe 14
Lalpa spinosa, Eggt bud development. of
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining & mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
forby, Presidents address
pongida, difficulties of classification
tage, american adjustable concentric
ttlese staining & mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, & havi cula crassinervis 265
falicylic acid u movorcopy
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
tap, circulation of the
Jars, structure of the genus Brisinga
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process

Releases this week

TIL this week



  • datasette on Open Source Insights (via) Open Source Insights is “an experimental service developed and hosted by Google to help developers better understand the structure, security, and construction of open source software packages”. It calculates scores for packages using various automated heuristics. A JSON version of the resulting score card can be accessed using{package_name}/v/ #11th August 2022, 1:06 am
  • sethmlarson/pypi-data (via) Seth Michael Larson uses GitHub releases to publish a ~325MB (gzipped to ~95MB) SQLite database on a roughly monthly basis that contains records of 370,000+ PyPI packages plus their OpenSSF score card metrics. It’s a really interesting dataset, but also a neat way of packaging and distributing data—the scripts Seth uses to generate the database file are included in the repository. #11th August 2022, 1:02 am


  • Let websites framebust out of native apps (via) Adrian Holovaty makes a compelling case that it is Not OK that we allow native mobile apps to embed our websites in their own browsers, including the ability for them to modify and intercept those pages (it turned out today that Instagram injects extra JavaScript into pages loaded within the Instagram in-app browser). He compares this to frame-busting on the regular web, and proposes that the X-Frame-Options: DENY header which browsers support to prevent a page from being framed should be upgraded to apply to native embedded browsers as well.

    I’m not convinced that reusing X-Frame-Options: DENY would be the best approach—I think it would break too many existing legitimate uses—but a similar option (or a similar header) specifically for native apps which causes pages to load in the native OS browser instead sounds like a fantastic idea to me. #10th August 2022, 10:29 pm
  • Introducing sqlite-http: A SQLite extension for making HTTP requests (via) Characteristically thoughtful SQLite extension from Alex, following his sqlite-html extension from a few days ago. sqlite-http lets you make HTTP requests from SQLite—both as a SQL function that returns a string, and as a table-valued SQL function that lets you independently access the body, headers and even the timing data for the request.

    This write-up is excellent: it provides interactive demos but also shows how additional SQLite extensions such as the new-to-me “define” extension can be combined with sqlite-http to create custom functions for parsing and processing HTML. #10th August 2022, 10:22 pm
  • How SQLite Helps You Do ACID (via) Ben Johnson’s series of posts explaining the internals of SQLite continues with a deep look at how the rollback journal works. I’m learning SO much from this series. #10th August 2022, 3:39 pm
  • curl-impersonate (via) “A special build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate is able to perform TLS and HTTP handshakes that are identical to that of a real browser.”

    I hadn’t realized that it’s become increasingly common for sites to use fingerprinting of TLS and HTTP handshakes to block crawlers. curl-impersonate attempts to impersonate browsers much more accurately, using tricks like compiling with Firefox’s nss TLS library and Chrome’s BoringSSL. #10th August 2022, 3:34 pm

9th August 2022

  • sqlite-zstd: Transparent dictionary-based row-level compression for SQLite. Interesting SQLite extension from phiresky, the author of that amazing SQLite WASM hack from a while ago which could fetch subsets of a large SQLite database using the HTTP range header. This extension, written in Rust, implements row-level compression for a SQLite table by creating compression dictionaries for larger chunks of the table, providing better results than just running compression against each row value individually. #9th August 2022, 9:23 pm

6th August 2022

  • Microsoft® Open Source Software (OSS) Secure Supply Chain (SSC) Framework Simplified Requirements. This is really good: don’t get distracted by the acronyms, skip past the intro and head straight to the framework practices section, which talks about things like keeping copies of the packages you depend on, running scanners, tracking package updates and most importantly keeping an inventory of the open source packages you work so you can quickly respond to things like log4j.

    I feel like I say this a lot these days, but if you had told teenage-me that Microsoft would be publishing genuinely useful non-FUD guides to open source supply chain security by 2022 I don’t think I would have believed you. #6th August 2022, 4:49 pm

4th August 2022

  • Your documentation is complete when someone can use your module without ever having to look at its code. This is very important. This makes it possible for you to separate your module’s documented interface from its internal implementation (guts). This is good because it means that you are free to change the module’s internals as long as the interface remains the same.

    Remember: the documentation, not the code, defines what a module does.

    Ken Williams # 4th August 2022, 3:50 pm

3rd August 2022

  • Introducing sqlite-html: query, parse, and generate HTML in SQLite (via) Another brilliant SQLite extension module from Alex Garcia, this time written in Go. sqlite-html adds a whole family of functions to SQLite for parsing and constructing HTML strings, built on the Go goquery and cascadia libraries. Once again, Alex uses an Observable notebook to describe the new features, with embedded interactive examples that are backed by a Datasette instance running in Fly. #3rd August 2022, 5:31 pm

2nd August 2022

  • How I Used DALL·E 2 to Generate The Logo for OctoSQL (via) Jacob Martin gives a blow-by-blow account of his attempts at creating a logo for his OctoSQL project using DALL-E, spending $30 of credits and making extensive use of both the “variations” feature and the tool that lets you request modifications to existing images by painting over parts you want to regenerate. Really interesting to read as an example of a “real world” DALL-E project. #2nd August 2022, 9:12 pm

1st August 2022

  • storysniffer (via) Ben Welsh built a small Python library that guesses if a URL points to an article on a news website, or if it’s more likely to be a category page or /about page or similar. I really like this as an example of what you can do with a tiny machine learning model: the model is bundled as a ~3MB pickle file as part of the package, and the repository includes the Jupyter notebook that was used to train it. #1st August 2022, 11:40 pm

31st July 2022

  • Cleaning data with sqlite-utils and Datasette (via) I wrote a new tutorial for the Datasette website, showing how to use sqlite-utils to import a CSV file, clean up the resulting schema, fix date formats and extract some of the columns into a separate table. It’s accompanied by a ten minute video originally recorded for the HYTRADBOI conference. #31st July 2022, 7:57 pm

30th July 2022

  • GAS-ICS-Sync (via) Google Calendar can subscribe to ICS calendar feeds... but polls for updates less than once every 24 hours (as far as I can tell) greatly limiting their usefulness. Derek Antrican wrote a script using Google App Script which fixes this by polling calendar URLs more often and writing them to your calendar via the write API. #30th July 2022, 11:47 pm
  • GPSJam (via) John Wiseman’s “Daily maps of GPS interference” —a beautiful interactive globe (powered by Mapbox GL) which you can use to see points of heaviest GPS interference over a 24 hour period, using data collected from commercial airline radios by ADS-B Exchange. “From what I can tell the most common reason for aircraft GPS systems to have degraded accuracy is jamming by military systems. At least, the vast majority of aircraft that I see with bad GPS accuracy are flying near conflict zones where GPS jamming is known to occur.” #30th July 2022, 7:51 pm
  • Introducing sqlite-lines - a SQLite extension for reading files line-by-line (via) Alex Garcia wrote a brilliant C module for SQLIte which adds functions (and a table-valued function) for efficiently reading newline-delimited text into SQLite. When combined with SQLite’s built-in JSON features this means you can read a huge newline-delimited JSON file into SQLite in a streaming fashion so it doesn’t exhaust memory for a large file. Alex also compiled the extension to WebAssembly, and his post here is an Observable notebook post that lets you exercise the code directly. #30th July 2022, 7:18 pm