Simon Willison’s Weblog

Weeknotes: Apache proxies in Docker containers, refactoring Datasette

Updates to six major projects this week, plus finally some concrete progress towards Datasette 1.0.

Fixing Datasette’s proxy bugs

Now that Datasette has had its fourth birthday I’ve decided to really push towards hitting the 1.0 milestone. The key property of that release will be a stable JSON API, stable plugin hooks and a stable, documented context for custom templates. There’s quite a lot of mostly unexciting work needed to get there.

As I work through the issues in that milestone I’m encountering some that I filed more than two years ago!

Two of those made it into the Datasette 0.59.3 bug fix release earlier this week.

The majority of the work in that release though related to Datasette’s base_url feature, designed to help people who run Datasette behind a proxy.

base_url lets you run Datasette like this:

datasette --setting base_url=/prefix/ fixtures.db

When you do this, Datasette will change its URLs to start with that prefix—so the hompage will live at /prefix/, the database index page at /prefix/fixtures/, tables at /prefix/fixtures/facetable etc.

The reason you would want this is if you are running a larger website, and you intend to proxy traffic to /prefix/ to a separate Datasette instance.

The Datasette documentation includes suggested nginx and Apache configurations for doing exactly that.

This feature has been a magnet for bugs over the years! People keep finding new parts of the Datasette interface that fail to link to the correct pages when run in this mode.

The principle cause of these bugs is that I don’t use Datasette in this way myself, so I wasn’t testing it nearly as thoroughly as it needed.

So the first step in finally solving these issues once and for all was to get my own instance of Datasette up and running behind an Apache proxy.

Since I like to deploy live demos to Cloud Run, I decided to try and run Apache and Datasette in the same container. This took a lot of figuring out. You can follow my progress on this in these two issue threads:

The short version: I got it working! My Docker implementation now lives in the demos/apache-proxy directory and the live demo itself is deployed to datasette-apache-proxy-demo.fly.dev/prefix/.

(I ended up deploying it to Fly after running into a bug when deployed to Cloud Run that I couldn’t replicate on my own laptop.)

My final implementation uses a Debian base container with Supervisord to manage the two processes.

With a working live environment, I was finally able to track down the root cause of the bugs. My notes on #1519: base_url is omitted in JSON and CSV views document how I found and solved them, and updated the associated test to hopefully avoid them ever coming back in the future.

The big Datasette table refactor

The single most complicated part of the Datasette codebase is the code behind the table view—the page that lets you browse, facet, search, filter and paginate through the contents of a table (this page here).

It’s got very thorough tests, but the actual implementation is mostly a 600 line class method.

It was already difficult to work with, but the changes I want to make for Datasette 1.0 have proven too much for it. I need to refactor.

Apart from making that view easier to change and maintain, a major goal I have is for it to support a much more flexible JSON syntax. I want the JSON version to default to just returning minimal information about the table, then allow ?_extra=x parameters to opt into additional information—like facets, suggested facets, full counts, SQL schema information and so on.

This means I want to break up that 600 line method into a bunch of separate methods, each of which can be opted-in-to by the calling code.

The HTML interface should then build on top of the JSON, requesting the extras that it knows it will need and passing the resulting data through to the template. This helps solve the challenge of having a stable template context that I can document in advance of Datasette 1.0

I’ve been putting this off for over a year now, because it’s a lot of work. But no longer! This week I finally started to get stuck in.

I don’t know if I’ll stick with it, but my initial attempt at this is a little unconventional. Inspired by how pytest fixtures work I’m experimenting with a form of dependency injection, in a new (very alpha) library I’ve released called asyncinject.

The key idea behind asyncinject is to provide a way for class methods to indicate their dependencies as named parameters, in the same way as pytest fixtures do.

When you call a method, the code can spot which dependencies have not yet been resolved and execute them before executing the method.

Crucially, since they are all async def methods they can be executed in parallel. I’m cautiously excited about this—Datasette has a bunch of opportunities for parallel queries—fetching a single page of table rows, calculating a count(*) for the entire table, executing requested facets and calculating suggested facets are all queries that could potentially run in parallel rather than in serial.

What about the GIL, you might ask? Datasette’s database queries are handled by the sqlite3 module, and that module releases the GIL once it gets into SQLite C code. So theoretically I should be able to use more than one core for this all.

The asyncinject README has more details, including code examples. This may turn out to be a terrible idea! But it’s really fun to explore, and I’ll be able to tell for sure if this is a useful, maintainable and performant approach once I have Datasette’s table view running on top of it.

git-history and sqlite-utils

I made some big improvements to my git-history tool, which automates the process of turning a JSON (or other) file that has been version-tracked in a GitHub repository (see Git scraping) into a SQLite database that can be used to explore changes to it over time.

The biggest was a major change to the database schema. Previously, the tool used full Git SHA hashes as foreign keys in the largest table.

The problem here is that a SHA hash string is 40 characters long, and if they are being used as a foreign key that’s a LOT of extra weight added to the largest table.

sqlite-utils has a table.lookup() method which is designed to make creating “lookup” tables—where a string is stored in a unique column but an integer ID can be used for things like foreign keys—as easy as possible.

That method was previously quite limited, but in sqlite-utils 3.18 and 3.19—both released this week—I expanded it to cover the more advanced needs of my git-history tool.

The great thing about building stuff on top of your own libraries is that you can discover new features that you need along the way—and then ship them promptly without them blocking your progress!

Some other highlights

  • s3-credentials 0.6 adds a --dry-run option that you can use to show what the tool would do without making any actual changes to your AWS account. I found myself wanting this while continuing to work on the ability to specify a folder prefix within S3 that the bucket credentials should be limited to.
  • datasette-publish-vercel 0.12 applies some pull requests from Romain Clement that I had left unreviewed for far too long, and adds the ability to customize the vercel.json file used for the deployment—useful for things like setting up additional custom redirects.
  • datasette-graphql 2.0 updates that plugin to Graphene 3.0, a major update to that library. I had to break backwards compatiblity in very minor ways, hence the 2.0 version number.
  • csvs-to-sqlite 1.3 is the first relase of that tool in just over a year. William Rowell contributed a new feature that allows you to populate “fixed” database columns on your imported records, see PR #81 for details.

TIL this week

Releases this week