Weeknotes: Apache proxies in Docker containers, refactoring Datasette
22nd November 2021
Updates to six major projects this week, plus finally some concrete progress towards Datasette 1.0.
Fixing Datasette’s proxy bugs
Now that Datasette has had its fourth birthday I’ve decided to really push towards hitting the 1.0 milestone. The key property of that release will be a stable JSON API, stable plugin hooks and a stable, documented context for custom templates. There’s quite a lot of mostly unexciting work needed to get there.
As I work through the issues in that milestone I’m encountering some that I filed more than two years ago!
Two of those made it into the Datasette 0.59.3 bug fix release earlier this week.
The majority of the work in that release though related to Datasette’s base_url feature, designed to help people who run Datasette behind a proxy.
base_url lets you run Datasette like this:
datasette --setting base_url=/prefix/ fixtures.db
When you do this, Datasette will change its URLs to start with that prefix—so the hompage will live at
/prefix/, the database index page at
/prefix/fixtures/, tables at
The reason you would want this is if you are running a larger website, and you intend to proxy traffic to
/prefix/ to a separate Datasette instance.
The Datasette documentation includes suggested nginx and Apache configurations for doing exactly that.
This feature has been a magnet for bugs over the years! People keep finding new parts of the Datasette interface that fail to link to the correct pages when run in this mode.
The principle cause of these bugs is that I don’t use Datasette in this way myself, so I wasn’t testing it nearly as thoroughly as it needed.
So the first step in finally solving these issues once and for all was to get my own instance of Datasette up and running behind an Apache proxy.
Since I like to deploy live demos to Cloud Run, I decided to try and run Apache and Datasette in the same container. This took a lot of figuring out. You can follow my progress on this in these two issue threads:
- #1521: Docker configuration for exercising Datasette behind Apache mod_proxy
- #1522: Deploy a live instance of demos/apache-proxy
(I ended up deploying it to Fly after running into a bug when deployed to Cloud Run that I couldn’t replicate on my own laptop.)
My final implementation uses a Debian base container with Supervisord to manage the two processes.
With a working live environment, I was finally able to track down the root cause of the bugs. My notes on #1519: base_url is omitted in JSON and CSV views document how I found and solved them, and updated the associated test to hopefully avoid them ever coming back in the future.
The big Datasette table refactor
The single most complicated part of the Datasette codebase is the code behind the table view—the page that lets you browse, facet, search, filter and paginate through the contents of a table (this page here).
It’s got very thorough tests, but the actual implementation is mostly a 600 line class method.
It was already difficult to work with, but the changes I want to make for Datasette 1.0 have proven too much for it. I need to refactor.
Apart from making that view easier to change and maintain, a major goal I have is for it to support a much more flexible JSON syntax. I want the JSON version to default to just returning minimal information about the table, then allow
?_extra=x parameters to opt into additional information—like facets, suggested facets, full counts, SQL schema information and so on.
This means I want to break up that 600 line method into a bunch of separate methods, each of which can be opted-in-to by the calling code.
The HTML interface should then build on top of the JSON, requesting the extras that it knows it will need and passing the resulting data through to the template. This helps solve the challenge of having a stable template context that I can document in advance of Datasette 1.0
I’ve been putting this off for over a year now, because it’s a lot of work. But no longer! This week I finally started to get stuck in.
I don’t know if I’ll stick with it, but my initial attempt at this is a little unconventional. Inspired by how pytest fixtures work I’m experimenting with a form of dependency injection, in a new (very alpha) library I’ve released called asyncinject.
The key idea behind
asyncinject is to provide a way for class methods to indicate their dependencies as named parameters, in the same way as pytest fixtures do.
When you call a method, the code can spot which dependencies have not yet been resolved and execute them before executing the method.
Crucially, since they are all
async def methods they can be executed in parallel. I’m cautiously excited about this—Datasette has a bunch of opportunities for parallel queries—fetching a single page of table rows, calculating a
count(*) for the entire table, executing requested facets and calculating suggested facets are all queries that could potentially run in parallel rather than in serial.
What about the GIL, you might ask? Datasette’s database queries are handled by the
sqlite3 module, and that module releases the GIL once it gets into SQLite C code. So theoretically I should be able to use more than one core for this all.
The asyncinject README has more details, including code examples. This may turn out to be a terrible idea! But it’s really fun to explore, and I’ll be able to tell for sure if this is a useful, maintainable and performant approach once I have Datasette’s table view running on top of it.
git-history and sqlite-utils
I made some big improvements to my git-history tool, which automates the process of turning a JSON (or other) file that has been version-tracked in a GitHub repository (see Git scraping) into a SQLite database that can be used to explore changes to it over time.
The biggest was a major change to the database schema. Previously, the tool used full Git SHA hashes as foreign keys in the largest table.
The problem here is that a SHA hash string is 40 characters long, and if they are being used as a foreign key that’s a LOT of extra weight added to the largest table.
sqlite-utils has a table.lookup() method which is designed to make creating “lookup” tables—where a string is stored in a unique column but an integer ID can be used for things like foreign keys—as easy as possible.
The great thing about building stuff on top of your own libraries is that you can discover new features that you need along the way—and then ship them promptly without them blocking your progress!
Some other highlights
s3-credentials 0.6 adds a
--dry-runoption that you can use to show what the tool would do without making any actual changes to your AWS account. I found myself wanting this while continuing to work on the ability to specify a folder prefix within S3 that the bucket credentials should be limited to.
datasette-publish-vercel 0.12 applies some pull requests from Romain Clement that I had left unreviewed for far too long, and adds the ability to customize the
vercel.jsonfile used for the deployment—useful for things like setting up additional custom redirects.
- datasette-graphql 2.0 updates that plugin to Graphene 3.0, a major update to that library. I had to break backwards compatiblity in very minor ways, hence the 2.0 version number.
- csvs-to-sqlite 1.3 is the first relase of that tool in just over a year. William Rowell contributed a new feature that allows you to populate “fixed” database columns on your imported records, see PR #81 for details.
TIL this week
- Planning parallel downloads with TopologicalSorter
- Using cog to update --help in a Markdown README file
- Using build-arg variables with Cloud Run deployments
- Assigning a custom subdomain to a Fly app
Releases this week
datasette-publish-vercel: 0.12—(18 releases total)—2021-11-22
Datasette plugin for publishing data using Vercel
git-history: 0.4—(6 releases total)—2021-11-21
Tools for analyzing Git history using SQLite
sqlite-utils: 3.19—(90 releases total)—2021-11-21
Python CLI utility and library for manipulating SQLite databases
datasette: 0.59.3—(101 releases total)—2021-11-20
An open source multi-tool for exploring and publishing data
Datasette plugin that redirects all non-https requests to https
s3-credentials: 0.6—(6 releases total)—2021-11-18
A tool for creating credentials for accessing S3 buckets
csvs-to-sqlite: 1.3—(13 releases total)—2021-11-18
Convert CSV files into a SQLite database
datasette-graphql: 2.0—(32 releases total)—2021-11-17
Datasette plugin providing an automatic GraphQL API for your SQLite databases
asyncinject: 0.2a0—(2 releases total)—2021-11-17
Run async workflows using pytest-fixtures-style dependency injection
More recent articles
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023
- DALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023
- Now add a walrus: Prompt engineering in DALL-E 3 - 26th October 2023