Simon Willison’s Weblog

Subscribe

Weeknotes: Datasette Writes

26th February 2020

As discussed previously, the biggest hole in Datasette’s feature set at the moment involves writing to the database.

Datasette was born as a hack to abuse serverless, stateless hosting by bundling a static, immutable database as part of the deployment. The key idea was that for some use-cases—such as data journalism—you don’t need to be able to continually update your data. It’s just the facts that support the story you are trying to tell.

I also believed the conventional wisdom that SQLite is fine for reads but shouldn’t be trusted to handle web application writes. I no longer believe this to be the case: SQLite is great at handling writes, as millions of iPhone and Android apps will attest.

Meanwhile, the biggest blocker to people trying out Datasette is that they would need to convert their data to SQLite somehow in order to use it. I’ve been building a family of CLI tools for this, but that requires users to both be familiar with the command-line and to install software on their computers.

So: Datasette needs to grow web-based tools for loading data into the database.

Datasette’s plugin system is the ideal space for experimenting with ways of doing this, without needing to try out crazy new features on Datasette’s own core.

There’s just one big problem: SQLite may be great at fast, reliable writes but it still doesn’t like concurrent writes: it’s important to only ever have one connection writing to a SQLite database at a time.

I’ve been mulling over the best way to handle this for the best part of a year... and then a couple of days ago I had a breakthrough: with a dedicated write thread for a database file, I could use a Python queue to ensure only one write could access the database at a time.

There’s prior art for this: SQLite wizard Charles Leifer released code plus a beautiful explanation of how to queue writes to SQLite back in 2017. I’m not sure why I didn’t settle on his approach sooner.

So... Datasette 0.37, released this evening, has a new capability exposed to plugins: they can now request that an operation (either a SQL statement or a full custom Python function) be queued up to execute inside a thread that posesses an exclusive write connection to a SQLite database.

I’ve documented how plugins can use this in the new plugin internals documentation: execute_write() and execute_write_fn().

So far there’s only one public plugin that takes advantage of this: datasette-upload-csvs, which previously used a dirty hack but has now been upgraded to use the new execute_write_fn() method.

I’m really excited about the potential plugins this unlocks though. I experimented with a logging plugin and a plugin for deleting tables while I was building the hooks (full implementations of those are posted as comments in the pull request).

Other use-cases I’m interested to explore include:

  • Plugins that import data from other APIs or services. Imagine web UI interfaces to some of my Dogsheep tools for example.
  • Plugins that periodically update data—pulling the latest CSV updates from government open data portals (like San Francisco’s trees).
  • Tools for enhancing tables with additional data derived from their values—geocoding or reverse geocoding columns, resolving identifiers and so on.
  • Now that plugins have a tool for maintaining their own state, plugins could use SQLite tables to track things like which saved searches have been executed.
  • A plugin that lets you attach annotations to rows and columns in other tables, storing those annotations in its own SQLite database.