Simon Willison’s Weblog

Subscribe

Weeknotes: datasette-dump, sqlite-backup, talks

11th September 2020

I spent some time this week digging into Python’s sqlite3 internals. I also gave two talks and recorded a third, due to air at PyGotham in October.

sqlite-dump and datasette-backup

I’m running an increasing number of Datasette instances with mutable database files—databases that are updated through a variety of different mechanisms. So I need to start thinking about backups.

Prior to this most of my database files had been relatively disposable: they’re built from other sources of data (often by scheduled GitHub Actions) so backups weren’t necessary since I could always rebuild them from their point of truth.

Creating a straight copy of a SQLite database file isn’t enough for robust backups, because the file may be accepting writes while you are creating the copy.

SQLite has various mechanisms for backups. There’s an online backup API and more recent SQLite versions support a VACUUM INTO command which also optimizes the backed up database.

I figured it would be useful to expose this functionality by a Datasette plugin—one that could allow automated backups to be directly fetched from Datasette over HTTPS. So I started work on datasette-backup.

For the first backup mode, I decided to take advantage of the connection.iterdump() method that’s built into Python’s sqlite3 module. This method is an iterator that outputs plain text SQL that can recreate a database. Crucially it’s a streaming-compatible mechanism—unlike VACUUM INTO which would require me to create a temporary file the same as the database I was backing up.

I started experimenting with it, and ran into a big problem. I make extensive use of SQLite full-text search, but the .sql dumps generated by .iterdump() break with constraint errors if they include any FTS tables.

After a bit of digging I came across a 13 year old comment about this in the cPython source code itself!

The implementation for .iterdump() turns out to be entirely in Python, and way less complicated than I had expected. So I decided to see if I could get FTS table exports working.

In a classic case of yak shaving, I decided to create a Python library called sqlite-dump to solve this problem. And since my existing cookiecutter templates only cover Datasette Plugins or Click apps I first needed to create a new python-lib template in order to create the library I needed for my plugin.

I got it working! Install the datasette-backup plugin on any Datasette instance to get a /-/backup/name-of-database.sql URL that will produce a streaming SQL dump of any attached database.

A weird bug with SQLite FTS and triggers

While working on datasette-backup I noticed a weird issue with some of my SQLite full-text search enabled databases: they kept getting bigger. Way bigger than I would expect them to.

I eventually noticed that the licenses_fts table in my github-to-sqlite demo database had 7 rows in it, but the accompanying licenses_fts_docsize table had 9,141. I would expect it to only have 7 as well.

I was stumped as to what was going on, so I turned to the official SQLite forum. I only recently discovered how useful this is as a resource. Dan Kennedy, one of the three core SQLite maintainers, replied within an hour and gave me some useful hints. The root cause turned out to be the way SQLite triggers work: by default, SQLite runs in recursive_triggers=off mode (for backwards compatibility with older databases). This means that an INSERT OR REPLACE update to a table that is backed by full-text search may not correctly trigger the updates needed on the FTS table itself.

Since there doesn’t appear to be any disadvantage to running with recursive_triggers=on I’ve now set that as the default for sqlite-utils, as-of version 2.17.

I then added a sqlite-utils rebuild-fts data.db command in version 2.18 which can rebuild the FTS tables in a database and fix the _fts_docsize problem.

Talks

I presented Build your own data warehouse for personal analytics with SQLite and Datasette at PyCon AU last week. The video is here and includes my first public demo of Dogsheep Beta, my new combined search engine for personal analytics data imported using my Dogsheep family of tools. I took questions in this Google Doc, and filled out more detailed answers after the talk.

I gave a talk at PyRVA a couple of days called Rapid data analysis with SQLite and Datasette. Here’s the video and Google Doc for that one.

I also pre-recorded my talk for PyGotham: Datasette—an ecosystem of tools for working with Small Data. The conference is in the first week of October and I’ll be hanging out there during the talk answering questions and chatting about the project, safe from the stress of also having to present it live!

TIL this week

Releases this week