Simon Willison’s Weblog

Fun with binary data and SQLite

This week I’ve been mainly experimenting with binary data storage in SQLite. sqlite-utils can now insert data from binary files, and datasette-media can serve content over HTTP that originated as binary BLOBs in a database file.

Paul Ford piqued my interest in this when he tweeted about loading thousands of PDF documents into a SQLite database:

The SQLite documentation claims that serving smaller binary files from BLOB columns can be 35% faster than the filesystem. I’ve done a little bit of work with binary files in SQLite—the datasette-render-binary and datasette-render-images both help display BLOB data—but I’d never really dug into it in much detail.

sqlite-utils insert-files

The first step was to make it easier to build database files that include binary data.

sqlite-utils is my combination Python library and CLI tool for building SQLite databases. I’ve been steadily evolving it for a couple of years now, and it’s the engine behind my Dogsheep collection of tools for personal analytics.

The new insert-files command can be used to insert content from binary files into a SQLite database, along with file metadata.

The most basic usage looks like this:

sqlite-utils insert-files gifs.db images *.gif

By default, this creates a database table like so:

CREATE TABLE [images] (
    [path] TEXT PRIMARY KEY,
    [content] BLOB,
    [size] INTEGER
);

You can customize this table to include other file metadata using the -c (short for --column) option:

sqlite-utils insert-files gifs.db images *.gif \
    -c path -c md5 -c last_modified:mtime -c size --pk=path

This creates a table with the following schema:

CREATE TABLE [images] (
    [path] TEXT PRIMARY KEY,
    [md5] TEXT,
    [last_modified] FLOAT,
    [size] INTEGER
);

If you pass a directory instead of a file name the command will recursively add every file in that directory.

I also improved sqlite-utils with respect to outputting binary data. The new --raw option dumps the binary contents of a column directly to standard out, so you can read an image back out of one of the above tables like so:

sqlite-utils photos.db \
    "select content from images where path=:path" \
    -p path 'myphoto.jpg' \
    --raw > myphoto.jpg

This example also demonstrates the new support for :parameters passed using the new -p option, see #124.

sqlite-utils usually communicate in JSON, but JSON doesn’t have the ability to represent binary values. Datasette outputs binary values like so:

"data": {
  "$base64": true,
  "encoded": "iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAY..."
}

I added support for the same format to sqlite-utils—so you can now query binary columns and get out that nested object, or pipe JSON with that nested structure in to sqlite-utils insert and have it stored as a binary BLOB in the database.

datasette-media

datasette-media is a plugin for serving binary content directly from Datasette on a special URL. I originally built it while working on Dogsheep Photos—given a SQLite file full of Apple Photos metadata I wanted to be able to serve thumbnails of the actual images via my Datasette web server.

Those photos were still stored on disk—the plugin lets you configure a SQL query like this which will cause hits to /-/media/photos/$UUID to serve that file from disk:

{
    "plugins": {
        "datasette-media": {
            "photo": {
                "sql": "select filepath from apple_photos where uuid=:key"
            }
        }
    }
}

Issue #14 added support for BLOB columns as well. You can now configure the plugin like this to serve binary content that was stored in the database:

{
    "plugins": {
        "datasette-media": {
            "thumb": {
                "sql": "select content from thumbnails where uuid=:key"
            }
        }
    }
}

This would serve content from a BLOB column in a thumbnails table from the URL /-/media/thumb/$UUID.

I really like this pattern of configuring plugins using SQL queries, where the returned column names have special meaning that is interpreted by the plugin. datasette-atom and datasette-ics use a similar trick.

I expanded datasette-media with a few other related features:

  • Return a content_url column and it will proxy content from that URL
  • Set "enable_transform": true for a media bucket to enable ?w= and ?h= and ?format= parameters for transforming the image before it is served to the user
  • Return a content_filename column to set a download file name (in a content-disposition HTTP header) prompting the user’s browser to download the file

See the README or release notes for more details.

Also this week

I renamed datasette-insert-api to just datasette-insert, reflecting my plans to add non-API features to that plugin in the future.

In doing so I had to figure out how to rename a PyPI package such that dependent projects would continue to work. I ended up building a pypi-rename cookiecutter template encoding what I learned.

I enabled PostgreSQL full-text search for my blog’s Django Admin interface, and wrote a TIL on how I did it.

I added compound primary key support to db-to-sqlite, so now it can convert PostgreSQL or MySQL databases to SQLite if they use compound primary keys.

TIL this week

Releases this week