Simon Willison’s Weblog

Fun with binary data and SQLite

This week I’ve been mainly experimenting with binary data storage in SQLite. sqlite-utils can now insert data from binary files, and datasette-media can serve content over HTTP that originated as binary BLOBs in a database file.

Paul Ford piqued my interest in this when he tweeted about loading thousands of PDF documents into a SQLite database:

The SQLite documentation claims that serving smaller binary files from BLOB columns can be 35% faster than the filesystem. I’ve done a little bit of work with binary files in SQLite—the datasette-render-binary and datasette-render-images both help display BLOB data—but I’d never really dug into it in much detail.

sqlite-utils insert-files

The first step was to make it easier to build database files that include binary data.

sqlite-utils is my combination Python library and CLI tool for building SQLite databases. I’ve been steadily evolving it for a couple of years now, and it’s the engine behind my Dogsheep collection of tools for personal analytics.

The new insert-files command can be used to insert content from binary files into a SQLite database, along with file metadata.

The most basic usage looks like this:

sqlite-utils insert-files gifs.db images *.gif

By default, this creates a database table like so:

CREATE TABLE [images] (
    [path] TEXT PRIMARY KEY,
    [content] BLOB,
    [size] INTEGER
);

You can customize this table to include other file metadata using the -c (short for --column) option:

sqlite-utils insert-files gifs.db images *.gif \
    -c path -c md5 -c last_modified:mtime -c size --pk=path

This creates a table with the following schema:

CREATE TABLE [images] (
    [path] TEXT PRIMARY KEY,
    [md5] TEXT,
    [last_modified] FLOAT,
    [size] INTEGER
);

If you pass a directory instead of a file name the command will recursively add every file in that directory.

I also improved sqlite-utils with respect to outputting binary data. The new --raw option dumps the binary contents of a column directly to standard out, so you can read an image back out of one of the above tables like so:

sqlite-utils photos.db \
    "select content from images where path=:path" \
    -p path 'myphoto.jpg' \
    --raw > myphoto.jpg

This example also demonstrates the new support for :parameters passed using the new -p option, see #124.

sqlite-utils usually communicate in JSON, but JSON doesn’t have the ability to represent binary values. Datasette outputs binary values like so:

"data": {
  "$base64": true,
  "encoded": "iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAY..."
}

I added support for the same format to sqlite-utils—so you can now query binary columns and get out that nested object, or pipe JSON with that nested structure in to sqlite-utils insert and have it stored as a binary BLOB in the database.

datasette-media

datasette-media is a plugin for serving binary content directly from Datasette on a special URL. I originally built it while working on Dogsheep Photos—given a SQLite file full of Apple Photos metadata I wanted to be able to serve thumbnails of the actual images via my Datasette web server.

Those photos were still stored on disk—the plugin lets you configure a SQL query like this which will cause hits to /-/media/photos/$UUID to serve that file from disk:

{
    "plugins": {
        "datasette-media": {
            "photo": {
                "sql": "select filepath from apple_photos where uuid=:key"
            }
        }
    }
}

Issue #14 added support for BLOB columns as well. You can now configure the plugin like this to serve binary content that was stored in the database:

{
    "plugins": {
        "datasette-media": {
            "thumb": {
                "sql": "select content from thumbnails where uuid=:key"
            }
        }
    }
}

This would serve content from a BLOB column in a thumbnails table from the URL /-/media/thumb/$UUID.

I really like this pattern of configuring plugins using SQL queries, where the returned column names have special meaning that is interpreted by the plugin. datasette-atom and datasette-ics use a similar trick.

I expanded datasette-media with a few other related features:

  • Return a content_url column and it will proxy content from that URL
  • Set "enable_transform": true for a media bucket to enable ?w= and ?h= and ?format= parameters for transforming the image before it is served to the user
  • Return a content_filename column to set a download file name (in a content-disposition HTTP header) prompting the user’s browser to download the file

See the README or release notes for more details.

Also this week

I renamed datasette-insert-api to just datasette-insert, reflecting my plans to add non-API features to that plugin in the future.

In doing so I had to figure out how to rename a PyPI package such that dependent projects would continue to work. I ended up building a pypi-rename cookiecutter template encoding what I learned.

I enabled PostgreSQL full-text search for my blog’s Django Admin interface, and wrote a TIL on how I did it.

I added compound primary key support to db-to-sqlite, so now it can convert PostgreSQL or MySQL databases to SQLite if they use compound primary keys.

TIL this week

Releases this week

This is Fun with binary data and SQLite by Simon Willison, posted on 30th July 2020.

Tagged , , , , ,

Previous: Weeknotes: datasette-copyable, datasette-insert-api