Simon Willison’s Weblog

sqlite-utils: a Python library and CLI tool for building SQLite databases

sqlite-utils is a combination Python library and command-line tool I’ve been building over the past six months which aims to make creating new SQLite databases as quick and easy as possible.

It’s part of the ecosystem of tools I’m building around my Datasette project.

I spent the weekend adding all kinds of exciting command-line options to it, so I’m ready to describe it to the world.

A Python library for quickly creating databases

A core idea behind Datasette is that SQLite is the ideal format for publishing all kinds of interesting structured data. Datasette takes any SQLite database and adds a browsable web interface, a JSON API and the ability to export tables and queries as CSV.

The other half of the equation then is tools to create SQLite databases. csvs-to-sqlite was my first CLI attempt at this. sqlite-utils takes a much more flexible and comprehensive approach.

I started working on sqlite-utils last year as part of my project to Analyze US Election Russian Facebook Ads. The initial aim was to build a library that made constructing new SQLite databases inside of a Jupyter notebook as productive as possible.

The core idea behind the library is that you can give it a list of Python dictionaries (equivalent to JSON objects) and it will automatically create a SQLite table with the correct schema, then insert those items into the new table.

To illustrate, let’s create a database using this JSON file of meteorite landings released by NASA (discovered via awesome-json-datasets curated by Justin Dorfman).

Here’s the quickest way in code to turn that into a database:

import requests
import sqlite_utils

db = sqlite_utils.Database("meteorites.db")
db["meteorites"].insert_all(
    requests.get(
        "https://data.nasa.gov/resource/y77d-th95.json"
    ).json(),
    pk="id"
)

This two lines of code creates a new SQLite database on disk called meteorites.db, creates a table in that file called meteorites, detects the necessary columns based on the incoming data, inserts all of the rows and sets the id column up as the primary key.

To see the resulting database, run datasette meteorites.db and browse to http://127.0.0.1:8001/.

You can do a lot more with the library. You can create tables, insert and upsert data in bulk, configure foreign key relationships, configure SQLite full-text search and much more. I encourage you to consult the documentation for all of the details.

The sqlite-utils command-line tool

This is the new stuff built over the past few days, and I think it’s really fun.

First install the tool from PyPI, using pip3 install sqlite-utils.

Let’s start by grabbing a copy of the russian-ads.db database I created in Analyzing US Election Russian Facebook Ads (4MB):

$ cd /tmp
$ wget https://static.simonwillison.net/static/2019/russian-ads.db

We can see a list of tables in the database and their counts using the tables command:

$ sqlite-utils tables russian-ads.db --counts
[{"table": "ads", "count": 3498},
 {"table": "targets", "count": 1665},
 {"table": "ad_targets", "count": 36559},
 {"table": "ads_fts", "count": 3498},
 {"table": "ads_fts_segments", "count": 120},
 {"table": "ads_fts_segdir", "count": 1},
 {"table": "ads_fts_docsize", "count": 3498},
 {"table": "ads_fts_stat", "count": 1}]

By default, sqlite-utils outputs data as neatly formatted JSON. You can get CSV instead using the --csv option:

$ sqlite-utils tables russian-ads.db --counts --csv
table,count
ads,3498
targets,1665
ad_targets,36559
ads_fts,3498
ads_fts_segments,120
ads_fts_segdir,1
ads_fts_docsize,3498
ads_fts_stat,1

Or if you want a pretty ASCII-art table, use --table (or the shortcut, -t):

$ sqlite-utils tables russian-ads.db --counts -t
table               count
----------------  -------
ads                  3498
targets              1665
ad_targets          36559
ads_fts              3498
ads_fts_segments      120
ads_fts_segdir          1
ads_fts_docsize      3498
ads_fts_stat            1

The table view is built on top of tabulate, which offers dozens of table variations. Run sqlite-utils tables --help for the full list—try --table -fmt=rst for output that can be pasted directly into a reStructuredText document (handy for writing documentation).

So far we’ve just looked at a list of tables. Lets run a SQL query:

$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category"
[{"category": "accessing_facebook_on", "count(*)": 1},
 {"category": "age", "count(*)": 82},
 {"category": "and_must_also_match", "count(*)": 228},
 {"category": "army_reserve_industry", "count(*)": 3},
 {"category": "behaviors", "count(*)": 16},
 ...

Again, this can be output as CSV using --csv, or a table with --table.

The default JSON output is objects wrapped in an array. Use --arrays to get an array of arrays instead. More interestingly: --nl causes the data to be output as newline-delimited JSON, like this:

$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category" --nl
{"category": "accessing_facebook_on", "count(*)": 1}
{"category": "age", "count(*)": 82}
{"category": "and_must_also_match", "count(*)": 228}
{"category": "army_reserve_industry", "count(*)": 3}
{"category": "behaviors", "count(*)": 16}
...

This is a really interesting format for piping to other tools.

Creating databases from JSON on the command-line

The sqlite-utils insert command can be used to create new tables by piping JSON or CSV directly into the tool. It’s the command-line equivalent of the .insert_all() Python function I demonstrated earlier.

Here’s how to create that meteorite database directly from the command-line:

$ curl "https://data.nasa.gov/resource/y77d-th95.json" | \
    sqlite-utils insert meteorites.db meteorites - --pk=id

This will use a SQLite database file called meteorites.db (creating one if it does not yet exist), create or use a table called meteorites and read the data from standard in (hence the pipe). You can pass a filename instead of a - here to read data from a file on disk.

The insert command accepts multiple formats—it defaults to expecting a JSON array of objects, but you can use --nl to accept newline-delimited JSON and --csv to accept CSV.

This means you can combine the tools! Let’s create a brand new database by exporting data from the old one, using newline-delimited JSON as the intermediary format:

$ sqlite-utils russian-ads.db \
    "select * from ads where text like '%veterans%'" --nl | \
    sqlite-utils insert veterans.db ads - --nl

This creates a new file called veterans.db containing an ads table with just the ads that mentioned veterans somewhere in their body text.

Since we’re working with JSON, we can introduce other command-line tools into the mix.

jq is a neat little tool for extracting data from a JSON file using its own mini domain-specific language.

The Nobel Prize API offers a JSON file listing all of the Nobel laureates—but they are contained as an array in a top level "laureates" key. sqlite-utils needs a flat array—so we can use jq to get exactly that:

$ curl "http://api.nobelprize.org/v1/laureate.json" | \
  jq ".laureates" | \
  sqlite-utils insert nobel.db laureates -

Now we have a file called nobel.db containing all of the Nobel laureates.

Since Datasette recently grew the ability to export newline-delimited JSON, we can also use this ability to directly consume data from Datasette. Lets grab every episode of the Joy of Painting in which Bob Ross painted a beach:

$ curl "https://fivethirtyeight.datasettes.com/fivethirtyeight-aa93d24/bob-ross%2Felements-by-episode.json?_facet=BEACH&BEACH=1&_shape=array&_nl=on" \
| sqlite-utils insert bob.db beach_episodes - --nl

Plenty more features

As with the Python API, the sqlite-utils CLI tool has dozens of other options and extensive documentation.

I’ve been really enjoying growing an ecosystem of tools around Datasette. sqlite-utils is the keystone here: it’s fundamental to other tools I’m building, such as db-to-sqlite (which can export any SQLAlchemy-supported database directly to a SQLite file on disk).

I’ve found myself increasingly turning to SQLite first for all kinds of ad-hoc analysis, and I’m excited to try out these new command-line abilities of sqlite-utils for real-world data spelunking tasks.

This is sqlite-utils: a Python library and CLI tool for building SQLite databases by Simon Willison, posted on 25th February 2019.

Tagged , , ,

Next: I commissioned an oil painting of Barbra Streisand’s cloned dogs

Previous: Exploring search relevance algorithms with SQLite