Simon Willison’s Weblog

On webperformance, management, python, postgresql, opensource, ...

 

Recent entries

Datasette 0.28—and why master should always be releasable seven days ago

It’s been quite a while since the last substantial release of Datasette. Datasette 0.27 came out all the way back in January.

This isn’t because development has slowed down. In fact, the project has had 131 commits since then, covering a bewildering array of new functionality and with some significant contributions from developers who aren’t me—Russ Garrett and Romain Primet deserve special recognition here.

The problem has been one of discipline. I’m a big fan of the idea of keeping master shippable at all times in my professional work, but I hadn’t quite adopted this policy for my open-source side projects. A couple of months ago I found myself in a situation where I had two major refactorings (of faceting and of Datasette’s treatment of immutable files) going on in master at the same time, and untangling them turned out to take way longer than I had expected.

So I’ve updated Datasette’s contribution guidelines to specify that master should always be releasable, almost entirely as a reminder to myself.

All of that said, I’m finally back out of the weeds and I’m excited to announce today’s release of Datasette 0.28. It features a salmagundi of new features! I’m replicating the release notes below.

Supporting databases that change

From the beginning of the project, Datasette has been designed with read-only databases in mind. If a database is guaranteed not to change it opens up all kinds of interesting opportunities—from taking advantage of SQLite immutable mode and HTTP caching to bundling static copies of the database directly in a Docker container. The interesting ideas in Datasette explores this idea in detail.

As my goals for the project have developed, I realized that read-only databases are no longer the right default. SQLite actually supports concurrent access very well provided only one thread attempts to write to a database at a time, and I keep encountering sensible use-cases for running Datasette on top of a database that is processing inserts and updates.

So, as-of version 0.28 Datasette no longer assumes that a database file will not change. It is now safe to point Datasette at a SQLite database which is being updated by another process.

Making this change was a lot of work—see tracking tickets #418, #419 and #420. It required new thinking around how Datasette should calculate table counts (an expensive operation against a large, changing database) and also meant reconsidering the “content hash” URLs Datasette has used in the past to optimize the performance of HTTP caches.

Datasette can still run against immutable files and gains numerous performance benefits from doing so, but this is no longer the default behaviour. Take a look at the new Performance and caching documentation section for details on how to make the most of Datasette against data that you know will be staying read-only and immutable.

Faceting improvements, and faceting plugins

Datasette Facets provide an intuitive way to quickly summarize and interact with data. Previously the only supported faceting technique was column faceting, but 0.28 introduces two powerful new capibilities: facet-by-JSON-array and the ability to define further facet types using plugins.

Facet by array (#359) is only available if your SQLite installation provides the json1 extension. Datasette will automatically detect columns that contain JSON arrays of values and offer a faceting interface against those columns—useful for modelling things like tags without needing to break them out into a new table. See Facet by JSON array for more.

The new register_facet_classes() plugin hook (#445) can be used to register additional custom facet classes. Each facet class should provide two methods: suggest() which suggests facet selections that might be appropriate for a provided SQL query, and facet_results() which executes a facet operation and returns results. Datasette’s own faceting implementations have been refactored to use the same API as these plugins.

datasette publish cloudrun

Google Cloud Run is a brand new serverless hosting platform from Google, which allows you to build a Docker container which will run only when HTTP traffic is recieved and will shut down (and hence cost you nothing) the rest of the time. It’s similar to Zeit’s Now v1 Docker hosting platform which sadly is no longer accepting signups from new users.

The new datasette publish cloudrun command was contributed by Romain Primet (#434) and publishes selected databases to a new Datasette instance running on Google Cloud Run.

See Publishing to Google Cloud Run for full documentation.

register_output_renderer plugins

Russ Garrett implemented a new Datasette plugin hook called register_output_renderer (#441) which allows plugins to create additional output renderers in addition to Datasette’s default .json and .csv.

Russ’s in-development datasette-geo plugin includes an example of this hook being used to output .geojson automatically converted from SpatiaLite.

Medium changes

  • Datasette now conforms to the Black coding style (#449)—and has a unit test to enforce this in the future
  • New Special table arguments:
    • ?columnname__in=value1,value2,value3 filter for executing SQL IN queries against a table, see Table arguments (#433)
    • ?columnname__date=yyyy-mm-dd filter which returns rows where the spoecified datetime column falls on the specified date (583b22a)
    • ?tags__arraycontains=tag filter which acts against a JSON array contained in a column (78e45ea)
    • ?_where=sql-fragment filter for the table view (#429)
    • ?_fts_table=mytable and ?_fts_pk=mycolumn querystring options can be used to specify which FTS table to use for a search query—see Configuring full-text search for a table or view (#428)
  • You can now pass the same table filter multiple times—for example, ?content__not=world&content__not=hello will return all rows where the content column is neither hello or world (#288)
  • You can now specify about and about_url metadata (in addition to source and license) linking to further information about a project—see Source, license and about
  • New ?_trace=1 parameter now adds debug information showing every SQL query that was executed while constructing the page (#435)
  • datasette inspect now just calculates table counts, and does not introspect other database metadata (#462)
  • Removed /-/inspect page entirely—this will be replaced by something similar in the future, see #465
  • Datasette can now run against an in-memory SQLite database. You can do this by starting it without passing any files or by using the new --memory option to datasette serve. This can be useful for experimenting with SQLite queries that do not access any data, such as SELECT 1+1 or SELECT sqlite_version().

Small changes

  • We now show the size of the database file next to the download link (#172)
  • New /-/databases introspection page shows currently connected databases (#470)
  • Binary data is no longer displayed on the table and row pages (#442—thanks, Russ Garrett)
  • New show/hide SQL links on custom query pages (#415)
  • The extra_body_script plugin hook now accepts an optional view_name argument (#443—thanks, Russ Garrett)
  • Bumped Jinja2 dependency to 2.10.1 (#426)
  • All table filters are now documented, and documentation is enforced via unit tests (2c19a27)
  • New project guideline: master should stay shippable at all times! (31f36e1)
  • Fixed a bug where sqlite_timelimit() occasionally failed to clean up after itself (bac4e01)
  • We no longer load additional plugins when executing pytest (#438)
  • Homepage now links to database views if there are less than five tables in a database (#373)
  • The --cors option is now respected by error pages (#453)
  • datasette publish heroku now uses the --include-vcs-ignore option, which means it works under Travis CI (#407)
  • datasette publish heroku now publishes using Python 3.6.8 (666c374)
  • Renamed datasette publish now to datasette publish nowv1 (#472)
  • datasette publish nowv1 now accepts multiple --alias parameters (09ef305)
  • Removed the datasette skeleton command (#476)
  • The documentation on how to build the documentation now recommends sphinx-autobuild

Running Datasette on Glitch one month ago

The worst part of any software project is setting up a development environment. It’s by far the biggest barrier for anyone trying to get started learning to code. I’ve been a developer for more than twenty years and I still feel the pain any time I want to do something new.

Glitch is the most promising attempt I’ve ever seen at tackling this problem. It provides an entirely browser-based development environment that allows you to edit code, see the results instantly and view and remix the source code of other people’s projects.

It’s developed into a really fun, super-creative community and a fantastic resource for people looking to get started in the ever-evolving world of software development.

This evening I decided to get Datasette running on it. I’m really impressed with how well it works, and I think Glitch provides an excellent environment for experimenting with Datasette and related tools.

TLDR version: visit https://glitch.com/edit/#!/remix/datasette-csvs right now, drag-and-drop in a CSV file and watch it get served by Datasette on Glitch just a few seconds later.

Running Python on Glitch

The Glitch documentation is all about Node.js and JavaScript, but they actually have very solid Python support as well.

Every Glitch project runs in a container that includes Python 2.7.12 and Python 3.5.2, and you can use pip install --user or pip3 install --user to install Python dependencies.

The key to running non-JavaScript projects on Glitch is the glitch.json file format. You can use this to specify an install script, which sets up your container, and a start script, which starts your application running. Glitch will route HTTP traffic to port 3000, so your application server needs to listen on that port.

This means the most basic Glitch project to run Datasette looks like this:

https://datasette-basic.glitch.me/ (view source)

It contains a single glitch.json file:

{
    "install": "pip3 install --user datasette",
    "start": "datasette -p 3000"
}

This installs Datasette using pip3, then runs it on port 3000.

Since there’s no actual data to serve, this is a pretty boring demo. The most interesting page is this one, which shows the installed versions of the software:

https://datasette-basic.glitch.me/-/versions

Something more interesting: datasette-csvs

Let’s build one with some actual data.

My csvs-to-sqlite tool converts CSV files into a SQLite database. Since it’s also written in Python we can run it against CSV files as part of the Glitch install script.

Glitch provides a special directory called .data/ which can be used as a persistent file storage space that won’t be cleared in between restarts. The following "install" script installs datasette and csvs-to-sqlite, then runs the latter to create a SQLite database from all available CSV files:

{
    "install":  "pip3 install --user datasette csvs-to-sqlite && csvs-to-sqlite *.csv .data/csv-data.db",
    "start": "datasette .data/csv-data.db -p 3000"
}

Now we can simply drag and drop CSV files into the root of the Glitch project and they will be automatically converted into a SQLite database and served using Datasette!

We need a couple of extra details. Firstly, we want Datasette to automatically re-build the database file any time a new CSV file is added or an existing CSV file is changed. We can do that by adding a "watch" block to glitch.json:

"watch": {
    "install": {
        "include": [
            "\\.csv$"
        ]
    }
}

This ensures that our "install" script will run again any time a CSV file changes.

Let’s tone down the rate at which the scripts execute, by using throttle to set the polling interval to once a second:

"throttle": 1000

The above almost worked, but I started seeing errors if I changed the number of columns in a CSV file, since doing so clashed with the schema that had already been created in the database.

My solution was to add code to the install script that would delete the SQLite database file before attempting to recreate it—using the rm ... || true idiom to prevent Glitch from failing the installation if the file it attempted to remove did not already exist.

My final glitch.json file looks like this:

{
  "install": "pip3 install --user datasette csvs-to-sqlite && rm .data/csv-data.db || true && csvs-to-sqlite *.csv .data/csv-data.db",
  "start": "datasette .data/csv-data.db -p 3000 -m metadata.json",
  "watch": {
    "install": {
      "include": [
        "\\.csv$"
      ]
    },
    "restart": {
      "include": [
        "^metadata.json$"
      ]
    },
    "throttle": 1000
  }
}

I also set it up to use Datasette’s metadata.json format, and automatically restart the server any time the contents of that file changes.

https://datasette-csvs.glitch.me/ (view source) shows the results, running against a simple example.csv file I created.

Remixing!

Here’s where things get really fun: Glitch projects support “remixing”, whereby anyone can click a link to create their own editable copy of a project.

Remixing works even if you aren’t logged in to Glitch! Anonymous projects expire after five days, so be sure to sign in with GitHub or Facebook if you want to keep yours around.

Try it out now: Visit https://glitch.com/edit/#!/remix/datasette-csvs to create your own remix of my project. Then drag a new CSV file directly into the editor and within a few seconds Datasette on Glitch will be up and running against a converted copy of your file!

Limitations

The Glitch help center article What technical restrictions are in place? describes their limits. Most importantly, projects are limited to 4,000 requests an hour—and there’s currently no way to increase that limit. They also limit projects to 200MB of disk space—easily enough to get started exploring some interesting CSV files with Datasette.

Next steps

I’m delighted at how easy this was to setup, and how much power the ability to remix these Datasette demos provides. I’m tempted to start creating remixable Glitch demos that illustrate other aspects of Datasette’s functionality such as plugins or full-text search.

Glitch is an exceptionally cool piece of software. I look forward to seeing their Python support continue to evolve.

Generating a commit log for San Francisco’s official list of trees two months ago

San Francisco has a neat open data portal (as do an increasingly large number of cities these days). For a few years my favourite file on there has been Street Tree List, a list of all 190,000 trees in the city maintained by the Department of Public Works.

I’ve been using that file for Datasette demos for a while now, but last week I noticed something intriguing: the file had been recently updated. On closer inspection it turns out it’s updated on a regular basis! I had assumed it was a static snapshot of trees at a certain point in time, but I was wrong: Street_Tree_List.csv is a living document.

Back in September 2017 I built a scraping project relating to hurricane Irma. The idea was to take data sources like FEMA’s list of open shelters and track them over time, by scraping them into a git repository and committing after every fetch.

I’ve been meaning to spend more time with this idea, and building a commit log for San Francisco’s trees looked like an ideal opportunity to do so.

sf-tree-history

Here’s the result: sf-tree-history, a git repository dedicated to recording the history of changes made to the official list of San Francisco’s trees. The repo contains three things: the latest copy of Street_Tree_List.csv, a README, and a Circle CI configuration that grabs a new copy of the file every night and, if it has changed, commits it to git and pushes the result to GitHub.

The most interesting part of the repo is the commit history itself. I’ve only been running the script for just over a week, but I already have some useful illustrative commits:

  • 7ab432cdcb8d7914cfea4a5b59803f38cade532b from March 6th records three new trees added to the file: two Monterey Pines and a Blackwood Acacia.
  • d6b258959af9546909b2eee836f0156ed88cd45d from March 12th shows four changes made to existing records. Of particular interest: TreeID 235981 (a Cherry Plum) had its address updated from 412 Webster St to 410 Webster St and its latitude and longitude tweaked a little bit as well.
  • ca66d9a5fdd632549301d249c487004a5b68abf2 lists 2151 rows changed, 1280 rows added! I found an old copy of Street_Tree_List.csv on my laptop from April 2018, so for fun I loaded it into the repository and used git commit amend to back-date the commit to almost a year ago. I generated a commit message between that file and the version from 9 days ago which came in at around 10,000 lines of text. Git handled that just fine, but GitHub’s web view sadly truncates it.

csv-diff

One of the things I learned from my hurricane Irma project was the importance of human-readable commit messages that summarize the detected changes. I initially wrote some code to generate those by hand, but then realized that this could be extracted into a reusable tool.

The result is csv-diff, a tiny Python CLI tool which can generate a human (or machine) readable version of the differences between two CSV files.

Using it looks like this:

$ csv-diff one.csv two.csv --key=id
1 row added, 1 row removed, 1 row changed

1 row added

  {"id": "3", "name": "Bailey", "age": "1"}

1 row removed

  {"id": "2", "name": "Pancakes", "age": "2"}

1 row changed

  Row 1
    age: "4" => "5"

The csv-diff README has further details on the tool.

Circle CI

My favourite thing about the sf-tree-history project is that it costs me nothing to run—either in hosting costs or (hopefully) in terms of ongoing maintenance.

The git repository is hosted for free on GitHub. Because it’s a public project, Circle CI will run tasks against it for free.

My .circleci/config.yml does the rest. It uses Circle’s cron syntax to schedule a task that runs every night. The task then runs this script (embedded in the YAML configuration):

cp Street_Tree_List.csv Street_Tree_List-old.csv
curl -o Street_Tree_List.csv "https://data.sfgov.org/api/views/tkzw-k3nq/rows.csv?accessType=DOWNLOAD"
git add Street_Tree_List.csv
git config --global user.email "treebot@example.com"
git config --global user.name "Treebot"
sudo pip install csv-diff
csv-diff Street_Tree_List-old.csv Street_Tree_List.csv --key=TreeID > message.txt
git commit -F message.txt && \
  git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master \
  || true

This script does all of the work.

  1. First it backs up the existing Street_Tree_list.csv as Street_Tree_List-old.csv, in order to be able to run a comparison later.
  2. It downloads the latest copy of Street_Tree_List.csv from the San Francisco data portal
  3. It adds the file to the git index and sets itself an identity for use in the commit
  4. It installs my csv-diff utility from PyPI
  5. It uses csv-diff to create a diff of the two files, and writes that diff to a new file called message.txt
  6. Finally, it attempts to create a new commit using message.txt as the commit message, then pushes the result to GitHub

The last line is the most complex. Circle CI will mark a build as failed if any of the commands in the run block return a non-0 exit code. git commit returns a non-0 exit code if you attempt to run it but none of the files have changed.

git commit ... && git push ... || true ensures that if git commit succeeds the git push command will be run, BUT if it fails the || true will still return a 0 exit code for the overall line—so Circle CI will not mark the build as failed.

There’s one last trick here: I’m using git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master to push my changes to GitHub. This takes advantage of Circle CI environment variables, which are the recommended way to configure secrets such that they cannot be viewed by anyone browsing your Circle CI builds. I created a personal GitHub auth token for this project, which I’m using to allow Circle CI to push commits to GitHub on my behalf.

Next steps

I’m really excited about this pattern of using GitHub in combination with Circle CI to track changes to any file that is being posted on the internet. I’m opening up the code (and my csv-diff utility) in the hope that other people will use them to set up their own tracking projects. Who knows, maybe there’s a file out there that’s even more exciting than San Francisco’s official list of trees!

I commissioned an oil painting of Barbra Streisand’s cloned dogs two months ago

Two dogs in a stroller looking at a gravestone, as an oil painting
Two identical puffs of white fur, gazing at the tombstone of the dog they are

Last year, Barbra Streisand cloned her dog, Sammie.

The story is fascinating, as is the background reading on dog cloning from Vanity Fair. But the thing that really stuck with me was the photograph that accompanied “Barbra Streisand Explains: Why I Cloned My Dog” in the New York Times:

Two dogs in a stroller looking at a gravestone

David Ewing Duncan in Vanity Fair described the scenario like this: Barbra Streisand, visiting the grave of her beloved Sammie, with Miss Violet and Miss Scarlett perched next to her in their stroller—two identical puffs of white fur, gazing at the tombstone of the dog they are.

This photograph says so much about the age that we live in. I couldn’t get it out of my head.

I’ve long been fascinated by Dafen, the town in a China that was once responsible for 60% of the world’s oil paintings—mostly replicas, but today increasingly original artwork.

I always wanted to commission an oil painting from China, but I never quite found the right subject... until now.

There’s something deliciously appropriate about using a painting cloning service to clone a photograph of some cloned dogs.

So I uploaded a copy of the photo to Instapainting and entered a few extra instructions:

Please paint the stroller closer to the gravestone—adjust the composition so that it fits the 12x16 dimensions while maintaining the two key elements of the image: the stroller with the two dogs in it and the gravestone that they are looking at

A couple of months later, a tube arrived in the mail. I opened it up... and they had nailed it. If anything the painting is an improvement on the original photograph: the composition is tighter, the stroller no longer has its wheel cut off, some dead plants in the background (which I had not even noticed) are given a bit more prominence, and the little doggy faces have exactly the right expressions of mild existential dread.

So thank you Alice Wu at Xiamen Dearchic Arts—I could not be happier.

With a painting this good, obviously it needed to be framed. I took it to Underglass Framing in San Francisco’s Hayes Valley and told them I was looking for something with an air of existential dread. “I think we can do that” they said.

Two dogs in a stroller looking at a gravestone, as an oil painting in an intimidating frame

Natalie says I can keep it in the guest bathroom.

sqlite-utils: a Python library and CLI tool for building SQLite databases three months ago

sqlite-utils is a combination Python library and command-line tool I’ve been building over the past six months which aims to make creating new SQLite databases as quick and easy as possible.

It’s part of the ecosystem of tools I’m building around my Datasette project.

I spent the weekend adding all kinds of exciting command-line options to it, so I’m ready to describe it to the world.

A Python library for quickly creating databases

A core idea behind Datasette is that SQLite is the ideal format for publishing all kinds of interesting structured data. Datasette takes any SQLite database and adds a browsable web interface, a JSON API and the ability to export tables and queries as CSV.

The other half of the equation then is tools to create SQLite databases. csvs-to-sqlite was my first CLI attempt at this. sqlite-utils takes a much more flexible and comprehensive approach.

I started working on sqlite-utils last year as part of my project to Analyze US Election Russian Facebook Ads. The initial aim was to build a library that made constructing new SQLite databases inside of a Jupyter notebook as productive as possible.

The core idea behind the library is that you can give it a list of Python dictionaries (equivalent to JSON objects) and it will automatically create a SQLite table with the correct schema, then insert those items into the new table.

To illustrate, let’s create a database using this JSON file of meteorite landings released by NASA (discovered via awesome-json-datasets curated by Justin Dorfman).

Here’s the quickest way in code to turn that into a database:

import requests
import sqlite_utils

db = sqlite_utils.Database("meteorites.db")
db["meteorites"].insert_all(
    requests.get(
        "https://data.nasa.gov/resource/y77d-th95.json"
    ).json(),
    pk="id"
)

This two lines of code creates a new SQLite database on disk called meteorites.db, creates a table in that file called meteorites, detects the necessary columns based on the incoming data, inserts all of the rows and sets the id column up as the primary key.

To see the resulting database, run datasette meteorites.db and browse to http://127.0.0.1:8001/.

You can do a lot more with the library. You can create tables, insert and upsert data in bulk, configure foreign key relationships, configure SQLite full-text search and much more. I encourage you to consult the documentation for all of the details.

The sqlite-utils command-line tool

This is the new stuff built over the past few days, and I think it’s really fun.

First install the tool from PyPI, using pip3 install sqlite-utils.

Let’s start by grabbing a copy of the russian-ads.db database I created in Analyzing US Election Russian Facebook Ads (4MB):

$ cd /tmp
$ wget https://static.simonwillison.net/static/2019/russian-ads.db

We can see a list of tables in the database and their counts using the tables command:

$ sqlite-utils tables russian-ads.db --counts
[{"table": "ads", "count": 3498},
 {"table": "targets", "count": 1665},
 {"table": "ad_targets", "count": 36559},
 {"table": "ads_fts", "count": 3498},
 {"table": "ads_fts_segments", "count": 120},
 {"table": "ads_fts_segdir", "count": 1},
 {"table": "ads_fts_docsize", "count": 3498},
 {"table": "ads_fts_stat", "count": 1}]

By default, sqlite-utils outputs data as neatly formatted JSON. You can get CSV instead using the --csv option:

$ sqlite-utils tables russian-ads.db --counts --csv
table,count
ads,3498
targets,1665
ad_targets,36559
ads_fts,3498
ads_fts_segments,120
ads_fts_segdir,1
ads_fts_docsize,3498
ads_fts_stat,1

Or if you want a pretty ASCII-art table, use --table (or the shortcut, -t):

$ sqlite-utils tables russian-ads.db --counts -t
table               count
----------------  -------
ads                  3498
targets              1665
ad_targets          36559
ads_fts              3498
ads_fts_segments      120
ads_fts_segdir          1
ads_fts_docsize      3498
ads_fts_stat            1

The table view is built on top of tabulate, which offers dozens of table variations. Run sqlite-utils tables --help for the full list—try --table -fmt=rst for output that can be pasted directly into a reStructuredText document (handy for writing documentation).

So far we’ve just looked at a list of tables. Lets run a SQL query:

$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category"
[{"category": "accessing_facebook_on", "count(*)": 1},
 {"category": "age", "count(*)": 82},
 {"category": "and_must_also_match", "count(*)": 228},
 {"category": "army_reserve_industry", "count(*)": 3},
 {"category": "behaviors", "count(*)": 16},
 ...

Again, this can be output as CSV using --csv, or a table with --table.

The default JSON output is objects wrapped in an array. Use --arrays to get an array of arrays instead. More interestingly: --nl causes the data to be output as newline-delimited JSON, like this:

$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category" --nl
{"category": "accessing_facebook_on", "count(*)": 1}
{"category": "age", "count(*)": 82}
{"category": "and_must_also_match", "count(*)": 228}
{"category": "army_reserve_industry", "count(*)": 3}
{"category": "behaviors", "count(*)": 16}
...

This is a really interesting format for piping to other tools.

Creating databases from JSON on the command-line

The sqlite-utils insert command can be used to create new tables by piping JSON or CSV directly into the tool. It’s the command-line equivalent of the .insert_all() Python function I demonstrated earlier.

Here’s how to create that meteorite database directly from the command-line:

$ curl "https://data.nasa.gov/resource/y77d-th95.json" | \
    sqlite-utils insert meteorites.db meteorites - --pk=id

This will use a SQLite database file called meteorites.db (creating one if it does not yet exist), create or use a table called meteorites and read the data from standard in (hence the pipe). You can pass a filename instead of a - here to read data from a file on disk.

The insert command accepts multiple formats—it defaults to expecting a JSON array of objects, but you can use --nl to accept newline-delimited JSON and --csv to accept CSV.

This means you can combine the tools! Let’s create a brand new database by exporting data from the old one, using newline-delimited JSON as the intermediary format:

$ sqlite-utils russian-ads.db \
    "select * from ads where text like '%veterans%'" --nl | \
    sqlite-utils insert veterans.db ads - --nl

This creates a new file called veterans.db containing an ads table with just the ads that mentioned veterans somewhere in their body text.

Since we’re working with JSON, we can introduce other command-line tools into the mix.

jq is a neat little tool for extracting data from a JSON file using its own mini domain-specific language.

The Nobel Prize API offers a JSON file listing all of the Nobel laureates—but they are contained as an array in a top level "laureates" key. sqlite-utils needs a flat array—so we can use jq to get exactly that:

$ curl "http://api.nobelprize.org/v1/laureate.json" | \
  jq ".laureates" | \
  sqlite-utils insert nobel.db laureates -

Now we have a file called nobel.db containing all of the Nobel laureates.

Since Datasette recently grew the ability to export newline-delimited JSON, we can also use this ability to directly consume data from Datasette. Lets grab every episode of the Joy of Painting in which Bob Ross painted a beach:

$ curl "https://fivethirtyeight.datasettes.com/fivethirtyeight-aa93d24/bob-ross%2Felements-by-episode.json?_facet=BEACH&BEACH=1&_shape=array&_nl=on" \
| sqlite-utils insert bob.db beach_episodes - --nl

Plenty more features

As with the Python API, the sqlite-utils CLI tool has dozens of other options and extensive documentation.

I’ve been really enjoying growing an ecosystem of tools around Datasette. sqlite-utils is the keystone here: it’s fundamental to other tools I’m building, such as db-to-sqlite (which can export any SQLAlchemy-supported database directly to a SQLite file on disk).

I’ve found myself increasingly turning to SQLite first for all kinds of ad-hoc analysis, and I’m excited to try out these new command-line abilities of sqlite-utils for real-world data spelunking tasks.

Exploring search relevance algorithms with SQLite four months ago

SQLite isn’t just a fast, high quality embedded database: it also incorporates a powerful full-text search engine in the form of the FTS4 and FTS5 extensions. You’ve probably used these a bunch of times already: many iOS, Android and desktop applications use SQLite under-the-hood and use it to implement their built-in search.

I’ve been using these capabilities for basic search in Datasette for over a year now, but I’ve recently started digging into some of their more advanced features. It turns out hacking around with SQLite is a great way to learn more about how fundamental information retrieval algorithms work under the hood.

Today I’m releasing sqlite-fts4—a Python package that provides a collection of custom SQL functions for working with SQLite’s FTS4 module. It includes some neat tools for introspecting how relevancy ranking algorithms actually work.

Why not just use FTS5?

If it’s available to you FTS5 is usually the best option: it has a good ranking algorithm built in. I described how to use it to build fast autocomplete search for your website for the 2018 24 ways advent calendar. You can join directly against a virtual table and order by a pre-calculated relevance score accessible through that table.

What makes FTS4 interesting is that it doesn’t include a scoring mechanism: it instead exposes raw statistical data to you in a way that lets you build your own ranking functions.

You probably don’t need to do this—unless you are stuck on an older SQLite version that doesn’t support the latest features. But… if you’re interested in understanding more about how search actually works, the need to implement a ranking function is an excellent learning learning opportunity.

I’ll be demonstrating these functions using a hosted Datasette instance running at datasette-sqlite-fts4.datasette.io (with the data from my 24 ways article). You can play with them out there, or if you want to use your own Datasette instance you can enable these custom SQL functions by pip installing my new datasette-sqlite-fts4 plugin.

Raw FTS4 matchinfo() data

When using FTS4, the only scoring help SQLite gives you is the bulit-in matchinfo() function. For each document in your search result set, this function will expose raw statistical data that can be used to calculate a score.

Let’s try it out using the following query:

select
    *, matchinfo(articles_fts, "pcx")
from
    articles_fts
where
    articles_fts match :search

Run matchinfo() in Datasette

The pcx here is called the format string—it lets SQLite know what information about the match you would like to see.

The results are returned as a binary string! For the first matching document, we get back the following:

\x02\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\xa3\x00\x00\x00\x1f\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\\\x00\x00\x00\x15\x00\x00\x00

SQLite’s C heritage is showing through here.

decode_matchinfo() to decode the binary

The first step in working with matchinfo is to decode that binary string. It’s actually a sequence of unsigned 32 bit integers. We can turn it into a Python list of integers using the following:

struct.unpack("I" * (len(matchinfo) // 4), matchinfo)

sqlite-fts4 exposes a SQL function called decode_matchinfo() which does exactly this. Let’s expand our example to use it:

select
    title, author,
    decode_matchinfo(matchinfo(articles_fts, "pcx")),
    matchinfo(articles_fts, "pcx")
from
    articles_fts
where
    articles_fts match :search

Run decode_matchinfo() in Datasette

The matchinfo for our first matching document now looks like this:

[2, 3, 0, 2, 2, 0, 0, 0, 1, 163, 31, 0, 2, 2, 0, 0, 0, 2, 92, 21]

Better, but still obscure. What does it mean?

The anwser lies in the SQLite matchinfo documentation. In our format string, we requested p, c and x:

  • p requests a single integer reprenting the number of search terms we are matching. Since our search query is jquery maps this is 2—it’s the first integer in the list.
  • c requests the number of searchable columns in our table. We created articles_fts with 3 columns, so it’s 3. That’s the second integer in the list.
  • x is much more interesting: it returns 3 integer values for each term/column combination. Since we have 2 terms and 3 columns that means we get back 6 * 3 = 18 integers. If you count the items in the array above you’ll see there are 18 left after you remove the first two. Each triple represents the number of times the term appears in the current column, the number of times it appears in this column across every row and the number of total documents that match the term in this column at least once.

Search relevancy scores are usually calculated against exactly this kind of collection of statistics: we rank based on how rare the matching terms are across the rest of the corpus.

annotate_matchinfo() to annotate the integers

Having a list of integers made things easier, but still not easy enough. That’s where annotate_matchinfo() comes in. This custom SQL function expands the matchinfo list of integers into a giant JSON object describing exactly what each of the results means.

We’ll try it out like this:

select
    title, author,
    decode_matchinfo(matchinfo(articles_fts, "pcx")),
    json_object("pre", annotate_matchinfo(matchinfo(articles_fts, "pcx"), "pcx"))
from
    articles_fts
where
    articles_fts match :search

Run annotate_matchinfo() in Datasette

Note that we have to provide the format string twice, so that annotate_matchinfo() knows the requested order of the binary matchinfo data.

This returns a JSON object that looks like this:

{
  "p": {
    "title": "Number of matchable phrases in the query",
    "value": 2,
    "idx": 0
  },
  "c": {
    "title": "Number of user defined columns in the FTS table",
    "value": 3,
    "idx": 1
  },
  "x": {
    "title": "Details for each phrase/column combination"
    "value": [
      ...
      {
        "phrase_index": 0,
        "column_index": 2,
        "hits_this_column_this_row": 1,
        "hits_this_column_all_rows": 163,
        "docs_with_hits": 31,
        "idxs": [8, 9, 10]
      },
      {
        "phrase_index": 1,
        "column_index": 0,
        "hits_this_column_this_row": 0,
        "hits_this_column_all_rows": 2,
        "docs_with_hits": 2,
        "idxs": [11, 12, 13]
      }...
    ],
  }
}

Try it out with pcxnalyb to see the complete set of format string options.

You may be wondering why I wrapped that function call in json_object("pre", ...). This is a Datasette trick: I recently added the ability to pretty-print JSON to my datasette-html-json plugin—see that package’s README for details.

Building ranking functions

These statistics are everything we need to calculate relevance scores. sqlite-fts4 implements two such functions: rank_score() is a simple TF/IDF function. rank_bm25() is much more interesting—it’s an implementation of the Okapi BM25, inspired by the one that ships with the peewee ORM.

Let’s try them both out:

select
    title, author,
    rank_score(matchinfo(articles_fts, "pcx")) as score,
    rank_bm25(matchinfo(articles_fts, "pcnalx")) as bm25,
    json_object("pre", annotate_matchinfo(matchinfo(articles_fts, "pcxnalyb"), "pcxnalyb"))
from
    articles_fts
where
    articles_fts match :search
order by bm25

Try rank_score() and rank_bm25() in Datasette

You can switch the order by clause between bm25 and score to compare the two.

bm25() is definitely a better option. It’s the default algorithm used these days by Elasticsearch, and they wrote up an excellent explanation of how it works on their blog.

Take a look at the source code for the ranking functions to see how they are implemented. They work against the data structure returned by annotate_matchinfo() to try and make it clear what is going on.

Building the rank_bm25() function took me longer than I expected: I was comparing my results against bm25() from peewee to ensure I was getting them right, but I couldn’t get them to match. After some furious debugging I finally figured out the problem: peewee had a rare bug! I reported it to Charles Leifer and he analyzed it and turned around a fix in a matter of hours—it turns out the C library that peewee had ported to Python had the same problem.

Next steps

I’m really impressed with the flexibility that FTS4 provides—it turns out FTS5 isn’t the only worthwhile option for search in SQLite

I’m thinking about ways to expose some of the bm25 tuning parameters (in particular the magic B and K1 constants explained by the Elasticsearch article) and I plan to upgrade Datasette’s search functionality to make ranking available as a first-class feature on the searchable table view.

I’m also generally excited about SQLite as a learning tool for exploring different search ranking mechanisms. Once you’ve decoded that binary matchinfo string it’s impressive how much you can get done with the underlying data.

Elsewhere

Today

  • Using dependabot to bump Django on my blog from 2.2 to 2.2.1 (via) GitHub recently acquired dependabot and made it free, and I decided to try it out on my blog. It’s a really neat piece of automation: it scans your requirements.txt (plus a number of other packaging definitions across several different languages), checks for updates to your dependencies and opens pull requests against any that it finds. Combine it with a CI service such as Circle CI and your tests will run automatically against the pull request, letting you know if it’s safe to merge. dependabot constantly rebases other changes against the pull request to try and ensure it will merge as cleanly as possible. #

25th May 2019

  • sqlite-utils 1.0. I just released sqlite-utils 1.0, with a couple of handy new features over 0.14: it can now automatically add columns to a database table if you attempt to insert data which doesn’t quite fit (using alter=True in the Python API or the --alter option to the “sqlite-utils insert” command). It also has the ability to output nested JSON column values on the command-line using the new --json-cols option. This is the first project I’ve marked as a 1.0 release in a very long time—I’ll be sticking to semver for this project from now on, bumping the major version only in the case of a backwards incompatible change. #

22nd May 2019

  • WebAssembly at eBay: A Real-World Use Case (via) eBay used WebAssembly to run a C++ barcode reading library inside a web worker, passing images from the camera in order to provide a barcode scanning interface as part of their mobile web “add listing” page (a feature that had already proved successful in their native mobile apps). This is a great write-up, with lots of detail about how they compiled the library. They ended up running three barcode solutions in parallel web workers—two using WebAssembly, one in pure JavaScript—because their testing showed that racing between three implementations greatly increased the chance of a match due to how the different libraries handled poor quality or out-of-focus images. #

21st May 2019

  • Terrarium by Fastly Labs. Fastly have been investing heavily in WebAssembly, which makes sense as it provides an excellent option for a sandboxed environment for executing server-side code at the edge of their CDN offering. Terrarium is their “playground for experimenting with edge-side WebAssembly”—it lets you write a program in Rust, C, TypeScript or Wat (WebAssembly text format), compile it to WebAssembly and deploy it to a URL with a single button-click. It’s just a demo for the moment so deployments only persist for 15 minutes, but it’s a fascinating sandbox to play around with. #
  • Monaco Editor. VS Code is MIT licensed and built on top of Electron. I thought “huh, I wonder if I could run the editor component embedded in a web app”—and it turns out Microsoft have already extracted out the code editor component into an open source JavaScript package called Monaco. Looks very slick, though sadly it’s not supported in mobile browsers. #
  • Public Data Release of Stack Overflow’s 2019 Developer Survey. Here’s the Stack Overflow announcement of their developer survey public data release, which discusses the Glitch partnership and mentions Datasette. #
  • Discover Insights in Developer Survey Results. Stack Overflow partnered with Glitch and used Datasette to host the full data set from Stack Overflow’s 2019 Developer Survey! #

15th May 2019

  • django-lifecycle (via) Interesting alternative to Django signals by Robert Singer. It provides a model mixin class which over-rides the Django ORM’s save() method, tracking which model attributes have been changed. Then it lets you add methods to your model with a @hook annotation allowing you to specify things like “run this method before saving if the status changed” or “run this after an object has been deleted”. #
  • Why I (Still) Love Tech: In Defense of a Difficult Industry (via) If you only read one longform piece this week, make it this one. Utterly delightful prose and a bunch of different messages that resonated with me deeply. #
  • Imagine if you were really into the group Swervedriver in the mid-’90s but by 2019 someone was on CNBC telling you that Swervedriver represented, I don’t know, 10 percent of global economic growth, outpacing returns in oil and lumber. That’s the tech industry.

    Paul Ford #

14th May 2019

  • quicktype code generator for Python. Really interesting tool: give it an example JSON document and it will code-generate the equivalent set of Python classes (with type annotations) instantly in your browser. It also accepts input in JSON Schema or TypeScript and can generate code in 18 different languages. #
  • Amazon’s Away Teams laid bare: How AWS's hivemind of engineers develop and maintain their internal tech (via) Some interesting insights into how Amazon structure their engineering organization to maximize team productivity in a service-oriented environment. Two things that stood out to me: each service is owned by a “home team”, but sometimes features that are needed by other teams can be built by forming an “away team” to build out that functionality. Secondly, Amazon has a concept of “bar raisers” who are engineers across the organization who help approve key design and architectural decisions. It’s possible to go against the recommendation of a bar raiser but “such a move is noted and made visible to higher levels of management”. #

10th May 2019

  • ... the overall conclusion I reach is that we have so much to gain from making Django async-capable that it is worth the large amount of work it will take. I also believe, crucially, that we can undertake this change in an iterative, community-driven way that does not rely solely on one or two long-time contributors burning themselves out.

    Andrew Godwin #

8th May 2019

  • We don’t like limits on discrimination and lending, so we’re gonna use machine learning, which is a form of money laundering for bias, a way to blame mathematical algorithms for desires to simply avoid rules that everybody else has to play by in this industry.

    Maciej Ceglowski #

7th May 2019

  • asgi-cors (via) I’ve been trying out the new ASGI 3.0 spec and I just released my first piece of ASGI middleware: asgi-cors, which lets you wrap an ASGI application with Access-Control-Allow-Origin CORS headers (either “*” or dynamic headers based on an origin whitelist). #

2nd May 2019