Simon Willison’s Weblog

On talks, opensource, webassembly, rust, datajournalism, ...


Recent entries

Running Datasette on Glitch two days ago

The worst part of any software project is setting up a development environment. It’s by far the biggest barrier for anyone trying to get started learning to code. I’ve been a developer for more than twenty years and I still feel the pain any time I want to do something new.

Glitch is the most promising attempt I’ve ever seen at tackling this problem. It provides an entirely browser-based development environment that allows you to edit code, see the results instantly and view and remix the source code of other people’s projects.

It’s developed into a really fun, super-creative community and a fantastic resource for people looking to get started in the ever-evolving world of software development.

This evening I decided to get Datasette running on it. I’m really impressed with how well it works, and I think Glitch provides an excellent environment for experimenting with Datasette and related tools.

TLDR version: visit!/remix/datasette-csvs right now, drag-and-drop in a CSV file and watch it get served by Datasette on Glitch just a few seconds later.

Running Python on Glitch

The Glitch documentation is all about Node.js and JavaScript, but they actually have very solid Python support as well.

Every Glitch project runs in a container that includes Python 2.7.12 and Python 3.5.2, and you can use pip install --user or pip3 install --user to install Python dependencies.

The key to running non-JavaScript projects on Glitch is the glitch.json file format. You can use this to specify an install script, which sets up your container, and a start script, which starts your application running. Glitch will route HTTP traffic to port 3000, so your application server needs to listen on that port.

This means the most basic Glitch project to run Datasette looks like this: (view source)

It contains a single glitch.json file:

    "install": "pip3 install --user datasette",
    "start": "datasette -p 3000"

This installs Datasette using pip3, then runs it on port 3000.

Since there’s no actual data to serve, this is a pretty boring demo. The most interesting page is this one, which shows the installed versions of the software:

Something more interesting: datasette-csvs

Let’s build one with some actual data.

My csvs-to-sqlite tool converts CSV files into a SQLite database. Since it’s also written in Python we can run it against CSV files as part of the Glitch install script.

Glitch provides a special directory called .data/ which can be used as a persistent file storage space that won’t be cleared in between restarts. The following "install" script installs datasette and csvs-to-sqlite, then runs the latter to create a SQLite database from all available CSV files:

    "install":  "pip3 install --user datasette csvs-to-sqlite && csvs-to-sqlite *.csv .data/csv-data.db",
    "start": "datasette .data/csv-data.db -p 3000"

Now we can simply drag and drop CSV files into the root of the Glitch project and they will be automatically converted into a SQLite database and served using Datasette!

We need a couple of extra details. Firstly, we want Datasette to automatically re-build the database file any time a new CSV file is added or an existing CSV file is changed. We can do that by adding a "watch" block to glitch.json:

"watch": {
    "install": {
        "include": [

This ensures that our "install" script will run again any time a CSV file changes.

Let’s tone down the rate at which the scripts execute, by using throttle to set the polling interval to once a second:

"throttle": 1000

The above almost worked, but I started seeing errors if I changed the number of columns in a CSV file, since doing so clashed with the schema that had already been created in the database.

My solution was to add code to the install script that would delete the SQLite database file before attempting to recreate it—using the rm ... || true idiom to prevent Glitch from failing the installation if the file it attempted to remove did not already exist.

My final glitch.json file looks like this:

  "install": "pip3 install --user datasette csvs-to-sqlite && rm .data/csv-data.db || true && csvs-to-sqlite *.csv .data/csv-data.db",
  "start": "datasette .data/csv-data.db -p 3000 -m metadata.json",
  "watch": {
    "install": {
      "include": [
    "restart": {
      "include": [
    "throttle": 1000

I also set it up to use Datasette’s metadata.json format, and automatically restart the server any time the contents of that file changes. (view source) shows the results, running against a simple example.csv file I created.


Here’s where things get really fun: Glitch projects support “remixing”, whereby anyone can click a link to create their own editable copy of a project.

Remixing works even if you aren’t logged in to Glitch! Anonymous projects expire after five days, so be sure to sign in with GitHub or Facebook if you want to keep yours around.

Try it out now: Visit!/remix/datasette-csvs to create your own remix of my project. Then drag a new CSV file directly into the editor and within a few seconds Datasette on Glitch will be up and running against a converted copy of your file!


The Gltich help center article What technical restrictions are in place? describes their limits. Most importantly, projects are limited to 4,000 requests an hour—and there’s currently no way to increase that limit. They also limit projects to 200MB of disk space—easily enough to get started exploring some interesting CSV files with Datasette.

Next steps

I’m delighted at how easy this was to setup, and how much power the ability to remix these Datasette demos provides. I’m tempted to start creating remixable Glitch demos that illustrate other aspects of Datasette’s functionality such as plugins or full-text search.

Glitch is an exceptionally cool piece of software. I look forward to seeing their Python support continue to evolve.

Generating a commit log for San Francisco’s official list of trees one month ago

San Francisco has a neat open data portal (as do an increasingly large number of cities these days). For a few years my favourite file on there has been Street Tree List, a list of all 190,000 trees in the city maintained by the Department of Public Works.

I’ve been using that file for Datasette demos for a while now, but last week I noticed something intriguing: the file had been recently updated. On closer inspection it turns out it’s updated on a regular basis! I had assumed it was a static snapshot of trees at a certain point in time, but I was wrong: Street_Tree_List.csv is a living document.

Back in September 2017 I built a scraping project relating to hurricane Irma. The idea was to take data sources like FEMA’s list of open shelters and track them over time, by scraping them into a git repository and committing after every fetch.

I’ve been meaning to spend more time with this idea, and building a commit log for San Francisco’s trees looked like an ideal opportunity to do so.


Here’s the result: sf-tree-history, a git repository dedicated to recording the history of changes made to the official list of San Francisco’s trees. The repo contains three things: the latest copy of Street_Tree_List.csv, a README, and a Circle CI configuration that grabs a new copy of the file every night and, if it has changed, commits it to git and pushes the result to GitHub.

The most interesting part of the repo is the commit history itself. I’ve only been running the script for just over a week, but I already have some useful illustrative commits:

  • 7ab432cdcb8d7914cfea4a5b59803f38cade532b from March 6th records three new trees added to the file: two Monterey Pines and a Blackwood Acacia.
  • d6b258959af9546909b2eee836f0156ed88cd45d from March 12th shows four changes made to existing records. Of particular interest: TreeID 235981 (a Cherry Plum) had its address updated from 412 Webster St to 410 Webster St and its latitude and longitude tweaked a little bit as well.
  • ca66d9a5fdd632549301d249c487004a5b68abf2 lists 2151 rows changed, 1280 rows added! I found an old copy of Street_Tree_List.csv on my laptop from April 2018, so for fun I loaded it into the repository and used git commit amend to back-date the commit to almost a year ago. I generated a commit message between that file and the version from 9 days ago which came in at around 10,000 lines of text. Git handled that just fine, but GitHub’s web view sadly truncates it.


One of the things I learned from my hurricane Irma project was the importance of human-readable commit messages that summarize the detected changes. I initially wrote some code to generate those by hand, but then realized that this could be extracted into a reusable tool.

The result is csv-diff, a tiny Python CLI tool which can generate a human (or machine) readable version of the differences between two CSV files.

Using it looks like this:

$ csv-diff one.csv two.csv --key=id
1 row added, 1 row removed, 1 row changed

1 row added

  {"id": "3", "name": "Bailey", "age": "1"}

1 row removed

  {"id": "2", "name": "Pancakes", "age": "2"}

1 row changed

  Row 1
    age: "4" => "5"

The csv-diff README has further details on the tool.

Circle CI

My favourite thing about the sf-tree-history project is that it costs me nothing to run—either in hosting costs or (hopefully) in terms of ongoing maintenance.

The git repository is hosted for free on GitHub. Because it’s a public project, Circle CI will run tasks against it for free.

My .circleci/config.yml does the rest. It uses Circle’s cron syntax to schedule a task that runs every night. The task then runs this script (embedded in the YAML configuration):

cp Street_Tree_List.csv Street_Tree_List-old.csv
curl -o Street_Tree_List.csv ""
git add Street_Tree_List.csv
git config --global ""
git config --global "Treebot"
sudo pip install csv-diff
csv-diff Street_Tree_List-old.csv Street_Tree_List.csv --key=TreeID > message.txt
git commit -F message.txt && \
  git push -q https://${GITHUB_PERSONAL_TOKEN} master \
  || true

This script does all of the work.

  1. First it backs up the existing Street_Tree_list.csv as Street_Tree_List-old.csv, in order to be able to run a comparison later.
  2. It downloads the latest copy of Street_Tree_List.csv from the San Francisco data portal
  3. It adds the file to the git index and sets itself an identity for use in the commit
  4. It installs my csv-diff utility from PyPI
  5. It uses csv-diff to create a diff of the two files, and writes that diff to a new file called message.txt
  6. Finally, it attempts to create a new commit using message.txt as the commit message, then pushes the result to GitHub

The last line is the most complex. Circle CI will mark a build as failed if any of the commands in the run block return a non-0 exit code. git commit returns a non-0 exit code if you attempt to run it but none of the files have changed.

git commit ... && git push ... || true ensures that if git commit succeeds the git push command will be run, BUT if it fails the || true will still return a 0 exit code for the overall line—so Circle CI will not mark the build as failed.

There’s one last trick here: I’m using git push -q https://${GITHUB_PERSONAL_TOKEN} master to push my changes to GitHub. This takes advantage of Circle CI environment variables, which are the recommended way to configure secrets such that they cannot be viewed by anyone browsing your Circle CI builds. I created a personal GitHub auth token for this project, which I’m using to allow Circle CI to push commits to GitHub on my behalf.

Next steps

I’m really excited about this pattern of using GitHub in combination with Circle CI to track changes to any file that is being posted on the internet. I’m opening up the code (and my csv-diff utility) in the hope that other people will use them to set up their own tracking projects. Who knows, maybe there’s a file out there that’s even more exciting than San Francisco’s official list of trees!

I commissioned an oil painting of Barbra Streisand’s cloned dogs one month ago

Two dogs in a stroller looking at a gravestone, as an oil painting
Two identical puffs of white fur, gazing at the tombstone of the dog they are

Last year, Barbra Streisand cloned her dog, Sammie.

The story is fascinating, as is the background reading on dog cloning from Vanity Fair. But the thing that really stuck with me was the photograph that accompanied “Barbra Streisand Explains: Why I Cloned My Dog” in the New York Times:

Two dogs in a stroller looking at a gravestone

David Ewing Duncan in Vanity Fair described the scenario like this: Barbra Streisand, visiting the grave of her beloved Sammie, with Miss Violet and Miss Scarlett perched next to her in their stroller—two identical puffs of white fur, gazing at the tombstone of the dog they are.

This photograph says so much about the age that we live in. I couldn’t get it out of my head.

I’ve long been fascinated by Dafen, the town in a China that was once responsible for 60% of the world’s oil paintings—mostly replicas, but today increasingly original artwork.

I always wanted to commission an oil painting from China, but I never quite found the right subject... until now.

There’s something deliciously appropriate about using a painting cloning service to clone a photograph of some cloned dogs.

So I uploaded a copy of the photo to Instapainting and entered a few extra instructions:

Please paint the stroller closer to the gravestone—adjust the composition so that it fits the 12x16 dimensions while maintaining the two key elements of the image: the stroller with the two dogs in it and the gravestone that they are looking at

A couple of months later, a tube arrived in the mail. I opened it up... and they had nailed it. If anything the painting is an improvement on the original photograph: the composition is tighter, the stroller no longer has its wheel cut off, some dead plants in the background (which I had not even noticed) are given a bit more prominence, and the little doggy faces have exactly the right expressions of mild existential dread.

So thank you Alice Wu at Xiamen Dearchic Arts—I could not be happier.

With a painting this good, obviously it needed to be framed. I took it to Underglass Framing in San Francisco’s Hayes Valley and told them I was looking for something with an air of existential dread. “I think we can do that” they said.

Two dogs in a stroller looking at a gravestone, as an oil painting in an intimidating frame

Natalie says I can keep it in the guest bathroom.

sqlite-utils: a Python library and CLI tool for building SQLite databases one month ago

sqlite-utils is a combination Python library and command-line tool I’ve been building over the past six months which aims to make creating new SQLite databases as quick and easy as possible.

It’s part of the ecosystem of tools I’m building around my Datasette project.

I spent the weekend adding all kinds of exciting command-line options to it, so I’m ready to describe it to the world.

A Python library for quickly creating databases

A core idea behind Datasette is that SQLite is the ideal format for publishing all kinds of interesting structured data. Datasette takes any SQLite database and adds a browsable web interface, a JSON API and the ability to export tables and queries as CSV.

The other half of the equation then is tools to create SQLite databases. csvs-to-sqlite was my first CLI attempt at this. sqlite-utils takes a much more flexible and comprehensive approach.

I started working on sqlite-utils last year as part of my project to Analyze US Election Russian Facebook Ads. The initial aim was to build a library that made constructing new SQLite databases inside of a Jupyter notebook as productive as possible.

The core idea behind the library is that you can give it a list of Python dictionaries (equivalent to JSON objects) and it will automatically create a SQLite table with the correct schema, then insert those items into the new table.

To illustrate, let’s create a database using this JSON file of meteorite landings released by NASA (discovered via awesome-json-datasets curated by Justin Dorfman).

Here’s the quickest way in code to turn that into a database:

import requests
import sqlite_utils

db = sqlite_utils.Database("meteorites.db")

This two lines of code creates a new SQLite database on disk called meteorites.db, creates a table in that file called meteorites, detects the necessary columns based on the incoming data, inserts all of the rows and sets the id column up as the primary key.

To see the resulting database, run datasette meteorites.db and browse to

You can do a lot more with the library. You can create tables, insert and upsert data in bulk, configure foreign key relationships, configure SQLite full-text search and much more. I encourage you to consult the documentation for all of the details.

The sqlite-utils command-line tool

This is the new stuff built over the past few days, and I think it’s really fun.

First install the tool from PyPI, using pip3 install sqlite-utils.

Let’s start by grabbing a copy of the russian-ads.db database I created in Analyzing US Election Russian Facebook Ads (4MB):

$ cd /tmp
$ wget

We can see a list of tables in the database and their counts using the tables command:

$ sqlite-utils tables russian-ads.db --counts
[{"table": "ads", "count": 3498},
 {"table": "targets", "count": 1665},
 {"table": "ad_targets", "count": 36559},
 {"table": "ads_fts", "count": 3498},
 {"table": "ads_fts_segments", "count": 120},
 {"table": "ads_fts_segdir", "count": 1},
 {"table": "ads_fts_docsize", "count": 3498},
 {"table": "ads_fts_stat", "count": 1}]

By default, sqlite-utils outputs data as neatly formatted JSON. You can get CSV instead using the --csv option:

$ sqlite-utils tables russian-ads.db --counts --csv

Or if you want a pretty ASCII-art table, use --table (or the shortcut, -t):

$ sqlite-utils tables russian-ads.db --counts -t
table               count
----------------  -------
ads                  3498
targets              1665
ad_targets          36559
ads_fts              3498
ads_fts_segments      120
ads_fts_segdir          1
ads_fts_docsize      3498
ads_fts_stat            1

The table view is built on top of tabulate, which offers dozens of table variations. Run sqlite-utils tables --help for the full list—try --table -fmt=rst for output that can be pasted directly into a reStructuredText document (handy for writing documentation).

So far we’ve just looked at a list of tables. Lets run a SQL query:

$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category"
[{"category": "accessing_facebook_on", "count(*)": 1},
 {"category": "age", "count(*)": 82},
 {"category": "and_must_also_match", "count(*)": 228},
 {"category": "army_reserve_industry", "count(*)": 3},
 {"category": "behaviors", "count(*)": 16},

Again, this can be output as CSV using --csv, or a table with --table.

The default JSON output is objects wrapped in an array. Use --arrays to get an array of arrays instead. More interestingly: --nl causes the data to be output as newline-delimited JSON, like this:

$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category" --nl
{"category": "accessing_facebook_on", "count(*)": 1}
{"category": "age", "count(*)": 82}
{"category": "and_must_also_match", "count(*)": 228}
{"category": "army_reserve_industry", "count(*)": 3}
{"category": "behaviors", "count(*)": 16}

This is a really interesting format for piping to other tools.

Creating databases from JSON on the command-line

The sqlite-utils insert command can be used to create new tables by piping JSON or CSV directly into the tool. It’s the command-line equivalent of the .insert_all() Python function I demonstrated earlier.

Here’s how to create that meteorite database directly from the command-line:

$ curl "" | \
    sqlite-utils insert meteorites.db meteorites - --pk=id

This will use a SQLite database file called meteorites.db (creating one if it does not yet exist), create or use a table called meteorites and read the data from standard in (hence the pipe). You can pass a filename instead of a - here to read data from a file on disk.

The insert command accepts multiple formats—it defaults to expecting a JSON array of objects, but you can use --nl to accept newline-delimited JSON and --csv to accept CSV.

This means you can combine the tools! Let’s create a brand new database by exporting data from the old one, using newline-delimited JSON as the intermediary format:

$ sqlite-utils russian-ads.db \
    "select * from ads where text like '%veterans%'" --nl | \
    sqlite-utils insert veterans.db ads - --nl

This creates a new file called veterans.db containing an ads table with just the ads that mentioned veterans somewhere in their body text.

Since we’re working with JSON, we can introduce other command-line tools into the mix.

jq is a neat little tool for extracting data from a JSON file using its own mini domain-specific language.

The Nobel Prize API offers a JSON file listing all of the Nobel laureates—but they are contained as an array in a top level "laureates" key. sqlite-utils needs a flat array—so we can use jq to get exactly that:

$ curl "" | \
  jq ".laureates" | \
  sqlite-utils insert nobel.db laureates -

Now we have a file called nobel.db containing all of the Nobel laureates.

Since Datasette recently grew the ability to export newline-delimited JSON, we can also use this ability to directly consume data from Datasette. Lets grab every episode of the Joy of Painting in which Bob Ross painted a beach:

$ curl "" \
| sqlite-utils insert bob.db beach_episodes - --nl

Plenty more features

As with the Python API, the sqlite-utils CLI tool has dozens of other options and extensive documentation.

I’ve been really enjoying growing an ecosystem of tools around Datasette. sqlite-utils is the keystone here: it’s fundamental to other tools I’m building, such as db-to-sqlite (which can export any SQLAlchemy-supported database directly to a SQLite file on disk).

I’ve found myself increasingly turning to SQLite first for all kinds of ad-hoc analysis, and I’m excited to try out these new command-line abilities of sqlite-utils for real-world data spelunking tasks.

Exploring search relevance algorithms with SQLite three months ago

SQLite isn’t just a fast, high quality embedded database: it also incorporates a powerful full-text search engine in the form of the FTS4 and FTS5 extensions. You’ve probably used these a bunch of times already: many iOS, Android and desktop applications use SQLite under-the-hood and use it to implement their built-in search.

I’ve been using these capabilities for basic search in Datasette for over a year now, but I’ve recently started digging into some of their more advanced features. It turns out hacking around with SQLite is a great way to learn more about how fundamental information retrieval algorithms work under the hood.

Today I’m releasing sqlite-fts4—a Python package that provides a collection of custom SQL functions for working with SQLite’s FTS4 module. It includes some neat tools for introspecting how relevancy ranking algorithms actually work.

Why not just use FTS5?

If it’s available to you FTS5 is usually the best option: it has a good ranking algorithm built in. I described how to use it to build fast autocomplete search for your website for the 2018 24 ways advent calendar. You can join directly against a virtual table and order by a pre-calculated relevance score accessible through that table.

What makes FTS4 interesting is that it doesn’t include a scoring mechanism: it instead exposes raw statistical data to you in a way that lets you build your own ranking functions.

You probably don’t need to do this—unless you are stuck on an older SQLite version that doesn’t support the latest features. But… if you’re interested in understanding more about how search actually works, the need to implement a ranking function is an excellent learning learning opportunity.

I’ll be demonstrating these functions using a hosted Datasette instance running at (with the data from my 24 ways article). You can play with them out there, or if you want to use your own Datasette instance you can enable these custom SQL functions by pip installing my new datasette-sqlite-fts4 plugin.

Raw FTS4 matchinfo() data

When using FTS4, the only scoring help SQLite gives you is the bulit-in matchinfo() function. For each document in your search result set, this function will expose raw statistical data that can be used to calculate a score.

Let’s try it out using the following query:

    *, matchinfo(articles_fts, "pcx")
    articles_fts match :search

Run matchinfo() in Datasette

The pcx here is called the format string—it lets SQLite know what information about the match you would like to see.

The results are returned as a binary string! For the first matching document, we get back the following:


SQLite’s C heritage is showing through here.

decode_matchinfo() to decode the binary

The first step in working with matchinfo is to decode that binary string. It’s actually a sequence of unsigned 32 bit integers. We can turn it into a Python list of integers using the following:

struct.unpack("I" * (len(matchinfo) // 4), matchinfo)

sqlite-fts4 exposes a SQL function called decode_matchinfo() which does exactly this. Let’s expand our example to use it:

    title, author,
    decode_matchinfo(matchinfo(articles_fts, "pcx")),
    matchinfo(articles_fts, "pcx")
    articles_fts match :search

Run decode_matchinfo() in Datasette

The matchinfo for our first matching document now looks like this:

[2, 3, 0, 2, 2, 0, 0, 0, 1, 163, 31, 0, 2, 2, 0, 0, 0, 2, 92, 21]

Better, but still obscure. What does it mean?

The anwser lies in the SQLite matchinfo documentation. In our format string, we requested p, c and x:

  • p requests a single integer reprenting the number of search terms we are matching. Since our search query is jquery maps this is 2—it’s the first integer in the list.
  • c requests the number of searchable columns in our table. We created articles_fts with 3 columns, so it’s 3. That’s the second integer in the list.
  • x is much more interesting: it returns 3 integer values for each term/column combination. Since we have 2 terms and 3 columns that means we get back 6 * 3 = 18 integers. If you count the items in the array above you’ll see there are 18 left after you remove the first two. Each triple represents the number of times the term appears in the current column, the number of times it appears in this column across every row and the number of total documents that match the term in this column at least once.

Search relevancy scores are usually calculated against exactly this kind of collection of statistics: we rank based on how rare the matching terms are across the rest of the corpus.

annotate_matchinfo() to annotate the integers

Having a list of integers made things easier, but still not easy enough. That’s where annotate_matchinfo() comes in. This custom SQL function expands the matchinfo list of integers into a giant JSON object describing exactly what each of the results means.

We’ll try it out like this:

    title, author,
    decode_matchinfo(matchinfo(articles_fts, "pcx")),
    json_object("pre", annotate_matchinfo(matchinfo(articles_fts, "pcx"), "pcx"))
    articles_fts match :search

Run annotate_matchinfo() in Datasette

Note that we have to provide the format string twice, so that annotate_matchinfo() knows the requested order of the binary matchinfo data.

This returns a JSON object that looks like this:

  "p": {
    "title": "Number of matchable phrases in the query",
    "value": 2,
    "idx": 0
  "c": {
    "title": "Number of user defined columns in the FTS table",
    "value": 3,
    "idx": 1
  "x": {
    "title": "Details for each phrase/column combination"
    "value": [
        "phrase_index": 0,
        "column_index": 2,
        "hits_this_column_this_row": 1,
        "hits_this_column_all_rows": 163,
        "docs_with_hits": 31,
        "idxs": [8, 9, 10]
        "phrase_index": 1,
        "column_index": 0,
        "hits_this_column_this_row": 0,
        "hits_this_column_all_rows": 2,
        "docs_with_hits": 2,
        "idxs": [11, 12, 13]

Try it out with pcxnalyb to see the complete set of format string options.

You may be wondering why I wrapped that function call in json_object("pre", ...). This is a Datasette trick: I recently added the ability to pretty-print JSON to my datasette-html-json plugin—see that package’s README for details.

Building ranking functions

These statistics are everything we need to calculate relevance scores. sqlite-fts4 implements two such functions: rank_score() is a simple TF/IDF function. rank_bm25() is much more interesting—it’s an implementation of the Okapi BM25, inspired by the one that ships with the peewee ORM.

Let’s try them both out:

    title, author,
    rank_score(matchinfo(articles_fts, "pcx")) as score,
    rank_bm25(matchinfo(articles_fts, "pcnalx")) as bm25,
    json_object("pre", annotate_matchinfo(matchinfo(articles_fts, "pcxnalyb"), "pcxnalyb"))
    articles_fts match :search
order by bm25

Try rank_score() and rank_bm25() in Datasette

You can switch the order by clause between bm25 and score to compare the two.

bm25() is definitely a better option. It’s the default algorithm used these days by Elasticsearch, and they wrote up an excellent explanation of how it works on their blog.

Take a look at the source code for the ranking functions to see how they are implemented. They work against the data structure returned by annotate_matchinfo() to try and make it clear what is going on.

Building the rank_bm25() function took me longer than I expected: I was comparing my results against bm25() from peewee to ensure I was getting them right, but I couldn’t get them to match. After some furious debugging I finally figured out the problem: peewee had a rare bug! I reported it to Charles Leifer and he analyzed it and turned around a fix in a matter of hours—it turns out the C library that peewee had ported to Python had the same problem.

Next steps

I’m really impressed with the flexibility that FTS4 provides—it turns out FTS5 isn’t the only worthwhile option for search in SQLite

I’m thinking about ways to expose some of the bm25 tuning parameters (in particular the magic B and K1 constants explained by the Elasticsearch article) and I plan to upgrade Datasette’s search functionality to make ranking available as a first-class feature on the searchable table view.

I’m also generally excited about SQLite as a learning tool for exploring different search ranking mechanisms. Once you’ve decoded that binary matchinfo string it’s impressive how much you can get done with the underlying data.

Fast Autocomplete Search for Your Website four months ago

Every website deserves a great search engine—but building a search engine can be a lot of work, and hosting it can quickly get expensive.

I’m going to build a search engine for 24 ways that’s fast enough to support autocomplete (a.k.a. typeahead) search queries and can be hosted for free. I’ll be using wget, Python, SQLite, Jupyter, sqlite-utils and my open source Datasette tool to build the API backend, and a few dozen lines of modern vanilla JavaScript to build the interface.

Animated demo of autocomplete search against 24 ways

Try it out here, then read on to see how I built it.

First step: crawling the data

The first step in building a search engine is to grab a copy of the data that you plan to make searchable.

There are plenty of potential ways to do this: you might be able to pull it directly from a database, or extract it using an API. If you don’t have access to the raw data, you can imitate Google and write a crawler to extract the data that you need.

I’m going to do exactly that against 24 ways: I’ll build a simple crawler using wget, a command-line tool that features a powerful “recursive” mode that’s ideal for scraping websites.

We’ll start at the page, which links to an archived index for every year that 24 ways has been running.

Then we’ll tell wget to recursively crawl the website, using the --recursive flag.

We don’t want to fetch every single page on the site—we’re only interested in the actual articles. Luckily, 24 ways has nicely designed URLs, so we can tell wget that we only care about pages that start with one of the years it has been running, using the -I argument like this: -I /2005,/2006,/2007,/2008,/2009,/2010,/2011,/2012,/2013,/2014,/2015,/2016,/2017

We want to be polite, so let’s wait for 2 seconds between each request rather than hammering the site as fast as we can: --wait 2

The first time I ran this, I accidentally downloaded the comments pages as well. We don’t want those, so let’s exclude them from the crawl using -X "/*/*/comments".

Finally, it’s useful to be able to run the command multiple times without downloading pages that we have already fetched. We can use the --no-clobber option for this.

Tie all of those options together and we get this command:

wget --recursive --wait 2 --no-clobber 
  -I /2005,/2006,/2007,/2008,/2009,/2010,/2011,/2012,/2013,/2014,/2015,/2016,/2017 
  -X "/*/*/comments" 

If you leave this running for a few minutes, you’ll end up with a folder structure something like this:

$ find

As a quick sanity check, let’s count the number of HTML pages we have retrieved:

$ find | grep index.html | wc -l

There’s one last step! We got everything up to 2017, but we need to fetch the articles for 2018 (so far) as well. They aren’t linked in the /archives/ yet so we need to point our crawler at the site’s front page instead:

wget --recursive --wait 2 --no-clobber 
  -I /2018 
  -X "/*/*/comments"

Thanks to --no-clobber, this is safe to run every day in December to pick up any new content.

We now have a folder on our computer containing an HTML file for every article that has ever been published on the site! Let’s use them to build ourselves a search index.

Building a search index using SQLite

There are many tools out there that can be used to build a search engine. You can use an open-source search server like Elasticsearch or Solr, a hosted option like Algolia or Amazon CloudSearch or you can tap into the built-in search features of relational databases like MySQL or PostgreSQL.

I’m going to use something that’s less commonly used for web applications but makes for a powerful and extremely inexpensive alternative: SQLite.

SQLite is the world’s most widely deployed database, even though many people have never even heard of it. That’s because it’s designed to be used as an embedded database: it’s commonly used by native mobile applications and even runs as part of the default set of apps on the Apple Watch!

SQLite has one major limitation: unlike databases like MySQL and PostgreSQL, it isn’t really designed to handle large numbers of concurrent writes. For this reason, most people avoid it for building web applications.

This doesn’t matter nearly so much if you are building a search engine for infrequently updated content—say one for a site that only publishes new content on 24 days every year.

It turns out SQLite has very powerful full-text search functionality built into the core database—the FTS5 extension.

I’ve been doing a lot of work with SQLite recently, and as part of that, I’ve been building a Python utility library to make building new SQLite databases as easy as possible, called sqlite-utils. It’s designed to be used within a Jupyter notebook—an enormously productive way of interacting with Python code that’s similar to the Observable notebooks Natalie described on 24 ways yesterday.

If you haven’t used Jupyter before, here’s the fastest way to get up and running with it—assuming you have Python 3 installed on your machine. We can use a Python virtual environment to ensure the software we are installing doesn’t clash with any other installed packages:

$ python3 -m venv ./jupyter-venv
$ ./jupyter-venv/bin/pip install jupyter
# ... lots of installer output
# Now lets install some extra packages we will need later
$ ./jupyter-venv/bin/pip install beautifulsoup4 sqlite-utils html5lib
# And start the notebook web application
$ ./jupyter-venv/bin/jupyter-notebook
# This will open your browser to Jupyter at http://localhost:8888/

You should now be in the Jupyter web application. Click New -> Python 3 to start a new notebook.

A neat thing about Jupyter notebooks is that if you publish them to GitHub (either in a regular repository or as a Gist), it will render them as HTML. This makes them a very powerful way to share annotated code. I’ve published the notebook I used to build the search index on my GitHub account.

Juptyer notebook with my scraping code

Here’s the Python code I used to scrape the relevant data from the downloaded HTML files. Check out the notebook for a line-by-line explanation of what’s going on.

from pathlib import Path
from bs4 import BeautifulSoup as Soup
base = Path("/Users/simonw/Dropbox/Development/24ways-search")
articles = list(base.glob("*/*/*/*.html"))
# articles is now a list of paths that look like this:
# PosixPath('...24ways-search/')
docs = []
for path in articles:
    year = str(path.relative_to(base)).split("/")[1]
    url = 'https://' + str(path.relative_to(base).parent) + '/'
    soup = Soup(, "html5lib")
    author = soup.select_one(".c-continue")["title"].split(
        "More information about"
    author_slug = soup.select_one(".c-continue")["href"].split(
    published = soup.select_one(".c-meta time")["datetime"]
    contents = soup.select_one(".e-content").text.strip()
    title = soup.find("title").text.split(" ◆")[0]
        topic = soup.select_one(
            '.c-meta a[href^="/topics/"]'
    except TypeError:
        topic = None
        "title": title,
        "contents": contents,
        "year": year,
        "author": author,
        "author_slug": author_slug,
        "published": published,
        "url": url,
        "topic": topic,

After running this code, I have a list of Python dictionaries representing each of the documents that I want to add to the index. The list looks something like this:

    "title": "Why Bother with Accessibility?",
    "contents": "Web accessibility (known in other fields as inclus...",
    "year": "2013",
    "author": "Laura Kalbag",
    "author_slug": "laurakalbag",
    "published": "2013-12-10T00:00:00+00:00",
    "url": "",
    "topic": "design"
    "title": "Levelling Up",
    "contents": "Hello, 24 ways. Iu2019m Ashley and I sell property ins...",
    "year": "2013",
    "author": "Ashley Baxter",
    "author_slug": "ashleybaxter",
    "published": "2013-12-06T00:00:00+00:00",
    "url": "",
    "topic": "business"

My sqlite-utils library has the ability to take a list of objects like this and automatically create a SQLite database table with the right schema to store the data. Here’s how to do that using this list of dictionaries.

import sqlite_utils
db = sqlite_utils.Database("/tmp/24ways.db")

That’s all there is to it! The library will create a new database and add a table to it called articles with the necessary columns, then insert all of the documents into that table.

(I put the database in /tmp/ for the moment—you can move it to a more sensible location later on.)

You can inspect the table using the sqlite3 command-line utility (which comes with OS X) like this:

$ sqlite3 /tmp/24ways.db
sqlite> .headers on
sqlite> .mode column
sqlite> select title, author, year from articles;
title                           author        year      
------------------------------  ------------  ----------
Why Bother with Accessibility?  Laura Kalbag  2013      
Levelling Up                    Ashley Baxte  2013      
Project Hubs: A Home Base for   Brad Frost    2013      
Credits and Recognition         Geri Coady    2013      
Managing a Mind                 Christopher   2013      
Run Ragged                      Mark Boulton  2013      
Get Started With GitHub Pages   Anna Debenha  2013      
Coding Towards Accessibility    Charlie Perr  2013      
<Ctrl+D to quit>

There’s one last step to take in our notebook. We know we want to use SQLite’s full-text search feature, and sqlite-utils has a simple convenience method for enabling it for a specified set of columns in a table. We want to be able to search by the title, author and contents fields, so we call the enable_fts() method like this:

db["articles"].enable_fts(["title", "author", "contents"])

Introducing Datasette

Datasette is the open-source tool I’ve been building that makes it easy to both explore SQLite databases and publish them to the internet.

We’ve been exploring our new SQLite database using the sqlite3 command-line tool. Wouldn’t it be nice if we could use a more human-friendly interface for that?

If you don’t want to install Datasette right now, you can visit to try it out against the 24 ways search index data. I’ll show you how to deploy Datasette to Heroku like this later in the article.

If you want to install Datasette locally, you can reuse the virtual environment we created to play with Jupyter:

./jupyter-venv/bin/pip install datasette

This will install Datasette in the ./jupyter-venv/bin/ folder. You can also install it system-wide using regular pip install datasette.

Now you can run Datasette against the 24ways.db file we created earlier like so:

./jupyter-venv/bin/datasette /tmp/24ways.db

This will start a local webserver running. Visit http://localhost:8001/ to start interacting with the Datasette web application.

If you want to try out Datasette without creating your own 24ways.db file you can download the one I created directly from

Publishing the database to the internet

One of the goals of the Datasette project is to make deploying data-backed APIs to the internet as easy as possible. Datasette has a built-in command for this, datasette publish. If you have an account with Heroku or Zeit Now, you can deploy a database to the internet with a single command. Here’s how I deployed (running on Heroku’s free tier) using datasette publish:

$ ./jupyter-venv/bin/datasette publish heroku /tmp/24ways.db --name search-24ways
-----> Python app detected
-----> Installing requirements with pip

-----> Running post-compile hook
-----> Discovering process types
       Procfile declares types -> web

-----> Compressing...
       Done: 47.1M
-----> Launching...
       Released v8 deployed to Heroku

If you try this out, you’ll need to pick a different --name, since I’ve already taken search-24ways.

You can run this command as many times as you like to deploy updated versions of the underlying database.

Searching and faceting

Datasette can detect tables with SQLite full-text search configured, and will add a search box directly to the page. Take a look at to see this in action.

Datasette faceted browse

SQLite search supports wildcards, so if you want autocomplete-style search where you don’t need to enter full words to start getting results you can add a * to the end of your search term. Here’s a search for access* which returns articles on accessibility:

A neat feature of Datasette is the ability to calculate facets against your data. Here’s a page showing search results for svg with facet counts calculated against both the year and the topic columns:

Every page visible via Datasette has a corresponding JSON API, which can be accessed using the JSON link on the page—or by adding a .json extension to the URL:

Better search using custom SQL

The search results we get back from ../articles?_search=svg are OK, but the order they are returned in is not ideal—they’re actually being returned in the order they were inserted into the database! You can see why this is happening by clicking the View and edit SQL link on that search results page.

This exposes the underlying SQL query, which looks like this:

select rowid, * from articles where rowid in (
  select rowid from articles_fts where articles_fts match :search
) order by rowid limit 101

We can do better than this by constructing a custom SQL query. Here’s the query we will use instead:

  snippet(articles_fts, -1, 'b4de2a49c8', '8c94a2ed4b', '...', 100) as snippet,
  articles_fts.rank, articles.title, articles.url,, articles.year
from articles
  join articles_fts on articles.rowid = articles_fts.rowid
where articles_fts match :search || "*"
  order by rank limit 10;

You can try this query out directly—since Datasette opens the underling SQLite database in read-only mode and enforces a one second time limit on queries, it’s safe to allow users to provide arbitrary SQL select queries for Datasette to execute.

There’s a lot going on here! Let’s break the SQL down line-by-line:

  snippet(articles_fts, -1, 'b4de2a49c8', '8c94a2ed4b', '...', 100) as snippet,

We’re using snippet(), a built-in SQLite function, to generate a snippet highlighting the words that matched the query. We use two unique strings that I made up to mark the beginning and end of each match—you’ll see why in the JavaScript later on.

  articles_fts.rank, articles.title, articles.url,, articles.year

These are the other fields we need back—most of them are from the articles table but we retrieve the rank (representing the strength of the search match) from the magical articles_fts table.

from articles
  join articles_fts on articles.rowid = articles_fts.rowid

articles is the table containing our data. articles_fts is a magic SQLite virtual table which implements full-text search—we need to join against it to be able to query it.

where articles_fts match :search || "*"
  order by rank limit 10;

:search || "*" takes the ?search= argument from the page querystring and adds a * to the end of it, giving us the wildcard search that we want for autocomplete. We then match that against the articles_fts table using the match operator. Finally, we order by rank so that the best matching results are returned at the top—and limit to the first 10 results.

How do we turn this into an API? As before, the secret is to add the .json extension. Datasette actually supports multiple shapes of JSON—we’re going to use ?_shape=array to get back a plain array of objects:

JSON API call to search for articles matching SVG

The HTML version of that page shows the time taken to execute the SQL in the footer. Hitting refresh a few times, I get response times between 2 and 5ms—easily fast enough to power a responsive autocomplete feature.

A simple JavaScript autocomplete search interface

I considered building this using React or Svelte or another of the myriad of JavaScript framework options available today, but then I remembered that vanilla JavaScript in 2018 is a very productive environment all on its own.

We need a few small utility functions: first, a classic debounce function adapted from this one by David Walsh:

function debounce(func, wait, immediate) {
  let timeout;
  return function() {
    let context = this, args = arguments;
    let later = () => {
      timeout = null;
      if (!immediate) func.apply(context, args);
    let callNow = immediate && !timeout;
    timeout = setTimeout(later, wait);
    if (callNow) func.apply(context, args);

We’ll use this to only send fetch() requests a maximum of once every 100ms while the user is typing.

Since we’re rendering data that might include HTML tags (24 ways is a site about web development after all), we need an HTML escaping function. I’m amazed that browsers still don’t bundle a default one of these:

const htmlEscape = (s) => s.replace(
  />/g, '&gt;'
  /</g, '&lt;'
  /&/g, '&'
  /"/g, '&quot;'
  /'/g, '&#039;'

We need some HTML for the search form, and a div in which to render the results:

<h1>Autocomplete search</h1>
  <p><input id="searchbox" type="search" placeholder="Search 24ways" style="width: 60%"></p>
<div id="results"></div>

And now the autocomplete implementation itself, as a glorious, messy stream-of-consciousness of JavaScript:

// Embed the SQL query in a multi-line backtick string:
const sql = `select
  snippet(articles_fts, -1, 'b4de2a49c8', '8c94a2ed4b', '...', 100) as snippet,
  articles_fts.rank, articles.title, articles.url,, articles.year
from articles
  join articles_fts on articles.rowid = articles_fts.rowid
where articles_fts match :search || "*"
  order by rank limit 10`;

// Grab a reference to the <input type="search">
const searchbox = document.getElementById("searchbox");

// Used to avoid race-conditions:
let requestInFlight = null;

searchbox.onkeyup = debounce(() => {
  const q = searchbox.value;
  // Construct the API URL, using encodeURIComponent() for the parameters
  const url = (
    "" +
    encodeURIComponent(sql) +
  // Unique object used just for race-condition comparison
  let currentRequest = {};
  requestInFlight = currentRequest;
  fetch(url).then(r => r.json()).then(d => {
    if (requestInFlight !== currentRequest) {
      // Avoid race conditions where a slow request returns
      // after a faster one.
    let results = => `
      <div class="result">
        <h3><a href="${r.url}">${htmlEscape(r.title)}</a></h3>
        <p><small>${htmlEscape(} - ${r.year}</small></p>
    document.getElementById("results").innerHTML = results;
}, 100); // debounce every 100ms

There’s just one more utility function, used to help construct the HTML results:

const highlight = (s) => htmlEscape(s).replace(
  /b4de2a49c8/g, '<b>'
  /8c94a2ed4b/g, '</b>'

This is what those unique strings passed to the snippet() function were for.

Avoiding race conditions in autocomplete

One trick in this code that you may not have seen before is the way race-conditions are handled. Any time you build an autocomplete feature, you have to consider the following case:

  • User types acces
  • Browser sends request A—querying documents matching acces*
  • User continues to type accessibility
  • Browser sends request B—querying documents matching accessibility*
  • Request B returns. It was fast, because there are fewer documents matching the full term
  • The results interface updates with the documents from request B, matching accessibility*
  • Request A returns results (this was the slower of the two requests)
  • The results interface updates with the documents from request A—results matching access*

This is a terrible user experience: the user saw their desired results for a brief second, and then had them snatched away and replaced with those results from earlier on.

Thankfully there’s an easy way to avoid this. I set up a variable in the outer scope called requestInFlight, initially set to null.

Any time I start a new fetch() request, I create a new currentRequest = {} object and assign it to the outer requestInFlight as well.

When the fetch() completes, I use requestInFlight !== currentRequest to sanity check that the currentRequest object is strictly identical to the one that was in flight. If a new request has been triggered since we started the current request we can detect that and avoid updating the results.

It’s not a lot of code, really

And that’s the whole thing! The code is pretty ugly, but when the entire implementation clocks in at fewer than 70 lines of JavaScript, I honestly don’t think it matters. You’re welcome to refactor it as much you like.

How good is this search implementation? I’ve been building search engines for a long time using a wide variety of technologies and I’m happy to report that using SQLite in this way is genuinely a really solid option. It scales happily up to hundreds of MBs (or even GBs) of data, and the fact that it’s based on SQL makes it easy and flexible to work with.

A surprisingly large number of desktop and mobile applications you use every day implement their search feature on top of SQLite.

More importantly though, I hope that this demonstrates that using Datasette for an API means you can build relatively sophisticated API-backed applications with very little backend programming effort. If you’re working with a small-to-medium amount of data that changes infrequently, you may not need a more expensive database. Datasette-powered applications easily fit within the free tier of both Heroku and Zeit Now.

For more of my writing on Datasette, check out the datasette tag on my blog. And if you do build something fun with it, please let me know on Twitter.

This article originally appeared on 24ways.



  • Story Structure 104: The Juicy Details. Dan Harmon (Community, Rick and Morty) wrote a fascinating series of essays on story structure for his Channel 101 film festival project. It’s worth reading the whole series, but this chapter is where things get really detailed. #

23rd April 2019

  • Language support on Glitch: a list (via) This is really useful: it’s essentially “Glitch: the missing manual” for running languages other than JavaScript. The Glitch community forums are a gold mine of useful information like this. #

21st April 2019

  • Lots of people calling for more aggressive moderation seem to imagine that if they yell enough the companies have a thoughtful, unbiased and nuance-understanding HAL 9000 they can deploy. It’s really more like the Censorship DMV.

    Alex Stamos #

19th April 2019

18th April 2019

  • Using the HTML lang attribute (via) TIL the HTML lang attribute is used by screen readers to understand how to provide the correct accent and pronunciation. #
  • How Zoom’s web client avoids using WebRTC (via) It turns out video conferencing app Zoom uses their own WebAssembly compiled video and audio codecs and transmits H264 over WebSockets. #
  • An Intro to Threading in Python (via) Real Python consistently produces really comprehensive, high quality articles and tutorials. This is an excellent introduction to threading in Python, covering threads, locks, queues, ThreadPoolExecutor and more. #

17th April 2019

  • Pyodide: Bringing the scientific Python stack to the browser (via) More fun with WebAssembly: Pyodide attempts (and mostly succeeds) to bring the full Python data stack to the browser: CPython, NumPy, Pandas, Scipy, and Matplotlib. Also includes interesting bridge tools for e.g. driving a canvas element from Python. Really interesting project from the Firefox Data Platform team. #

16th April 2019

  • Wasmer: a Python library for executing WebAssembly binaries. This is a really interesting new tool: “pip install wasmer” and now you can load code that has been compiled to WebAssembly and call those functions directly from Python. It’s built on top of the wasmer universal WebAssembly runtime, written over just the past year in Rust by a team lead by Syrus Akbary, the author of the Graphene GraphQL library for Python. #
  • ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} (via) Andrew Gallant’s post from September 2016 introducing ripgrep, the command-line grep tool he wrote using Rust (on top of the Rust regular expression library also written by Andrew). ripgrep is a beautifully designed CLI interface and is crazy fast, and this post describes how it gets its performance in a huge amount of detail, right down to comparing the different algorithmic approaches used by other similar tools. I recently learned that ripgrep ships as part of VS Code, which is why VS Code’s search-across-project feature is so fast. In fact, if you dig around in the OS X package you can find the rg binary already installed on your mac: find /Applications/Visual* | grep bin/rg #

13th April 2019

  • Datasette: ?_where=sql-fragment parameter for table views. I just shipped a tiny but really useful new feature to Datasette master: you can now add ?_where=sql-fragment on to the URL of any table view to inject additional SQL directly into the underlying WHERE clause. This tiny feature actually has some really interesting applications: I created this because I wanted to be able to run more complex custom SQL queries without losing access to the conveniences of Datasette’s table view, in particular the built-in faceting support. The feature actually fits in well with Datasette’s philosophy of allowing arbitrary SQL to be executed against a read-only database: you can turn this ability off using the allow_sql config flag. #

11th April 2019

  • How to Create an Index in Django Without Downtime (via) Excellent advanced tutorial on Django migrations, which uses a desire to create indexes in PostgreSQL without locking the table (with CREATE INDEX CONCURRENTLY) to explain the SeparateDatabaseAndState and atomic features of Django’s migration framework. #
  • Using 6 Page and 2 Page Documents To Make Organizational Decisions (via) I’ve been thinking a lot recently about the challenges of efficiently getting to consensus within a larger organization spread across multiple locations and time zones. This model described by Ian Nowland based on his experience at AWS seems very promising. The goal is to achieve a decision or “disagree and commit” consensus using a max 6 page document and a one hour meeting. The first fifteen minutes of the meeting are dedicated to silently reading the document—if you’ve read it already you are given the option of arriving fifteen minutes late. #

9th April 2019

  • Ministry of Silly Runtimes: Vintage Python on Cloud Run (via) Cloud Run is an exciting new hosting service from Google that lets you define a container using a Dockerfile and then run that container in a “scale to zero” environment, so you only pay for time spent serving traffic. It’s similar to the now-deprecated Zeit Now 1.0 which inspired me to create Datasette. Here Dustin Ingram demonstrates how powerful Docker can be as the underlying abstraction by deploying a web app using a 25 year old version of Python 1.x. #
  • Generator Tricks for Systems Programmers (via) David Beazley’s definitive generators tutorial from 2008, updated for Python 3.7 in October 2018. #
  • In the five years since the shark was erected, no other examples have occurred … any system of control must make some small place for the dynamic, the unexpected, the downright quirky. I therefore recommend that the Headington Shark be allowed to remain.

    Peter Macdonald #