Simon Willison’s Weblog

On scaling, postgresql, facebook, aws, react, ...

 

Recent entries

Generating a commit log for San Francisco’s official list of trees 11 days ago

San Francisco has a neat open data portal (as do an increasingly large number of cities these days). For a few years my favourite file on there has been Street Tree List, a list of all 190,000 trees in the city maintained by the Department of Public Works.

I’ve been using that file for Datasette demos for a while now, but last week I noticed something intriguing: the file had been recently updated. On closer inspection it turns out it’s updated on a regular basis! I had assumed it was a static snapshot of trees at a certain point in time, but I was wrong: Street_Tree_List.csv is a living document.

Back in September 2017 I built a scraping project relating to hurricane Irma. The idea was to take data sources like FEMA’s list of open shelters and track them over time, by scraping them into a git repository and committing after every fetch.

I’ve been meaning to spend more time with this idea, and building a commit log for San Francisco’s trees looked like an ideal opportunity to do so.

sf-tree-history

Here’s the result: sf-tree-history, a git repository dedicated to recording the history of changes made to the official list of San Francisco’s trees. The repo contains three things: the latest copy of Street_Tree_List.csv, a README, and a Circle CI configuration that grabs a new copy of the file every night and, if it has changed, commits it to git and pushes the result to GitHub.

The most interesting part of the repo is the commit history itself. I’ve only been running the script for just over a week, but I already have some useful illustrative commits:

  • 7ab432cdcb8d7914cfea4a5b59803f38cade532b from March 6th records three new trees added to the file: two Monterey Pines and a Blackwood Acacia.
  • d6b258959af9546909b2eee836f0156ed88cd45d from March 12th shows four changes made to existing records. Of particular interest: TreeID 235981 (a Cherry Plum) had its address updated from 412 Webster St to 410 Webster St and its latitude and longitude tweaked a little bit as well.
  • ca66d9a5fdd632549301d249c487004a5b68abf2 lists 2151 rows changed, 1280 rows added! I found an old copy of Street_Tree_List.csv on my laptop from April 2018, so for fun I loaded it into the repository and used git commit amend to back-date the commit to almost a year ago. I generated a commit message between that file and the version from 9 days ago which came in at around 10,000 lines of text. Git handled that just fine, but GitHub’s web view sadly truncates it.

csv-diff

One of the things I learned from my hurricane Irma project was the importance of human-readable commit messages that summarize the detected changes. I initially wrote some code to generate those by hand, but then realized that this could be extracted into a reusable tool.

The result is csv-diff, a tiny Python CLI tool which can generate a human (or machine) readable version of the differences between two CSV files.

Using it looks like this:

$ csv-diff one.csv two.csv --key=id
1 row added, 1 row removed, 1 row changed

1 row added

  {"id": "3", "name": "Bailey", "age": "1"}

1 row removed

  {"id": "2", "name": "Pancakes", "age": "2"}

1 row changed

  Row 1
    age: "4" => "5"

The csv-diff README has further details on the tool.

Circle CI

My favourite thing about the sf-tree-history project is that it costs me nothing to run—either in hosting costs or (hopefully) in terms of ongoing maintenance.

The git repository is hosted for free on GitHub. Because it’s a public project, Circle CI will run tasks against it for free.

My .circleci/config.yml does the rest. It uses Circle’s cron syntax to schedule a task that runs every night. The task then runs this script (embedded in the YAML configuration):

cp Street_Tree_List.csv Street_Tree_List-old.csv
curl -o Street_Tree_List.csv "https://data.sfgov.org/api/views/tkzw-k3nq/rows.csv?accessType=DOWNLOAD"
git add Street_Tree_List.csv
git config --global user.email "treebot@example.com"
git config --global user.name "Treebot"
sudo pip install csv-diff
csv-diff Street_Tree_List-old.csv Street_Tree_List.csv --key=TreeID > message.txt
git commit -F message.txt && \
  git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master \
  || true

This script does all of the work.

  1. First it backs up the existing Street_Tree_list.csv as Street_Tree_List-old.csv, in order to be able to run a comparison later.
  2. It downloads the latest copy of Street_Tree_List.csv from the San Francisco data portal
  3. It adds the file to the git index and sets itself an identity for use in the commit
  4. It installs my csv-diff utility from PyPI
  5. It uses csv-diff to create a diff of the two files, and writes that diff to a new file called message.txt
  6. Finally, it attempts to create a new commit using message.txt as the commit message, then pushes the result to GitHub

The last line is the most complex. Circle CI will mark a build as failed if any of the commands in the run block return a non-0 exit code. git commit returns a non-0 exit code if you attempt to run it but none of the files have changed.

git commit ... && git push ... || true ensures that if git commit succeeds the git push command will be run, BUT if it fails the || true will still return a 0 exit code for the overall line—so Circle CI will not mark the build as failed.

There’s one last trick here: I’m using git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master to push my changes to GitHub. This takes advantage of Circle CI environment variables, which are the recommended way to configure secrets such that they cannot be viewed by anyone browsing your Circle CI builds. I created a personal GitHub auth token for this project, which I’m using to allow Circle CI to push commits to GitHub on my behalf.

Next steps

I’m really excited about this pattern of using GitHub in combination with Circle CI to track changes to any file that is being posted on the internet. I’m opening up the code (and my csv-diff utility) in the hope that other people will use them to set up their own tracking projects. Who knows, maybe there’s a file out there that’s even more exciting than San Francisco’s official list of trees!

I commissioned an oil painting of Barbra Streisand’s cloned dogs 17 days ago

Two dogs in a stroller looking at a gravestone, as an oil painting
Two identical puffs of white fur, gazing at the tombstone of the dog they are

Last year, Barbra Streisand cloned her dog, Sammie.

The story is fascinating, as is the background reading on dog cloning from Vanity Fair. But the thing that really stuck with me was the photograph that accompanied “Barbra Streisand Explains: Why I Cloned My Dog” in the New York Times:

Two dogs in a stroller looking at a gravestone

David Ewing Duncan in Vanity Fair described the scenario like this: Barbra Streisand, visiting the grave of her beloved Sammie, with Miss Violet and Miss Scarlett perched next to her in their stroller—two identical puffs of white fur, gazing at the tombstone of the dog they are.

This photograph says so much about the age that we live in. I couldn’t get it out of my head.

I’ve long been fascinated by Dafen, the town in a China that was once responsible for 60% of the world’s oil paintings—mostly replicas, but today increasingly original artwork.

I always wanted to commission an oil painting from China, but I never quite found the right subject... until now.

There’s something deliciously appropriate about using a painting cloning service to clone a photograph of some cloned dogs.

So I uploaded a copy of the photo to Instapainting and entered a few extra instructions:

Please paint the stroller closer to the gravestone—adjust the composition so that it fits the 12x16 dimensions while maintaining the two key elements of the image: the stroller with the two dogs in it and the gravestone that they are looking at

A couple of months later, a tube arrived in the mail. I opened it up... and they had nailed it. If anything the painting is an improvement on the original photograph: the composition is tighter, the stroller no longer has its wheel cut off, some dead plants in the background (which I had not even noticed) are given a bit more prominence, and the little doggy faces have exactly the right expressions of mild existential dread.

So thank you Alice Wu at Xiamen Dearchic Arts—I could not be happier.

With a painting this good, obviously it needed to be framed. I took it to Underglass Framing in San Francisco’s Hayes Valley and told them I was looking for something with an air of existential dread. “I think we can do that” they said.

Two dogs in a stroller looking at a gravestone, as an oil painting in an intimidating frame

Natalie says I can keep it in the guest bathroom.

sqlite-utils: a Python library and CLI tool for building SQLite databases 28 days ago

sqlite-utils is a combination Python library and command-line tool I’ve been building over the past six months which aims to make creating new SQLite databases as quick and easy as possible.

It’s part of the ecosystem of tools I’m building around my Datasette project.

I spent the weekend adding all kinds of exciting command-line options to it, so I’m ready to describe it to the world.

A Python library for quickly creating databases

A core idea behind Datasette is that SQLite is the ideal format for publishing all kinds of interesting structured data. Datasette takes any SQLite database and adds a browsable web interface, a JSON API and the ability to export tables and queries as CSV.

The other half of the equation then is tools to create SQLite databases. csvs-to-sqlite was my first CLI attempt at this. sqlite-utils takes a much more flexible and comprehensive approach.

I started working on sqlite-utils last year as part of my project to Analyze US Election Russian Facebook Ads. The initial aim was to build a library that made constructing new SQLite databases inside of a Jupyter notebook as productive as possible.

The core idea behind the library is that you can give it a list of Python dictionaries (equivalent to JSON objects) and it will automatically create a SQLite table with the correct schema, then insert those items into the new table.

To illustrate, let’s create a database using this JSON file of meteorite landings released by NASA (discovered via awesome-json-datasets curated by Justin Dorfman).

Here’s the quickest way in code to turn that into a database:

import requests
import sqlite_utils

db = sqlite_utils.Database("meteorites.db")
db["meteorites"].insert_all(
    requests.get(
        "https://data.nasa.gov/resource/y77d-th95.json"
    ).json(),
    pk="id"
)

This two lines of code creates a new SQLite database on disk called meteorites.db, creates a table in that file called meteorites, detects the necessary columns based on the incoming data, inserts all of the rows and sets the id column up as the primary key.

To see the resulting database, run datasette meteorites.db and browse to http://127.0.0.1:8001/.

You can do a lot more with the library. You can create tables, insert and upsert data in bulk, configure foreign key relationships, configure SQLite full-text search and much more. I encourage you to consult the documentation for all of the details.

The sqlite-utils command-line tool

This is the new stuff built over the past few days, and I think it’s really fun.

First install the tool from PyPI, using pip3 install sqlite-utils.

Let’s start by grabbing a copy of the russian-ads.db database I created in Analyzing US Election Russian Facebook Ads (4MB):

$ cd /tmp
$ wget https://static.simonwillison.net/static/2019/russian-ads.db

We can see a list of tables in the database and their counts using the tables command:

$ sqlite-utils tables russian-ads.db --counts
[{"table": "ads", "count": 3498},
 {"table": "targets", "count": 1665},
 {"table": "ad_targets", "count": 36559},
 {"table": "ads_fts", "count": 3498},
 {"table": "ads_fts_segments", "count": 120},
 {"table": "ads_fts_segdir", "count": 1},
 {"table": "ads_fts_docsize", "count": 3498},
 {"table": "ads_fts_stat", "count": 1}]

By default, sqlite-utils outputs data as neatly formatted JSON. You can get CSV instead using the --csv option:

$ sqlite-utils tables russian-ads.db --counts --csv
table,count
ads,3498
targets,1665
ad_targets,36559
ads_fts,3498
ads_fts_segments,120
ads_fts_segdir,1
ads_fts_docsize,3498
ads_fts_stat,1

Or if you want a pretty ASCII-art table, use --table (or the shortcut, -t):

$ sqlite-utils tables russian-ads.db --counts -t
table               count
----------------  -------
ads                  3498
targets              1665
ad_targets          36559
ads_fts              3498
ads_fts_segments      120
ads_fts_segdir          1
ads_fts_docsize      3498
ads_fts_stat            1

The table view is built on top of tabulate, which offers dozens of table variations. Run sqlite-utils tables --help for the full list—try --table -fmt=rst for output that can be pasted directly into a reStructuredText document (handy for writing documentation).

So far we’ve just looked at a list of tables. Lets run a SQL query:

$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category"
[{"category": "accessing_facebook_on", "count(*)": 1},
 {"category": "age", "count(*)": 82},
 {"category": "and_must_also_match", "count(*)": 228},
 {"category": "army_reserve_industry", "count(*)": 3},
 {"category": "behaviors", "count(*)": 16},
 ...

Again, this can be output as CSV using --csv, or a table with --table.

The default JSON output is objects wrapped in an array. Use --arrays to get an array of arrays instead. More interestingly: --nl causes the data to be output as newline-delimited JSON, like this:

$ sqlite-utils russian-ads.db "select category, count(*) from targets group by category" --nl
{"category": "accessing_facebook_on", "count(*)": 1}
{"category": "age", "count(*)": 82}
{"category": "and_must_also_match", "count(*)": 228}
{"category": "army_reserve_industry", "count(*)": 3}
{"category": "behaviors", "count(*)": 16}
...

This is a really interesting format for piping to other tools.

Creating databases from JSON on the command-line

The sqlite-utils insert command can be used to create new tables by piping JSON or CSV directly into the tool. It’s the command-line equivalent of the .insert_all() Python function I demonstrated earlier.

Here’s how to create that meteorite database directly from the command-line:

$ curl "https://data.nasa.gov/resource/y77d-th95.json" | \
    sqlite-utils insert meteorites.db meteorites - --pk=id

This will use a SQLite database file called meteorites.db (creating one if it does not yet exist), create or use a table called meteorites and read the data from standard in (hence the pipe). You can pass a filename instead of a - here to read data from a file on disk.

The insert command accepts multiple formats—it defaults to expecting a JSON array of objects, but you can use --nl to accept newline-delimited JSON and --csv to accept CSV.

This means you can combine the tools! Let’s create a brand new database by exporting data from the old one, using newline-delimited JSON as the intermediary format:

$ sqlite-utils russian-ads.db \
    "select * from ads where text like '%veterans%'" --nl | \
    sqlite-utils insert veterans.db ads - --nl

This creates a new file called veterans.db containing an ads table with just the ads that mentioned veterans somewhere in their body text.

Since we’re working with JSON, we can introduce other command-line tools into the mix.

jq is a neat little tool for extracting data from a JSON file using its own mini domain-specific language.

The Nobel Prize API offers a JSON file listing all of the Nobel laureates—but they are contained as an array in a top level "laureates" key. sqlite-utils needs a flat array—so we can use jq to get exactly that:

$ curl "http://api.nobelprize.org/v1/laureate.json" | \
  jq ".laureates" | \
  sqlite-utils insert nobel.db laureates -

Now we have a file called nobel.db containing all of the Nobel laureates.

Since Datasette recently grew the ability to export newline-delimited JSON, we can also use this ability to directly consume data from Datasette. Lets grab every episode of the Joy of Painting in which Bob Ross painted a beach:

$ curl "https://fivethirtyeight.datasettes.com/fivethirtyeight-aa93d24/bob-ross%2Felements-by-episode.json?_facet=BEACH&BEACH=1&_shape=array&_nl=on" \
| sqlite-utils insert bob.db beach_episodes - --nl

Plenty more features

As with the Python API, the sqlite-utils CLI tool has dozens of other options and extensive documentation.

I’ve been really enjoying growing an ecosystem of tools around Datasette. sqlite-utils is the keystone here: it’s fundamental to other tools I’m building, such as db-to-sqlite (which can export any SQLAlchemy-supported database directly to a SQLite file on disk).

I’ve found myself increasingly turning to SQLite first for all kinds of ad-hoc analysis, and I’m excited to try out these new command-line abilities of sqlite-utils for real-world data spelunking tasks.

Exploring search relevance algorithms with SQLite two months ago

SQLite isn’t just a fast, high quality embedded database: it also incorporates a powerful full-text search engine in the form of the FTS4 and FTS5 extensions. You’ve probably used these a bunch of times already: many iOS, Android and desktop applications use SQLite under-the-hood and use it to implement their built-in search.

I’ve been using these capabilities for basic search in Datasette for over a year now, but I’ve recently started digging into some of their more advanced features. It turns out hacking around with SQLite is a great way to learn more about how fundamental information retrieval algorithms work under the hood.

Today I’m releasing sqlite-fts4—a Python package that provides a collection of custom SQL functions for working with SQLite’s FTS4 module. It includes some neat tools for introspecting how relevancy ranking algorithms actually work.

Why not just use FTS5?

If it’s available to you FTS5 is usually the best option: it has a good ranking algorithm built in. I described how to use it to build fast autocomplete search for your website for the 2018 24 ways advent calendar. You can join directly against a virtual table and order by a pre-calculated relevance score accessible through that table.

What makes FTS4 interesting is that it doesn’t include a scoring mechanism: it instead exposes raw statistical data to you in a way that lets you build your own ranking functions.

You probably don’t need to do this—unless you are stuck on an older SQLite version that doesn’t support the latest features. But… if you’re interested in understanding more about how search actually works, the need to implement a ranking function is an excellent learning learning opportunity.

I’ll be demonstrating these functions using a hosted Datasette instance running at datasette-sqlite-fts4.datasette.io (with the data from my 24 ways article). You can play with them out there, or if you want to use your own Datasette instance you can enable these custom SQL functions by pip installing my new datasette-sqlite-fts4 plugin.

Raw FTS4 matchinfo() data

When using FTS4, the only scoring help SQLite gives you is the bulit-in matchinfo() function. For each document in your search result set, this function will expose raw statistical data that can be used to calculate a score.

Let’s try it out using the following query:

select
    *, matchinfo(articles_fts, "pcx")
from
    articles_fts
where
    articles_fts match :search

Run matchinfo() in Datasette

The pcx here is called the format string—it lets SQLite know what information about the match you would like to see.

The results are returned as a binary string! For the first matching document, we get back the following:

\x02\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\xa3\x00\x00\x00\x1f\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\\\x00\x00\x00\x15\x00\x00\x00

SQLite’s C heritage is showing through here.

decode_matchinfo() to decode the binary

The first step in working with matchinfo is to decode that binary string. It’s actually a sequence of unsigned 32 bit integers. We can turn it into a Python list of integers using the following:

struct.unpack("I" * (len(matchinfo) // 4), matchinfo)

sqlite-fts4 exposes a SQL function called decode_matchinfo() which does exactly this. Let’s expand our example to use it:

select
    title, author,
    decode_matchinfo(matchinfo(articles_fts, "pcx")),
    matchinfo(articles_fts, "pcx")
from
    articles_fts
where
    articles_fts match :search

Run decode_matchinfo() in Datasette

The matchinfo for our first matching document now looks like this:

[2, 3, 0, 2, 2, 0, 0, 0, 1, 163, 31, 0, 2, 2, 0, 0, 0, 2, 92, 21]

Better, but still obscure. What does it mean?

The anwser lies in the SQLite matchinfo documentation. In our format string, we requested p, c and x:

  • p requests a single integer reprenting the number of search terms we are matching. Since our search query is jquery maps this is 2—it’s the first integer in the list.
  • c requests the number of searchable columns in our table. We created articles_fts with 3 columns, so it’s 3. That’s the second integer in the list.
  • x is much more interesting: it returns 3 integer values for each term/column combination. Since we have 2 terms and 3 columns that means we get back 6 * 3 = 18 integers. If you count the items in the array above you’ll see there are 18 left after you remove the first two. Each triple represents the number of times the term appears in the current column, the number of times it appears in this column across every row and the number of total documents that match the term in this column at least once.

Search relevancy scores are usually calculated against exactly this kind of collection of statistics: we rank based on how rare the matching terms are across the rest of the corpus.

annotate_matchinfo() to annotate the integers

Having a list of integers made things easier, but still not easy enough. That’s where annotate_matchinfo() comes in. This custom SQL function expands the matchinfo list of integers into a giant JSON object describing exactly what each of the results means.

We’ll try it out like this:

select
    title, author,
    decode_matchinfo(matchinfo(articles_fts, "pcx")),
    json_object("pre", annotate_matchinfo(matchinfo(articles_fts, "pcx"), "pcx"))
from
    articles_fts
where
    articles_fts match :search

Run annotate_matchinfo() in Datasette

Note that we have to provide the format string twice, so that annotate_matchinfo() knows the requested order of the binary matchinfo data.

This returns a JSON object that looks like this:

{
  "p": {
    "title": "Number of matchable phrases in the query",
    "value": 2,
    "idx": 0
  },
  "c": {
    "title": "Number of user defined columns in the FTS table",
    "value": 3,
    "idx": 1
  },
  "x": {
    "title": "Details for each phrase/column combination"
    "value": [
      ...
      {
        "phrase_index": 0,
        "column_index": 2,
        "hits_this_column_this_row": 1,
        "hits_this_column_all_rows": 163,
        "docs_with_hits": 31,
        "idxs": [8, 9, 10]
      },
      {
        "phrase_index": 1,
        "column_index": 0,
        "hits_this_column_this_row": 0,
        "hits_this_column_all_rows": 2,
        "docs_with_hits": 2,
        "idxs": [11, 12, 13]
      }...
    ],
  }
}

Try it out with pcxnalyb to see the complete set of format string options.

You may be wondering why I wrapped that function call in json_object("pre", ...). This is a Datasette trick: I recently added the ability to pretty-print JSON to my datasette-html-json plugin—see that package’s README for details.

Building ranking functions

These statistics are everything we need to calculate relevance scores. sqlite-fts4 implements two such functions: rank_score() is a simple TF/IDF function. rank_bm25() is much more interesting—it’s an implementation of the Okapi BM25, inspired by the one that ships with the peewee ORM.

Let’s try them both out:

select
    title, author,
    rank_score(matchinfo(articles_fts, "pcx")) as score,
    rank_bm25(matchinfo(articles_fts, "pcnalx")) as bm25,
    json_object("pre", annotate_matchinfo(matchinfo(articles_fts, "pcxnalyb"), "pcxnalyb"))
from
    articles_fts
where
    articles_fts match :search
order by bm25

Try rank_score() and rank_bm25() in Datasette

You can switch the order by clause between bm25 and score to compare the two.

bm25() is definitely a better option. It’s the default algorithm used these days by Elasticsearch, and they wrote up an excellent explanation of how it works on their blog.

Take a look at the source code for the ranking functions to see how they are implemented. They work against the data structure returned by annotate_matchinfo() to try and make it clear what is going on.

Building the rank_bm25() function took me longer than I expected: I was comparing my results against bm25() from peewee to ensure I was getting them right, but I couldn’t get them to match. After some furious debugging I finally figured out the problem: peewee had a rare bug! I reported it to Charles Leifer and he analyzed it and turned around a fix in a matter of hours—it turns out the C library that peewee had ported to Python had the same problem.

Next steps

I’m really impressed with the flexibility that FTS4 provides—it turns out FTS5 isn’t the only worthwhile option for search in SQLite

I’m thinking about ways to expose some of the bm25 tuning parameters (in particular the magic B and K1 constants explained by the Elasticsearch article) and I plan to upgrade Datasette’s search functionality to make ranking available as a first-class feature on the searchable table view.

I’m also generally excited about SQLite as a learning tool for exploring different search ranking mechanisms. Once you’ve decoded that binary matchinfo string it’s impressive how much you can get done with the underlying data.

Zeit 2.0, and building smaller Python Docker images four months ago

Changes are afoot at Zeit Now, my preferred hosting provider for the past year (see previous posts). They have announced Now 2.0, an intriguing new approach to providing auto-scaling immutable deployments. It’s built on top of lambdas, and comes with a whole host of new constraints: code needs to fit into a 5MB bundle for example (though it looks like this restriction will soon be relaxed a littleupdate November 19th you can now bump this up to 50MB).

Unfortunately, they have also announced their intent to deprecate the existing Now v1 Docker-based solution.

“We will only start thinking about deprecation plans once we are able to accommodate the most common and critical use cases of v1 on v2”—Matheus Fernandes

“When we reach feature parity, we still intend to give customers plenty of time to upgrade (we are thinking at the very least 6 months from the time we announce it)”—Guillermo Rauch

This is pretty disastrous news for many of my projects, most crucially Datasette and Datasette Publish.

Datasette should be fine—it supports Heroku as an alternative to Zeit Now out of the box, and the publish_subcommand plugin hook makes it easy to add further providers (I’m exploring several new options at the moment).

Datasette Publish is a bigger problem. The whole point of that project is to make it easy for less-technical users to deploy their data as an interactive API to a Zeit Now account that they own themselves. Talking these users through what they need to do to upgrade should v1 be shut down in the future is not an exciting prospect.

So I’m going to start hunting for an alternative backend for Datasette Publish, but in the meantime I’ve had to make some changes to how it works in order to handle a new size limit of 100MB for Docker images deployed by free users.

Building smaller Docker images

Zeit appear to have introduced a new limit for free users of their Now v1 platform: Docker images need to be no larger than 100MB.

Datasette Publish was creating final image sizes of around 350MB, blowing way past that limit. I spent some time today figuring out how to get it to produce images within the new limit, and learned a lot about Docker image optimization in the process.

I ended up using Docker’s multi-stage build feature, which allows you to create temporary images during a build, use them to compile dependencies, then copy just the compiled assets into the final image.

An example of the previous Datasette Publish generated Dockerfile can be seen here. Here’s a rough outline of what it does:

  • Start with the python:3.6-slim-stretch image
  • apt-installs python3-dev and gcc so it can compile Python libraries with binary dependencies (pandas and uvloop for example)
  • Use pip to install csvs-to-sqlite and datasette
  • Add the uploaded CSV files, then run csvs-to-sqlite to convert them into a SQLite database
  • Run datasette inspect to cache a JSON file with information about the different tables
  • Run datasette serve to serve the resulting web application

There’s a lot of scope for improvement here. The final image has all sorts of cruft that’s not actually needed for serving the image: it has csvs-to-sqlite and all of its dependencies, plus the original uploaded CSV files.

Here’s the workflow I used to build a Dockerfile and check the size of the resulting image. My work-in-progress can be found in the datasette-small repo.

# Build the Dockerfile in the current directory and tag as datasette-small
$ docker build . -t datasette-small
# Inspect the size of the resulting image
$ docker images | grep datasette-small
# Start the container running
$ docker run -d -p 8006:8006 datasette-small
654d3fc4d3343c6b73414c6fb4b2933afc56fbba1f282dde9f515ac6cdbc5339
# Now visit http://localhost:8006/ to see it running

Alpine Linux

When you start looking for ways to build smaller Dockerfiles, the first thing you will encounter is Alpine Linux. Alpine is a Linux distribution that’s perfect for containers: it builds on top of BusyBox to strip down to the smallest possible image that can still do useful things.

The python:3.6-alpine container should be perfect: it gives you the smallest possible container that can run Python 3.6 applications (including the ability to pip install additional dependencies).

There’s just one problem: in order to install C-based dependencies like pandas (used by csvs-to-sqlite) and Sanic (used by Datasette) you need a compiler toolchain. Alpine doesn’t have this out-of-the-box, but you can install one using Alpine’s apk package manager. Of course, now you’re bloating your container with a bunch of compilation tools that you don’t need to serve the final image.

This is what makes multi-stage builds so useful! We can spin up an Alpine image with the compilers installed, build our modules, then copy the resulting binary blobs into a fresh container.

Here’s the basic recipe for doing that:

FROM python:3.6-alpine as builder

# Install and compile Datasette + its dependencies
RUN apk add --no-cache gcc python3-dev musl-dev alpine-sdk
RUN pip install datasette

# Now build a fresh container, copying across the compiled pieces
FROM python:3.6-alpine

COPY --from=builder /usr/local/lib/python3.6 /usr/local/lib/python3.6
COPY --from=builder /usr/local/bin/datasette /usr/local/bin/datasette

This pattern works really well, and produces delightfully slim images. My first attempt at this wasn’t quite slim enough to fit the 100MB limit though, so I had to break out some Docker tools to figure out exactly what was going on.

Inspecting docker image layers

Part of the magic of Docker is the concept of layers. When Docker builds a container it uses a layered filesystem (UnionFS) and creates a new layer for every executable line in the Dockerfile. This dramatically speeds up future builds (since layers can be reused if they have already been built) and also provides a powerful tool for inspecting different stages of the build.

When you run docker build part of the output is IDs of the different image layers as they are constructed:

datasette-small $ docker build . -t datasette-small
Sending build context to Docker daemon  2.023MB
Step 1/21 : FROM python:3.6-slim-stretch as csvbuilder
 ---> 971a5d5dad01
Step 2/21 : RUN apt-get update && apt-get install -y python3-dev gcc wget
 ---> Running in f81485df62dd

Given a layer ID, like 971a5d5dad01, it’s possible to spin up a new container that exposes the exact state of that layer (thanks, Stack Overflow). Here’s how do to that:

docker run -it --rm 971a5d5dad01 sh

The -it argument attaches standard input to the container (-i) and allocates a pseudo-TTY (-t). The -rm option means that the container will be removed when you Ctrl+D back out of it. sh is the command we want to run in the container—using a shell lets us start interacting with it.

Now that we have a shell against that layer, we can use regular unix commands to start exploring it. du -m (m for MB) is particularly useful here, as it will show us the largest directories in the filesystem. I pipe it through sort like so:

$ docker run -it --rm abc63755616b sh
# du -m | sort -n
...
58  ./usr/local/lib/python3.6
70  ./usr/local/lib
71  ./usr/local
76  ./usr/lib/python3.5
188 ./usr/lib
306 ./usr
350 .

Straight away we can start seeing where the space is being taken up in our image.

Deleting unnecessary files

I spent quite a while inspecting different stages of my builds to try and figure out where the space was going. The alpine copy recipe worked neatly, but I was still a little over the limit. When I started to dig around in my final image I spotted some interesting patterns—in particular, the /usr/local/lib/python3.6/site-packages/uvloop directory was 17MB!

# du -m /usr/local | sort -n -r | head -n 5
96  /usr/local
95  /usr/local/lib
83  /usr/local/lib/python3.6
36  /usr/local/lib/python3.6/site-packages
17  /usr/local/lib/python3.6/site-packages/uvloop

That seems like a lot of disk space for a compiled C module, so I dug in further…

It turned out the uvloop folder still contained a bunch of files that were used as part of the compilation, including a 6.7MB loop.c file and a bunch of .pxd and .pyd files that are compiled by Cython. None of these files are needed after the extension has been compiled, but they were there, taking up a bunch of precious space.

So I added the following to my Dockerfile:

RUN find /usr/local/lib/python3.6 -name '*.c' -delete
RUN find /usr/local/lib/python3.6 -name '*.pxd' -delete
RUN find /usr/local/lib/python3.6 -name '*.pyd' -delete

Then I noticed that there were __pycache__ files that weren’t needed either, so I added this as well:

RUN find /usr/local/lib/python3.6 -name '__pycache__' | xargs rm -r

(The -delete flag didn’t work correctly for that one, so I used xargs instead.)

This shaved off around 15MB, putting me safely under the limit.

Running csvs-to-sqlite in its own stage

The above tricks had got me the smallest Alpine Linux image I could create that would still run Datasette… but Datasette Publish also needs to run csvs-to-sqlite in order to convert the user’s uploaded CSV files to SQLite.

csvs-to-sqlite has some pretty heavy dependencies of its own in the form of Pandas and NumPy. Even with the build chain installed I was having trouble installing these under Alpine, especially since building numpy for Alpine is notoriously slow.

Then I realized that thanks to multi-stage builds there’s no need for me to use Alpine at all for this step. I switched back to python:3.6-slim-stretch and used it to install csvs-to-sqlite and compile the CSV files into a SQLite database. I also ran datasette inspect there for good measure.

Then in my final Alpine container I could use the following to copy in just those compiled assets:

COPY --from=csvbuilder inspect-data.json inspect-data.json
COPY --from=csvbuilder data.db data.db

Tying it all together

Here’s an example of a full Dockerfile generated by Datasette Publish that combines all of these tricks. To summarize, here’s what it does:

  • Spin up a python:3.6-slim-stretch—call it csvbuilder
    • apt-get install -y python3-dev gcc so we can install compiled dependencies
    • pip install csvs-to-sqlite datasette
    • Copy in the uploaded CSV files
    • Run csvs-to-sqlite to convert them into a SQLite database
    • Run datasette inspect data.db to generate an inspect-data.json file with statistics about the tables. This can later be used to reduce startup time for datasette serve.
  • Spin up a python:3.6-alpine—call it buildit
    • We need a build chain to compile a copy of datasette for Alpine Linux…
    • apk add --no-cache gcc python3-dev musl-dev alpine-sdk
    • Now we can pip install datasette, plus any requested plugins
    • Reduce the final image size by deleting any __pycache__ or *.c, *.pyd and *.pxd files.
  • Spin up a fresh python:3.6-alpine for our final image
    • Copy in data.db and inspect-data.json from csvbuilder
    • Copy across /usr/local/lib/python3.6 and /usr/local/bin/datasette from bulidit
    • … and we’re done! Expose port 8006 and set datasette serve to run when the container is started

Now that I’ve finally learned how to take advantage of multi-stage builds I expect I’ll be using them for all sorts of interesting things in the future.

Automatically playing science communication games with transfer learning and fastai four months ago

This weekend was the 9th annual Science Hack Day San Francisco, which was also the 100th Science Hack Day held worldwide.

Natalie and I decided to combine our interests and build something fun.

I’m currently enrolled in Jeremy Howard’s Deep Learning course so I figured this was a great opportunity to try out some computer vision.

Natalie runs the SciComm Games calendar and accompanying @SciCommGames bot to promote and catalogue science communication hashtag games on Twitter.

Hashtag games? Natalie explains them here—essentially they are games run by scientists on Twitter to foster public engagement around an animal or topic by challenging people to identify if a photo is a #cougarOrNot or participate in a #TrickyBirdID or identify #CrowOrNo or many others.

Combining the two… we decided to build a bot that automatically plays these games using computer vision. So far it’s just trying #cougarOrNot—you can see the bot in action at @critter_vision.

Training data from iNaturalist

In order to build a machine learning model, you need to start out with some training data.

I’m a big fan of iNaturalist, a citizen science project that encourages users to upload photographs of wildlife (and plants) they have seen and have their observations verified by a community. Natalie and I used it to build owlsnearme.com earlier this year—the API in particular is fantastic.

iNaturalist has over 5,000 verified sightings of felines (cougars, bobcats, domestic cats and more) in the USA.

The raw data is available as a paginated JSON API. The medium sized photos are just the right size for training a neural network.

I started by grabbing 5,000 images and saving them to disk with a filename that reflected their identified species:

Bobcat_9005106.jpg
Domestic-Cat_10068710.jpg
Bobcat_15713672.jpg
Domestic-Cat_6755280.jpg
Mountain-Lion_9075705.jpg

Building a model

I’m only one week into the fast.ai course so this really isn’t particularly sophisticated yet, but it was just about good enough to power our hack.

The main technique we are learning in the course is called transfer learning, and it really is shockingly effective. Instead of training a model from scratch you start out with a pre-trained model and use some extra labelled images to train a small number of extra layers.

The initial model we are using is ResNet-34, a 34-layer neural network trained on 1,000 labelled categories in the ImageNet corpus.

In class, we learned to use this technique to get 94% accuracy against the Oxford-IIIT Pet Dataset—around 7,000 images covering 12 cat breeds and 25 dog breeds. In 2012 the researchers at Oxford were able to get 59.21% using a sophisticated model—it 2018 we can get 94% with transfer learning and just a few lines of code.

I started with an example provided in class, which loads and trains images from files on disk using a regular expression that extracts the labels from the filenames.

My full Jupyter notebook is inaturalist-cats.ipynb—the key training code is as follows:

from fastai import *
from fastai.vision import *
cat_images_path = Path('/home/jupyter/.fastai/data/inaturalist-usa-cats/images')
cat_fnames = get_image_files(cat_images_path)
cat_data = ImageDataBunch.from_name_re(
    cat_images_path,
    cat_fnames,
    r'/([^/]+)_\d+.jpg$',
    ds_tfms=get_transforms(),
    size=224
)
cat_data.normalize(imagenet_stats)
cat_learn = ConvLearner(cat_data, models.resnet34, metrics=error_rate)
cat_learn.fit_one_cycle(4)
# Save the generated model to disk
cat_learn.save("usa-inaturalist-cats")

Calling cat_learn.save("usa-inaturalist-cats") created an 84MB file on disk at /home/jupyter/.fastai/data/inaturalist-usa-cats/images/models/usa-inaturalist-cats.pth—I used scp to copy that model down to my laptop.

This model gave me a 24% error rate which is pretty terrible—others on the course have been getting error rates less than 10% for all kinds of interesting problems. My focus was to get a model deployed as an API though so I haven’t spent any additional time fine-tuning things yet.

Deploying the model as an API

The fastai library strongly encourages training against a GPU, using pytorch and PyCUDA. I’ve been using n1-highmem-8 Google Cloud Platform instance with an attached Tesla P4, then running everything in a Jupyter notebook there. This costs around $0.38 an hour—fine for a few hours of training, but way too expensive to permanently host a model.

Thankfully, while a GPU is essential for productively training models it’s not nearly as important for evaluating them against new data. pytorch can run in CPU mode for that just fine on standard hardware, and the fastai README includes instructions on installing it for a CPU using pip.

I started out by ensuring I could execute my generated model on my own laptop (since pytorch doesn’t yet work with the GPU built into the Macbook Pro). Once I had that working, I used the resulting code to write a tiny Starlette-powered API server. The code for that can be found in in cougar.py.

fastai is under very heavy development and the latest version doesn’t quite have a clean way of loading a model from disk without also including the initial training images, so I had to hack around quite a bit to get this working using clues from the fastai forums. I expect this to get much easier over the next few weeks as the library continues to evolve based on feedback from the current course.

To deploy the API I wrote a Dockerfile and shipped it to Zeit Now. Now remains my go-to choice for this kind of project, though unfortunately their new (and brilliant) v2 platform imposes a 100MB image size limit—not nearly enough when the model file itself weights in at 83 MB. Thankfully it’s still possible to specify their v1 cloud which is more forgiving for larger applications.

Here’s the result: an API which can accept either the URL to an image or an uploaded image file: https://cougar-or-not.now.sh/—try it out with a cougar and a bobcat.

The Twitter Bot

Natalie built the Twitter bot. It runs as a scheduled task on Heroku and works by checking for new #cougarOrNot tweets from Dr. Michelle LaRue, extracting any images, passing them to my API and replying with a tweet that summarizes the results. Take a look at its recent replies to get a feel for how well it is doing.

Amusingly, Dr. LaRue frequently tweets memes to promote upcoming competitions and marks them with the same hashtag. The bot appears to think that most of the memes are bobcats! I should definitely spend some time tuning that model.

Science Hack Day was great fun. A big thanks to the organizing team, and congrats to all of the other participants. I’m really looking forward to the next one.

Plus… we won a medal!

Elsewhere

18th March 2019

  • VisiData (via) Intriguing tool by Saul Pwanson: VisiData is a command-line “textpunk utility” for browsing and manipulating tabular data. “pip3 install visidata” and then “vd myfile.csv” (or .json or .xls or SQLite orothers) and get an interactive terminal UI for quickly searching through the data, conducting frequency analysis of columns, manipulating it and much more besides. Two tips for if you start playing with it: hit “gq” to exit, and hit “Ctrl+H” to view the help screen. #

17th March 2019

  • The Cloud and Open Source Powder Keg (via) Stephen O’Grady’s analysis of the Elastic v.s. AWS situation, where Elastic started mixing their open source and non-open source code together and Amazon responded by releasing their own forked “open distribution for Elasticsearch”. World War One analogies included! #
  • What the Hell is Going On? (via) David Perell discusses how the shift from information scarcity to information abundance is reshaping commerce, education, and politics. Long but worthwhile. #
  • Don’t worry about people stealing your ideas. If your ideas are any good, you’ll have to ram them down people’s throats.

    Howard H. Aiken #

15th March 2019

  • Client-side instrumentation for under $1 per month. No servers necessary. (via) Rolling your own analytics used to be too complex and expensive to be worth the effort. Thanks to cloud technologies like Cloudfront, Athena, S3 and Lambda you can now inexpensively implement client-side analytics (via requests to a tracking pixel) that stores detailed logs on S3, then use Amazon Athena to run queries against those logs ($5/TB scanned) to get detailed reporting. This post also introduced me to Snowplow, an open source JavaScript analytics script (released by a commercial analytics platform) which looks very neat—it’s based on piwik.js, the tracker from the open-source Piwik analytics tool. #

10th March 2019

  • D3 Projection Comparison (via) Fun Observable notebook that lets you compare any two out of D3’s 96 (!) geographical projections of the world. #

9th March 2019

  • datasette-jellyfish. I learned about a handy Python library called Jellyfish which implements approximate and phonetic matching of strings—soundex, metaphone, porter stemming, levenshtein distance and more. I’ve built a simple Datasette plugin which wraps the library and makes each of those algorithms available as a SQL function. #
  • Publish the data behind your stories with SQLite and Datasette. I presented a workshop on Datasette at the IRE and NICAR CAR 2019 data journalism conference yesterday. Here’s the worksheet I prepared for the tutorial. #

4th March 2019

  • MySQL: How to get the top N rows for each group. MySQL doesn’t support the row_number() window function that’s available in PostgreSQL (and recent SQLite), which means it can’t easily answer questions like “for each of these authors, give me the most recent three blog entries they have written” in a single query. Only it turns out it can, if you abuse MySQL session variables in a devious way. This isn’t a new feature: MySQL has had this for over a decade, and in my rough testing it works quickly even on tables with millions of rows. #
  • List of Physical Visualizations (via) “A chronological list of physical visualizations and related artifacts, maintained by Pierre Dragicevic and Yvonne Jansen”—327 and counting! #

26th February 2019

25th February 2019

  • Metrics are lossily compressed logs. Traces are logs with parent child relationships between entries. The only reason we have three terms is because getting value from them has required different compromises to make them cost effective.

    Clint Sharp #

  • huey. Charles Leifer’s “little task queue for Python”. Similar to Celery, but it’s designed to work with Redis, SQLite or in the parent process using background greenlets. Worth checking out for the really neat design. The project is new to me, but it’s been under active development since 2011 and has a very healthy looking rate of releases. #
  • My Twitter thread collecting behind the scenes content about Spider-Man: Into the Spider-Verse. I absolutely loved Spider-Verse, and I’ve been delighted to discover that many of the artists who created the movie are active on Twitter and have been posting all kinds of fascinating material about their creative process. I’ve been collecting examples in this Twitter thread for a couple of months now. They definitely deserved that Oscar. #
  • In January, Facebook distributes a policy update stating that moderators should take into account recent romantic upheaval when evaluating posts that express hatred toward a gender. “I hate all men” has always violated the policy. But “I just broke up with my boyfriend, and I hate all men” no longer does.

    Casey Newton #

22nd February 2019

  • Seeking the Productive Life: Some Details of My Personal Infrastructure (via) Stephen Wolfram’s 15,000 word epic about his personal approach to productivity, developed over the past thirty years. This is a fascinating document—I found myself thinking “surely there can’t be more information than this” and then spotting that the scrollbar wasn’t even a third done yet. Very hard to summarize: it turns out if you’re the work-from-home CEO of your own privately held 800 person company you can construct some very opinionated habits. #
  • String length - Rosetta Code (via) Calculating the length of a string is surprisingly difficult once Unicode is involved. Here’s a fascinating illustration of how that problem can be attached dozens of different programming languages. From that page: the string “J̲o̲s̲é̲” (“J\x{332}o\x{332}s\x{332}e\x{301}\x{332}”) has 4 user-visible graphemes, 9 characters (code points), and 14 bytes when encoded in UTF-8. #

19th February 2019

  • Lessons from 6 software rewrite stories (via) Herb Caudill takes on the classic idea that rewriting from scratch is “the single worst strategic mistake that any software company can make” and investigates it through the lens of six well-chosen examples: Netscape 6, Basecamp Classic/2/3, Visual Studio/VS Code, Gmail/Inbox, FogBugz/Wasabi/Trello, and finally FreshBooks/BillSpring. Each story has details I had never heard before, and the lessons and conclusions are deeply insightful. #
  • parameterized. I love the @parametrize decorator in pytest, which lets you run the same test multiple times against multiple parameters. The only catch is that the decorator in pytest doesn’t work for old-style unittest TestCase tests, which means you can’t easily add it to test suites that were built using the older model. I just found out about parameterized which works with unittest tests whether or not you are running them using the pytest test runner. #
  • The Eleven Laws of Showrunning (via) Fascinating essay on how to run a modern TV show by Javier Grillo-Marxuach. Being a showrunner basically involves running a 100+ person startup with a 7 digit budget, almost immovable deadlines, high maintenance activist investors and you’re still expected to write some of the scripts! So many useful lessons here about management, creativity and delegation: almost everything in here is relevant to product management, startup founding and engineering management as well. #