Simon Willison’s Weblog

On zeitnow, security, django, webperformance, observable, ...

 

Recent entries

Exploring search relevance algorithms with SQLite one month ago

SQLite isn’t just a fast, high quality embedded database: it also incorporates a powerful full-text search engine in the form of the FTS4 and FTS5 extensions. You’ve probably used these a bunch of times already: many iOS, Android and desktop applications use SQLite under-the-hood and use it to implement their built-in search.

I’ve been using these capabilities for basic search in Datasette for over a year now, but I’ve recently started digging into some of their more advanced features. It turns out hacking around with SQLite is a great way to learn more about how fundamental information retrieval algorithms work under the hood.

Today I’m releasing sqlite-fts4—a Python package that provides a collection of custom SQL functions for working with SQLite’s FTS4 module. It includes some neat tools for introspecting how relevancy ranking algorithms actually work.

Why not just use FTS5?

If it’s available to you FTS5 is usually the best option: it has a good ranking algorithm built in. I described how to use it to build fast autocomplete search for your website for the 2018 24 ways advent calendar. You can join directly against a virtual table and order by a pre-calculated relevance score accessible through that table.

What makes FTS4 interesting is that it doesn’t include a scoring mechanism: it instead exposes raw statistical data to you in a way that lets you build your own ranking functions.

You probably don’t need to do this—unless you are stuck on an older SQLite version that doesn’t support the latest features. But… if you’re interested in understanding more about how search actually works, the need to implement a ranking function is an excellent learning learning opportunity.

I’ll be demonstrating these functions using a hosted Datasette instance running at datasette-sqlite-fts4.datasette.io (with the data from my 24 ways article). You can play with them out there, or if you want to use your own Datasette instance you can enable these custom SQL functions by pip installing my new datasette-sqlite-fts4 plugin.

Raw FTS4 matchinfo() data

When using FTS4, the only scoring help SQLite gives you is the bulit-in matchinfo() function. For each document in your search result set, this function will expose raw statistical data that can be used to calculate a score.

Let’s try it out using the following query:

select
    *, matchinfo(articles_fts, "pcx")
from
    articles_fts
where
    articles_fts match :search

Run matchinfo() in Datasette

The pcx here is called the format string—it lets SQLite know what information about the match you would like to see.

The results are returned as a binary string! For the first matching document, we get back the following:

\x02\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\xa3\x00\x00\x00\x1f\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\\\x00\x00\x00\x15\x00\x00\x00

SQLite’s C heritage is showing through here.

decode_matchinfo() to decode the binary

The first step in working with matchinfo is to decode that binary string. It’s actually a sequence of unsigned 32 bit integers. We can turn it into a Python list of integers using the following:

struct.unpack("I" * (len(matchinfo) // 4), matchinfo)

sqlite-fts4 exposes a SQL function called decode_matchinfo() which does exactly this. Let’s expand our example to use it:

select
    title, author,
    decode_matchinfo(matchinfo(articles_fts, "pcx")),
    matchinfo(articles_fts, "pcx")
from
    articles_fts
where
    articles_fts match :search

Run decode_matchinfo() in Datasette

The matchinfo for our first matching document now looks like this:

[2, 3, 0, 2, 2, 0, 0, 0, 1, 163, 31, 0, 2, 2, 0, 0, 0, 2, 92, 21]

Better, but still obscure. What does it mean?

The anwser lies in the SQLite matchinfo documentation. In our format string, we requested p, c and x:

  • p requests a single integer reprenting the number of search terms we are matching. Since our search query is jquery maps this is 2—it’s the first integer in the list.
  • c requests the number of searchable columns in our table. We created articles_fts with 3 columns, so it’s 3. That’s the second integer in the list.
  • x is much more interesting: it returns 3 integer values for each term/column combination. Since we have 2 terms and 3 columns that means we get back 6 * 3 = 18 integers. If you count the items in the array above you’ll see there are 18 left after you remove the first two. Each triple represents the number of times the term appears in the current column, the number of times it appears in this column across every row and the number of total documents that match the term in this column at least once.

Search relevancy scores are usually calculated against exactly this kind of collection of statistics: we rank based on how rare the matching terms are across the rest of the corpus.

annotate_matchinfo() to annotate the integers

Having a list of integers made things easier, but still not easy enough. That’s where annotate_matchinfo() comes in. This custom SQL function expands the matchinfo list of integers into a giant JSON object describing exactly what each of the results means.

We’ll try it out like this:

select
    title, author,
    decode_matchinfo(matchinfo(articles_fts, "pcx")),
    json_object("pre", annotate_matchinfo(matchinfo(articles_fts, "pcx"), "pcx"))
from
    articles_fts
where
    articles_fts match :search

Run annotate_matchinfo() in Datasette

Note that we have to provide the format string twice, so that annotate_matchinfo() knows the requested order of the binary matchinfo data.

This returns a JSON object that looks like this:

{
  "p": {
    "title": "Number of matchable phrases in the query",
    "value": 2,
    "idx": 0
  },
  "c": {
    "title": "Number of user defined columns in the FTS table",
    "value": 3,
    "idx": 1
  },
  "x": {
    "title": "Details for each phrase/column combination"
    "value": [
      ...
      {
        "phrase_index": 0,
        "column_index": 2,
        "hits_this_column_this_row": 1,
        "hits_this_column_all_rows": 163,
        "docs_with_hits": 31,
        "idxs": [8, 9, 10]
      },
      {
        "phrase_index": 1,
        "column_index": 0,
        "hits_this_column_this_row": 0,
        "hits_this_column_all_rows": 2,
        "docs_with_hits": 2,
        "idxs": [11, 12, 13]
      }...
    ],
  }
}

Try it out with pcxnalyb to see the complete set of format string options.

You may be wondering why I wrapped that function call in json_object("pre", ...). This is a Datasette trick: I recently added the ability to pretty-print JSON to my datasette-html-json plugin—see that package’s README for details.

Building ranking functions

These statistics are everything we need to calculate relevance scores. sqlite-fts4 implements two such functions: rank_score() is a simple TF/IDF function. rank_bm25() is much more interesting—it’s an implementation of the Okapi BM25, inspired by the one that ships with the peewee ORM.

Let’s try them both out:

select
    title, author,
    rank_score(matchinfo(articles_fts, "pcx")) as score,
    rank_bm25(matchinfo(articles_fts, "pcnalx")) as bm25,
    json_object("pre", annotate_matchinfo(matchinfo(articles_fts, "pcxnalyb"), "pcxnalyb"))
from
    articles_fts
where
    articles_fts match :search
order by bm25

Try rank_score() and rank_bm25() in Datasette

You can switch the order by clause between bm25 and score to compare the two.

bm25() is definitely a better option. It’s the default algorithm used these days by Elasticsearch, and they wrote up an excellent explanation of how it works on their blog.

Take a look at the source code for the ranking functions to see how they are implemented. They work against the data structure returned by annotate_matchinfo() to try and make it clear what is going on.

Building the rank_bm25() function took me longer than I expected: I was comparing my results against bm25() from peewee to ensure I was getting them right, but I couldn’t get them to match. After some furious debugging I finally figured out the problem: peewee had a rare bug! I reported it to Charles Leifer and he analyzed it and turned around a fix in a matter of hours—it turns out the C library that peewee had ported to Python had the same problem.

Next steps

I’m really impressed with the flexibility that FTS4 provides—it turns out FTS5 isn’t the only worthwhile option for search in SQLite

I’m thinking about ways to expose some of the bm25 tuning parameters (in particular the magic B and K1 constants explained by the Elasticsearch article) and I plan to upgrade Datasette’s search functionality to make ranking available as a first-class feature on the searchable table view.

I’m also generally excited about SQLite as a learning tool for exploring different search ranking mechanisms. Once you’ve decoded that binary matchinfo string it’s impressive how much you can get done with the underlying data.

Zeit 2.0, and building smaller Python Docker images three months ago

Changes are afoot at Zeit Now, my preferred hosting provider for the past year (see previous posts). They have announced Now 2.0, an intriguing new approach to providing auto-scaling immutable deployments. It’s built on top of lambdas, and comes with a whole host of new constraints: code needs to fit into a 5MB bundle for example (though it looks like this restriction will soon be relaxed a littleupdate November 19th you can now bump this up to 50MB).

Unfortunately, they have also announced their intent to deprecate the existing Now v1 Docker-based solution.

“We will only start thinking about deprecation plans once we are able to accommodate the most common and critical use cases of v1 on v2”—Matheus Fernandes

“When we reach feature parity, we still intend to give customers plenty of time to upgrade (we are thinking at the very least 6 months from the time we announce it)”—Guillermo Rauch

This is pretty disastrous news for many of my projects, most crucially Datasette and Datasette Publish.

Datasette should be fine—it supports Heroku as an alternative to Zeit Now out of the box, and the publish_subcommand plugin hook makes it easy to add further providers (I’m exploring several new options at the moment).

Datasette Publish is a bigger problem. The whole point of that project is to make it easy for less-technical users to deploy their data as an interactive API to a Zeit Now account that they own themselves. Talking these users through what they need to do to upgrade should v1 be shut down in the future is not an exciting prospect.

So I’m going to start hunting for an alternative backend for Datasette Publish, but in the meantime I’ve had to make some changes to how it works in order to handle a new size limit of 100MB for Docker images deployed by free users.

Building smaller Docker images

Zeit appear to have introduced a new limit for free users of their Now v1 platform: Docker images need to be no larger than 100MB.

Datasette Publish was creating final image sizes of around 350MB, blowing way past that limit. I spent some time today figuring out how to get it to produce images within the new limit, and learned a lot about Docker image optimization in the process.

I ended up using Docker’s multi-stage build feature, which allows you to create temporary images during a build, use them to compile dependencies, then copy just the compiled assets into the final image.

An example of the previous Datasette Publish generated Dockerfile can be seen here. Here’s a rough outline of what it does:

  • Start with the python:3.6-slim-stretch image
  • apt-installs python3-dev and gcc so it can compile Python libraries with binary dependencies (pandas and uvloop for example)
  • Use pip to install csvs-to-sqlite and datasette
  • Add the uploaded CSV files, then run csvs-to-sqlite to convert them into a SQLite database
  • Run datasette inspect to cache a JSON file with information about the different tables
  • Run datasette serve to serve the resulting web application

There’s a lot of scope for improvement here. The final image has all sorts of cruft that’s not actually needed for serving the image: it has csvs-to-sqlite and all of its dependencies, plus the original uploaded CSV files.

Here’s the workflow I used to build a Dockerfile and check the size of the resulting image. My work-in-progress can be found in the datasette-small repo.

# Build the Dockerfile in the current directory and tag as datasette-small
$ docker build . -t datasette-small
# Inspect the size of the resulting image
$ docker images | grep datasette-small
# Start the container running
$ docker run -d -p 8006:8006 datasette-small
654d3fc4d3343c6b73414c6fb4b2933afc56fbba1f282dde9f515ac6cdbc5339
# Now visit http://localhost:8006/ to see it running

Alpine Linux

When you start looking for ways to build smaller Dockerfiles, the first thing you will encounter is Alpine Linux. Alpine is a Linux distribution that’s perfect for containers: it builds on top of BusyBox to strip down to the smallest possible image that can still do useful things.

The python:3.6-alpine container should be perfect: it gives you the smallest possible container that can run Python 3.6 applications (including the ability to pip install additional dependencies).

There’s just one problem: in order to install C-based dependencies like pandas (used by csvs-to-sqlite) and Sanic (used by Datasette) you need a compiler toolchain. Alpine doesn’t have this out-of-the-box, but you can install one using Alpine’s apk package manager. Of course, now you’re bloating your container with a bunch of compilation tools that you don’t need to serve the final image.

This is what makes multi-stage builds so useful! We can spin up an Alpine image with the compilers installed, build our modules, then copy the resulting binary blobs into a fresh container.

Here’s the basic recipe for doing that:

FROM python:3.6-alpine as builder

# Install and compile Datasette + its dependencies
RUN apk add --no-cache gcc python3-dev musl-dev alpine-sdk
RUN pip install datasette

# Now build a fresh container, copying across the compiled pieces
FROM python:3.6-alpine

COPY --from=builder /usr/local/lib/python3.6 /usr/local/lib/python3.6
COPY --from=builder /usr/local/bin/datasette /usr/local/bin/datasette

This pattern works really well, and produces delightfully slim images. My first attempt at this wasn’t quite slim enough to fit the 100MB limit though, so I had to break out some Docker tools to figure out exactly what was going on.

Inspecting docker image layers

Part of the magic of Docker is the concept of layers. When Docker builds a container it uses a layered filesystem (UnionFS) and creates a new layer for every executable line in the Dockerfile. This dramatically speeds up future builds (since layers can be reused if they have already been built) and also provides a powerful tool for inspecting different stages of the build.

When you run docker build part of the output is IDs of the different image layers as they are constructed:

datasette-small $ docker build . -t datasette-small
Sending build context to Docker daemon  2.023MB
Step 1/21 : FROM python:3.6-slim-stretch as csvbuilder
 ---> 971a5d5dad01
Step 2/21 : RUN apt-get update && apt-get install -y python3-dev gcc wget
 ---> Running in f81485df62dd

Given a layer ID, like 971a5d5dad01, it’s possible to spin up a new container that exposes the exact state of that layer (thanks, Stack Overflow). Here’s how do to that:

docker run -it --rm 971a5d5dad01 sh

The -it argument attaches standard input to the container (-i) and allocates a pseudo-TTY (-t). The -rm option means that the container will be removed when you Ctrl+D back out of it. sh is the command we want to run in the container—using a shell lets us start interacting with it.

Now that we have a shell against that layer, we can use regular unix commands to start exploring it. du -m (m for MB) is particularly useful here, as it will show us the largest directories in the filesystem. I pipe it through sort like so:

$ docker run -it --rm abc63755616b sh
# du -m | sort -n
...
58  ./usr/local/lib/python3.6
70  ./usr/local/lib
71  ./usr/local
76  ./usr/lib/python3.5
188 ./usr/lib
306 ./usr
350 .

Straight away we can start seeing where the space is being taken up in our image.

Deleting unnecessary files

I spent quite a while inspecting different stages of my builds to try and figure out where the space was going. The alpine copy recipe worked neatly, but I was still a little over the limit. When I started to dig around in my final image I spotted some interesting patterns—in particular, the /usr/local/lib/python3.6/site-packages/uvloop directory was 17MB!

# du -m /usr/local | sort -n -r | head -n 5
96  /usr/local
95  /usr/local/lib
83  /usr/local/lib/python3.6
36  /usr/local/lib/python3.6/site-packages
17  /usr/local/lib/python3.6/site-packages/uvloop

That seems like a lot of disk space for a compiled C module, so I dug in further…

It turned out the uvloop folder still contained a bunch of files that were used as part of the compilation, including a 6.7MB loop.c file and a bunch of .pxd and .pyd files that are compiled by Cython. None of these files are needed after the extension has been compiled, but they were there, taking up a bunch of precious space.

So I added the following to my Dockerfile:

RUN find /usr/local/lib/python3.6 -name '*.c' -delete
RUN find /usr/local/lib/python3.6 -name '*.pxd' -delete
RUN find /usr/local/lib/python3.6 -name '*.pyd' -delete

Then I noticed that there were __pycache__ files that weren’t needed either, so I added this as well:

RUN find /usr/local/lib/python3.6 -name '__pycache__' | xargs rm -r

(The -delete flag didn’t work correctly for that one, so I used xargs instead.)

This shaved off around 15MB, putting me safely under the limit.

Running csvs-to-sqlite in its own stage

The above tricks had got me the smallest Alpine Linux image I could create that would still run Datasette… but Datasette Publish also needs to run csvs-to-sqlite in order to convert the user’s uploaded CSV files to SQLite.

csvs-to-sqlite has some pretty heavy dependencies of its own in the form of Pandas and NumPy. Even with the build chain installed I was having trouble installing these under Alpine, especially since building numpy for Alpine is notoriously slow.

Then I realized that thanks to multi-stage builds there’s no need for me to use Alpine at all for this step. I switched back to python:3.6-slim-stretch and used it to install csvs-to-sqlite and compile the CSV files into a SQLite database. I also ran datasette inspect there for good measure.

Then in my final Alpine container I could use the following to copy in just those compiled assets:

COPY --from=csvbuilder inspect-data.json inspect-data.json
COPY --from=csvbuilder data.db data.db

Tying it all together

Here’s an example of a full Dockerfile generated by Datasette Publish that combines all of these tricks. To summarize, here’s what it does:

  • Spin up a python:3.6-slim-stretch—call it csvbuilder
    • apt-get install -y python3-dev gcc so we can install compiled dependencies
    • pip install csvs-to-sqlite datasette
    • Copy in the uploaded CSV files
    • Run csvs-to-sqlite to convert them into a SQLite database
    • Run datasette inspect data.db to generate an inspect-data.json file with statistics about the tables. This can later be used to reduce startup time for datasette serve.
  • Spin up a python:3.6-alpine—call it buildit
    • We need a build chain to compile a copy of datasette for Alpine Linux…
    • apk add --no-cache gcc python3-dev musl-dev alpine-sdk
    • Now we can pip install datasette, plus any requested plugins
    • Reduce the final image size by deleting any __pycache__ or *.c, *.pyd and *.pxd files.
  • Spin up a fresh python:3.6-alpine for our final image
    • Copy in data.db and inspect-data.json from csvbuilder
    • Copy across /usr/local/lib/python3.6 and /usr/local/bin/datasette from bulidit
    • … and we’re done! Expose port 8006 and set datasette serve to run when the container is started

Now that I’ve finally learned how to take advantage of multi-stage builds I expect I’ll be using them for all sorts of interesting things in the future.

Automatically playing science communication games with transfer learning and fastai three months ago

This weekend was the 9th annual Science Hack Day San Francisco, which was also the 100th Science Hack Day held worldwide.

Natalie and I decided to combine our interests and build something fun.

I’m currently enrolled in Jeremy Howard’s Deep Learning course so I figured this was a great opportunity to try out some computer vision.

Natalie runs the SciComm Games calendar and accompanying @SciCommGames bot to promote and catalogue science communication hashtag games on Twitter.

Hashtag games? Natalie explains them here—essentially they are games run by scientists on Twitter to foster public engagement around an animal or topic by challenging people to identify if a photo is a #cougarOrNot or participate in a #TrickyBirdID or identify #CrowOrNo or many others.

Combining the two… we decided to build a bot that automatically plays these games using computer vision. So far it’s just trying #cougarOrNot—you can see the bot in action at @critter_vision.

Training data from iNaturalist

In order to build a machine learning model, you need to start out with some training data.

I’m a big fan of iNaturalist, a citizen science project that encourages users to upload photographs of wildlife (and plants) they have seen and have their observations verified by a community. Natalie and I used it to build owlsnearme.com earlier this year—the API in particular is fantastic.

iNaturalist has over 5,000 verified sightings of felines (cougars, bobcats, domestic cats and more) in the USA.

The raw data is available as a paginated JSON API. The medium sized photos are just the right size for training a neural network.

I started by grabbing 5,000 images and saving them to disk with a filename that reflected their identified species:

Bobcat_9005106.jpg
Domestic-Cat_10068710.jpg
Bobcat_15713672.jpg
Domestic-Cat_6755280.jpg
Mountain-Lion_9075705.jpg

Building a model

I’m only one week into the fast.ai course so this really isn’t particularly sophisticated yet, but it was just about good enough to power our hack.

The main technique we are learning in the course is called transfer learning, and it really is shockingly effective. Instead of training a model from scratch you start out with a pre-trained model and use some extra labelled images to train a small number of extra layers.

The initial model we are using is ResNet-34, a 34-layer neural network trained on 1,000 labelled categories in the ImageNet corpus.

In class, we learned to use this technique to get 94% accuracy against the Oxford-IIIT Pet Dataset—around 7,000 images covering 12 cat breeds and 25 dog breeds. In 2012 the researchers at Oxford were able to get 59.21% using a sophisticated model—it 2018 we can get 94% with transfer learning and just a few lines of code.

I started with an example provided in class, which loads and trains images from files on disk using a regular expression that extracts the labels from the filenames.

My full Jupyter notebook is inaturalist-cats.ipynb—the key training code is as follows:

from fastai import *
from fastai.vision import *
cat_images_path = Path('/home/jupyter/.fastai/data/inaturalist-usa-cats/images')
cat_fnames = get_image_files(cat_images_path)
cat_data = ImageDataBunch.from_name_re(
    cat_images_path,
    cat_fnames,
    r'/([^/]+)_\d+.jpg$',
    ds_tfms=get_transforms(),
    size=224
)
cat_data.normalize(imagenet_stats)
cat_learn = ConvLearner(cat_data, models.resnet34, metrics=error_rate)
cat_learn.fit_one_cycle(4)
# Save the generated model to disk
cat_learn.save("usa-inaturalist-cats")

Calling cat_learn.save("usa-inaturalist-cats") created an 84MB file on disk at /home/jupyter/.fastai/data/inaturalist-usa-cats/images/models/usa-inaturalist-cats.pth—I used scp to copy that model down to my laptop.

This model gave me a 24% error rate which is pretty terrible—others on the course have been getting error rates less than 10% for all kinds of interesting problems. My focus was to get a model deployed as an API though so I haven’t spent any additional time fine-tuning things yet.

Deploying the model as an API

The fastai library strongly encourages training against a GPU, using pytorch and PyCUDA. I’ve been using n1-highmem-8 Google Cloud Platform instance with an attached Tesla P4, then running everything in a Jupyter notebook there. This costs around $0.38 an hour—fine for a few hours of training, but way too expensive to permanently host a model.

Thankfully, while a GPU is essential for productively training models it’s not nearly as important for evaluating them against new data. pytorch can run in CPU mode for that just fine on standard hardware, and the fastai README includes instructions on installing it for a CPU using pip.

I started out by ensuring I could execute my generated model on my own laptop (since pytorch doesn’t yet work with the GPU built into the Macbook Pro). Once I had that working, I used the resulting code to write a tiny Starlette-powered API server. The code for that can be found in in cougar.py.

fastai is under very heavy development and the latest version doesn’t quite have a clean way of loading a model from disk without also including the initial training images, so I had to hack around quite a bit to get this working using clues from the fastai forums. I expect this to get much easier over the next few weeks as the library continues to evolve based on feedback from the current course.

To deploy the API I wrote a Dockerfile and shipped it to Zeit Now. Now remains my go-to choice for this kind of project, though unfortunately their new (and brilliant) v2 platform imposes a 100MB image size limit—not nearly enough when the model file itself weights in at 83 MB. Thankfully it’s still possible to specify their v1 cloud which is more forgiving for larger applications.

Here’s the result: an API which can accept either the URL to an image or an uploaded image file: https://cougar-or-not.now.sh/—try it out with a cougar and a bobcat.

The Twitter Bot

Natalie built the Twitter bot. It runs as a scheduled task on Heroku and works by checking for new #cougarOrNot tweets from Dr. Michelle LaRue, extracting any images, passing them to my API and replying with a tweet that summarizes the results. Take a look at its recent replies to get a feel for how well it is doing.

Amusingly, Dr. LaRue frequently tweets memes to promote upcoming competitions and marks them with the same hashtag. The bot appears to think that most of the memes are bobcats! I should definitely spend some time tuning that model.

Science Hack Day was great fun. A big thanks to the organizing team, and congrats to all of the other participants. I’m really looking forward to the next one.

Plus… we won a medal!

How to Instantly Publish Data to the Internet with Datasette three months ago

I spoke about my Datasette project at PyBay in August and they’ve just posted the video of my talk.

I’ve also published the annotated slides from the talk, using a similar format to my Redis tutorial from back in 2010.

How I moderated the State of Django panel at DjangoCon US. three months ago

On Wednesday last week I moderated the State of Django panel as the closing session for DjangoCon US 2018.

I think it went well, so I’m writing some notes on exactly how we did it. In my experience it’s worth doing this for things like public speaking: in six months time I might moderate another panel and I’ll be desperately trying to remember what went right last time.

Panels are hard. Bad panels are way too common, to the point that some good conferences actively avoid having panels at all.

In my opinion, a good panel has a number of important attributes:

  • The panel needs a coherent theme. It shouldn’t just be several independent speakers that happen to be sharing the same time slot.
  • Panels need to be balanced. Having just one or two of the speakers monopolize the conversation is bad for the audience and bad for the panelists themselves.
  • The moderator is there to facilitate the conversation, not to be the center of attention. I love public speaking so I feel the need to be particularly cautious here.
  • Panelists need to have diverse perspectives on the topics under discussion. A panel where everyone agrees with each other and makes the same points is a very boring panel indeed.

I originally pitched the panel to the DjangoCon US organizing committee as a “State of Django” conversation where core maintainers would talk about the current state of the project.

They countered with a much better idea: a panel that encompassed both the state of the Django framework and the community and ecosystem that it exists within. Since DjangoCon is primarily about bringing that community together this was a much better fit for the conference, and would make for a much more interesting and relevant discussion.

I worked together with the conference organizers to find our panelists. Nicholle James in particular was the driving force behind assembling the panelists and ensuring everything was in place for the panel to succeed.

We ended up with a panel representing a comprehensive cross-section of the organizations that make the Django community work:

  • Andrew Godwin, representing Django Core
  • Anna Makarudze, representing the DSF, Django Girls and Python Africa
  • Frank Wiles, President of the DSF
  • Jeff Triplett, President of DEFNA, member of the Board of Directors for the PSF
  • Josue Balandrano Coronel, representing DEFNA
  • Katherine Michel, representing DEFNA and DjangoCon US Website Chair
  • Kojo Idrissa, DEFNA North American Ambassador
  • Rachell Calhoun, representing Django Girls and PyLadies

As it was the closing session for the conference, I wanted the panel to both celebrate the progress of the community and project and to explore what we need to do next: what should we be working on to improve Django and it’s community in the future?

I had some initial thoughts on topics, but since the panel was scheduled for the last session of the conference I decided to spend the conference itself firming up the topics that would be discussed. This was a really good call: we got to create an agenda for the panel that was almost entirely informed by the other conference sessions combined with hot topics from the halfway track. We also asked conference attendees for their suggestions via an online form, and used those suggestions to further inform the topics that were discussed.

I made sure to have a 10-15 minute conversation one-on-one with each of the panelists during the conference. We then got together for an hour at lunch before the panel to sync up with the topics and themes we would be discussing.

Topic and themes

Our pre-panel conversations highlighted a powerful theme for the panel itself, which I ended up summarizing as “What can the Django project learn from the Django community?” This formed the framework for the other themes of the panel.

The themes themselves were:

One of the hardest parts for me was figuring out the order in which we would tackle these themes. I ended up settling on the above order about half an hour before the panel started.

Opening and closing

With eight panelists, ensuring that introductions didn’t drag on too long was particularly important. I asked each panelist to introduce themselves with a couple of sentences that highlighted the organizations they were affiliated with that were relevant to the panel. For our chosen group of panelists this was definitely the right framing.

I then asked each panelist to be prepared to close the panel with a call to action: something an audience member could actively do that would support the community going forward. This worked really well: it provided a great, actionable note to end both the panel and the conference.

Preparing and running the panel

We used our panel lunch together to check that no one would have calls to action that overlapped too much, and to provide a rough indication of who had things to say about each topic we planned to discuss.

This turned out to be essential: I’ve been on smaller panels where the panelists have been able to riff easily on each other’s points, but with eight panelists it turned out not everyone could even see each other, so as panel moderator it fell on me to direct questions to individuals and then prompt others for follow-ups. Thankfully the panel lunch combined with the one-to-one conversations gave me the information I needed for this.

I had written down a selection of questions for each of the themes. Having a selection turned out to be crucial: a few times the panelists talked about material that I had planned to cover in a later section, so I had to adapt as we went along. In the future I’ll spend more time on this: these written ideas were a crucial component in keeping the panel flowing in the right direction.

With everything in place the panel itself was a case of concentrating on what everyone was saying and using the selection of the next questions (plus careful ad-libbing) to guide the conversation along the preselected themes. I also tried to keep mental track of who had done the most speaking so I could ensure the conversation stayed as balanced as possible by inviting other panelists into the conversation.

The video of the panel should be out in around three weeks time, at which point you can evaluate for yourself if we managed to do a good job of it. I’m really happy with the feedback we got after the panel, and I plan to use a similar process for panels I’m involved with in the future.

The interesting ideas in Datasette four months ago

Datasette (previously) is my open source tool for exploring and publishing structured data. There are a lot of ideas embedded in Datasette. I realized that I haven’t put many of them into writing.

Publishing read-only data
Bundling the data with the code
SQLite as the underlying data engine
Far-future cache expiration
Publishing as a core feature
License and source metadata
Facet everything
Respect for CSV
SQL as an API language
Optimistic query execution with time limits
Keyset pagination
Interactive demos based on the unit tests
Documentation unit tests

Publishing read-only data

Datasette provides a read-only API to your data. It makes no attempt to deal with writes. Avoiding writes entirely is fundamental to a plethora of interesting properties, many of which are expanded on further below. In brief:

  • Hosting web applications with no read/write persistence requirements is incredibly cheap in 2018—often free (both ZEIT Now and a Heroku have generous free tiers). This is a big deal: even having to pay a few dollars a month is enough to dicentivise sharing data, since now you have to figure out who will pay and ensure the payments don’t expire in the future.
  • Being read-only makes it trivial to scale: just add more instances, each with their own copy of the data. All of the hard problems in scaling web applications that relate to writable data stores can be skipped entirely.
  • Since the database file is opened using SQLite’s immutable mode, we can accept arbitrary SQL queries with no risk of them corrupting the data.

Any time your data changes, you need to publish a brand new copy of the whole database. With the right hosting this is easy: deploy a brand new copy of your data and application in parallel to your existing live deployment, then switch over incoming HTTP traffic to your API at the load balancer level. Heroku and Zeit Now both support this strategy out of the box.

Bundling the data with the code

Since the data is read-only and is encapsulated in a single binary SQLite database file, we can bundle the data as part of the app. This means we can trivially create and publish Docker images that provide both the data and the API and UI for accessing it. We can also publish to any hosting provider that will allow us to run a Python application, without also needing to provision a mutable database.

The datasette package command takes one or more SQLite databases and bundles them together with the Datasette application in a single Docker image, ready to be deployed anywhere that can run Docker containers.

SQLite as the underlying data engine

Datasette encourages people to use SQLite as a standard format for publishing data.

Relational database are great: once you know how to use them, you can represent any data you can imagine using a carefully designed schema.

What about data that’s too unstructured to fit a relational schema? SQLite includes excellent support for JSON data—so if you can’t shape your data to fit a table schema you can instead store it as text blobs of JSON—and use SQLite’s JSON functions to filter by or extract specific fields.

What about binary data? Even that’s covered: SQLite will happily store binary blobs. My datasette-render-images plugin (live demo here) is one example of a tool that works with binary image data stored in SQLite blobs.

What if my data is too big? Datasette is not a “big data” tool, but if your definition of big data is something that won’t fit in RAM that threshold is growing all the time (2TB of RAM on a single AWS instance now costs less than $4/hour).

I’ve personally had great results from multiple GB SQLite databases and Datasette. The theoretical maximum size of a single SQLite database is around 140TB.

SQLite also has built-in support for surprisingly good full-text search, and thanks to being extensible via modules has excellent geospatial functionality in the form of the SpatiaLite extension. Datasette benefits enormously from this wider ecosystem.

The reason most developers avoid SQLite for production web applications is that it doesn’t deal brilliantly with large volumes of concurrent writes. Since Datasette is read-only we can entirely ignore this limitation.

Far-future cache expiration

Since the data in a Datasette instance never changes, why not cache calls to it forever?

Datasette sends a far future HTTP cache expiry header with every API response. This means that browsers will only ever fetch data the first time a specific URL is accessed, and if you host Datasette behind a CDN such as Fastly or Cloudflare each unique API call will hit Datasette just once and then be cached essentially forever by the CDN.

This means it’s safe to deploy a JavaScript app using an inexpensively hosted Datasette-backed API to the front page of even a high traffic site—the CDN will easily take the load.

Zeit added Cloudflare to every deployment (even their free tier) back in July, so if you are hosted there you get this CDN benefit for free.

What if you re-publish an updated copy of your data? Datasette has that covered too. You may have noticed that every Datasette database gets a hashed suffix automatically when it is deployed:

https://fivethirtyeight.datasettes.com/fivethirtyeight-c9e67c4

This suffix is based on the SHA256 hash of the entire database file contents—so any change to the data will result in new URLs. If you query a previous suffix Datasette will notice and redirect you to the new one.

If you know you’ll be changing your data, you can build your application against the non-suffixed URL. This will not be cached and will always 302 redirect to the correct version (and these redirects are extremely fast).

https://fivethirtyeight.datasettes.com/fivethirtyeight/alcohol-consumption%2Fdrinks.json

The redirect sends an HTTP/2 push header such that if you are running behind a CDN that understands push (such as Cloudflare) your browser won’t have to make two requests to follow the redirect. You can use the Chrome DevTools to see this in action:

Chrome DevTools showing a redirect initiated by an HTTP/2 push

And finally, if you need to opt out of HTTP caching for some reason you can disable it on a per-request basis by including ?_ttl=0 in the URL query string. —for example, if you want to return a random member of the Avengers it doesn’t make sense to cache the response:

https://fivethirtyeight.datasettes.com/fivethirtyeight?sql=select+*+from+[avengers%2Favengers]+order+by+random()+limit+1&_ttl=0

Publishing as a core feature

Datasette aims to reduce the friction for publishing interesting data online as much as possible.

To this end, Datasette includes a “publish” subcommand:

# deploy to Heroku
datasette publish heroku mydatabase.db
# Or deploy to Zeit Now
datasette publish now mydatabase.db

These commands take one or more SQLite databases, upload them to a hosting provider, configure a Datasette instance to serve them and return the public URL of the newly deployed application.

Out of the box, Datasette can publish to either Heroku or to Zeit Now. The publish_subcommand plugin hook means other providers can be supported by writing plugins.

License and source metadata

Datasette believes that data should be accompanied by source information and a license, whenever possible. The metadata.json file that can be bundled with your data supports these. You can also provide source and license information when you run datasette publish:

datasette publish fivethirtyeight.db \
    --source="FiveThirtyEight" \
    --source_url="https://github.com/fivethirtyeight/data" \
    --license="CC BY 4.0" \
    --license_url="https://creativecommons.org/licenses/by/4.0/"

When you use these options Datasette will create the corresponding metadata.json file for you as part of the deployment.

Facet everything

I really love faceted search: it’s the first tool I turn to whenever I want to start understanding a collection of data. I’ve built faceted search engines on top of Solr, Elasticsearch and PostgreSQL and many of my favourite tools (like Splunk and Datadog) have it as a core feature.

Datasette automatically attempts to calculate facets against every table. You can read more about the Datasette Facets feature here—as a huge faceted search fan it’s one of my all-time favourite features of the project. Now I can add SQLite to the list of technologies I’ve used to build faceted search!

Respect for CSV

CSV is by far the most common format for sharing and publishing data online. Almost every useful data tool has the ability to export to it, and it remains the lingua franca of spreadsheet import and export.

It has many flaws: it can’t easily represent nested data structures, escaping rules for values containing commas are inconsistently implemented and it doesn’t have a standard way of representing character encoding.

Datasette aims to promote SQLite as a much better default format for publishing data. I would much rather download a .db file full of pre-structured data than download a .csv and then have to re-structure it as a separate piece of work.

But interacting well with the enormous CSV ecosystem is essential. Datasette has deep CSV export functionality: any data you can see, you can export—including the results of arbitrary SQL queries. If your query can be paginated Datasette can stream down every page in a single CSV file for you.

Datasette’s sister-tool csvs-to-sqlite handles the other side of the equation: importing data from CSV into SQLite tables. And the Datasette Publish web application allows users to upload their CSVs and have them deployed directly to their own fresh Datasette instance—no command line required.

SQL as an API language

A lot of people these days are excited about GraphQL, because it allows API clients to request exactly the data they need, including traversing into related objects in a single query.

Guess what? SQL has been able to do that since the 1970s!

There are a number of reasons most APIs don’t allow people to pass them arbitrary SQL queries:

  • Security: we don’t want people messing up our data
  • Performance: what if someone sends an accidental (or deliberate) expensive query that exhausts our resources?
  • Hiding implementation details: if people write SQL against our API we can never change the structure of our database tables

Datasette has answers to all three.

On security: the data is read-only, using SQLite’s immutable mode. You can’t damage it with a query—INSERT and UPDATEs will simply throw harmless errors.

On performance: SQLite has a mechanism for canceling queries that take longer than a certain threshold. Datasette sets this to one second by default, though you can alter that configuration if you need to (I often bump it up to ten seconds when exploring multi-GB data on my laptop).

On hidden implementation details: since we are publishing static data rather than maintaining an evolving API, we can mostly ignore this issue. If you are really worried about it you can take advantage of canned queries and SQL view definitions to expose a carefully selected forward-compatible view into your data.

Optimistic query execution with time limits

I mentioned Datasette’s SQL time limits above. These aren’t just there to avoid malicious queries: the idea of “optimistic SQL evaluation” is baked into some of Datasette’s core features.

Consider suggested facets—where Datasette inspects any table you view and tries to suggest columns that are worth faceting against.

The way this works is Datasette loops over every column in the table and runs a query to see if there are less than 20 unique values for that column. On a large table this could take a prohibitive amount of time, so Datasette sets an aggressive timeout on those queries: just 50ms. If the query fails to run in that time it is silently dropped and the column is not listed as a suggested facet.

Datasette’s JSON API provides a mechanism for JavaScript applications to use that same pattern. If you add ?_timelimit=20 to any Datasette API call, the underlying query will only get 20ms to run. If it goes over you’ll get a very fast error response from the API. This means you can design your own features that attempt to optimistically run expensive queries without damaging the performance of your app.

Keyset pagination

SQL pagination using OFFSET/LIMIT has a fatal flaw: if you request page number 300 at 20 per page the underlying SQL engine needs to calculate and sort all 6,000 preceding rows before it can return the 20 you have requested.

This does not scale at all well.

Keyset pagination (often known by other names, including cursor-based pagination) is a far more efficient way to paginate through data. It works against ordered data. Each page is returned with a token representing the last record you saw, then when you request the next page the engine merely has to filter for records that are greater than that tokenized value and scan through the next 20 of them.

(Actually, it scans through 21. By requesting one more record than you intend to display you can detect if another page of results exists—if you ask for 21 but get back 20 or less you know you are on the last page.)

Datasette’s table view includes a sophisticated implementation of keyset pagination.

Datasette defaults to sorting by primary key (or SQLite rowid). This is perfect for efficient pagination: running a select against the primary key column for values greater than X is one of the fastest range scan queries any database can support. This allows users to paginate as deep as they like without paying the offset/limit performance penalty.

This is also how the “export all rows as CSV” option works: when you select that option, Datasette opens a stream to your browser and internally starts keyset-pagination over the entire table. This keeps resource usage in check even while streaming back millions of rows.

Here’s where Datasette gets fancy: it handles keyset pagination for any other sort order as well. If you sort by any column and click “next” you’ll be requesting the next set of rows after the last value you saw. And this even works for columns containing duplicate values: If you sort by such a column, Datasette actually sorts by that column combined with the primary key. The “next” pagination token it generates encodes both the sorted value and the primary key, allowing it to correctly serve you the next page when you click the link.

Try clicking “next” on this page to see keyset pagination against a sorted column in action.

Interactive demos based on the unit tests

I love interactive demos. I decided it would be useful if every single release of Datasette had a permanent interactive demo illustrating its features.

Thanks to Zeit Now, this was pretty easy to set up. I’ve actually taken it a step further: every successful push to master on GitHub is also deployed to a permanent URL.

Some examples:

The database that is used for this demo is the exact same database that is created by Datasette’s unit test fixtures. The unit tests are already designed to exercise every feature, so reusing them for a live demo makes a lot of sense.

You can view this test database on your own machine by checking out the full Datasette repository from GitHub and running the following:

python tests/fixtures.py fixtures.db metadata.json
datasette fixtures.db -m metadata.json

Here’s the code in the Datasette Travis CI configuration that deploys a live demo for every commit and every released tag.

Documentation unit tests

I wrote about the Documentation unit tests pattern back in July.

Datasette’s unit tests include some assertions that ensure that every plugin hook, configuration setting and underlying view class is mentioned in the documentation. A commit or pull request that adds or modifies these without also updating the documentation (or at least ensuring there is a corresponding heading in the docs) will fail its tests.

Learning more

Datasette’s documentation is in pretty good shape now, and the changelog provides a detailed overview of new features that I’ve added to the project. I presented Datasette at the PyBay conference in August and I’ve published my annotated slides from that talk. I was interviewed about Datasette for the Changelog podcast in May and my notes from that conversation include some of my favourite demos.

Datasette now has an official Twitter account—you can follow @datasetteproj there for updates about the project.

Elsewhere

16th February 2019

15th February 2019

  • If you want the fastest website despite implementation difficulty, the answer is: SSR behind a CDN with assets in best compression formats (webp, Brotli, woff2) served over http2 (or 3) from same origin with JS as enhancement only

    Mike Sherov #

  • Data science is different now (via) Detailed examination of the current state of the job market for data science. Boot camps and university courses have produced a growing volume of junior data scientists seeking work, but the job market is much more competitive than many expected—especially for those without prior experience. Meanwhile the job itself is much more about data cleanup and software engineering skills: machine learning models and applied statistics end up being a small portion of the actual work. #

14th February 2019

  • Vitess (via) I remember looking at Vitess when it was first released by YouTube in 2012. The idea of a proven horizontally scalable sharding mechanism for MySQL was exciting, but I was put off by the need for a custom Go or Java client library. Apparently that changed with Vitess 2.1 in April 2017, the first version to introduce a MySQL protocol compatible proxy which can be connected to by existing code written in any language. Vitess 3.0 came out last December so now the MySQL proxy layer is much more stable. Vitess is used in production by a bunch of other companies now (including Slack and Square) so it’s definitely worth a closer look. #
  • Operations engineering does not consist of firefighting your shitty software, it is the science of delivering value to users.

    Charity Majors #

13th February 2019

  • django-zombodb (via) The hardest part of working with an external search engine like Elasticsearch is always keeping that index synchronized with your relational database. ZomboDB is a PostgreSQL extension which lets you create a new type of index backed by an external Elasticsearch cluster. Updated rows will be pushed to the index automatically, and custom SQL syntax can then be used to execute searches. django-zombodb is a brand new library by Flávio Juvenal which integrates ZomboDB directly into the Django ORM, letting you add Elasticsearch-backed functionality with just a few lines of extra configuration. It even includes custom Django migrations for enabling the extension in PostgreSQL! #

12th February 2019

  • Private blockchains are completely uninteresting. (By this, I mean systems that use the blockchain data structure but don’t have the above three elements.) In general, they have some external limitation on who can interact with the blockchain and its features. These are not anything new; they’re distributed append-only data structures with a list of individuals authorized to add to it. Consensus protocols have been studied in distributed systems for more than 60 years. Append-only data structures have been similarly well covered. They’re blockchains in name only, and -- as far as I can tell -- the only reason to operate one is to ride on the blockchain hype.

    Bruce Schneier #

8th February 2019

  • socrata2sql (via) Phenomenal new open source tool released by Andrew Chavez at the Dallas Morning News. Socrata is the open data portal software used by huge numbers of local governments worldwide. socrata2sql is a tool that interacts with the standard Socrata API and can use it to suck down a dataset and save it as a SQLite, PostgreSQL, MySQL or other SQLAlchemy-supported database. I just tried this and it took a single command to create a SQLite database of every police arrest in Dallas in the past five years. #
  • db-to-sqlite (via) I just released version 0.2 of a tiny CLI utility I’ve been working on. It builds on top of SQLAlchemy and lets you connect to any SQLAlchemy-supported database and convert the data from it to a local SQLite database file. The new --all option will mirror all available tables (including foreign key relationships), or you can use --sql to save the results of custom SQL queries. #

6th February 2019

  • Questions for a new technology. Kellan poses 8 questions which should be asked of any technology that is being proposed for inclusion in an existing tech stack. I’m particularly fond of “Will this solution kill and eat the solution that it replaces?”. My rule of thumb these days is that new technology either needs to make something possible that isn’t possible at all with the existing stack, or it needs to represent at least a 3X productivity improvement in order to compensate for the switching and retraining costs across a large team. #

1st February 2019

  • The Datasette Ecosystem. I’ve written a page of documentation that introduces the wider Datasette Ecosystem: csvs-to-sqlite, sqlite-utils, db-to-sqlite, dbf-to-sqlite, markdown-to-sqlite and a full collection of Datasette plugins. #
  • Datasette 0.27 (via) The latest release of Datasette introduces an option to output tables and SQL query results as newline-delimited JSON—plus a new “datasette plugins” command for listing available plugins. #

30th January 2019

  • Everyone is angry about CSS again. I’m not even going to try to summarize the arguments. However it always seems to boil down to the fact that CSS is simultaneously too easy to bother with, yet so hard it needs to be wrapped up in a ball of JavaScript in case it scares the horses.

    Rachel Andrew #

28th January 2019

  • Since Mozilla moved on from Firefox OS, its derivatives have shipped on an order of magnitude more devices than during its entire time under Mozilla’s leadership and it has gone on to form the basis of the third largest and fastest growing mobile operating system in the world.

    Ben Francis #