Simon Willison’s Weblog

On pypi, projects, webperformance, weeknotes, deno, ...

 

Recent entries

Weeknotes: Datasette Cloud and zero downtime deployments four days ago

Yesterday’s piece on Deploying a data API using GitHub Actions and Cloud Run was originally intended to be my weeknotes, but ended up getting a bit too involved.

Aside from playing with GitHub Actions and Cloud Run, my focus over the past week has been working on Datasette Cloud. Datasette Cloud is the current name I’m using for my hosted Datasette product—the idea being that I’ll find it a lot easier to get feedback on Datasette from journalists if they can use it without having to install anything!

My MVP for Datasette Cloud is that I can use it to instantly provision a new, private Datasette instance for a journalist (or team of journalists) that they can then sign into, start playing with and start uploading their data to (initially as CSV files).

I have to solve quite a few problems to get there:

  • Secure, isolated instances of Datasette. A team or user should only be able to see their own files. I plan to solve this using Docker containers that are mounted such that they can only see their own dedicated volumes.
  • The ability to provision new instances as easily as possible—and give each one its own HTTPS subdomain.
  • Authentication: users need to be able to register and sign in to accounts. I could use datasette-auth-github for this but I’d like to be able to support regular email/password accounts too.
  • Users need to be able to upload CSV files and have them converted into a SQLite database compatible with Datasette.

Zero downtime deployments

I have a stretch goal which I’m taking pretty seriously: I want to have a mechanism in place for zero-downtime deployments of new versions of the software.

Arguable this is an unneccessary complication for an MVP. I may not fully implement it, but I do want to at least know that the path I’ve taken is compatible with zero downtime deployments.

Why do zero downtime deployments matter so much to me? Because they are desirable for rapid iteration, and crucial for setting up continuious deployment. Even a couple of seconds of downtime during a deployment leaves a psychological desire not to deploy too often. I’ve seen the productivity boost that deploying fearlessly multiple times a day brings, and I want it.

So I’ve been doing a bunch of research into zero downtime deployment options (thanks to some great help on Twitter) and I think I have something that’s going to work for me.

The first ingredient is Traefik—a new-to-me edge router (similar to nginx) which has a delightful focus on runtime configuration based on automatic discovery.

It works with a bunch of different technology stacks, but I’m going to be using it with regular Docker. Traefik watches for new Docker containers, reads their labels and uses that to reroute traffic to them.

So I can launch a new Docker container, apply the Docker label "traefik.frontend.rule": "Host:subdomain.mydomain.com" and Traefik will start proxying traffic to that subdomain directly to that container.

Traefik also has extremely robust built-in support for Lets Encrypt to issue certificates. I managed to issue a wildcard TLS certificate for my entire domain, so new subdomains are encrypted straight away. This did require me to give Traefik API access to modify DNS entries—I’m running DNS for this project on Digital Ocean and thankfully Traefik knows how to do this by talking to their API.

That solves provisioning: when I create a new account I can call the Docker API (from Python) to start up a new, labelled container on a subdomain protected by a TLS certificate.

I still needed a way to run a zero-downtime deployment of a new container (for example when I release a new version of Datasette and want to upgrade everyone). After quite a bit of research (during which I discovered you can’t modify the labels on a Docker container without restarting it) I settled on the approach described in this article.

Essentially you configure Traefik to retry failed requests, start a new, updated container with the same routing information as the existing one (causing Traefik to load balance HTTP requests across both), then shut down the old container and trust Traefik to retry in-flight requests against the one that’s still running.

Rudimentary testing with ab suggested that this is working as desired.

One remaining problem: if Traefik is running in a Docker container and proxying all of my traffic, how can I upgrade Traefik itself without any downtime?

Consensus on Twitter seems to be that Docker on its own doesn’t have a great mechanism for this (I was hoping I could re-route port 80 traffic to the host to a different container in an atomic way). But... iptables has mechanisms that can re-route traffic from one port to another—so I should be able to run a new Traefik container on a different port and re-route to it at the operating system level.

That’s quite enough yak shaving around zero time deployments for now!

datasette-upload-csvs

A big problem I’m seeing with the current Datasette ecosystem is that while Datasette offers a web-based user interface for querying and accessing data, the tools I’ve written for actually creating those databases are decidedly command-line only.

Telling journalists they have to learn to install and run software on the command-line is way too high a barrier to entry.

I’ve always intended to have Datasette plugins that can handle uploading and converting data. It’s time to actually build one!

datasette-upload-csvs is what I’ve got so far. It has a big warning not to use it in the README—it’s very alpha sofware at the moment—but it does prove that the concept can work.

It uses the asgi_wrapper plugin hook to intercept requests to the path /-/upload-csv and forward them on to another ASGI app, written using Starlette, which provides a basic upload form and then handles the upload.

Uploaded CSVs are converted to SQLite using sqlite-utils and written to the first mutable database attached to Datasette.

It needs a bunch more work (and tests) before I’m comfortable telling people to use it, but it does at least exist as a proof of concept for me to iterate on.

datasette-auth-django-cookies

No code for this yet, but I’m beginning to flesh it out as a concept.

I don’t particularly want to implement user registration and authentication and cookies and password hashing. I know how to do it, which means I know it’s not something you shuld re-roll for every project.

Django has a really well designed, robust authentication system. Can’t I just use that?

Since all of my applications will be running on subdomains of a single domain, my current plan is to have a regular Django application which handles registration and logins. Each subdomain will then run a custom piece of Datasette ASGI middleware which knows how to read and validate the Django authentication cookie.

This should give me single sign-on with a single, audited codebase for registration and login with (hopefully) the least amount of work needed to integrate it with Datasette.

Code for this will hopefully follow over the next week.

Niche Museums—now publishing weekly

I hit a milestone with my Niche Museums project: the site now lists details of 100 museums!

For the 100th entry I decided to celebrate with by far the most rewarding (and exclusive) niche museum experience I’ve ever had: Ray Bandar’s Bone Palace.

You should read the entry. The short version is that Ray Bandar collected 7,000 animals skulls over a sixty year period, and Natalie managed to score us a tour of his incredible basement mere weeks before the collection was donated to the California Academy of Sciences.

The basement full of skulls

Posting one museum a day was taking increasingly more of my time, as I had to delve into the depths of my museums-I-have-visited backlog and do increasing amounts of research. Now that I’ve hit 100 I’m going to switch to publishing one a week, which should also help me visit new ones quickly enough to keep the backlog full!

So I only posted four this week:

I also built a simple JavaScript image gallery to better display the 54 photos I published from our trip to Ray Bandar’s basement.

Deploying a data API using GitHub Actions and Cloud Run five days ago

I’m using the combination of GitHub Actions and Google Cloud Run to retrieve data from the U.S. Department of Justice FARA website and deploy it as a queryable API using Datasette.

FARA background

The Foreign Agents Registration Act (FARA) law that requires “certain agents of foreign principals who are engaged in political activities or other activities specified under the statute to make periodic public disclosure of their relationship with the foreign principal, as well as activities, receipts and disbursements in support of those activities”.

The law was introduced in 1938 in response to the large number of German propaganda agents that were operating in the U.S. prior to the war.

Basically, if you are in the United States as a lobbyist for a foreign government you need to register under FARA. It was used in 23 criminal cases during World War II, but hasn’t had much use since it was ammended in 1966. Although... if you consult the list of recent cases you’ll see some very interesting recent activity involving Russia and Ukraine.

It’s also for spies! Quoting the FARA FAQ:

Finally, 50 U.S.C. § 851, requires registration of persons who have knowledge of or have received instruction or assignment in espionage, counterespionage or sabotage service or tactics of a foreign country or political party.

I imagine most spies operate in violation of this particular law and don’t take steps to register themselves.

It’s all still pretty fascinating though, in part because it gets updated. A lot. Almost every business day in fact.

Tracking FARA history

I know this because seven months ago I set up a scraper for it. Every twelve hours I have code which downloads the four bulk CSVs published by the Justice department and saves them to a git repository. It’s the same trick I’ve been using to track San Francisco’s database of trees and PG&E’s outage map.

I’ve been running the scraper using Circle CI, but this weekend I decided to switch it over to GitHub Actions to get a better idea for how they work.

Deploying it as an API

I also wanted to upgrade my script to also deploy a fresh Datasette instance of the data using Google Cloud Run. I wrote a script to do this on a manual basis last year, but I never combined it with the daily scraper. Combining the two means I can offer a Datasette-powered API directly against the latest data.

https://fara.datasettes.com is that API—it now updates twice a day, assuming there are some changes to the underlying data.

Putting it all together

The final GitHub action workflow can be seen here. I’m going to present an annotated version here.

on:
  repository_dispatch:
  schedule:
    - cron:  '0 0,12 * * *'

This sets when the workflow should be triggered. I’m running it twice a day—at midnight and noon UTC (the 0,12 cron syntax).

The repository_dispatch key means I can also trigger it manually by running the following curl command—useful for testing:

curl -XPOST https://api.github.com/repos/simonw/fara-history/dispatches \
    -H 'Authorization: token MY_PERSONAL_TOKEN_HERE' \
    -d '{"event_type": "trigger_action"}' \
    -H 'Accept: application/vnd.github.everest-preview+json'

Next comes the job itself, which I called scheduled and set to run on the latest Ubuntu:

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:

Next comes the steps. Each step is run in turn, in an isolated process (presumably a container) but with access to the current working directory.

- uses: actions/checkout@v2
  name: Check out repo
- name: Set up Python
  uses: actions/setup-python@v1
  with:
    python-version: 3.8

The first two steps checkout the fara-history repository and install Python 3.8.

- uses: actions/cache@v1
  name: Configure pip caching
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-

This step should set up a cache so that pip doesn’t have to download fresh dependencies on every run. Unfortunately it doesn’t seem to actually work—it only works for push and pull_request events, but my workflow is triggered by schedule and repository_dispatch. There’s an open issue about this.

- name: Install Python dependencies
  run: |
    python -m pip install --upgrade pip
    pip install -r requirements.txt

This step installs my dependencies from requirements.txt.

- name: Fetch, update and commit FARA data
  run: . update_and_commit_all.sh
- name: Build fara.db database
  run: python build_database.py

Now we’re getting to the fun stuff. My update_and_commit_all.sh script downloads the four zip files from the FARA.gov site, unzips them, sorts them, diffs them against the previously stored files and commits the new copy to GitHub if they have changed. See my explanation of csv-diff for more on this—though sadly only one of the files has a reliable row identifier so I can’t generate great commit messages for most of them.

My build_database.py script uses sqlite-utils to convert the CSV files into a SQLite database.

Now that we’ve got a SQLite database, we can deploy it to Google Cloud Run using Datasette.

But should we run a deploy at all? If the database hasn’t changed, there’s no point in deploying it. How can we tell if the database file has changed from the last one that was published?

Datasette has a mechanism for deriving a content hash of a database, part of a performance optimization which is no longer turned on by default and may be removed in the future.

You can generate JSON that includes hash using the datasette inspect command. The jq tool can then be used to extract out just the hash:

$ datasette inspect fara.db | jq '.fara.hash' -r
fbc9cbaca6de1e232fc14494faa06cc8d4cb9f379d0d568e4711e9a218800906

The -r option to jq causes it to return just the raw string, without quote marks.

Datasette’s /-/databases.json introspection URL reveals the hashes of the currently deployed database. Here’s how to pull the currently deployed hash:

$ curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r
a6c0ab26589bde0d225c5a45044e0adbfa3840b95fbb263d01fd8fb0d2460ed5

If those two hashes differ then we should deploy the new database.

GitHub Actions have a slightly bizarre mechanism for defining “output variables” for steps, which can then be used to conditionally run further steps.

Here’s the step that sets those variables, followed by the step that conditionally installs the Google Cloud CLI tools using their official action:

- name: Set variables to decide if we should deploy
  id: decide_variables
  run: |-
    echo "##[set-output name=latest;]$(datasette inspect fara.db | jq '.fara.hash' -r)"
    echo "##[set-output name=deployed;]$(curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r)"
- name: Set up Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  uses: GoogleCloudPlatform/github-actions/setup-gcloud@master
  with:
    version: '275.0.0'
    service_account_email: ${{ secrets.GCP_SA_EMAIL }}
    service_account_key: ${{ secrets.GCP_SA_KEY }}

Having installed the Google Cloud tools, I can deploy my database using Datasette:

- name: Deploy to Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  run: |-
    gcloud components install beta
    gcloud config set run/region us-central1
    gcloud config set project datasette-222320
    datasette publish cloudrun fara.db --service fara-history -m metadata.json

This was by far the hardest part to figure out.

First, I needed to create a Google Cloud service account with an accompanying service key.

I tried and failed to do this using the CLI, so I switched to their web console following these and then these instructions.

Having downloaded the key JSON file, I converted it to base64 and pasted it into a GitHub Actions secret (hidden away in the repository settings area) called GCP_SA_KEY.

cat ~/Downloads/datasette-222320-2ad02afe6d82.json \
    | base64 | pbcopy

The service account needed permissions in order to run a build through Cloud Build and then deploy the result through Cloud Run. I spent a bunch of time trying out different combinations and eventually gave up and gave the account “Editor” permissions across my entire project. This is bad. I am hoping someone can help me understand what the correct narrow set of permissions are, and how to apply them.

It also took me a while to figure out that I needed to run these three commands before I could deploy to my project. The first one installs the Cloud Run tools, the second set up some required configuration:

gcloud components install beta
gcloud config set run/region us-central1
gcloud config set project datasette-222320

But... having done all of the above, the following command run from an action successfully deploys the site!

datasette publish cloudrun fara.db \
    --service fara-history -m metadata.json

DNS

Google Cloud Run deployments come with extremely ugly default URLs. For this project, that URL is https://fara-history-j7hipcg4aq-uc.a.run.app/.

I wanted something nicer. I own datasettes.com and manage the DNS via Cloudflare, which means I can point subdomains at Cloud Run instances.

This is a two-step process

  1. I set fara.datasettes.com as a DNS-only (no proxying) CNAME for ghs.googlehosted.com.
  2. In the Google Cloud Console I used Cloud Run -> Manage Custom Domains (a button in the header) -> Add Mapping to specify that fara.datasettes.com should map to my fara-history service (the --service argument from datasette publish earlier).

I had previously verified my domain ownership—I forget quite how I did it. Domains purchased through Google Domains get to skip this step.

Next steps

This was a lot of fiddling around. I’m hoping that by writing this up in detail I’ll be able to get this working much faster next time.

I think this model—GitHub Actions that pull data, build a database and deploy to Cloud Run using datasette publish—is incredibly promising. The end result should be an API that costs cents-to-dollars a month to operate thanks to Cloud Run’s scale-to-zero architecture. And hopefully by publishing this all on GitHub it will be as easy as possible for other people to duplicate it for their own projects.

Weeknotes: Improv at Stanford, planning Datasette Cloud 12 days ago

Last week was the first week of the quarter at Stanford—which is called “shopping week” here because students are expected to try different classes to see which ones they are going to stick with.

I’ve settled on three classes this quarter: Beginning Improvising, Designing Machine Learning and Entrepreneurship: Formation of New Ventures.

Beginning Improvising is the Stanford improv theater course. It’s a big time commitment: three two-hours sessions a week for ten weeks is nearly 60 hours of improv!

It’s already proving to be really interesting though: it turns out the course is a thinly disguised applied psychology course.

Improv is about creating a creative space for other people to shine. The applications to professional teamwork are obvious and fascinating to me. I’ll probably write more about this as the course continues.

Designing Machine Learning is a class at the Stanford d.School taught by Michelle Carney and Emily Callaghan. It focuses on multidisciplinary applications of machine learning, mixing together students from many different disciplines around Stanford.

I took a fast.ai deep learning course last year which gave me a basic understanding of the code size of neural networks, but I’m much more interestind in figuring out applications so this seems like a much more interesting option than a more code-focused course.

The class started out building some initial models using Google’s Teachable Machine tool, which is fascinating. It lets you train transfer learning models for image, audio and posture recognition entirely in your browser—no data is transferred to Google’s servers at all. You can then export those models and use them with a variety of different libraries—I’ve got them to work with both JavaScript and Python already.

I’m taking Entrepreneurship: Formation of New Ventures because of the rave reviews I heard from other JSK fellows who took it last quarter. It’s a classic case-study business school class: each session features a guest speaker who is a successful entrepreneur, and the class discusses their case for the first two thirds of the section while they listen in—then finds out how well the discussion matched to what actually happened.

Planning Datasette Cloud

Shopping week kept me pretty busy so I’ve not done much actual development over the past week, but I have started planning out and researching my next major project, which I’m currently calling Datasette Cloud.

Datasette Cloud will be an invite-only hosted SaaS version of Datasette. It’s designed to help get news organizations on board with the software without having to talk them through figuring out their own hosting, so I can help them solve real problems and learn more about how the ecosystem should evolve to support them.

I’d love to be able to run this on serverless hosting platforms like Google Cloud Run or Heroku, but sadly those tools aren’t an option to me due to a key problem: I’m trying to build a stateful service (SQLite databases need to live on a local disk) in 2020.

I posed this challenge on Twitter back in October:

I’ve been exploring my options since then, and I think I’ve settled on a decidedly 2010-era way of doing this: I’m going to run my own instances! So I’ve been exploring hosting Datasette on both AWS Lightsail and Digital Ocean Droplets over the past few months.

My current plan is to have each Datasette Cloud account run as a Datasette instance in its own Docker container, primarily to ensure filesystem isolation: different accounts must not be able to see each other’s database files.

I started another discussion about this on Twitter and had several recommendations for Traefik as a load balancer for assigning hostnames to different Docker containers, which is exactly what I need to do.

So this afternoon I made my way through Digital Ocean’s outstanding tutorial How To Use Traefik as a Reverse Proxy for Docker Containers on Ubuntu 18.04 and I think I’ve convinced myself that this is a smart way forward.

So, mostly a research week but I’ve got a solid plan for my next steps.

This week’s Niche Museums

I also finally got around to implementing a map.

Building a sitemap.xml with a one-off Datasette plugin 19 days ago

One of the fun things about launching a new website is re-learning what it takes to promote a website from scratch on the modern web. I’ve been thoroughly enjoying using Niche Museums as an excuse to explore 2020-era SEO.

I used to use Google Webmaster Tools for this, but apparently that got rebranded as Google Search Console back in May 2015. It’s really useful. It shows which search terms got impressions, which ones got clicks and lets you review which of your pages are indexed and which have returned errors.

Niche Museums has been live since October 24th, but it was a SPA for the first month. I switched it to server-side rendering (with separate pages for each museum) on November 25th. The Google Search Console shows it first appeared in search results on 2nd December.

So far, I’ve had 35 clicks! Not exactly earth-shattering, but every site has to start somewhere.

Screenshot of the Google Search Console.

In a bid to increase the number of indexed pages, I decided to build a sitemap.xml. This probably isn’t necessary—Google advise that you might not need one if your site is “small”, defined as 500 pages or less (Niche Museums lists 88 museums, though it’s still increasing by one every day). It’s nice to be able to view that sitemap and confirm that those pages have all been indexed inside the Search Console though.

Since Niche Museums is entirely powered by a customized Datasette instance, I needed to figure out how best to build that sitemap.

One-off plugins

Datasette’s most powerful customization options are provided by the plugins mechanism. Back in June I ported Datasette to ASGI, and the subsequent release of Datasette 0.29 introduced a new asgi_wrapper plugin hook. This hook makes it possible to intercept requests and implement an entirely custom response—ideal for serving up a /sitemap.xml page.

I considered building and releasing a generic datasette-sitemap plugin that could be used anywhere, but that felt like over-kill for this particular problem. Instead, I decided to take advantage of the --plugins-dir= Datasette option to build a one-off custom plugin for the site.

The Datasette instance that runs Niche Museums starts up like this:

$ datasette browse.db about.db \
    --template-dir=templates/ \
    --plugins-dir=plugins/ \
    --static css:static/ \
    -m metadata.json

This serves the two SQLite database files, loads custom templatse from the templates/ directory, sets up www.niche-museums.com/css/museums.css to serve data from the static/ directory and loads metadata settings from metadata.json. All of these files are on GitHub.

It also tells Datasette to look for any Python files in the plugins/ directory and load those up as plugins.

I currently have four Python files in that directory—you can see them here. The sitemap.xml is implemented using the new sitemap.py plugin file.

Here’s the first part of that file, which wraps the Datasette ASGI app with middleware that checks for the URL /robots.txt or /sitemap.xml and returns custom content for either of them:

from datasette import hookimpl
from datasette.utils.asgi import asgi_send


@hookimpl
def asgi_wrapper(datasette):
    def wrap_with_robots_and_sitemap(app):
        async def robots_and_sitemap(scope, recieve, send):
            if scope["path"] == "/robots.txt":
                await asgi_send(
                    send, "Sitemap: https://www.niche-museums.com/sitemap.xml", 200
                )
            elif scope["path"] == "/sitemap.xml":
                await send_sitemap(send, datasette)
            else:
                await app(scope, recieve, send)

        return robots_and_sitemap

    return wrap_with_robots_and_sitemap

The boilerplate here is a little convoluted, but this does the job. I’m considering adding alternative plugin hooks for custom pages that could simplify this in the future.

The asgi_wrapper(datasette) plugin function is expected to return a function which will be used to wrap the Datasette ASGI application. In this case that wrapper function is called wrap_with_robots_and_sitemap(app). Here’s the Datasette core code that builds the ASGI app and applies the wrappers:

asgi = AsgiLifespan(
    AsgiTracer(DatasetteRouter(self, routes)), on_startup=setup_db
)
for wrapper in pm.hook.asgi_wrapper(datasette=self):
    asgi = wrapper(asgi)

So this plugin will be executed as:

asgi = wrap_with_robots_and_sitemap(asgi)

The wrap_with_robots_and_sitemap(app) function then returns another, asynchronous function. This function follows the ASGI protocol specification, and has the following signature and body:

async def robots_and_sitemap(scope, recieve, send):
    if scope["path"] == "/robots.txt":
        await asgi_send(
            send, "Sitemap: https://www.niche-museums.com/sitemap.xml", 200
        )
    elif scope["path"] == "/sitemap.xml":
        await send_sitemap(send, datasette)
    else:
        await app(scope, recieve, send)

If the incoming URL path is /robots.txt, the function directly returns a reference to the sitemap, as seen at www.niche-museums.com/robots.txt.

If the path is /sitemap.xml, it calls the send_sitemap(...) function.

For any other path, it proxies the call to the original ASGI app function that was passed to the wrapper function: await app(scope, recieve, send).

The most interesting part of the implementation is that send_sitemap() function. This is the function which constructs the sitemap.xml returned by www.niche-museums.com/sitemap.xml.

Here’s what that function looks like:

async def send_sitemap(send, datasette):
    content = [
        '<?xml version="1.0" encoding="UTF-8"?>',
        '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
    ]
    for db in datasette.databases.values():
        hidden = await db.hidden_table_names()
        tables = await db.table_names()
        for table in tables:
            if table in hidden:
                continue
            for row in await db.execute("select id from [{}]".format(table)):
                content.append(
                    "<url><loc>https://www.niche-museums.com/browse/{}/{}</loc></url>".format(
                        table, row["id"]
                    )
                )
    content.append("</urlset>")
    await asgi_send(send, "\n".join(content), 200, content_type="application/xml")

The key trick here is to use the datasette instance object which was passed to the asgi_wrapper() plugin hook.

The code uses that instance to introspect the attached SQLite databases. It loops through them listing all of their tables, and filtering out any hidden tables (which in this case are tables used by the SQLite FTS indexing mechanism). Then for each of those tables it runs select id from [tablename] and uses the results to build the URLs that are listed in the sitemap.

Finally, the resulting XML is concatenated together and sent back to the client with an application/xml content type.

For the moment, Niche Museums only has one table that needs including in the sitemap—the museums table.

I have a longer-term goal to provide detailed documentation for the datasette object here: since it’s exposed to plugins it’s become part of the API interface for Datasette itself. I want to stabilize this before I release Datasette 1.0.

This week’s new museums

I had a lot of fun writing up the Griffith Observatory: it turns out founding donor Griffith J. Griffith was a truly terrible individual.

sqlite-utils 2.0: real upserts 27 days ago

I just released version 2.0 of my sqlite-utils library/CLI tool to PyPI.

It’s 2.0 because this is my first backwards compatibility breaking version of the library. I was hoping to avoid this for as long as possible: given semantic versioning, my opinion is that the highest quality libraries are the 1.x versions with the highest sub-version. Version 1.253 means a library managed to ship 253 new features without once breaking compatibility!

For sqlite-utils I needed the clean break, because I made a big mistake in my definition of “upsert”.

An upsert is a database operation which says “Update this record with these new values. If the record does not exist, create it.”

This is actually a documented feature of SQLite version 3.24.0 and higher. I’ve seen the term used for other databases such as PostgreSQL as well.

The problem is: when I first implemented the .upsert() method for sqlite-utils I incorrectly assumed that upsert was an alias for INSERT OR REPLACE.

So in sqlite-utils 1.x, the following code:

db["dogs"].upsert({
    "id": 1,
    "name": "Cleo",
    "age": 4
}, pk="id")

Executes then following SQL:

INSERT OR REPLACE INTO dogs
    ("id", "name", "age")
VALUES
    (1, "Cleo", 4);

It turns out this isn’t how upsert should be expected to work. This query will entirely replace any existing record. But... an upsert should update the record in place. Crucially, it should leave any columns not referenced in the upsert alone. My implementation over-wrote the entire row, deleting any missing values.

So if I run this Python code:

db["dogs"].upsert({
    "id": 1,
    "age": 5
}, pk="id")

The existing record for id=1 should have its age updated, but the name should stay the same.

I thought hard about whether I could keep the existing, incorrect .upsert() behavior and introduce a different method name or argument that did things right—maintaining backwards compatibility at the price of a less intuitive API—but this felt like too big a discrepancy. And this is what major version number changes are for! You can follow my deliberations about this in this issue.

While this is technically a breaking change, when I look at the code I have already written that uses sqlite-utils it turns out all of my uses of .upsert() are compatible with the new implementation anyway—they all send the full record object, so they will over-ride all of the fields no matter if they are running against 1.x or 2.x.

If I’m lucky, other users of sqlite-utils will be unaffected by this behavior change either. So hopefully it will be a painless upgrade.

If you do need to keep the old behavior, I’ve made it available as a replace=True argument to .insert() and .insert_all(), like so:

db["dogs"].insert({
    "id": 1,
    "name": "Cleo",
    "age": 4
}, pk="id", replace=True)

I did think about introducing a new piece of terminology for this—.inplace(), for insert-or-replace—but after getting in trouble redefining upsert by mistake I decided inventing new language wasn’t worth the risk.

Using this for Niche Museums

The project that made me realize I needed this upsert change was Niche Museums.

Niche Museums works by building a SQLite database file from a museums.yaml file. The YAML file contains the data for all 81 museums on the site. To add or edit a listing, I edit that file—sometimes in Visual Code on my Mac, sometimes via the GitHub web interface and often by using the Working Copy git client on my iPhone.

Creating the database is done using my yaml-to-sqlite tool:

$ yaml-to-sqlite browse.db museums museums.yaml --pk=id

The build process then runs a few extra scripts to annotate the records with extra data.

annotate_timestamps.py iterates through the full git history and uses it to figure out when each record was first created or last updated.

annotate_nominatim.py uses the OpenStreetMap Nominatim API to reverse geocode the latitude and longitude of each museum, adding extra columns for the city, state, country and all kinds of other interesting geographical details.

I’m not doing much with this OSM data yet—mainly using it to include the OSM country and city name in the full-text search index—but I hope to use it for faceting and other navigation improvements in the future (and maybe location-based breadcrumbs).

Here’s the problem though: every time I deploy the site, I run those scripts. Nominatim asks you not to hit their API more than once a second, so I have a sleep(1) in the code. This means the more museums I add, the longer that build step takes to reverse geocode them all.

It would be great to avoid geocoding museums that have already been geocoded. Properly implemented upsert let’s me do exactly that.

I haven’t made these changes yet, but my plan is to update the build script so that instead of starting from a blank database each time, it downloads the previously built version of the database and runs an upsert against it with the data from the YAML. This will overwrite all of the data that I might have updated in that file, but leave the osm_* columns that were created by the annotate_nominatim.py script in place.

That script can then just run against records for which osm_country is null. As such, it should only geocode newly created records.

Other improvements this week

I added a few new features to Niche Museums this week.

First, I added simple photo galleries to the listings pages, using CSS columns. This means I can add more than one photo to a museum! I’ve done this to a few now, including going back to the listing for the Ilfracombe Museum to add a whole bunch of extra photos.

I added photos and press links to the Atom feed for the site. Since the Atom feed is defined using a custom SQL query and datasette-atom, achieving this involved building a truly diabolical SQL query (making extensive use of the SQLite json1 extension).

I fixed an issue with the search engine where unexpected punctuation could cause an FTS syntax error in SQLite. I’ve extracted that into a fix for a Datasette issue as well.

The site search engine also now attempts to geocode the user’s query and offers a link to the corresponding lat/lon radius search for that point. Try a search for new orleans to see what that looks like—the implementation (entirely client-side, calling the Nominatim API) is here.

Here are the museums I added this week:

Weeknotes: Datasette 0.33 one month ago

I released Datasette 0.33 yesterday. The release represents an accumulation of small changes and features since Datasette 0.32 back in November. Duplicating the release notes:

  • rowid is now included in dropdown menus for filtering tables (#636)
  • Columns are now only suggested for faceting if they have at least one value with more than one record (#638)
  • Queries with no results now display “0 results” (#637)
  • Improved documentation for the --static option (#641)
  • asyncio task information is now included on the /-/threads debug page
  • Bumped Uvicorn dependency 0.11
  • You can now use --port 0 to listen on an available port
  • New template_debug setting for debugging templates, e.g. https://latest.datasette.io/fixtures/roadsideattractions?context=1 (#654)

The last three items deserve extra explanation.

Port 0

I run a lot of Datasette instances on my laptop—I often have 5 or 6 running representing different projects or experiments.

Eventually this means I start to hit port conflicts. I’ll run datasette mydb.db -p 8005 and get an error because port 8005 is already in use by something else.

I asked on Twitter if there were any downsides to having Datasette automatically pick an unused port... and @davids and @dominicrodger pointed out that port 0 is a Unix convention for “pick a port for me”.

It turned out Datasette already supported this! If you ran datasette --port=0 the underlying Uvicorn server would indeed pick an unused port and start serving on it.

There was just one catch: the URL output to the console would show port 0, so it wasn’t clear which port to actually visit. This was a bug in Uvicorn.

Open source to the rescue... I filed an issue with Uvicorn and then constructed a pull request. Tom Christie merged it five hours later and shipped a release. Datasette 0.33 depends on that updated version of Uvicorn, so --port=0 now works as expected!

template_debug

I’m always looking for new ways to add debugging and development tools to Datasette. I added support for ?_trace=1 to see executed SQL queries back in May for example.

While thinking through a refactoring of Datasette’s template rendering to better support plugins, I realized there was an opportunity for better debugging around templates.

Datasette supports custom templates, which means template authors need a clear way to see what variables are available to their template logic.

I tried adding a ?_context=1 parameter to dump out the template context, and it instantly helped me better understand what was going on.

There was just one problem: what about secret information? By default Datasette doesn’t include any secrets (signing keys, API tokens etc) in its template context—but Datasette plugins are allowed to add things to the context, and I can’t guarantee that a future plugin won’t add something that shouldn’t be exposed by the ?_context=1 debugging parameter.

The solution was to have context debugging only available if the new template_debug configuration setting is turned on:

$ datasette --memory --config template_debug:1

I’ve set up the latest.datasette.io demo to run with this setting, so you can now view the context for any page on that site, like so:

Niche Museums

Natalie and I are in the UK this week. We spent a few days in London and managed to get in a few niche museums, including the first four on this week’s new listings on Niche Museums:

I also made a small fix to the site’s Atom feed: it now respects newlines in the entries, by rendering the description using Markdown.

Here’s the implementation. Since the Atom feed is defined by a SQL query (using datasette-atom) I took advantage of Datasette’s ability to load custom plugins from a per-site plugins/ directory and added a custom SQLite SQL function for rendering markdown, then added that function call to the query that defines the feed.

Elsewhere

24th January 2020

  • 2020 Web Milestones (via) A lot of stuff is happening in 2020! Mike Sherov rounds it up—highlights include the release of Chromium Edge (Microsoft’s Chrome-powered browser for Windows 7+), Web Components supported in every major browser, Deno 1.x, SameSite Cookies turned on by default (which should dramatically reduce CSRF exposure) and Python 2 and Flash EOLs. #
  • Generated Columns in SQLite (via) SQLite 3.31.0 released today, and generated columns are the single most notable new feature. PostgreSQL 12 added these in October 2019, and MySQL has had them since 5.7 in October 2015. MySQL and SQLite both offer either “stored” or “virtual” generated columns, with virtual columns being calculated at runtime. PostgreSQL currently only supports stored columns. #

18th January 2020

  • Ray Bandar's Bone Palace (via) The hundredth museum I’ve posted to Niche Museums, and this one is my absolute favourite. Ray Bandar spent sixty years collecting over 7,000 animal skulls and putting them on display in his house and basement. Natalie managed to score us a tour a few weeks before the entire collection was donated to the California Academy of Sciences. It was the most amazing room I’ve ever been in. #
  • Code is made of pain, lies, and bad ideas, all so we can pretend that electrified sand will do what we tell it to

    Yoz Grahame #

17th January 2020

  • Your own hosted blog, the easy, free, open way (even if you're not a computer expert) (via) Jeremy Howard and the fast.ai team have released fast_template—a GitHub repository designed to be used as a template to create new repositories with a complete Jekyll blog configured for use with GitHub pages. GitHub’s official document recommends you install Ruby on your machine to do this, but Jeremy points out that with the right repository setup you can run a blog entirely by editing files through the GitHub web interface. #

16th January 2020

  • How to do Zero Downtime Deployments of Docker Containers. I’m determined to get reliable zero-downtime deploys working for a new project, because I know from experience that even a few seconds of downtime during a deploy changes the project mentality from “deploy any time you want” to “don’t deploy too often”. I’m using Docker containers behind Traefik, which means new containers should have traffic automatically balanced to them by Traefik based on their labels. After much fiddling around the pattern described by this article worked best for me: it lets me start a new container, then stop the old one and have Traefik’s “retry” mechanism send any requests to the stopped container over to the new one instead. #
  • Demos, Prototypes, and MVPs (via) I really like how Jacob describes the difference between a demo and a prototype: a demo is externally facing and helps explain a concept to a customer; a prototype is internally facing and helps prove that something can be built. #

15th January 2020

  • There is enough wood; Green estimates that it takes about 13 minutes for 20 North American forests to collectively grow enough wood for a 20-story building

    David Roberts #

10th January 2020

  • Serving 100µs reads with 100% availability (via) Fascinating use-case for SQLite from Segment: they needed a massively replicated configuration database across all of their instances that process streaming data. They chose to make the configuration available as a ~50GB SQLite database file mirrored to every instance, meaning lookups against that data could complete in microseconds. Changes to the central MySQL configuration store are pulled every 2-3 seconds, resulting in a trade-off of consistency for availability which fits their use-case just fine. #
  • A visual introduction to machine learning. Beautiful interactive essay explaining how a decision tree machine learning module is constructed, and using that to illustrate the concept of overfitting. This is one of the best explanations of machine learning fundamentals I’ve seen anywhere. #
  • Snowpack (via) Really interesting new twist on build systems for JavaScript. Modern browsers (everything since IE11) support JavaScript modules, but actually working with them is tricky since so much of the JavaScript ecosystem expects you to be using a bundler like Webpack. Snowpack is a tool for converting npm dependencies into JavaScript modules which can then be loaded directly by the browser, taking advantage of HTTP/2 to efficiently load the resulting larger number of files. #
  • Portable Cloud Functions with the Python Functions Framework (via) The new functions-framework library on PyPI lets you run Google Cloud Functions written in Python in other environments—on your local developer machine or bundled in a Docker container for example. I have real trouble trusting serverless platforms that lock you into a single provider (AWS Lambda makes me very uncomfortable) so this is a breath of fresh air. #
  • Async Support - HTTPX (via) HTTPX is the new async-friendly HTTP library for Python spearheaded by Tom Christie. It works in both async and non-async mode with an API very similar to requests. The async support is particularly interesting—it’s a really clean API, and now that Jupyter supports top-level await you can run ’(await httpx.AsyncClient().get(url)).text’ directly in a cell and get back the response. Most excitingly the library lets you pass an ASGI app directly to the client and then perform requests against it—ideal for unit tests. #

8th January 2020

  • I’ve found, in my 20 years of running the site, that whenever you ban an ironic Nazi, suddenly they become actual Nazis

    Richard “Lowtax” Kyanka #

  • A Compiler Writing Journey (via) Warren Toomey has been writing a self-compiling compiler for a subset of C, and extensively documenting every step of the journey here on GitHub. The result is an extremely high quality free textbook on compiler construction. #

7th January 2020

  • Better Python Object Serialization. TIL about functions.singledispatch, a decorator which makes it easy to create Python functions with implementations that vary based on the type of their arguments and which can have additional implementations registered after the fact—great for things like custom JSON serialization. #

3rd January 2020

  • Come version 80, any cookie without a SameSite attribute will be treated as “Lax” by Chrome. This is really important to understand because put simply, it’ll very likely break a bunch of stuff. [...] The fix is easy, all it needs is for everyone responsible for maintaining any system that uses cookies that might be passed from an external origin to understand what’s going on. Can’t be that hard, right? Hello? Oh...

    Troy Hunt #

  • How the Digg team was acquihired. (via) Useful insight into how a talent acquisition can play out from Will Larson, who was an engineering leader at Digg when they negotiated their acquihire exit. #
  • What do you call the parts of a story? Or: why can’t journalists spell “lead”? (via) Carl M. Johnson’s analysis of what journalists call different elements of a story, useful for data modeling a CMS for a news organization. #

2nd January 2020

  • How we use "ship small" to rapidly build new features at GitHub (via) Useful insight into how GitHub develop new features. They make aggressive use of feature flags, shipping a rough skeleton of a new feature to production as early as possible and actively soliciting feedback from other employees as they iterate on the feature. They static JSON mocks of APIs to unblock their frontend engineers and iterate on the necessary data structures while the real backend is bring implemented. #