Simon Willison’s Weblog

Subscribe

Goodbye Zeit Now v1, hello datasette-publish-now—and talking to myself in GitHub issues

8th April 2020

This week I’ve been mostly dealing with the finally announced shutdown of Zeit Now v1. And having long-winded conversations with myself in GitHub issues.

How Zeit Now inspired Datasette

I first started experiencing with Zeit’s serverless Now hosting platform back in October 2017, when I used it to deploy json-head.now.sh—an updated version of an API tool I originally built for Google App Engine in July 2008.

I liked Zeit Now, a lot. Instant, inexpensive deploys of any stateless project that could be defined using a Dockerfile? Just type now to deploy the project in your current directory? Every deployment gets its own permanent URL? Amazing!

There was just one catch: Since Now deployments are ephemeral applications running on them need to be stateless. If you want a database, you need to involve another (potentially costly) service. It’s a limitation shared by other scalable hosting solutions—Heroku, App Engine and so on. How much interesting stuff can you build without a database?

I was musing about this in the shower one day (that old cliche really happened for me) when I had a thought: sure, you can’t write to a database... but if your data is read-only, why not bundle the database alongside the application code as part of the Docker image?

Ever since I helped launch the Datablog at the Guardian back in 2009 I had been interested in finding better ways to publish data journalism datasets than CSV files or a Google spreadsheets—so building something that could package and bundle read-only data was of extreme interest to me.

In November 2017 I released the first version of Datasette. The original idea was very much inspired by Zeit Now.

I gave a talk about Datasette at the Zeit Day conference in San Francisco in April 2018. Suffice to say I was a huge fan!

Goodbye, Zeit Now v1

In November 2018, Zeit announced Now v2. And it was, different.

v2 is an entirely different architecture from v1. Where v1 built on Docker containers, v2 is built on top of serverless functions—AWS Lambda in particular.

I can see why Zeit did this. Lambda functions can launch from cold way faster—v1’s Docker infrastructure had tough cold-start times. They are much cheaper to run as well—crucial for Zeit given their extremely generous pricing plans.

But it was bad news for my projects. Lambdas are tightly size constrained, which is tough when you’re bundling potentially large SQLite database files with your deployments.

More importantly, in 2018 Amazon were deliberately excluding the Python sqlite3 standard library module from the Python Lambda environment! I guess they hadn’t considered people who might want to work with read-only database files.

So Datasette on Now v2 just wasn’t going to work. Zeit kept v1 supported for the time being, but the writing was clearly on the wall.

In April 2019 Google announced Cloud Run, a serverless, scale-to-zero hosting environment based around Docker containers. In many ways it’s Google’s version of Zeit Now v1—it has many of the characteristics I loved about v1, albeit with a clunkier developer experience and much more friction in assigning nice URLs to projects. Romain Primet contributed Cloud Run support to Datasette and it has since become my preferred hosting target for my new projects (see Deploying a data API using GitHub Actions and Cloud Run).

Last week, Zeit finally announced the sunset date for v1. From 1st of May new deploys won’t be allowed, and on the 7th of August they’ll be turning off the old v1 infrastructure and deleting all existing Now v1 deployments.

I engaged in an extensive Twitter conversation about this, where I praised Zeit’s handling of the shutdown while bemoaning the loss of the v1 product I had loved so much.

Migrating my projects

My newer projects have been on Cloud Run for quite some time, but I still have a bunch of old projects that I care about and want to keep running past the v1 shutdown.

The first project I ported was latest.datasette.io, a live demo of Datasette which updates with the latest code any time I push to the Datasette master branch on GitHub.

Any time I do some kind of ops task like this I’ve gotten into the habit of meticulously documenting every single step in comments on a GitHub issue. Here’s the issue for porting latest.datasette.io to Cloud Run (and switching from Circle CI to GitHub Actions at the same time).

My next project was global-power-plants-datasette, a small project which takes a database of global power plants published by the World Resources Institute and publishes it using Datasette. It checks for new updates to their repo once a day. I originally built it as a demo for datasette-cluster-map, since it’s fun seeing 33,000 power plants on a single map. Here’s that issue.

Having warmed up with these two, my next target was the most significant: porting my Niche Museums website.

Niche Museums is the most heavily customized Datasette instance I’ve run anywhere—it incorporates custom templates, CSS and plugins.

Here’s the tracking issue for porting it to Cloud Run. I ran into a few hurdles with DNS and TLS certificates, and I had to do some additional work to ensure niche-museums.com redirects to www.niche-musums.com, but it’s now fully migrated.

Hello, Zeit Now v2

In complaining about the lack of that essential sqlite3 module I figured it would be responsible to double-check and make sure that was still true.

It was not! Today Now’s Python environment includes sqlite3 after all.

Datasette’s publish_subcommand() plugin hook lets plugins add new publishing targets to the datasette publish command (I used it to build datasette-publish-fly last month). How hard would it be to build a plugin for Zeit Now v2?

I fired up a new lengthy talking-to-myself GitHub issue and started prototyping.

Now v2 may not support Docker, but it does support the ASGI Python standard (the asynchronous alternative to WSGI, shepherded by Andrew Godwin).

Zeit are keen proponents of the Jamstack approach, where websites are built using static pre-rendered HTML and JavaScript that calls out to APIs for dynamic data. v2 deployments are expected to consist of static HTML with “serverless functions”—standalone server-side scripts that live in an api/ directory by convention and are compiled into separate lambdas.

Datasette works just fine without JavaScript, which means it needs to handle all of the URL routes for a site. Essentually I need to build a single function that runs the whole of Datasette, then route all incoming traffic to it.

It took me a while to figure it out, but it turns out the Now v2 recipe for that is a now.json file that looks like this:

{
    "version": 2,
    "builds": [
        {
            "src": "index.py",
            "use": "@now/python"
        }
    ],
    "routes": [
        {
            "src": "(.*)",
            "dest": "index.py"
        }
    ]
}

Thanks Aaron Boodman for the tip.

Given the above configuration, Zeit will install any Python dependencies in a requirements.txt file, then treat an app variable in the index.py file as an ASGI application it should route all incoming traffic to. Exactly what I need to deploy Datasette!

This was everything I needed to build the new plugin. datasette-publish-now is the result.

Here’s the generated source code for a project deployed using the plugin, showing how the underlyinng ASGI application is configured.

It’s currently an alpha—not every feature is supported (see this milestone) and it relies on a minor deprecated feature (which I’ve implored Zeit to reconsider) but it’s already full-featured enough that I can start using it to upgrade some of my smaller existing Now projects.

The first I upgraded is one of my favourites: polar-bears.now.sh, which visualizes tracking data from polar bear ear tags (using datasette-cluster-map) that was published by the USGS Alaska Science Center, Polar Bear Research Program.

Here’s the command I used to deploy the site:

$ pip install datasette-publish-now
$ datasette publish now2 polar-bears.db \
    --title "Polar Bear Ear Tags, 2009-2011" \
    --source "USGS Alaska Science Center, Polar Bear Research Program" \
    --source_url "https://alaska.usgs.gov/products/data.php?dataid=130" \
    --install datasette-cluster-map \
    --project=polar-bears

I exported a full list of my Now v1 projects from their handy active v1 instances page.

The rest of my projects

I scraped the page using the following JavaScript, constructed with the help of the instant evaluation console feature in Firefox 75:

console.log(
  JSON.stringify(
    Array.from(
      Array.from(
        document.getElementsByTagName("table")[1].
          getElementsByTagName("tr")
      ).slice(1).map(
        (tr) =>
          Array.from(
            tr.getElementsByTagName("td")
        ).map((td) => td.innerText)
      )
    )
  )
);

Then I loaded them into Datasette for analysis.

After filtering out the datasette-latest-commithash.now.sh projects I had deployed for every push to GitHub it turns out I have 34 distinct projects running there.

I won’t port all of them, but given datasette-publish-now I should be able to port the ones that I care about without too much trouble.

Debugging Datasette with git bisect run

I fixed two bugs in Datasette this week using git bisect run—a tool I’ve been meaning to figure out for years, which lets you run an automated binary search against a commit log to find the source of a bug.

Since I was figuring out a new tool, I fired up another GitHub issue self-conversation: in issue #716 I document my process of both learning to use git bisect run and using it to find a solution to that particular bug.

It worked great, so I used the same trick on issue 689 as well.

Watching git bisect run churn through 32 revisions in a few seconds and pinpoint the exact moment a bug was introduced is pretty delightful:

$ git bisect start master 0.34
Bisecting: 32 revisions left to test after this (roughly 5 steps)
[dc80e779a2e708b2685fc641df99e6aae9ad6f97] Handle scope path if it is a string
$ git bisect run python check_templates_considered.py
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 15 revisions left to test after this (roughly 4 steps)
[7c6a9c35299f251f9abfb03fd8e85143e4361709] Better tests for prepare_connection() plugin hook, refs #678
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 7 revisions left to test after this (roughly 3 steps)
[0091dfe3e5a3db94af8881038d3f1b8312bb857d] More reliable tie-break ordering for facet results
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[ce12244037b60ba0202c814871218c1dab38d729] Release notes for 0.35
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 1 revision left to test after this (roughly 1 step)
[70b915fb4bc214f9d064179f87671f8a378aa127] Datasette.render_template() method, closes #577
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[286ed286b68793532c2a38436a08343b45cfbc91] geojson-to-sqlite
running python check_templates_considered.py
70b915fb4bc214f9d064179f87671f8a378aa127 is the first bad commit
commit 70b915fb4bc214f9d064179f87671f8a378aa127
Author: Simon Willison
Date:   Tue Feb 4 12:26:17 2020 -0800

    Datasette.render_template() method, closes #577

    Pull request #664.

:040000 040000 def9e31252e056845609de36c66d4320dd0c47f8 da19b7f8c26d50a4c05e5a7f05220b968429725c M	datasette
bisect run success

Supporting metadata.yaml

The other Datasette project I completed this week is a relatively small feature with hopefully a big impact: you can now use YAML for Datasette’s metadata configuration as an alternative to JSON.

I’m not crazy about YAML: I still don’t feel like I’ve mastered it, and I’ve been tracking it for 18 years! But it has one big advantage over JSON for configuration files: robust support for multi-line strings.

Datasette’s metadata file can include lengthy SQL statements and strings of HTML, both of which benefit from multi-line strings.

I first used YAML for metadata for my Analyzing US Election Russian Facebook Ads project. The metadata file for that demonstrates both embedded HTML and embedded SQL—and an accompanying build_metadata.py script converted it to JSON at build time. I’ve since used the same trick for a number of other projects.

The next release of Datasette (hopefully within a week) will ship the new feature, at which point those conversion scripts won’t be necessary.

This should work particularly well with the forthcoming ability for a canned query to write to a database. Getting that wrapped up and shipped will be my focus for the next few days.

This is Goodbye Zeit Now v1, hello datasette-publish-now—and talking to myself in GitHub issues by Simon Willison, posted on 8th April 2020.

Next: datasette-clone

Previous: Weeknotes: Covid-19, First Python Notebook, more Dogsheep, Tailscale