Simon Willison’s Weblog

On docker, weeknotes, logging, github, gis, ...

 

Recent entries

Things I learned about shapefiles building shapefile-to-sqlite four days ago

The latest in my series of x-to-sqlite tools is shapefile-to-sqlite. I learned a whole bunch of things about the ESRI shapefile format while building it.

Governments really love ESRI shapefiles. There is a huge amount of interesting geospatial data made available in the format—4,614 on Data.gov!

shapefile-to-sqlite

shapefile-to-sqlite loads the data from these files into a SQLite database, turning geometry properties into database columns and the geometry itself into a blob of GeoJSON. Let’s try it out on a shapefile containing the boundaries of US national parks.

$ wget http://nrdata.nps.gov/programs/lands/nps_boundary.zip
...
Saving to: ‘nps_boundary.zip’
nps_boundary.zip                           100%[=====================================================================================>]  12.61M   705KB/s    in 22s     
2020-02-18 19:59:22 (597 KB/s) - ‘nps_boundary.zip’ saved [13227561/13227561]

$ unzip nps_boundary.zip 
Archive:  nps_boundary.zip
inflating: temp/Current_Shapes/Data_Store/06-06-12_Posting/nps_boundary.xml  
inflating: temp/Current_Shapes/Data_Store/06-06-12_Posting/nps_boundary.dbf  
inflating: temp/Current_Shapes/Data_Store/06-06-12_Posting/nps_boundary.prj  
inflating: temp/Current_Shapes/Data_Store/06-06-12_Posting/nps_boundary.shp  
inflating: temp/Current_Shapes/Data_Store/06-06-12_Posting/nps_boundary.shx

$ shapefile-to-sqlite nps.db temp/Current_Shapes/Data_Store/06-06-12_Posting/nps_boundary.shp
temp/Current_Shapes/Data_Store/06-06-12_Posting/nps_boundary.shp
[####################################]  100%

$ datasette nps.db
Serve! files=('nps.db',) (immutables=()) on port 8003
INFO:     Started server process [33534]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)

I recommend installing the datasette-leaflet-geojson plugin, which will turn any column containing GeoJSON into a Leaflet map.

Screenshot of National Parks in Datasette

If you’ve installed SpatiaLite (installation instructions here) you can use the --spatialite option to instead store the geometry in a SpatiaLite column, unlocking a bewildering array of SQL geometry functions.

$ shapefile-to-sqlite nps.db temp/Current_Shapes/Data_Store/06-06-12_Posting/nps_boundary.shp --spatialite --table=nps-spatialite
temp/Current_Shapes/Data_Store/06-06-12_Posting/nps_boundary.shp
[##################################--]   94%  00:00:00

I deployed a copy of the resulting database using Cloud Run:

$ datasette publish cloudrun nps.db \
    --service national-parks \
    --title "National Parks" \
    --source_url="https://catalog.data.gov/dataset/national-parks" \
    --source="data.gov" \
    --spatialite \
    --install=datasette-leaflet-geojson \
    --install=datasette-render-binary \
    --extra-options="--config max_returned_rows:5"

I used max_returned_rows:5 there because these geometrries are pretty big—without it a page with 100 rows on it can return over 90MB of HTML!

You can browse the GeoJSON version of the table here and the SpatiaLite version here.

The SpatiaLite version defaults to rendering each geometry as an ugly binary blob. You can convert them to GeoJSON for compatibility with datasette-leaflet-geojson using the SpatiaLite AsGeoJSON() function:

select id, UNIT_NAME, AsGeoJSON(geometry)
from [nps-spatialite]

Here’s the result of that query running against the demo.

Understanding shapefiles

The most confusing thing about shapefiles is that they aren’t a single file. A shapefile comes as a minimum of three files: foo.shp containing geometries, foo.shx containing an index into those geometries (really more of an implementation detail) and foo.dbf contains key/value properties for each geometry.

They often come bundled with other files too. foo.prj is a WKT projection for the data for example. Wikipedia lists a whole bunch of other possibilities.

As a result, shapefiles are usually distributed as a zip file. Some shapefile libraries can even read directly from a zip.

The GeoJSON format was designed as a modern alternative to shapefiles, so understanding GeoJSON really helps in understanding shapefiles. In particular the GeoJSON geometry types: Point, LineString, MultiLineString, Polygon and MultiPolygon match how shapefile geometries work.

An important detail in shapefiles is that data in the .shp and .dbf files is matched by array index—so the first geometry can be considered as having ID=0, the second ID=1 and so on.

You can read the properties from the .dbf file using the dbfread Python module like this:

$ ipython
In [1]: import dbfread
In [2]: db = dbfread.DBF("temp/Current_Shapes/Data_Store/06-06-12_Posting/nps_boundary.dbf")
In [3]: next(iter(db))
Out[3]: 
OrderedDict([('UNIT_TYPE', 'Park'),
            ('STATE', ''),
            ('REGION', 'NC'),
            ('UNIT_CODE', 'NACC'),
            ('UNIT_NAME', 'West Potomac Park'),
            ('DATE_EDIT', None),
            ('GIS_NOTES', ''),
            ('CREATED_BY', 'Legacy'),
            ('METADATA', ''),
            ('PARKNAME', '')])

Reading shapefiles in Python

I’m a big fan of the Shapely Python library, so I was delighted to see that Sean Gillies, creator of Shapely, also created a library for reading and writing shapefiles: Fiona.

GIS with Python, Shapely, and Fiona by Tom MacWright was particularly useful for figuring this out. I like how he wrote that post in 2012 but added a note in 2017 that it’s still his recommended way of getting started with GIS in Python.

Projections

The trickiest part of working with any GIS data is always figuring out how to deal with projections.

GeoJSON attempts to standardize on WGS 84, otherwise known as the latitude/longitude model used by GPS. But... shapefiles frequently use something else. The Santa Clara county parks shapefiles for example use EPSG:2227, also known as California zone 3.

(Fun fact: ESPG stands for European Petroleum Survey Group, a now defunct oil industry group that today lives on only as a database of projected coordinate systems.)

I spent quite a while thinking about how to best handle projections. In the end I decided that I’d follow GeoJSON’s lead and attempt to convert everything to WGS 84, but allow users to skip that behaviour using --crs=keep or to specify an alternative projection to convert to with --crs=epsg:2227 or similar.

SpatiaLite creates its geometry columns with a baked in SRID (a code which usually maps to the EPSG identifier). You can see which SRID was used for a specific geometry using the srid() function:

select srid(geometry) from “nps-spatialite” limit 1

SpatiaLite can also convert to another projection using the Transform() function:

select ’:’ || AsGeoJSON(Transform(geometry, 2227)) from “nps-spatialite” limit 1

(I’m using ':' || AsGeoJSON(...) here to disable the datasette-leaflet-geojson plugin, since it can’t correctly render data that has been transformed to a non-WGS-84 proection.)

Pulling it all together

I now have two tools for imorting geospatial data into SQLite (or SpatiaLite) databases: shapefile-to-sqlite and geojson-to-sqlite.

I’m excited about Datasette’s potential as a tool for GIS. I started exploring this back in 2017 when I used it to build a location to timezone API—but adding easy shapefile imports to the toolchain should unlock all kinds of interesting new geospatial projects.

How to cheat at unit tests with pytest and Black 12 days ago

I’ve been making a lot of progress on Datasette Cloud this week. As an application that provides private hosted Datasette instances (initially targeted at data journalists and newsrooms) the majority of the code I’ve written deals with permissions: allowing people to form teams, invite team members, promote and demote team administrators and suchlike.

The one thing I’ve learned about permissions code over the years is that it absolutely warrants comprehensive unit tests. This is not code that can afford to have dumb bugs, or regressions caused by future development!

I’ve become a big proponent of pytest over the past two years, but this is the first Django project that I’ve built using pytest from day one as opposed to relying on the Django test runner. It’s been a great opportunity to try out pytest-django, and I’m really impressed with it. It maintains my favourite things about Django’s test framework—smart usage of database transactions to reset the database and a handy test client object for sending fake HTTP requests—and adds all of that pytest magic that I’ve grown to love.

It also means I get to use my favourite trick for productively writing unit tests: the combination of pytest and Black, the “uncompromising Python code formatter”.

Cheating at unit tests

In pure test-driven development you write the tests first, and don’t start on the implementation until you’ve watched them fail.

Most of the time I find that this is a net loss on productivity. I tend to prototype my way to solutions, so I often find myself with rough running code before I’ve developed enough of a concrete implementation plan to be able to write the tests.

So… I cheat. Once I’m happy with the implementation I write the tests to match it. Then once I have the tests in place and I know what needs to change I can switch to using changes to the tests to drive the implementation.

In particular, I like using a rough initial implementation to help generate the tests in the first place.

Here’s how I do that with pytest. I’ll write a test that looks something like this:

def test_some_api(client):
    response = client.get("/some/api/")
    assert False == response.json()

Note that I’m using the pytest-django client fixture here, which magically passes a fully configured Django test client object to my test function.

I run this test, and it fails:

pytest -k test_some_api

(pytest -k blah runs just tests that contain blah in their name)

Now… I run the test again, but with the --pdb option to cause pytest to drop me into a debugger at the failure point:

$ pytest -k test_some_api --pdb
== test session starts ===
platform darwin -- Python 3.7.5, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
django: settings: config.test_settings (from ini)
...
client = <django.test.client.Client object at 0x10cfdb510>

    def test_some_api(client):
        response = client.get("/some/api/")
>       assert False == response.json()
E       assert False == {'this': ['is', 'an', 'example', 'api']}
core/test_docs.py:27: AssertionError
>> entering PDB >>
>> PDB post_mortem (IO-capturing turned off) >>
> core/test_docs.py(27)test_some_api()
-> assert False == response.json()
(Pdb) response.json()
{'this': ['is', 'an', 'example', 'api'], 'that_outputs': 'JSON'}
(Pdb) 

Running response.json() in the debugger dumps out the actual value to the console.

Then I copy that output—in this case {'this': ['is', 'an', 'example', 'api'], 'that_outputs': 'JSON'}—and paste it into the test:

def test_some_api(client):
    response = client.get("/some/api/")
    assert {'this': ['is', 'an', 'example', 'api'], 'that_outputs': 'JSON'} == response.json()

Finally, I run black . in my project root to reformat the test:

def test_some_api(client):
    response = client.get("/some/api/")
    assert {
        "this": ["is", "an", "example", "api"],
        "that_outputs": "JSON",
    } == response.json()

This last step means that no matter how giant and ugly the test comparison has become I’ll always get a neatly formatted test out of it.

I always eyeball the generated test to make sure that it’s what I would have written by hand if I wasn’t so lazy—then I commit it along with the implementation and move on to the next task.

I’ve used this technique to write many of the tests in both Datasette and sqlite-utils, and those are by far the best tested pieces of software I’ve ever released.

I started doing this around two years ago, and I’ve held off writing about it until I was confident I understood the downsides. I haven’t found any yet: I end up with a robust, comprehensive test suite and it takes me less than half the time to write the tests than if I’d been hand-crafting all of those comparisons from scratch.

Also this week

Working on Datasette Cloud has required a few minor releases to some of my open source projects:

Unrelated to Datasette Cloud, I also shipped twitter-to-sqlite 0.16 with a new command for importing your Twitter friends (previously it only had a command for importing your followers).

In bad personal motivation news… I missed my weekly update to Niche Museums and lost my streak!

Weeknotes: Shaving yaks for Datasette Cloud 18 days ago

I’ve been shaving a lot of yaks, but I’m finally ready to for other people to start kicking the tires on the MVP of Datasette Cloud.

I’ve started by inviting a small group of people (from my fellowship program) in to start trying out this new hosted, team-oriented version of Datasette.

Getting to this point has been a classic example of the last 10% of the project taking 90% of the time.

Here’s just one example. I need my users to be able to upload CSV files directly into Datasette, rather than making them rely on my growing collection of command-line tools for data ingestion.

So I’ve been trying to knock the very-alpha version of my new datasette-upload-csvs plugin into good enough shape to be usable for this initial round of testing.

But... that plugin needs to render templates. And Datasette plugins didn’t have a pleasant way of rendering templates (if you discount horrifying stack-inpsection hacks). I’ve had an issue open about this since September. I finally closed it today, and shipped Datasette 0.35 to celebrate.

I want users to only be able to access the Datasette instances for teams that they belong to. Since authentication is handled by datasette-auth-existing-cookies I needed it to grow some concept of permissions. I ended up shipping versions 0.3, 0.4, 0.4.1, 0.5 and 0.5.1 just in the past two days, but it finally does what I need it to do. Another thoroughly shaved yak.

I ran into an awkward ASGI scope issue, which I ended up figuring out using a new datasette-debug-asgi plugin based on my older asgi-scope project.

I’ve been doing a lot of tinkering with my Docker/Traefik environment too. I can now launch new containers from Python code triggered by a Django Admin action, which is pretty fun.

My other fun project from this week was geojson-to-sqlite, a CLI tool for converting GeoJSON files into a SQLite (or optionally a SpatiaLite) database. Combined with datasette-leaflet-geojson this allows for some really fun geospatial nerdery. I’m looking forward to diving deeper into this set of Datasette use-cases in the near future.

Scrappy weeknotes this week, but I’ve decided that it’s better to keep to the habit and post something untidy than to delay posting and break my streak.

Weeknotes: datasette-auth-existing-cookies and datasette-sentry 25 days ago

Work on Datasette Cloud continues—I’m tantalizingly close to having a MVP I can start to invite people to try out.

I’m trying to get as much work as possible done for it using Datasette Plugins. This week I’ve released two new plugins to assist in the effort.

datasette-auth-existing-cookies

My first attempt at adding authentication to Datasette was datasette-auth-github, which takes advantages of GitHub’s best-in-class OAuth flow to implement sign-in with GitHub and builds a relatively powerful permission system on top of GitHub users, organizations and teams.

For Datasette Cloud I need to go a step further: I’m definitely going to have regular username/password accounts, and I’ll probably implement sign-in-with-Google as well.

I don’t particularly want to implement username/password accounts from scratch. Django (and django-registration) provide robust and very well tested solution for this. How about I use that?

Datasette Cloud teams will each get their own Datasette instance running on a subdomain. If I implement authentication as a Django app running on example.com I can set that as the cookie domain—then Datasette instances running on teamname.example.com will be able to see the resulting authentication cookie.

Given a Django authentication cookie (which may just be a sessionid) how can I tell if it corresponds to a logged in user? That’s where my new datasette-auth-existing-cookies plugin comes in.

The plugin lets you configure Datasette to read in a specified list of cookies and then forward them on as part of an API request to an underlying application. That application then returns JSON showing if the user is signed in or not. The plugin then sets a short-lived signed cookie that persists that information.

Here’s what the configuration looks like:

{
    "plugins": {
        "datasette-auth-existing-cookies": {
            "api_url": "https://www.example.com/user-from-cookies",
            "auth_redirect_url": "https://www.example.com/login",
            "original_cookies": ["sessionid"]
        }
    }
}

Any hits to teamname.example.com will be checked for a sessionid cookie. That cookie is forwarded on to https://www.example.com/user-from-cookies to see if it’s valid.

If the cookie is missing or invalid, the user will be redirected to the following URL:

https://www.example.com/login?next=https://teamname.example.com/

The plugin has a few other options: you can request that the ?next= parameter is itself signed to help protect against unvalidated redirects for example. But it’s a pretty simple piece of code that hopefully means I won’t have to spend much more time thinking about login and registration.

httpx for testing ASGI apps

All of my Datasette plugins ship with unit tests—mainly so that I can implement continuous deployment from them, where new tagged releases are automatically shipped to PyPI provided the tests pass.

For ASGI plugins, this means writing unit tests against the ASGI spec. I’ve mainly been doing this using the ApplicationCommunicator class from asgiref, which provides powerful low-level hooks for interacting with an ASGI application. The tests end up being pretty verbose though!

Here’s the ApplicationCommunicator test I first wrote for datasette-auth-existing-cookies.

I’ve been exploring Tom Christie’s httpx library for asynchronous HTTP calls in Python recently, and I spotted an interesting capability buried deep in the documentation: you can pass it an ASGI app and make requests directly against the app, without round-tripping through HTTP!

This looked ideal for unit testing, so I had a go at rewriting my tests using it. The result was delightful:

auth_app = ExistingCookiesAuthTest(
    hello_world_app,
    ...
)
async with httpx.AsyncClient(app=auth_app) as client:
    response = await client.get(
        "https://demo.example.com/", allow_redirects=False
    )
    assert 302 == response.status_code
    location = response.headers["location"]
    assert "https://www.example.com/login" == location

This is a much nicer way of writing tests for ASGI applications and middleware. I’m going to be using this for all of my projects going forward.

datasette-sentry

In starting to deploy Datasette Cloud I quickly ran into the need to start collecting and analyzing errors thrown in production.

I’ve been enjoing using Sentry for this for several years now, and I was pleased to see that the official Sentry SDK grew support for ASGI last July.

Wrapping it up as a Datasette plugin took less than half an hour: datasette-sentry. It’s configured like this:

{
    "plugins": {
        "datasette-sentry": {
            "dsn": {
                "$env": "SENTRY_DSN"
            }
        }
    }
}

The DSN configuring Sentry will then be read from the SENTRY_DSN environment variable.

Weeknotes: Datasette Cloud and zero downtime deployments one month ago

Yesterday’s piece on Deploying a data API using GitHub Actions and Cloud Run was originally intended to be my weeknotes, but ended up getting a bit too involved.

Aside from playing with GitHub Actions and Cloud Run, my focus over the past week has been working on Datasette Cloud. Datasette Cloud is the current name I’m using for my hosted Datasette product—the idea being that I’ll find it a lot easier to get feedback on Datasette from journalists if they can use it without having to install anything!

My MVP for Datasette Cloud is that I can use it to instantly provision a new, private Datasette instance for a journalist (or team of journalists) that they can then sign into, start playing with and start uploading their data to (initially as CSV files).

I have to solve quite a few problems to get there:

  • Secure, isolated instances of Datasette. A team or user should only be able to see their own files. I plan to solve this using Docker containers that are mounted such that they can only see their own dedicated volumes.
  • The ability to provision new instances as easily as possible—and give each one its own HTTPS subdomain.
  • Authentication: users need to be able to register and sign in to accounts. I could use datasette-auth-github for this but I’d like to be able to support regular email/password accounts too.
  • Users need to be able to upload CSV files and have them converted into a SQLite database compatible with Datasette.

Zero downtime deployments

I have a stretch goal which I’m taking pretty seriously: I want to have a mechanism in place for zero-downtime deployments of new versions of the software.

Arguable this is an unneccessary complication for an MVP. I may not fully implement it, but I do want to at least know that the path I’ve taken is compatible with zero downtime deployments.

Why do zero downtime deployments matter so much to me? Because they are desirable for rapid iteration, and crucial for setting up continuious deployment. Even a couple of seconds of downtime during a deployment leaves a psychological desire not to deploy too often. I’ve seen the productivity boost that deploying fearlessly multiple times a day brings, and I want it.

So I’ve been doing a bunch of research into zero downtime deployment options (thanks to some great help on Twitter) and I think I have something that’s going to work for me.

The first ingredient is Traefik—a new-to-me edge router (similar to nginx) which has a delightful focus on runtime configuration based on automatic discovery.

It works with a bunch of different technology stacks, but I’m going to be using it with regular Docker. Traefik watches for new Docker containers, reads their labels and uses that to reroute traffic to them.

So I can launch a new Docker container, apply the Docker label "traefik.frontend.rule": "Host:subdomain.mydomain.com" and Traefik will start proxying traffic to that subdomain directly to that container.

Traefik also has extremely robust built-in support for Lets Encrypt to issue certificates. I managed to issue a wildcard TLS certificate for my entire domain, so new subdomains are encrypted straight away. This did require me to give Traefik API access to modify DNS entries—I’m running DNS for this project on Digital Ocean and thankfully Traefik knows how to do this by talking to their API.

That solves provisioning: when I create a new account I can call the Docker API (from Python) to start up a new, labelled container on a subdomain protected by a TLS certificate.

I still needed a way to run a zero-downtime deployment of a new container (for example when I release a new version of Datasette and want to upgrade everyone). After quite a bit of research (during which I discovered you can’t modify the labels on a Docker container without restarting it) I settled on the approach described in this article.

Essentially you configure Traefik to retry failed requests, start a new, updated container with the same routing information as the existing one (causing Traefik to load balance HTTP requests across both), then shut down the old container and trust Traefik to retry in-flight requests against the one that’s still running.

Rudimentary testing with ab suggested that this is working as desired.

One remaining problem: if Traefik is running in a Docker container and proxying all of my traffic, how can I upgrade Traefik itself without any downtime?

Consensus on Twitter seems to be that Docker on its own doesn’t have a great mechanism for this (I was hoping I could re-route port 80 traffic to the host to a different container in an atomic way). But... iptables has mechanisms that can re-route traffic from one port to another—so I should be able to run a new Traefik container on a different port and re-route to it at the operating system level.

That’s quite enough yak shaving around zero time deployments for now!

datasette-upload-csvs

A big problem I’m seeing with the current Datasette ecosystem is that while Datasette offers a web-based user interface for querying and accessing data, the tools I’ve written for actually creating those databases are decidedly command-line only.

Telling journalists they have to learn to install and run software on the command-line is way too high a barrier to entry.

I’ve always intended to have Datasette plugins that can handle uploading and converting data. It’s time to actually build one!

datasette-upload-csvs is what I’ve got so far. It has a big warning not to use it in the README—it’s very alpha sofware at the moment—but it does prove that the concept can work.

It uses the asgi_wrapper plugin hook to intercept requests to the path /-/upload-csv and forward them on to another ASGI app, written using Starlette, which provides a basic upload form and then handles the upload.

Uploaded CSVs are converted to SQLite using sqlite-utils and written to the first mutable database attached to Datasette.

It needs a bunch more work (and tests) before I’m comfortable telling people to use it, but it does at least exist as a proof of concept for me to iterate on.

datasette-auth-django-cookies

No code for this yet, but I’m beginning to flesh it out as a concept.

I don’t particularly want to implement user registration and authentication and cookies and password hashing. I know how to do it, which means I know it’s not something you shuld re-roll for every project.

Django has a really well designed, robust authentication system. Can’t I just use that?

Since all of my applications will be running on subdomains of a single domain, my current plan is to have a regular Django application which handles registration and logins. Each subdomain will then run a custom piece of Datasette ASGI middleware which knows how to read and validate the Django authentication cookie.

This should give me single sign-on with a single, audited codebase for registration and login with (hopefully) the least amount of work needed to integrate it with Datasette.

Code for this will hopefully follow over the next week.

Niche Museums—now publishing weekly

I hit a milestone with my Niche Museums project: the site now lists details of 100 museums!

For the 100th entry I decided to celebrate with by far the most rewarding (and exclusive) niche museum experience I’ve ever had: Ray Bandar’s Bone Palace.

You should read the entry. The short version is that Ray Bandar collected 7,000 animals skulls over a sixty year period, and Natalie managed to score us a tour of his incredible basement mere weeks before the collection was donated to the California Academy of Sciences.

The basement full of skulls

Posting one museum a day was taking increasingly more of my time, as I had to delve into the depths of my museums-I-have-visited backlog and do increasing amounts of research. Now that I’ve hit 100 I’m going to switch to publishing one a week, which should also help me visit new ones quickly enough to keep the backlog full!

So I only posted four this week:

I also built a simple JavaScript image gallery to better display the 54 photos I published from our trip to Ray Bandar’s basement.

Deploying a data API using GitHub Actions and Cloud Run one month ago

I’m using the combination of GitHub Actions and Google Cloud Run to retrieve data from the U.S. Department of Justice FARA website and deploy it as a queryable API using Datasette.

FARA background

The Foreign Agents Registration Act (FARA) law that requires “certain agents of foreign principals who are engaged in political activities or other activities specified under the statute to make periodic public disclosure of their relationship with the foreign principal, as well as activities, receipts and disbursements in support of those activities”.

The law was introduced in 1938 in response to the large number of German propaganda agents that were operating in the U.S. prior to the war.

Basically, if you are in the United States as a lobbyist for a foreign government you need to register under FARA. It was used in 23 criminal cases during World War II, but hasn’t had much use since it was ammended in 1966. Although... if you consult the list of recent cases you’ll see some very interesting recent activity involving Russia and Ukraine.

It’s also for spies! Quoting the FARA FAQ:

Finally, 50 U.S.C. § 851, requires registration of persons who have knowledge of or have received instruction or assignment in espionage, counterespionage or sabotage service or tactics of a foreign country or political party.

I imagine most spies operate in violation of this particular law and don’t take steps to register themselves.

It’s all still pretty fascinating though, in part because it gets updated. A lot. Almost every business day in fact.

Tracking FARA history

I know this because seven months ago I set up a scraper for it. Every twelve hours I have code which downloads the four bulk CSVs published by the Justice department and saves them to a git repository. It’s the same trick I’ve been using to track San Francisco’s database of trees and PG&E’s outage map.

I’ve been running the scraper using Circle CI, but this weekend I decided to switch it over to GitHub Actions to get a better idea for how they work.

Deploying it as an API

I also wanted to upgrade my script to also deploy a fresh Datasette instance of the data using Google Cloud Run. I wrote a script to do this on a manual basis last year, but I never combined it with the daily scraper. Combining the two means I can offer a Datasette-powered API directly against the latest data.

https://fara.datasettes.com is that API—it now updates twice a day, assuming there are some changes to the underlying data.

Putting it all together

The final GitHub action workflow can be seen here. I’m going to present an annotated version here.

on:
  repository_dispatch:
  schedule:
    - cron:  '0 0,12 * * *'

This sets when the workflow should be triggered. I’m running it twice a day—at midnight and noon UTC (the 0,12 cron syntax).

The repository_dispatch key means I can also trigger it manually by running the following curl command—useful for testing:

curl -XPOST https://api.github.com/repos/simonw/fara-history/dispatches \
    -H 'Authorization: token MY_PERSONAL_TOKEN_HERE' \
    -d '{"event_type": "trigger_action"}' \
    -H 'Accept: application/vnd.github.everest-preview+json'

Next comes the job itself, which I called scheduled and set to run on the latest Ubuntu:

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:

Next comes the steps. Each step is run in turn, in an isolated process (presumably a container) but with access to the current working directory.

- uses: actions/checkout@v2
  name: Check out repo
- name: Set up Python
  uses: actions/setup-python@v1
  with:
    python-version: 3.8

The first two steps checkout the fara-history repository and install Python 3.8.

- uses: actions/cache@v1
  name: Configure pip caching
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-

This step should set up a cache so that pip doesn’t have to download fresh dependencies on every run. Unfortunately it doesn’t seem to actually work—it only works for push and pull_request events, but my workflow is triggered by schedule and repository_dispatch. There’s an open issue about this.

- name: Install Python dependencies
  run: |
    python -m pip install --upgrade pip
    pip install -r requirements.txt

This step installs my dependencies from requirements.txt.

- name: Fetch, update and commit FARA data
  run: . update_and_commit_all.sh
- name: Build fara.db database
  run: python build_database.py

Now we’re getting to the fun stuff. My update_and_commit_all.sh script downloads the four zip files from the FARA.gov site, unzips them, sorts them, diffs them against the previously stored files and commits the new copy to GitHub if they have changed. See my explanation of csv-diff for more on this—though sadly only one of the files has a reliable row identifier so I can’t generate great commit messages for most of them.

My build_database.py script uses sqlite-utils to convert the CSV files into a SQLite database.

Now that we’ve got a SQLite database, we can deploy it to Google Cloud Run using Datasette.

But should we run a deploy at all? If the database hasn’t changed, there’s no point in deploying it. How can we tell if the database file has changed from the last one that was published?

Datasette has a mechanism for deriving a content hash of a database, part of a performance optimization which is no longer turned on by default and may be removed in the future.

You can generate JSON that includes hash using the datasette inspect command. The jq tool can then be used to extract out just the hash:

$ datasette inspect fara.db | jq '.fara.hash' -r
fbc9cbaca6de1e232fc14494faa06cc8d4cb9f379d0d568e4711e9a218800906

The -r option to jq causes it to return just the raw string, without quote marks.

Datasette’s /-/databases.json introspection URL reveals the hashes of the currently deployed database. Here’s how to pull the currently deployed hash:

$ curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r
a6c0ab26589bde0d225c5a45044e0adbfa3840b95fbb263d01fd8fb0d2460ed5

If those two hashes differ then we should deploy the new database.

GitHub Actions have a slightly bizarre mechanism for defining “output variables” for steps, which can then be used to conditionally run further steps.

Here’s the step that sets those variables, followed by the step that conditionally installs the Google Cloud CLI tools using their official action:

- name: Set variables to decide if we should deploy
  id: decide_variables
  run: |-
    echo "##[set-output name=latest;]$(datasette inspect fara.db | jq '.fara.hash' -r)"
    echo "##[set-output name=deployed;]$(curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r)"
- name: Set up Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  uses: GoogleCloudPlatform/github-actions/setup-gcloud@master
  with:
    version: '275.0.0'
    service_account_email: ${{ secrets.GCP_SA_EMAIL }}
    service_account_key: ${{ secrets.GCP_SA_KEY }}

Having installed the Google Cloud tools, I can deploy my database using Datasette:

- name: Deploy to Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  run: |-
    gcloud components install beta
    gcloud config set run/region us-central1
    gcloud config set project datasette-222320
    datasette publish cloudrun fara.db --service fara-history -m metadata.json

This was by far the hardest part to figure out.

First, I needed to create a Google Cloud service account with an accompanying service key.

I tried and failed to do this using the CLI, so I switched to their web console following these and then these instructions.

Having downloaded the key JSON file, I converted it to base64 and pasted it into a GitHub Actions secret (hidden away in the repository settings area) called GCP_SA_KEY.

cat ~/Downloads/datasette-222320-2ad02afe6d82.json \
    | base64 | pbcopy

The service account needed permissions in order to run a build through Cloud Build and then deploy the result through Cloud Run. I spent a bunch of time trying out different combinations and eventually gave up and gave the account “Editor” permissions across my entire project. This is bad. I am hoping someone can help me understand what the correct narrow set of permissions are, and how to apply them.

It also took me a while to figure out that I needed to run these three commands before I could deploy to my project. The first one installs the Cloud Run tools, the second set up some required configuration:

gcloud components install beta
gcloud config set run/region us-central1
gcloud config set project datasette-222320

But... having done all of the above, the following command run from an action successfully deploys the site!

datasette publish cloudrun fara.db \
    --service fara-history -m metadata.json

DNS

Google Cloud Run deployments come with extremely ugly default URLs. For this project, that URL is https://fara-history-j7hipcg4aq-uc.a.run.app/.

I wanted something nicer. I own datasettes.com and manage the DNS via Cloudflare, which means I can point subdomains at Cloud Run instances.

This is a two-step process

  1. I set fara.datasettes.com as a DNS-only (no proxying) CNAME for ghs.googlehosted.com.
  2. In the Google Cloud Console I used Cloud Run -> Manage Custom Domains (a button in the header) -> Add Mapping to specify that fara.datasettes.com should map to my fara-history service (the --service argument from datasette publish earlier).

I had previously verified my domain ownership—I forget quite how I did it. Domains purchased through Google Domains get to skip this step.

Next steps

This was a lot of fiddling around. I’m hoping that by writing this up in detail I’ll be able to get this working much faster next time.

I think this model—GitHub Actions that pull data, build a database and deploy to Cloud Run using datasette publish—is incredibly promising. The end result should be an API that costs cents-to-dollars a month to operate thanks to Cloud Run’s scale-to-zero architecture. And hopefully by publishing this all on GitHub it will be as easy as possible for other people to duplicate it for their own projects.

Elsewhere

21st February 2020

  • So next time someone is giving you feedback about something you made, think to yourself that to win means getting two or three insights, ideas, or suggestions that you are excited about, and that you couldn’t think up on your own.

    Juliette Cezzar #

14th February 2020

  • pup. This is a great idea: a command-line tool for parsing HTML on stdin using CSS selectors. It’s like jq but for HTML. Supports a sensible collection of selectors and has a number of output options for the selected nodes, including plain text and JSON. It also works as a simple pretty-printer for HTML. #

13th February 2020

  • A group of software engineers gathered around a whiteboard are a joint cognitive system. The scrawls on the board are spatial cues for building a shared model of a complex system.

    Eric Dobbs #

8th February 2020

  • We write a lot of JavaScript at Basecamp, but we don’t use it to create “JavaScript applications” in the contemporary sense. All our applications have server-side rendered HTML at their core, then add sprinkles of JavaScript to make them sparkle. [...] It allows us to party with productivity like days of yore. A throwback to when a single programmer could make rapacious progress without getting stuck in layers of indirection or distributed systems. A time before everyone thought the holy grail was to confine their server-side application to producing JSON for a JavaScript-based client application.

    David Heinemeier Hansson #

7th February 2020

  • Deep learning isn’t hard anymore. This article does a great job of explaining how transfer learning is unlocking a new wave of innovation around deep learning. Previously if you wanted to train a model you needed vast amounts if data and thousands of dollars of compute time. Thanks to transfer learning you can now take an existing model (such as GPT2) and train something useful on top of it that’s specific to a new domain in just minutes it hours, with only a few hundred or a few thousand new labeled samples. #

31st January 2020

  • geojson-to-sqlite (via) I just put out the first release of geojson-to-sqlite—a CLI tool that can convert GeoJSON files (consisting of a Feature or a set of features in a FeatureCollection) into a table in a SQLite database. If you use the --spatialite option it will initalize the table with SpatiaLite and store the geometries in a spacially indexed geometry field—without that option it stores them as GeoJSON. #

30th January 2020

  • I used to tolerate and expect complexity. Working on Go the past 10 years has changed my perspective, though. I now value simplicity above almost all else and tolerate complexity only when it’s well isolated, well documented, well tested, and necessary to make things simpler overall at other layers for most people.

    Brad Fitzpatrick #

24th January 2020

  • 2020 Web Milestones (via) A lot of stuff is happening in 2020! Mike Sherov rounds it up—highlights include the release of Chromium Edge (Microsoft’s Chrome-powered browser for Windows 7+), Web Components supported in every major browser, Deno 1.x, SameSite Cookies turned on by default (which should dramatically reduce CSRF exposure) and Python 2 and Flash EOLs. #
  • Generated Columns in SQLite (via) SQLite 3.31.0 released today, and generated columns are the single most notable new feature. PostgreSQL 12 added these in October 2019, and MySQL has had them since 5.7 in October 2015. MySQL and SQLite both offer either “stored” or “virtual” generated columns, with virtual columns being calculated at runtime. PostgreSQL currently only supports stored columns. #

18th January 2020

  • Ray Bandar's Bone Palace (via) The hundredth museum I’ve posted to Niche Museums, and this one is my absolute favourite. Ray Bandar spent sixty years collecting over 7,000 animal skulls and putting them on display in his house and basement. Natalie managed to score us a tour a few weeks before the entire collection was donated to the California Academy of Sciences. It was the most amazing room I’ve ever been in. #
  • Code is made of pain, lies, and bad ideas, all so we can pretend that electrified sand will do what we tell it to

    Yoz Grahame #

17th January 2020

  • Your own hosted blog, the easy, free, open way (even if you're not a computer expert) (via) Jeremy Howard and the fast.ai team have released fast_template—a GitHub repository designed to be used as a template to create new repositories with a complete Jekyll blog configured for use with GitHub pages. GitHub’s official document recommends you install Ruby on your machine to do this, but Jeremy points out that with the right repository setup you can run a blog entirely by editing files through the GitHub web interface. #