Weeknotes: Archiving coronavirus.data.gov.uk, custom pages and directory configuration in Datasette, photos-to-sqlite
29th April 2020
I mainly made progress on three projects this week: Datasette, photos-to-sqlite and a cleaner way of archiving data to a git repository.
Archiving coronavirus.data.gov.uk
The UK goverment have a new portal website sharing detailed Coronavirus data for regions around the country, at coronavirus.data.gov.uk.
As with everything else built in 2020, it’s a big single-page JavaScript app. Matthew Somerville investigated what it would take to build a much lighter (and faster loading) site displaying the same information by moving much of the rendering to the server.
One of the best things about the SPA craze is that it strongly encourages structured data to be published as JSON files. Matthew’s article inspired me to take a look, and sure enough the government figures are available in an extremely comprehensive (and 3.3MB in size) JSON file, available from https://c19downloads.azureedge.net/downloads/data/data_latest.json.
Any time I see a file like this my first questions are how often does it change—and what kind of changes are being made to it?
I’ve written about scraping to a git repository (see my new gitscraping tag) a bunch in the past:
- Scraping hurricane Irma—September 2017
- Changelogs to help understand the fires in the North Bay—October 2017
- Generating a commit log for San Francisco’s official list of trees—March 2019
- Tracking PG&E outages by scraping to a git repo—October 2019
- Deploying a data API using GitHub Actions and Cloud Run—January 2020
Now that I’ve figured out a really clean way to Commit a file if it changed in a GitHub Action knocking out new versions of this pattern is really quick.
simonw/coronavirus-data-gov-archive is my new repo that does exactly that: it periodically fetches the latest versions of the JSON data files powering that site and commits them if they have changed. The aim is to build a commit history of changes made to the underlying data.
The first implementation was extremely simple—here’s the entire action:
name: Fetch latest data
on:
push:
repository_dispatch:
schedule:
- cron: '25 * * * *'
jobs:
scheduled:
runs-on: ubuntu-latest
steps:
- name: Check out this repo
uses: actions/checkout@v2
- name: Fetch latest data
run: |-
curl https://c19downloads.azureedge.net/downloads/data/data_latest.json | jq . > data_latest.json
curl https://c19pub.azureedge.net/utlas.geojson | gunzip | jq . > utlas.geojson
curl https://c19pub.azureedge.net/countries.geojson | gunzip | jq . > countries.geojson
curl https://c19pub.azureedge.net/regions.geojson | gunzip | jq . > regions.geojson
- name: Commit and push if it changed
run: |-
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Latest data: ${timestamp}" || exit 0
git push
It uses a combination of curl
and jq
(both available in the default worker environment) to pull down the data and pretty-print it (better for readable diffs), then commits the result.
Matthew Somerville pointed out that inefficient polling sets a bad precedent. Here I’m hitting azureedge.net
, the Azure CDN, so that didn’t particularly worry me—but since I want this pattern to be used widely it’s good to provide a best-practice example.
Figuring out the best way to make conditional get requests in a GitHub Action lead me down something of a rabbit hole. I wanted to use curl’s new ETag support but I ran into a curl bug, so I ended up rolling a simple Python CLI tool called conditional-get to solve my problem. In the time it took me to release that tool (just a few hours) a new curl release came out with a fix for that bug!
Here’s the workflow using my conditional-get
tool. See the issue thread for all of the other potential solutions, including a really neat Action shell-script solution by Alf Eaton.
To my absolute delight, the project has already been forked once by Daniel Langer to capture Canadian Covid-19 cases!
New Datasette features
I pushed two new features to Datasette master, ready for release in 0.41.
Configuration directory mode
This is an idea I had while building datasette-publish-now. Datasette instances can be run with custom metadata, custom plugins and custom templates. I’m increasingly finding myself working on projects that run using something like this:
$ datasette data1.db data2.db data3.db \
--metadata=metadata.json
--template-dir=templates \
--plugins-dir=plugins
Directory configuration mode introduces the idea that Datasette can configure itself based on a directory layout. The above example can instead by handled by creating the following layout:
my-project/data1.db
my-project/data2.db
my-project/data3.db
my-project/metadatata.json
my-project/templates/index.html
my-project/plugins/custom_plugin.py
Then run Datasette directly targetting that directory:
$ datasette my-project/
See issue #731 for more details. Directory configuration mode is documented here.
Define custom pages using templates/pages
In niche-museums.com, powered by Datasette I described how I built the www.niche-museums.com website as a heavily customized Datasette instance.
That site has /about and /map pages which are served by custom templates—but I had to do some gnarly hacks with empty about.db
and map.db
files to get them to work.
Issue #648 introduces a new mechanism for creating this kind of page: create a templates/pages/map.html
template file and custom 404 handling code will ensure that any hits to /map
serve the rendered contents of that template.
This could work really well with the datasette-template-sql plugin, which allows templates to execute abritrary SQL queries (ala PHP or ColdFusion).
Here’s the new documentation on custom pages, including details of how to use the new custom_status()
, custom_header()
and custom_redirect()
template functions to go beyond just returning HTML.
photos-to-sqlite
My Dogsheep personal analytics project brings my tweets, GitHub activity, Swarm checkins and more together in one place. But the big missing feature is my photos.
As-of yesterday, I have 39,000 photos from Apple Photos uploaded to an S3 bucket using my new photos-to-sqlite tool. I can run the following SQL query and get back ten random photos!
select
json_object(
'img_src',
'https://photos.simonwillison.net/i/' ||
sha256 || '.' || ext || '?w=400'
),
filepath,
ext
from
photos
where
ext in ('jpeg', 'jpg', 'heic')
order by
random()
limit
10
photos.simonwillison.net
is running a modified version of my heic-to-jpeg image converting and resizing proxy, which I’ll release at some point soon.
There’s still plenty of work to do—I still need to import EXIF data (including locations) into SQLite, and I plan to use osxphotos to export additional metadata from my Apple Photos library. But this week it went from a pure research project to something I can actually start using, which is exciting.
TIL this week
- Fixing “compinit: insecure directories” error
- Restricting SSH connections to devices within a Tailscale network
- Generated a summary of nested JSON data
- Session-scoped temporary directories in pytest
- How to mock httpx using pytest-mock
Generated using this query.
More recent articles
- I can now run a GPT-4 class model on my laptop - 9th December 2024
- Prompts.js - 7th December 2024
- First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin) - 4th December 2024