Simon Willison’s Weblog

Subscribe

Weeknotes: Datasette and Git scraping at NICAR, VaccinateCA

7th March 2021

This week I virtually attended the NICAR data journalism conference and made a ton of progress on the Django backend for VaccinateCA (see last week).

NICAR 2021

NICAR stands for the National Institute for Computer Assisted Reporting—an acronym that reflects the age of the organization, which started teaching journalists data-driven reporting back in 1989, long before the term “data journalism” became commonplace.

This was my third NICAR and it’s now firly established itself at the top of the list of my favourite conferences. Every year it attracts over 1,000 of the highest quality data nerds—from data journalism veterans who’ve been breaking stories for decades to journalists who are just getting started with data and want to start learning Python or polish up their skills with Excel.

I presented an hour long workshop on Datasette, which I’m planning to turn into the first official Datasette tutorial. I also got to pre-record a five minute lightning talk about Git scraping.

I published the video and notes for that yesterday. It really seemed to strike a nerve at the conference: I showed how you can set up a scheduled scraper using GitHub Actions with just a few lines of YAML configuration, and do so entirely through the GitHub web interface without even opening a text editor.

Pretty much every data journalist wants to run scrapers, and understands the friction involved in maintaining your own dedicated server and crontabs and storage and backups for running them. Being able to do this for free on GitHub’s infrastructure drops that friction down to almost nothing.

The lightning talk lead to a last-minute GitHub Actions and Git scraping office hours session being added to the schedule, and I was delighted to have Ryan Murphy from the LA Times join that session to demonstrate the incredible things the LA Times have been doing with scrapers and GitHub Actions. You can see some of their scrapers in the datadesk/california-coronavirus-scrapers repo.

VaccinateCA

The race continues to build out a Django backend for the VaccinateCA project, to collect data on vaccine availability from people making calls on that organization’s behalf.

The new backend is getting perilously close to launch. I’m leaning heavily on the Django admin for this, refreshing my knowledge of how to customize it with things like admin actions and custom filters.

It’s been quite a while since I’ve done anything sophisticated with the Django admin and it has evolved a LOT. In the past I’ve advised people to drop the admin for custom view functions the moment they want to do anything out-of-the-ordinary—I don’t think that advice holds any more. It’s got really good over the years!

A very smart thing the team at VaccinateCA did a month ago is to start logging the full incoming POST bodies for every API request handled by their existing Netlify functions (which then write to Airtable).

This has given me an invaluable tool for testing out the new replacement API: I wrote a script which replays those API logs against my new implementation—allowing me to test that every one of several thousand previously recorded API requests will run without errors against my new code.

Since this is so valuable, I’ve written code that will log API requests to the new stack directly to the database. Normally I’d shy away from a database table for logging data like this, but the expected traffic is the low thousands of API requests a day—and a few thousand extra database rows per day is a tiny price to pay for having such a high level of visibility into how the API is being used.

(I’m also logging the API requests to PostgreSQL using Django’s JSONField, which means I can analyze them in depth later on using PostgreSQL’s JSON functionality!)

YouTube subtitles

I decided to add proper subtitles to my lightning talk video, and was delighted to learn that the YouTube subtitle editor pre-populates with an automatically generated transcript, which you can then edit in place to fix up spelling, grammar and remove the various “um” and “so” filler words.

This makes creating high quality captions extremely productive. I’ve also added them to the 17 minute Introduction to Datasette and sqlite-utils video that’s embedded on the datasette.io homepage—editing the transcript for that only took about half an hour.

TIL this week