Simon Willison’s Weblog

On weeknotes 80 sqlite 126 vaccinateca 4 projects 204 javascript 580 ...

 

Recent entries

Porting VaccinateCA to Django five days ago

As I mentioned back in February, I’ve been working with the VaccinateCA project to try to bring the pandemic to an end a little earlier by helping gather as accurate a model as possible of where the Covid vaccine is available in California and how people can get it.

The key activity at VaccinateCA is calling places to check on their availability and eligibility criteria. Up until last night this was powered by a heavily customized Airtable instance, accompanied by a custom JavaScript app for the callers that communicated with the Airtable API via some Netlify functions.

Today, the flow is powered by a new custom Django backend, running on top of PostgreSQL.

The thing you should never do

Replacing an existing system with a from-scratch rewrite is risky. Replacing a system that is built on something as flexible as Airtable that is evolving on a daily basis is positively terrifying!

Airtable served us extremely well, but unfortunately there are hard limits to the number of rows Airtable can handle and we’ve already bounced up against them and had to archive some of our data. To keep scaling the organization we needed to migrate away.

We needed to build a matching relational database with a comprehensive, permission-controlled interface for editing it, plus APIs to drive our website and application. And we needed to do it using the most boring technology possible, so we could focus on solving problems directly rather than researching anything new.

It will never cease to surprise me that Django has attained boring technology status! VaccineCA sits firmly in Django’s sweet-spot. So we used that to build our replacement.

The new Django-based system is called VIAL, for “Vaccine Information Archive and Library”—a neat Jesse Vincent bacronym.

We switched things over to VIAL last night, but we still have activity in Airtable as well. I expect we’ll keep using Airtable for the lifetime of the organization—there are plenty of ad-hoc data projects for which it’s a perfect fit.

The most important thing here is to have a trusted single point of truth for any piece of information. I’m not quite ready to declare victory on that point just yet, but hopefully once things settle down over the next few days.

Screenshot of the Django admin VIAL index page

Data synchronization patterns

The first challenge, before even writing any code, was how to get stuff out of Airtable. I built a tool for this a while ago called airtable-export, and it turned out the VaccinateCA team were using it already before I joined!

airtable-export was already running several times an hour, backing up the data in JSON format to a GitHub repository (a form of Git scraping). This gave us a detailed history of changes to the Airtable data, which occasionally proved extremely useful for answering questions about when a specific record was changed or deleted.

Having the data in a GitHub repository was also useful because it gave us somewhere to pull data from that wasn’t governed by Airtable’s rate limits.

I iterated through a number of different approaches for writing importers for the data.

Each Airtable table ended up as a single JSON file in our GitHub repository, containing an array of objects—those files got pretty big, topping out at about 80MB.

I started out with Django management commands, which could be passed a file or a URL. A neat thing about using GitHub for this is that you can use the “raw data” link to obtain a URL with a short-lived token, which grants access to that file. So I could create a short-term URL and paste it directly to my import tool.

I don’t have a good pattern for running Django management commands on Google Cloud Run, so I started moving to API-based import scripts instead.

The pattern that ended up working best was to provide a /api/importRecords API endpoint which accepts a JSON array of items.

The API expects the input to have a unique primary key in each record—airtable_id in our case. It then uses Django’s update_or_create() ORM method to create new records if they were missing, and update existing records otherwise.

One remaining challenge: posting 80MB of JSON to an API in one go would likely run into resource limits. I needed a way to break that input up into smaller batches.

I ended up building a new tool for this called json-post. It has an extremely specific use-case: it’s for when you want to POST a big JSON array to an API endpoint but you want to first break it up into batches!

Here’s how to break up the JSON in Reports.json into 50 item arrays and send them to that API as separate POSTs:

json-post Reports.json \                              
   "https://example.com/api/importReports" \
   --batch-size 50

Here are some more complex options. Here we need to pass an Authorization: Beraer XXXtokenXXX API key header, run the array in reverse, record our progress (the JSON responses from the API as newline-delimited JSON) to a log file, set a longer HTTP read timeout and filter for just specific items:

% json-post Reports.json \                              
   "https://example.com/api/importReports" \
  -h Authorization 'Bearer XXXtokenXXX' \
  --batch-size 50 \
  --reverse \
  --log /tmp/progress.txt \
  --http-read-timeout 20 \
  --filter 'item.get("is_soft_deleted")'

The --filter option proved particularly useful. As we kicked the tires on VIAL we would spot new bugs—things like the import script failing to correctly record the is_soft_deleted field we were using in Airtable. Being able to filter that input file with a command-line flag meant we could easily re-run the import just for a subset of reports that were affected by a particular bug.

--filter takes a Python expression that gets compiled into a function and passed item as the current item in the list. I borrowed the pattern from my sqlite-transform tool.

The value of API logs

VaccineCA’s JavaScript caller application used to send data to Airtable via a Netlify function, which allowed additional authentication to be added built using Auth0.

Back in February, the team had the bright idea to log the API traffic to that function to a separate base in Airtable—including full request and response bodies.

This proved invaluable for debugging. It also meant that when I started building VIAL’s alternative implementation of the “submit a call report” API I could replay historic API traffic that had been recorded in that table, giving me a powerful way to exercise the new API with real-world traffic.

This meant that when we turned on VIAL we could switch our existing JavaScript SPA over to talking to it using a fully tested clone of the existing Airtable-backed API.

VIAL implements this logging pattern again, this time using Django and PostgreSQL.

Given that the writable APIs will recieve in the low thousands of requests a day, keeping them in a database table works great. The table has grown to 90MB so far. I’m hoping that the pandemic will be over before we have to worry about logging capacity!

We’re using PostgreSQL jsonb columns to store the incoming and returned JSON, via Django’s JSONField. This means we can do in-depth API analysis using PostgreSQL’s JSON SQL functions! Being able to examine returned JSON error messages or aggregate across incoming request bodies helped enormously when debugging problems with the API import scripts.

Storing the original JSON

Today, almost all of the data stored in VIAL originated in Airtable. One trick that has really helped build the system is that each of the tables that might contain imported data has both an airtable_id nullable column and an import_json JSON field.

Any time we import a record from airtable, we record both the ID and the full, original Airtable JSON that we used for the import.

This is another powerful tool for debugging: we can view the original Airtable JSON directly in the Django admin interface for a record, and confirm that it matches the ORM fields that we set from that.

I came up with a simple pattern for Pretty-printing all read-only JSON in the Django admin that helps with this too.

Staying as flexible as possible

The thing that worried me most about replacing Airtable with Django was Airtable’s incredible flexibility. In the organization’s short life it has already solved so many problems by adding new columns in Airtable, or building new views.

Is it possible to switch to custom software without losing that huge cultural advantage?

This is the same reason it’s so hard for custom software to compete with spreadsheets.

We’ve only just made the switch, so we won’t know for a while how well we’ve done at handling this. I have a few mechanisms in place that I’m hoping will help.

The first is django-sql-dashboard. I wrote about this project in previous weeknotes here and here—the goal is to bring some of the ideas from Datasette into the Django/PostgreSQL world, by providing a read-only mechanism for constructing SQL queries, bookmarking and saving the results and outputting simple SQL-driven visualizations.

We have a lot of SQL knowledge at VaccinateCA, so my hope is that people with SQL will be able to solve their own problems, and people who don’t know SQL yet will have no trouble finding someone who can help them.

In the boring technology model of things, django-sql-dashboard counts as the main innovation token I’m spending for this project. I’m optimistic that it will pay off.

I’m also leaning heavily on Django’s migration system, with the aim of making database migrations common and boring, rather than their usual default of being rare and exciting. We’re up to 77 migrations already, in a codebase that is just over two months old!

I think a culture that evolves the database schema quickly and with as little drama as possible is crucial to maintaining the agility that this kind of organization needs.

Aside from the Django Admin providing the editing interface, everything that comes into and goes out of VIAL happens through APIs. These are fully documented: I want people to be able to build against the APIs independently, especially for things like data import.

After seeing significant success with PostgreSQL JSON already, I’m considering using it to add even more API-driven flexbility to VIAL in the future. Allowing our client developers to start collecting a new piece of data from our volunteers in an existing JSON field, then migrating that into a separate column once it has proven its value, is very tempting indeed.

Open source tools we are using

An incomplete list of open source packages we are using for VIAL so far:

Want to help out?

VaccinateCA is hiring! It’s an interesting gig, because the ultimate goal is to end the pandemic and put this non-profit permanently out of business. So if you want to help end things faster, get in touch.

TIL this week

Releases this week

Animated choropleth of vaccinations by US county 13 days ago

Last week I mentioned that I’ve recently started scraping and storing the CDC’s per-county vaccination numbers in my cdc-vaccination-history GitHub repository. This week I used an Observable notebook and d3’s TopoJSON support to render those numbers on an animated choropleth map.

Animated map of choropleth county vaccinations

The full code is available at https://observablehq.com/@simonw/us-county-vaccinations-choropleth-map

From scraper to Datasette

My scraper for this data is a single line in a GitHub Actions workflow:

curl https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_county_condensed_data \
  | jq . > counties.json

I pipe the data through jq to pretty-print it, just to get nicer diffs.

My build_database.py script then iterates over the accumulated git history of that counties.json file and uses sqlite-utils to build a SQLite table:

for i, (when, hash, content) in enumerate(
    iterate_file_versions(".", ("counties.json",))
):
    try:
        counties = json.loads(
            content
        )["vaccination_county_condensed_data"]
    except ValueError:
        # Bad JSON
        continue
    for county in counties:
        id = county["FIPS"] + "-" + county["Date"]
        db[
            "daily_reports_counties"
        ].insert(
            dict(county, id=id), pk="id",
            alter=True, replace=True
        )

The resulting table can be seen at cdc/daily_reports_counties.

From Datasette to Observable

Observable notebooks are my absolute favourite tool for prototyping new visualizations. There are examples of pretty much anything you could possibly want to create, and the Observable ecosystem actively encourages forking and sharing new patterns.

Loading data from Datasette into Observable is easy, using Datasette’s various HTTP APIs. For this visualization I needed to pull two separate things from Datasette.

Firstly, for any given date I need the full per-county vaccination data. Here’s the full table filtered for April 2nd for example.

Since that’s 3,221 rows Datasette’s JSON export would need to be paginated... but Datasette’s CSV export can stream all 3,000+ rows in a single request. So I’m using that, fetched using the d3.csv() function:

county_data = await d3.csv(
    `https://cdc-vaccination-history.datasette.io/cdc/daily_reports_counties.csv?_stream=on&Date=${county_date}&_size=max`
);

In order to animate the different dates, I need a list of available dates. I can get those with a SQL query:

select distinct Date
from daily_reports_counties
order by Date

Datasette’s JSON API has a ?_shape=arrayfirst option which will return a single JSON array of the first values in each row, which means I can do this:

https://cdc-vaccination-history.datasette.io/cdc.json?sql=select%20distinct%20Date%20from%20daily_reports_counties%20order%20by%20Date&_shape=arrayfirst

And get back just the dates as an array:

[
  "2021-03-26",
  "2021-03-27",
  "2021-03-28",
  "2021-03-29",
  "2021-03-30",
  "2021-03-31",
  "2021-04-01",
  "2021-04-02",
  "2021-04-03"
]

Mike Bostock has a handy Scrubber implementation which can provide a slider with the ability to play and stop iterating through values. In the notebook that can be used like so:

viewof county_date = Scrubber(county_dates, {
  delay: 500,
  autoplay: false
})

county_dates = (await fetch(
  "https://cdc-vaccination-history.datasette.io/cdc.json?sql=select%20distinct%20Date%20from%20daily_reports_counties%20order%20by%20Date&_shape=arrayfirst"
)).json()

import { Scrubber } from "@mbostock/scrubber"

Drawing the map

The map itself is rendered using TopoJSON, an extension to GeoJSON that efficiently encodes topology.

Consider the map of 3,200 counties in the USA: since counties border each other, most of those border polygons end up duplicating each other to a certain extent.

TopoJSON only stores each shared boundary once, but still knows how they relate to each other which means the data can be used to draw shapes filled with colours.

I’m using the https://d3js.org/us-10m.v1.json TopoJSON file built and published with d3. Here’s my JavaScript for rendering that into an SVG map:

{
  const svg = d3
    .create("svg")
    .attr("viewBox", [0, 0, width, 700])
    .style("width", "100%")
    .style("height", "auto");

  svg
    .append("g")
    .selectAll("path")
    .data(
      topojson.feature(topojson_data, topojson_data.objects.counties).features
    )
    .enter()
    .append("path")
    .attr("fill", function(d) {
      if (!county_data[d.id]) {
        return 'white';
      }
      let v = county_data[d.id].Series_Complete_65PlusPop_Pct;
      return d3.interpolate("white", "green")(v / 100);
    })
    .attr("d", path)
    .append("title") // Tooltip
    .text(function(d) {
      if (!county_data[d.id]) {
        return '';
      }
      return `${
        county_data[d.id].Series_Complete_65PlusPop_Pct
      }% of the 65+ population in ${county_data[d.id].County}, ${county_data[d.id].StateAbbr.trim()} have had the complete vaccination`;
    });
  return svg.node();
}

Next step: a plugin

Now that I have a working map, my next goal is to package this up as a Datasette plugin. I’m hoping to create a generic choropleth plugin which bundles TopoJSON for some common maps—probably world countries, US states and US counties to start off with—but also allows custom maps to be supported as easily as possible.

Datasette 0.56

Also this week, I shipped Datasette 0.56. It’s a relatively small release—mostly documentation improvements and bug fixes, but I’ve alse bundled SpatiaLite 5 with the official Datasette Docker image.

TIL this week

Releases this week

Weeknotes: SpatiaLite 5, Datasette on Azure, more CDC vaccination history 20 days ago

This week I got SpatiaLite 5 working in the Datasette Docker image, improved the CDC vaccination history git scraper, figured out Datasette on Azure and we closed on a new home!

SpatiaLite 5 for Datasette

SpatiaLite 5 came out earlier this year with a bunch of exciting improvements, most notably an implementation of KNN (K-nearest neighbours)—a way to efficiently answer the question “what are the 10 closest rows to this latitude/longitude point”.

I love building X near me websites so I expect I’ll be using this a lot in the future.

I spent a bunch of time this week figuring out how best to install it into a Docker container for use with Datasette. I finally cracked it in issue 1249 and the Dockerfile in the Datasette repository now builds with the SpatiaLite 5.0 extension, using a pattern I figured out for installing Debian unstable packages into a Debian stable base container.

When Datasette 0.56 is released the official Datasette Docker image will bundle SpatiaLite 5.0.

CDC vaccination history in Datasette

I’m tracking the CDC’s per-state vaccination numbers in my cdc-vaccination-history repository, as described in my Git scraping lightning talk.

Scraping data into a git repository to track changes to it over time is easy. What’s harder is extracting that data back out of the commit history in order to analyze and visualize it later.

To demonstrate how this can work I added a build_database.py script to that repository which iterates through the git history and uses it to build a SQLite database containing daily state reports. I also added steps to the GitHub Actions workflow to publish that SQLite database using Datasette and Vercel.

I installed the datasette-vega visualization plugin there too. Here’s a chart showing the number of doses administered over time in California.

Chart of vaccines distributed in California, which is going up at a healthy pace

This morning I started capturing the CDC’s per-county data too, but I’ve not yet written code to load that into Datasette. [UPDATE: that table is now available: cdc/daily_reports_counties]

Datasette on Azure

I’m keen to make Datasette easy to deploy in as many places as possible. I already have mechanisms for publishing to Heroku, Cloud Run, Vercel and Fly.io—today I worked out the recipe needed for Azure Functions.

I haven’t bundled it into a datasette-publish-azure plugin yet but that’s the next step. In the meantime the azure-functions-datasette repo has a working example with instructions on how to deploy it.

Thanks go to Anthony Shaw for building out the ASGI wrapper needed to run ASGI applications like Datasette on Azure Functions.

iam-to-sqlite

I spend way too much time whinging about IAM on Twitter. I’m certain that properly learning IAM will unlock the entire world of AWS, but I have so far been unable to overcome my discomfort with it long enough to actually figure it out.

After yet another unproductive whinge this week I guilted myself into putting in some effort, and it’s already started to pay off: I figured out how to dump out all existing IAM data (users, groups, roles and policies) as JSON using the aws iam get-account-authorization-details command, and got so excited about it that I built iam-to-sqlite as a wrapper around that command that writes the results into SQLite so I can browse them using Datasette!

Datasette showing IAM database tables

I’m increasingly realizing that the key to me understanding how pretty much any service works is to pull their JSON into a SQLite database so I can explore it as relational tables.

A useful trick for writing weeknotes

When writing weeknotes like these, it’s really useful to be able to see all of the commits from the past week across many different projects.

Today I realized you can use GitHub search for this. Run a search for author:simonw created:>2021-03-20 and filter to commits, ordered by “Recently committed”.

Here’s that search for me.

Django pull request accepted!

I had a pull request accepted to Django this week! It was a documentation fix for the RawSQL query expression—I found a pattern for using it as part of an .filter(id__in=RawSQL(...)) query that wasn’t covered by the documentation.

And we found a new home

One other project this week: Natalie and I closed on a new home! We’re moving to El Granada, a tiny town just north of Half Moon Bay, on the coast 40 minutes south of San Francisco. We’ll be ten minutes from the ocean, with plenty of pinnipeds and pelicans. Exciting!

Cleo asleep on the deck with the Pacific ocean in the distance

TIL this week

Releases this week

Weeknotes: django-sql-dashboard widgets 27 days ago

A few small releases this week, for django-sql-dashboard, datasette-auth-passwords and datasette-publish-vercel.

django-sql-dashboard widgets and permissions

django-sql-dashboard, my subset-of-Datasette-for-Django-and-PostgreSQL continues to come together.

New this week: widgets and permissions.

To recap: this Django app borrows some ideas from Datasette: it encourages you to create a read-only PostgreSQL user and grant authenticated users the ability to run one or more raw SQL queries directly against your database.

You can execute more than one SQL query and combine them into a saved dashboard, which will then show multiple tables containing the results.

This week I added support for dashboard widgets. You can construct SQL queries to return specific column patterns which will then be rendered on the page in different ways.

There are four widgets at the moment: “big number”, bar chart, HTML and Markdown.

Big number is the simplest: define a SQL query that returns two columns called label and big_number and the dashboard will display that result as a big number:

select 'Entries' as label, count(*) as big_number from blog_entry;

Entries: 2804 - an example of a big number display

Bar chart is more sophisticated: return columns named bar_label and bar_quantity to display a bar chart of the results:

select
  to_char(date_trunc('month', created), 'YYYY-MM') as bar_label,
  count(*) as bar_quantity
from
  blog_entry
group by
  bar_label
order by
  count(*) desc

A bar chart showing the result of that query

HTML and Markdown are simpler: they display the rendered HTML or Markdown, after filtering it through the Bleach library to strip any harmful elements or scripts.

select
  '## Ten most recent blogmarks (of ' 
  || count(*) || ' total)'
as markdown from blog_blogmark;

I’m running the dashboard application on this blog, and I’ve set up an example dashboard here that illustrates the different types of widget.

An example dashboard with several different widgets

Defining custom widgets is easy: take the column names you would like to respond to, sort them alphabetically, join them with hyphens and create a custom widget in a template file with that name.

So if you wanted to build a widget that looks for label and geojson columns and renders that data on a Leaflet map, you would create a geojson-label.html template and drop it into your Django templates/django-sql-dashboard/widgets folder. See the custom widgets documentation for details.

Which reminds me: I decided a README wasn’t quite enough space for documentation here, so I started a Read The Docs documentation site for the project.

Datasette and sqlite-utils both use Sphinx and reStructuredText for their documentation.

For django-sql-dashboard I’ve decided to try out Sphinx and Markdown instead, using MyST—a Markdown flavour and parser for Sphinx.

I picked this because I want to add inline help to django-sql-dashboard, and since it ships with Markdown as a dependency already (to power the Markdown widget) my hope is that using Markdown for the documentation will allow me to ship some of the user-facing docs as part of the application itself. But it’s also a fun excuse to try out MyST, which so far is working exactly as advertised.

I’ve seen people in the past avoid Sphinx entirely because they preferred Markdown to reStructuredText, so MyST feels like an important addition to the Python documentation ecosystem.

HTTP Basic authentication

datasette-auth-passwords implements password-based authentication to Datasette. The plugin defaults to providing a username and password login form which sets a signed cookie identifying the current user.

Version 0.4 introduces optional support for HTTP Basic authentication instead—where the user’s browser handles the authentication prompt.

Basic auth has some disadvantages—most notably that it doesn’t support logout without the user entirely closing down their browser. But it’s useful for a number of reasons:

  • It’s easy to protect every resource on a website with it—including static assets. Adding "http_basic_auth": true to your plugin configuration adds this protection, covering all of Datasette’s resources.
  • It’s much easier to authenticate with from automated scripts. curl and roquests and httpx all have simple built-in support for passing basic authentication usernames and passwords, which makes it a useful target for scripting—without having to install an additional authentication plugin such as datasette-auth-tokens.

I’m continuing to flesh out authentication options for Datasette, and adding this to datasette-auth-passwords is one of those small improvements that should pay off long into the future.

A fix for datasette-publish-vercel

Datasette instances published to Vercel using the datasette-publish-vercel have previously been affected by an obscure Vercel bug: characters such as + in the query string were being lost due to Vercel unescaping encoded characters before the request got to the Python application server.

Vercel fixed this earlier this month, and the latest release of datasette-publish-vercel includes their fix by switching to the new @vercel/python builder. Thanks @styfle from Vercel for shepherding this fix through!

New photos on Niche Museums

My Niche Museums project has been in hiberation since the start of the pandemic. Now that vaccines are rolling out it feels like there might be an end to this thing, so I’ve started thinking about my museum hobby again.

I added some new photos to the site today—on the entries for Novelty Automation, DEVIL-ish Little Things, Evergreen Aviation & Space Museum and California State Capitol Dioramas.

Hopefully someday soon I’ll get to visit and add an entirely new museum!

Releases this week

Weeknotes: tableau-to-sqlite, django-sql-dashboard one month ago

This week I started a limited production run of my new backend for Vaccinate CA calling, built a tableau-to-sqlite import tool and started working on a subset of Datasette for PostgreSQL and Django called django-sql-dashboard.

Vaccinate CA backend progress

My key project at the moment is buiding out a new Django-powered backend for the Vaccinate CA call reporting application—where real human beings constantly call pharmacies and medical sites around California to build a comprehensive guide to where the Covid vaccine is available.

As of this week, the new backend is running for a subset of the overall call volume. It’s exciting! It’s also a reminder that the single hardest piece of logic in any crowdsourcing-style application is the logic that gives a human being their next task. I’m continuing to evolve that logic, which is somewhat harder when the system I’m modifying is actively being used.

tableau-to-sqlite

The Vaccinate CA project is constantly on the lookout for new sources of data that might indicate locations that have the vaccine. Some of this data is locked up in Tableau dashboards, which are notoriously tricky to scrape.

When faced with problems like this, I frequently turn to GitHub code search: I’ll find a unique looking token in the data I’m trying to wrangle and run searches to see if anyone on GitHub has written code to handle it.

In doing so, I came across Tableau Scraper—an open source Python library by Bertrand Martel which does a fantastic job of turning a Tableau dashboard into a Pandas DataFrame.

Writing a Pandas DataFrame to a SQLite database is a one-liner: df.to_sql("table-name", sqlite3.connect(db_path)). So I spun up a quick command-line wrapper around the TableuaScraper class called tableau-to-sqlite which lets you do the following:

% tableau-to-sqlite tableau.db https://results.mo.gov/t/COVID19/views/VaccinationsDashboard/Vaccinations

Considering how much valuable data is trapped in government Tableau dashboards I’m really excited to point this tool at more sources. The README includes tips on combining this with sqlite-utils to get a CSV or JSON export which can then be tracked using Git scraping.

django-sql-dashboard

I’m continuing to ponder the idea of getting Datasette to talk to PostgreSQL in addition to SQLite, but in the meantime I have a growing Django application that runs against PostgreSQL and a desire to build some quick dashboards against it.

One of Datasette’s key features is the ability to bookmark a read-only SQL query and share that link with other people. It’s SQL injection attacks repurposed as a feature, and it’s proved to be incredibly useful over the past few years.

Here’s an example from earlier this week where I wanted to see how many GitHub issues I had opened and then closed within 60 seconds. The answer is 17!

django-sql-dashboard is my highly experimental exploration of what that idea looks like against a PostgreSQL database, wrapped inside a Django application

The key idea is to support executing read-only PostgreSQL statements with a strict timelimit (set using PostgreSQL’s statement_timeout setting, described here). Users can execute SQL directly, bookmark and share queries and save them to a database table in order to construct persistent dashboards.

It’s very early days for the project yet, and I’m still not 100% convinced it’s a good idea, but early signs are very promising.

A fun feature is that it lets you have more than one SQL query on the same page. Here’s what it looks like running against my blog’s database, showing a count query and the months in which I wrote the most blog entries:

Two SQL queries are shown: select count() from blog_entry; and select to_char(date_trunc('month', created), 'Mon YYYY') as month, count() from blog_entry group by month order by count(*) desc

Releases this week

TIL this week

Weeknotes: Datasette and Git scraping at NICAR, VaccinateCA one month ago

This week I virtually attended the NICAR data journalism conference and made a ton of progress on the Django backend for VaccinateCA (see last week).

NICAR 2021

NICAR stands for the National Institute for Computer Assisted Reporting—an acronym that reflects the age of the organization, which started teaching journalists data-driven reporting back in 1989, long before the term “data journalism” became commonplace.

This was my third NICAR and it’s now firly established itself at the top of the list of my favourite conferences. Every year it attracts over 1,000 of the highest quality data nerds—from data journalism veterans who’ve been breaking stories for decades to journalists who are just getting started with data and want to start learning Python or polish up their skills with Excel.

I presented an hour long workshop on Datasette, which I’m planning to turn into the first official Datasette tutorial. I also got to pre-record a five minute lightning talk about Git scraping.

I published the video and notes for that yesterday. It really seemed to strike a nerve at the conference: I showed how you can set up a scheduled scraper using GitHub Actions with just a few lines of YAML configuration, and do so entirely through the GitHub web interface without even opening a text editor.

Pretty much every data journalist wants to run scrapers, and understands the friction involved in maintaining your own dedicated server and crontabs and storage and backups for running them. Being able to do this for free on GitHub’s infrastructure drops that friction down to almost nothing.

The lightning talk lead to a last-minute GitHub Actions and Git scraping office hours session being added to the schedule, and I was delighted to have Ryan Murphy from the LA Times join that session to demonstrate the incredible things the LA Times have been doing with scrapers and GitHub Actions. You can see some of their scrapers in the datadesk/california-coronavirus-scrapers repo.

VaccinateCA

The race continues to build out a Django backend for the VaccinateCA project, to collect data on vaccine availability from people making calls on that organization’s behalf.

The new backend is getting perilously close to launch. I’m leaning heavily on the Django admin for this, refreshing my knowledge of how to customize it with things like admin actions and custom filters.

It’s been quite a while since I’ve done anything sophisticated with the Django admin and it has evolved a LOT. In the past I’ve advised people to drop the admin for custom view functions the moment they want to do anything out-of-the-ordinary—I don’t think that advice holds any more. It’s got really good over the years!

A very smart thing the team at VaccinateCA did a month ago is to start logging the full incoming POST bodies for every API request handled by their existing Netlify functions (which then write to Airtable).

This has given me an invaluable tool for testing out the new replacement API: I wrote a script which replays those API logs against my new implementation—allowing me to test that every one of several thousand previously recorded API requests will run without errors against my new code.

Since this is so valuable, I’ve written code that will log API requests to the new stack directly to the database. Normally I’d shy away from a database table for logging data like this, but the expected traffic is the low thousands of API requests a day—and a few thousand extra database rows per day is a tiny price to pay for having such a high level of visibility into how the API is being used.

(I’m also logging the API requests to PostgreSQL using Django’s JSONField, which means I can analyze them in depth later on using PostgreSQL’s JSON functionality!)

YouTube subtitles

I decided to add proper subtitles to my lightning talk video, and was delighted to learn that the YouTube subtitle editor pre-populates with an automatically generated transcript, which you can then edit in place to fix up spelling, grammar and remove the various “um” and “so” filler words.

This makes creating high quality captions extremely productive. I’ve also added them to the 17 minute Introduction to Datasette and sqlite-utils video that’s embedded on the datasette.io homepage—editing the transcript for that only took about half an hour.

TIL this week

Elsewhere

14th April 2021

  • Why you shouldn't use ENV variables for secret data (via) I do this all the time, but this article provides a good set of reasons that secrets in environment variables are a bad pattern—even when you know there’s no multi-user access to the host you are deploying to. The biggest problem is that they often get captured by error handling scripts, which may not have the right code in place to redact them. This article suggests using Docker secrets instead, but I’d love to see a comprehensive write-up of other recommended patterns for this that go beyond applications running in Docker. #14th April 2021, 6:22 pm

11th April 2021

  • In general, relying only on natural keys is a nightmare. Double nightmare if it’s PII. Natural keys only work if you are flawlessly omniscient about the domain. And you aren’t.

    Jacques Chester # 11th April 2021, 10:48 pm

5th April 2021

  • Behind GitHub’s new authentication token formats (via) This is a really smart design. GitHub’s new tokens use a type prefix of “ghp_” or “gho_” or a few others depending on the type of token, to help support mechanisms that scan for accidental token publication. A further twist is that the last six characters of the tokens are a checksum, which means token scanners can reliably distinguish a real token from a coincidental string without needing to check back with the GitHub database. “One other neat thing about _ is it will reliably select the whole token when you double click on it”—what a useful detail! #5th April 2021, 9:28 pm
  • Render single selected county on a map (via) Another experiment at the intersection of Datasette and Observable notebooks. This one imports a full Datasette table (3,200 US counties) using streaming CSV and loads that into Observable’s new Search and Table filter widgets. Once you select a single county a second Datasette SQL query (this time retuning JSON) fetches a GeoJSON representation of that county which is then rendered as SVG using D3. #5th April 2021, 4:48 am

4th April 2021

  • Spatialite Speed Test. Part of an excellent series of posts about SpatiaLite from 2012—here John C. Zastrow reports on running polygon intersection queries against a 1.9GB database file in 40 seconds without an index and 0.186 seconds using the SpatialIndex virtual table mechanism. #4th April 2021, 4:28 pm

1st April 2021

  • If you measure things by foot traffic we [the SFO Museum] are one of the busiest museums in the world. If that is the case we are also one of the busiest museums in the world that no one knows about. Nothing in modern life really prepares you for the idea that a museum should be part of an airport. San Francisco, as I’ve mentioned, is funny that way.

    Aaron Straup Cope # 1st April 2021, 10:40 pm

31st March 2021

  • This teaches us that—when it’s a big enough deal—Amazon will lie to us. And coming from the company that runs the production infrastructure for our companies, stores our data, and has been granted an outsized position of trust based upon having earned it over 15 years, this is a nightmare.

    Corey Quinn # 31st March 2021, 4:47 pm

30th March 2021

  • ifconfig.co (via) I really like this: “curl ifconfig.co” gives you your IP address as plain text, “curl ifconfig.co/city” tells you your city according to MaxMind GeoLite2, “curl ifconfig.co/json” gives you all sorts of useful extra data. Suggested rate limit is one per minute, but the code is open source Go that you can run yourself. #30th March 2021, 7:53 pm

29th March 2021

  • Hello, HPy (via) HPy provides a new way to write C extensions for Python in a way that is compatible with multiple Python implementations at once, including PyPy. #29th March 2021, 2:40 pm

25th March 2021

  • sqlite-plus (via) Anton Zhiyanov bundled together a bunch of useful SQLite C extensions for things like statistical functions, unicode string normalization and handling CSV files as virtual tables. The GitHub Actions workflow here is a particularly useful example of compiling SQLite extensions for three different platforms. #25th March 2021, 9:13 pm
  • Why all my servers have an 8GB empty file (via) This is such a good idea: keep a 8GB empty spacer.img file on any servers you operate, purely so that if the server runs out of space you can delete the file and get some breathing room for getting everything else working again. I’ve had servers run out of space in the past and it’s an absolute pain to sort out—this trick would have really helped. #25th March 2021, 8:35 pm
  • Homebrew Python Is Not For You. If you’ve been running into frustrations with your Homebrew Python environments breaking over the past few months (the dreaded “Reason: image not found” error) Justin Mayer has a good explanation. Python in a Homebrew is designed to work as a dependency for their other packages, and recent policy changes that they made to support smoother upgrades have had catastrophic problems effects on those of us who try to use it for development environments. #25th March 2021, 3:14 pm

24th March 2021