Simon Willison’s Weblog

On management 31 covid19 12 dogsheep 26 weeknotes 74 graphql 14 ...

 

Recent entries

Trying to end the pandemic a little earlier with VaccinateCA two days ago

This week I got involved with the VaccinateCA effort. We are trying to end the pandemic a little earlier, by building the most accurate database possible of vaccination locations and availability in California.

VaccinateCA

I’ve been following this project for a while through Twitter, mainly via Patrick McKenzie—here’s his tweet about the project from January 20th.

The core idea is one of those things that sounds obviously correct the moment you hear it. The Covid vaccination roll-out is decentralized and pretty chaotic. VaccinateCA figured out that the best way to figure out where the vaccine is available is to call the places that are distributing it—pharmacies, hospitals, clinics—as often as possible and ask if they have any in stock, who is eligible for the shot and how people can sign up for an appointment.

What We’ve Learned (So Far) by Patrick talks about lessons learned in the first 42 days of the project.

There are three public-facing components to VaccinateCA:

  • www.vaccinateca.com is a website to help you find available vaccines near you.
  • help.vaccinateca is the web app used by volunteers who make calls—it provides a script and buttons to submit information gleaned from the call. If you’re interested in volunteering there’s information on the website.
  • api.vaccinateca is the public API, which is documented here and is also used by the end-user facing website. It provides a full dump of collected location data, plus information on county policies and large-scale providers (pharmacy chains, health care providers).

The system currently mostly runs on Airtable, and takes advantage of pretty much every feature of that platform.

Why I got involved

Jesse Vincent convinced me to get involved. It turns out to be a perfect fit for both my interests and my skills and experience.

I’ve built crowdsourcing platforms before—for MP’s expense reports at the Guardian, and then for conference and event listings with our startup, Lanyrd.

VaccinateCA is a very data-heavy organization: the key goal is to build a comprehensive database of vaccine locations and availability. My background in data journalism and the last three years I’ve spent working on Datasette have given me a wealth of relevant experience here.

And finally… VaccinateCA are quickly running up against the limits of what you can sensibly do with Airtable—especially given Airtable’s hard limit at 100,000 records. They need to port critical tables to a custom PostgreSQL database, while maintaining as much as possible the agility that Airtable has enabled for them.

Django is a great fit for this kind of challenge, and I know quite a bit about both Django and using Django to quickly build robust, scalable and maintainable applications!

So I spent this week starting a Django replacement for the Airtable backend used by the volunteer calling application. I hope to get to feature parity (at least as an API backend that the application can write to) in the next few days, to demonstrate that a switch-over is both possible and a good idea.

What about Datasette?

On Monday I spun up a Datasette instance at vaccinateca.datasette.io (underlying repository) against data from the public VaccinateCA API. The map visualization of all of the locations instantly proved useful in helping spot locations that had incorrectly been located with latitudes and longitudes outside of California.

I hope to use Datasette for a variety of tasks like this, but it shouldn’t be the core of the solution. VaccinateCA is the perfect example of a problem that needs to be solved with Boring Technology—it needs to Just Work, and time that could be spent learning exciting new technologies needs to be spent building what’s needed as quickly, robustly and risk-free as possible.

That said, I’m already starting to experiment with the new JSONField introduced in Django 3.1—I’m hoping that a few JSON columns can help compensate for the lack of flexibility compared to Airtable, which makes it ridiculously easy for anyone to add additional columns.

(To be fair JSONField has been a feature of the Django PostgreSQL Django extension since version 1.9 in 2015 so it’s just about made it into the boring technology bucket by now.)

Also this week

Working on VaccinateCA has given me a chance to use some of my tools in new and interesting ways, so I got to ship a bunch of small fixes, detailed in Releases this week below.

On Friday I gave a talk at Speakeasy JS, "the JavaScript meetup for 🥼 mad science, 🧙‍♂️ hacking, and 🧪 experiments" about why "SQL in your client-side JavaScript is a great idea". The video for that is on YouTube and I plan to provide a full write-up soon.

I also recorded a five minute lightning talk about Git Scraping for next week’s NICAR 2021 data journalism conference.

I also made a few small cosmetic upgrades to the way tags are displayed on my blog—they now show with a rounded border and purple background, and include a count of items published with that tag. My tags page is one example of where I’ve now applied this style.

TIL this week

Releases this week

Cross-database queries in SQLite (and weeknotes) nine days ago

I released Datasette 0.55 and sqlite-utils 3.6 this week with a common theme across both releases: supporting cross-database joins.

Cross-database queries in Datasette

SQLite databases are single files on disk. I really love this characteristic—it makes them easy to create, copy and move around. All you need is a disk volume you can create as many SQLite databases as you like.

A lesser known feature of SQLite is that you can run queries, including joins, across tables from more than one database. The secret sauce is the ATTACH DATABASE command. Run the following SQL:

ATTACH 'other.db' AS other;

And now you can reference tables in that database as other.tablename. You can then join against them, combine them with UNION and generally treat them as if they were another table in your first connected database.

I’ve wanted to add support for cross-database queries to Datasette since May 2018. It took me quite a while to settle on a design—SQLite defaults to only allowing ten databases to be attached together, and I needed to figure out how multiple connected databases would fit with the design of the rest of Datasette.

In the end, I decided on the simplest option that would unlock the feature. Run Datasette with the new --crossdb option and the first ten databases passed to Datasette will be ATTACHed to an in-memory database available at the /_memory URL.

The latest.datasette.io demo now exposes two databases using this feature. Here’s an illustrative example query that performs a UNION across the sqlite_master metadata table in two databases:

select
  'fixtures' as database, *
from
  [fixtures].sqlite_master
union
select
  'extra_database' as database, *
from
  [extra_database].sqlite_master

Try that out here.

Cross-database queries in sqlite-utils

sqlite-utils offers both a Python library and a command-line utility in one package. I’ve added ATTACH support to both.

The Python library support looks like this:

db = Database("first.db")
db.attach("second", "second.db")
# Now you can run queries like this:
cursor = db.execute("""
select * from table_in_first
    union all
select * from second.table_in_second
""")
print(cursor.fetchall())

The command-line tool now has a new --attach option which lets you attach a database using an alias. The equivalent query to the above would look like this:

$ sqlite-utils first.db --attach second second.db '
  select * from table_in_first
    union all
  select * from second.table_in_second'

This defaults to returning results as a JSON array, but you can add --csv or --tsv or other options to get the results back in different output formats.

A cosmetic upgrade to tags on my blog

I noticed that Will Larson’s blog shows little numbers next to the tags indicating how many times they have been used. I really liked that, so I’ve implemented it here as well.

Each entry (and quotation and link) now gets a block in the sidebar that looks like this:

Screenshot showing my new tags, each with a number indicating how many times they have been used

As a long-time fan of faceted search interfaces I really like this upgrade—it helps indicate at a glance the kind of content I have stashed away in my blog’s archive.

Releases this week

  • datasette-json-preview: 0.3—2021-02-20
    Preview of new JSON default format for Datasette
  • sqlite-utils: 3.6—2021-02-19
    Python CLI utility and library for manipulating SQLite databases
  • datasette: 0.55—2021-02-19
    An open source multi-tool for exploring and publishing data
  • datasette-graphql: 1.4—2021-02-18
    Datasette plugin providing an automatic GraphQL API for your SQLite databases
  • higher-lower: 0.1—2021-02-16
    Functions for finding numbers using higher/lower
  • download-tiles: 0.4.1—2021-02-16
    Download map tiles and store them in an MBTiles database

TIL this week

Open source projects should run office hours 10 days ago

Back in December I decided to try something new for my Datasette open source project: Datasette Office Hours. The idea is simple: anyone can book a 25 minute conversation with me on a Friday to talk about the project. I’m interested in talking to people who are using Datasette, or who are considering using it, or who just want to have a chat.

I’ve now had 35 conversations and it’s been absolutely fantastic. I’ve talked to people in Iceland, Burundi, Finland, Singapore, Bulgaria and dozens of other places around the world. I’ve seen my software applied to applications ranging from historic cemetery records to library collections to open city data. It’s been thrilling.

I’d like to encourage more open source project maintainers to consider doing something similar.

Reasons to do this

A challenge of open source is that it’s easy to be starved of feedback. People might file bug reports if something breaks, but other than that it can feel like publishing software into a void.

Hearing directly from people who are using your stuff is incredibly motivational. It’s also an amazing source of ideas and feedback on where the project should go next.

In the startup world “talk to your users and potential customers” is advice that becomes a constant drumbeat… because it’s really effective, but it’s also hard to bring up the courage to do!

Talking to users in open source is similarly valuable. And it turns out, especially in these pandemic times, people really do want to talk to you. Office hours is an extremely low-friction way of putting up a sign that says “let’s have a conversation”.

How I do this: 25 minute slots, via Calendly

I’m using Calendly to make 20 minute slots available every Friday between 9am and 5pm Pacific Time (with 12:30-1:30 set aside for lunch), with a ten minute buffer between slots.

In practice, I treat these as 25 minute slots. This gives me five 5 minutes break in between conversations, and also means it’s possible to stretch to 30 minutes if we get to a key topic just before the time slot ends.

I configured Calendly to allow a maximum of five bookings on any Friday. This feels right to me—conversations with five different people can be pretty mentally tiring, and cutting off after five still gives me a good chance to get other work done during the day.

I use Calendly’s Zoom integration, which automatically sends out a calendar invite to both myself and my conversation partner and schedules a Zoom room that’s linked to from the invite. All I have to do is click the link at the appropriate time.

About one in fifteen conversations ends up cancelled. That’s completely fine—I get half an hour of my day back and we can usually reschedule for another week.

How about making some money?

I’ve been having some fascinating conversations on Twitter recently about the challenges of taking an open source project and turning it into a full-time job, earning a salary good enough to avoid the siren call of working for a FAANG company.

People pointed me to a few good examples of open source maintainers who charge for video conference consulting sessions—Graphile’s Benjie and CSS Wizardry’s Harry Roberts both let you book paid sessions with them directly.

I really like this as an opportunity for earning money against an open source project, and I think it could complement office hours nicely: 25 minutes on a Friday free on a first-come, first-served basis could then up-sell to a 1.5 hours paid consulting session, which could then lead to larger consulting contracts.

Try this yourself

If you’re tempted to try office hours for your own project, getting started is easy. I’m using the Calendly free plan, but their paid plans (which include the ability to attach Stripe or PayPal payments to bookings) are reasonably priced. I’ve been promoting my sessions via Twitter, the datasette.io website and the Datasette Newsletter.

This is one of those ideas I wish I’d had sooner. It’s quickly become a highlight of my week.

Weeknotes: Finally, an intro video for Datasette 16 days ago

My big project this week was this Video introduction to Datasette and sqlite-utils. I recorded the video a few weeks ago in advance of FOSDEM, but this week I put together the annotated version. I’m really happy with it, and I’ve added it to the datasette.io homepage as a starting point for helping people understand the project.

Annotating the video

I’m not a huge watcher of video tutorials—I’m impatient and find myself watching them at double speed and still complaining that a text version would be more efficient. So when I publish my own videos I like to accompany them with a useful text version.

The format I’ve settled on—for this video and for others like Personal Data Warehouses: Reclaiming Your Data—is to use screenshots from the video accompanied by notes, links and code samples. Ideally you can read the text version without watching the video at all, but if you do watch the video the text version can provide extended annotations.

I created this one on macOS using the combination of QuickTime Player and Preview. If you hit Command+C while watching a video in QuickTime Player a PNG snapshot gets copied to your clipboard. Switch to Preview with Command+Tab and hit Command+N to create a new untitled document there containing the image.

I ran through the video creating snapshots of each interesting moment in this way, leaving 40+ open Preview documents called “Untitled”, “Untitled 2” and so on.

When I reached the end of the video I switched back to Preview and used Command+S to save each open document in turn, creating a folder full of images with names like “Untitled 2.png”. These were in the correct order.

I used the Finder “Rename 40 items...” right-click menu option to remove the Untitled prefix.

Then I optimized the PNGs using the tricks described in this TIL.

Next step: create the HTML. I have a template that I use for each “slide” which looks like this:

<div class="slide">
    <img alt="" height="{height}" src="{filename}" width="{width}">
    <div>
        <p>Words</p>
    </div>
</div>

I constructed an Observable notebook which accepts a list of filenames (copied and pasted directly from the Finder) and turns them into a sequence of those HTML templates.

Having built out the HTML framework for the page, the last step was to go through and add the annotations. I did that by editing the HTML directly.

datasette-tiles: OSM v.s. TMS

Matthew Somerville pointed out that my datasette-tiles project (described here previously) had the Y axis flipped from the standard used by Google Maps and OpenStreetMap.

It turns out there are two rival standards for this. TMS—Tile Map Service—is a tile specification developed by the Open Source Geospatial Foundation. It’s used by the MBTiles specification which is why datasette-tiles was using it.

Google Maps and OpenStreetMap do things slightly differently—counting Y from the top of the map instead of the bottom. This has become the de facto standard for web mapping. Tom MacWright has published a useful explainer of the diffference between the two.

I made myself a couple of diagrams to ensure I completely understood how the two tile systems work:

TMS tiles have x=0, y=3 at the top left

OSM tiles have x=0, y=0 at the top left

datasette-tiles 0.6 ships a breaking change that switches the default serving mechanism to the OpenStreetMap system—so /-/tiles/db/z/x/y.png now serves tiles using that coordinate system.

If you want the TMS mechanism, you can use the new /-/tiles-tms/db/z/x/y.png endpoint instead.

Releases this week

TIL this week

Video introduction to Datasette and sqlite-utils 22 days ago

I put together a 17 minute video introduction to Datasette and sqlite-utils for FOSDEM 2021, showing how you can use Datasette to explore data, and demonstrating using the sqlite-utils command-line tool to convert a CSV file into a SQLite database, and then publish it using datasette publish. Here’s the video, plus annotated screen captures with further links and commentary.

Weeknotes: Mostly messing around with map tiles 23 days ago

Most of what I worked on this week was covered in Serving map tiles from SQLite with MBTiles and datasette-tiles. I built two new plugins: datasette-tiles for serving map tiles, and datasette-basemap which bundles map tiles for zoom levels 0-6 of OpenStreetMap. I also released download-tiles for downloading tiles and bundling them into an MBTiles database.

sqlite-utils 3.4.1

I added one new feature to sqlite-utils: the sqlite-utils import command can now be configured to read CSV files using alternative delimiters, by passing the --delimiter option or the --quotechar option.

This is covered in the documentation, which provides the following example:

name;description
Cleo;|Very fine; a friendly dog|
Pancakes;A local corgi

Imported using:

sqlite-utils insert dogs.db dogs dogs.csv \
  --delimiter=";" --quotechar="|"

Datasette 0.54.1

I spotted a subtle but nasty regression in Datasette: a change I made to how hidden form fields worked on the table page meant that clearing the _search search input and re-submitting the form didn’t take effect, and the search would persist. Datasette 0.54.1 fixes that bug.

Releases this week

  • datasette-jellyfish: 1.0.1—2021-02-06
    Datasette plugin adding SQL functions for fuzzy text matching powered by Jellyfish
  • sqlite-utils: 3.4.1—2021-02-06
    Python CLI utility and library for manipulating SQLite databases
  • datasette-tiles: 0.5—2021-02-04
    Mapping tile server for Datasette, serving tiles from MBTiles packages
  • download-tiles: 0.4—2021-02-03
    Download map tiles and store them in an MBTiles database
  • datasette-basemap: 0.2—2021-02-02
    A basemap for Datasette and datasette-leaflet
  • datasette: 0.54.1—2021-02-02
    An open source multi-tool for exploring and publishing data
  • datasette-cluster-map: 0.17.1—2021-02-01
    Datasette plugin that shows a map for any data with latitude/longitude columns
  • datasette-leaflet: 0.2.2—2021-02-01
    Datasette plugin adding the Leaflet JavaScript library

TIL this week

Elsewhere

Yesterday

  • How I cut GTA Online loading times by 70% (via) Incredible debugging war story: t0st was fed up of waiting six minutes (!) for GTA Online to load on their PC, so they used a host of devious debugging tricks to try and figure out what was going on. It turned out the game was loading a 10MB JSON file detailing all of the available in-game purchases, but inefficient JSON parsing meant it was pegging an entire CPU for 4 minutes mainly running the strlen() C function. Despite not having access to source code or debugging symbols t0st figured out the problem and managed to inject a custom DLL that hooked some internal functions and dropped load times down from 6m down to to 1m50s! #1st March 2021, 7:12 pm

27th February 2021

  • unasync (via) Today I started wondering out loud if one could write code that takes an asyncio Python library and transforms it into the synchronous equivalent by using some regular expressions to strip out the “await ...” keywords and suchlike. Turns out that can indeed work, and Ratan Kulshreshtha built it! unasync uses the standard library tokenize module to run some transformations against an async library and spit out the sync version automatically. I’m now considering using this for sqlite-utils. #27th February 2021, 10:20 pm
  • cosmopolitan libc (via) “Cosmopolitan makes C a build-once run-anywhere language, similar to Java, except it doesn’t require interpreters or virtual machines be installed beforehand. [...] Instead, it reconfigures stock GCC to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + BIOS with the best possible performance and the tiniest footprint imaginable.” This is a spectacular piece of engineering. #27th February 2021, 6:02 am

26th February 2021

23rd February 2021

  • When building a tool, it’s easy to forget how much you’ve internalized: how much knowledge and context you’ve assumed. Your tool can feel familiar or even obvious to you while being utterly foreign to everyone else. If your goal is for other people to use the darn thing — meaning you’re not just building for yourself, or tinkering for its own sake (which are totally valid reasons) — you gotta help people use it! It doesn’t matter what’s possible or what you intended; all that matters is whether people actually succeed in practice.

    Mike Bostock # 23rd February 2021, 10:55 pm

  • I strongly suspect that the single most impactful thing I did during my 5+ years at Linden Lab was shortly before I left: set up a weekly meeting between a couple of leads from Support and Engineering to go over the top 10 support issues.

    Yoz Grahame # 23rd February 2021, 4:49 am

22nd February 2021

  • Fuzzy Name Matching in Postgres. Paul Ramsey describes how to implement fuzzy name matching in PostgreSQL using the fuzzystrmatch extension and its levenshtein() and soundex() functions, plus functional indexes to query against indexed soundex first and then apply slower Levenshtein. The same tricks should also work against SQLite using the datasette-jellyfish plugin. #22nd February 2021, 9:16 pm
  • Business rules engines are li’l Conway’s Law devices: a manifestation of the distrust between stakeholders, client and contractor. We require BREs so that separate business units need not talk to each other to solve problems. They are communication and organizational dysfunction made silicon.

    Paul Smith # 22nd February 2021, 8:34 pm

  • Blazing fast CI with pytest-split and GitHub Actions (via) pytest-split is a neat looking variant on the pattern of splitting up a test suite to run different parts of it in parallel on different machines. It involves maintaining a periodically updated JSON file in the repo recording the average runtime of different tests, to enable them to be more fairly divided among test runners. Includes a recipe for running as a matrix in GitHub Actions. #22nd February 2021, 7:06 pm
  • People, processes, priorities. Twitter thread from Adrienne Porter Felt outlining her model for thinking about engineering management. I like this trifecta of “people, processes, priorities” a lot. #22nd February 2021, 5:21 pm

19th February 2021

  • One of the hardest things I’ve had to learn is that humans aren’t pure functions: an input that works one day and gets one result, then again another day and get an entirely different result.

    Sarah Drasner # 19th February 2021, 12 am

11th February 2021

  • Litestream runs continuously on a test server with generated load and streams backups to S3. It uses physical replication so it’ll actually restore the data from S3 periodically and compare the checksum byte-for-byte with the current database.

    Ben Johnson # 11th February 2021, 8:50 pm

  • trustme (via) This looks incredibly useful. Run “python -m trustme” and it will create three files for you: server.pem, server.key and a client.pem client certificate, providing a certificate for “localhost” (or another host you spefict) using a fake certificate authority. Looks like it should be the easiest way to test TLS locally. #11th February 2021, 8 pm
  • Why I Built Litestream. Litestream is a really exciting new piece of technology by Ben Johnson, who previously built BoltDB, the key-value store written in Go that is used by etcd. It adds replication to SQLite by running a process that converts the SQLite WAL log into a stream that can be saved to another folder or pushed to S3. The S3 option is particularly exciting—Ben estimates that keeping a full point-in-time recovery log of a high write SQLite database should cost in the order of a few dollars a month. I think this could greatly expand the set of use-cases for which SQLite is sensible choice. #11th February 2021, 7:25 pm

10th February 2021

  • Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies (via) Alex Birsan describes a new category of security vulnerability he discovered in the npm, pip and gem packaging ecosystems: if a company uses a private repository with internal package names, uploading a package with the same name to the public repository can often result in an attacker being able to execute their own code inside the networks of their target. Alex scored over $130,000 in bug bounties from this one, from a number of name-brand companies. Of particular note for Python developers: the --extra-index-url argument to pip will consult both public and private registries and install the package with the highest version number! #10th February 2021, 8:42 pm

8th February 2021

  • Finally, remember that whatever choice is made, you’re going to need to get behind it! You should be able to make a compelling positive case for any of the options you present. If there’s an option you can’t support, don’t present it.

    Jacob Kaplan-Moss # 8th February 2021, 3:21 pm

3rd February 2021

  • Cleaning Up Your Postgres Database (via) Craig Kerstiens provides some invaluable tips on running an initial check of the health of a PostgreSQL database, by using queries against the pg_statio_user_indexes table to find the memory cache hit ratio and the pg_stat_user_tables table to see what percentage of queries to your tables are using an index. #3rd February 2021, 7:32 am