Simon Willison’s Weblog

On python, github, asgi, sql, scaling, ...

 

Recent entries

Weeknotes: Covid-19, First Python Notebook, more Dogsheep, Tailscale nine hours ago

My covid-19.datasettes.com project publishes information on COVID-19 cases around the world. The project started out using data from Johns Hopkins CSSE, but last week the New York Times started publishing high quality USA county- and state-level daily numbers to their own repository. Here’s the change that added the NY Times data.

It’s very easy to use this data to accidentally build misleading things. I’ve been updating the README with links about this—my current favourite is Why It’s So Freaking Hard To Make A Good COVID-19 Model by Maggie Koerth, Laura Bronner and Jasmine Mithani at FiveThirtyEight.

First Python Notebook

Ben Welsh from the LA Times teaches a course called First Python Notebook at journalism conferences such as NICAR. He ran a free online version the course last weekend, and I offered to help out as a TA.

Most of the help I provided came before the course: Ben asked attendees to confirm that they had working installations of Python 3 and pipenv, and if they didn’t volunteers such as myself would step in to help. I had Zoom and email conversations with at least ten people to help them get their environments into shape.

This XKCD neatly summarizes the problem:

XKCD Python Environments

One of the most common problems I had to debug was PATH issues: people had installed the software, but due to various environmental differences python3 and pipenv weren’t available on the PATH. Talking people through the obscurities of creating a ~/.bashrc file and using it to define a PATH over-ride really helps emphasize how arcane this kind of knowledge is.

I enjoyed this comment:

“Welcome to intro to Tennis. In the first two weeks, we’ll discuss how to rig a net and resurface a court.”—Claus Wilke

Ben’s course itself is hands down the best introduction to Python from a Data Journalism perspective I have ever seen. Within an hour of starting the students are using Pandas in a Jupyter notebook to find interesting discrepancies in California campaign finance data.

If you want to check it out yourself, the entire four hour workshop is now on YouTube and closely follows the material on firstpythonnotebook.org.

Coronavirus Diary

We are clearly living through a notable and very painful period of history right now. On the 19th of March (just under two weeks ago, but time is moving both really fast and incredibly slowly right now) I started a personal diary—something I’ve never done before. It lives in an Apple Note and I’m adding around a dozen paragraphs to it every day. I think it’s helping. I’m sure it will be interesting to look back on in a few years time.

Dogsheep

Much of my development work this past week has gone into my Dogsheep suite of tools for personal analytics.

  • I upgraded the entire family of tools for compatibility with sqlite-utils 2.x.
  • pocket-to-sqlite got a major upgrade: it now fetches items using Pocket’s API pagination (previously it just tried to pull in 5,000 items in one go) and has the ability to only fetch new items. As a result I’m now running it from cron in my personal Dogsheep instance, so “Save to Pocket” is now my preferred Dogsheep-compatible way of bookmarking content.
  • twitter-to-sqlite got a couple of important new features in release 0.20. I fixed a nasty bug in the --since flag where retweets from other accounts could cause new tweets from an account to be ignored. I also added a new count_history table which automatically tracks changes to a Twitter user’s friends, follower and listed counts over time (#40).

I’m also now using Dogsheep for some journalism! I’m working with the Big Local News team at Stanford to help track and archive tweets by a number of different US politicians and health departments relating to the ongoing pandemic. This collaboration resulted in the above improvements to twitter-to-sqlite.

Tailscale

My personal Dogsheep is currently protected by client certificates, so only my personal laptop and iPhone (with the right certificates installed) can connect to the web server it is running on.

I spent a bit of time this week playing with Tailscale, and I’m really impressed by it.

Tailscale is a commercial company built on top of WireGuard, the new approach to VPN tunnels which just got merged into the Linux 5.6 kernel. Tailscale first caught my attention in January when they hired Brad Fitzpatrick.

WireGuard lets you form a private network by having individual hosts exchange public/private keys with each other. Tailscale provides software which manages those keys for you, making it trivial to set up a private network between different nodes.

How trivial? It took me less than ten minutes to get a three-node private network running between my iPhone, laptop and a Linux server. I installed the iPhone app, the Ubuntu package, the OS X app, signed them all into my Google account and I was done.

Each of those devices now has an additional IP address in the 100.x range which they can use to talk to each other. Tailscale guarantees that the IP address will stay constant for each of them.

Since the network is public/private key encrypted between the nodes, Tailscale can’t see any of my traffic—they’re purely acting as a key management mechanism. And it’s free: Tailscale charge for networks with multiple users, but a personal network like this is free of charge.

I’m not running my own personal Dogsheep on it yet, but I’m tempted to switch over. I’d love other people to start running their own personal Dogsheep instances but I’m paranoid about encouraging this when securing them is so important. Tailscale looks like it might be a great solution for making secure personal infrastructure more easily and widely available.

Weeknotes: Datasette 0.39 and many other projects eight days ago

This week’s theme: Well, I’m not going anywhere. So a ton of progress to report on various projects.

Datasette 0.39

This evening I shipped Datasette 0.39. The two big features are a mechanism for setting the default sort order for tables and a new base_url configuration setting.

You can see the new default sort order in action on my Covid-19 project—the daily reports now default to sort by day descending so the most recent figures show up first. Here’s the metadata that makes it happen, and here’s the new documentation.

I had to do some extra work on that project this morning when the underlying data changed its CSV column headings without warning.

The base_url feature has been an open issue since Janunary 2019. It lets you run Datasette behind a proxy on a different URL prefix—/tools/datasette/ for example. The trigger for finally getting this solved was a Twitter conversation about running Datasette on Binder in coordination with a Jupyter notebook.

Tony Hirst did some work on this last year, but was stumped by the lack of a base_url equivalent. Terry Jones shared an implementation in December. I finally found the inspiration to pull it all together, and ended up wih a working fork of Tony’s project which does indeed launch Datasette on Binder—try launching your own here.

github-to-sqlite

I’ve not done much work on my Dogsheep family of tools in a while. That changed this week: in particular, I shipped a 1.0 of github-to-sqlite.

As you might expect, it’s a tool for importing GitHub data into a SQLite database. Today it can handle repositories, releases, release assets, commits, issues and issue comments. You can see a live demo built from Dogsheep organization data at github-to-sqlite.dogsheep.net (deployed by this GitHub action).

I built this tool primarily to help me better keep track of all of my projects. Pulling the issues into a single database means I can run queries against all open issues across all of my repositories, and imporing commits and releases is handy for when I want to write my weeknotes and need to figure out what I’ve worked on lately.

datasette-render-markdown

GitHub issues use Markdown. To correctly display them it’s useful to be able to render that Markdown. I built datasette-render-markdown back in November, but this week I made some substantial upgrades: you can now configure which columns should be rendered, and it includes support for Markdown extensions including GitHub-Flavored Markdown.

You can see it in action on the github-to-sqlite demo.

I also upgraded datasette-render-timestamps with the same explicit column configuration pattern.

datasette-publish-fly

Fly is a relatively new hosting provider which lets you host applications bundled as Docker containers in load-balanced data centers geographically close to your users.

It has a couple of characteristics that make it a really good fit for Datasette.

Firstly, the pricing model: Fly will currently host a tiny (128MB of RAM) container for $2.67/month—and they give you $10/month of free service credit, enough for 3 containers.

It turns out Datasette runs just fine in 128MB of RAM, so that’s three always-on Datasette containers! (Unlike Heroku and Cloud Run, Fly keeps your containers running rather than scaling them to zero).

Secondly, it works by shipping it a Dockerfile. This means building datasette publish support for it is really easy.

I added the publish_subcommand plugin hook to Datasette all the way back in 0.25 in September 2018, but I’ve never actually built anything with it. That’s now changed: datasette-publish-fly uses the hook to add a datasette publish fly command for publishing databases directly to your Fly account.

hacker-news-to-sqlite

It turns out I created my Hacker News account in 2007, and I’ve posted 2,167 comments and submitted 131 stories since then. Since my personal Dogsheep project is about pulling my data from multiple sources into a single place it made sense to build a tool for importing from Hacker News.

hacker-news-to-sqlite uses the official Hacker News API to import every comment and story posted by a specific user. It can also use one or more item IDs to suck the entire discussion tree around those items.

The README includes detailed documentation on how to best browse your data using Datasette once you have imported it.

Other projects

Weeknotes: this week was absurd 15 days ago

As of this morning, San Francisco is in a legally mandated shelter-in-place. I can hardly remember what life was like seven days ago. It’s been a very long, very crazy week. This was not a great week for getting stuff done.

So very short weeknotes today.

  • I started work on datasette-edit-tables, a plugin that will eventually allow Datasette users to modify tables—add columns, change the types of existing columns, rename tables and suchlike. So far all it offers is a delete table button.
  • I released sqlite-utils 2.4.2 (and 2.4.1 before it) with a couple of bug fixes. Notably it does a better job detecting the types of different existing SQLite columns—it turns out SQLite supports a wide range of cosmetic column types.
  • I did a bit of work on my github-contents library, which aims to make it as easy as possible to write code that updates text stored in a GitHub repository. I want to be able to use it to programmatically create pull requests (so I can add a visual editor to my cryptozoology crowdsourcing project). I added a branch_exists() method and I’m working on being able to commit to a branch other than master.

Natalie and I are hunkering down for the long run here in San Francisco, attempting to stay mentally healthy through aggressive use of Zoom and Google Hangouts. Best of luck to everyone out there getting through this thing.

Weeknotes: COVID-19 numbers in Datasette 22 days ago

COVID-19, the disease caused by the novel coronavirus, gets more terrifying every day. Johns Hopkins Center for Systems Science and Engineering (CSSE) have been collating data about the spread of the disease and publishing it as CSV files on GitHub.

This morning I used the pattern described in Deploying a data API using GitHub Actions and Cloud Run to set up a scheduled task that grabs their data once an hour and publishes it to https://covid-19.datasettes.com/ as a table in Datasette.

If you’re not yet concerned about COVID-19 you clearly haven’t been paying atttention to what’s been happening in Italy. Here’s a query which shows a graph of the number of confirmed cases in Italy over the past few weeks (using datasette-vega):

COVID-19 confirmed cases in Italy, spiking up to 10,149

155 cases 17 days ago to 10,149 cases today is really frightening. And the USA still doesn’t have robust testing in place, so the numbers here are likely to really shock people once they start to become more apparent.

If you’re going to use the data in covid-19.datasettes.com for anything please be responsible with it and read the warnings in the README file in detail: it’s important to fully understand the sources of the data and how it is being processed before you use it to make any assertions about the spread of COVID-10.

My favourite resource to understand Coronavirus and what we should be doing about it is flattenthecurve.com, compiled by Julie McMurry, an assistant professor at Oregon State University College of Public Health. I strongly recommend checking it out.

Other projects

I’ve worked on a bunch of other projects this week, some of which were inspired by my time at NICAR.

  • fec-to-sqlite is a script for saving FEC campaign finance filings to a SQLite database. Since those filings are pulled in via HTTP and can get pretty big, it uses a neat trick to generate a progress bar with the tqdm library—it initiates a progress bar with the Content-Length of the incoming file, then as it iterates over the lines coming in over HTTP it uses the length of each line to update that bar.
  • datasette-search-all is a new plugin that enables search across multiple FTS-enabled SQLite tables at once. I wrote more about that in this blog post on Monday.
  • datasette-column-inspect is an extremely experimental plugin that tries out a “column inspector” tool for Datasette tables—click on a column heading and theh plugin shows you interesting facts about that column, such as the min/mean/max/stdev, any outlying values, the most common values and the least common values. Screenshot below. This prototype came about as part of a JSK team project for the Designing Machine Learning course at Stanford—we were thinking about ways in which machine learning could help journalists find stories in large datasets. The prototype doesn’t have any machine learning in it—just some simple statistics to identify outliers—but it’s meant to illustrate how a tool that exposes machine learning insights against tabular data might work.
  • github-to-sqlite grew a new sub-command: github-to-sqlite commits github.db simonw/datasette—which imports information about commits to a repository (just the author and commit message, not the body of the commit itself). I’m running a private version of this against all of my projects, which is really useful for seeing what I worked on over the past week when writing my weeknotes.

Here are two screenshots of datasette-column-inspect in action. You can try out a live demo of the plugin over here.

Outliers in number of appearences in the Avengers: Iron Man, Captain America, Spider Man and Wolverine

Column summary for states in actions_under_antiquities_act

datasette-search-all: a new plugin for searching multiple Datasette tables at once 24 days ago

I just released a new plugin for Datasette, and it’s pretty fun. datasette-search-all is a plugin written mostly in JavaScript that executes the same search query against every searchable table in every database connected to your Datasette instance.

You can try it out on my FARA (Foreign Agents Registration Act) search site, fara.datasettes.com—see Deploying a data API using GitHub Actions and Cloud Run for background on that project.

Here’s a search for manafort across all four FARA tables (derived from CSVs originally pulled from the Justice Department bulk data site).

Running a search for manafort

I’ve been planning to build cross-table search for Datasette for quite a while now. It’s a potentially very difficult problem: searching a single table is easy, but the moment you attempt to search multiple tables you run into a number of challenges:

  • Different tables have different columns. How do you present those in a single search interface?
  • Full-text search relevance scores make sense within a single table (due to the statistics they rely on, see Exploring search relevance algorithms with SQLite) but cannot be compared across multiple tables.

I have an idea for how I can address these, but it involves creating a single separate full-text index table that incorporates text from many different tables, along with a complex set of indexing mechanisms (maybe driven by triggers) for keeping it up to date.

But maybe I was overthinking this?

While I stewed on the ideal way to solve this problem, projects like my FARA site were stuck without cross-table search. Then this morning I realized that there was another way: I could build pretty much the simplest thing that could possibly work (always a good plan in my experience).

Here’s how the new plugin works: it scans through every table attached to Datasette looking for tables that are configured for full-text search. Then it presents a UI which can excute searches against ALL of those tables, and present the top five results from each one.

The scanning-for-searchable-tables happens in Python, but the actual searching is all in client-side JavaScript. The searches run in parallel, which means the user sees results from the fastest (smallest) tables first, then the larger, slower tables drop in at the bottom.

It’s stupidly simple, but I really like the result. It’s also a neat demonstration of running parallel SQL queries from JavaScript, a technique which I’m keen to apply to all sorts of other interesting problems.

JavaScript style

The JavaScript I wrote for this project is unconventional for 2020: it’s a block of inline script on the page, using no libraries or frameworks, but taking advantage of modern niceties like backtick template literals and fetch().

The code is messy, short and extremely easy to change in the future. It doesn’t require running a build tool. I’m pretty happy with it.

Adding a search form to the homepage

The other thing the plugin does is add a search box to the Datasette homepage (as seen on the FARA site)—but only if the attached databases contain at least one FTS-configured searchable table.

There are two parts to the implementation here. The first is a extra_template_vars() plugin hook which injects a searchable_tables variable into the hompage context—code here.

The second is a custom index.html template which ships with the plugin. When Datasette renders a template it looks first in the local --template-dirs folder (if that option was used), secondly in all of the installed plugins and finally in the Datasette set of default templates.

The new index.html template starts with {% extends "default:index.html" %}, which means it extends the default template that shipped with Datasette. It then redefines the description_source_license block from that template to conditionally show the search form.

I’m not at all happy with abusing description_source_license in this way—it just happens to be a block located at the top of that page. As I write more plugins that customize the Datasette UI in some way I continually run into this problem: plugins need to add markup to pages in specific points, but they also need to do so in a way that won’t over-ride what other plugins are up to.

I’m beginning to formulate an idea for how Datasette can better support this, but until that’s ready I’ll be stuck with hacks like the one used here.

Using this with datasette-configure-fts and datasette-upload-csvs

The datasette-configure-fts plugin provides a simple UI for configuring search for different tables, by selecting which columns should be searchable.

Combining this with datasette-search-all is really powerful. It means you can dump a bunch of CSVs into Datasette (maybe using datasette-upload-csvs), select some columns and then run searches across all of those different data sources in one place.

Not bad for 93 lines of JavaScript and a bit of Python glue!

Weeknotes: datasette-ics, datasette-upload-csvs, datasette-configure-fts, asgi-csrf 29 days ago

I’ve been preparing for the NICAR 2020 Data Journalism conference this week which has lead me into a flurry of activity across a plethora of different projects and plugins.

datasette-ics

NICAR publish their schedule as a CSV file. I couldn’t resist loading it into a Datasette on Glitch, which inspired me to put together a plugin I’ve been wanting for ages: datasette-ics, a register_output_renderer() plugin that can produce a subscribable iCalendar file from an arbitrary SQL query.

It’s based on datasette-atom and works in a similar way: you construct a query that outputs a required set of columns (event_name and event_dtstart as a minimum), then add the .ics extension to get back an iCalendar file.

You can optionally also include event_dtend, event_duration, event_description, event_uid and most importantly event_tz, which can contain a timezone string. Figuring out how to handle timezones was the fiddliest part of the project.

If you’re going to NICAR, subscribe to https://nicar-2020.glitch.me/data/calendar.ics in a calendar application to get the full 261 item schedule.

If you just want to see what the iCalendar feed looks like, add ?_plain=1 to preview it with a text/plain content type: https://nicar-2020.glitch.me/data/calendar.ics?_plain=1—and here’s the SQL query that powers it.

datasette-upload-csvs

My work on Datasette Cloud is inspiring all kinds of interesting work on plugins. I released datasette-upload-csvs a while ago, but now that Datasette has official write support I’ve been upgrading the plugin to hopefully achieve its full potential.

In particular, I’ve been improving its usability. CSV files can be big—and if you’re uploading 100MB of CSV it’s not particularly reassuring if your browser just sits for a few minutes spinning on the status bar.

So I added two progress bars to the plugins. The first is a client-side progress bar that shows you the progress of the initial file upload. I used the XMLHttpRequest pattern (and the drag-and-drop recipe) from Joseph Zimmerman’s useful article How To Make A Drag-and-Drop File Uploader With Vanilla JavaScriptfetch() doesn’t reliably report upload progres just yet.

I’m using Starlette and asyncio so uploading large files doesn’t tie up server resources in the same way that it would if I was using processes and threads.

The second progress bar relates to server-side processing of the file: churning through 100,000 rows of CSV data and inserting them into SQLite can take a while, and I wanted users to be able to see what was going on.

Here’s an animation screenshot of how the interface looks now:

Uploading a CSV

Implementing this was trickier. In the end I took advantage of the new dedicaed write thread made available by datasette.execute_write_fn()—since that thread has exclusive access to write to the database, I create a SQLite table called _csv_progress_ and write a new record to it every 10 rows. I use the number of bytes in the CSV file as the total and track how far through that file Python’s CSV parser has got using file.tell().

It seems to work really well. The full server-side code is here—the progress bar itself then polls Datasette’s JSON API for the record in the _csv_progress_ table.

datasette-configure-fts

SQLite ships with a decent implementation of full-text search. Datasette knows how to tell if a table has been configured for full-text search and adds a search box to the table page, documented here.

datasette-configure-fts is a new plugin that provides an interface for configuring search against existing SQLite tables. Under the hood it uses the sqlite-utils full-text search methods to configure the table and set up triggers to keep the index updated as data in the table changes.

It’s pretty simple, but it means that users of Datasette Cloud can upload a potentially enormous CSV file and then click to set specific columns as searchable. It’s a fun example of the kind of things that can be built with Datasette`s new write capabilities.

asgi-csrf

CSRF is one of my favourite web application security vulnerabilties—I first wrote about it on this blog back in 2005!

I was surprised to see that the Starlette/ASGI ecosystem doesn’t yet have much in the way of CSRF prevention. The best option I could find to use the WTForms library with Starlette.

I don’t need a full forms library for my purposes (at least not yet) but I needed CSRF protection for datasete-configure-fts, so I’ve started working on a small ASGI middleware library called asgi-csrf.

It’s modelled on a subset of Django’s robust CSRF prevention. The README warns people NOT to trust it yet—there are still some OWASP recommendations that it needs to apply (issue here) and I’m not yet ready to declare it robust and secure. It’s a start though, and feels like exactly the kind of problem that ASGI middleware is meant to address.

Elsewhere

30th March 2020

  • gifcap (via) This is really neat: a purely client-side implementation of animated gif screen capture, using navigator.mediaDevices.getDisplayMedia for the screen capturing, mithril for the UI and the gif.js pure JavaScript GIF encoding library to render the output. #

27th March 2020

  • PostGraphile: Production Considerations. PostGraphile is a tool for building a GraphQL API on top of an existing PostgreSQL schema. Their “production considerations” documentation is particularly interesting because it directly addresses some of my biggest worries about GraphQL: the potential for someone to craft an expensive query that ties up server resources. PostGraphile suggests a number of techniques for avoiding this, including a statement timeout, a query whitelist, pagination caps and (in their “pro” version) a cost limit that uses a calculated cost score for the query. #

26th March 2020

  • Making Datasets Fly with Datasette and Fly (via) It’s always exciting to see a Datasette tutorial that wasn’t written by me! This one is great—it shows how to load Central Park Squirrel Census data into a SQLite database, explore it with Datasette and then publish it to the Fly hosting platform using datasette-publish-fly and datasette-cluster-map. #
  • Slack’s not specifically a “work from home” tool; it’s more of a “create organizational agility” tool. But an all-at-once transition to remote work creates a lot of demand for organizational agility.

    Stewart Butterfield #

21st March 2020

  • hacker-news-to-sqlite (via) The latest in my Dogsheep series of tools: hacker-news-to-sqlite uses the Hacker News API to fetch your comments and submissions from Hacker News and save them to a SQLite database. #

19th March 2020

  • Django: Added support for asynchronous views and middleware (via) An enormously consequential feature just landed in Django, and is set to ship as part of Django 3.1 in August. Asynchronous views will allow Django applications to define views using “async def myview(request)”—taking full advantage of Python’s growing asyncio ecosystem and providing enormous performance improvements for Django sites that do things like hitting APIs over HTTP. Andrew has been puzzling over this for ages and it’s really exciting to see it land in a form that should be usable in a stable Django release in just a few months. #
  • datasette-publish-fly (via) Fly is a neat new Docker hosting provider with a very tempting pricing model: Just $2.67/month for their smallest always-on instance, and they give each user $10/month in free credit. datasette-publish-fly is the first plugin I’ve written using the publish_subcommand plugin hook, which allows extra hosting providers to be added as publish targets. Install the plugin and you can run “datasette publish fly data.db” to deploy SQLite databases to your Fly account. #

12th March 2020

  • New governance model for the Django project. This has been under discussion for a long time: I’m really excited to see it put into action. It’s difficult to summarize, but they key effect should be a much more vibrant, active set of people involved in making decisions about the framework. #
  • Announcing Daylight Map Distribution. Mike Migurski announces a new distribution of OpenStreetMap: a 42GB dump of the version of the data used by Facebook, carefully moderated to minimize the chance of incorrect or maliciously offensive edits. Lots of constructive conversation in the comments about the best way for Facebook to make their moderation decisions more available to the OSM community. #

9th March 2020

7th March 2020

  • I called it normalization because then President Nixon was talking a lot about normalizing relations with China. I figured that if he could normalize relations, so could I.

    Edgar F. Codd #

5th March 2020

  • Millions of tiny databases. Fascinating, detailed review of a paper that describes Amazon’s Physalia, a distributed configuration store designed to provide extremely high availability coordination for Elastic Block Store replication. My eyebrows raised at “Physalia is designed to offer consistency and high-availability, even under network partitions.” since that’s such a blatant violation of CAP theorem, but it later justifies it like so: “One desirable property therefore, is that in the event of a partition, a client’s Physalia database will be on the same side of the partition as the client. Clever placement of cells across nodes can maximise the chances of this.” #

28th February 2020

  • Wildcard: Spreadsheet-Driven Customization of Web Applications (via) What a fascinating collection of ideas. Wildcard is a browser extension (currently using Tampermonkey and sadly not yet available to try out) which lets you add “spreadsheet-driven customization” to any web application. Watching the animated screenshots in the videos helps explain what this mean—essentially it’s a two-way scraping trick, where content on the page (e.g. Airbnb listings) are extracted into a spreadsheet-like table interface using JavaScript—but then interactions you make in that spreadsheet like filtering and sorting are reflected back on the original page. It even has the ability to serve editable cells by mapping them to form inputs on the page. Lots to think about here. #