Simon Willison’s Weblog

Weeknotes: Dogsheep

Having figured out my Stanford schedule, this week I started getting back into the habit of writing some code.

Dogsheep

Dogsheep is the collective name I’ve given to a suite of tools I’m building around the concept of personal analytics.

I generate a lot of data, and while much of it ends up in the silos of the internet giants, thanks to the GDPR most of those silos now feature an “export a copy of your data” button.

Wouldn’t it be cool if you could convert that data into a SQLite database and then use Datasette to run queries against it?

So that’s what I’m doing! The tools I’ve built so far include:

  • healthkit-to-sqlite for my Apple HealthKit data (mostly collected by my Apple Watch).
  • twitter-to-sqlite, by far the most developed tool. For Dogsheep purposes it lets me import my tweets and the tweets I have favourited, but it’s growing all kinds of other useful features for retrieving and analyzing data from Twitter. More on this in my previous weeknotes.
  • swarm-to-sqlite for my Foursquare Swarm checkins.
  • inaturalist-to-sqlite for my iNaturalist observations.
  • google-takeout-to-sqlite for the wealth of data available from Google Takeout. I’ve barely scratched the surface with this one but it does have the ability to export my location history from Google Maps—215,000 latitude/longitude/timestamp records dating back to select min(timestamp) from location_history July 2015!
  • github-to-sqlite for importing my GitHub repositories and the repositories I have starred.
  • pocket-to-sqlite (built over this weekend) for articles I have saved to Pocket.
  • genome-to-sqlite for my 23andMe genome, because being able to run SQL queries against my genetic code is really funny.

My biggest achievement this week is that my own personal Dogsheep is no longer vaporware! I got a Datasette instance running on a AWS Lightsail VPS (using the systemd recipe from this ticket) and started loading in copies of my data.

Securing this instance well is important. In addition to running it behind datasette-auth-github I’ve followed up on a long-term ambition of deploying something protected by client certificates. If your browser doesn’t have the required certificate installed it gets rejected by nginx before it’s even proxied through to Datasette.

I have the certificate successfully installed on my iPhone and my laptop browsers. Setting this all up wasn’t particularly straight-forward—I ended up following this excellent tutorial by Nathan Wittstock for the nginx part of it, then randomly Googling for tips on installing the certificate in Firefox and on my iPhone.

I think there’s a great opportunity for tooling that makes this much easier. Adding that to my maybe-someday pile.

datasette-auth-github asset caching

Dogsheep is a great opportunity for eating my own dogfood. I’ve been having a lot of fun using it to explore my data... but I noticed that there was a suspicous delay every time I clicked a link. Datasette is mostly JavaScript-free but usually it responds so fast (~100ms or less) that clicking around feels “app-like”—but here I was really feeling the delay, often of more than a second.

I fired up the Firefox Network inspector and quickly spotted what was happening. I have the datasette-vega plugin installed so I can create charts, and that weighs in at nearly 1MB of JavaScript. According to Firefox my browser was loading that entire 1M file from scratch for every page load!

The culprit there turned out to be my datasette-auth-github plugin.

As detailed in issue #6, the ASGI middleware adds a cache-control: private header to every HTTP response—to make sure authentication-required content isn’t accidentally cached by any intermediary proxies (I originally designed Datasette to run effectively behind Varnish and Cloudflare).

This was applied to everything... including the static JavaScript and CSS served up by the plugin!

To fix this, I shipped a new release of datasette-auth-github with a cacheable_prefixes setting, which defaults to /-/static/ and /-/static-plugins/ when run as a Datasette plugin. See issue 47 for details.

Areas of research

Connection pooling

My most significant ongoing project for Datasette right now concerns database connection pooling.

For Datasette Library I want to be able to support potentially hundreds of attached databases, and Datasette Edit calls for the ability to use a writable (as opposed to read-only) connection.

This means I need to get smarter about database connections, and last week I finally made some solid progress on figuring out more advanced connection pooling. Still more work to go but I’m finally feeling unblocked on this after sweating over it for a couple of months without writing any code.

I also added a /-/threads page to Datasette (demo) to make it easier to understand the threads being run by the application.

Recursive CTEs in SQLite

twitter-to-sqlite generates a tweets table with both id and in_reply_to_status_id (nullable) columns.

SQLite supports recursive CTEs, documented here. In theory, this means it should be possible to write SQL queries that analyze Twitter threads—showing me the tweets that are at the bottom of the longest reply chain, for example.

I was feeling lazy and asked if anyone knew how to do that on Twitter. Robin Houston, Matthew Somerville and Piers Cawley all chipped in with useful tips, which I’ve collected in this gist.

I particularly enjoyed learning how Piers uses recursive SQL queries to help run his bakery.

logging

I finally made some progress on getting my head around the Python standard library logging module, a long-time nemesis.

The breakthrough was to sit down and actually read the source code—which helped me understand that the entire point of the library is to be a globally configured logging system that makes careful use of threading primitives to ensure you can log without worrying about concurrency.

I’ve been worrying about that aspect of logging for years, so it’s reassuring to see that Vinay Sajip has that comprehensively covered already.

Easier installation for Datasette

I want journalists to be able to install Datasette. I don’t want to have to talk them through installing xcode to install homebrew to install Python 3 first!

Raffaele Messuti did some great work investigating pyinstaller for this a while ago. I started a conversation on Twitter a few days ago after investigating pex (neat but not quite what I want, since users still have to install Python). From that thread I got some interesting further leads:

  • Freezing your code in the The Hitchhiker’s Guide to Python describes the problem space in general.
  • Briefcase is the part of the BeeWare suite that deals with for converting a Python project into a standalone native application (for a bunch of different platforms).
  • PyOxidizer is a relatively new tool for building standalone executables that bundle a Python interpreter, on top of the Rust and Cargo build ecosystem.
  • Conda Constructor helps you construct an installer for a group of Conda packages (I need to figure out Conda).

Next step: actively try some of these out and see what I can do with them.

My longer term goal is to package Datasette up with Electron and make it available as a regular Windows and OS X application, for people who aren’t comfortable using the command-line directly.

Books

I’ve been mostly working through required reading for my Stanford courses. Strategic Communication in particular has some excellent text books:

This is Weeknotes: Dogsheep by Simon Willison, posted on 7th October 2019.

Tagged , , ,

Next: Tracking PG&E outages by scraping to a git repo

Previous: Weeknotes: first week of Stanford classes