Simon Willison’s Weblog

Subscribe

Weeknotes: Design thinking for journalists, genome-to-sqlite, datasette-atom

20th September 2019

I haven’t had much time for code this week: we’ve had a full five day workshop at JSK with Tran Ha (a JSK alumni) learning how to apply Design Thinking to our fellowship projects and generally to challenges facing journalism.

I’ve used aspects of design thinking in building software products, but I’d never really thought about how it could be applied outside of digital product design. It’s been really interesting—especially seeing the other fellows (who, unlike me, are generally not planning to build software during their fellowship) start to apply it to a much wider and more interesting range of problems.

I’ve been commuting in to Stanford on the Caltrain, which did give me a bit of time to work on some code.

genome-to-sqlite

I’m continuing to build out a family of tools for personal analytics, where my principle goal is to reclaim the data that various internet companies have collected about me and pull it into a local SQLite database so I can analyze, visualize and generally and have fun with it.

A few years ago I shared my DNA with 23andMe. I don’t think I’d make the decision to do that today: it’s incredibly personal data, and the horror stories about people making unpleasant discoveries about their family trees keep on building. But since I’ve done it, I decided to see if I could extract out some data…

… and it turns out they let you download your entire genome! You can export it as a zipped up TSV file—mine decompresses to 15MB of data (which feels a little small—I know little about genetics, but I’m presuming that’s because the genome they record and share is just the interesting known genetic markers, not the entire DNA sequence—UPDATE: confirmed, thanks @laurencerowe).

So I wrote a quick utility, genome-to-sqlite, which loads the TSV file (directly from the zip or a file you’ve already extracted) and writes it to a simple SQLite table. Load it into Datasette and you can even facet by chromosome, which is exciting!

This is where my knowledge runs out. I’m confident someone with more insight than me could construct some interesting SQL queries against this—maybe one that determines if you are likely to have red hair?—so I’m hoping someone will step in and provide a few examples.

I filed a help wanted issue on GitHub. I also put a request out on Twitter for an UPDATE statement that could turn me into a dinosaur.

datasette-atom

This is very much a work-in-progress right now: datasette-atom will be a Datasette plugin that adds .atom as an output format (using the register_output_renderer plugin hook contributed by Russ Garrett a few months ago.

The aim is to allow people to subscribe to the output of a query in their feed reader (and potentially through that via email and other mechanisms)—particularly important for databases which are being updated over time.

It’s a slightly tricky plugin to design because valid Atom feed entries require a globally unique ID, a title and an “updated” date—and not all SQL queries produce obvious candidates for these values. As such, I’m going to have the plugin prompt the user for those fields and then persist them in the feed URL that you subscribe to.

This also means you won’t be able to generate an Atom feed for a query that doesn’t return at least one datetime column. I think I’m OK with that.

github-to-sqlite

I released one new feature for github-to-sqlite this week: the github-to-sqlite repos github.db command, which populates a database table of all of the repositories available to the authenticated user. Or use github-to-sqlite repos github.db dogsheep to pull the repos owned by a specific user or organization.

The command configures a SQLite full-text search index against the repo titles and descriptions, so if you have a lot of GitHub repos (I somehow have nearly 300!) you can search through them and use Datasette to facet them against different properties.

github-to-sqlite currently has two other useful subcommands: starred fetches details of every repository a user has starred, and issues pulls details of the issues (but sadly not yet their comment threads) attached to a repository.

Books

I’m trying to spend more time reading books—so I’m going to start including book stuff in my weeknotes in the hope of keeping myself on track.

I acquired two new books this week:

  • Just Enough Research by Erika Hall (recommended by Tom Coates and Tran Ha), because I need to spent the next few months interviewing as many journalists (and other project stakeholders) as possible to ensure I am solving the right problems for them.
  • Producing Open Source Software by Karl Fogel, because my fellowship goal is to build a thriving open source ecosystem around tooling for data journalism and this book looks like it covers a lot of the topics I need to really do a good job of that.

Next step: actually read them! Hopefully I’ll have some notes to share in next week’s update.