Simon Willison’s Weblog

Subscribe

Datasette—an ecosystem of tools for working with small data

22nd July 2021

This is the transcript and video from a talk I gave at PyGotham 2020 about using SQLite, Datasette and Dogsheep to work with small data.

I really like the term “small data”, as a counterpoint to big data. I define small data as anything that fits on a mobile phone, and since my iPhone has half a terabyte of storage these days that ends up covering a lot of ground!

I think SQLite is the ideal tool for working with data that fits this definition.

My PyGotham talk demonstrates how to find SQLite databases that already exist on your laptop, how to use Datasette to explore them, and then expands to cover Datasette as a tool for publishing data online and my Dogsheep family of tools for personal analytics.

Here’s the talk video on YouTube. I’ve included a full transcript below with additional links and embedded code samples.

Transcript and notes

I’m going to spend the next 20 minutes trying to convince you that almost every data problem you have should be solved using SQLite.

This is also an excuse for me to promote open source project I’ve been working on for the last three years, called Datasette.

Datasette is a tool for exploring and publishing data that’s stored in SQLite databases. So the more people I can use convince to use SQLite to solve things the better!

So let’s talk about SQLite, which claims to be the most widely deployed and used database in the world, and I think that claim holds up.

You may not have realised it, but you’re using SQLite every single day. It’s in phones and laptops. It’s in embedded devices. I have an Apple Watch, so I’m wearing SQLite databases on my wrist right now.

Let’s take a look on my laptop and see what we can find.

So this right here is a magic incantation that on a Mac will search for the largest SQLite databases across the entire system.

mdfind "kMDItemDisplayName == *.sqlite" -0 | \
  xargs -0 stat "-f%z %N" | sort -nr | head -n 20

I’m gonna run this right now to get my top 20. And here we are. Evernote is using SQLite, Firefox uses it, but the one at the top, this one is 857 megabytes for SQLite file. It’s something to do with Apple Photos, so let’s see what’s in there.

Datasette can be installed as a command-line application, you can run it against the path to a SQLite database, and it starts up a little local web server, which you can then use to explore that data.

So there are 67 tables in this Apple Photos database, with detected face prints and face crops and albums and all sorts of things.

But the one that I happen to know is super interesting is this one here. It’s called the Z Cloud Master Media Meta Data. It has 44,000 rows in and I happen to know I’ve taken about 44,000 photographs—so it looks like there’s some interesting stuff here.

This is kind of garbled, but there’s a clue. The binary column here starts with BPlist, and I happen to know that Apple have a format called a binary plist, which is probably what this is.

So I’m gonna install a plugin for Datasette called datasette-bplist that knows what to do with that data. This is being installed from PyPI.

datasette install datasette-bplist

And now I can start Datasette up again. Hit refresh on this page. And now that Z Data has been decoded into something that’s a little bit more readable.

And straightaway we can see that OK, this is classic photo metadata. We’ve got the make and model of the device. we’ve got EXIF data. The really exciting stuff is we’ve got GPS data. So this is a database that includes the latitude and longitude of every photograph I’ve ever taken, and we’re absolutely going to dig into that and see if we can do some interesting stuff with it later on.

So SQLite is clearly a pretty great database—if Apple are using it this extensively it must have some good things going on for it. But then why don’t we use it for web applications?

Well, the classic answer to that is that the one thing SQLite is not so good at is accepting concurrent writes. If you’ve got a high traffic web site with lots of people writing to your database, you’re better off with something like MySQL or Postgres.

But I realised that I have a whole bunch of use cases that don’t require any writes at all. Any time I want to publish data online that isn’t going to change, SQLite is actually a really good fit for that.

So that was the original inspiration for Datasette. I was thinking about how if I want to publish data online, is this a way I can do it that’s really cheap and flexible and interesting?

So that’s what I’ve been doing. I’ll show you a few examples.

This is Covid-19 case data, which is published online right now by Johns Hopkins University. The New York Times, the L.A. Times and The Economist.

https://covid-19.datasettes.com/

They’re all publishing their data as CSV files in GitHub repositories, which is pretty great because you get a commit history of changes to that CSV data. But it’s not necessarily something you can dive straight into and start using.

That’s where Datasette comes in. This is a Datasette instance which is automatically updated every day with the latest data from these different publications and I can drill through and see things like this New York Times data for example.

The New York Times publishes county level data of cases and deaths from Covid-19 and they’ve got over half a million rows in this spreadsheet. And so when we look at this data, it’s pretty unexciting in a sort of table form. We’ve got the date, the county, the state, the FIPS code, which is a unique identify for each county, the number of cases and the number of deaths.

But Datasette kicks in and does some really useful things for us.

Firstly, Datasette has a concept of faceting where it looks for columns in the database which have a small number of unique values—like the State column for example, which only has 50 values in it, and it makes those available for navigation at the top along with these summary counts. So I can look down at this and say, Oh, look, Tennessee. Let’s drill into just the rows in this table that mention Tennessee. And then Tennessee has 16,000 rows of which each county has a whole bunch. So let’s look at Rutherford County in Tennessee. That’s 181 rows. And now that we’ve drilled down these two levels, we’ve got back a table that is just showing us the number of cases and number of deaths ordered my date.

So each of these rows says on the sixth of September, there have been 80 deaths reported in 8100 cases.

You saw a Datasette plugin earlier when I installed the bplist plugin. Another plugin I wrote is called Datasette Vega. It gives you simple charting options, so we can say I want a bar chart that plots the date on the X axis against the number of cases on the Y axis. And now we’ve got a chart of cases over time in this particular county in the state of Tennessee.

A couple of other things you can do: You can export the data back out again so I could get this data as JSON if I want to load this programmatically into something, and get it out as CSV if I want to load it into something like Microsoft Excel.

I can actually export it as YAML because there’s an experimental plugin I’ve been building that does a YAML export. And then, more importantly, you can view and edit the SQL query that was used for this page.

Now this is one of the interesting things about publishing read-only data: because the data as read only and it’s opened in a special read only mode, people executing SQL queries can’t cause any damage to that database. So SQL injection which in most web applications is a serious security vulnerability for us, becomes a feature.

We could say I just want back the data on the cases and the deaths columns—I’m gonna hit “run SQL” and now I’m getting back just that data that I requested, and I could export that data back out again as JSON or CSV. So this becomes a really powerful way of remixing and slicing and dicing data and then getting it back out again in a format that I can use somewhere else.

So that’s all well and good. But the obvious question is, how do you get this data into a SQLite file in the first place? So I’m gonna show you a demo using one of my favourite pieces of data of all time. This is the results of the 2018 New York Central Park squirrel census, where a bunch of volunteers went out and tried to answer the crucial question: “How many squirrels are there in Central Park?”.

And this is published as a CSV file, it’s got 3,023 rows. Each row is a squirrel. And we can export that out and get back CSV. So I’ve got the file here. I’m gonna turn it into a SQLite database.

I have a tool I wrote called csvs-to-sqlite, which takes a CSV file—or multiple CSV files—and turns them into, or loads them into SQLite. So here we go, I’ve run that command and and now I have squirrels.db, which is a SQLite database.

csvs-to-sqlite 2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv squirrels.db

If I run sqlite3 squirrels.db .schema, it’ll show me the schema—and it is indeed a table with columns for squirrels with unique squirrel IDS and which hectare they’re in and all of that kind of stuff.

datasette squirrels.db

And so now that we’ve done that, I can run datasette squirrels.db and start up Datasette and I can see that data in my browser. And here we go here, here are 3,000 rows of squirrel data.

Faceting is kind of fun: you can see things like primary fur colour, showing up as options there. And, actually, if we facet by that, we can say you know what? For the 392 cinnamon squirrels, let’s see what the highlight fur colours are for those. Show me the 10 squirrels that are black and have cinnamon hair.

Another thing that stands out about this data is that there are these two columns here, X and Y, which look suspiciously like they might be latitudes and longitudes. We can do something really interesting with that.

I’m gonna load up the SQLite command-line tool for squirrels.db, and I’m gonna rename the X column to longitude.

sqlite3 squirrels.db
alter table "2018_Central_Park_Squirrel_Census_-_Squirrel_Data"
rename column X to longitude;
alter table "2018_Central_Park_Squirrel_Census_-_Squirrel_Data"
rename column Y to latitude;

So “alter table”—um, quite a long table name this one—gonna alter this table here. Rename column X to longitude. And I’m gonna rename column Y to latitude. And now I’m going to check my datasette plugins and make sure I’ve got a plugin installed called datasette-cluster-map.

There it is. So now if I run Datasette against squirrels again and refresh this page, the plugin here notices the latitude and longitude columns and it draws them on the map. So this is a map of everywhere that they saw a cinnamon squirrel with black highlight fur.

And if I close off these filters you can see that here’s a map of all 3,000 squirrels that were spotted in Central Park. And I can zoom in and explore the data that way.

That’s pretty cool. Wouldn’t it be cool if we could share this data on the Internet somehow?

So we’ve got a SQLite database. And this is where another key feature of Datasette comes in, which is the"datasette publish command.

Datasette Publish lets you publish databasees online, using Google Cloud Run or Heroku or Vercel. I’m going to use Vercel right now, so if I type datasette publish vercel squirrels.db—I need to give it a project name. I’m going to call it pygotham-squirrels. And I know that I want to install that datasette-cluster-map plugin as well.

datasette publish vercel squirrels.db \
  --project=pygotham-squirrels \
  --install=datasette-cluster-map

So I run this command, and it packages up that database with the Datasette application itself, uploads it and starts running a build on Vercel. It gives me a URL here. I can watch it doing its thing—so it’s installing runtimes, it’s installing those required dependencies, that additional plugin. And normally, this only takes about 20 seconds to run. And at the end of that 20 seconds, I will have a Web application that is running online with that data.

https://pygotham-squirrels.vercel.app/

So anyone can visit this URL right now, click on that map and they get that Datasette of squirrels with the map plugin, with the ability to facet by primary fur colour. It’s all up there. It’s running online and it’s ready for other people to visit.

I designed this initially for newspapers based on my experience working at the Guardian newspaper, where we wanted to publish some of the data behind our stories, but as a way productively getting structured data onto the web in a way you can start using it I think this is really, really exciting as a capability.

And remember there are JSON APIs for this. So we have just published an API online for squirrel data that anyone who can write JavaScript or Python can start using to build their own squirrel maps.

Another thing that I’ve been doing with this is figuring out how to turn these into much more custom experiences.

Datasette allows you to provide custom templates, it lets you provide custom CSS, which means you can actually use it to build entire websites powered under the hood by Datasette.

One of the first I built was this one here. This is called niche-museums.com—it’s a website for my hobby of trying to track down weird and niche museums around the world, I can click “Use my location” and see just the weird museums near me.

There’s one just around the corner from our house called The Comic Book Rockstars Toilet Seat Museum, where a local comic book store has been collecting toilet seats illustrated by famous comic artists. And if you’re ever in San Francisco, you should absolutely pop by Hayes Valley and check this place out.

But this right here is just a heavily styled Datasette instance. If you go to /browse, it will give you access to the data. You can see that each museum is a row in a SQLite database table. I’ve got that plugin, so I get a free map showing you where everything is. Then the actual museum pages are just templated versions of the underlying Datasette pages.

I really like this as a way of publishing websites because it’s gives you all of the benefits of static publishing where you it scales magically by starting up new instances, you don’t have to worry about security holes and stuff because it’s all read only data, but because there’s a relational database under it, you can build features like search—so I can search for Bigfoot and get back the Bigfoot Discovery Museum.

Or you can build things like this little “use my location”, button at the top of the screen. But again, this is all built using Datasette and Datasette plugins. I have a plugin that knows how to turn a SQL query into an Atom feed so that people can subscribe to new updates to the website as well.

But to finish, I want to talk about a totally different use-case for all of this stuff, and that’s personal analytics.

This is an idea that I had had a year and a half ago, inspired by an essay written by Stephen Wolfram. Stephen wrote this absolutely fascinating essay about his approach to personal productivity, and the main feature of this essay is the scroll bar down the side. This thing just goes on and on and on, and he talks about how he had a standing desk, but then his heart rate monitor showed him that he gets better health benefits from walking around outside. So he built himself a little like popcorn tray for his laptop. He scanned every document he’s ever every document he’s ever written from the age of 11. He’s got all of this amazing infrastructure setup for being productive, most of which I feel is way over the top for how I want to live my life.

He has a green screen setup in his basement so he can give talks from home.

But there was one thing in here that kind of caught my eye, and that’s this idea that he has of a metasearcher. It’s a personal search engine that he built that can search every e-mail, every file, every paper he’s written, all of the people he knows for any any search term.

And I saw this and I thought, you know, that’s something I’d really like. I would love to have a personal search engine for all of my stuff.

But if I’m gonna build that and it’s inspired by Stephen Wolfram, I need to come up with a decent name for it.

And so since it is inspired by Wolf Ram, but it’s not really aiming to be as good or comprehensive as the thing that he built, I decided to call it Dog Sheep. So I have a project called Dogsheep. I love that pun so much I committed myself to actually building the software.

And so Dogsheep is a collection of tools for personal analytics: tools for pulling in data about yourself from lots of different sources, turning those into SQLite databases so that you can start working with them and then you can load them up in Datasette and start doing interesting things against them.

I’ll give you a a demonstration of my personal Dogsheep instance, where all of the data from these tools ends up.

This is my personal Dogsheep. It has data from a a whole bunch of different places. I’ve got data from Twitter and HealthKit—my Apple Watch tracks my health over time. GitHub data, Foursquare Swarm, I pulled in data from LinkedIn and Goodreads and Hacker News and all of these different places, and I can do all sorts of fun things with this.

So I’ll start by showing you a couple of things from from Twitter. So I index all of my tweets, anytime somebody mentions me, and all of the tweets that I’ve favorited and because I’ve indexed my favourited tweets I can run searches against them.

So I can search for PyGotham and see all of the tweets that I’ve favorited about PyGotham.

Twitter does not offer this feature themselves and having this feature suddenly makes Twitter favourites so, so much more useful.

I can search my followers so if I decide I want to have a conversation with a VC, I can type in VC and see anyone who follows me on Twitter who mentions VC in their bio, and hence is somebody that I might be able to contact via direct messages.

But a really useful example is that my dog, Cleo, has a Twitter account. And every time she goes to the vet, she tweets a selfie and with her selfie she tweets how much she weighs. So she’ll tweet “I weigh 42.5 pounds. I grew a little bit more dog”. Because she tweets this and I’ve got this coming into Dogsheep I now have these in a SQL database—so I can construct a SQL query that looks for tweets that mentioned her weight. And I can use a regular expression to extract that weight value out as the thing before the LB. And then I can use my charting plugin to plot that as a chart. So I now have a chart of Cleo’s weight over time, as reported by her posting selfies at the vet. It’s like having the kind of data warehouse you get at a giant corporation, but just against your own data, and costing about $5 a month in hosting fees.

So that’s tweets. But there was so much more stuff in here. I’m running dozens and dozens of projects on GitHub, and I pull all of the issues, issue comments, commits and releases into my one database.

So here I have a table of all 7,000 commits that I’ve made on GitHub across all of these different project so I can search these in one place, I can evaluate them, I can compare commits across different projects and I can also do charting. So here’s a graph that’s constructed by a custom SQL query, which pulls out just the date of each commit, groups by them and sticks on a count. So these are my commits over time as a chart.

A key philosophy of Datasette is everything should have a URL. So once I’ve constructed this query and put it on a chart, I can bookmark that page right there and that becomes, essentially, the application encoded in a URL that I can bookmark and link to and visit later on.

I’ve got a copy of my genome. I did 23AndMe a few years ago, and I found out recently that they have an export button and you can export your genome out as a CSV file. So I did that. I now have a copy of the 600,000 rows of my genome that 23AndMe keep track of. Which means I can use SQL queries to analyse my own genome. I have a query here that tells me what colour my eyes are based on running this query against my genome—apparently my eyes are blue 99% of the time because I have a GG genotype in this particular location on my genome.

This took an entire weekend figure out. I went to Science Hack Day and borrowed a geneticist so I could see if I could figure out some homebrew genetic explorations.

I mentioned the Apple Watch earlier. Apple record an incredible amount of data about me and they don’t uploaded to the cloud—an interesting thing about Apple is they keep that stuff on your device. So if you dig around in the HealthKit app on the iPhone, there’s an export button that will give you that data back out again. And I wrote code to turn that into SQLite and there was an enormous quantity of data in here. I’ve got standing time and basal energy burned, my body mass over time reported by my Withing scales, headphone audio exposure is something that showed up last year, the number of flights of steps I climbed, all sorts of stuff

But the most interesting one is workouts because every time you track a workout using your Apple Watch, it goes in the database. And if it’s an outdoor workout—where you’re, say, going for a run, it records your GPS location every few seconds. I ran the Bay To Breakers race in San Francisco a couple of years ago and beecause of that, I’ve got a database table with my location tracked route. It’s 3,800 points, and I can plot that on a map and see the exact, finely grained route I took through San Francisco when I ran Bay To Brakers just over three years ago.

It’s kind of amazing to me that this stuff is just sat there on our devices waiting for us to liberate it and start using it to build these visualisations and learn interesting things about ourselves.

I also mentioned earlier on Apple Photos. Apple Photos databases are absolutely fascinating because it turns out Apple run machine learning models on your phone to figure out what you took photographs of. And this means I can do things like run a SQL query that shows me all of the photographs that I’ve taken of a pelican.

My personal Dogsheep right here shows me these pictures—these are Apple Photos I took where Apple itself has labelled them as pelican.

And even more fun than that is Apple calculates scores for your photos. These are hidden away in the SQLite database. And they have names like Z Overall Aesthetic Score or Z Harmonious Colour Score or Z Pleasant Camera Tilt Score. So this query here isn’t just showing me photographs I’ve taken of pelicans—it’s sorting them by the most aesthetically pleasing according to Apple’s classification algorithm that they’re running on my device.

I can show you the Pelican photo with the most pleasant camera tilt—there’s so much fun stuff you can do digging into all of this.

So I’ve been building this set of Dogsheep tools for about a year now, but the thing I always want to do was to tie them all together. I wanted to build that personal search engine.

And a couple of weeks ago, I finally got it working.

I’ve called it Dogsheep Beta because Stephen Wolfram’s search engine is called Wolfram Alpha. So I figured Dogsheep Beta was definitely the appropriate name for it.

And so this right here is Dogsheep Beta, and it lets me run searches across my tweets, my photos, my GitHub commits, issue comments, releases, comments I made on Hacker News—all in one place.

And so if I search for example, for “cleopaws costume”, I will see all of the tweets by my dog, whose Twitter handle is @Cleopaws where she’s tweeting about her costumes. I think she makes a particularly fetching pirate.

Or again, I can search for PyGotham and I get back not just not just tweets, but I get back issue comments from my personal issues where I’ve been tracking the fact that I need to submit a talk.

Again, this is a custom search interface, but really all this is under the hood is SQLite—it’s a SQLite database with a table that acts as a search index and I’m running queries using SQLite’s built-in full text search mechanism.

If any of this has caught your interest, all of this is aggressively open source. So if you go to dogsheep.github.io this is an index page for all of the Dogsheep project. Click through to the GitHub Datasette repository and I’ve got very extensive documentation covering Datasette, what it can do, the plugins available and how to write plugins of your own.

And really, my goal with this project is I want to build a growing ecosystem of not just Datasette plugins, but tools for getting data from all of these different sources and loading them into SQLite databases so that we can join them together and use them to find out interesting things about the world and about our own lives.

If this has piqued your interest, please drop by the project. We have a GitHub discussions board which you’re welcome to join.

[I hadn’t launched the full website when I gave this talk, but today I’d encourage you to explore datasette.io]