Video introduction to Datasette and sqlite-utils
I put together a 17 minute video introduction to Datasette and sqlite-utils for FOSDEM 2021, showing how you can use Datasette to explore data, and demonstrating using the
sqlite-utils command-line tool to convert a CSV file into a SQLite database, and then publish it using
datasette publish. Here’s the video, plus annotated screen captures with further links and commentary.
Datasette is an “open source multi-tool for exploring and publishing data”. The best way to explain what that means it to show you a demo.
Cleo is my dog. I’m going to answer the question “What is Cleo’s favourite coffee shop?”
This is a SQL view that joins against the venues table to retrieve the latitude and longitude, then uses the datasette-cluster-map plugin to show my checkins on a map.
The view is defined by the following SQL:
CREATE VIEW checkin_details AS select checkins.id, created, venues.id as venue_id, venues.name as venue_name, venues.latitude, venues.longitude, group_concat(categories.name) as venue_categories, shout, createdBy, events.name as event_name from checkins join venues on checkins.venue = venues.id left join events on checkins.event = events.id join categories_venues on venues.id = categories_venues.venues_id join categories on categories.id = categories_venues.categories_id group by checkins.id order by createdAt desc;
If I filter for everything where the “shout” contains the wolf emoji I get a map of just the places that Cleo likes.
If I facet by venue_categories I can see that she likes parks and dog runs, but she’s also been to coffee shops 25 times.
I can filter down to just coffee shop checkins, then facet by venue_name to see that she’s been to Blue Bottle 16 times but she has occasionally been to Starbucks.
Everything you can see in the Datasette interface can be got back out as raw data as well. This is a JSON feed of Cleo’s coffee shop checkins.
And here are those checkins as CSV.
So this is clearly super-useful software for categorizing your dog’s different coffee shop visits.
Let’s try something a bit more serious.
The New York Times publish their Covid-19 data to a GitHub repository as CSV files on a daily basis.
I can drill down to just the numbers for San Francisco County in California.
The original inspiration for Datasette was work I did at the Guardian back in 2009. We launched the Datablog to share the data underlying our stories—and we ended up using Google Sheets to share the data. I always felt like there should be a more open way of publishing data than putting it in a Google Sheet.
I started looking at these three years ago, and realized that while they didn’t generally provide a relational database (at least not for free) if your data was read-only you could bundle the data up in a SQLite file and deploy it as part of the application code.
I call this the Baked Data architectural pattern.
The city of San Francisco publishes a CSV file of every tree in the city—195,000 trees!
The CSV file includes latitude, longitude, species, street address and lots more.
Both Datasette and sqlite-utils can be installed on macOS using Homebrew.
brew install datasette sqlite-utils
They are written in Python so they can also be installed using
To run Datasette against the new
trees.db file locally, run this:
http://127.0.0.1:8000/ in your browser.
You can also use the
-o option to Datasette to open the browser for you:
datasette trees.db -o
And here’s that CSV file running in Datasette in tabular format.
To see them on a map, we need the datasette-cluster-map plugin. We can install that like this:
datasette install datasette-cluster-map
And here they are on a map. You can see straight away that sixteen of the trees have an incorrect latitude and longitude and appear in the middle of the ocean.
A lot of the columns have duplicate values—
qLegalStatus for example. sqlite-utils has a tool for helping with that—it can extract those values out into a separate table.
I wrote about this at length in Refactoring databases with sqlite-utils extract.
We can extract that
qLegalStatus column like so:
sqlite-utils extract trees.db trees qLegalStatus
When I refresh the page you can see that those columns now contain links through to the new table.
Here’s the new table.
And here’s a map of every “significant tree” in San Francisco.
We’ll extract two more columns:
sqlite-utils extract trees.db trees qCaretaker sqlite-utils extract trees.db trees qSpecies
Another useful thing we can do is configure full-text search. We can do that against the
qAddress column using the enable-fts command:
sqlite-utils enable-fts trees.db trees qAddress
Now Datasette provides a search field, which we can use to search for “grove” to see a map of all of the trees on Grove street.
And I can facet by species to see the most common species of trees on Grove street. I can even get those back out as CSV or JSON.
Because this is all running on a SQLite database, you can execute custom SQL queries against the database. This is safe because the database is opened in read-only mode and there’s a time limit on how long a query is allowed to execute for.
Let’s publish this to the internet, using the datasette publish command.
I’m going to publish
trees.db to Google Cloud Run using the following commmand:
datasette publish cloudrun trees.db --title "San Francisco Trees" --service sf-trees
This will create a new Docker container with Datasette and the bundled trees database, push that to Google’s cloud infrastructure, and Google will then deploy it to Cloud Run and provide a URL.
There’s also a collection of tools for working with SQLite databases—both manipulating existing databases and creating new ones with data pulled from APIs, different file formats and other relational databases.
Now thet the publish flow has finished, you can interact with the trees database online at sf-trees-j7hipcg4aq-uc.a.run.app.
To deploy it using the datasette-cluster-map plugin, run this:
datasette publish cloudrun trees.db --title "San Francisco Trees" \ --service sf-trees --install=datasette-cluster-map
You can browse the underlying data tables at datasette.io/content.
If you’d like to have a one-on-one conversation with me about Datasette you can sign up for Datasette office hours. I love talking to people about these projects!