Simon Willison’s Weblog


Joining CSV files in your browser using Datasette Lite

20th June 2022

I added a new feature to Datasette Lite—my version of Datasette that runs entirely in your browser using WebAssembly (previously): you can now use it to load one or more CSV files by URL, and then run SQL queries against them—including joins across data from multiple files.

Your CSV file needs to be hosted somewhere with access-control-allow-origin: * CORS headers. Any CSV file hosted on GitHub provides these, if you use the link you get by clicking on the “Raw” version.

Loading CSV data from a URL

Here’s the URL to a CSV file of college fight songs collected by FiveThirtyEight in their data repo as part of the reporting for this story a few years ago:

You can pass this to Datasette Lite in two ways:

Once Datasette has loaded, a data database will be available with a single table called fight-songs.

As you navigate around in Datasette the URL bar will update to reflect current state—which means you can deep-link to table views with applied filters and facets:

Or even link to the result of a custom SQL query:

Loading multiple files and joining data

You can pass the ?csv= parameter more than once to load data from multiple CSV files into the same virtual data database. Each CSV file will result in a separate table.

For this demo I’ll use two CSV files.

The first is us-counties-recent.csv from the NY Times covid-19-data repository, which lists the most recent numbers for Covid cases for every US county.

The second is us_census_county_populations_2019.csv, a CSV file listing the population of each county according to the 2019 US Census which I extracted from this page on the US Census website.

Both of those tables include a column called fips, representing the FIPS county code for each county. These 4-5 digit codes are ideal for joining the two tables.

Here’s a SQL query which joins the two tables, filters for the data for the most recent date represented (using where date = (select max(date) from [us-counties-recent])) and calculates cases_per_million using the cases and the population:

  1.0 * [us-counties-recent].cases / us_census_county_populations_2019.population * 1000000 as cases_per_million
  join us_census_county_populations_2019 on us_census_county_populations_2019.fips = [us-counties-recent].fips
  date = (select max(date) from [us-counties-recent])
order by
  cases_per_million desc

A screenshot of that query running in Datasette. Loving county Texas has the worst result - 1,289,940 cases per million - but that's because they have a population of just 169 people and 218 recorded cases.

And since everything in Datasette Lite can be bookmarked, here’s the super long URL (clickable version here) that executes that query against those two CSV files:*%2C%0A++us_census_county_populations_2019.population%2C%0A++1.0+*+%5Bus-counties-recent%5D.cases+%2F+us_census_county_populations_2019.population+*+1000000+as+cases_per_million%0Afrom%0A++%5Bus-counties-recent%5D%0A++join+us_census_county_populations_2019+on+us_census_county_populations_2019.fips+%3D+%5Bus-counties-recent%5D.fips%0Awhere%0A++date+%3D+%28select+max%28date%29+from+%5Bus-counties-recent%5D%29%0Aorder+by%0A++cases_per_million+desc