Joining CSV files in your browser using Datasette Lite
20th June 2022
I added a new feature to Datasette Lite—my version of Datasette that runs entirely in your browser using WebAssembly (previously): you can now use it to load one or more CSV files by URL, and then run SQL queries against them—including joins across data from multiple files.
Your CSV file needs to be hosted somewhere with
access-control-allow-origin: * CORS headers. Any CSV file hosted on GitHub provides these, if you use the link you get by clicking on the “Raw” version.
Loading CSV data from a URL
You can pass this to Datasette Lite in two ways:
- You can load the web app, click the “Load data by URL to a CSV file” button and paste in the URL
- Or you can pass it as a
?csv=parameter to the application, like this: https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv
Once Datasette has loaded, a
data database will be available with a single table called
As you navigate around in Datasette the URL bar will update to reflect current state—which means you can deep-link to table views with applied filters and facets:
Or even link to the result of a custom SQL query:
Loading multiple files and joining data
You can pass the
?csv= parameter more than once to load data from multiple CSV files into the same virtual
data database. Each CSV file will result in a separate table.
For this demo I’ll use two CSV files.
Both of those tables include a column called
fips, representing the FIPS county code for each county. These 4-5 digit codes are ideal for joining the two tables.
Here’s a SQL query which joins the two tables, filters for the data for the most recent date represented (using
where date = (select max(date) from [us-counties-recent])) and calculates
cases_per_million using the cases and the population:
select [us-counties-recent].*, us_census_county_populations_2019.population, 1.0 * [us-counties-recent].cases / us_census_county_populations_2019.population * 1000000 as cases_per_million from [us-counties-recent] join us_census_county_populations_2019 on us_census_county_populations_2019.fips = [us-counties-recent].fips where date = (select max(date) from [us-counties-recent]) order by cases_per_million desc
And since everything in Datasette Lite can be bookmarked, here’s the super long URL (clickable version here) that executes that query against those two CSV files:
More recent articles
- Datasette Enrichments: a new plugin framework for augmenting your data - 1st December 2023
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023
- DALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023