Simon Willison’s Weblog

On emoji, scaling, docker, documentation, unicode, ...

 

Recent entries

Datasette Publish: a web app for publishing CSV files as an online database 23 hours ago

I’ve just released Datasette Publish, a web tool for turning one or more CSV files into an online database with a JSON API.

Here’s a demo application I built using Datasette Publish, showing Californian campaign finance data using CSV files released by the California Civic Data Coalition.

And here’s an animated screencast showing exactly how I built it:

Animated demo of Datasette Publish

Datasette Publish combines my Datasette tool for publishing SQLite databases as an API with my csvs-to-sqlite tool for generating them.

It’s built on top of the Zeit Now hosting service, which means anything you deploy with it lives on your own account with Zeit and stays entirely under your control. I used the brand new Zeit API 2.0.

Zeit’s generous free plan means you can try the tool out as many times as you like—and if you want to use it for an API powering a production website you can easily upgrade to a paid hosting plan.

Who should use it

Anyone who has data they want to share with the world!

The fundamental idea behind Datasette is that publishing structured data as both a web interface and a JSON API should be as quick and easy as possible.

The world is full of interesting data that often ends up trapped in PDF blobs or other hard-to-use formats, if it gets published at all. Datasette encourages using SQLite instead: a powerful, flexible format that enables analysis via SQL queries and can easily be shared and hosted online.

Since so much of the data that IS published today uses CSV, this first release of Datasette Publish focuses on CSV conversion above anything else. I plan to add support for other useful formats in the future.

The three areas I’m most excited in seeing adoption of Datasette are data journalism, civic open data and cultural institutions.

Data journalism because when I worked at the Guardian Datasette is the tool I wish I had had for publishing data. When we started the Guardian Datablog we ended up using Google Sheets for this.

Civic open data because it turns out the open data movement mostly won! It’s incredible how much high quality data is published by local and national governments these days. My San Francisco tree search project for example uses data from the Department of Public Works—a CSV of 190,000 trees around the city.

Cultural institutions because the museums and libraries of the world are sitting on enormous treasure troves of valuable information, and have an institutional mandate to share that data as widely as possible.

If you are involved in any of the above please get in touch. I’d love your help improving the Datasette ecosystem to better serve your needs.

How it works

Datasette Publish would not be possible without Zeit Now. Now is a revolutionary approach to hosting: it lets you instantly create immutable deployments with a unique URL, via a command-line tool or using their recently updated API. It’s by far the most productive hosting environment I’ve ever worked with.

I built the main Datasette Publish interface using React. Building a SPA here made a lot of sense, because it allowed me to construct the entire application without any form of server-side storage (aside from Keen for analytics).

When you sign in via Zeit OAuth I store your access token in a signed cookie. Each time you upload a CSV the file is stored directly using Zeit’s upload API, and the file metadata is persisted in JavaScript state in the React app. When you click “publish” the accumulated state is sent to the server where it is used to construct a new Zeit deployment.

The deployment itself consists of the CSV files plus a Dockerfile that installs Python, Datasette, csvs-to-sqlite and their dependencies, then runs csvs-to-sqlite against the CSV files and starts up Datasette against the resulting database.

If you specified a title, description, source or license I generate a Datasette metadata.json file and include that in the deployment as well.

Since free deployments to Zeit are “source code visible”, you can see exactly how the resulting application is structured by visiting https://datasette-onrlszntsq.now.sh/_src (the campaign finance app I built earlier).

Using the Zeit API in this way has the neat effect that I don’t ever store any user data myself—neither the access token used to access your account nor any of the CSVs that you upload. Uploaded files go straight to your own Zeit account and stay under your control. Access tokens are never persisted. The deployed application lives on your own hosting account, where you can terminate it or upgrade it to a paid plan without any further involvement from the tool I have built.

Not having to worry about storing encrypted access tokens or covering any hosting costs beyond the Datasette Publish tool itself is delightful.

This ability to build tools that themselves deploy other tools is fascinating. I can’t wait to see what other kinds of interesting new applications it enables.

Discussion on Hacker News.

Building a location to time zone API with SpatiaLite, OpenStreetMap and Datasette one month ago

Given a latitude and longitude, how can we tell what time zone that point lies within? Here’s how I built a simple JSON API to answer that question, using a combination of data from OpenStreetMap, the SpatiaLite extension for SQLite and my Datasette API tool.

The API

You can try the API out here: feed it a latitude and longitude and it will return the corresponding time zone ID: https://timezones-api.now.sh/timezones/by_point

Some examples:

The data

I was first introduced to Eric Muller’s tz_world shapefile by Nick Williams at Eventbrite, who used it to build us an internal time zone lookup API on top of MySQL’s geospatial data types. Eric’s project is no longer updated and he recommends Evan Siroky’s timezone-boundary-builder project as an alternative, which derives time zone shapes from OpenStreetMap and makes the resulting data available under the Open Database License. The shapefile itself can be downloaded from the GitHub releases page for the project.

Loading the data into SpatiaLite

SpatiaLite is a powerful open source extension for SQLite, which adds a comprehensive suite of geospatial functions—including the ability to ingest shapefiles, convert them into geometries and run point within queries against the resulting shapes.

The easiest way to get it running on OS X is via Homebrew:

$ brew install spatialite-tools
$ brew install gdal

Having installed SpatiaLite, we can ingest the shapefile using .loadshp combined_shapefile timezones CP1252 23032—here’s the full process, from downloading the shapefile to ingesting it into a new SQLite database file called timezones.db:

$ wget https://github.com/evansiroky/timezone-boundary-builder/releases/download/2017c/timezones.shapefile.zip
$ unzip timezones.shapefile.zip
$ cd dist
$ spatialite timezones.db
SpatiaLite version ..: 4.3.0a   Supported Extensions:
...
Enter SQL statements terminated with a ";"
spatialite> .loadshp combined_shapefile timezones CP1252 23032
========
Loading shapefile at 'combined_shapefile' into SQLite table 'timezones'

BEGIN;
CREATE TABLE "timezones" (
"PK_UID" INTEGER PRIMARY KEY AUTOINCREMENT,
"tzid" TEXT);
SELECT AddGeometryColumn('timezones', 'Geometry', 23032, 'MULTIPOLYGON', 'XY');
COMMIT;

Inserted 414 rows into 'timezones' from SHAPEFILE
========
spatialite> 

Let’s try it out with a query. Here’s the SQL needed to find the time zone for a point in Tokyo:

select tzid
from
    timezones
where
    within(GeomFromText('POINT(139.7819661 35.6631424)'), timezones.Geometry);

Let’s run that in SpatiaLite and see what we get:

spatialite> select tzid
   ...> from
   ...>     timezones
   ...> where
   ...>     within(GeomFromText('POINT(139.7819661 35.6631424)'), timezones.Geometry);
Asia/Tokyo

Looks good so far! How long is it taking though? We can find out by running .timer on in the spatialite prompt:

spatialite> .timer on
spatialite> select tzid
   ...> from
   ...>     timezones
   ...> where
   ...>     within(GeomFromText('POINT(139.7819661 35.6631424)'), timezones.Geometry);
Asia/Tokyo
CPU Time: user 0.108479 sys 0.064778

That’s a tenth of a second, or 100ms. Fast, but not brilliant.

Speeding it up with a geospatial index

It turns out SpatiaLite includes support for spatial indices, based on SQLite’s R*Tree module. R-Tree indexes can massively accelerate boundary box searches. Our searches are a lot more complex than that, acting as they do against extremely complex polygon shapes—but we can use a boundary box search to dramatically reduce the candidates we need to consider. Let’s create an index against our Geometry column:

SELECT CreateSpatialIndex('timezones', 'geometry');

To take advantage of this index, we need to expand our original SQL to first filter by geometries where their bounding box contains the point we are searching for. SpatiaLite has created an index table called idx_timezones_Geometry against which we can run an R-Tree optimized query. Here’s the SQL we will use:

select tzid
from
    timezones
where
    within(GeomFromText('POINT(139.7819661 35.6631424)'), timezones.Geometry)
    and rowid in (
        SELECT pkid FROM idx_timezones_Geometry
        where xmin < 139.7819661
        and xmax > 139.7819661
        and ymin < 35.6631424
        and ymax > 35.6631424
    );

How does this affect our performance?

spatialite> select tzid
   ...> from
   ...>     timezones
   ...> where
   ...>     within(GeomFromText('POINT(139.7819661 35.6631424)'), timezones.Geometry)
   ...>     and rowid in (
   ...>         SELECT pkid FROM idx_timezones_Geometry
   ...>         where xmin < 139.7819661
   ...>         and xmax > 139.7819661
   ...>         and ymin < 35.6631424
   ...>         and ymax > 35.6631424
   ...>     );
Asia/Tokyo
CPU Time: user 0.001541 sys 0.000111

From 100ms down to 1.5ms—nearly a 70x speedup! Not bad at all.

Building and publishing the API with Datasette

Now that we have a fast SQL query for finding a time zone for a latitude and longitude we can use Datasette to turn it into a JSON API.

The simplest way to do that looks like this:

datasette timezones.db \
    --load-extension=/usr/local/lib/mod_spatialite.dylib

This will start Datasette on port 8001 and load the SpatiaLite extension. You can then navigate to localhost:8001/timezones in your browser and paste in the SQL query… which should look something like this:

https://timezones-api.now.sh/timezones?sql=select+tzid%0D%0Afrom%0D%0Atimezones%0D%0Awhere%0D%0Awithin(GeomFromText(‘POINT(139.7819661+35.6631424)’)%2C+timezones.Geometry)%0D%0Aand+rowid+in+(%0D%0A++++SELECT+pkid+FROM+idx_timezones_Geometry%0D%0Awhere+xmin+<+139.7819661%0D%0Aand+xmax+>+139.7819661%0D%0Aand+ymin+<+35.6631424%0D%0Aand+ymax+>+35.6631424%0D%0A++++)%3B

This works (click the JSON link to get the result as JSON) but it’s a little inconvenient to use: you have to construct a URL with the same latitude and longitude repeated in multiple places.

We can improve things using Datasette’s support for SQLite named parameters. Here’s that same SQL query using parameters instead of hard-coded latitude and longitude points:

select tzid
from
    timezones
where
    within(GeomFromText('POINT(' || :longitude || ' ' || :latitude || ')'), timezones.Geometry)
    and rowid in (
        SELECT pkid FROM idx_timezones_Geometry
        where xmin < :longitude
        and xmax > :longitude
        and ymin < :latitude
        and ymax > :latitude)

If you paste this into Datasette it will detect the named parameters and turn them into URL querystring parameters hooked up (in thu UI) to HTML form fields.

Datasette time zone query showing longitude and latitude form fields

To save us from having to include the full SQL in the URL every time we call our new API, let’s take advantage of a new feature introduced in Datasette 0.14: canned queries. These are named, pre-packaged queries that can be defined in a metadata.json file. The file looks like this:

{
    "title": "OpenStreetMap Time Zone Boundaries",
    "license": "ODbL",
    "license_url": "http://opendatacommons.org/licenses/odbl/",
    "source": "timezone-boundary-builder",
    "source_url": "https://github.com/evansiroky/timezone-boundary-builder",
    "databases": {
        "timezones": {
            "queries": {
                "by_point": "select tzid\nfrom\n    timezones\nwhere\n    within(GeomFromText('POINT(' || :longitude || ' ' || :latitude || ')'), timezones.Geometry)\n    and rowid in (\n        SELECT pkid FROM idx_timezones_Geometry\n        where xmin < :longitude\n        and xmax > :longitude\n        and ymin < :latitude\n        and ymax > :latitude\n    )"
            }
        }
    }
}

The canned query is defined as the by_point key in the queries nested object. I’m also adding license and source information here for the project, because it’s good manners.

We can try this in Datasette on our local machine like so:

datasette timezones.db -m metadata.json \
    --load-extension=/usr/local/lib/mod_spatialite.dylib 

Now visiting localhost:8001/timezones/by_point will provide the interface for the query—and adding .json to the URL will turn it into an API.

Vizualizing time zones with Leaflet and GeoJSON

If you browse around in Datasette you’ll quickly run into a rather unfortunate problem. The localhost:8001/timezones/timezones page, which shows the first 100 rows in the table, takes a shockingly long time to load. When it eventually does load you’ll see why: each record includes an enormous binary string containing its the geometry. On my machine just that one page weighs in at 62MB of HTML!

This is bad: rendering that much HTML in one go can block the event loop and cause the application to become unresponsive. That giant blob of binary data isn’t exactly useful for humans, either.

We can make some major improvements here using another Datasette 0.14 feature: custom templates. Let’s start with a replacement template that shows just the length of the binary string instead of attempting to render it.

We’ll do that by over-riding the _rows_and_columns.html include template, which is used by Datasette on both the table page and the page used to display individual rows. Since we only want to over-ride this template for one particular table we’ll create a file called _rows_and_columns-timezones-timezones.html—the file name means this will only over-ride the template timezones table in our timezones database. Here’s our new template:

<table>
    <thead>
        <tr>
            {% for column in display_columns %}
                <th scope="col">{{ column }}</th>
            {% endfor %}
        </tr>
    </thead>
    <tbody>
    {% for row in display_rows %}
        <tr>
            {% for cell in row %}
                <td>
                    {% if cell.column == 'Geometry' %}
                        {{ cell.value|length }} bytes
                    {% else %}
                        {{ cell.value }}
                    {% endif %}
                </td>
            {% endfor %}
        </tr>
    {% endfor %}
    </tbody>
</table>

If we put that in a directory called templates/ we can tell Datasette to use it like this:

datasette timezones.db -m metadata.json \
    --load-extension=/usr/local/lib/mod_spatialite.dylib \
    --template-dir=templates/

Our localhost:8001/timezones/timezones page now looks like this:

Datasette time zone index showing lengths

But wouldn’t it be neat if we could see the actual shapes of these time zones? It turns out that’s actually pretty easy, using the combination of GeoJSON and the Leaflet mapping library.

GeoJSON is a neat, widely supported standard for encoding geographic information such as polygon shapes as JSON. SpatiaLite ships with built-in GeoJSON support in the form of the AsGeoJSON SQL function. We can use that function to turn any of our time zone geometries into a GeoJSON string:

select AsGeoJSON(Geometry) from timezones where tzid = 'Asia/Tokyo';

If you run that with Datasette you’ll get back a string of GeoJSON. You can paste that into geojson.io to instantly visualize it.

The Leaflet mapping library supports GeoJSON out of the box. We can construct a custom row.html template for our Datasette that loads Leaflet from unpkg.com, uses fetch() to execute the AsGeoJSON query and renders the result in a map on the page. Here’s the result:

Custom Datasette page rendering a GeoJSON map

And here’s the row-timezones-timezones.html template:

{% extends "row.html" %}
{% block extra_head %}
<link rel="stylesheet" href="https://unpkg.com/leaflet@1.2.0/dist/leaflet.css" integrity="sha512-M2wvCLH6DSRazYeZRIm1JnYyh22purTM+FDB5CsyxtQJYeKq83arPe5wgbNmcFXGqiSH2XR8dT/fJISVA1r/zQ==" crossorigin=""/>
<script src="https://unpkg.com/leaflet@1.2.0/dist/leaflet.js" integrity="sha512-lInM/apFSqyy1o6s89K4iQUKg6ppXEgsVxT35HbzUupEVRh2Eu9Wdl4tHj7dZO0s1uvplcYGmt3498TtHq+log==" crossorigin=""></script>
<style>
#map {
  margin-top: 1em;
  width: 100%;
  height: 400px;
}
</style>
{% endblock %}

{% block content %}
{{ super() }}
<div id="map"></div>
<script>
var pk = location.pathname.split('/').slice(-1)[0];
var tiles = L.tileLayer('https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
    maxZoom: 19,
    detectRetina: true,
    attribution: '&copy; <a href="https://www.openstreetmap.org/copyright">OpenStreetMap</a> contributors, Points &copy 2012 LINZ'
});
var sql = 'select AsGeoJSON(Geometry) from timezones where PK_UID = ' + pk;
fetch('/timezones.json?sql=' + encodeURIComponent(sql)).then(r => r.json()).then(d => {
  var map = L.map('map', {layers: [tiles]});
  var geoJSON = JSON.parse(d.rows[0][0]);
  var layer = L.geoJSON(geoJSON)
  layer.addTo(map);
  map.fitBounds(layer.getBounds());
});
</script>
{% endblock %}

Publishing it to the internet

Normally we would use the datasette publish command to publish our database to Heroku or Zeit Now, but the SpatiaLite dependency means that won’t work for this case. Instead, we need to construct a custom Dockerfile that builds the SpatiaLite module.

Since we’re using Docker, we may as well have the Dockerfile download the shapefiles and build the SpatiaLite database for us all in one go. Here’s a Dockerfile that does exactly that:

FROM python:3.6-slim-stretch

RUN apt update
RUN apt install -y python3-dev gcc spatialite-bin libsqlite3-mod-spatialite wget unzip

RUN pip install https://github.com/simonw/datasette/archive/master.zip

# Download and import the shapefiles
RUN wget https://github.com/evansiroky/timezone-boundary-builder/releases/download/2017c/timezones.shapefile.zip \
    && unzip timezones.shapefile.zip && \
    cd dist && \
    echo ".loadshp combined_shapefile timezones CP1252 23032\nSELECT CreateSpatialIndex('timezones', 'geometry');" | spatialite timezones.db && \
    mv timezones.db .. && \
    cd .. && rm -rf dist && rm timezones.shapefile.zip

ENV SQLITE_EXTENSIONS /usr/lib/x86_64-linux-gnu/mod_spatialite.so

ADD metadata.json metadata.json

ADD templates templates

RUN datasette inspect timezones.db --inspect-file inspect-data.json

EXPOSE 8001

CMD ["datasette", "serve", "timezones.db", "--host", "0.0.0.0", "--cors", "--port", "8001", "--inspect-file", "inspect-data.json", "-m", "metadata.json", "--template-dir", "templates"]

The full code, including the supporting templates, can be found in simonw/timezones-api on GitHub.

If you have Docker installed (Docker for Mac is a one-click install package these days, it’s impressively painless) you can build the container like this:

docker build . -t timezones-api
# Wait for the image to build
docker run -p 8001:8001 timezones-api

Then you can visit http://localhost:8001/ to see your freshly built Datasette in your browser.

The easiest way to publish it online is using Zeit Now. Simply run the now command in the directory containing the Dockerfile and Zeit will upload the entire directory, build the container in the cloud and deploy it with a fresh URL. It’s by far the easiest Docker deployment environment I’ve ever used.

Now can even deploy directly from a public GitHub repository… so you can deploy your own copy of the API by running the following command:

$ now simonw/timezones-api
> Didn't find directory. Searching on GitHub...
> Deploying GitHub repository "simonw/timezones-api" under simonw
> Ready! https://simonw-timezones-api-fbihjcbnog.now.sh (copied to clipboard) [2s]
> Initializing…

Canned queries + SpatiaLite = instant geospatial APIs

Hopefully this has helped illustrate the ease with which Datasette, SpatiaLite and canned queries can be used to create and publish geospatial APIs. Thanks to projects like OpenStreetMap the world is full of high quality open geospatial data. Go build something cool with it!

New in Datasette: filters, foreign keys and search one month ago

I’ve released Datasette 0.13 with a number of exciting new features (Datasette previously).

Filters

Datasette’s table view supports query-string based filtering. 0.13 introduces a new user interface for constructing those filters. Let’s use it to find every episode where Bob Ross painted clouds and mountains in season 3 of The Joy of Painting:

Animation demonstrating the new filter UI

The resulting querystring looks like this:

?CLOUDS__exact=1&EPISODE__startswith=S03&MOUNTAIN__exact=1

Using the .json or .jsono extension on the same URL returns JSON (in list-of-lists or list-of-objects format), so the new filter UI also acts as a simple API explorer. If you click “View and edit SQL” you will get the generated SQL in an editor, ready for you to further modify it.

Foreign key relationships

Datasette now provides special treatment for SQLite foreign key relationships: if it detects a foreign key when displaying a table it will show values in that column as links to the related records—and if the foreign key table has an obvious label column, that label will be displayed in the column as the link label.

Here’s an example, using San Francisco’s Mobile Food Facility Permit dataset… aka food trucks!

Food truck table, showing links in the Applicant and FacilityType columns

And here’s a portion of the corresponding CREATE TABLE statements showing the foreign key relationships:

CREATE TABLE "Mobile_Food_Facility_Permit" (
    "locationid" INTEGER,
    "Applicant" INTEGER,
    "FacilityType" INTEGER,
    "cnn" INTEGER,
    "LocationDescription" TEXT,
    ...,
    FOREIGN KEY ("Applicant") REFERENCES [Applicant](id),
    FOREIGN KEY ("FacilityType") REFERENCES [FacilityType](id)
);
CREATE TABLE "Applicant" (
    "id" INTEGER PRIMARY KEY ,
    "value" TEXT
);
CREATE TABLE "FacilityType" (
    "id" INTEGER PRIMARY KEY ,
     "value" TEXT
);

If you click through to one of the linked records, you’ll see a page like this:

Food truck applicant

The “Links from other tables” section lists all other tables that reference this row, and links to a filtered query showing the corresponding records.

Using csvs-to-sqlite to build foreign key tables

The latest release of my csvs-to-sqlite utility adds a feature which complements Datasette’s foreign key support: you can now tell csvs-to-sqlite to “extract” a specified set of columns and use them to create additional tables.

Here’s how to create the food-trucks.db database used in the above examples.

First step: make sure you have Python 3 installed. On OS X with homebrew you can run brew install python3, otherwise follow the instructions on Python.org.

Ensure you have the most recent releases of csvs-to-sqlite and datasette:

pip3 install csvs-to-sqlite -U
pip3 install datasette -U

You may need to sudo these.

Now export the full CSV file from the Mobile Food Facility Permit page.

How to download CSV from data.sfgov.org

Here’s a sample of that CSV file:

$ head -n 2 Mobile_Food_Facility_Permit.csv 
locationid,Applicant,FacilityType,cnn,LocationDescription,Address,blocklot,block,lot,permit,Status,FoodItems,X,Y,Latitude,Longitude,Schedule,dayshours,NOISent,Approved,Received,PriorPermit,ExpirationDate,Location
751253,Pipo's Grill,Truck,5688000,FOLSOM ST: 14TH ST to 15TH ST (1800 - 1899),1800 FOLSOM ST,3549083,3549,083,16MFF-0010,REQUESTED,Tacos: Burritos: Hot Dogs: and Hamburgers,6007856.719,2107724.046,37.7678524427181,-122.416104892532,http://bsm.sfdpw.org/PermitsTracker/reports/report.aspx?title=schedule&report=rptSchedule&params=permit=16MFF-0010&ExportPDF=1&Filename=16MFF-0010_schedule.pdf,,,,2016-02-04,0,,"(37.7678524427181, -122.416104892532)"

Next, run the following command:

csvs-to-sqlite Mobile_Food_Facility_Permit.csv \
    -c FacilityType \
    -c block \
    -c Status \
    -c Applicant \
    food-trucks.db

The -c options are the real magic here: they tell csvs-to-sqlite to take that column from the CSV file and extract it out into a lookup table.

Having created the new database, you can use Datasette to browse it:

datasette food-trucks.db

Then browse to http://127.0.0.1:8001/ and start exploring.

Full-text search with Datasette and csvs-to-sqlite

SQLite includes a powerful full-text search implementation in the form of the FTS3, FTS4 and (in the most recent versions) FTS5 modules.

Datasette will look for tables that have a FTS virtual table configured against them and, if detected, will add support for a _search= query string argument and a search text interface as well.

Here’s an example of Datasette and SQLite FTS in action, this time using the DataSF list of Film Locations in San Francisco provided by the San Francisco Film Commission.

Searching film locations for Coit Tower

If you click on "View and edit SQL" you’ll see how the underlying query works:

select rowid, *
from Film_Locations_in_San_Francisco
where rowid in (
    select rowid
    from [Film_Locations_in_San_Francisco_fts]
    where [Film_Locations_in_San_Francisco_fts] match :search
)

csvs-to-sqlite knows how to create the underlying FTS virtual tables from a specified list of columns. Here’s how to create the sf-film-locations database:

csvs-to-sqlite \
    Film_Locations_in_San_Francisco.csv sf-film-locations.db \
    -c Title \
    -c "Release Year" \
    -c "Production Company" \
    -c "Distributor" \
    -c "Director" \
    -c "Writer" \
    -c "Actor 1:Actors" \
    -c "Actor 2:Actors" \
    -c "Actor 3:Actors" \
    -f Title \
    -f "Production Company" \
    -f Director \
    -f Writer \
    -f "Actor 1" \
    -f "Actor 2" \
    -f "Actor 3" \
    -f Locations \
    -f "Fun Facts"

The -f options are used to specify the columns which should be incorporated into the SQLite full-text search index. Note that the -f argument is compatible with the -c argument described above—if you extract a text column into a separate table, csvs-to-sqlite can still incorporate data from that column into the full-text index it creates.

I’m using another new feature above as well: the CSV file has three columns for actors, Actor 1, Actor 2 and Actor 3. I can tell the -c column extractor to refer each of those columns to the same underlying lookup table like this:

    -c "Actor 1:Actors" \
    -c "Actor 2:Actors" \
    -c "Actor 3:Actors" \

If you visit the Eddie Murphy page you can see that he’s listed as Actor 1 for 14 rows and in Actor 2 for 1.

A search engine for trees!

One last demo, this time using my favourite CSV file from data.sfgov.org: the Street Tree List, published by the San Francisco Department of Public Works.

This time, in addition to publishing the database I also put together a custom UI for querying it, based on the Leaflet.markercluster library. You can try that out at https://sf-tree-search.now.sh/.

SF Tree Search

Here’s the command I used to create the database:

csvs-to-sqlite Street_Tree_List.csv sf-trees.db \
    -c qLegalStatus \
    -c qSpecies \
    -c qSiteInfo \
    -c PlantType \
    -c qCaretaker \
    -c qCareAssistant \
    -f qLegalStatus \
    -f qSpecies \
    -f qAddress \
    -f qSiteInfo \
    -f PlantType \
    -f qCaretaker \
    -f qCareAssistant \
    -f PermitNotes

Once again, I’m extracting out specified columns and pointing the SQLite full-text indexer at a subset of them.

Since the JavaScript search needs to pull back a subset of the overall data, I composed a custom SQL query to drive those searches.

The full source code for my tree search demo is available on GitHub.

Datasette: instantly create and publish an API for your SQLite databases two months ago

I just shipped the first public version of datasette, a new tool for creating and publishing JSON APIs for SQLite databases.

You can try out out right now at fivethirtyeight.datasettes.com, where you can explore SQLite databases I built from Creative Commons licensed CSV files published by FiveThirtyEight. Or you can check out parlgov.datasettes.com, derived from the parlgov.org database of world political parties which illustrates some advanced features such as SQLite views.

Common surnames from fivethirtyeight

Or you can try it out on your own machine. If you run OS X and use Google Chrome, try running the following:

pip3 install datasette
datasette ~/Library/Application\ Support/Google/Chrome/Default/History

This will start a web server on http://127.0.0.1:8001/ displaying an interface that will let you browse your Chrome browser history, which is conveniently stored in a SQLite database.

Got a SQLite database you want to share with the world? Provided you have Zeit Now set up on your machine, you can publish one or more databases with a single command:

datasette publish now my-database.db

The above command will whir away for about a minute and then spit out a URL to a hosted version of datasette with your database (or databases) ready to go. This is how I’m hosting the fivethirtyeight and parlgov example datasets, albeit on a custom domain behind a Cloudflare cache.

The datasette API

Everything datasette can do is driven by URLs. Queries can produce responsive HTML pages (I’m using a variant of this responsive tables pattern for smaller screens) or with the .json or .jsono extension can produce JSON. All JSON responses are served with an Access-Control-Allow-Origin: * HTTP header, meaning you can query them from any page.

You can try that right now in your browser’s developer console. Navigate to http://www.example.com/ and enter the following in the console:

fetch(
    "https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9/avengers%2Favengers.jsono"
).then(
    r => r.json()
).then(data => console.log(
    JSON.stringify(data.rows[0], null, '  ')
))

You’ll see the following:

{
  "rowid": 1,
  "URL": "http://marvel.wikia.com/Henry_Pym_(Earth-616)",
  "Name/Alias": "Henry Jonathan \"Hank\" Pym",
  "Appearances": 1269,
  "Gender": "MALE",
  "Full/Reserve Avengers Intro": "Sep-63",
  "Year": 1963,
  "Years since joining": 52,
  ...
}

Since the API sits behind Cloudflare with a year-long cache expiry header, responses to any query like this should be lightning-fast.

Datasette supports a limited form of filtering based on URL parameters, inspired by Django’s ORM. Here’s an example: by appending ?CLOUDS=1&MOUNTAINS=1&BUSHES=1 to the FiveThirtyEight dataset of episodes of Bob Ross’ The Joy of Painting we can see every episode in which Bob paints clouds, bushes AND mountains:

https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9/bob-ross%2Felements-by-episode?CLOUDS=1&MOUNTAINS=1&BUSHES=1

And here’s the same episode list as JSON.

Arbitrary SQL

The most exciting feature of datasette is that it allows users to execute arbitrary SQL queries against the database. Here’s a convoluted Bob Ross example, returning a count for each of the items that can appear in a painting.

Datasette has a number of limitations in place here: it cuts off any SQL queries that take longer than a threshold (defaulting to 1000ms) and it refuses to return more than 1,000 rows at a time—partly to avoid too much JSON serialization overhead.

Datasette also blocks queries containing the string PRAGMA, since these statements could be used to modify database settings at runtime. If you need to include PRAGMA in an argument to a query you can do so by constructing a prepared statement:

select * from [twitter-ratio/senators] where "text" like :q

You can then construct a URL that incorporates both the SQL and provides a value for that named argument, like this: https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9?sql=select+rowid%2C+*+from+[twitter-ratio%2Fsenators]+where+“text”+like+%3Aq&q=%25pragmatic%25—which returns tweets by US senators that include the word “pragmatic”.

Why an immutable API?

A key feature of datasette is that the API it provides is very deliberately read-only. This provides a number of interesting benefits:

Implementation notes

Datasette is built on top of the Sanic asynchronous Python web framework (see my previous notes), and makes extensive use of Python 3’s async/await statements. Since SQLite doesn’t yet have an async Python module all interactions with SQLite are handled inside a thread pool managed by a concurrent.futures.ThreadPoolExecutor.

The CLI is implemented using the Click framework. This is the first time I’ve used Click and it was an absolute joy to work with. I enjoyed it so much I turned one of my Jupyter notebooks into a Click script called csvs-to-sqlite and published it to PyPI.

This post is being discussed on a Hacker News.

Elsewhere

Yesterday

14th January 2018

  • A SIM Switch Account Takeover (Mine). Someone walked into a T-Mobile store with a fake ID in his name and stole Albert Wenger’s SIM identity, then used it to gain access to his Yahoo mail account, reset his Twitter password and post a tweet boosting a specific cryptocurrency. His accounts with Google Authenticator 2FA stayed safe. #
  • How the industry-breaking Spectre bug stayed secret for seven months. It’s pretty amazing that the bug only became public knowledge a week before the intended embargo date, considering the number of individuals and companies that has to be looped in. The biggest public clues were patches being applied in public to the Linux kernel—one smart observer noted that the page table issue “has all the markings of a security patch being readied under pressure from a deadline.” #

13th January 2018

  • Telling stories through your commits. Joel Chippendale’s excellent guide to writing a useful commit history. I spend a lot of time on my commit messages, because when I’m trying to understand code later on they are the only form of documentation that is guaranteed to remain up-to-date against the code at that exact point of time. These tips are clear, concise, teadabale and include some great examples. #
  • Notes on Kafka in Python. Useful review by Matthew Rocklin of the three main open source Python Kafka client libraries as of October 2017. #

11th January 2018

  • Incident report: npm. Fascinating insight into the challenges involved in managing a massive scale community code repository. An algorithm incorrectly labeled a legit user as spam, an NPM staff member acted on the report, dependent package installations started failing and because the package had been removed as spam other users were able to try and fix the bug by publishing fresh copies of the missing package to the same namespace. #

10th January 2018

  • How to compile and run the SQLite JSON1 extension on OS X. Thanks, Stack Overflow! I’ve been battling this one for a while—it turns out you can download the SQLite source bundle, compile just the json1.c file using gcc and load that extension in Python’s sqlite3 module (or with Datasette’s --load-extension= option) to gain access to the full suite of SQLite JSON functions—json(), json_extract() etc. #

9th January 2018

8th January 2018

  • csvkit. “A suite of command-line tools for converting to and working with CSV”—includes a huge range of utilities for things like converting Excel and JSON to CSV, grepping, sorting and extracting a subset of columns, combining multiple CSV files together and exporting CSV to a relational database. Worth reading through the tutorial which shows how the different commands can be piped together. #
  • [On Meltdown’s impact on hosting costs] The reality is that we have been living with borrowed performance. The new reality is that security is too important and can not be exchanged for speed. Time to profile, tune and optimize.

    Miguel de Icaza‏ #

  • Statistical NLP on OpenStreetMap. libpostal is ferociously clever: it’s a library for parsing and understanding worldwide addresses, built on top of a machine learning model trained on millions of addresses from OpenStreetMap. Al Barrentine describes how it works in this fascinating and detailed essay. #
  • Himalayan Database: From Visual FoxPro GUI to JSON API with Datasette (via) The Himalayan Database is a compilation of records for all expeditions that have climbed in the Nepalese Himalaya, originally compiled by journalist Elizabeth Hawley over several decades. The database is published as a Visual FoxPro database—here Raffaele Messuti‏ provides step-by-step instructions for extracting the data from the published archive, converting them to CSV using dbfcsv and then converting the CSVs to SQLite using csvs-to-sqlite so you can browse them using Datasette. #

2nd January 2018

  • Most infosec bugs are really boring after a while. But processor ones are always crazy and fascinating because processors are basically a hornet’s nest of witchcraft and mayhem stacked on top of each other all the way down.

    Matt Tait #

29th December 2017

  • Frontend in 2017: The important parts. Keeping track of developments in the frontend and JavaScript community is pretty much a full time job here days, so I found this summary of trends and developments over 2017 very useful for trying to catch up. #

28th December 2017