Simon Willison’s Weblog

On mysql, docker, django, sqlite, visualization, ...

 

Recent entries

Datasette Facets two months ago

Datasette 0.22 is out with the most significant new feature I’ve added since the initial release: faceted browse.

Datasette lets you deploy an instant web UI and JSON API for any SQLite database. csvs-to-sqlite makes it easy to create a SQLite database out of any collection of CSV files. Datasette Publish is a web app that can run these combined tools against CSV files you upload from your browser. And now the new Datasette Facets feature lets you explore any CSV file using faceted navigation with a couple of clicks.

Exploring characters from Marvel comics

Let’s use facets to explore every character in the Marvel Universe.

FiveThirtyEight have published a CSV file of 16,376 characters from Marvel comics, scraped from Wikia as part of the research for their 2014 story Comic Books Are Still Made By Men, For Men And About Men.

Here’s that CSV file loaded into the latest version of Datasette:

Marvel characters explored using Datasette Facets

We start by applying the identity status, alignment and gender facets. Then we filter down to just the bad characters with a public identity, and apply the eye colour facet. Now we can filter to just the 20 bad characters with a public identity and yellow eyes.

At each stage along the way we could see numerical summaries of the other facets. That’s a pretty sophisticated piece of analysis we’ve been able to run with just a few clicks (and it works responsively on mobile as well).

I’ve published a full copy of everything else in the FiveThirtyEight data repository, which means you can find plenty more examples of facets in action at https://fivethirtyeight.datasettes.com/—one example: Actions under the Antiquities Act, faceted by states, pres_or_congress, action and current_agency.

Analyzing GSA IT Standards with Datasette Publish

The US government’s General Services Administration have a GitHub account, and they use it to publish a repository of assorted data as CSVs.

Let’s take one of those CSVS and analyze it with Datasette Facets, using the Datasette Publish web app to upload and process the CSV.

We’ll start with the it-standards.csv file, downloaded from their repo. We’ll upload it to Datasette Publish and add some associated metadata:

GSA IT standards published using Datasette Publish

Here’s the result, with the Status and Deployment Type facets applied. And here’s a query showing just SaaS tools with status Approved—Preferred.

European Power Stations

The Open Power System Data project publishes data about electricity systems. They publish data in a number of formats, including SQLite databases. Let’s take their conventional_power_plants.sqlite file and explore it with Datasette. With Datasette installed, run the following commands in your terminal:

wget https://data.open-power-system-data.org/conventional_power_plants/2018-02-27/conventional_power_plants.sqlite
datasette conventional_power_plants.sqlite

This will start Datasette running at http://127.0.0.1:8001/ ready for you to explore the data.

Next we can publish the SQLite database directly to the internet using the datasette publish command-line tool:

$ datasette publish now conventional_power_plants.sqlite \
    --source="Open Power System Data. 2018. Data Package Conventional power plants. Version 2018-02-27" \
    --source_url="https://data.open-power-system-data.org/conventional_power_plants/2018-02-27/" \
    --title="Conventional power plants" \
    --branch=master
> Deploying /private/var/folders/jj/fngnv0810tn2lt_kd3911pdc0000gp/T/tmpufvxrzgp/datasette under simonw
> https://datasette-tgngfjddix.now.sh [in clipboard] (sfo1) [11s]
> Synced 3 files (1.28MB) [11s]
> Building…
> ▲ docker build
Sending build context to Docker daemon 1.343 MBkB
> Step 1/7 : FROM python:3
> 3: Pulling from library/python
> 3d77ce4481b1: Already exists
> 534514c83d69: Already exists
...
> Successfully built da7ac223e8aa
> Successfully tagged registry.now.systems/now/3d6d318f0da06d3ea1bc97417c7dc484aaac9026:latest
> ▲ Storing image
> Build completed
> Verifying instantiation in sfo1
> [0] Serve! files=('conventional_power_plants.sqlite',) on port 8001
> [0] [2018-05-20 22:51:51 +0000] [1] [INFO] Goin' Fast @ http://0.0.0.0:8001
> [0] [2018-05-20 22:51:51 +0000] [1] [INFO] Starting worker [1]

Finally, let’s give it a nicer URL using now alias:

 now alias https://datasette-tgngfjddix.now.sh conventional-power-plants.now.sh

The result can now be seen at https://conventional-power-plants.now.sh/

Here’s every conventional power plant in Europe faceted by country, energy source and technology.

Power Plant Facets

Implementation notes

I love faceted search engines. One of my first approaches to understanding any new large dataset has long been to throw it into a faceted search engine and see what comes out. In the past I’ve built them using Solr, Elasticsearch, PostgreSQL and even Whoosh. I guess it was inevitable that I’d try to build one with SQLite.

You can follow the development of Datasette Facets in the now-closed issue #255 on GitHub.

Facets are requested by appending one or more ?_facet=colname parameters to the URL. This causes Datasette to run the following SQL query for each of those specified columns:

select colname as value, count(*) as count
from tablename where (current where clauses)
group by colname order by count desc limit 31

For large tables, this could get expensive. Datasette supports time limits for SQLite queries, and facets are given up to 200ms (by default, this limit can be customized) to finish executing. If the query doesn’t complete in the given time the user sees a warning that the facet could not be displayed.

We ask for 31 values in the limit clause even though we only display 30. This lets us detect if there are more values available and show a ... indicator to let the user know that the facets were truncated.

Datasette also suggests facets that you might want to apply. This is implemented using another query, this time run against every column that is not yet being used as a facet. If a table has 20 columns this means 20 queries, so they run with an even tighter 50ms time limit. The query looks like this:

select distinct colname
from tablename where (current where clauses)
limit 31

All we are doing here is trying to determine if the column in question has less than 30 unique values. The limit clause here means that if you run this query against a column with entirely distinct values (the primary key for example) the query will terminate extremely quickly—after it has found just the first 31 values.

Once the query has executed, we count the distinct values and check to see if this column, when used as a facet:

  • Will return 30 or less unique options
  • Will return more than one unique option
  • Will return less unique options than the current total number of filtered rows

If the query takes longer than 50ms we terminate it and do not suggest that column as a potential facet.

Facets via JSON

As with everything in Datasette, the facets you can view in your browser are also available as part of the JSON API (which ships with CORS headers so you can easily fetch data from JavaScript running in a browser on any web page).

To get back JSON, add .json to the path (before the first ?). Here’s that power plants example returned as JSON: https://conventional-power-plants.now.sh/conventional_power_plants-e3c301c/conventional_power_plants_EU.json?_facet=country&_facet=energy_source&_facet=technology

Please let me know if you build something interesting with Datasette Facets!

Notes from my appearance on the Changelog podcast two months ago

After I spoke at Zeit Day SF last weekend I sat down with Adam Stacoviak to record a 25 minute segment for episode 296 of the Changelog podcast, talking about Datasette. We covered a lot of ground!

A transcript should be available soon (the Changelog have excellent transcripts) but in the meantime here are my own notes on the topics that we covered.

My section starts 54:45 minutes into the episode.

From the intro

Datasette was inspired by my work at the Guardian with Simon Rogers, where we started the The Guardian Data Blog.

Here’s Datasette on GitHub, and Datasette: instantly create and publish an API for your SQLite databases where I first announced the project.

My sf-trees.com website, which is powered by this Datasette API using data from the San Francisco Department of Public Works.

The hosting platform I use for immutable deployments of Docker containers is Zeit Now.

The Datasette tool suite

Here’s the full Datasette suite of tools:

  • csvs-to-sqlite is a command-line tool for converting CSV files into a SQLite database.
  • Datasette is a web application for exploring SQLite databases and interacting with them via a read-only JSON API. It also includes the datasette publish command-line tool for publishing those databases using Zeit Now or Heroku.
  • Datasette Publish is a web app that lets you upload CSV files and deploy them to a new Datasette instance without installing any software. Here’s my announcement post.

We talked a lot about SQLite. Here’s a document describing How SQLite is tested and a bunch more about SQLite from my blog.

Datasette examples

A few more fun examples of datasettes that we discussed:

Tools for manipulating data

I’m a huge fan of the combination of Jupyter notebooks and Pandas. Here’s a collection of notes on jupyter and pandas from my blog.

Contributing to Datasette

The best thing you can do is use Datasette to publish your data and then give me feedback!

If you want to get involved in the code, take a look at the help wanted label on GitHub or have a look at the open issues for csvs-to-sqlite.

Even better: write plugins! Datasette’s plugin architecture is just getting started so the more experience and feedback we can get with it the better.

And finally… I mentioned US Navy Airship and Zeppelin photos. I also have a Pinterest Board.

Exploring the UK Register of Members Interests with SQL and Datasette two months ago

Ever wondered which UK Members of Parliament get gifted the most helicopter rides? How about which MPs have been given Christmas hampers by the Sultan of Brunei? (David Cameron, William Hague and Michael Howard apparently). Here’s how to dig through the Register of Members Interests using SQL and Datasette.

Gifts from the Sultan

mySociety have been building incredible civic participation applications like TheyWorkForYou and FixMyStreet for nearly 15 years now, and have accumulated all kinds of interesting data along the way.

They recently launched their own data portal at data.mysociety.org listing all of the information they have available. While exploring it I stumbled across their copy of the UK Register of Members Interests. Every UK Member of Parliament has to register their conflicts of interest and income sources, and mySociety have an ongoing project to parse that data into a more useful format.

It won’t surprise you to hear that I couldn’t resist turning their XML files into a SQLite database.

The result is register-of-members-interests.datasettes.com—a Datasette instance running against a SQLite database containing over 1.3 million line-items registered by 1,419 MPs over the course of 18 years.

Some fun queries

A few of my favourites so far:

Understanding the data model

Most of the action takes place in the items table, where each item is a line-item from an MP’s filing. You can search that table by keyword (see helicopter example above) or apply filters to it using the standard Datasette interface. You can also execute your own SQL directly against the database.

Each item is filed against a category. There appears to have been quite a bit of churn in the way that the categories are defined over the years, plus the data is pretty untidy—there are no less than 10 ways of spelling “Remunerated employment, office, profession etc.” for example!

Categories

There are also a LOT of duplicate items in the set—it appears that MPs frequently list the same item (a rental property for example) every time they fill out the register. SQL DISTINCT clauses can help filter through these, as seen in some of the above examples.

The data also has the concepts of both members and people. As far as I can tell people are distinct, but members may contain duplicates—presumably to represent MPs who have served more than one term in office. It looks like the member field stopped being populated in March 2015 so analysis is best performed against the people table.

Once concept I have introduced myself is the record_id. In the XML documents the items are often grouped together into a related collection, like this:

<regmem personid="uk.org.publicwhip/person/10001"
    memberid="uk.org.publicwhip/member/40289" membername="Diane Abbott" date="2014-07-14">
    <category type="2" name="Remunerated employment, office, profession etc">
        <item>Payments from MRL Public Sector Consultants, Pepple House, 8 Broad Street, Great Cambourne, Cambridge CB23 6HJ:</item>
        <item>26 November 2013, I received a fee of £1,000 for speaking at the 1st African Legislative Summit, National Assembly, Abuja, Nigeria.  Hours: 8 hrs. The cost of my flights, transfers and hotel accommodation in Abuja were also met; estimated value £5,000. <em>(Registered 3 December 2013)</em></item>
        <item>23 July 2013, I received a fee of £5,000 for appearing as a contestant on ITV&#8217;s &#8216;The Chase Celebrity &#8211; Series 3&#8217; television programme.  Address of payer:  ITV Studios Ltd, London Television Centre, Upper Ground, London SE1 9Lt.  Hours: 12 hrs.   <em>(Registered 23 July 2013)</em></item>
    </category>
</regmem>

While these items are presented as separate line items, their grouping carries meaning: the first line item here acts as a kind of heading to help provide context to the other items.

To model this in the simplest way possible, I’ve attempted to preserve the order of these groups using a pair of additional columns: the record_id and the sort_order. I construct the record_id using a collection of other fields—the idea is for it to be sortable, and for each line-item in the same grouping to have the same record_id:

record_id = "{date}-{category_id}-{person_id}-{record}".format(
    date=date,
    category_id=category_id,
    person_id=person_id.split("/")[
        -1
    ],
    record=record,
)

The resulting record_id might look like this: 2018-04-16-70b64e89-24878-0

To recreate that particular sequence of line-items, you can search for all items matching that record_id and then sort them by their sort_order. Here’s that record from Diane Abbott shown with its surrounding context.

A single record

How I built it

The short version: I downloaded all of the XML files and wrote a Python script which parsed them using ElementTree and inserted them into a SQLite database. I’ve put the code on GitHub.

A couple of fun tricks: firstly, I borrowed some code from csvs-to-sqlite to create the full-text search index and enable searching:

def create_and_populate_fts(conn):
    create_sql = """
        CREATE VIRTUAL TABLE "items_fts"
        USING {fts_version} (item, person_name, content="items")
    """.format(
        fts_version=best_fts_version()
    )
    conn.executescript(create_sql)
    conn.executescript(
        """
        INSERT INTO "items_fts" (rowid, item, person_name)
        SELECT items.rowid, items.item, people.name
        FROM items LEFT JOIN people ON items.person_id = people.id
    """
    )

The best_fts_version() function implements basic feature detection against SQLite by trying operations in an in-memory database.

Secondly, I ended up writing my own tiny utility function for inserting records into SQLite. SQLite has useful INSERT OR REPLACE INTO syntax which allows you to insert a record and will automatically update an existing record if there is a match on the primary key. This meant I could write this utility function and use it for all of my data inserts:

def insert_or_replace(conn, table, record):
    pairs = record.items()
    columns = [p[0] for p in pairs]
    params = [p[1] for p in pairs]
    sql = "INSERT OR REPLACE INTO {table} ({column_list}) VALUES ({value_list});".format(
        table=table,
        column_list=", ".join(columns),
        value_list=", ".join(["?" for p in params]),
    )
    conn.execute(sql, params)

# ...

insert_or_replace(
    db,
    "people",
    {
        "id": person_id,
        "name": regmem_el.attrib["membername"],
    },
)

What can you find?

I’ve really only scratched the surface of what’s in here with my initial queries. What can you find? Send me Datasette query links on Twitter with your discoveries!

Datasette plugins, and building a clustered map visualization three months ago

Datasette now supports plugins!

Last Saturday I asked Twitter for examples of Python projects with successful plugin ecosystems. pytest was the clear winner: the pytest plugin compatibility table (an ingenious innovation that I would love to eventually copy for Datasette) lists 457 plugins, and even the core pytest system itself is built as a collection of default plugins that can be replaced or over-ridden.

Best of all: pytest’s plugin mechanism is available as a separate package: pluggy. And pluggy was exactly what I needed for Datasette.

You can follow the ongoing development of the feature in issue #14. This morning I released Datasette 0.20 with support for a number of different plugin hooks: plugins can add custom template tags and SQL functions, and can also bundle their own static assets, JavaScript, CSS and templates. The hooks are described in some detail in the Datasette Plugins documentation.

datasette-cluster-map

I also released my first plugin: datasette-cluster-map. Once installed, it looks out for database tables that have a latitude and longitude column. When it finds them, it draws all of the points on an interactive map using Leaflet and Leaflet.markercluster.

Let’s try it out on some polar bears!

Polar Bears on a cluster map

The USGS Alaska Science Center have released a delightful set of data entitled Sensor and Location data from Ear Tag PTTs Deployed on Polar Bears in the Southern Beaufort Sea 2009 to 2011. It’s a collection of CSV files, which means it’s trivial to convert it to SQLite using my csvs-to-sqlite tool.

Having created the SQLite database, we can deploy it to a hosting account on Zeit Now alongside the new plugin like this:

# Make sure we have the latest datasette
pip3 install datasette --upgrade
# Deploy polar-bears.db to now with an increased default page_size
datasette publish now \
    --install=datasette-cluster-map \
    --extra-options "--page_size=500" \
    polar-bears.db

The --install option is new in Datasette 0.20 (it works for datasette publish heroku as well)—it tells the publishing provider to pip install the specified package. You can use it more than once to install multiple plugins, and it accepts a path to a zip file in addition to the name of a PyPI package.

Explore the full demo at https://datasette-cluster-map-demo.now.sh/polar-bears

Visualize any query on a map

Since the plugin inserts itself at the top of any Datasette table view with latitude and longitude columns, there are all sorts of neat tricks you can do with it.

I also loaded the San Francisco tree list (thanks, Department of Public Works) into the demo. Impressively, you can click “load all” on this page and Leaflet.markercluster will load in all 189,144 points and display them on the same map… and it works fine on my laptop and my phone. Computers in 2018 are pretty good!

But since it’s a Datasette table, we can filter it. Here’s a map of every New Zealand Xmas Tree in San Francisco (8,683 points). Here’s every tree where the Caretaker is Friends of the Urban Forest. Here’s every palm tree planted in 1990:

Palm trees planted in 1990

Update: This is an incorrect example: there are 21 matches on “palm avenue” because the FTS search index covers the address field—they’re not actually palm trees. Here’s a corrected query for palm trees planted in 1990.

The plugin currently only works against columns called latitude and longitude… but if your columns are called something else, don’t worry: you can craft a custom SQL query that aliases your columns and everything will work as intended. Here’s an example against some more polar bear data:

select *, "Capture Latitude" as latitude, "Capture Longitude" as longitude
from [USGS_WC_eartag_deployments_2009-2011]

Writing your own plugins

I’m really excited to see what people invent. If you want to have a go, your first stop should be the Plugins documentation. If you want an example of a simple plugin (including the all-important mechanism for packaging it up using setup.py) take a look at datasette-cluster-map on GitHub.

And if you have any thoughts, ideas or suggestions on how the plugin mechanism can be further employed please join the conversation on issue #14. I’ve literally just got started with Datasette’s plugin hooks, and I’m very keen to hear about things people want to build that aren’t yet supported.

Elsewhere

17th July 2018

  • Teaching Tech Together (via) A new free book compiled by Greg Wilson about teaching programming, based on the instructor training program developed for the Software Carpentry course over the past six years. This is fantastic—it’s filling in a whole load of holes in my understanding of effective teaching theory. #
  • Interviewing a developer for whom English wasn’t his first language and he kept calling legacy code “legendary code” and now that’s all I want to write.

    Mark Norman Francis #

  • ES6 In Depth: Template strings. Useful guide to the advanced features of backtick template strings in ES6. #
  • Feature comparison between hyperHTML and lit-html. Compiled by hyperHTML author Andrea Giammarchi. lit-html is a similar project maintained by Google’s Polymer team. #
  • hyperHTML (via) I’ve been playing around with multiline backtick template literals in JavaScript recently, and Claudio Cicali on Twitter pointed me to hyperHTML. It’s a fascinating project: it implements an alternative to React’s virtual DOM in just 4.6KB of code, taking advantage of the way tagged template literals can call a custom function which gets to affect how the interpolated values from the template are treated. The documentation does a great job of describing the various clever tricks that are used under the hood, in particular the way it takes advantage of the way tagged literals “freeze” the incoming array of literal chunks allowing hyperHTML to compile each template once and reuse the complied version for different interpolations. #

16th July 2018

  • Bowiebranchia. I spent the weekend learning about Nudibranchs, which are beautiful sea slugs (common on the coast of California) which are definitely best explained by their resemblance to different eras of David Bowie. #

13th July 2018

  • XARs: An efficient system for self-contained executables (via) Really interesting new open source project from Facebook: a XAR is a new way of packaging up a Python executable complete with its dependencies and resources such that it can be distributed and executed elsewhere as a single file. It’s kind of like a Docker container without Docker—it uses the SquashFS compressed read-only filesystem. I can’t wait to try this out with Datasette. #
  • future-fstrings (via) Clever module that backports fstrings to versions of Python earlier than 3.6, by registering itself as a codec and abusing Python’s # -*- coding: future_fstrings -*- feature. Via a conversation on Twitter that pointed out that the JavaScript community have been using transpilation to successfully experiment with new language features for years now. #

12th July 2018

  • The Now CDN (via) Huge announcement from Zeit Now today: all .now.sh deployments are now served through the Cloudflare CDN, which means they benefit from 150 worldwide CDN locations that obey HTTP caching headers. This is particularly relevant for Datasette, since it serves far-future cache headers by default and uses Cloudflare-compatible HTTP/2 push hints to accelerate 302 redirects. This means that both the “datasette publish now” CLI command and the Datasette Publish web app will now result in Cloudflare-accelerated deployments. #
  • Usage of ARIA attributes via HTTP Archive. A neat example of a Google BigQuery query you can run against the HTTP Archive public dataset (a crawl of the “top” websites run periodically by the Internet Archive, which captures the full details of every resource fetched) to see which ARIA attributes are used the most often. Linking to this because I used it successfully today as the basis for my own custom query—I love that it’s possible to analyze a huge representative sample of the modern web in this way. #

10th July 2018

  • scrapely. Neat twist on a screen scraping library: this one lets you “train it” by feeding it examples of URLs paired with a dictionary of the data you would like to have extracted from that URL, then uses an instance based learning earning algorithm to run against new URLs. Slightly confusing name since it’s maintained by the scrapy team but is a totally independent project from the scrapy web crawling framework. #

6th July 2018

  • Over the last twenty years, publishing systems for content on [BBC] News pages have come and gone, having been replaced or made obsolete. Although newer content is published through dynamic web applications that can be readily modified, what lies beneath this sometimes resembles layers of sedimentary rock.

    James Donohue #

3rd July 2018

  • react-from-zero (via) Interesting approach to teaching and understanding React: unlike most other tutorials this skips Webpack and ES6 entirely an focuses on things you can get running just using a browser and loading code via script tags. It does eventually load Babel to enable client-side JSX transforms, but before that it shows how React can be used by loading react.js and react-dom.js and then calling React.createElement() manually (or by using the 0xeac7 magic symbol and constructing JavaScript objects manually with $$typeof: magicValue). #

2nd July 2018

  • Our provisioning tools for developer environments broke and no one knew how to fix them, so we reassigned new hires the zombie VMs of recently departed coworkers.

    Will Larson #

  • Digg's v4 launch: an optimism born of necessity. Riveting behind-the-scenes story of the disastrous Digg V4 launch by former Digg engineer Will Larson. #

29th June 2018

  • datasette-vega (via) I wrote a visualization plugin for Datasette that uses the excellent Vega “visualization grammar” library to provide bar, line and scatter charts configurable against any Datasette table or SQL query. #

27th June 2018

  • Migrating Messenger storage to optimize performance (via) Fascinating case-study of a truly gargantuan migration. Messenger has over a billion users, and Facebook successfully migrated its backend storage from HBase to their MyRocks database (a fork of MySQL with a storage engine built on their SSD-optimized RocksDB key/value library) without any user-visible downtime. They ended up using two migration paths: one for the 99.9% of regular accounts, and a separate path for extremely high volume accounts (businesses with very active chat bots or support systems). #