Simon Willison’s Weblog

On andrewgodwin, security, python, inaturalist, projects, ...

 

Recent entries

Datasette Demo (video) from the SF Python Meetup 23 days ago

I gave a short talk about Datasette last month at the SF Python Meetup Holiday Party. They’ve just posted the video, so here it is:

I showed how I built San Francisco Tree Search using Datasette, csvs-to-sqlite and data from the San Francisco Department of Public Works.

Analyzing my Twitter followers with Datasette 24 days ago

I decided to do some ad-hoc analsis of my social network on Twitter this afternoon… and since everything is more fun if you bundle it up into a SQLite database and publish it to the internet I performed the analysis using Datasette.

The end result

Here’s the Datasette database containing all of my Twitter followers: https://simonw-twitter-followers.now.sh/simonw-twitter-followers-b9bff3a

Much more interesting though are the queries I can now run against it. A few examples:

The thing I find most exciting about this use-case for Datasette is that it allows you to construct entire mini-applications using just a SQL query encoded in a URL. Type queries into the textarea, iterate on them until they do something useful, add some :named parameters (which generate form fields) and bookmark the resulting URL. It’s an incredibly powerful way to build custom interfaces for exploring data.

The rest of this post will describe how I pulled the data from Twitter and turned it into a SQLite database for publication with Datasette.

Fetching my followers

To work with the Twitter API, we first need credentials. Twitter still mostly uses the OAuth 1 model of authentication which is infuriatingly complicated, requiring you to sign parameters using two pairs of keys and secrets. OAuth 2 mostly uses a single access token sent over TLS to avoid the signing pain, but Twitter’s API dates back to the times when API client libraries with robust TLS were not a safe assumption.

Since I have to re-figure out the Twitter API every few years, here’s how I got it working this time. I created a new Twitter app using the form on https://apps.twitter.com/ (which is surprisingly hard to find if you start out on the https://developer.twitter.com/ portal). Having created the app I navigated to the “Keys and Access Tokens” tab, scrolled down and clicked the “Create my access token” button. Then I grabbed the four magic tokens from the following spots on the page:

Twitter application setup

Now in Python I can make properly signed calls to the Twitter API like so:

from requests_oauthlib import OAuth1Session
twitter = OAuth1Session(
    client_key='...',
    client_secret='...',
    resource_owner_key='...',
    resource_owner_secret='...'
)
print(twitter.get(
    'https://api.twitter.com/1.1/users/show.json?screen_name=simonw'
).json())

The Twitter API has an endpoint for retrieving everyone who follows an account as a paginated JSON list: followers/list. At some point in the past few years Twitter got really stingy with their rate limits—most endpoints, including followers/list only allow 15 requests every 15 minutes! You can request up to 200 followers at a time, but with 15,000 followers that meant the full fetch would take 75 minutes. So I set the following running in a Jupyter notebook and went for a walk with the dog.

from requests_oauthlib import OAuth1Session
import urllib.parse
import time

twitter = OAuth1Session(...)
url = 'https://api.twitter.com/1.1/followers/list.json'

def fetch_followers(cursor=-1):
    r = twitter.get(url + '?'+ urllib.parse.urlencode({
        'count': 200,
        'cursor': cursor
    }))
    return r.headers, r.json()

cursor = -1
users = []
while cursor:
    headers, body = fetch_followers(cursor)
    print(headers)
    users.extend(body['users'])
    print(len(users))
    cursor = body['next_cursor']
    time.sleep(70)

A couple of hours later I had a users list with 15,281 user dictionaries in it. I wrote that to disk for safe keeping:

import json
json.dump(users, open('twitter-followers.json', 'w'), indent=4)

Converting that JSON into a SQLite database

I wrote some notes on How to turn a list of JSON objects into a Datasette using Pandas a few weeks ago. This works really well, but we need to do a bit of cleanup first: Pandas prefers a list of flat dictionaries, but the Twitter API has given us back some nested structures.

I won’t do a line-by-line breakdown of it, but here’s the code I ended up using. The expand_entities() function replaces Twitter’s ugly t.co links with their expanded display_url alternatives—then clean_user() flattens a nested user into a simple dictionary:

def expand_entities(s, entities):
    for key, ents in entities.items():
        for ent in ents:
            if 'url' in ent:
                replacement = ent['expanded_url'] or ent['url']
                s = s.replace(ent['url'], replacement)
    return s

def clean_user(user):
    if user['description'] and 'description' in user['entities']:
        user['description'] = expand_entities(
            user['description'], user['entities']['description']
        )
    if user['url'] and 'url' in user['entities']:
        user['url'] = expand_entities(user['url'], user['entities']['url'])
    if 'entities' in user:
        del user['entities']
    if 'status' in user:
        del user['status']

for user in users:
    clean_user(user):

I now have a nice flat list of users dictionaries—a subset of which is provided here for illustration.

One additional step: SQLite’s built-in functions for handling date and time prefer ISO formatted timestamps, but previewing the DataFrame in Jupyter shows that the data I pulled from Twitter has dates in a different format altogether. I can fix this with a one-liner using the ever-handy dateutil library:

from dateutil.parser import parse
import pandas as pd
df = pd.DataFrame(users)
df['created_at'] = df['created_at'].apply(lambda s: parse(s).isoformat())

Here’s the before and after:

df.apply() illustrated

Now that the list contains just simple dictionaries, I can load it into a Pandas DataFrame and convert it to a SQLite table like so:

import sqlite3
conn = sqlite3.connect('/tmp/followers.db')
df.to_sql('followers', conn)
conn.close()

Now I can run datasette /tmp/followers.db to preview what I’ve got so far.

Extracting columns and setting up full-text search

This all works fine, but it’s not quite the finished product I demonstrated above. My desired final state has two additional features: common values in the lang, location, time_zone and translator_type columns have been pulled out into lookup tables, and I’ve enabled SQLite full-text search against a subset of the columns.

Normally I would use the -c and -f arguments to my csvs-to-sqlite tool to do this (see my write-up here), but that tool only works against CSV files on disk. I want to work with an in-memory Pandas DataFrame.

So I reverse-engineered my own code and figured out how to apply the same transformations from an interactive Python prompt instead. It ended up looking like this:

from csvs_to_sqlite import utils

conn = sqlite3.connect('/tmp/simonw-twitter-followers.db')

# Define columns I want to refactor:
foreign_keys = {
    'time_zone': ('time_zone', 'value'),
    'translator_type': ('translator_type', 'value'),
    'location': ('location', 'value'),
    'lang': ('lang', 'value'),
}
new_frames = utils.refactor_dataframes(conn, [df], foreign_keys)

# Save my refactored DataFrame to SQLite
utils.to_sql_with_foreign_keys(
    conn, new_frames[0], 'followers',
    foreign_keys, None, index_fks=True
)

# Create the full-text search index across these columns:
fts = ['screen_name', 'description', 'name', 'location']
utils.generate_and_populate_fts(conn, ['followers'], fts, foreign_keys)

conn.close()

Final step: publishing with Datasette

Having run datasette /tmp/simonw-twitter-followers.db to confirm locally that I got the results I was looking for, the last step was to publish it to the internet. As always, I used Zeit Now via the datasette publish command for this final step:

tmp $ datasette publish now simonw-twitter-followers.db \
    --title="@simonw Twitter followers, 27 Jan 2018"
> Deploying /private/var/.../datasette under simonw
> Ready! https://datasette-cmpznehuku.now.sh (copied to clipboard) [14s]
> Synced 2 files (11.29MB) [0ms] 
> Initializing…
> Building
> ▲ docker build
Sending build context to Docker daemon 11.85 MBkB
> Step 1 : FROM python:3
...
> Deployment complete!

Then I ran new alias to assign a permanent, more memorable URL:

 now alias https://datasette-cmpznehuku.now.sh simonw-twitter-followers.now.sh

Datasette Publish: a web app for publishing CSV files as an online database one month ago

I’ve just released Datasette Publish, a web tool for turning one or more CSV files into an online database with a JSON API.

Here’s a demo application I built using Datasette Publish, showing Californian campaign finance data using CSV files released by the California Civic Data Coalition.

And here’s an animated screencast showing exactly how I built it:

Animated demo of Datasette Publish

Datasette Publish combines my Datasette tool for publishing SQLite databases as an API with my csvs-to-sqlite tool for generating them.

It’s built on top of the Zeit Now hosting service, which means anything you deploy with it lives on your own account with Zeit and stays entirely under your control. I used the brand new Zeit API 2.0.

Zeit’s generous free plan means you can try the tool out as many times as you like—and if you want to use it for an API powering a production website you can easily upgrade to a paid hosting plan.

Who should use it

Anyone who has data they want to share with the world!

The fundamental idea behind Datasette is that publishing structured data as both a web interface and a JSON API should be as quick and easy as possible.

The world is full of interesting data that often ends up trapped in PDF blobs or other hard-to-use formats, if it gets published at all. Datasette encourages using SQLite instead: a powerful, flexible format that enables analysis via SQL queries and can easily be shared and hosted online.

Since so much of the data that IS published today uses CSV, this first release of Datasette Publish focuses on CSV conversion above anything else. I plan to add support for other useful formats in the future.

The three areas I’m most excited in seeing adoption of Datasette are data journalism, civic open data and cultural institutions.

Data journalism because when I worked at the Guardian Datasette is the tool I wish I had had for publishing data. When we started the Guardian Datablog we ended up using Google Sheets for this.

Civic open data because it turns out the open data movement mostly won! It’s incredible how much high quality data is published by local and national governments these days. My San Francisco tree search project for example uses data from the Department of Public Works—a CSV of 190,000 trees around the city.

Cultural institutions because the museums and libraries of the world are sitting on enormous treasure troves of valuable information, and have an institutional mandate to share that data as widely as possible.

If you are involved in any of the above please get in touch. I’d love your help improving the Datasette ecosystem to better serve your needs.

How it works

Datasette Publish would not be possible without Zeit Now. Now is a revolutionary approach to hosting: it lets you instantly create immutable deployments with a unique URL, via a command-line tool or using their recently updated API. It’s by far the most productive hosting environment I’ve ever worked with.

I built the main Datasette Publish interface using React. Building a SPA here made a lot of sense, because it allowed me to construct the entire application without any form of server-side storage (aside from Keen for analytics).

When you sign in via Zeit OAuth I store your access token in a signed cookie. Each time you upload a CSV the file is stored directly using Zeit’s upload API, and the file metadata is persisted in JavaScript state in the React app. When you click “publish” the accumulated state is sent to the server where it is used to construct a new Zeit deployment.

The deployment itself consists of the CSV files plus a Dockerfile that installs Python, Datasette, csvs-to-sqlite and their dependencies, then runs csvs-to-sqlite against the CSV files and starts up Datasette against the resulting database.

If you specified a title, description, source or license I generate a Datasette metadata.json file and include that in the deployment as well.

Since free deployments to Zeit are “source code visible”, you can see exactly how the resulting application is structured by visiting https://datasette-onrlszntsq.now.sh/_src (the campaign finance app I built earlier).

Using the Zeit API in this way has the neat effect that I don’t ever store any user data myself—neither the access token used to access your account nor any of the CSVs that you upload. Uploaded files go straight to your own Zeit account and stay under your control. Access tokens are never persisted. The deployed application lives on your own hosting account, where you can terminate it or upgrade it to a paid plan without any further involvement from the tool I have built.

Not having to worry about storing encrypted access tokens or covering any hosting costs beyond the Datasette Publish tool itself is delightful.

This ability to build tools that themselves deploy other tools is fascinating. I can’t wait to see what other kinds of interesting new applications it enables.

Discussion on Hacker News.

Building a location to time zone API with SpatiaLite, OpenStreetMap and Datasette two months ago

Given a latitude and longitude, how can we tell what time zone that point lies within? Here’s how I built a simple JSON API to answer that question, using a combination of data from OpenStreetMap, the SpatiaLite extension for SQLite and my Datasette API tool.

The API

You can try the API out here: feed it a latitude and longitude and it will return the corresponding time zone ID: https://timezones-api.now.sh/timezones/by_point

Some examples:

The data

I was first introduced to Eric Muller’s tz_world shapefile by Nick Williams at Eventbrite, who used it to build us an internal time zone lookup API on top of MySQL’s geospatial data types. Eric’s project is no longer updated and he recommends Evan Siroky’s timezone-boundary-builder project as an alternative, which derives time zone shapes from OpenStreetMap and makes the resulting data available under the Open Database License. The shapefile itself can be downloaded from the GitHub releases page for the project.

Loading the data into SpatiaLite

SpatiaLite is a powerful open source extension for SQLite, which adds a comprehensive suite of geospatial functions—including the ability to ingest shapefiles, convert them into geometries and run point within queries against the resulting shapes.

The easiest way to get it running on OS X is via Homebrew:

$ brew install spatialite-tools
$ brew install gdal

Having installed SpatiaLite, we can ingest the shapefile using .loadshp combined_shapefile timezones CP1252 23032—here’s the full process, from downloading the shapefile to ingesting it into a new SQLite database file called timezones.db:

$ wget https://github.com/evansiroky/timezone-boundary-builder/releases/download/2017c/timezones.shapefile.zip
$ unzip timezones.shapefile.zip
$ cd dist
$ spatialite timezones.db
SpatiaLite version ..: 4.3.0a   Supported Extensions:
...
Enter SQL statements terminated with a ";"
spatialite> .loadshp combined_shapefile timezones CP1252 23032
========
Loading shapefile at 'combined_shapefile' into SQLite table 'timezones'

BEGIN;
CREATE TABLE "timezones" (
"PK_UID" INTEGER PRIMARY KEY AUTOINCREMENT,
"tzid" TEXT);
SELECT AddGeometryColumn('timezones', 'Geometry', 23032, 'MULTIPOLYGON', 'XY');
COMMIT;

Inserted 414 rows into 'timezones' from SHAPEFILE
========
spatialite> 

Let’s try it out with a query. Here’s the SQL needed to find the time zone for a point in Tokyo:

select tzid
from
    timezones
where
    within(GeomFromText('POINT(139.7819661 35.6631424)'), timezones.Geometry);

Let’s run that in SpatiaLite and see what we get:

spatialite> select tzid
   ...> from
   ...>     timezones
   ...> where
   ...>     within(GeomFromText('POINT(139.7819661 35.6631424)'), timezones.Geometry);
Asia/Tokyo

Looks good so far! How long is it taking though? We can find out by running .timer on in the spatialite prompt:

spatialite> .timer on
spatialite> select tzid
   ...> from
   ...>     timezones
   ...> where
   ...>     within(GeomFromText('POINT(139.7819661 35.6631424)'), timezones.Geometry);
Asia/Tokyo
CPU Time: user 0.108479 sys 0.064778

That’s a tenth of a second, or 100ms. Fast, but not brilliant.

Speeding it up with a geospatial index

It turns out SpatiaLite includes support for spatial indices, based on SQLite’s R*Tree module. R-Tree indexes can massively accelerate boundary box searches. Our searches are a lot more complex than that, acting as they do against extremely complex polygon shapes—but we can use a boundary box search to dramatically reduce the candidates we need to consider. Let’s create an index against our Geometry column:

SELECT CreateSpatialIndex('timezones', 'geometry');

To take advantage of this index, we need to expand our original SQL to first filter by geometries where their bounding box contains the point we are searching for. SpatiaLite has created an index table called idx_timezones_Geometry against which we can run an R-Tree optimized query. Here’s the SQL we will use:

select tzid
from
    timezones
where
    within(GeomFromText('POINT(139.7819661 35.6631424)'), timezones.Geometry)
    and rowid in (
        SELECT pkid FROM idx_timezones_Geometry
        where xmin < 139.7819661
        and xmax > 139.7819661
        and ymin < 35.6631424
        and ymax > 35.6631424
    );

How does this affect our performance?

spatialite> select tzid
   ...> from
   ...>     timezones
   ...> where
   ...>     within(GeomFromText('POINT(139.7819661 35.6631424)'), timezones.Geometry)
   ...>     and rowid in (
   ...>         SELECT pkid FROM idx_timezones_Geometry
   ...>         where xmin < 139.7819661
   ...>         and xmax > 139.7819661
   ...>         and ymin < 35.6631424
   ...>         and ymax > 35.6631424
   ...>     );
Asia/Tokyo
CPU Time: user 0.001541 sys 0.000111

From 100ms down to 1.5ms—nearly a 70x speedup! Not bad at all.

Building and publishing the API with Datasette

Now that we have a fast SQL query for finding a time zone for a latitude and longitude we can use Datasette to turn it into a JSON API.

The simplest way to do that looks like this:

datasette timezones.db \
    --load-extension=/usr/local/lib/mod_spatialite.dylib

This will start Datasette on port 8001 and load the SpatiaLite extension. You can then navigate to localhost:8001/timezones in your browser and paste in the SQL query… which should look something like this:

https://timezones-api.now.sh/timezones?sql=select+tzid%0D%0Afrom%0D%0Atimezones%0D%0Awhere%0D%0Awithin(GeomFromText(‘POINT(139.7819661+35.6631424)’)%2C+timezones.Geometry)%0D%0Aand+rowid+in+(%0D%0A++++SELECT+pkid+FROM+idx_timezones_Geometry%0D%0Awhere+xmin+<+139.7819661%0D%0Aand+xmax+>+139.7819661%0D%0Aand+ymin+<+35.6631424%0D%0Aand+ymax+>+35.6631424%0D%0A++++)%3B

This works (click the JSON link to get the result as JSON) but it’s a little inconvenient to use: you have to construct a URL with the same latitude and longitude repeated in multiple places.

We can improve things using Datasette’s support for SQLite named parameters. Here’s that same SQL query using parameters instead of hard-coded latitude and longitude points:

select tzid
from
    timezones
where
    within(GeomFromText('POINT(' || :longitude || ' ' || :latitude || ')'), timezones.Geometry)
    and rowid in (
        SELECT pkid FROM idx_timezones_Geometry
        where xmin < :longitude
        and xmax > :longitude
        and ymin < :latitude
        and ymax > :latitude)

If you paste this into Datasette it will detect the named parameters and turn them into URL querystring parameters hooked up (in thu UI) to HTML form fields.

Datasette time zone query showing longitude and latitude form fields

To save us from having to include the full SQL in the URL every time we call our new API, let’s take advantage of a new feature introduced in Datasette 0.14: canned queries. These are named, pre-packaged queries that can be defined in a metadata.json file. The file looks like this:

{
    "title": "OpenStreetMap Time Zone Boundaries",
    "license": "ODbL",
    "license_url": "http://opendatacommons.org/licenses/odbl/",
    "source": "timezone-boundary-builder",
    "source_url": "https://github.com/evansiroky/timezone-boundary-builder",
    "databases": {
        "timezones": {
            "queries": {
                "by_point": "select tzid\nfrom\n    timezones\nwhere\n    within(GeomFromText('POINT(' || :longitude || ' ' || :latitude || ')'), timezones.Geometry)\n    and rowid in (\n        SELECT pkid FROM idx_timezones_Geometry\n        where xmin < :longitude\n        and xmax > :longitude\n        and ymin < :latitude\n        and ymax > :latitude\n    )"
            }
        }
    }
}

The canned query is defined as the by_point key in the queries nested object. I’m also adding license and source information here for the project, because it’s good manners.

We can try this in Datasette on our local machine like so:

datasette timezones.db -m metadata.json \
    --load-extension=/usr/local/lib/mod_spatialite.dylib 

Now visiting localhost:8001/timezones/by_point will provide the interface for the query—and adding .json to the URL will turn it into an API.

Vizualizing time zones with Leaflet and GeoJSON

If you browse around in Datasette you’ll quickly run into a rather unfortunate problem. The localhost:8001/timezones/timezones page, which shows the first 100 rows in the table, takes a shockingly long time to load. When it eventually does load you’ll see why: each record includes an enormous binary string containing its the geometry. On my machine just that one page weighs in at 62MB of HTML!

This is bad: rendering that much HTML in one go can block the event loop and cause the application to become unresponsive. That giant blob of binary data isn’t exactly useful for humans, either.

We can make some major improvements here using another Datasette 0.14 feature: custom templates. Let’s start with a replacement template that shows just the length of the binary string instead of attempting to render it.

We’ll do that by over-riding the _rows_and_columns.html include template, which is used by Datasette on both the table page and the page used to display individual rows. Since we only want to over-ride this template for one particular table we’ll create a file called _rows_and_columns-timezones-timezones.html—the file name means this will only over-ride the template timezones table in our timezones database. Here’s our new template:

<table>
    <thead>
        <tr>
            {% for column in display_columns %}
                <th scope="col">{{ column }}</th>
            {% endfor %}
        </tr>
    </thead>
    <tbody>
    {% for row in display_rows %}
        <tr>
            {% for cell in row %}
                <td>
                    {% if cell.column == 'Geometry' %}
                        {{ cell.value|length }} bytes
                    {% else %}
                        {{ cell.value }}
                    {% endif %}
                </td>
            {% endfor %}
        </tr>
    {% endfor %}
    </tbody>
</table>

If we put that in a directory called templates/ we can tell Datasette to use it like this:

datasette timezones.db -m metadata.json \
    --load-extension=/usr/local/lib/mod_spatialite.dylib \
    --template-dir=templates/

Our localhost:8001/timezones/timezones page now looks like this:

Datasette time zone index showing lengths

But wouldn’t it be neat if we could see the actual shapes of these time zones? It turns out that’s actually pretty easy, using the combination of GeoJSON and the Leaflet mapping library.

GeoJSON is a neat, widely supported standard for encoding geographic information such as polygon shapes as JSON. SpatiaLite ships with built-in GeoJSON support in the form of the AsGeoJSON SQL function. We can use that function to turn any of our time zone geometries into a GeoJSON string:

select AsGeoJSON(Geometry) from timezones where tzid = 'Asia/Tokyo';

If you run that with Datasette you’ll get back a string of GeoJSON. You can paste that into geojson.io to instantly visualize it.

The Leaflet mapping library supports GeoJSON out of the box. We can construct a custom row.html template for our Datasette that loads Leaflet from unpkg.com, uses fetch() to execute the AsGeoJSON query and renders the result in a map on the page. Here’s the result:

Custom Datasette page rendering a GeoJSON map

And here’s the row-timezones-timezones.html template:

{% extends "row.html" %}
{% block extra_head %}
<link rel="stylesheet" href="https://unpkg.com/leaflet@1.2.0/dist/leaflet.css" integrity="sha512-M2wvCLH6DSRazYeZRIm1JnYyh22purTM+FDB5CsyxtQJYeKq83arPe5wgbNmcFXGqiSH2XR8dT/fJISVA1r/zQ==" crossorigin=""/>
<script src="https://unpkg.com/leaflet@1.2.0/dist/leaflet.js" integrity="sha512-lInM/apFSqyy1o6s89K4iQUKg6ppXEgsVxT35HbzUupEVRh2Eu9Wdl4tHj7dZO0s1uvplcYGmt3498TtHq+log==" crossorigin=""></script>
<style>
#map {
  margin-top: 1em;
  width: 100%;
  height: 400px;
}
</style>
{% endblock %}

{% block content %}
{{ super() }}
<div id="map"></div>
<script>
var pk = location.pathname.split('/').slice(-1)[0];
var tiles = L.tileLayer('https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
    maxZoom: 19,
    detectRetina: true,
    attribution: '&copy; <a href="https://www.openstreetmap.org/copyright">OpenStreetMap</a> contributors, Points &copy 2012 LINZ'
});
var sql = 'select AsGeoJSON(Geometry) from timezones where PK_UID = ' + pk;
fetch('/timezones.json?sql=' + encodeURIComponent(sql)).then(r => r.json()).then(d => {
  var map = L.map('map', {layers: [tiles]});
  var geoJSON = JSON.parse(d.rows[0][0]);
  var layer = L.geoJSON(geoJSON)
  layer.addTo(map);
  map.fitBounds(layer.getBounds());
});
</script>
{% endblock %}

Publishing it to the internet

Normally we would use the datasette publish command to publish our database to Heroku or Zeit Now, but the SpatiaLite dependency means that won’t work for this case. Instead, we need to construct a custom Dockerfile that builds the SpatiaLite module.

Since we’re using Docker, we may as well have the Dockerfile download the shapefiles and build the SpatiaLite database for us all in one go. Here’s a Dockerfile that does exactly that:

FROM python:3.6-slim-stretch

RUN apt update
RUN apt install -y python3-dev gcc spatialite-bin libsqlite3-mod-spatialite wget unzip

RUN pip install https://github.com/simonw/datasette/archive/master.zip

# Download and import the shapefiles
RUN wget https://github.com/evansiroky/timezone-boundary-builder/releases/download/2017c/timezones.shapefile.zip \
    && unzip timezones.shapefile.zip && \
    cd dist && \
    echo ".loadshp combined_shapefile timezones CP1252 23032\nSELECT CreateSpatialIndex('timezones', 'geometry');" | spatialite timezones.db && \
    mv timezones.db .. && \
    cd .. && rm -rf dist && rm timezones.shapefile.zip

ENV SQLITE_EXTENSIONS /usr/lib/x86_64-linux-gnu/mod_spatialite.so

ADD metadata.json metadata.json

ADD templates templates

RUN datasette inspect timezones.db --inspect-file inspect-data.json

EXPOSE 8001

CMD ["datasette", "serve", "timezones.db", "--host", "0.0.0.0", "--cors", "--port", "8001", "--inspect-file", "inspect-data.json", "-m", "metadata.json", "--template-dir", "templates"]

The full code, including the supporting templates, can be found in simonw/timezones-api on GitHub.

If you have Docker installed (Docker for Mac is a one-click install package these days, it’s impressively painless) you can build the container like this:

docker build . -t timezones-api
# Wait for the image to build
docker run -p 8001:8001 timezones-api

Then you can visit http://localhost:8001/ to see your freshly built Datasette in your browser.

The easiest way to publish it online is using Zeit Now. Simply run the now command in the directory containing the Dockerfile and Zeit will upload the entire directory, build the container in the cloud and deploy it with a fresh URL. It’s by far the easiest Docker deployment environment I’ve ever used.

Now can even deploy directly from a public GitHub repository… so you can deploy your own copy of the API by running the following command:

$ now simonw/timezones-api
> Didn't find directory. Searching on GitHub...
> Deploying GitHub repository "simonw/timezones-api" under simonw
> Ready! https://simonw-timezones-api-fbihjcbnog.now.sh (copied to clipboard) [2s]
> Initializing…

Canned queries + SpatiaLite = instant geospatial APIs

Hopefully this has helped illustrate the ease with which Datasette, SpatiaLite and canned queries can be used to create and publish geospatial APIs. Thanks to projects like OpenStreetMap the world is full of high quality open geospatial data. Go build something cool with it!

Elsewhere

Today

  • s3monkey: A Python library that allows you to interact with Amazon S3 Buckets as if they are your local filesystem. (via) A particularly devious hack by Kenneth Reitz—provides a context manager within which various Python filesystem APIs such as open() and os.listdir() are monkeypatched to operate against an S3 bucket instead. Kenneth built it to make it easier to work with files from apps running on Heroku. Under the hood it uses pyfakefs, a filesystem mocking library originally released by Google. #
  • A Promenade of PyTorch. Useful overview of the PyTorch machine learning library from Facebook AI Research described as “a Python library enabling GPU-accelerated tensor computation”. Similar to TensorFlow, but where TensorFlow requires you to explicitly construct an execution graph PyTorch instead lets you write regular Python code (if statements, for loops etc) which PyTorch then uses to construct the execution graph for you. #
  • Andrew Godwin's www-router Docker container (via) Really clever Docker trick: a container that runs Nginx and uses it to route traffic to other containers based on the hostname—but the hostnames to be routed are configured using environment variables which the run-nginx.py CMD script uses to dynamically construct an nginx config when the container starts. #
  • Photos from our tour of the amazing bone collection of Ray Bandar. Ray Bandar (1927-2017) was an artist, scientist, naturalist and an incredibly prolific collector of bones. His collection is in the process of moving to the California Academy of Sciences but Natalie managed to land us a private tour lead by his great nephew. The collection is truly awe-inspiring, and a testament to an extraordinary life lived following a very particular passion. #

Yesterday

  • Moving a large and old codebase to Python3 (via) Really interesting case study full of good ideas. The codebase in this case was 240,000 lines of Python and Django written over the course of 15 years. The team used Python-Modernize to aid their transition to a six-compatible codebase first. #
  • Python & Async Simplified. Andrew Godwin: “Python’s async framework is actually relatively simple when you treat it at face value, but a lot of tutorials and documentation discuss it in minute implementation detail, so I wanted to make a higher-level overview that deliberately ignores some of the small facts and focuses on the practicalities of writing projects that mix both kinds of code.” ‪This is really useful: clearly explains the two separate worlds of Python (sync and async functions) and describes Andrew’s clever sync_to_async and async_to_sync decorators as well.‬ #

7th February 2018

  • The whole story is basically that Facebook gets so much traffic that they started convincing publishers to post things on Facebook. For a long time, that was fine. People posted things on Facebook, then you would click those links and go to their websites. But then, gradually, Facebook started exerting more and more control of what was being seen, to the point that they, not our website, essentially became the main publishers of everyone’s content. Today, there’s no reason to go to a comedy website that has a video if that video is just right on Facebook. And that would be fine if Facebook compensated those companies for the ad revenue that was generated from those videos, but because Facebook does not pay publishers, there quickly became no money in making high-quality content for the internet.

    Matt Klinman #

  • Googlebot's Javascript random() function is deterministic. random() as executed by Googlebot returns the same predicable sequence. More interestingly, Googlebot runs a much faster timer for setTimeout and setInterval—as Tom Anthony points out, “Why actually wait 5 seconds when you are a bot?” #

4th February 2018

  • owlsnearme source code on GitHub. Here’s the source code for our new owlsnearme.com project. It’s a single-page React application that pulls all of its data from the iNaturalist API. We built it this weekend with the SuperbOwl kick-off as a hard deadline so it’s not the most beautiful React code, but it’s a nice demonstration of how React (and create-react-app in particular) can be used for rapid development. #
  • Owls Near Me. Back in 2010 Natalie and I shipped owlsnearyou.com—a website for finding your nearest owls, using data from the sadly deceased WildlifeNearYou (RIP). To celebrate #SuperbOwl Sunday we rebuilt the same concept on top of the excellent iNaturalist API. Search for a place to see which owls have been spotted there, or click the magic button to geolocate your device and see which owls have been spotted in your nearby area! #

3rd February 2018

  • Conditional aggregation in Django 2.0 (via) I hadn’t realised how clever this new Django ORM feature by Tom Forbes is. It lets you build an aggregation against a subset of rows, e.g. Client.objects.aggregate(regular=Count(’pk’, filter=Q(account_type=Client.REGULAR)))—then if you are using PostgreSQL it translates it into a fast FILTER WHERE clause, while other databases emulate the same behaviour using a CASE statement. #
  • Imagine a Simon Says style game where I present an article found on the web on a projector. Students research for two to three minutes, then respond by standing or staying seated to signal if they believe the article is true or fake. My students absolutely loved the game. Some refused to go to recess until I gave them another chance to figure out the next article I had queued.

    Scott Bedley #

  • How I made a Who's On First subset database. Inspired by Paul Ford on Twitter, I tried out a new trick with SQLite: connect to a database containing JSON, attach a brand new empty database file using “attach database”, then populate it using INSERT INTO ... SELECT plus the json_extract() function to extract out a subset of the JSON properties into a new table in the new database. #
  • Just switched to {window.localStorage.getItem(’debug’) && <pre>{JSON.stringify(this.state, null, 2)}</pre>}—now I can ship to production and turn on debugging in my console with localStorage.setItem(’debug’, 1)

    @simonw #

2nd February 2018

  • asgiref: AsyncToSync and SyncToAsync (via) Andrew’s classes in asgiref that can turn a synchronous callable into an awaitable (that runs in a thread pool) or an awaitable callable into a synchronous callable, using Python 3 futures and asyncio. #
  • Channels 2.0. Andrew just shipped Channels 2.0—a major rewrite and redesign of the Channels project he started back in 2014. Channels brings async to Django, providing a logical, standardized way of supporting things like WebSockets and asynchronous execution on top of a Django application. Previously it required you to run a separate Twisted server and redis/RabbitMQ queue, but thanks to Python 3 async everything can now be deployed as a single process. And the new ASGI spec means its turtles all the way down! Everything from URL routing to view functions to middleware can be composed together using the same ASGI interface. #
  • Family fun with deepfakes. Or how I got my wife onto the Tonight Show. deepfakes is dystopian nightmare technology: take a few thousand photos of two different people with similar shaped faces and you can produce an extremely realistic video where you swap one person’s face for the other. Unsurprisingly it’s being used for porn. This is a pleasantly SFW explanation of how it works, complete with a demo where Sven Charleer swaps his wife Elke for Anne Hathaway on the Tonight Show. #
  • Using setup.py in Your (Django) Project. Includes this neat trick: if you list manage.py in the setup(scripts=) argument you can call it from e.g. cron using the full path to manage.py within your virtual environment and it will execute in the correct context without needing to explicitly activate the environment first. #

1st February 2018

  • How the Citus distributed database rebalances your data. Citus is a fascinating implementation of database sharding built on top of PostgreSQL primitives. PostgreSQL 10 introduced extremely flexible logical replication—in this post Craig Kerstiens explains how Citus use this new ability to re-balance shards (e.g. when you move from two to four physical PostgreSQL nodes) without downtime. #
  • Building a Full-Text Search App Using Docker and Elasticsearch. Deep, comprehensive tutorial from Patrick Triest showing how to use docker-compose to run three containers (Node API, nginx static content, elasticsearch) and then use that to build a neat Vue.js web search UI against 100 books from Project Gutenberg. #
  • What we need to do is come up with a way to help people understand that there are ways to never be lost again, and to listen to any music you want, and to video chat with someone on the other side of the world, without them having to feel disquieted about it. That it’s not OK that you’re made to feel weirded out. That it’s possible for there to be alternatives. That having to feel someone rooting around in your life is not a price you should have to pay.

    Stuart Langridge #

  • Observable notebook: San Francisco trees from Datasette. I used an Observable notebook to rebuild my San Francisco tree search demo against a Datasette API of a CSV of trees published by the SF Department of Public Works. The map updates live as you type a query, and every cell can be toggled to view the underlying source code. #

31st January 2018

  • USGS World Earthquake Map (observable notebook). Here’s an extended version of the notebook constructed by Jeremy Ashkenas in that Observable YouTube demo. You really need to check this thing out—the notebook itself has sliders in that you can manipulate (even on a mobile browser) or you can click to edit the code and see your changes reflected in real-time. If you sign in with GitHub you can fork the project to your own account and save your changes. #
  • Observable: An Earthquake Globe in Ten Minutes. Well worth your time. Jeremy Ashkenas uses Observable to live-code an interactive visualization of recent earthquakes around the world, using USGS data (fetched as JSON), d3, topoJSON and an Observable notebook. I’m sold—this is truly ground-breaking new technology. #
  • Observable Beta (via) Observable just released their beta, and it’s quite something. It’s by Mike Bostock (d3), Jeremy Ashkenas (Backbone, CoffeeScript) and Tom MacWright (Mapbox Studio). The easiest way to describe it is Jupyter notebooks for JavaScript supporting reactive programming—so code is evaluated as you type and you can add interactive widgets (like sliders and canvas views) to construct explorable visualizations on the fly. #

29th January 2018

  • [On SQLite] The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.”

    Paul Ford #

  • SQLite: The Spellfix1 Virtual Table (via) A SQLite extension that lets you create a spellfix1 virtual table which can power “fuzzy” search, by suggesting corrections for misspelled words. I haven’t tried this yet but it looks pretty powerful, including a configurable edit distance and the ability to set up custom “soundslike” terms for words with known unusual spellings. #

28th January 2018

  • 6M observations total! Where has iNaturalist grown in 80 days with 1 million new observations? Citizen science app iNaturalist is seeing explosive growth at the moment—they’ve been around for nearly a decade but 1/6 of the observations posted to the site were added in just the past few months. Having tried the latest version of their iPhone app it’s easy to see why: snap a photo of some nature and upload it to the app and it will use surprisingly effective machine learning to suggest the genus or even the individual species. Submit the observation and within a few minutes other iNaturalist community members will confirm the identification or suggest a correction. It’s brilliantly well executed and an utter delight to use. #
  • If I tweeted a throwaway comment in appreciation for McDonald’s apple pies and some other randos on Twitter happened to also tweet similar thoughts over the last few months, it doesn’t mean by extrapolation that ‘Millennials Can’t Get Enough Of McDonald’s Apple Pies’.  The Twitter search box is not a polling agency and Twitter doesn’t include everybody’s thoughts on everything. Just some people’s thoughts on some things.

    Nick Walker #