Simon Willison’s Weblog

On django, projects, webassembly, opensource, webperformance, ...

 

Recent entries

Single sign-on against GitHub using ASGI middleware one month ago

I released Datasette 0.29 last weekend, the first version of Datasette to be built on top of ASGI (discussed previously in Porting Datasette to ASGI, and Turtles all the way down).

This also marked the introduction of the new asg_wrapper plugin hook, which allows plugins to wrap the entire Datasette application in their own piece of ASGI middleware.

To celebrate this new capability, I also released two new plugins: datasette-cors, which provides fine-grained control over CORS headers (using my asgi-cors library from a few months ago) and datasette-auth-github, the first of hopefully many authentication plugins for Datasette.

datasette-auth-github

The new plugin is best illustrated with a demo.

Visit https://datasette-auth-demo.now.sh/ and you will be redirected to GitHub and asked to approve access to your account (just your e-mail address, not repository access).

Agree, and you’ll be redirected back to the demo with a new element in the Datasette header: your GitHub username, plus a “log out” link in the navigation bar at the top of the screen.

Controlling who can access

The default behaviour of the plugin is to allow in anyone with a GitHub account. Since the primary use-case for the plugin (at least for the moment) is restricting access to view data to a trusted subset of people, the plugin lets you configure who is allowed to view your data in three different ways:

Datasette inherits quite a sophisticated user management system from GitHub, with very little effort required from the plugin. The user_is_allowed() method that implements all three of the above options against the GitHub API in just 40 lines of code.

These options can be set using the "plugins" section of the Datasette metadata.json configuration file. Here’s an example:

{
  "plugins": {
    "datasette-auth-github": {
      "client_id": {"$env": "GITHUB_CLIENT_ID"},
      "client_secret": {"$env": "GITHUB_CLIENT_SECRET"},
      "allow_users": ["simonw"]
    }
  }
}

This also illustrates a new Datasette feature: the ability to set secret plugin configuration values. {"$env": "GITHUB_CLIENT_SECRET"} means "read this configuration option from the environment variable GITHUB_CLIENT_SECRET".

Automatic log in

Like many OAuth providers, GitHub only asks the user for their approval the first time they log into a given app. Any subsequent times they are redirected to GitHub it will skip the permission screen and redirect them right back again with a token.

This means we can implement automatic log in: any time a visitor arrives who does not have a cookie we can bounce them directly to GitHub, and if they have already consented they will be logged in instantly.

This is a great user-experience—provided the user is logged into GitHub they will be treated as if they are logged into your application—but it does come with a downside: what if the user clicks the “log out” link?

For the moment I’ve implemented this using another cookie: if the user clicks “log out”, I set an asgi_auth_logout cookie marking the user as having explicitly logged out. While they have that cookie they won’t be logged in automatically, instead having to click an explicit link. See issue 41 for thoughts on how this could be further improved.

One pleasant side-effect of all of this is that datasette-auth-github doesn’t need to persist the users GitHub access_token anywhere—it uses it during initil authentication check for any required organizations or teams, but then it deliberately forgets the token entirely.

OAuth access tokens are like passwords, so the most resonsible thing for a piece of softare to do with them is avoid storing them anywhere at all unless they are explicitly needed.

What happens when a user leaves an organization?

When building against a single sign-in provider, consideration needs to be given to offboarding: when a user is removed from a team or organization they should also lose access to their SSO applications.

This is difficult when an application sets its own authentication cookies, like datasette-auth-github does.

One solution would be to make an API call on every request to the application, to verify that the user should still have access. This would slow everything down and is likely to blow through rate limits as well, so we need a more efficient solution.

I ended up solving this with two mechanisms. Since we have automatic log in, our cookies don’t actually need to last very long—so by default the signed cookies set by the plugin last for just one hour. When a user’s cookie has expired they will be redirected back through GitHub—they probably won’t even notice the redirect, and their permissions will be re-verified as part of that flow.

But what if you need to invalidate those cookies instantly?

To cover that case, I’ve incorporated an optional cookie_version configuration option into the signatures on the cookies. If you need to invalidate every signed cookie that is out there—to lock out a compromised GitHub account owner for example—you can do so by changing the cookie_version configuration option and restarting (or re-deploying) Datasette.

These options are all described in detail in the project README.

Integration with datasette publish

The datasette publish command-line tool lets users instantly publish a SQLite database to the internet, using Heroku, Cloud Run or Zeit Now v1. I’ve added suppor for setting secret plugin configuration directly to that tool, which means you can publish an authentication-protected SQLite database to the internet with a shell one-liner, using --install=datasette-auth-github to install the plugin and --plugin-secret to configure it:

$ datasette publish cloudrun fixtures.db \
   --install=datasette-auth-github \
   --name datasette-auth-protected \
   --service datasette-auth-protected \
   --plugin-secret datasette-auth-github allow_users simonw \
   --plugin-secret datasette-auth-github client_id 85f6224cb2a44bbad3fa \
   --plugin-secret datasette-auth-github client_secret ...

This creates a Cloud Run instance which only allows GitHub user simonw to log in. You could instead use --plugin-secret datasette-auth-github allow_orgs my-org to allow any users from a specific GitHub organization.

Note that Cloud Run does not yet give you full control over the URL that will be assigned to your deployment. In this case it gave me https://datasette-auth-protected-j7hipcg4aq-uc.a.run.app—which works fine, but I needed to update my GitHub OAuth application’s callback URL manually to https://datasette-auth-protected-j7hipcg4aq-uc.a.run.app/-/auth-callback after deploying the application in order to get the authentication flow to work correctly.

Add GitHub authentication to any ASGI application!

datasette-auth-github isn’t just for Datasette: I deliberately wrote the plugin as ASGI middleware first, with only a very thin layer of extra code to turn it into an installable plugin.

This means that if you are building any other kind of ASGI app (or using an ASGI-compatible framework such as Starlette or Sanic) you can wrap your application directly with the middleware and get the same authentication behaviour as when the plugin is added to Datasette!

Here’s what that looks like:

from datasette_auth_github import GitHubAuth
from starlette.applications import Starlette
from starlette.responses import HTMLResponse
import uvicorn

app = Starlette(debug=True)


@app.route("/")
async def homepage(request):
    return HTMLResponse("Hello, {}".format(
        repr(request.scope["auth"])
    ))


authenticated_app = GitHubAuth(
    app,
    client_id="986f5d837b45e32ee6dd",
    client_secret="...",
    require_auth=True,
    allow_users=["simonw"],
)

if __name__ == "__main__":
    uvicorn.run(authenticated_app, host="0.0.0.0", port=8000)

The middleware adds a scope["auth"] key describing the logged in user, which is then passed through to your application. More on this in the README.

Your security reviews needed!

Since datasette-auth-github adds authentication to Datasette, it is an extremely security-sensitive piece of code. So far I’m the only person who has looked at it: before I start widely recommending it to people I’d really like to get some more eyes on it to check for any potential security problems.

I’ve opened issue #44 encouraging security-minded developers to have a dig through the code and see if there’s anything that can be tightened up or any potential vulnerabilities that need to be addressed. Please get involved!

It’s a pretty small codebase, but here are some areas you might want to inspect:

  • At a high level: is the way I’m verifying the user through the GitHub API and then storing their identity in a signed cookie the right way to go?
  • The cookie signing secret is derived from the GitHub OAuth application’s client_id and client_secret (because that secret is already meant to be a secret), combined with the cookie_version option described above—implementation here. Since this is a derived secret I’m using pbkdf2_hmac with 100,000 iterations. This is by far the most cryptographically interesting part of the code, and could definitely do with some second opinions.
  • The code used to sign and verify cookies is based on Django’s (thoroughly reviewed) implementation, but could benefit from a sanity check.
  • I wanted this library to work on Glitch, which currently only provides Python 3.5.2. Python’s asyncio HTTP librarys such as http3 and aiohttp both require more modern Pythons, so I ended up rolling my own very simple async HTTP function which uses urllib.request inside a loop.run_in_executor thread pool. Is that approach sound? Rolling my own HTTP client in this way feels a little hairy.

This has been a really fun project so far, and I’m very excited about the potential for authenticated Datasette moving forward—not to mention the possibilites unlocked by an ASGI middleware ecosystem with strong support for wrapping any application in an authentication layer.

Porting Datasette to ASGI, and Turtles all the way down two months ago

This evening I finally closed a Datasette issue that I opened more than 13 months ago: #272: Port Datasette to ASGI. A few notes on why this is such an important step for the project.

ASGI is the Asynchronous Server Gateway Interface standard. It’s been evolving steadily over the past few years under the guidance of Andrew Godwin. It’s intended as an asynchronous replacement for the venerable WSGI.

Turtles all the way down

Ten years ago at EuroDjangoCon 2009 in Prague I gave a talk entitled Django Heresies. After discussing some of the design decisions in Django that I didn’t think had aged well, I spent the last part of the talk talking about Turtles all the way down. I wrote that idea up here on my blog (see also these slides).

The key idea was that Django would be more interesting if the core Django contract—a function that takes a request and returns a response—was extended to more places in the framework. The top level site, the reusable applications, middleware and URL routing could all share that same contract. Everything could be composed from the same raw building blocks.

I’m excited about ASGI because it absolutely fits the turtles all the way down model.

The ASGI contract is an asynchronous function that takes three arguments:

async def application(scope, receive, send):
    ...

scope is a serializable dictionary providing the context for the current connection. receive is an awaitable which can be used to recieve incoming messages. send is an awaitable that can be used to send replies.

It’s a pretty low-level set of primitives (and less obvious than a simple request/response)—and that’s because ASGI is about more than just the standard HTTP request/response cycle. This contract works for HTTP, WebSockets and potentially any other protocol that needs to asynchronously send and receive data.

It’s an extremely elegant piece of protocol design, informed by Andrew’s experience with Django Channels, SOA protocols (we are co-workers at Eventbrite where we’ve both been heavily involved in Eventbrite’s SOA mechanism) and Andrew’s extensive conversations with other maintainers in the Python web community.

The ASGI protocol really is turtles all the way down—it’s a simple, well defined contract which can be composed together to implement all kinds of interesting web architectural patterns.

My asgi-cors library was my first attempt at building an ASGI turtle. The implementation is a simple Python decorator which, when applied to another ASGI callable, adds HTTP CORS headers based on the parameters you pass to the decorator. The library has zero installation dependencies (it has test dependencies on pytest and friends) and can be used on any HTTP ASGI project.

Building asgi-cors completely sold me on ASGI as the turtle pattern I had been desiring for over a decade!

Datasette plugins and ASGI

Which brings me to Datasette.

One of the most promising components of Datasette is its plugin mechanism. Based on pluggy (extracted from pytest), Datasette Plugins allow new features to be added to Datasette without needing to change the underlying code. This means new features can be built, packaged and shipped entirely independently of the core project. A list of currently available plugins can be found here.

WordPress is very solid blogging engine. Add in the plugin ecosystem around it and it can be used to build literally any CMS you can possibly imagine.

My dream for Datasette is to apply the same model: I want a strong core for publishing and exploring data that’s enhanced by plugins to solve a huge array of data analysis, visualization and API-backed problems.

Datasette has a range of plugin hooks already, but I’ve so far held back on implementing the most useful class of hooks: hooks that allow developers to add entirely new URL routes exposing completely custom functionality.

The reason I held back is that I wanted to be confident that the contract I was offering was something I would continue to support moving forward. A plugin system isn’t much good if the core implementation keeps on changing in backwards-incompatible ways.

ASGI is the exact contract I’ve been waiting for. It’s not quite ready yet, but you can follow #520: prepare_asgi plugin hook (thoughts and suggestions welcome!) to be the first to hear about this hook when it lands. I’m planning to use it to make my asgi-cors library available as a plugin, after which I’m excited to start exploring the idea of bringing authentication plugins to Datasette (and to the wider ASGI world in general).

I’m hoping that many Datasette ASGI plugins will exist in a form that allows them to be used by other ASGI applications as well.

I also plan to use ASGI to make components of Datasette itself available to other ASGI applications. If you just want a single instance of Datasette’s table view to be embedded somewhere in your URL configuration you should be able to do that by routing traffic directly to the ASGI-compatible view class.

I’m really excited about exploring the intersection of ASGI turtles-all-the-way-down and pluggy’s powerful mechanism for gluing components together. Both WSGI and Django’s reusable apps have attempted to create a reusable ecosystem in the past, to limited levels of success. Let’s see if ASGI can finally make the turtle dream come true.

Further reading

Hello ASGI by Tom Christie is the best introduction to ASGI I’ve seen. Tom is the author of the Uvicorn ASGI server (used by Datasette as-of this evening) and Starlette, a delightfully well-designd ASGI web framework. I’ve learned an enormous amount about ASGI by reading Tom’s code. Tom also gave a talk about ASGI at DjangoCon Europe a few months ago.

If you haven’t read A Django Async Roadmap by Andrew Godwin last year you should absolutely catch up. More than just talking about ASGI, Andrew sketches out a detailed and actionable plan for bringing asyncio to Django core. Andrew landeded the first Django core ASGI code based on the plan just a few days ago.

If you’re interested in the details of Datasette’s ASGI implementation, I posted detailed commentary on issue #272 over the past thirteen months as I researched and finalized my approach. I added further commentary to the associated pull request, which gathers together the 34 commits it took to ship the feature (squashed into a single commit to master).

Datasette 0.28—and why master should always be releasable three months ago

It’s been quite a while since the last substantial release of Datasette. Datasette 0.27 came out all the way back in January.

This isn’t because development has slowed down. In fact, the project has had 131 commits since then, covering a bewildering array of new functionality and with some significant contributions from developers who aren’t me—Russ Garrett and Romain Primet deserve special recognition here.

The problem has been one of discipline. I’m a big fan of the idea of keeping master shippable at all times in my professional work, but I hadn’t quite adopted this policy for my open-source side projects. A couple of months ago I found myself in a situation where I had two major refactorings (of faceting and of Datasette’s treatment of immutable files) going on in master at the same time, and untangling them turned out to take way longer than I had expected.

So I’ve updated Datasette’s contribution guidelines to specify that master should always be releasable, almost entirely as a reminder to myself.

All of that said, I’m finally back out of the weeds and I’m excited to announce today’s release of Datasette 0.28. It features a salmagundi of new features! I’m replicating the release notes below.

Supporting databases that change

From the beginning of the project, Datasette has been designed with read-only databases in mind. If a database is guaranteed not to change it opens up all kinds of interesting opportunities—from taking advantage of SQLite immutable mode and HTTP caching to bundling static copies of the database directly in a Docker container. The interesting ideas in Datasette explores this idea in detail.

As my goals for the project have developed, I realized that read-only databases are no longer the right default. SQLite actually supports concurrent access very well provided only one thread attempts to write to a database at a time, and I keep encountering sensible use-cases for running Datasette on top of a database that is processing inserts and updates.

So, as-of version 0.28 Datasette no longer assumes that a database file will not change. It is now safe to point Datasette at a SQLite database which is being updated by another process.

Making this change was a lot of work—see tracking tickets #418, #419 and #420. It required new thinking around how Datasette should calculate table counts (an expensive operation against a large, changing database) and also meant reconsidering the “content hash” URLs Datasette has used in the past to optimize the performance of HTTP caches.

Datasette can still run against immutable files and gains numerous performance benefits from doing so, but this is no longer the default behaviour. Take a look at the new Performance and caching documentation section for details on how to make the most of Datasette against data that you know will be staying read-only and immutable.

Faceting improvements, and faceting plugins

Datasette Facets provide an intuitive way to quickly summarize and interact with data. Previously the only supported faceting technique was column faceting, but 0.28 introduces two powerful new capibilities: facet-by-JSON-array and the ability to define further facet types using plugins.

Facet by array (#359) is only available if your SQLite installation provides the json1 extension. Datasette will automatically detect columns that contain JSON arrays of values and offer a faceting interface against those columns—useful for modelling things like tags without needing to break them out into a new table. See Facet by JSON array for more.

The new register_facet_classes() plugin hook (#445) can be used to register additional custom facet classes. Each facet class should provide two methods: suggest() which suggests facet selections that might be appropriate for a provided SQL query, and facet_results() which executes a facet operation and returns results. Datasette’s own faceting implementations have been refactored to use the same API as these plugins.

datasette publish cloudrun

Google Cloud Run is a brand new serverless hosting platform from Google, which allows you to build a Docker container which will run only when HTTP traffic is recieved and will shut down (and hence cost you nothing) the rest of the time. It’s similar to Zeit’s Now v1 Docker hosting platform which sadly is no longer accepting signups from new users.

The new datasette publish cloudrun command was contributed by Romain Primet (#434) and publishes selected databases to a new Datasette instance running on Google Cloud Run.

See Publishing to Google Cloud Run for full documentation.

register_output_renderer plugins

Russ Garrett implemented a new Datasette plugin hook called register_output_renderer (#441) which allows plugins to create additional output renderers in addition to Datasette’s default .json and .csv.

Russ’s in-development datasette-geo plugin includes an example of this hook being used to output .geojson automatically converted from SpatiaLite.

Medium changes

  • Datasette now conforms to the Black coding style (#449)—and has a unit test to enforce this in the future
  • New Special table arguments:
    • ?columnname__in=value1,value2,value3 filter for executing SQL IN queries against a table, see Table arguments (#433)
    • ?columnname__date=yyyy-mm-dd filter which returns rows where the spoecified datetime column falls on the specified date (583b22a)
    • ?tags__arraycontains=tag filter which acts against a JSON array contained in a column (78e45ea)
    • ?_where=sql-fragment filter for the table view (#429)
    • ?_fts_table=mytable and ?_fts_pk=mycolumn querystring options can be used to specify which FTS table to use for a search query—see Configuring full-text search for a table or view (#428)
  • You can now pass the same table filter multiple times—for example, ?content__not=world&content__not=hello will return all rows where the content column is neither hello or world (#288)
  • You can now specify about and about_url metadata (in addition to source and license) linking to further information about a project—see Source, license and about
  • New ?_trace=1 parameter now adds debug information showing every SQL query that was executed while constructing the page (#435)
  • datasette inspect now just calculates table counts, and does not introspect other database metadata (#462)
  • Removed /-/inspect page entirely—this will be replaced by something similar in the future, see #465
  • Datasette can now run against an in-memory SQLite database. You can do this by starting it without passing any files or by using the new --memory option to datasette serve. This can be useful for experimenting with SQLite queries that do not access any data, such as SELECT 1+1 or SELECT sqlite_version().

Small changes

  • We now show the size of the database file next to the download link (#172)
  • New /-/databases introspection page shows currently connected databases (#470)
  • Binary data is no longer displayed on the table and row pages (#442—thanks, Russ Garrett)
  • New show/hide SQL links on custom query pages (#415)
  • The extra_body_script plugin hook now accepts an optional view_name argument (#443—thanks, Russ Garrett)
  • Bumped Jinja2 dependency to 2.10.1 (#426)
  • All table filters are now documented, and documentation is enforced via unit tests (2c19a27)
  • New project guideline: master should stay shippable at all times! (31f36e1)
  • Fixed a bug where sqlite_timelimit() occasionally failed to clean up after itself (bac4e01)
  • We no longer load additional plugins when executing pytest (#438)
  • Homepage now links to database views if there are less than five tables in a database (#373)
  • The --cors option is now respected by error pages (#453)
  • datasette publish heroku now uses the --include-vcs-ignore option, which means it works under Travis CI (#407)
  • datasette publish heroku now publishes using Python 3.6.8 (666c374)
  • Renamed datasette publish now to datasette publish nowv1 (#472)
  • datasette publish nowv1 now accepts multiple --alias parameters (09ef305)
  • Removed the datasette skeleton command (#476)
  • The documentation on how to build the documentation now recommends sphinx-autobuild

Running Datasette on Glitch four months ago

The worst part of any software project is setting up a development environment. It’s by far the biggest barrier for anyone trying to get started learning to code. I’ve been a developer for more than twenty years and I still feel the pain any time I want to do something new.

Glitch is the most promising attempt I’ve ever seen at tackling this problem. It provides an entirely browser-based development environment that allows you to edit code, see the results instantly and view and remix the source code of other people’s projects.

It’s developed into a really fun, super-creative community and a fantastic resource for people looking to get started in the ever-evolving world of software development.

This evening I decided to get Datasette running on it. I’m really impressed with how well it works, and I think Glitch provides an excellent environment for experimenting with Datasette and related tools.

TLDR version: visit https://glitch.com/edit/#!/remix/datasette-csvs right now, drag-and-drop in a CSV file and watch it get served by Datasette on Glitch just a few seconds later.

Running Python on Glitch

The Glitch documentation is all about Node.js and JavaScript, but they actually have very solid Python support as well.

Every Glitch project runs in a container that includes Python 2.7.12 and Python 3.5.2, and you can use pip install --user or pip3 install --user to install Python dependencies.

The key to running non-JavaScript projects on Glitch is the glitch.json file format. You can use this to specify an install script, which sets up your container, and a start script, which starts your application running. Glitch will route HTTP traffic to port 3000, so your application server needs to listen on that port.

This means the most basic Glitch project to run Datasette looks like this:

https://datasette-basic.glitch.me/ (view source)

It contains a single glitch.json file:

{
    "install": "pip3 install --user datasette",
    "start": "datasette -p 3000"
}

This installs Datasette using pip3, then runs it on port 3000.

Since there’s no actual data to serve, this is a pretty boring demo. The most interesting page is this one, which shows the installed versions of the software:

https://datasette-basic.glitch.me/-/versions

Something more interesting: datasette-csvs

Let’s build one with some actual data.

My csvs-to-sqlite tool converts CSV files into a SQLite database. Since it’s also written in Python we can run it against CSV files as part of the Glitch install script.

Glitch provides a special directory called .data/ which can be used as a persistent file storage space that won’t be cleared in between restarts. The following "install" script installs datasette and csvs-to-sqlite, then runs the latter to create a SQLite database from all available CSV files:

{
    "install":  "pip3 install --user datasette csvs-to-sqlite && csvs-to-sqlite *.csv .data/csv-data.db",
    "start": "datasette .data/csv-data.db -p 3000"
}

Now we can simply drag and drop CSV files into the root of the Glitch project and they will be automatically converted into a SQLite database and served using Datasette!

We need a couple of extra details. Firstly, we want Datasette to automatically re-build the database file any time a new CSV file is added or an existing CSV file is changed. We can do that by adding a "watch" block to glitch.json:

"watch": {
    "install": {
        "include": [
            "\\.csv$"
        ]
    }
}

This ensures that our "install" script will run again any time a CSV file changes.

Let’s tone down the rate at which the scripts execute, by using throttle to set the polling interval to once a second:

"throttle": 1000

The above almost worked, but I started seeing errors if I changed the number of columns in a CSV file, since doing so clashed with the schema that had already been created in the database.

My solution was to add code to the install script that would delete the SQLite database file before attempting to recreate it—using the rm ... || true idiom to prevent Glitch from failing the installation if the file it attempted to remove did not already exist.

My final glitch.json file looks like this:

{
  "install": "pip3 install --user datasette csvs-to-sqlite && rm .data/csv-data.db || true && csvs-to-sqlite *.csv .data/csv-data.db",
  "start": "datasette .data/csv-data.db -p 3000 -m metadata.json",
  "watch": {
    "install": {
      "include": [
        "\\.csv$"
      ]
    },
    "restart": {
      "include": [
        "^metadata.json$"
      ]
    },
    "throttle": 1000
  }
}

I also set it up to use Datasette’s metadata.json format, and automatically restart the server any time the contents of that file changes.

https://datasette-csvs.glitch.me/ (view source) shows the results, running against a simple example.csv file I created.

Remixing!

Here’s where things get really fun: Glitch projects support “remixing”, whereby anyone can click a link to create their own editable copy of a project.

Remixing works even if you aren’t logged in to Glitch! Anonymous projects expire after five days, so be sure to sign in with GitHub or Facebook if you want to keep yours around.

Try it out now: Visit https://glitch.com/edit/#!/remix/datasette-csvs to create your own remix of my project. Then drag a new CSV file directly into the editor and within a few seconds Datasette on Glitch will be up and running against a converted copy of your file!

Limitations

The Glitch help center article What technical restrictions are in place? describes their limits. Most importantly, projects are limited to 4,000 requests an hour—and there’s currently no way to increase that limit. They also limit projects to 200MB of disk space—easily enough to get started exploring some interesting CSV files with Datasette.

Next steps

I’m delighted at how easy this was to setup, and how much power the ability to remix these Datasette demos provides. I’m tempted to start creating remixable Glitch demos that illustrate other aspects of Datasette’s functionality such as plugins or full-text search.

Glitch is an exceptionally cool piece of software. I look forward to seeing their Python support continue to evolve.

Generating a commit log for San Francisco’s official list of trees five months ago

San Francisco has a neat open data portal (as do an increasingly large number of cities these days). For a few years my favourite file on there has been Street Tree List, a list of all 190,000 trees in the city maintained by the Department of Public Works.

I’ve been using that file for Datasette demos for a while now, but last week I noticed something intriguing: the file had been recently updated. On closer inspection it turns out it’s updated on a regular basis! I had assumed it was a static snapshot of trees at a certain point in time, but I was wrong: Street_Tree_List.csv is a living document.

Back in September 2017 I built a scraping project relating to hurricane Irma. The idea was to take data sources like FEMA’s list of open shelters and track them over time, by scraping them into a git repository and committing after every fetch.

I’ve been meaning to spend more time with this idea, and building a commit log for San Francisco’s trees looked like an ideal opportunity to do so.

sf-tree-history

Here’s the result: sf-tree-history, a git repository dedicated to recording the history of changes made to the official list of San Francisco’s trees. The repo contains three things: the latest copy of Street_Tree_List.csv, a README, and a Circle CI configuration that grabs a new copy of the file every night and, if it has changed, commits it to git and pushes the result to GitHub.

The most interesting part of the repo is the commit history itself. I’ve only been running the script for just over a week, but I already have some useful illustrative commits:

  • 7ab432cdcb8d7914cfea4a5b59803f38cade532b from March 6th records three new trees added to the file: two Monterey Pines and a Blackwood Acacia.
  • d6b258959af9546909b2eee836f0156ed88cd45d from March 12th shows four changes made to existing records. Of particular interest: TreeID 235981 (a Cherry Plum) had its address updated from 412 Webster St to 410 Webster St and its latitude and longitude tweaked a little bit as well.
  • ca66d9a5fdd632549301d249c487004a5b68abf2 lists 2151 rows changed, 1280 rows added! I found an old copy of Street_Tree_List.csv on my laptop from April 2018, so for fun I loaded it into the repository and used git commit amend to back-date the commit to almost a year ago. I generated a commit message between that file and the version from 9 days ago which came in at around 10,000 lines of text. Git handled that just fine, but GitHub’s web view sadly truncates it.

csv-diff

One of the things I learned from my hurricane Irma project was the importance of human-readable commit messages that summarize the detected changes. I initially wrote some code to generate those by hand, but then realized that this could be extracted into a reusable tool.

The result is csv-diff, a tiny Python CLI tool which can generate a human (or machine) readable version of the differences between two CSV files.

Using it looks like this:

$ csv-diff one.csv two.csv --key=id
1 row added, 1 row removed, 1 row changed

1 row added

  {"id": "3", "name": "Bailey", "age": "1"}

1 row removed

  {"id": "2", "name": "Pancakes", "age": "2"}

1 row changed

  Row 1
    age: "4" => "5"

The csv-diff README has further details on the tool.

Circle CI

My favourite thing about the sf-tree-history project is that it costs me nothing to run—either in hosting costs or (hopefully) in terms of ongoing maintenance.

The git repository is hosted for free on GitHub. Because it’s a public project, Circle CI will run tasks against it for free.

My .circleci/config.yml does the rest. It uses Circle’s cron syntax to schedule a task that runs every night. The task then runs this script (embedded in the YAML configuration):

cp Street_Tree_List.csv Street_Tree_List-old.csv
curl -o Street_Tree_List.csv "https://data.sfgov.org/api/views/tkzw-k3nq/rows.csv?accessType=DOWNLOAD"
git add Street_Tree_List.csv
git config --global user.email "treebot@example.com"
git config --global user.name "Treebot"
sudo pip install csv-diff
csv-diff Street_Tree_List-old.csv Street_Tree_List.csv --key=TreeID > message.txt
git commit -F message.txt && \
  git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master \
  || true

This script does all of the work.

  1. First it backs up the existing Street_Tree_list.csv as Street_Tree_List-old.csv, in order to be able to run a comparison later.
  2. It downloads the latest copy of Street_Tree_List.csv from the San Francisco data portal
  3. It adds the file to the git index and sets itself an identity for use in the commit
  4. It installs my csv-diff utility from PyPI
  5. It uses csv-diff to create a diff of the two files, and writes that diff to a new file called message.txt
  6. Finally, it attempts to create a new commit using message.txt as the commit message, then pushes the result to GitHub

The last line is the most complex. Circle CI will mark a build as failed if any of the commands in the run block return a non-0 exit code. git commit returns a non-0 exit code if you attempt to run it but none of the files have changed.

git commit ... && git push ... || true ensures that if git commit succeeds the git push command will be run, BUT if it fails the || true will still return a 0 exit code for the overall line—so Circle CI will not mark the build as failed.

There’s one last trick here: I’m using git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master to push my changes to GitHub. This takes advantage of Circle CI environment variables, which are the recommended way to configure secrets such that they cannot be viewed by anyone browsing your Circle CI builds. I created a personal GitHub auth token for this project, which I’m using to allow Circle CI to push commits to GitHub on my behalf.

Next steps

I’m really excited about this pattern of using GitHub in combination with Circle CI to track changes to any file that is being posted on the internet. I’m opening up the code (and my csv-diff utility) in the hope that other people will use them to set up their own tracking projects. Who knows, maybe there’s a file out there that’s even more exciting than San Francisco’s official list of trees!

I commissioned an oil painting of Barbra Streisand’s cloned dogs five months ago

Two dogs in a stroller looking at a gravestone, as an oil painting
Two identical puffs of white fur, gazing at the tombstone of the dog they are

Last year, Barbra Streisand cloned her dog, Sammie.

The story is fascinating, as is the background reading on dog cloning from Vanity Fair. But the thing that really stuck with me was the photograph that accompanied “Barbra Streisand Explains: Why I Cloned My Dog” in the New York Times:

Two dogs in a stroller looking at a gravestone

David Ewing Duncan in Vanity Fair described the scenario like this: Barbra Streisand, visiting the grave of her beloved Sammie, with Miss Violet and Miss Scarlett perched next to her in their stroller—two identical puffs of white fur, gazing at the tombstone of the dog they are.

This photograph says so much about the age that we live in. I couldn’t get it out of my head.

I’ve long been fascinated by Dafen, the town in a China that was once responsible for 60% of the world’s oil paintings—mostly replicas, but today increasingly original artwork.

I always wanted to commission an oil painting from China, but I never quite found the right subject... until now.

There’s something deliciously appropriate about using a painting cloning service to clone a photograph of some cloned dogs.

So I uploaded a copy of the photo to Instapainting and entered a few extra instructions:

Please paint the stroller closer to the gravestone—adjust the composition so that it fits the 12x16 dimensions while maintaining the two key elements of the image: the stroller with the two dogs in it and the gravestone that they are looking at

A couple of months later, a tube arrived in the mail. I opened it up... and they had nailed it. If anything the painting is an improvement on the original photograph: the composition is tighter, the stroller no longer has its wheel cut off, some dead plants in the background (which I had not even noticed) are given a bit more prominence, and the little doggy faces have exactly the right expressions of mild existential dread.

So thank you Alice Wu at Xiamen Dearchic Arts—I could not be happier.

With a painting this good, obviously it needed to be framed. I took it to Underglass Framing in San Francisco’s Hayes Valley and told them I was looking for something with an air of existential dread. “I think we can do that” they said.

Two dogs in a stroller looking at a gravestone, as an oil painting in an intimidating frame

Natalie says I can keep it in the guest bathroom.

Elsewhere

15th August 2019

  • Subsume JSON a.k.a. JSON ⊂ ECMAScript (via) TIL that JSON isn’t a subset of ECMAScript after all! “In ES2018, ECMAScript string literals couldn’t contain unescaped U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR characters, because they are considered to be line terminators even in that context.” #

9th August 2019

  • Y’all decided you could send 6x as much script because the high-end could take it...but the next billion users can’t. There might have been budget for 2x, but not 6x. Not by a long shot.

    Alex Russell #

7th August 2019

  • OPP (Other People’s Problems) (via) Camille Fournier provides a comprehensive guide to picking your battles: in a large organization how can you navigate the enormous array of problems you can see that you’d like to fix, especially when so many of those problems aren’t directly in your area of control? #

5th August 2019

  • This is when I pull out “we don’t do that here.” It is a conversation ender. If you are the newcomer and someone who has been around a long time says “we don’t do that here”, it is hard to argue. This sentence doesn’t push my morality on anyone. If they want to do whatever it is elsewhere, I’m not telling them not to. I’m just cluing them into the local culture and values.

    Aja Hammerly #

  • Optimizing for the mobile web: Moving from Angular to Preact. Grubhub reduced their mobile web load times from 9-11s to 3-4s by replacing Angular with Preact (and replacing other libraries such as lodash with native JavaScript code). The conversion took 6 months and involved running Angular and Preact simultaneously during the transition—not a huge additional overhead as Preact itself is only 4KB. They used TypeScript throughout and credit it with providing a great deal of confidence and productivity to the overall refactoring. #

4th August 2019

3rd August 2019

  • Logs vs. metrics: a false dichotomy (via) Nick Stenning discusses the differences between logs and metrics: most notably that metrics can be derived from logs but logs cannot be reconstituted starting with time-series metrics. #
  • Documentation needs to include and be structured around its four different functions: tutorials, how-to guides, explanation and technical reference. Each of them requires a distinct mode of writing. People working with software need these four different kinds of documentation at different times, in different circumstances—so software usually needs them all.

    Daniele Procida #

1st August 2019

  • PyPI now supports uploading via API token (via) All of my open source Python libraries are set up to automatically deploy new tagged releases as PyPI packages using Circle CI or Travis, but I’ve always get a bit uncomfortable about sharing my PyPI password with those CI platforms to get this to work. PyPI just added scopes authentication tokens, which means I can issue a token that’s only allowed to upload a specific project and see an audit log of when that token was last used. #

24th July 2019

  • Repository driven development (via) I’m already a big fan of keeping documentation and code in the same repo so you can update them both from within the same code review, but this takes it even further: in repository driven development every aspect of the code and configuration needed to define, document, test and ship a service live in the service repository—all the way down to the configurations for reporting dashboards. This sounds like heaven. #
  • Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite. This morning I figured out how to use the memory-profiler module (and mprof command line tool) to debug memory usage of Python processes. I added the details, including screenshots, to this GitHub issue. It helped me knock down RAM usage for my healthkit-to-sqlite from 2.5GB to just 80MB by making smarter usage of the ElementTree pull parser. #
  • Targeted diagnostic logging in production (via) Will Sargent defines diagnostic logging as “debug logging statements with an audience”, and proposes controlling this style if logging via a feature flat system to allow detailed logging to be turned on in production against a selected subset if users in order to help debug difficult problems. Lots of great background material in the topic of observability here too. #

22nd July 2019

  • When a rewrite isn’t: rebuilding Slack on the desktop. Slack appear to have pulled off the almost impossible: finishing a complete, incremental rewrite of their core product. They moved from jQuery to React over the course of two years, constantly shipping new features as they went along. The biggest gain was in rewriting their code to support multiple workspaces, which means desktop client users no longer have to run a separate copy of Electron for every workspace they are signed into. #
  • healthkit-to-sqlite. Ever since I got an Apple Watch I’ve been itching to get my hands on the step tracking and health data that it’s been collecting for me. I know it’s there in a SQLite database on my wrist, but I couldn’t figure out how to get it! A few days ago I stumbled across the “Export Health Data” button in the iOS Health app, and it turns out it creates a zip file containing XML with a full dump of the data collected by Apple Health. healthkit-to-sqlite is the tool I’ve built that can read that export and use it to create a SQLite database ready to be queried and explored with Datasette. It’s a pretty basic implementation but it’s already giving me access to over 3 million rows of data. Lots of potential here for interesting work with personal analytics. #

20th July 2019

  • Unlocking the Department of State’s foreign military training data for good this time (via) I’m so excited about this: Security Force Monitor used Datasette to publish a 200,000 row database of training engagements between the US military and foreign military units, based on their own massive efforts to clean up the official data (from thousands of PDF files). This is pretty much my dream use-case for Datasette, and their future goals are inspiring: “Our hope is that when the next report arrives in a short few months, we will be able to turn it into machine readable data and pass it around the sector in minutes, rather than months.” #