Simon Willison’s Weblog

On glitch, genetics, asgi, csv, sqlite, ...

 

Recent entries

Weeknotes: Design thinking for journalists, genome-to-sqlite, datasette-atom two days ago

I haven’t had much time for code this week: we’ve had a full five day workshop at JSK with Tran Ha (a JSK alumni) learning how to apply Design Thinking to our fellowship projects and generally to challenges facing journalism.

I’ve used aspects of design thinking in building software products, but I’d never really thought about how it could be applied outside of digital product design. It’s been really interesting—especially seeing the other fellows (who, unlike me, are generally not planning to build software during their fellowship) start to apply it to a much wider and more interesting range of problems.

I’ve been commuting in to Stanford on the Caltrain, which did give me a bit of time to work on some code.

genome-to-sqlite

I’m continuing to build out a family of tools for personal analytics, where my principle goal is to reclaim the data that various internet companies have collected about me and pull it into a local SQLite database so I can analyze, sualize and generally and have fun with it.

A few years ago I shared my DNA with 23andMe. I don’t think I’d make the decision to do that today: it’s incredibly personal data, and the horror stories about people making unpleasant discoveries about their family trees keep on building. But since I’ve done it, I decided to see if I could extract out some data…

… and it turns out they let you download your entire genome! You can export it as a zipped up TSV file—mine decompresses to 15MB of data (which feels a little small—I know little about genetics, but I’m presuming that’s because the genome they record and share is just the interesting known genetic markers, not the entire DNA sequence—UPDATE: confirmed, thanks @laurencerowe).

So I wrote a quick utility, genome-to-sqlite, which loads the TSV file (directly from the zip or a file you’ve already extracted) and writes it to a simple SQLite table. Load it into Datasette and you can even facet by chromosome, which is exciting!

This is where my knowledge runs out. I’m confident someone with more insight than me could construct some interesting SQL queries against this—maybe one that determines if you are likely to have red hair?—so I’m hoping someone will step in and provide a few examples.

I filed a help wanted issue on GitHub. I also put a request out on Twitter for an UPDATE statement that could turn me into a dinosaur.

datasette-atom

This is very much a work-in-progress right now: datasette-atom will be a Datasette plugin that adds .atom as an output format (using the register_output_renderer plugin hook contributed by Russ Garrett a few months ago.

The aim is to allow people to subscribe to the output of a query in their feed reader (and potentially through that via email and other mechanisms)—particularly important for databases which are being updated over time.

It’s a slightly tricky plugin to design because valid Atom feed entries require a globally unique ID, a title and an “updated” date—and not all SQL queries produce obvious candidates for these values. As such, I’m going to have the plugin prompt the user for those fields and then persist them in the feed URL that you subscribe to.

This also means you won’t be able to generate an Atom feed for a query that doesn’t return at least one datetime column. I think I’m OK with that.

github-to-sqlite

I released one new feature for github-to-sqlite this week: the github-to-sqlite repos github.db command, which populates a database table of all of the repositories available to the authenticated user. Or use github-to-sqlite repos github.db dogsheep to pull the repos owned by a specific user or organization.

The command configures a SQLite full-text search index against the repo titles and descriptions, so if you have a lot of GitHub repos (I somehow have nearly 300!) you can search through them and use Datasette to facet them against different properties.

github-to-sqlite currently has two other useful subcommands: starred fetches details of every repository a user has starred, and issues pulls details of the issues (but sadly not yet their comment threads) attached to a repository.

Books

I’m trying to spend more time reading books—so I’m going to start including book stuff in my weeknotes in the hope of keeping myself on track.

I acquired two new books this week:

  • Just Enough Research by Erika Hall (recommended by Tom Coates and Tran Ha), because I need to spent the next few months interviewing as many journalists (and other project stakeholders) as possible to ensure I am solving the right problems for them.
  • Producing Open Source Software by Karl Fogel, because my fellowship goal is to build a thriving open source ecosystem around tooling for data journalism and this book looks like it covers a lot of the topics I need to really do a good job of that.

Next step: actually read them! Hopefully I’ll have some notes to share in next week’s update.

Weeknotes: ONA19, twitter-to-sqlite, datasette-rure nine days ago

I’ve decided to start writing weeknotes for the duration of my JSK fellowship. Here goes!

I started the fellowship last week, and this week I’ve been attending the Online News Association conference in New Orleans along with the other fellows.

Natalie and I have had a lot of fun exploring New Orleans, including several niche museums (documented in this Twitter thread, which I expect to continue adding to for years to come).

ONA is the largest digital news conference on the calendar, and as such it’s a great opportunity to meet all kinds of different journalists and get a start on figuring out how best to approach the fellowship.

twitter-to-sqlite

I’ve been working on this tool for a couple of weeks now, but this week it really started to prove its worth. twitter-to-sqlite is a utility I’m building to fetch data from Twitter and write it into a SQLite database. It has a comprehensive README, as do all of my recent projects: I’ve been doing this for long enough that I know that time spent on documentation will pay me back enormously in just a few months.

My initial goal for this tool was to use it for personal analytics, partly inspired by my Twitter followers project from last year. Then ONA sent out a spreadsheet with the Twitter names of most of the conference attendees and I realised there was an opportunity to use it for something more interesting.

I won’t share the attendee list here because it’s intended to be private to conference attendees, but the short version of what I did with it is this:

First, load the attendee list into a database using csvs-to-sqlite (after first converting the XLS file to CSV using Numbers):

$ csvs-to-sqlite attendees.csv ona.db

Next, fetch the full Twitter profiles for every one of those attendees. I added a --sql option to twitter-to-sqlite for this (issue #8) and ran the following:

$ twitter-to-sqlite users-lookup ona.db --sql="select Twitter from attendees"

Now I can run datasette ona.db and execute full-text searches against the complete Twitter profiles of those attendees—great for figuring out e.g. who is attending the conference from the Washington Post.

I took this a step further: can I identify people I follow on Twitter who are attending the conference? I can pull the list of IDs of people I follow like so:

$ twitter-to-sqlite friends-ids ona.db simonw

Now I can see which of my Twitter friends are attending the conference by loading it into Datasette and using this query:

select * from users
where screen_name in (select Twitter from attendees)
and id in (select followed_id from following where follower_id = 12497)

This is not a particularly efficient SQL query… and it doesn’t matter! The great thing about working against tiny SQLite databases that contain just a few thousand rows of data is that you can nest SQL queries like this with little concern for their performance—this one runs on my laptop in 18.213ms.

datasette-rure

SQLite has a REGEXP operator, which is documented thus:

The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If an application-defined SQL function named “regexp” is added at run-time, then the “X REGEXP Y” operator will be implemented as a call to “regexp(Y,X)”.

Python’s sqlite3 module lets you register custom functions, so this should be an easy fix. I’ve been holding off on implementing this in Datasette for a while though because Datasette allows user-provided queries, and Python’s regular expressions have a nasty quality: you can easily construct a regular expression and input that will hang the Python interpreter.

So I needed a safe regular expression library that I could trust not to hang my intepreter on the wrong input.

Google have such a library—re2—but the various Python bindings for it aren’t cleanly installable via pip install.

I asked about this on Twitter and Alex Willmer pointed me to rure-python—a Python wrapper for the Rust regular expression library, which “guarantees linear time searching using finite automata”—exactly what I was looking for!

So I built a quick Datasette plugin: datasette-rure, which provides a regexp() function to enable the REGEXP oaperator, powered by the Rust regular expression engine.

I also added regexp_match(pattern, text, index) and regexp_matches(pattern, text) functions. Here are their examples from the README:

select regexp_match('.*( and .*)', title) as n from articles where n is not null
-- Returns the ' and X' component of any matching titles, e.g.
--     and Recognition
--     and Transitions Their Place
-- etc
select regexp_matches(
    'hello (?P<name>\w+) the (?P<species>\w+)',
    'hello bob the dog, hello maggie the cat, hello tarquin the otter'
)
-- Returns a JSON array:
-- [{"name": "bob", "species": "dog"},
--  {"name": "maggie", "species": "cat"},
--  {"name": "tarquin", "species": "otter"}]

Returning JSON from a SQLite custom function is a fun trick: it means you can then loop over the returned rows or even join them against other tables within SQL using the bundled json_each() function.

I deployed an interactive demo of Datasette running the plugin. I’ve been trying to do this for all of the plugins I release—datasette-jq is another recent example (the interactive demos are linked from the README).

My JSK Fellowship: Building an open source ecosystem of tools for data journalism 12 days ago

I started a new chapter of my career last week: I began a year long fellowship with the John S. Knight Journalism Fellowships program at Stanford.

I’m going to spend the year thinking about and working on tools for data journalism. More details below, but the short version is that I want to help make the kind of data reporting we’re seeing from well funded publications like the New York Times, the Washington Post and the LA Times more accessible to smaller publications that don’t have the budget for full-time software engineers.

I’ve worked with newspapers a few times in the past: I helped create what would later become Django at the Lawrence Journal-World fifteen years ago, and I spent two years working on data journalism projects at the Guardian in London before being sucked into the tech startup world. My Datasette project was inspired by the challenges I saw at the Guardian, and I’m hoping to evolve it (and its accompanying ecosystem) in as useful a way as possible.

This fellowship is a chance for me to get fully embedded back in that world. I could not be more excited about it!

I’m at the Online News Association conference in New Orleans this week: if you’d like to meet up for a chat please drop me a line on Twitter or via email (swillison is my Gmail).

Here’s the part of my fellowship application (written back in January) which describes what I’m hoping to do. The program is extremely flexible and there is plenty of opportunity for me to change my focus if something more useful emerges from my research, but this provides a good indication of where my current thinking lies.

What is your fellowship proposal?

Think of this as your title or headline for your proposal. (25 words or less)

How might we grow an open source ecosystem of tools to help data journalists collect, analyze and publish the data underlying their stories?

Now, tell us more about your proposal. Why is it important to the challenges facing journalism and journalists today? How might it create meaningful change or advance the work of journalists? (600 words or less)

Data journalism is a crucial discipline for discovering and explaining true stories about the modern world—but effective data-driven reporting still requires tools and skills that are still not widely available outside of large, well funded news organizations.

Making these techniques readily available to smaller, local publications can help them punch above their weight, producing more impactful stories that overcome the challenges posed by their constrained resources.

Tools that work for smaller publications can work for larger publications as well. Reducing the time and money needed to produce great data journalism raises all boats and enables journalists to re-invest their improved productivity in ever more ambitious reporting projects.

Academic journals are moving towards publishing both the code and data that underlies their papers, encouraging reproducibility and better sharing of the underlying techniques. I want to encourage the same culture for data journalism, in the hope that “showing your working” can help fight misinformation and improve reader’s trust in the stories that are derived from the data.

I would like to use a JSK fellowship to build an ecosystem of data journalism tools that make data-driven reporting as productive and reproducible as possible, while opening it up to a much wider group of journalists.

At the core of my proposal is my Datasette open source project. I’ve been running this as a side-project for a year with some success: newspapers that have used it include the Baltimore Sun, who used it for their public salary records project: https://salaries.news.baltimoresun.com/. By dedicating myself to the project full-time I anticipate being able to greatly accelerate the pace of development and my ability to spend time teaching news organizations how to take advantage of it.

More importantly, the JSK fellowship would give me high quality access to journalism students, professors and professionals. A large portion of my fellowship would be spent talking to a wide pool of potential users and learning exactly what people need from the project.

I do not intend to be the only developer behind Datasette: I plan to deliberately grow the pool of contributors, both to the Datasette core project but also in developing tools and plugins that enhance the project’s capabilities. The great thing about a plugin ecosystem is that it removes the need for a gatekeeper: anyone can build and release a plugin independent of Datasette core, which both lowers the barriers to entry and dramatically increases the rate at which new functionality becomes available to all Datasette users.

My goal for the fellowship is to encourage the growth of open source tools that can be used by data journalists to increase the impact of their work. My experience at the Guardian lead me to Datasette as a promising avenue for this, but in talking to practitioners and students I hope to find other opportunities for tools that can help. My experience as a startup founder, R&D software engineer and an open source contributor put me in an excellent position to help create these tools in partnership with the wider open source community.

Single sign-on against GitHub using ASGI middleware two months ago

I released Datasette 0.29 last weekend, the first version of Datasette to be built on top of ASGI (discussed previously in Porting Datasette to ASGI, and Turtles all the way down).

This also marked the introduction of the new asg_wrapper plugin hook, which allows plugins to wrap the entire Datasette application in their own piece of ASGI middleware.

To celebrate this new capability, I also released two new plugins: datasette-cors, which provides fine-grained control over CORS headers (using my asgi-cors library from a few months ago) and datasette-auth-github, the first of hopefully many authentication plugins for Datasette.

datasette-auth-github

The new plugin is best illustrated with a demo.

Visit https://datasette-auth-demo.now.sh/ and you will be redirected to GitHub and asked to approve access to your account (just your e-mail address, not repository access).

Agree, and you’ll be redirected back to the demo with a new element in the Datasette header: your GitHub username, plus a “log out” link in the navigation bar at the top of the screen.

Controlling who can access

The default behaviour of the plugin is to allow in anyone with a GitHub account. Since the primary use-case for the plugin (at least for the moment) is restricting access to view data to a trusted subset of people, the plugin lets you configure who is allowed to view your data in three different ways:

Datasette inherits quite a sophisticated user management system from GitHub, with very little effort required from the plugin. The user_is_allowed() method that implements all three of the above options against the GitHub API in just 40 lines of code.

These options can be set using the "plugins" section of the Datasette metadata.json configuration file. Here’s an example:

{
  "plugins": {
    "datasette-auth-github": {
      "client_id": {"$env": "GITHUB_CLIENT_ID"},
      "client_secret": {"$env": "GITHUB_CLIENT_SECRET"},
      "allow_users": ["simonw"]
    }
  }
}

This also illustrates a new Datasette feature: the ability to set secret plugin configuration values. {"$env": "GITHUB_CLIENT_SECRET"} means "read this configuration option from the environment variable GITHUB_CLIENT_SECRET".

Automatic log in

Like many OAuth providers, GitHub only asks the user for their approval the first time they log into a given app. Any subsequent times they are redirected to GitHub it will skip the permission screen and redirect them right back again with a token.

This means we can implement automatic log in: any time a visitor arrives who does not have a cookie we can bounce them directly to GitHub, and if they have already consented they will be logged in instantly.

This is a great user-experience—provided the user is logged into GitHub they will be treated as if they are logged into your application—but it does come with a downside: what if the user clicks the “log out” link?

For the moment I’ve implemented this using another cookie: if the user clicks “log out”, I set an asgi_auth_logout cookie marking the user as having explicitly logged out. While they have that cookie they won’t be logged in automatically, instead having to click an explicit link. See issue 41 for thoughts on how this could be further improved.

One pleasant side-effect of all of this is that datasette-auth-github doesn’t need to persist the users GitHub access_token anywhere—it uses it during initil authentication check for any required organizations or teams, but then it deliberately forgets the token entirely.

OAuth access tokens are like passwords, so the most resonsible thing for a piece of softare to do with them is avoid storing them anywhere at all unless they are explicitly needed.

What happens when a user leaves an organization?

When building against a single sign-in provider, consideration needs to be given to offboarding: when a user is removed from a team or organization they should also lose access to their SSO applications.

This is difficult when an application sets its own authentication cookies, like datasette-auth-github does.

One solution would be to make an API call on every request to the application, to verify that the user should still have access. This would slow everything down and is likely to blow through rate limits as well, so we need a more efficient solution.

I ended up solving this with two mechanisms. Since we have automatic log in, our cookies don’t actually need to last very long—so by default the signed cookies set by the plugin last for just one hour. When a user’s cookie has expired they will be redirected back through GitHub—they probably won’t even notice the redirect, and their permissions will be re-verified as part of that flow.

But what if you need to invalidate those cookies instantly?

To cover that case, I’ve incorporated an optional cookie_version configuration option into the signatures on the cookies. If you need to invalidate every signed cookie that is out there—to lock out a compromised GitHub account owner for example—you can do so by changing the cookie_version configuration option and restarting (or re-deploying) Datasette.

These options are all described in detail in the project README.

Integration with datasette publish

The datasette publish command-line tool lets users instantly publish a SQLite database to the internet, using Heroku, Cloud Run or Zeit Now v1. I’ve added suppor for setting secret plugin configuration directly to that tool, which means you can publish an authentication-protected SQLite database to the internet with a shell one-liner, using --install=datasette-auth-github to install the plugin and --plugin-secret to configure it:

$ datasette publish cloudrun fixtures.db \
   --install=datasette-auth-github \
   --name datasette-auth-protected \
   --service datasette-auth-protected \
   --plugin-secret datasette-auth-github allow_users simonw \
   --plugin-secret datasette-auth-github client_id 85f6224cb2a44bbad3fa \
   --plugin-secret datasette-auth-github client_secret ...

This creates a Cloud Run instance which only allows GitHub user simonw to log in. You could instead use --plugin-secret datasette-auth-github allow_orgs my-org to allow any users from a specific GitHub organization.

Note that Cloud Run does not yet give you full control over the URL that will be assigned to your deployment. In this case it gave me https://datasette-auth-protected-j7hipcg4aq-uc.a.run.app—which works fine, but I needed to update my GitHub OAuth application’s callback URL manually to https://datasette-auth-protected-j7hipcg4aq-uc.a.run.app/-/auth-callback after deploying the application in order to get the authentication flow to work correctly.

Add GitHub authentication to any ASGI application!

datasette-auth-github isn’t just for Datasette: I deliberately wrote the plugin as ASGI middleware first, with only a very thin layer of extra code to turn it into an installable plugin.

This means that if you are building any other kind of ASGI app (or using an ASGI-compatible framework such as Starlette or Sanic) you can wrap your application directly with the middleware and get the same authentication behaviour as when the plugin is added to Datasette!

Here’s what that looks like:

from datasette_auth_github import GitHubAuth
from starlette.applications import Starlette
from starlette.responses import HTMLResponse
import uvicorn

app = Starlette(debug=True)


@app.route("/")
async def homepage(request):
    return HTMLResponse("Hello, {}".format(
        repr(request.scope["auth"])
    ))


authenticated_app = GitHubAuth(
    app,
    client_id="986f5d837b45e32ee6dd",
    client_secret="...",
    require_auth=True,
    allow_users=["simonw"],
)

if __name__ == "__main__":
    uvicorn.run(authenticated_app, host="0.0.0.0", port=8000)

The middleware adds a scope["auth"] key describing the logged in user, which is then passed through to your application. More on this in the README.

Your security reviews needed!

Since datasette-auth-github adds authentication to Datasette, it is an extremely security-sensitive piece of code. So far I’m the only person who has looked at it: before I start widely recommending it to people I’d really like to get some more eyes on it to check for any potential security problems.

I’ve opened issue #44 encouraging security-minded developers to have a dig through the code and see if there’s anything that can be tightened up or any potential vulnerabilities that need to be addressed. Please get involved!

It’s a pretty small codebase, but here are some areas you might want to inspect:

  • At a high level: is the way I’m verifying the user through the GitHub API and then storing their identity in a signed cookie the right way to go?
  • The cookie signing secret is derived from the GitHub OAuth application’s client_id and client_secret (because that secret is already meant to be a secret), combined with the cookie_version option described above—implementation here. Since this is a derived secret I’m using pbkdf2_hmac with 100,000 iterations. This is by far the most cryptographically interesting part of the code, and could definitely do with some second opinions.
  • The code used to sign and verify cookies is based on Django’s (thoroughly reviewed) implementation, but could benefit from a sanity check.
  • I wanted this library to work on Glitch, which currently only provides Python 3.5.2. Python’s asyncio HTTP librarys such as http3 and aiohttp both require more modern Pythons, so I ended up rolling my own very simple async HTTP function which uses urllib.request inside a loop.run_in_executor thread pool. Is that approach sound? Rolling my own HTTP client in this way feels a little hairy.

This has been a really fun project so far, and I’m very excited about the potential for authenticated Datasette moving forward—not to mention the possibilites unlocked by an ASGI middleware ecosystem with strong support for wrapping any application in an authentication layer.

Porting Datasette to ASGI, and Turtles all the way down three months ago

This evening I finally closed a Datasette issue that I opened more than 13 months ago: #272: Port Datasette to ASGI. A few notes on why this is such an important step for the project.

ASGI is the Asynchronous Server Gateway Interface standard. It’s been evolving steadily over the past few years under the guidance of Andrew Godwin. It’s intended as an asynchronous replacement for the venerable WSGI.

Turtles all the way down

Ten years ago at EuroDjangoCon 2009 in Prague I gave a talk entitled Django Heresies. After discussing some of the design decisions in Django that I didn’t think had aged well, I spent the last part of the talk talking about Turtles all the way down. I wrote that idea up here on my blog (see also these slides).

The key idea was that Django would be more interesting if the core Django contract—a function that takes a request and returns a response—was extended to more places in the framework. The top level site, the reusable applications, middleware and URL routing could all share that same contract. Everything could be composed from the same raw building blocks.

I’m excited about ASGI because it absolutely fits the turtles all the way down model.

The ASGI contract is an asynchronous function that takes three arguments:

async def application(scope, receive, send):
    ...

scope is a serializable dictionary providing the context for the current connection. receive is an awaitable which can be used to recieve incoming messages. send is an awaitable that can be used to send replies.

It’s a pretty low-level set of primitives (and less obvious than a simple request/response)—and that’s because ASGI is about more than just the standard HTTP request/response cycle. This contract works for HTTP, WebSockets and potentially any other protocol that needs to asynchronously send and receive data.

It’s an extremely elegant piece of protocol design, informed by Andrew’s experience with Django Channels, SOA protocols (we are co-workers at Eventbrite where we’ve both been heavily involved in Eventbrite’s SOA mechanism) and Andrew’s extensive conversations with other maintainers in the Python web community.

The ASGI protocol really is turtles all the way down—it’s a simple, well defined contract which can be composed together to implement all kinds of interesting web architectural patterns.

My asgi-cors library was my first attempt at building an ASGI turtle. The implementation is a simple Python decorator which, when applied to another ASGI callable, adds HTTP CORS headers based on the parameters you pass to the decorator. The library has zero installation dependencies (it has test dependencies on pytest and friends) and can be used on any HTTP ASGI project.

Building asgi-cors completely sold me on ASGI as the turtle pattern I had been desiring for over a decade!

Datasette plugins and ASGI

Which brings me to Datasette.

One of the most promising components of Datasette is its plugin mechanism. Based on pluggy (extracted from pytest), Datasette Plugins allow new features to be added to Datasette without needing to change the underlying code. This means new features can be built, packaged and shipped entirely independently of the core project. A list of currently available plugins can be found here.

WordPress is very solid blogging engine. Add in the plugin ecosystem around it and it can be used to build literally any CMS you can possibly imagine.

My dream for Datasette is to apply the same model: I want a strong core for publishing and exploring data that’s enhanced by plugins to solve a huge array of data analysis, visualization and API-backed problems.

Datasette has a range of plugin hooks already, but I’ve so far held back on implementing the most useful class of hooks: hooks that allow developers to add entirely new URL routes exposing completely custom functionality.

The reason I held back is that I wanted to be confident that the contract I was offering was something I would continue to support moving forward. A plugin system isn’t much good if the core implementation keeps on changing in backwards-incompatible ways.

ASGI is the exact contract I’ve been waiting for. It’s not quite ready yet, but you can follow #520: prepare_asgi plugin hook (thoughts and suggestions welcome!) to be the first to hear about this hook when it lands. I’m planning to use it to make my asgi-cors library available as a plugin, after which I’m excited to start exploring the idea of bringing authentication plugins to Datasette (and to the wider ASGI world in general).

I’m hoping that many Datasette ASGI plugins will exist in a form that allows them to be used by other ASGI applications as well.

I also plan to use ASGI to make components of Datasette itself available to other ASGI applications. If you just want a single instance of Datasette’s table view to be embedded somewhere in your URL configuration you should be able to do that by routing traffic directly to the ASGI-compatible view class.

I’m really excited about exploring the intersection of ASGI turtles-all-the-way-down and pluggy’s powerful mechanism for gluing components together. Both WSGI and Django’s reusable apps have attempted to create a reusable ecosystem in the past, to limited levels of success. Let’s see if ASGI can finally make the turtle dream come true.

Further reading

Hello ASGI by Tom Christie is the best introduction to ASGI I’ve seen. Tom is the author of the Uvicorn ASGI server (used by Datasette as-of this evening) and Starlette, a delightfully well-designd ASGI web framework. I’ve learned an enormous amount about ASGI by reading Tom’s code. Tom also gave a talk about ASGI at DjangoCon Europe a few months ago.

If you haven’t read A Django Async Roadmap by Andrew Godwin last year you should absolutely catch up. More than just talking about ASGI, Andrew sketches out a detailed and actionable plan for bringing asyncio to Django core. Andrew landeded the first Django core ASGI code based on the plan just a few days ago.

If you’re interested in the details of Datasette’s ASGI implementation, I posted detailed commentary on issue #272 over the past thirteen months as I researched and finalized my approach. I added further commentary to the associated pull request, which gathers together the 34 commits it took to ship the feature (squashed into a single commit to master).

Datasette 0.28—and why master should always be releasable four months ago

It’s been quite a while since the last substantial release of Datasette. Datasette 0.27 came out all the way back in January.

This isn’t because development has slowed down. In fact, the project has had 131 commits since then, covering a bewildering array of new functionality and with some significant contributions from developers who aren’t me—Russ Garrett and Romain Primet deserve special recognition here.

The problem has been one of discipline. I’m a big fan of the idea of keeping master shippable at all times in my professional work, but I hadn’t quite adopted this policy for my open-source side projects. A couple of months ago I found myself in a situation where I had two major refactorings (of faceting and of Datasette’s treatment of immutable files) going on in master at the same time, and untangling them turned out to take way longer than I had expected.

So I’ve updated Datasette’s contribution guidelines to specify that master should always be releasable, almost entirely as a reminder to myself.

All of that said, I’m finally back out of the weeds and I’m excited to announce today’s release of Datasette 0.28. It features a salmagundi of new features! I’m replicating the release notes below.

Supporting databases that change

From the beginning of the project, Datasette has been designed with read-only databases in mind. If a database is guaranteed not to change it opens up all kinds of interesting opportunities—from taking advantage of SQLite immutable mode and HTTP caching to bundling static copies of the database directly in a Docker container. The interesting ideas in Datasette explores this idea in detail.

As my goals for the project have developed, I realized that read-only databases are no longer the right default. SQLite actually supports concurrent access very well provided only one thread attempts to write to a database at a time, and I keep encountering sensible use-cases for running Datasette on top of a database that is processing inserts and updates.

So, as-of version 0.28 Datasette no longer assumes that a database file will not change. It is now safe to point Datasette at a SQLite database which is being updated by another process.

Making this change was a lot of work—see tracking tickets #418, #419 and #420. It required new thinking around how Datasette should calculate table counts (an expensive operation against a large, changing database) and also meant reconsidering the “content hash” URLs Datasette has used in the past to optimize the performance of HTTP caches.

Datasette can still run against immutable files and gains numerous performance benefits from doing so, but this is no longer the default behaviour. Take a look at the new Performance and caching documentation section for details on how to make the most of Datasette against data that you know will be staying read-only and immutable.

Faceting improvements, and faceting plugins

Datasette Facets provide an intuitive way to quickly summarize and interact with data. Previously the only supported faceting technique was column faceting, but 0.28 introduces two powerful new capibilities: facet-by-JSON-array and the ability to define further facet types using plugins.

Facet by array (#359) is only available if your SQLite installation provides the json1 extension. Datasette will automatically detect columns that contain JSON arrays of values and offer a faceting interface against those columns—useful for modelling things like tags without needing to break them out into a new table. See Facet by JSON array for more.

The new register_facet_classes() plugin hook (#445) can be used to register additional custom facet classes. Each facet class should provide two methods: suggest() which suggests facet selections that might be appropriate for a provided SQL query, and facet_results() which executes a facet operation and returns results. Datasette’s own faceting implementations have been refactored to use the same API as these plugins.

datasette publish cloudrun

Google Cloud Run is a brand new serverless hosting platform from Google, which allows you to build a Docker container which will run only when HTTP traffic is recieved and will shut down (and hence cost you nothing) the rest of the time. It’s similar to Zeit’s Now v1 Docker hosting platform which sadly is no longer accepting signups from new users.

The new datasette publish cloudrun command was contributed by Romain Primet (#434) and publishes selected databases to a new Datasette instance running on Google Cloud Run.

See Publishing to Google Cloud Run for full documentation.

register_output_renderer plugins

Russ Garrett implemented a new Datasette plugin hook called register_output_renderer (#441) which allows plugins to create additional output renderers in addition to Datasette’s default .json and .csv.

Russ’s in-development datasette-geo plugin includes an example of this hook being used to output .geojson automatically converted from SpatiaLite.

Medium changes

  • Datasette now conforms to the Black coding style (#449)—and has a unit test to enforce this in the future
  • New Special table arguments:
    • ?columnname__in=value1,value2,value3 filter for executing SQL IN queries against a table, see Table arguments (#433)
    • ?columnname__date=yyyy-mm-dd filter which returns rows where the spoecified datetime column falls on the specified date (583b22a)
    • ?tags__arraycontains=tag filter which acts against a JSON array contained in a column (78e45ea)
    • ?_where=sql-fragment filter for the table view (#429)
    • ?_fts_table=mytable and ?_fts_pk=mycolumn querystring options can be used to specify which FTS table to use for a search query—see Configuring full-text search for a table or view (#428)
  • You can now pass the same table filter multiple times—for example, ?content__not=world&content__not=hello will return all rows where the content column is neither hello or world (#288)
  • You can now specify about and about_url metadata (in addition to source and license) linking to further information about a project—see Source, license and about
  • New ?_trace=1 parameter now adds debug information showing every SQL query that was executed while constructing the page (#435)
  • datasette inspect now just calculates table counts, and does not introspect other database metadata (#462)
  • Removed /-/inspect page entirely—this will be replaced by something similar in the future, see #465
  • Datasette can now run against an in-memory SQLite database. You can do this by starting it without passing any files or by using the new --memory option to datasette serve. This can be useful for experimenting with SQLite queries that do not access any data, such as SELECT 1+1 or SELECT sqlite_version().

Small changes

  • We now show the size of the database file next to the download link (#172)
  • New /-/databases introspection page shows currently connected databases (#470)
  • Binary data is no longer displayed on the table and row pages (#442—thanks, Russ Garrett)
  • New show/hide SQL links on custom query pages (#415)
  • The extra_body_script plugin hook now accepts an optional view_name argument (#443—thanks, Russ Garrett)
  • Bumped Jinja2 dependency to 2.10.1 (#426)
  • All table filters are now documented, and documentation is enforced via unit tests (2c19a27)
  • New project guideline: master should stay shippable at all times! (31f36e1)
  • Fixed a bug where sqlite_timelimit() occasionally failed to clean up after itself (bac4e01)
  • We no longer load additional plugins when executing pytest (#438)
  • Homepage now links to database views if there are less than five tables in a database (#373)
  • The --cors option is now respected by error pages (#453)
  • datasette publish heroku now uses the --include-vcs-ignore option, which means it works under Travis CI (#407)
  • datasette publish heroku now publishes using Python 3.6.8 (666c374)
  • Renamed datasette publish now to datasette publish nowv1 (#472)
  • datasette publish nowv1 now accepts multiple --alias parameters (09ef305)
  • Removed the datasette skeleton command (#476)
  • The documentation on how to build the documentation now recommends sphinx-autobuild

Elsewhere

19th September 2019

  • Anyone with solid knowledge of both SQL and genetic engineering want to write me an UPDATE query to turn me into a dinosaur?

    @simonw #

  • genome-to-sqlite. I just found out 23andMe let you export your genome as a zipped TSV file, so I wrote a little Python command-line tool to import it into a SQLite database. #

10th September 2019

  • Evolving “nofollow” – new ways to identify the nature of links (via) Slightly confusing announcement from Google: they’re introducing rel=ugc and rel=sponsored in addition to rel=nofollow, and will be treating all three values as “hints” for their indexing system. They’re very unclear as to what the concrete effects of these hints will be, presumably because they will become part of the secret sauce of their ranking algorithm. #

3rd September 2019

  • sqlite-utils 1.11. Amjith Ramanujam contributed an excellent new feature to sqlite-utils, which I’ve now released as part of version 1.11. Previously you could enable SQLite full-text-search on a table using the .enable_fts() method (or the “sqlite-utils enable-fts” CLI command) but it wouldn’t reflect future changes to the table—you had to use populate_fts() any time you inserted new records. Thanks to Amjith you can now pass create_triggers=True (or --create-triggers) to cause sqlite-utils to automatically add triggers that keeps the FTS index up-to-date any time a row is inserted, updated or deleted from the table. #

15th August 2019

  • Subsume JSON a.k.a. JSON ⊂ ECMAScript (via) TIL that JSON isn’t a subset of ECMAScript after all! “In ES2018, ECMAScript string literals couldn’t contain unescaped U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR characters, because they are considered to be line terminators even in that context.” #

9th August 2019

  • Y’all decided you could send 6x as much script because the high-end could take it...but the next billion users can’t. There might have been budget for 2x, but not 6x. Not by a long shot.

    Alex Russell #

7th August 2019

  • OPP (Other People’s Problems) (via) Camille Fournier provides a comprehensive guide to picking your battles: in a large organization how can you navigate the enormous array of problems you can see that you’d like to fix, especially when so many of those problems aren’t directly in your area of control? #

5th August 2019

  • This is when I pull out “we don’t do that here.” It is a conversation ender. If you are the newcomer and someone who has been around a long time says “we don’t do that here”, it is hard to argue. This sentence doesn’t push my morality on anyone. If they want to do whatever it is elsewhere, I’m not telling them not to. I’m just cluing them into the local culture and values.

    Aja Hammerly #

  • Optimizing for the mobile web: Moving from Angular to Preact. Grubhub reduced their mobile web load times from 9-11s to 3-4s by replacing Angular with Preact (and replacing other libraries such as lodash with native JavaScript code). The conversion took 6 months and involved running Angular and Preact simultaneously during the transition—not a huge additional overhead as Preact itself is only 4KB. They used TypeScript throughout and credit it with providing a great deal of confidence and productivity to the overall refactoring. #

4th August 2019

3rd August 2019

  • Logs vs. metrics: a false dichotomy (via) Nick Stenning discusses the differences between logs and metrics: most notably that metrics can be derived from logs but logs cannot be reconstituted starting with time-series metrics. #
  • Documentation needs to include and be structured around its four different functions: tutorials, how-to guides, explanation and technical reference. Each of them requires a distinct mode of writing. People working with software need these four different kinds of documentation at different times, in different circumstances—so software usually needs them all.

    Daniele Procida #

1st August 2019

  • PyPI now supports uploading via API token (via) All of my open source Python libraries are set up to automatically deploy new tagged releases as PyPI packages using Circle CI or Travis, but I’ve always get a bit uncomfortable about sharing my PyPI password with those CI platforms to get this to work. PyPI just added scopes authentication tokens, which means I can issue a token that’s only allowed to upload a specific project and see an audit log of when that token was last used. #