Simon Willison’s Weblog

On machinelearning, csv, youtube, github, datajournalism, ...

 

Recent entries

Weeknotes: ONA19, twitter-to-sqlite, datasette-rure one day ago

I’ve decided to start writing weeknotes for the duration of my JSK fellowship. Here goes!

I started the fellowship last week, and this week I’ve been attending the Online News Association conference in New Orleans along with the other fellows.

Natalie and I have had a lot of fun exploring New Orleans, including several niche museums (documented in this Twitter thread, which I expect to continue adding to for years to come).

ONA is the largest digital news conference on the calendar, and as such it’s a great opportunity to meet all kinds of different journalists and get a start on figuring out how best to approach the fellowship.

twitter-to-sqlite

I’ve been working on this tool for a couple of weeks now, but this week it really started to prove its worth. twitter-to-sqlite is a utility I’m building to fetch data from Twitter and write it into a SQLite database. It has a comprehensive README, as do all of my recent projects: I’ve been doing this for long enough that I know that time spent on documentation will pay me back enormously in just a few months.

My initial goal for this tool was to use it for personal analytics, partly inspired by my Twitter followers project from last year. Then ONA sent out a spreadsheet with the Twitter names of most of the conference attendees and I realised there was an opportunity to use it for something more interesting.

I won’t share the attendee list here because it’s intended to be private to conference attendees, but the short version of what I did with it is this:

First, load the attendee list into a database using csvs-to-sqlite (after first converting the XLS file to CSV using Numbers):

$ csvs-to-sqlite attendees.csv ona.db

Next, fetch the full Twitter profiles for every one of those attendees. I added a --sql option to twitter-to-sqlite for this (issue #8) and ran the following:

$ twitter-to-sqlite users-lookup ona.db --sql="select Twitter from attendees"

Now I can run datasette ona.db and execute full-text searches against the complete Twitter profiles of those attendees—great for figuring out e.g. who is attending the conference from the Washington Post.

I took this a step further: can I identify people I follow on Twitter who are attending the conference? I can pull the list of IDs of people I follow like so:

$ twitter-to-sqlite friends-ids ona.db simonw

Now I can see which of my Twitter friends are attending the conference by loading it into Datasette and using this query:

select * from users
where screen_name in (select Twitter from attendees)
and id in (select followed_id from following where follower_id = 12497)

This is not a particularly efficient SQL query… and it doesn’t matter! The great thing about working against tiny SQLite databases that contain just a few thousand rows of data is that you can nest SQL queries like this with little concern for their performance—this one runs on my laptop in 18.213ms.

datasette-rure

SQLite has a REGEXP operator, which is documented thus:

The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If an application-defined SQL function named “regexp” is added at run-time, then the “X REGEXP Y” operator will be implemented as a call to “regexp(Y,X)”.

Python’s sqlite3 module lets you register custom functions, so this should be an easy fix. I’ve been holding off on implementing this in Datasette for a while though because Datasette allows user-provided queries, and Python’s regular expressions have a nasty quality: you can easily construct a regular expression and input that will hang the Python interpreter.

So I needed a safe regular expression library that I could trust not to hang my intepreter on the wrong input.

Google have such a library—re2—but the various Python bindings for it aren’t cleanly installable via pip install.

I asked about this on Twitter and Alex Willmer pointed me to rure-python—a Python wrapper for the Rust regular expression library, which “guarantees linear time searching using finite automata”—exactly what I was looking for!

So I built a quick Datasette plugin: datasette-rure, which provides a regexp() function to enable the REGEXP oaperator, powered by the Rust regular expression engine.

I also added regexp_match(pattern, text, index) and regexp_matches(pattern, text) functions. Here are their examples from the README:

select regexp_match('.*( and .*)', title) as n from articles where n is not null
-- Returns the ' and X' component of any matching titles, e.g.
--     and Recognition
--     and Transitions Their Place
-- etc
select regexp_matches(
    'hello (?P<name>\w+) the (?P<species>\w+)',
    'hello bob the dog, hello maggie the cat, hello tarquin the otter'
)
-- Returns a JSON array:
-- [{"name": "bob", "species": "dog"},
--  {"name": "maggie", "species": "cat"},
--  {"name": "tarquin", "species": "otter"}]

Returning JSON from a SQLite custom function is a fun trick: it means you can then loop over the returned rows or even join them against other tables within SQL using the bundled json_each() function.

I deployed an interactive demo of Datasette running the plugin. I’ve been trying to do this for all of the plugins I release—datasette-jq is another recent example (the interactive demos are linked from the README).

My JSK Fellowship: Building an open source ecosystem of tools for data journalism four days ago

I started a new chapter of my career last week: I began a year long fellowship with the John S. Knight Journalism Fellowships program at Stanford.

I’m going to spend the year thinking about and working on tools for data journalism. More details below, but the short version is that I want to help make the kind of data reporting we’re seeing from well funded publications like the New York Times, the Washington Post and the LA Times more accessible to smaller publications that don’t have the budget for full-time software engineers.

I’ve worked with newspapers a few times in the past: I helped create what would later become Django at the Lawrence Journal-World fifteen years ago, and I spent two years working on data journalism projects at the Guardian in London before being sucked into the tech startup world. My Datasette project was inspired by the challenges I saw at the Guardian, and I’m hoping to evolve it (and its accompanying ecosystem) in as useful a way as possible.

This fellowship is a chance for me to get fully embedded back in that world. I could not be more excited about it!

I’m at the Online News Association conference in New Orleans this week: if you’d like to meet up for a chat please drop me a line on Twitter or via email (swillison is my Gmail).

Here’s the part of my fellowship application (written back in January) which describes what I’m hoping to do. The program is extremely flexible and there is plenty of opportunity for me to change my focus if something more useful emerges from my research, but this provides a good indication of where my current thinking lies.

What is your fellowship proposal?

Think of this as your title or headline for your proposal. (25 words or less)

How might we grow an open source ecosystem of tools to help data journalists collect, analyze and publish the data underlying their stories?

Now, tell us more about your proposal. Why is it important to the challenges facing journalism and journalists today? How might it create meaningful change or advance the work of journalists? (600 words or less)

Data journalism is a crucial discipline for discovering and explaining true stories about the modern world—but effective data-driven reporting still requires tools and skills that are still not widely available outside of large, well funded news organizations.

Making these techniques readily available to smaller, local publications can help them punch above their weight, producing more impactful stories that overcome the challenges posed by their constrained resources.

Tools that work for smaller publications can work for larger publications as well. Reducing the time and money needed to produce great data journalism raises all boats and enables journalists to re-invest their improved productivity in ever more ambitious reporting projects.

Academic journals are moving towards publishing both the code and data that underlies their papers, encouraging reproducibility and better sharing of the underlying techniques. I want to encourage the same culture for data journalism, in the hope that “showing your working” can help fight misinformation and improve reader’s trust in the stories that are derived from the data.

I would like to use a JSK fellowship to build an ecosystem of data journalism tools that make data-driven reporting as productive and reproducible as possible, while opening it up to a much wider group of journalists.

At the core of my proposal is my Datasette open source project. I’ve been running this as a side-project for a year with some success: newspapers that have used it include the Baltimore Sun, who used it for their public salary records project: https://salaries.news.baltimoresun.com/. By dedicating myself to the project full-time I anticipate being able to greatly accelerate the pace of development and my ability to spend time teaching news organizations how to take advantage of it.

More importantly, the JSK fellowship would give me high quality access to journalism students, professors and professionals. A large portion of my fellowship would be spent talking to a wide pool of potential users and learning exactly what people need from the project.

I do not intend to be the only developer behind Datasette: I plan to deliberately grow the pool of contributors, both to the Datasette core project but also in developing tools and plugins that enhance the project’s capabilities. The great thing about a plugin ecosystem is that it removes the need for a gatekeeper: anyone can build and release a plugin independent of Datasette core, which both lowers the barriers to entry and dramatically increases the rate at which new functionality becomes available to all Datasette users.

My goal for the fellowship is to encourage the growth of open source tools that can be used by data journalists to increase the impact of their work. My experience at the Guardian lead me to Datasette as a promising avenue for this, but in talking to practitioners and students I hope to find other opportunities for tools that can help. My experience as a startup founder, R&D software engineer and an open source contributor put me in an excellent position to help create these tools in partnership with the wider open source community.

Single sign-on against GitHub using ASGI middleware two months ago

I released Datasette 0.29 last weekend, the first version of Datasette to be built on top of ASGI (discussed previously in Porting Datasette to ASGI, and Turtles all the way down).

This also marked the introduction of the new asg_wrapper plugin hook, which allows plugins to wrap the entire Datasette application in their own piece of ASGI middleware.

To celebrate this new capability, I also released two new plugins: datasette-cors, which provides fine-grained control over CORS headers (using my asgi-cors library from a few months ago) and datasette-auth-github, the first of hopefully many authentication plugins for Datasette.

datasette-auth-github

The new plugin is best illustrated with a demo.

Visit https://datasette-auth-demo.now.sh/ and you will be redirected to GitHub and asked to approve access to your account (just your e-mail address, not repository access).

Agree, and you’ll be redirected back to the demo with a new element in the Datasette header: your GitHub username, plus a “log out” link in the navigation bar at the top of the screen.

Controlling who can access

The default behaviour of the plugin is to allow in anyone with a GitHub account. Since the primary use-case for the plugin (at least for the moment) is restricting access to view data to a trusted subset of people, the plugin lets you configure who is allowed to view your data in three different ways:

Datasette inherits quite a sophisticated user management system from GitHub, with very little effort required from the plugin. The user_is_allowed() method that implements all three of the above options against the GitHub API in just 40 lines of code.

These options can be set using the "plugins" section of the Datasette metadata.json configuration file. Here’s an example:

{
  "plugins": {
    "datasette-auth-github": {
      "client_id": {"$env": "GITHUB_CLIENT_ID"},
      "client_secret": {"$env": "GITHUB_CLIENT_SECRET"},
      "allow_users": ["simonw"]
    }
  }
}

This also illustrates a new Datasette feature: the ability to set secret plugin configuration values. {"$env": "GITHUB_CLIENT_SECRET"} means "read this configuration option from the environment variable GITHUB_CLIENT_SECRET".

Automatic log in

Like many OAuth providers, GitHub only asks the user for their approval the first time they log into a given app. Any subsequent times they are redirected to GitHub it will skip the permission screen and redirect them right back again with a token.

This means we can implement automatic log in: any time a visitor arrives who does not have a cookie we can bounce them directly to GitHub, and if they have already consented they will be logged in instantly.

This is a great user-experience—provided the user is logged into GitHub they will be treated as if they are logged into your application—but it does come with a downside: what if the user clicks the “log out” link?

For the moment I’ve implemented this using another cookie: if the user clicks “log out”, I set an asgi_auth_logout cookie marking the user as having explicitly logged out. While they have that cookie they won’t be logged in automatically, instead having to click an explicit link. See issue 41 for thoughts on how this could be further improved.

One pleasant side-effect of all of this is that datasette-auth-github doesn’t need to persist the users GitHub access_token anywhere—it uses it during initil authentication check for any required organizations or teams, but then it deliberately forgets the token entirely.

OAuth access tokens are like passwords, so the most resonsible thing for a piece of softare to do with them is avoid storing them anywhere at all unless they are explicitly needed.

What happens when a user leaves an organization?

When building against a single sign-in provider, consideration needs to be given to offboarding: when a user is removed from a team or organization they should also lose access to their SSO applications.

This is difficult when an application sets its own authentication cookies, like datasette-auth-github does.

One solution would be to make an API call on every request to the application, to verify that the user should still have access. This would slow everything down and is likely to blow through rate limits as well, so we need a more efficient solution.

I ended up solving this with two mechanisms. Since we have automatic log in, our cookies don’t actually need to last very long—so by default the signed cookies set by the plugin last for just one hour. When a user’s cookie has expired they will be redirected back through GitHub—they probably won’t even notice the redirect, and their permissions will be re-verified as part of that flow.

But what if you need to invalidate those cookies instantly?

To cover that case, I’ve incorporated an optional cookie_version configuration option into the signatures on the cookies. If you need to invalidate every signed cookie that is out there—to lock out a compromised GitHub account owner for example—you can do so by changing the cookie_version configuration option and restarting (or re-deploying) Datasette.

These options are all described in detail in the project README.

Integration with datasette publish

The datasette publish command-line tool lets users instantly publish a SQLite database to the internet, using Heroku, Cloud Run or Zeit Now v1. I’ve added suppor for setting secret plugin configuration directly to that tool, which means you can publish an authentication-protected SQLite database to the internet with a shell one-liner, using --install=datasette-auth-github to install the plugin and --plugin-secret to configure it:

$ datasette publish cloudrun fixtures.db \
   --install=datasette-auth-github \
   --name datasette-auth-protected \
   --service datasette-auth-protected \
   --plugin-secret datasette-auth-github allow_users simonw \
   --plugin-secret datasette-auth-github client_id 85f6224cb2a44bbad3fa \
   --plugin-secret datasette-auth-github client_secret ...

This creates a Cloud Run instance which only allows GitHub user simonw to log in. You could instead use --plugin-secret datasette-auth-github allow_orgs my-org to allow any users from a specific GitHub organization.

Note that Cloud Run does not yet give you full control over the URL that will be assigned to your deployment. In this case it gave me https://datasette-auth-protected-j7hipcg4aq-uc.a.run.app—which works fine, but I needed to update my GitHub OAuth application’s callback URL manually to https://datasette-auth-protected-j7hipcg4aq-uc.a.run.app/-/auth-callback after deploying the application in order to get the authentication flow to work correctly.

Add GitHub authentication to any ASGI application!

datasette-auth-github isn’t just for Datasette: I deliberately wrote the plugin as ASGI middleware first, with only a very thin layer of extra code to turn it into an installable plugin.

This means that if you are building any other kind of ASGI app (or using an ASGI-compatible framework such as Starlette or Sanic) you can wrap your application directly with the middleware and get the same authentication behaviour as when the plugin is added to Datasette!

Here’s what that looks like:

from datasette_auth_github import GitHubAuth
from starlette.applications import Starlette
from starlette.responses import HTMLResponse
import uvicorn

app = Starlette(debug=True)


@app.route("/")
async def homepage(request):
    return HTMLResponse("Hello, {}".format(
        repr(request.scope["auth"])
    ))


authenticated_app = GitHubAuth(
    app,
    client_id="986f5d837b45e32ee6dd",
    client_secret="...",
    require_auth=True,
    allow_users=["simonw"],
)

if __name__ == "__main__":
    uvicorn.run(authenticated_app, host="0.0.0.0", port=8000)

The middleware adds a scope["auth"] key describing the logged in user, which is then passed through to your application. More on this in the README.

Your security reviews needed!

Since datasette-auth-github adds authentication to Datasette, it is an extremely security-sensitive piece of code. So far I’m the only person who has looked at it: before I start widely recommending it to people I’d really like to get some more eyes on it to check for any potential security problems.

I’ve opened issue #44 encouraging security-minded developers to have a dig through the code and see if there’s anything that can be tightened up or any potential vulnerabilities that need to be addressed. Please get involved!

It’s a pretty small codebase, but here are some areas you might want to inspect:

  • At a high level: is the way I’m verifying the user through the GitHub API and then storing their identity in a signed cookie the right way to go?
  • The cookie signing secret is derived from the GitHub OAuth application’s client_id and client_secret (because that secret is already meant to be a secret), combined with the cookie_version option described above—implementation here. Since this is a derived secret I’m using pbkdf2_hmac with 100,000 iterations. This is by far the most cryptographically interesting part of the code, and could definitely do with some second opinions.
  • The code used to sign and verify cookies is based on Django’s (thoroughly reviewed) implementation, but could benefit from a sanity check.
  • I wanted this library to work on Glitch, which currently only provides Python 3.5.2. Python’s asyncio HTTP librarys such as http3 and aiohttp both require more modern Pythons, so I ended up rolling my own very simple async HTTP function which uses urllib.request inside a loop.run_in_executor thread pool. Is that approach sound? Rolling my own HTTP client in this way feels a little hairy.

This has been a really fun project so far, and I’m very excited about the potential for authenticated Datasette moving forward—not to mention the possibilites unlocked by an ASGI middleware ecosystem with strong support for wrapping any application in an authentication layer.

Porting Datasette to ASGI, and Turtles all the way down two months ago

This evening I finally closed a Datasette issue that I opened more than 13 months ago: #272: Port Datasette to ASGI. A few notes on why this is such an important step for the project.

ASGI is the Asynchronous Server Gateway Interface standard. It’s been evolving steadily over the past few years under the guidance of Andrew Godwin. It’s intended as an asynchronous replacement for the venerable WSGI.

Turtles all the way down

Ten years ago at EuroDjangoCon 2009 in Prague I gave a talk entitled Django Heresies. After discussing some of the design decisions in Django that I didn’t think had aged well, I spent the last part of the talk talking about Turtles all the way down. I wrote that idea up here on my blog (see also these slides).

The key idea was that Django would be more interesting if the core Django contract—a function that takes a request and returns a response—was extended to more places in the framework. The top level site, the reusable applications, middleware and URL routing could all share that same contract. Everything could be composed from the same raw building blocks.

I’m excited about ASGI because it absolutely fits the turtles all the way down model.

The ASGI contract is an asynchronous function that takes three arguments:

async def application(scope, receive, send):
    ...

scope is a serializable dictionary providing the context for the current connection. receive is an awaitable which can be used to recieve incoming messages. send is an awaitable that can be used to send replies.

It’s a pretty low-level set of primitives (and less obvious than a simple request/response)—and that’s because ASGI is about more than just the standard HTTP request/response cycle. This contract works for HTTP, WebSockets and potentially any other protocol that needs to asynchronously send and receive data.

It’s an extremely elegant piece of protocol design, informed by Andrew’s experience with Django Channels, SOA protocols (we are co-workers at Eventbrite where we’ve both been heavily involved in Eventbrite’s SOA mechanism) and Andrew’s extensive conversations with other maintainers in the Python web community.

The ASGI protocol really is turtles all the way down—it’s a simple, well defined contract which can be composed together to implement all kinds of interesting web architectural patterns.

My asgi-cors library was my first attempt at building an ASGI turtle. The implementation is a simple Python decorator which, when applied to another ASGI callable, adds HTTP CORS headers based on the parameters you pass to the decorator. The library has zero installation dependencies (it has test dependencies on pytest and friends) and can be used on any HTTP ASGI project.

Building asgi-cors completely sold me on ASGI as the turtle pattern I had been desiring for over a decade!

Datasette plugins and ASGI

Which brings me to Datasette.

One of the most promising components of Datasette is its plugin mechanism. Based on pluggy (extracted from pytest), Datasette Plugins allow new features to be added to Datasette without needing to change the underlying code. This means new features can be built, packaged and shipped entirely independently of the core project. A list of currently available plugins can be found here.

WordPress is very solid blogging engine. Add in the plugin ecosystem around it and it can be used to build literally any CMS you can possibly imagine.

My dream for Datasette is to apply the same model: I want a strong core for publishing and exploring data that’s enhanced by plugins to solve a huge array of data analysis, visualization and API-backed problems.

Datasette has a range of plugin hooks already, but I’ve so far held back on implementing the most useful class of hooks: hooks that allow developers to add entirely new URL routes exposing completely custom functionality.

The reason I held back is that I wanted to be confident that the contract I was offering was something I would continue to support moving forward. A plugin system isn’t much good if the core implementation keeps on changing in backwards-incompatible ways.

ASGI is the exact contract I’ve been waiting for. It’s not quite ready yet, but you can follow #520: prepare_asgi plugin hook (thoughts and suggestions welcome!) to be the first to hear about this hook when it lands. I’m planning to use it to make my asgi-cors library available as a plugin, after which I’m excited to start exploring the idea of bringing authentication plugins to Datasette (and to the wider ASGI world in general).

I’m hoping that many Datasette ASGI plugins will exist in a form that allows them to be used by other ASGI applications as well.

I also plan to use ASGI to make components of Datasette itself available to other ASGI applications. If you just want a single instance of Datasette’s table view to be embedded somewhere in your URL configuration you should be able to do that by routing traffic directly to the ASGI-compatible view class.

I’m really excited about exploring the intersection of ASGI turtles-all-the-way-down and pluggy’s powerful mechanism for gluing components together. Both WSGI and Django’s reusable apps have attempted to create a reusable ecosystem in the past, to limited levels of success. Let’s see if ASGI can finally make the turtle dream come true.

Further reading

Hello ASGI by Tom Christie is the best introduction to ASGI I’ve seen. Tom is the author of the Uvicorn ASGI server (used by Datasette as-of this evening) and Starlette, a delightfully well-designd ASGI web framework. I’ve learned an enormous amount about ASGI by reading Tom’s code. Tom also gave a talk about ASGI at DjangoCon Europe a few months ago.

If you haven’t read A Django Async Roadmap by Andrew Godwin last year you should absolutely catch up. More than just talking about ASGI, Andrew sketches out a detailed and actionable plan for bringing asyncio to Django core. Andrew landeded the first Django core ASGI code based on the plan just a few days ago.

If you’re interested in the details of Datasette’s ASGI implementation, I posted detailed commentary on issue #272 over the past thirteen months as I researched and finalized my approach. I added further commentary to the associated pull request, which gathers together the 34 commits it took to ship the feature (squashed into a single commit to master).

Datasette 0.28—and why master should always be releasable three months ago

It’s been quite a while since the last substantial release of Datasette. Datasette 0.27 came out all the way back in January.

This isn’t because development has slowed down. In fact, the project has had 131 commits since then, covering a bewildering array of new functionality and with some significant contributions from developers who aren’t me—Russ Garrett and Romain Primet deserve special recognition here.

The problem has been one of discipline. I’m a big fan of the idea of keeping master shippable at all times in my professional work, but I hadn’t quite adopted this policy for my open-source side projects. A couple of months ago I found myself in a situation where I had two major refactorings (of faceting and of Datasette’s treatment of immutable files) going on in master at the same time, and untangling them turned out to take way longer than I had expected.

So I’ve updated Datasette’s contribution guidelines to specify that master should always be releasable, almost entirely as a reminder to myself.

All of that said, I’m finally back out of the weeds and I’m excited to announce today’s release of Datasette 0.28. It features a salmagundi of new features! I’m replicating the release notes below.

Supporting databases that change

From the beginning of the project, Datasette has been designed with read-only databases in mind. If a database is guaranteed not to change it opens up all kinds of interesting opportunities—from taking advantage of SQLite immutable mode and HTTP caching to bundling static copies of the database directly in a Docker container. The interesting ideas in Datasette explores this idea in detail.

As my goals for the project have developed, I realized that read-only databases are no longer the right default. SQLite actually supports concurrent access very well provided only one thread attempts to write to a database at a time, and I keep encountering sensible use-cases for running Datasette on top of a database that is processing inserts and updates.

So, as-of version 0.28 Datasette no longer assumes that a database file will not change. It is now safe to point Datasette at a SQLite database which is being updated by another process.

Making this change was a lot of work—see tracking tickets #418, #419 and #420. It required new thinking around how Datasette should calculate table counts (an expensive operation against a large, changing database) and also meant reconsidering the “content hash” URLs Datasette has used in the past to optimize the performance of HTTP caches.

Datasette can still run against immutable files and gains numerous performance benefits from doing so, but this is no longer the default behaviour. Take a look at the new Performance and caching documentation section for details on how to make the most of Datasette against data that you know will be staying read-only and immutable.

Faceting improvements, and faceting plugins

Datasette Facets provide an intuitive way to quickly summarize and interact with data. Previously the only supported faceting technique was column faceting, but 0.28 introduces two powerful new capibilities: facet-by-JSON-array and the ability to define further facet types using plugins.

Facet by array (#359) is only available if your SQLite installation provides the json1 extension. Datasette will automatically detect columns that contain JSON arrays of values and offer a faceting interface against those columns—useful for modelling things like tags without needing to break them out into a new table. See Facet by JSON array for more.

The new register_facet_classes() plugin hook (#445) can be used to register additional custom facet classes. Each facet class should provide two methods: suggest() which suggests facet selections that might be appropriate for a provided SQL query, and facet_results() which executes a facet operation and returns results. Datasette’s own faceting implementations have been refactored to use the same API as these plugins.

datasette publish cloudrun

Google Cloud Run is a brand new serverless hosting platform from Google, which allows you to build a Docker container which will run only when HTTP traffic is recieved and will shut down (and hence cost you nothing) the rest of the time. It’s similar to Zeit’s Now v1 Docker hosting platform which sadly is no longer accepting signups from new users.

The new datasette publish cloudrun command was contributed by Romain Primet (#434) and publishes selected databases to a new Datasette instance running on Google Cloud Run.

See Publishing to Google Cloud Run for full documentation.

register_output_renderer plugins

Russ Garrett implemented a new Datasette plugin hook called register_output_renderer (#441) which allows plugins to create additional output renderers in addition to Datasette’s default .json and .csv.

Russ’s in-development datasette-geo plugin includes an example of this hook being used to output .geojson automatically converted from SpatiaLite.

Medium changes

  • Datasette now conforms to the Black coding style (#449)—and has a unit test to enforce this in the future
  • New Special table arguments:
    • ?columnname__in=value1,value2,value3 filter for executing SQL IN queries against a table, see Table arguments (#433)
    • ?columnname__date=yyyy-mm-dd filter which returns rows where the spoecified datetime column falls on the specified date (583b22a)
    • ?tags__arraycontains=tag filter which acts against a JSON array contained in a column (78e45ea)
    • ?_where=sql-fragment filter for the table view (#429)
    • ?_fts_table=mytable and ?_fts_pk=mycolumn querystring options can be used to specify which FTS table to use for a search query—see Configuring full-text search for a table or view (#428)
  • You can now pass the same table filter multiple times—for example, ?content__not=world&content__not=hello will return all rows where the content column is neither hello or world (#288)
  • You can now specify about and about_url metadata (in addition to source and license) linking to further information about a project—see Source, license and about
  • New ?_trace=1 parameter now adds debug information showing every SQL query that was executed while constructing the page (#435)
  • datasette inspect now just calculates table counts, and does not introspect other database metadata (#462)
  • Removed /-/inspect page entirely—this will be replaced by something similar in the future, see #465
  • Datasette can now run against an in-memory SQLite database. You can do this by starting it without passing any files or by using the new --memory option to datasette serve. This can be useful for experimenting with SQLite queries that do not access any data, such as SELECT 1+1 or SELECT sqlite_version().

Small changes

  • We now show the size of the database file next to the download link (#172)
  • New /-/databases introspection page shows currently connected databases (#470)
  • Binary data is no longer displayed on the table and row pages (#442—thanks, Russ Garrett)
  • New show/hide SQL links on custom query pages (#415)
  • The extra_body_script plugin hook now accepts an optional view_name argument (#443—thanks, Russ Garrett)
  • Bumped Jinja2 dependency to 2.10.1 (#426)
  • All table filters are now documented, and documentation is enforced via unit tests (2c19a27)
  • New project guideline: master should stay shippable at all times! (31f36e1)
  • Fixed a bug where sqlite_timelimit() occasionally failed to clean up after itself (bac4e01)
  • We no longer load additional plugins when executing pytest (#438)
  • Homepage now links to database views if there are less than five tables in a database (#373)
  • The --cors option is now respected by error pages (#453)
  • datasette publish heroku now uses the --include-vcs-ignore option, which means it works under Travis CI (#407)
  • datasette publish heroku now publishes using Python 3.6.8 (666c374)
  • Renamed datasette publish now to datasette publish nowv1 (#472)
  • datasette publish nowv1 now accepts multiple --alias parameters (09ef305)
  • Removed the datasette skeleton command (#476)
  • The documentation on how to build the documentation now recommends sphinx-autobuild

Running Datasette on Glitch four months ago

The worst part of any software project is setting up a development environment. It’s by far the biggest barrier for anyone trying to get started learning to code. I’ve been a developer for more than twenty years and I still feel the pain any time I want to do something new.

Glitch is the most promising attempt I’ve ever seen at tackling this problem. It provides an entirely browser-based development environment that allows you to edit code, see the results instantly and view and remix the source code of other people’s projects.

It’s developed into a really fun, super-creative community and a fantastic resource for people looking to get started in the ever-evolving world of software development.

This evening I decided to get Datasette running on it. I’m really impressed with how well it works, and I think Glitch provides an excellent environment for experimenting with Datasette and related tools.

TLDR version: visit https://glitch.com/edit/#!/remix/datasette-csvs right now, drag-and-drop in a CSV file and watch it get served by Datasette on Glitch just a few seconds later.

Running Python on Glitch

The Glitch documentation is all about Node.js and JavaScript, but they actually have very solid Python support as well.

Every Glitch project runs in a container that includes Python 2.7.12 and Python 3.5.2, and you can use pip install --user or pip3 install --user to install Python dependencies.

The key to running non-JavaScript projects on Glitch is the glitch.json file format. You can use this to specify an install script, which sets up your container, and a start script, which starts your application running. Glitch will route HTTP traffic to port 3000, so your application server needs to listen on that port.

This means the most basic Glitch project to run Datasette looks like this:

https://datasette-basic.glitch.me/ (view source)

It contains a single glitch.json file:

{
    "install": "pip3 install --user datasette",
    "start": "datasette -p 3000"
}

This installs Datasette using pip3, then runs it on port 3000.

Since there’s no actual data to serve, this is a pretty boring demo. The most interesting page is this one, which shows the installed versions of the software:

https://datasette-basic.glitch.me/-/versions

Something more interesting: datasette-csvs

Let’s build one with some actual data.

My csvs-to-sqlite tool converts CSV files into a SQLite database. Since it’s also written in Python we can run it against CSV files as part of the Glitch install script.

Glitch provides a special directory called .data/ which can be used as a persistent file storage space that won’t be cleared in between restarts. The following "install" script installs datasette and csvs-to-sqlite, then runs the latter to create a SQLite database from all available CSV files:

{
    "install":  "pip3 install --user datasette csvs-to-sqlite && csvs-to-sqlite *.csv .data/csv-data.db",
    "start": "datasette .data/csv-data.db -p 3000"
}

Now we can simply drag and drop CSV files into the root of the Glitch project and they will be automatically converted into a SQLite database and served using Datasette!

We need a couple of extra details. Firstly, we want Datasette to automatically re-build the database file any time a new CSV file is added or an existing CSV file is changed. We can do that by adding a "watch" block to glitch.json:

"watch": {
    "install": {
        "include": [
            "\\.csv$"
        ]
    }
}

This ensures that our "install" script will run again any time a CSV file changes.

Let’s tone down the rate at which the scripts execute, by using throttle to set the polling interval to once a second:

"throttle": 1000

The above almost worked, but I started seeing errors if I changed the number of columns in a CSV file, since doing so clashed with the schema that had already been created in the database.

My solution was to add code to the install script that would delete the SQLite database file before attempting to recreate it—using the rm ... || true idiom to prevent Glitch from failing the installation if the file it attempted to remove did not already exist.

My final glitch.json file looks like this:

{
  "install": "pip3 install --user datasette csvs-to-sqlite && rm .data/csv-data.db || true && csvs-to-sqlite *.csv .data/csv-data.db",
  "start": "datasette .data/csv-data.db -p 3000 -m metadata.json",
  "watch": {
    "install": {
      "include": [
        "\\.csv$"
      ]
    },
    "restart": {
      "include": [
        "^metadata.json$"
      ]
    },
    "throttle": 1000
  }
}

I also set it up to use Datasette’s metadata.json format, and automatically restart the server any time the contents of that file changes.

https://datasette-csvs.glitch.me/ (view source) shows the results, running against a simple example.csv file I created.

Remixing!

Here’s where things get really fun: Glitch projects support “remixing”, whereby anyone can click a link to create their own editable copy of a project.

Remixing works even if you aren’t logged in to Glitch! Anonymous projects expire after five days, so be sure to sign in with GitHub or Facebook if you want to keep yours around.

Try it out now: Visit https://glitch.com/edit/#!/remix/datasette-csvs to create your own remix of my project. Then drag a new CSV file directly into the editor and within a few seconds Datasette on Glitch will be up and running against a converted copy of your file!

Limitations

The Glitch help center article What technical restrictions are in place? describes their limits. Most importantly, projects are limited to 4,000 requests an hour—and there’s currently no way to increase that limit. They also limit projects to 200MB of disk space—easily enough to get started exploring some interesting CSV files with Datasette.

Next steps

I’m delighted at how easy this was to setup, and how much power the ability to remix these Datasette demos provides. I’m tempted to start creating remixable Glitch demos that illustrate other aspects of Datasette’s functionality such as plugins or full-text search.

Glitch is an exceptionally cool piece of software. I look forward to seeing their Python support continue to evolve.

Elsewhere

10th September 2019

  • Evolving “nofollow” – new ways to identify the nature of links (via) Slightly confusing announcement from Google: they’re introducing rel=ugc and rel=sponsored in addition to rel=nofollow, and will be treating all three values as “hints” for their indexing system. They’re very unclear as to what the concrete effects of these hints will be, presumably because they will become part of the secret sauce of their ranking algorithm. #

3rd September 2019

  • sqlite-utils 1.11. Amjith Ramanujam contributed an excellent new feature to sqlite-utils, which I’ve now released as part of version 1.11. Previously you could enable SQLite full-text-search on a table using the .enable_fts() method (or the “sqlite-utils enable-fts” CLI command) but it wouldn’t reflect future changes to the table—you had to use populate_fts() any time you inserted new records. Thanks to Amjith you can now pass create_triggers=True (or --create-triggers) to cause sqlite-utils to automatically add triggers that keeps the FTS index up-to-date any time a row is inserted, updated or deleted from the table. #

15th August 2019

  • Subsume JSON a.k.a. JSON ⊂ ECMAScript (via) TIL that JSON isn’t a subset of ECMAScript after all! “In ES2018, ECMAScript string literals couldn’t contain unescaped U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR characters, because they are considered to be line terminators even in that context.” #

9th August 2019

  • Y’all decided you could send 6x as much script because the high-end could take it...but the next billion users can’t. There might have been budget for 2x, but not 6x. Not by a long shot.

    Alex Russell #

7th August 2019

  • OPP (Other People’s Problems) (via) Camille Fournier provides a comprehensive guide to picking your battles: in a large organization how can you navigate the enormous array of problems you can see that you’d like to fix, especially when so many of those problems aren’t directly in your area of control? #

5th August 2019

  • This is when I pull out “we don’t do that here.” It is a conversation ender. If you are the newcomer and someone who has been around a long time says “we don’t do that here”, it is hard to argue. This sentence doesn’t push my morality on anyone. If they want to do whatever it is elsewhere, I’m not telling them not to. I’m just cluing them into the local culture and values.

    Aja Hammerly #

  • Optimizing for the mobile web: Moving from Angular to Preact. Grubhub reduced their mobile web load times from 9-11s to 3-4s by replacing Angular with Preact (and replacing other libraries such as lodash with native JavaScript code). The conversion took 6 months and involved running Angular and Preact simultaneously during the transition—not a huge additional overhead as Preact itself is only 4KB. They used TypeScript throughout and credit it with providing a great deal of confidence and productivity to the overall refactoring. #

4th August 2019

3rd August 2019

  • Logs vs. metrics: a false dichotomy (via) Nick Stenning discusses the differences between logs and metrics: most notably that metrics can be derived from logs but logs cannot be reconstituted starting with time-series metrics. #
  • Documentation needs to include and be structured around its four different functions: tutorials, how-to guides, explanation and technical reference. Each of them requires a distinct mode of writing. People working with software need these four different kinds of documentation at different times, in different circumstances—so software usually needs them all.

    Daniele Procida #

1st August 2019

  • PyPI now supports uploading via API token (via) All of my open source Python libraries are set up to automatically deploy new tagged releases as PyPI packages using Circle CI or Travis, but I’ve always get a bit uncomfortable about sharing my PyPI password with those CI platforms to get this to work. PyPI just added scopes authentication tokens, which means I can issue a token that’s only allowed to upload a specific project and see an audit log of when that token was last used. #

24th July 2019

  • Repository driven development (via) I’m already a big fan of keeping documentation and code in the same repo so you can update them both from within the same code review, but this takes it even further: in repository driven development every aspect of the code and configuration needed to define, document, test and ship a service live in the service repository—all the way down to the configurations for reporting dashboards. This sounds like heaven. #
  • Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite. This morning I figured out how to use the memory-profiler module (and mprof command line tool) to debug memory usage of Python processes. I added the details, including screenshots, to this GitHub issue. It helped me knock down RAM usage for my healthkit-to-sqlite from 2.5GB to just 80MB by making smarter usage of the ElementTree pull parser. #
  • Targeted diagnostic logging in production (via) Will Sargent defines diagnostic logging as “debug logging statements with an audience”, and proposes controlling this style if logging via a feature flat system to allow detailed logging to be turned on in production against a selected subset if users in order to help debug difficult problems. Lots of great background material in the topic of observability here too. #