Simon Willison's Weblog: asgi

asgi-replay

2023-07-24T19:51:33+00:00

As part of submitting LLM to Homebrew core I needed an automated test that demonstrated that the tool was working—but I couldn’t test against the live OpenAI API because I didn’t want to have to reveal my API token as part of the test. I solved this by creating a dummy HTTP endpoint that simulates a hit to the OpenAI API, then configuring the Homebrew test to hit that instead. As part of THAT I ended up building this tiny tool which uses my asgi-proxy-lib package to intercept and log the details of hits made to a service, then provides a mechanism to replay that traffic.

Tags: projects, asgi

Writing a chat application in Django 4.2 using async StreamingHttpResponse, Server-Sent Events and PostgreSQL LISTEN/NOTIFY

2023-05-19T15:42:03+00:00

Writing a chat application in Django 4.2 using async StreamingHttpResponse, Server-Sent Events and PostgreSQL LISTEN/NOTIFY

Excellent tutorial by Víðir Valberg Guðmundsson on implementing chat with server-sent events using the newly async-capable StreamingHttpResponse from Django 4.2x.

He uses PostgreSQL’a LISTEN/NOTIFY mechanism which can be used asynchronously in psycopg3—at the cost of a separate connection per user of the chat.

The article also covers how to use the Last-Event-ID header to implement reconnections in server-sent events, transmitting any events that may have been missed during the time that the connection was dropped.

Via lobste.rs

Tags: async, postgresql, asgi, django

datasette-granian

2023-01-20T02:12:03+00:00

datasette-granian

Granian is a new Python web server—similar to Gunicorn—written in Rust. I built a small plugin that adds a “datasette granian” command starting a Granian server that serves Datasette’s ASGI application, using the same pattern as my existing datasette-gunicorn plugin.

Via Granian issue tracker: Ability to serve an ASGI app object directly, rather than passing a module string

Tags: rust, datasette, asgi

Deploying Python web apps as AWS Lambda functions

2022-09-19T04:05:03+00:00

Deploying Python web apps as AWS Lambda functions

After literally years of failed half-hearted attempts, I finally managed to deploy an ASGI Python web application (Datasette) to an AWS Lambda function! Here are my extensive notes.

Tags: serverless, lambda, datasette, python, aws, asgi

Weeknotes: Datasette Lite, nogil Python, HYTRADBOI

2022-05-06T22:56:39+00:00

My big project this week was Datasette Lite, a new way to run Datasette directly in a browser, powered by WebAssembly and Pyodide. I also continued my research into running SQL queries in parallel, described last week. Plus I spoke at HYTRADBOI.

Datasette Lite

This started out as a research project, inspired by the excitement around Python in the browser from PyCon US last week (which I didn't attend, but observed with some jealousy on Twitter).

I've been wanting to explore this possibility for a while. JupyterLite had convinced me that it would be feasible to run Datasette using Pyodide, especially after I found out that the sqlite3 module from the Python standard library works there already.

I have a private "notes" GitHub repository which I use to keep notes in GitHub issues. I started a thread there researching the possibility of running an ASGI application in Pyodide, thinking that might be a good starting point to getting Datasette to work.

The proof of concept moved remarkably quickly, especially once I realized that Service Workers weren't going to work but Web Workers might.

Once I had comitted to Datasette Lite as a full project I started a new repository for it and transferred across my initial prototype issue thread. You can read that full thread for a blow-by-blow account of how my research pulled together in datasette-lite issue #1.

The rest of the project is documented in detail in my blog post.

Since launching it the biggest change I've made was a change of URL: since it's clearly going to be a core component of the Datasette project going forward I promoted it from simonw.github.io/datasette-lite/ to its new permanent home at lite.datasette.io. It's still hosted by GitHub Pages - here's my TIL about setting up the new domain.

It may have started as a proof of concept tech demo, but the response to it so far has convinced me that I should really take it seriously. Being able to host Datasette without needing to run any server-side code at all is an incredibly compelling experience.

It doesn't matter how hard I work on getting the Datasette deployment experience as easy as possible, static file hosting will always be an order of magnitude more accessible. And even at this early stage Datasette Lite is already proving to be a genuinely useful way to run the software.

As part of this research I also shipped sqlite-utils 3.26.1 with a minor dependency fix that means it works in Pyodide now. You can try that out by running the following in the Pyodide REPL:

>>> import micropip
>>> await micropip.install("sqlite-utils")
>>> import sqlite_utils
>>> db = sqlite_utils.Database(memory=True)
>>> list(db.query("select 3 * 5"))
[{'3 * 5': 15}]

Parallel SQL queries work... if you can get rid of the GIL

Last week I described my effort to implement Parallel SQL queries for Datasette.

The idea there was that many Datasette pages execute multiple SQL queries - a count(*) and a select ... limit 101 for example - that could be run in parallel instead of serial, for a potential improvement in page load times.

My hope was that I could get away with this despite Python's infamous Global Interpreter Lock because the sqlite3 C module releases the GIL when it executes a query.

My initial results weren't showing an increase in performance, even while the queries were shown to be overlapping each other. I opened a research thread and spent some time this week investigating.

My conclusion, sadly, was that the GIL was indeed to blame. sqlite3 releases the GIL to execute the query, but there's still a lot of work that happens in Python land itself - most importantly the code that assembles the objects that represent the rows returned by the query, which is still subject to the GIL.

Then this comment on a thread about the GIL on Lobsters reminded me of the nogil fork of Python by Sam Gross, who has been working on this problem for several years now.

Since that fork has a Docker image trying it out was easy... and to my amazement it worked! Running my parallel queries implementation against nogil Python reduced a page load time from 77ms to 47ms.

Sam's work is against Python 3.9, but he's discussing options for bringing his improvemets into Python itself with the core maintainers. I'm hopeful that this might happen in the next few years. It's an incredible piece of work.

An amusing coincidence: one restriction of WASM and Pyodide is that they can't start new threads - so as part of getting Datasette to work on that platform I had to add a new setting that disables the ability to run SQL queries in threads entirely!

datasette-copy-to-memory

One question I found myself asking while investigating parallel SQL queries (before I determined that the GIL was to blame) was whether parallel SQLite queries against the same database file were suffering from some form of file locking or contention.

To rule that out, I built a new plugin: datasette-copy-to-memory - which reads a SQLite database from disk and copies it into an in-memory database when Datasette first starts up.

This didn't make an observable difference in performance, but I've not tested it extensively - especially not against larger databases using servers with increased amounts of available RAM.

If you're inspired to give this plugin a go I'd love to hear about your results.

asgi-gzip and datasette-gzip

I mentioned datasette-gzip last week: a plugin that acts as a wrapper around the excellent GZipMiddleware from Starlette.

The performance improvements from this - especially for larger HTML tables, which it turns out compress extremely well - were significant. Enough so that I plan to bring gzip support into Datasette core very shortly.

Since I don't want to add the whole of Starlette as a dependency just to get gzip support, I extracted that code out into a new Python package called asgi-gzip.

The obvious risk with doing this is that it might fall behind the excellent Starlette implementation. So I came up with a pattern based on Git scraping that would automatically open a new GitHub issue should the borrowed Starlette code change in the future.

I wrote about that pattern in Automatically opening issues when tracked file content changes.

Speaking at HYTRADBOI

I spoke at the HYTRADBOI conference last week: Have You Tried Rubbing A Database On It.

HYTRADBOI was organized by Jamie Brandon. It was a neat event, with a smart format: 34 pre-recorded 10 minute long talks, arranged into a schedule to encourage people to watch and discuss them at specific times during the day of the event.

It's worth reading Jamie's postmortem of the event for some insightful thinking on online event organization.

My talk was Datasette: a big bag of tricks for solving interesting problems using SQLite. It ended up working out as a lightning-fast 10 minute tutorial on using the sqlite-utils CLI to clean up some data (in this case Manatee Carcass Recovery Locations in Florida since 1974) and then using Datasette to explore and publish it.

I've posted some basic notes to accompany the talk. My plan is to use this as the basis for an official tutorial on sqlite-utils for the tutorials section of the Datasette website.

Releases this week

datasette: 0.62a0 - (111 releases total) - 2022-05-02
An open source multi-tool for exploring and publishing data
sqlite-utils: 3.26.1 - (100 releases total) - 2022-05-02
Python CLI utility and library for manipulating SQLite databases
click-default-group-wheel: 1.2.2 - 2022-05-02
Extends click.Group to invoke a command without explicit subcommand name (this version publishes a wheel)
s3-credentials: 0.11 - (11 releases total) - 2022-05-01
A tool for creating credentials for accessing S3 buckets
datasette-copy-to-memory: 0.2 - (5 releases total) - 2022-04-30
Copy database files into an in-memory database on startup
datasette-gzip: 0.2 - (2 releases total) - 2022-04-28
Add gzip compression to Datasette
asgi-gzip: 0.1 - 2022-04-28
gzip middleware for ASGI applications, extracted from Starlette

TIL this week

Tags: gil, projects, speaking, datasette, asgi, webassembly, weeknotes, pyodide, datasette-lite

Automatically opening issues when tracked file content changes

2022-04-28T17:18:14+00:00

I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.

Extracting GZipMiddleware from Starlette

Here's why I needed to solve this problem.

I want to add gzip support to my Datasette open source project. Datasette builds on the Python ASGI standard, and Starlette provides an extremely well tested, robust GZipMiddleware class that adds gzip support to any ASGI application. As with everything else in Starlette, it's really good code.

The problem is, I don't want to add the whole of Starlette as a dependency. I'm trying to keep Datasette's core as small as possible, so I'm very careful about new dependencies. Starlette itself is actually very light (and only has a tiny number of dependencies of its own) but I still don't want the whole thing just for that one class.

So I decided to extract the GZipMiddleware class into a separate Python package, under the same BSD license as Starlette itself.

The result is my new asgi-gzip package, now available on PyPI.

What if Starlette fixes a bug?

The problem with extracting code like this is that Starlette is a very effectively maintained package. What if they make improvements or fix bugs in the GZipMiddleware class? How can I make sure to apply those same fixes to my extracted copy?

As I thought about this challenge, I realized I had most of the solution already.

Git scraping is the name I've given to the trick of running a periodic scraper that writes to a git repository in order to track changes to data over time.

It may seem redundant to do this against a file that already lives in version control elsewhere - but in addition to tracking changes, Git scraping can offfer a cheap and easy way to add automation that triggers when a change is detected.

I need an actionable alert any time the Starlette code changes so I can review the change and apply a fix to my own library, if necessary.

Since I already run all of my projects out of GitHub issues, automatically opening an issue against the asgi-gzip repository would be ideal.

My track.yml workflow does exactly that: it implements the Git scraping pattern against the gzip.py module in Starlette, and files an issue any time it detects changes to that file.

Starlette haven't made any changes to that file since I started tracking it, so I created a test repo to try this out.

Here's one of the example issues. I decided to include the visual diff in the issue description and have a link to it from the underlying commit as well.

How it works

The implementation is contained entirely in this track.yml workflow. I designed this to be contained as a single file to make it easy to copy and paste it to adapt it for other projects.

It uses actions/github-script, which makes it easy to do things like file new issues using JavaScript.

Here's a heavily annotated copy:

name: Track the Starlette version of this

# Run on repo pushes, and if a user clicks the "run this action" button,
# and on a schedule at 5:21am UTC every day
on:
  push:
  workflow_dispatch:
  schedule:
  - cron:  '21 5 * * *'

# Without this block I got this error when the action ran:
# HttpError: Resource not accessible by integration
permissions:
  # Allow the action to create issues
  issues: write
  # Allow the action to commit back to the repository
  contents: write

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/github-script@v6
      # Using env: here to demonstrate how an action like this can
      # be adjusted to take dynamic inputs
      env:
        URL: https://raw.githubusercontent.com/encode/starlette/master/starlette/middleware/gzip.py
        FILE_NAME: tracking/gzip.py
      with:
        script: |
          const { URL, FILE_NAME } = process.env;
          // promisify pattern for getting an await version of child_process.exec
          const util = require("util");
          // Used exec_ here because 'exec' variable name is already used:
          const exec_ = util.promisify(require("child_process").exec);
          // Use curl to download the file
          await exec_(`curl -o ${FILE_NAME} ${URL}`);
          // Use 'git diff' to detect if the file has changed since last time
          const { stdout } = await exec_(`git diff ${FILE_NAME}`);
          if (stdout) {
            // There was a diff to that file
            const title = `${FILE_NAME} was updated`;
            const body =
              `${URL} changed:` +
              "\n\n```diff\n" +
              stdout +
              "\n```\n\n" +
              "Close this issue once those changes have been integrated here";
            const issue = await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: title,
              body: body,
            });
            const issueNumber = issue.data.number;
            // Now commit and reference that issue number, so the commit shows up
            // listed at the bottom of the issue page
            const commitMessage = `${FILE_NAME} updated, refs #${issueNumber}`;
            // https://til.simonwillison.net/github-actions/commit-if-file-changed
            await exec_(`git config user.name "Automated"`);
            await exec_(`git config user.email "actions@users.noreply.github.com"`);
            await exec_(`git add -A`);
            await exec_(`git commit -m "${commitMessage}" || exit 0`);
            await exec_(`git pull --rebase`);
            await exec_(`git push`);
          }

In the asgi-gzip repository I keep the fetched gzip.py file in a tracking/ directory. This directory isn't included in the Python package that gets uploaded to PyPI - it's there only so that my code can track changes to it over time.

More interesting applications

I built this to solve my "tell me when Starlette update their gzip.py file" problem, but clearly this pattern has much more interesting uses.

You could point this at any web page to get a new GitHub issue opened when that page content changes. Subscribe to notifications for that repository and you get a robust , shared mechanism for alerts - plus an issue system where you can post additional comments and close the issue once someone has reviewed the change.

There's a lot of potential here for solving all kinds of interesting problems. And it doesn't cost anything either: GitHub Actions (somehow) remains completely free for public repositories!

Update: October 13th 2022

Almost six months after writing about this... it triggered for the first time!

Here's the issue that the script opened: #4: tracking/gzip.py was updated.

I applied the improvement (Marcelo Trylesinski and Kai Klingenberg updated Starlette's code to avoid gzipping if the response already had a Content-Encoding header) and released version 0.2 of the package.

Tags: github, gzip, projects, python, datasette, asgi, github-actions, git-scraping

Weeknotes: sqlite-utils updates, Datasette and asgi-csrf, open-sourcing VIAL

2021-06-28T17:23:21+00:00

Some work on sqlite-utils, asgi-csrf, a Datasette alpha and we open-sourced VIAL.

sqlite-utils

Last week's sqlite-utils 3.10 introduced a huge new feature: the ability to run joins directly against CSV and JSON files from the command-line.

I've since released sqlite-utils 3.11 and 3.12, much smaller releases.

3.11 added a new --schema option to the sqlite-utils memory command which lets you see the schema you'll be querying for the imported data:

$ curl 'https://api.github.com/users/dogsheep/repos' | \
  sqlite-utils memory - --schema
CREATE TABLE [stdin] (
   [id] INTEGER,
   [node_id] TEXT,
   [name] TEXT,
   [full_name] TEXT,
   [private] INTEGER,
   [owner] TEXT,
   [html_url] TEXT,
   [description] TEXT,
   ...
   [watchers] INTEGER,
   [default_branch] TEXT
);
CREATE VIEW t1 AS select * from [stdin];
CREATE VIEW t AS select * from [stdin];

3.12 focused on the Python library side of the package. It adds a new method, db.query(sql) which returns an iterator over Python dictionaries representing the results of a query.

This was a pretty obvious missing feature of the library: the rest of sqlite-utils deals with rows that are represented as dictionaries - you pass a list of Python dictionaries to db[table_name].insert_all(list_of_dicts) to create a table with the correct schema, for example. But if you wanted to execute SELECT queries you had to use db.execute() which would return a standard library cursor object which could then return tuples if you called .fetchall() on it.

It was only when I started to work on an interactive Jupyter notebook tutorial for sqlite-utils that I realized how weird it was not to have an equivalent method for reading data out of the database again.

Here's what the new method looks like:

db = Database(memory=True)
db["dogs"].insert_all([
    {"name": "Cleo"},
    {"name": "Pancakes"}
])
for row in db.query("select * from dogs"):
    print(row)
# Outputs:
# {'name': 'Cleo'}
# {'name': 'Pancakes'}

Full documentation here.

asgi-csrf and a Datasette alpha

I'm building a custom Datasette integration for a consulting client at the moment which needs to be able to accept POST form data as part of an API. Datasette has CSRF protection but for this particular project I need to opt-out of that protection for this one endpoint.

I ended up releasing asgi-csrf 0.9 with a new skip_if_scope= mechanism for dynamically disabling CSRF protection based on the incoming ASGI scope. I then shipped a Datasette 0.58a1 alpha release with a new skip_csrf(datasette, scope) plugin hook for plugins to take advantage of that mechanism.

Expect another alpha release shortly to preview the new get_metadata plugin hook contributed by Brandon Roberts. I've decided that alphas are the ideal way to explore new plugin hooks while they are still being developed as it lets projects pip install the alpha while making it clear that the interface may not yet be fully baked.

Open-sourcing VIAL

VIAL is the project I've been working on for VaccinateCA/VaccinateTheStates - see previous posts. It's a Django application which powers a crowd-sourced and scraper-driven effort to catalogue all of the places in the USA that you can get the Covid vaccine - 77,000 and counting right now.

We had always intended to open-source the code and now we have! github.com/CAVaccineInventory/vial is the newly-made-public repository.

I still need to produce a bunch of extra documentation about VIAL, likely including a video introduction to the project. But it's great to have it out there!

Releases this week

sqlite-utils: 3.12 - (80 releases total) - 2021-06-25
Python CLI utility and library for manipulating SQLite databases
datasette: 0.58a1 - (92 releases total) - 2021-06-24
An open source multi-tool for exploring and publishing data
asgi-csrf: 0.9 - (17 releases total) - 2021-06-23
ASGI middleware for protecting against CSRF attacks

TIL this week

Scraping Reddit via their JSON API

Tags: csrf, datasette, asgi, weeknotes, sqlite-utils, vaccinate-ca

Notes on streaming large API responses

2021-06-25T16:26:49+00:00

I started a Twitter conversation last week about API endpoints that stream large amounts of data as an alternative to APIs that return 100 results at a time and require clients to paginate through all of the pages in order to retrieve all of the data:

Any unexpected downsides to offering streaming HTTP API endpoints that serve up eg 100,000 JSON objects in a go rather than asking users to paginate 100 at a time over 1,000 requests, assuming efficient implementation of that streaming endpoint?
— Simon Willison (@simonw) June 17, 2021

I got a ton of great replies. I tried to tie them together in a thread attached to the tweet, but I'm also going to synthesize them into some thoughts here.

Bulk exporting data

The more time I spend with APIs, especially with regard to my Datasette and Dogsheep projects, the more I realize that my favourite APIs are the ones that let you extract all of your data as quickly and easily as possible.

There are generally three ways an API might provide this:

Click an "export everything" button, then wait for a while for an email to show up with a link to a downloadable zip file. This isn't really an API, in particular since it's usually hard if not impossible to automate that initial "click", but it's still better than nothing. Google's Takeout is one notable implementation of this pattern.
Provide a JSON API which allows users to paginate through their data. This is a very common pattern, although it can run into difficulties: what happens if new data is added while you are paginating through the original data, for example? Some systems only allow access to the first N pages too, for performance reasons.
Providing a single HTTP endpoint you can hit that will return ALL of your data - potentially dozens or hundreds of MBs of it - in one go.

It's that last option that I'm interested in talking about today.

Efficiently streaming data

It used to be that most web engineers would quickly discount the idea of an API endpoint that streams out an unlimited number of rows. HTTP requests should be served as quickly as possible! Anything more than a couple of seconds spent processing a request is a red flag that something should be reconsidered.

Almost everything in the web stack is optimized for quickly serving small requests. But over the past decade the tide has turned somewhat: Node.js made async web servers commonplace, WebSockets taught us to handle long-running connections and in the Python world asyncio and ASGI provided a firm foundation for handling long-running requests using smaller amounts of RAM and CPU.

I've been experimenting in this area for a few years now.

Datasette has the ability to use ASGI trickery to stream all rows from a table (or filtered table) as CSV, potentially returning hundreds of MBs of data.

Django SQL Dashboard can export the full results of a SQL query as CSV or TSV, this time using Django's StreamingHttpResponse (which does tie up a full worker process, but that's OK if you restrict it to a controlled number of authenticated users).

VIAL implements streaming responses to offer an "export from the admin" feature. It also has an API-key-protected search API which can stream out all matching rows in JSON or GeoJSON.

Implementation notes

The key thing to watch out for when implementing this pattern is memory usage: if your server buffers 100MB+ of data any time it needs to serve an export request you're going to run into trouble.

Some export formats are friendlier for streaming than others. CSV and TSV are pretty easy to stream, as is newline-delimited JSON.

Regular JSON requires a bit more thought: you can output a [ character, then output each row in a stream with a comma suffix, then skip the comma for the last row and output a ]. Doing that requires peeking ahead (looping two at a time) to verify that you haven't yet reached the end.

Or... Martin De Wulf pointed out that you can output the first row, then output every other row with a preceeding comma - which avoids the whole "iterate two at a time" problem entirely.

The next challenge is efficiently looping through every database result without first pulling them all into memory.

PostgreSQL (and the psycopg2 Python module) offers server-side cursors, which means you can stream results through your code without loading them all at once. I use these in Django SQL Dashboard.

Server-side cursors make me nervous though, because they seem like they likely tie up resources in the database itself. So the other technique I would consider here is keyset pagination.

Keyset pagination works against any data that is ordered by a unique column - it works especially well against a primary key (or other indexed column). Each page of data is retrieved using a query something like this:

select * from items order by id limit 21

Note the limit 21 - if we are retrieving pages of 20 items we ask for 21, since then we can use the last returned item to tell if there is a next page or not.

Then for subsequent pages take the 20th id value and ask for things greater than that:

select * from items where id > 20 limit 21

Each of these queries is fast to respond (since it's against an ordered index) and uses a predictable, fixed amount of memory. Using keyset pagination we can loop through an abitrarily large table of data, streaming each page out one at a time, without exhausting any resources.

And since each query is small and fast, we don't need to worry about huge queries tying up database resources either.

What can go wrong?

I really like these patterns. They haven't bitten me yet, though I've not deployed them for anything truly huge scale. So I asked Twitter what kind of problems I should look for.

Based on the Twitter conversation, here are some of the challenges that this approach faces.

Challenge: restarting servers

If the stream takes a significantly long time to finish then rolling out updates becomes a problem. You don't want to interrupt a download but also don't want to wait forever for it to finish to spin down the server.
— Adam Lowry (@robotadam) June 17, 2021

This came up a few times, and is something I hadn't considered. If your deployment process involves restarting your servers (and it's hard to imagine one that doesn't) you need to take long-running connections into account when you do that. If there's a user half way through a 500MB stream you can either truncate their connection or wait for them to finish.

Challenge: how to return errors

If you're streaming a response, you start with an HTTP 200 code... but then what happens if an error occurs half-way through, potentially while paginating through the database?

You've already started sending the request, so you can't change the status code to a 500. Instead, you need to write some kind of error to the stream that's being produced.

If you're serving up a huge JSON document, you can at least make that JSON become invalid, which should indicate to your client that something went wrong.

Formats like CSV are harder. How do you let your user know that their CSV data is incomplete?

And what if someone's connection drops - are they definitely going to notice that they are missing something, or will they assume that the truncated file is all of the data?

Challenge: resumable downloads

If a user is paginating through your API, they get resumability for free: if something goes wrong they can start again at the last page that they fetched.

Resuming a single stream is a lot harder.

The HTTP range mechanism can be used to provide resumable downloads against large files, but it only works if you generate the entire file in advance.

There is a way to design APIs to support this, provided the data in the stream is in a predictable order (which it has to be if you're using keyset pagination, described above).

Have the endpoint that triggers the download take an optional ?since= parameter, like this:

GET /stream-everything?since=b24ou34
[
    {"id": "m442ecc", "name": "..."},
    {"id": "c663qo2", "name": "..."},
    {"id": "z434hh3", "name": "..."},
]

Here the b24ou34 is an identifier - it can be a deliberately opaque token, but it needs to be served up as part of the response.

If the user is disconnected for any reason, they can start back where they left off by passing in the last ID that they successfully retrieved:

GET /stream-everything?since=z434hh3

This still requires some level of intelligence from the client application, but it's a reasonably simple pattern both to implement on the server and as a client.

Easiest solution: generate and return from cloud storage

It seems the most robust way to implement this kind of API is the least technically exciting: spin off a background task that generates the large response and pushes it to cloud storage (S3 or GCS), then redirect the user to a signed URL to download the resulting file.

This is easy to scale, gives users complete files with content-length headers that they know they can download (and even resume-downloading, since range headers are supported by S3 and GCS). It also avoids any issues with server restarts caused by long connections.

This is how Mixpanel handle their export feature, and it's the solution Sean Coates came to when trying to find a workaround for the AWS Lambda/API Gate response size limit.

If your goal is to provide your users a robust, reliable bulk-export mechanism for their data, export to cloud storage is probably the way to go.

But streaming dynamic responses are a really neat trick, and I plan to keep exploring them!

Tags: apis, scaling, streaming, asgi

Weeknotes: datasette-ics, datasette-upload-csvs, datasette-configure-fts, asgi-csrf

2020-03-04T02:27:47+00:00

I've been preparing for the NICAR 2020 Data Journalism conference this week which has lead me into a flurry of activity across a plethora of different projects and plugins.

datasette-ics

NICAR publish their schedule as a CSV file. I couldn't resist loading it into a Datasette on Glitch, which inspired me to put together a plugin I've been wanting for ages: datasette-ics, a register_output_renderer() plugin that can produce a subscribable iCalendar file from an arbitrary SQL query.

It's based on datasette-atom and works in a similar way: you construct a query that outputs a required set of columns (event_name and event_dtstart as a minimum), then add the .ics extension to get back an iCalendar file.

You can optionally also include event_dtend, event_duration, event_description, event_uid and most importantly event_tz, which can contain a timezone string. Figuring out how to handle timezones was the fiddliest part of the project.

If you're going to NICAR, subscribe to https://nicar-2020.glitch.me/data/calendar.ics in a calendar application to get the full 261 item schedule.

If you just want to see what the iCalendar feed looks like, add ?_plain=1 to preview it with a text/plain content type: https://nicar-2020.glitch.me/data/calendar.ics?_plain=1 - and here's the SQL query that powers it.

datasette-upload-csvs

My work on Datasette Cloud is inspiring all kinds of interesting work on plugins. I released datasette-upload-csvs a while ago, but now that Datasette has official write support I've been upgrading the plugin to hopefully achieve its full potential.

In particular, I've been improving its usability. CSV files can be big - and if you're uploading 100MB of CSV it's not particularly reassuring if your browser just sits for a few minutes spinning on the status bar.

So I added two progress bars to the plugins. The first is a client-side progress bar that shows you the progress of the initial file upload. I used the XMLHttpRequest pattern (and the drag-and-drop recipe) from Joseph Zimmerman's useful article How To Make A Drag-and-Drop File Uploader With Vanilla JavaScript - fetch() doesn't reliably report upload progres just yet.

I'm using Starlette and asyncio so uploading large files doesn't tie up server resources in the same way that it would if I was using processes and threads.

The second progress bar relates to server-side processing of the file: churning through 100,000 rows of CSV data and inserting them into SQLite can take a while, and I wanted users to be able to see what was going on.

Here's an animation screenshot of how the interface looks now:

Implementing this was trickier. In the end I took advantage of the new dedicaed write thread made available by datasette.execute_write_fn() - since that thread has exclusive access to write to the database, I create a SQLite table called _csv_progress_ and write a new record to it every 10 rows. I use the number of bytes in the CSV file as the total and track how far through that file Python's CSV parser has got using file.tell().

It seems to work really well. The full server-side code is here - the progress bar itself then polls Datasette's JSON API for the record in the _csv_progress_ table.

datasette-configure-fts

SQLite ships with a decent implementation of full-text search. Datasette knows how to tell if a table has been configured for full-text search and adds a search box to the table page, documented here.

datasette-configure-fts is a new plugin that provides an interface for configuring search against existing SQLite tables. Under the hood it uses the sqlite-utils full-text search methods to configure the table and set up triggers to keep the index updated as data in the table changes.

It's pretty simple, but it means that users of Datasette Cloud can upload a potentially enormous CSV file and then click to set specific columns as searchable. It's a fun example of the kind of things that can be built with Datasette`s new write capabilities.

asgi-csrf

CSRF is one of my favourite web application security vulnerabilties - I first wrote about it on this blog back in 2005!

I was surprised to see that the Starlette/ASGI ecosystem doesn't yet have much in the way of CSRF prevention. The best option I could find to use the WTForms library with Starlette.

I don't need a full forms library for my purposes (at least not yet) but I needed CSRF protection for datasete-configure-fts, so I've started working on a small ASGI middleware library called asgi-csrf.

It's modelled on a subset of Django's robust CSRF prevention. The README warns people NOT to trust it yet - there are still some OWASP recommendations that it needs to apply (issue here) and I'm not yet ready to declare it robust and secure. It's a start though, and feels like exactly the kind of problem that ASGI middleware is meant to address.

Tags: csrf, data-journalism, ical, plugins, projects, search, security, datasette, asgi, weeknotes, datasette-cloud

Async Support - HTTPX

2020-01-10T04:49:59+00:00

Async Support - HTTPX

HTTPX is the new async-friendly HTTP library for Python spearheaded by Tom Christie. It works in both async and non-async mode with an API very similar to requests. The async support is particularly interesting—it’s a really clean API, and now that Jupyter supports top-level await you can run ’(await httpx.AsyncClient().get(url)).text’ directly in a cell and get back the response. Most excitingly the library lets you pass an ASGI app directly to the client and then perform requests against it—ideal for unit tests.

Via @_tomchristie

Tags: asgi, tom-christie, async, http, python, httpx

Logging to SQLite using ASGI middleware

2019-12-16T22:30:46+00:00

I had some fun playing around with ASGI middleware and logging during our flight back to England for the holidays.

asgi-log-to-sqlite

I decided to experiment with SQLite as a logging mechanism. I wouldn’t use this on a high traffic site, but most of my Datasette related projects are small enough that logging HTTP traffic directly to a SQLite database feels like it should work reasonable well.

Once your logs are in a SQLite database, you can use Datasette to analyze them. I think this could be a lot of fun.

asgi-log-to-sqlite is my first exploration of this idea. It’s a piece of ASGI middleware which wraps an ASGI application and then logs relevant information from the request and response to an attached SQLite database.

You use it like this:

from asgi_log_to_sqlite import AsgiLogToSqlite
from my_asgi_app import app

app = AsgiLogToSqlite(app, "/tmp/log.db")

Here’s a demo Datasette instance showing logs from my testing: asgi-log-demo-j7hipcg4aq-uc.a.run.app

As always with Datasette, the data is at its most interesting once you apply some facets.

Intercepting requests to and from the wrapped ASGI app

There are a couple of interesting parts of the implementation. The first is how the information is gathered from the request and response.

This is a classic pattern for ASGI middleware. The ASGI protocol has three key components; a scope dictionary describing the incoming request, and two async functions called receive and send which are used to retrieve and send data to the connected client (usually a browser).

Most middleware works by wrapping those functions with custom replacements. That’s what I’m doing here:

class AsgiLogToSqlite:
    def __init__(self, app, file):
        self.app = app
        self.db = sqlite_utils.Database(file)
    # ...
    async def __call__(self, scope, receive, send):
        response_headers = []
        body_size = 0
        http_status = None

        async def wrapped_send(message):
            nonlocal body_size, response_headers, http_status
            if message["type"] == "http.response.start":
                response_headers = message["headers"]
                http_status = message["status"]

            if message["type"] == "http.response.body":
                body_size += len(message["body"])

            await send(message)

        start = time.time()
        await self.app(scope, receive, wrapped_send)
        end = time.time()

My wrapped_send() function replaces the original send() function with one that pulls out some of the data I want to log from the messages that are being sent to the client.

I record a start time, then await the original ASGI application, then record an end time when it finishes.

Logging to SQLite using sqlite-utils

I’m using my sqlite-utils library to implement the logging. My first version looked like this:

db["requests"].insert({
    "path": scope.get("path"),
    "response_headers": str(response_headers),
    "body_size": body_size,
    "http_status": http_status,
    "scope": str(scope),
}, alter=True)

sqlite-utils automatically creates a table with the correct schema the first time you try to insert a record into it. This makes it ideal for rapid prototyping. In this case I captured stringified versions of various data structures so I could look at them in my browser with Datasette.

The alter=True argument here means that if I attempt to insert a new shape of record into an existing tables any missing columns will be added automatically as well. Again, handy for prototyping.

Based on the above, I evolved the code into recording the values I wanted to see in my logs - the full URL path, the User-Agent, the HTTP referrer, the IP and so on.

This resulted in a LOT of duplicative data. Values like the path, user-agent and HTTP referrer are the same across many different requests.

Regular plain text logs can solve this with gzip compression, but you can’t gzip a SQLite database and still expect it to work.

Since we are logging to a relational database, we can solve for duplicate values using normalization. We can extract out those lengthy strings into separate lookup tables - that way we can store mostly integer foreign key references in the requests table itself.

After a few iterations, my database code ended up looking like this:

with db.conn:  # Use a transaction
    db["requests"].insert(
        {
            "start": start,
            "method": scope["method"],
            "path": lookup(db, "paths", path),
            "query_string": lookup(db, "query_strings", query_string),
            "user_agent": lookup(db, "user_agents", user_agent),
            "referer": lookup(db, "referers", referer),
            "accept_language": lookup(db, "accept_languages", accept_language),
            "http_status": http_status,
            "content_type": lookup(db, "content_types", content_type),
            "client_ip": scope.get("client", (None, None))[0],
            "duration": end - start,
            "body_size": body_size,
        },
        alter=True,
        foreign_keys=self.lookup_columns,
    )


def lookup(db, table, value):
    return db[table].lookup({
        "name": value
    }) if value else None

The table.lookup() method in sqlite-utils is designed for exactly this use-case. If you pass it a value (or multiple values) it will ensure the underlying table has those columns with a unique index on them, then get-or-insert your data and return you the primary key.

Automatically creating tables is fine for an initial prototype, but it starts getting a little messy once you have foreign keys relationships that you need to be able to rely on. I moved to explicit table creation in an ensure_tables() method that’s called once when the middleware class is used to wrap the underlying ASGI app:

    lookup_columns = (
        "path",
        "user_agent",
        "referer",
        "accept_language",
        "content_type",
        "query_string",
    )

    def ensure_tables(self):
        for column in self.lookup_columns:
            table = "{}s".format(column)
            if not self.db[table].exists:
                self.db[table].create({
                    "id": int,
                    "name": str
                }, pk="id")
        if not self.db["requests"].exists:
            self.db["requests"].create({
                "start": float,
                "method": str,
                "path": int,
                "query_string": int,
                "user_agent": int,
                "referer": int,
                "accept_language": int,
                "http_status": int,
                "content_type": int,
                "client_ip": str,
                "duration": float,
                "body_size": int,
            }, foreign_keys=self.lookup_columns)

I’m increasingly using this pattern in my sqlite-utils projects. It’s not a full-grown migrations system but it’s a pretty low-effort way of creating tables correctly provided they don’t yet exist.

Here’s the full implementation of the middleware.

Configuring the middleware for use with Datasette

Publishing standalone ASGI middleware for this kind of thing is neat because it can be used with any ASGI application, not just with Datasette.

To make it as usable as possible with Datasette I want it made available as a plugin.

I’ve tried two different patterns for this in the past.

My first ASGI middleware was asgi-cors. I published that as two separate packages to PyPI: asgi-cors is the middleware itself, and datasette-cors is a very thin plugin wrapper around it that hooks into Datasette’s plugin configuration mechanism.

For datasette-auth-github I decided not to publish two packages. Instead I published a single plugin package and then described how to use it as standalone ASGI middleware in its documentation.

This lazier approach is confusing: it’s not at all clear that a package called datasette-auth-github can be used independently of Datasette. But I did get to avoid having to publish two packages.

datasette-configure-asgi

Since I want to do a lot more experiments with ASGI plugins in the future, I decided to try solving the ASGI configuration issue once and for all. I built a new experimental plugin, datasette-configure-asgi which can be used to configure ANY ASGI middleware that conforms to an expected protocol.

Here’s what that looks like at the configuration level, using a metadata.json settings file (which I should really rename since it’s more about configuration than metadata these days):

{
  "plugins": {
    "datasette-configure-asgi": [
      {
        "class": "asgi_log_to_sqlite.AsgiLogToSqlite",
        "args": {
          "file": "/tmp/log.db"
        }
      }
    ]
  }
}

The implementation of this plugin is very simple: here’s the entire thing:

from datasette import hookimpl
import importlib


@hookimpl
def asgi_wrapper(datasette):
    def wrap_with_classes(app):
        configs = datasette.plugin_config("datasette-configure-asgi") or []
        for config in configs:
            module_path, class_name = config["class"].rsplit(".", 1)
            mod = importlib.import_module(module_path)
            klass = getattr(mod, class_name)
            args = config.get("args") or {}
            app = klass(app, **args)
        return app

    return wrap_with_classes

It hooks into the asgi_wrapper plugin hook, reads its configuration from the datasette object (using plugin_config()), then loops through the list of configured plugins and dynamically loads each implementation using importlib. Then it wraps the ASGI app with each of them in turn.

Open questions

This is where I’ve got to with my experiments so far. Should you use this stuff in production? Almost certainly not! I wrote it on a plane just now. It definitely needs a bit more thought.

A couple of obvious open questions:

Python async functions shouldn’t make blocking calls, since doing so will block the entire event loop for everyone else.

Interacting with SQLite is a blocking call. Datasette works around this by running SQL queries in a thread pool; my logging plugin doesn’t bother with that.

Maybe it should? My hunch is that inserting into SQLite in this way is so fast it won’t actually cause any noticeable overhead. It would be nice to test that assumption thoroughly though.

Log rotation. This is an important detail for any well designed logging system, and I’ve punted on it entirely. Figuring out an elegant way to handle this with underlying SQLite databases files would be an interesting design challenge - relevant issue.

Would my SQLite logging middleware work with Django 3.0? I don’t see why not - the documentation covers how to wrap entire Django applications with ASGI middleware. I should try that out!

This week’s Niche Museums

These are technically my weeknotes, but logging experiments aside it’s been a quiet week for me.

I finally added paragraph breaks to Niche Museums (using datasette-render-markdown, implementation here) As a result my descriptions have been getting a whole lot longer. Added this week:

The Tonga Room in San Francisco
London Silver Vaults in London
Rosie the Riveter National Historical Park in Richmond, CA
LA Bureau of Street Lighting Museum in Los Angeles
Aye-Aye Island in Madagascar
Monarch Bear Grove in San Francisco
Alverstone Mead Red Squirrel Hide on the Isle of Wight

Tags: django, logging, projects, sqlite, datasette, asgi, weeknotes, sqlite-utils

Datasette 0.31

2019-11-12T06:11:57+00:00

Datasette 0.31

Released today: this version adds compatibility with Python 3.8 and breaks compatibility with Python 3.5. Since Glitch support Python 3.7.3 now I decided I could finally give up on 3.5. This means Datasette can use f-strings now, but more importantly it opens up the opportunity to start taking advantage of Starlette, which makes all kinds of interesting new ASGI-based plugins much easier to build.

Tags: glitch, asgi, datasette, python, projects

Single sign-on against GitHub using ASGI middleware

2019-07-14T01:18:56+00:00

I released Datasette 0.29 last weekend, the first version of Datasette to be built on top of ASGI (discussed previously in Porting Datasette to ASGI, and Turtles all the way down).

This also marked the introduction of the new asgi_wrapper plugin hook, which allows plugins to wrap the entire Datasette application in their own piece of ASGI middleware.

To celebrate this new capability, I also released two new plugins: datasette-cors, which provides fine-grained control over CORS headers (using my asgi-cors library from a few months ago) and datasette-auth-github, the first of hopefully many authentication plugins for Datasette.

datasette-auth-github

The new plugin is best illustrated with a demo.

Visit https://datasette-auth-demo.now.sh/ and you will be redirected to GitHub and asked to approve access to your account (just your e-mail address, not repository access).

Agree, and you’ll be redirected back to the demo with a new element in the Datasette header: your GitHub username, plus a “log out” link in the navigation bar at the top of the screen.

Controlling who can access

The default behaviour of the plugin is to allow in anyone with a GitHub account. Since the primary use-case for the plugin (at least for the moment) is restricting access to view data to a trusted subset of people, the plugin lets you configure who is allowed to view your data in three different ways:

You can restrict access to a specific list of GitHub accounts, using the allow_users configuration option.
You can restrict access to members of one or more GitHub organizations, with allow_orgs.
You can restrict access to members of specific teams within an organization, using allow_teams.

Datasette inherits quite a sophisticated user management system from GitHub, with very little effort required from the plugin. The user_is_allowed() method that implements all three of the above options against the GitHub API in just 40 lines of code.

These options can be set using the "plugins" section of the Datasette metadata.json configuration file. Here’s an example:

{
  "plugins": {
    "datasette-auth-github": {
      "client_id": {"$env": "GITHUB_CLIENT_ID"},
      "client_secret": {"$env": "GITHUB_CLIENT_SECRET"},
      "allow_users": ["simonw"]
    }
  }
}

This also illustrates a new Datasette feature: the ability to set secret plugin configuration values. {"$env": "GITHUB_CLIENT_SECRET"} means "read this configuration option from the environment variable GITHUB_CLIENT_SECRET".

Automatic log in

Like many OAuth providers, GitHub only asks the user for their approval the first time they log into a given app. Any subsequent times they are redirected to GitHub it will skip the permission screen and redirect them right back again with a token.

This means we can implement automatic log in: any time a visitor arrives who does not have a cookie we can bounce them directly to GitHub, and if they have already consented they will be logged in instantly.

This is a great user-experience - provided the user is logged into GitHub they will be treated as if they are logged into your application - but it does come with a downside: what if the user clicks the “log out” link?

For the moment I’ve implemented this using another cookie: if the user clicks “log out”, I set an asgi_auth_logout cookie marking the user as having explicitly logged out. While they have that cookie they won’t be logged in automatically, instead having to click an explicit link. See issue 41 for thoughts on how this could be further improved.

One pleasant side-effect of all of this is that datasette-auth-github doesn’t need to persist the users GitHub access_token anywhere - it uses it during initil authentication check for any required organizations or teams, but then it deliberately forgets the token entirely.

OAuth access tokens are like passwords, so the most resonsible thing for a piece of softare to do with them is avoid storing them anywhere at all unless they are explicitly needed.

What happens when a user leaves an organization?

When building against a single sign-in provider, consideration needs to be given to offboarding: when a user is removed from a team or organization they should also lose access to their SSO applications.

This is difficult when an application sets its own authentication cookies, like datasette-auth-github does.

One solution would be to make an API call on every request to the application, to verify that the user should still have access. This would slow everything down and is likely to blow through rate limits as well, so we need a more efficient solution.

I ended up solving this with two mechanisms. Since we have automatic log in, our cookies don’t actually need to last very long - so by default the signed cookies set by the plugin last for just one hour. When a user’s cookie has expired they will be redirected back through GitHub - they probably won’t even notice the redirect, and their permissions will be re-verified as part of that flow.

But what if you need to invalidate those cookies instantly?

To cover that case, I’ve incorporated an optional cookie_version configuration option into the signatures on the cookies. If you need to invalidate every signed cookie that is out there - to lock out a compromised GitHub account owner for example - you can do so by changing the cookie_version configuration option and restarting (or re-deploying) Datasette.

These options are all described in detail in the project README.

Integration with datasette publish

The datasette publish command-line tool lets users instantly publish a SQLite database to the internet, using Heroku, Cloud Run or Zeit Now v1. I’ve added suppor for setting secret plugin configuration directly to that tool, which means you can publish an authentication-protected SQLite database to the internet with a shell one-liner, using --install=datasette-auth-github to install the plugin and --plugin-secret to configure it:

$ datasette publish cloudrun fixtures.db \
   --install=datasette-auth-github \
   --name datasette-auth-protected \
   --service datasette-auth-protected \
   --plugin-secret datasette-auth-github allow_users simonw \
   --plugin-secret datasette-auth-github client_id 85f6224cb2a44bbad3fa \
   --plugin-secret datasette-auth-github client_secret ...

This creates a Cloud Run instance which only allows GitHub user simonw to log in. You could instead use --plugin-secret datasette-auth-github allow_orgs my-org to allow any users from a specific GitHub organization.

Note that Cloud Run does not yet give you full control over the URL that will be assigned to your deployment. In this case it gave me https://datasette-auth-protected-j7hipcg4aq-uc.a.run.app - which works fine, but I needed to update my GitHub OAuth application’s callback URL manually to https://datasette-auth-protected-j7hipcg4aq-uc.a.run.app/-/auth-callback after deploying the application in order to get the authentication flow to work correctly.

Add GitHub authentication to any ASGI application!

datasette-auth-github isn’t just for Datasette: I deliberately wrote the plugin as ASGI middleware first, with only a very thin layer of extra code to turn it into an installable plugin.

This means that if you are building any other kind of ASGI app (or using an ASGI-compatible framework such as Starlette or Sanic) you can wrap your application directly with the middleware and get the same authentication behaviour as when the plugin is added to Datasette!

Here’s what that looks like:

from datasette_auth_github import GitHubAuth
from starlette.applications import Starlette
from starlette.responses import HTMLResponse
import uvicorn

app = Starlette(debug=True)


@app.route("/")
async def homepage(request):
    return HTMLResponse("Hello, {}".format(
        repr(request.scope["auth"])
    ))


authenticated_app = GitHubAuth(
    app,
    client_id="986f5d837b45e32ee6dd",
    client_secret="...",
    require_auth=True,
    allow_users=["simonw"],
)

if __name__ == "__main__":
    uvicorn.run(authenticated_app, host="0.0.0.0", port=8000)

The middleware adds a scope["auth"] key describing the logged in user, which is then passed through to your application. More on this in the README.

Your security reviews needed!

Since datasette-auth-github adds authentication to Datasette, it is an extremely security-sensitive piece of code. So far I’m the only person who has looked at it: before I start widely recommending it to people I’d really like to get some more eyes on it to check for any potential security problems.

I’ve opened issue #44 encouraging security-minded developers to have a dig through the code and see if there’s anything that can be tightened up or any potential vulnerabilities that need to be addressed. Please get involved!

It’s a pretty small codebase, but here are some areas you might want to inspect:

At a high level: is the way I’m verifying the user through the GitHub API and then storing their identity in a signed cookie the right way to go?
The cookie signing secret is derived from the GitHub OAuth application’s client_id and client_secret (because that secret is already meant to be a secret), combined with the cookie_version option described above - implementation here. Since this is a derived secret I’m using pbkdf2_hmac with 100,000 iterations. This is by far the most cryptographically interesting part of the code, and could definitely do with some second opinions.
The code used to sign and verify cookies is based on Django’s (thoroughly reviewed) implementation, but could benefit from a sanity check.
I wanted this library to work on Glitch, which currently only provides Python 3.5.2. Python’s asyncio HTTP librarys such as http3 and aiohttp both require more modern Pythons, so I ended up rolling my own very simple async HTTP function which uses urllib.request inside a loop.run_in_executor thread pool. Is that approach sound? Rolling my own HTTP client in this way feels a little hairy.

This has been a really fun project so far, and I’m very excited about the potential for authenticated Datasette moving forward - not to mention the possibilites unlocked by an ASGI middleware ecosystem with strong support for wrapping any application in an authentication layer.

Tags: github, middleware, projects, security, datasette, asgi

datasette-cors

2019-07-08T04:30:53+00:00

datasette-cors

My other Datasette ASGI plugin: this one wraps my asgi-cors project and lets you configure CORS access from a list of domains (or a set of domain wildcards) so you can make JavaScript calls to a Datasette instance from a specific set of other hosts.

Via @simonw

Tags: projects, asgi, datasette, cors

datasette-auth-github

2019-07-08T04:28:17+00:00

datasette-auth-github

My first big ASGI plugin for Datasette: datasette-auth-github adds the ability to require users to authenticate against the GitHub OAuth API. You can whitelist specific users, or you can restrict access to members of specific GitHub organizations or teams. While it’s structured as a Datasette plugin it also includes ASGI middleware which can be applied to any ASGI application.

Via @simonw

Tags: asgi, oauth, datasette, projects, github

Porting Datasette to ASGI, and Turtles all the way down

2019-06-23T21:39:00+00:00

This evening I finally closed a Datasette issue that I opened more than 13 months ago: #272: Port Datasette to ASGI. A few notes on why this is such an important step for the project.

ASGI is the Asynchronous Server Gateway Interface standard. It’s been evolving steadily over the past few years under the guidance of Andrew Godwin. It’s intended as an asynchronous replacement for the venerable WSGI.

Turtles all the way down

Ten years ago at EuroDjangoCon 2009 in Prague I gave a talk entitled Django Heresies. After discussing some of the design decisions in Django that I didn’t think had aged well, I spent the last part of the talk talking about Turtles all the way down. I wrote that idea up here on my blog (see also these slides).

The key idea was that Django would be more interesting if the core Django contract - a function that takes a request and returns a response - was extended to more places in the framework. The top level site, the reusable applications, middleware and URL routing could all share that same contract. Everything could be composed from the same raw building blocks.

I’m excited about ASGI because it absolutely fits the turtles all the way down model.

The ASGI contract is an asynchronous function that takes three arguments:

async def application(scope, receive, send):
    ...

scope is a serializable dictionary providing the context for the current connection. receive is an awaitable which can be used to recieve incoming messages. send is an awaitable that can be used to send replies.

It’s a pretty low-level set of primitives (and less obvious than a simple request/response) - and that’s because ASGI is about more than just the standard HTTP request/response cycle. This contract works for HTTP, WebSockets and potentially any other protocol that needs to asynchronously send and receive data.

It’s an extremely elegant piece of protocol design, informed by Andrew’s experience with Django Channels, SOA protocols (we are co-workers at Eventbrite where we’ve both been heavily involved in Eventbrite’s SOA mechanism) and Andrew’s extensive conversations with other maintainers in the Python web community.

The ASGI protocol really is turtles all the way down - it’s a simple, well defined contract which can be composed together to implement all kinds of interesting web architectural patterns.

My asgi-cors library was my first attempt at building an ASGI turtle. The implementation is a simple Python decorator which, when applied to another ASGI callable, adds HTTP CORS headers based on the parameters you pass to the decorator. The library has zero installation dependencies (it has test dependencies on pytest and friends) and can be used on any HTTP ASGI project.

Building asgi-cors completely sold me on ASGI as the turtle pattern I had been desiring for over a decade!

Datasette plugins and ASGI

Which brings me to Datasette.

One of the most promising components of Datasette is its plugin mechanism. Based on pluggy (extracted from pytest), Datasette Plugins allow new features to be added to Datasette without needing to change the underlying code. This means new features can be built, packaged and shipped entirely independently of the core project. A list of currently available plugins can be found here.

WordPress is very solid blogging engine. Add in the plugin ecosystem around it and it can be used to build literally any CMS you can possibly imagine.

My dream for Datasette is to apply the same model: I want a strong core for publishing and exploring data that’s enhanced by plugins to solve a huge array of data analysis, visualization and API-backed problems.

Datasette has a range of plugin hooks already, but I’ve so far held back on implementing the most useful class of hooks: hooks that allow developers to add entirely new URL routes exposing completely custom functionality.

The reason I held back is that I wanted to be confident that the contract I was offering was something I would continue to support moving forward. A plugin system isn’t much good if the core implementation keeps on changing in backwards-incompatible ways.

ASGI is the exact contract I’ve been waiting for. It’s not quite ready yet, but you can follow #520: prepare_asgi plugin hook (thoughts and suggestions welcome!) to be the first to hear about this hook when it lands. I’m planning to use it to make my asgi-cors library available as a plugin, after which I’m excited to start exploring the idea of bringing authentication plugins to Datasette (and to the wider ASGI world in general).

I’m hoping that many Datasette ASGI plugins will exist in a form that allows them to be used by other ASGI applications as well.

I also plan to use ASGI to make components of Datasette itself available to other ASGI applications. If you just want a single instance of Datasette’s table view to be embedded somewhere in your URL configuration you should be able to do that by routing traffic directly to the ASGI-compatible view class.

I’m really excited about exploring the intersection of ASGI turtles-all-the-way-down and pluggy’s powerful mechanism for gluing components together. Both WSGI and Django’s reusable apps have attempted to create a reusable ecosystem in the past, to limited levels of success. Let’s see if ASGI can finally make the turtle dream come true.

asgi-cors

2019-05-07T00:12:37+00:00

asgi-cors

I’ve been trying out the new ASGI 3.0 spec and I just released my first piece of ASGI middleware: asgi-cors, which lets you wrap an ASGI application with Access-Control-Allow-Origin CORS headers (either “*” or dynamic headers based on an origin whitelist).

Via @simonw

Tags: projects, asgi, security, cors

Hello world for ASGI running on Glitch

2019-04-26T05:06:12+00:00

Hello world for ASGI running on Glitch

I’m continuing to experiment with Python 3 running on Glitch. This evening on my walk home from work I built this “hello world” demo on my phone, partly to see if Glitch was a workable mobile development environment—it passed with flying colours! The demo is a simple hello world implemented using the new ASGI 3.0 specification, running on the daphne reference server. Click the “via” link for my accompanying thread on Twitter, which includes a short screencast (also recorded on my phone) showing Glitch in action.

Via @simonw

Tags: glitch, asgi, projects

Quoting Tom Christie

2018-10-08T14:43:16+00:00

The ASGI specification provides an opportunity for Python to hit a productivity/performance sweet-spot for a wide range of use-cases, from writing high-volume proxy servers through to bringing large-scale web applications to market at speed.

— Tom Christie

Tags: async, python, asgi, tom-christie

Simon Willison's Weblog: asgi

asgi-replay

Writing a chat application in Django 4.2 using async StreamingHttpResponse, Server-Sent Events and PostgreSQL LISTEN/NOTIFY

datasette-granian

Deploying Python web apps as AWS Lambda functions

Weeknotes: Datasette Lite, nogil Python, HYTRADBOI

Datasette Lite

Parallel SQL queries work... if you can get rid of the GIL

datasette-copy-to-memory

asgi-gzip and datasette-gzip

Speaking at HYTRADBOI

Releases this week

TIL this week

Automatically opening issues when tracked file content changes

Extracting GZipMiddleware from Starlette

What if Starlette fixes a bug?

How it works

More interesting applications

Update: October 13th 2022

Weeknotes: sqlite-utils updates, Datasette and asgi-csrf, open-sourcing VIAL

sqlite-utils

asgi-csrf and a Datasette alpha

Open-sourcing VIAL

Releases this week

TIL this week

Notes on streaming large API responses

Bulk exporting data

Efficiently streaming data

Implementation notes

What can go wrong?

Challenge: restarting servers

Challenge: how to return errors

Challenge: resumable downloads

Easiest solution: generate and return from cloud storage

Weeknotes: datasette-ics, datasette-upload-csvs, datasette-configure-fts, asgi-csrf

datasette-ics

datasette-upload-csvs

datasette-configure-fts

asgi-csrf

Async Support - HTTPX

Logging to SQLite using ASGI middleware

asgi-log-to-sqlite

Intercepting requests to and from the wrapped ASGI app

Logging to SQLite using sqlite-utils

Configuring the middleware for use with Datasette

datasette-configure-asgi

Open questions

This week’s Niche Museums

Datasette 0.31

Single sign-on against GitHub using ASGI middleware

datasette-auth-github

Controlling who can access

Automatic log in

What happens when a user leaves an organization?

Integration with datasette publish

Add GitHub authentication to any ASGI application!

Your security reviews needed!

datasette-cors

datasette-auth-github

Porting Datasette to ASGI, and Turtles all the way down

Turtles all the way down

Datasette plugins and ASGI

Further reading

asgi-cors

Hello world for ASGI running on Glitch

Quoting Tom Christie