Simon Willison’s Weblog

On projects 240 shotscraper 8 git 33 plugins 41 visualization 21 ...

 

Recent entries

Weeknotes: Building Datasette Cloud on Fly Machines, Furo for documentation three days ago

Hosting provider Fly released Fly Machines this week. I got an early preview and I’ve been working with it for a few days—it’s a fascinating new piece of technology. I’m using it to get my hosting service for Datasette ready for wider release.

Datasette Cloud

Datasette Cloud is the name I’ve given my forthcoming hosted SaaS version of Datasette. I’m building it for two reasons:

  1. This is an obvious step towards building a sustainable business model for my open source project. It’s a reasonably well-trodden path at this point: plenty of projects have demonstrated that offering paid hosting for an open source project can build a valuable business. GitLab are an especially good example of this model.
  2. There are plenty of people who could benefit from Datasette, but the friction involved in hosting it prevents them from taking advantage of the software. I’ve tried to make it as easy to host as possible, but without a SaaS hosted version I’m failing to deliver value to the people that I most want the software to help.

My previous alpha was built directly on Docker, running everything on a single large VPS. Obviously it needed to scale beyond one machine, and I started experimenting with Kubernetes to make this happen.

I also want to allow users to run their own plugins, without risk of malicious code causing problems for other accounts. Docker and Kubernetes containers don’t offer the isolation that I need to feel comfortable doing this, so I started researching Firecracker—constructed by AWS to power Lambda and Fargate, so very much designed with potentially malicious code in mind.

Spinning up Firecracker on a Kubernetes cluster is no small lift!

And then I heard about Fly Machines. And it looks like it’s exactly what I need to get this project to the next milestone.

Fly Machines

Fly’s core offering allows you to run Docker containers in regions around the world, compiled (automatically by Fly) to Firecracker containers with geo-load-balancing so users automatically get routed to an instance running near them.

Their new Fly Machines product gives you a new way to run containers there: you get full control over when containers are created, updated, started, stopped and destroyed. It’s the exact level of control I need to build Datasette Cloud.

It also implements scale-to-zero: you can stop a container, and Fly will automatically start it back up again for you (generally in less than a second) when fresh traffic comes in.

(I had built my own version of this for my Datasette Cloud alpha, but the spin up time took more like 10s and involved showing the user a custom progress bar to help them see what was going on.)

Being able to programatically start and stop Firecracker containers was exactly what I’d been trying to piece together using Kubernetes—and the ability to control which global region they go in (with the potential for Litestream replication between regions in the future) is a feature I hadn’t expected to be able to offer for years.

So I spent most of this week on a proof of concept. I’ve successfully demonstrated that the Fly Machines product has almost exactly the features that I need to ship Datasette Cloud on Fly Machines—and I’ve confirmed that the gaps I need to fill are on Fly’s near-term roadmap.

I don’t have anything to demonstrate publicly just yet, but I do have several new TILs.

If this sounds interesting to you or your organization and you’d like to try it out, drop me an email at swillison @ Google’s email service.

The Furo theme for Sphinx

My shot-scraper automated screenshot tool’s README had got a little too long, so I decided to upgrade it to a full documentation website.

I chose to use MyST and Sphinx for this, hosted on Read The Docs.

MyST adds Markdown syntax to Sphinx, which is easier to remember (and for people to contribute to) than reStructuredText.

After putting the site live, Adam Johnson suggested I take a look at the Furo theme. I’d previously found Sphinx themes hard to navigate because they had so much differing functionality, but a personal recommendation turned out to be exactly what I needed.

Furo is really nice—it fixed a slight rendering complaint I had about nested lists in the theme I was using, and since it doesn’t use web fonts it dropped the bytes transferred for a page of documentation by more than half!

I switched shot-scraper over to Furo, and liked it so much that I switched over Datasette and sqlite-utils too.

Here’s what the shot-scraper documentation looks like now:

A screenshot of the shot-scraper documentation, showing the table of contents

Screenshot taken using shot-scraper itself, like this:

shot-scraper \
  https://shot-scraper.datasette.io/en/latest/ \
  --retina --height 1200

Full details of those theme migrations (including more comparative screenshots) can be found in these issues:

Releases this week

TIL this week

Bundling binary tools in Python wheels five days ago

I spotted a new (to me) pattern which I think is pretty interesting: projects are bundling compiled binary applications as part of their Python packaging wheels. I think it’s really neat.

pip install ziglang

Zig is a new programming language lead by Andrew Kelley that sits somewhere near Rust: Wikipedia calls it an “imperative, general-purpose, statically typed, compiled system programming language”.

One of its most notable features is that it bundles its own C/C++ compiler, as a “hermetic” compiler—it’s completely standalone, unaffected by the system that it is operating within. I learned about this usage of the word hermetic this morning from How Uber Uses Zig by Motiejus Jakštys.

The concept reminds me of Gregory Szorc’s python-build-standalone, which provides redistributable Python builds and was key to getting my Datasette Desktop Electron application working with its own hermetic build of Python.

One of the options provided for installing Zig (and its bundled toolchain) is to use pip:

% pip install ziglang
...
% python -m ziglang cc --help
OVERVIEW: clang LLVM compiler

USAGE: zig [options] file...

OPTIONS:
  -###                    Print (but do not run) the commands to run for this compilation
  --amdgpu-arch-tool=<value>
                          Tool used for detecting AMD GPU arch in the system.
...

This means you can now pip install a full C compiler for your current platform!

The way this works is really simple. The ziglang package that you install has two key files: A zig binary (155MB on my system) containing the full Zig compiled implementation, and a __main__.py module containing the following:

import os, sys, subprocess
sys.exit(subprocess.call([
    os.path.join(os.path.dirname(__file__), "zig"),
    *sys.argv[1:]
]))

The package also bundles lib and doc folders with supporting files used by Zig itself, unrelated to Python.

The Zig project then bundles and ships eight different Python wheels targetting different platforms. Here’s their code that does that, which lists the platforms that are supported:

for zig_platform, python_platform in {
    'windows-i386':   'win32',
    'windows-x86_64': 'win_amd64',
    'macos-x86_64':   'macosx_10_9_x86_64',
    'macos-aarch64':  'macosx_11_0_arm64',
    'linux-i386':     'manylinux_2_12_i686.manylinux2010_i686',
    'linux-x86_64':   'manylinux_2_12_x86_64.manylinux2010_x86_64',
    'linux-armv7a':   'manylinux_2_17_armv7l.manylinux2014_armv7l',
    'linux-aarch64':  'manylinux_2_17_aarch64.manylinux2014_aarch64',
}.items():
    # Build the wheel here...

They suggest that if you want to run their tools from a Python program you do so like this, to ensure your script can find the installed binary:

import sys, subprocess

subprocess.call([sys.executable, "-m", "ziglang"])

I find this whole approach pretty fascinating. I really love the idea that I can add a full C/C++ compiler as a dependency to any of my Python projects, and thanks to Python wheels I’ll automatically get a binary excutable compiled for my current platform.

Playwright Python

I spotted another example of this pattern recently in Playwright Python. Playwright is Microsoft’s open source browser automation and testing framework—a kind of modern Selenium. I used it recently to build my shot-scraper screenshot automation tool.

Playwright provides a full-featured API for controlling headless (and headful) browser instances, with implementations in Node.js, Python, Java and .NET.

I was intrigued as to how they had developed such a sophisticated API for four different platforms/languages at once, providing full equivalence for all of their features across all four.

So I dug around in their Python package (from pip install playwright) and found this:

77M ./venv/lib/python3.10/site-packages/playwright/driver/node

That’s a full copy of the Node.js binary!

% ./venv/lib/python3.10/site-packages/playwright/driver/node --version
v16.13.0

Playwright Python works by providing a Python layer on top of the existing JavaScript API library. It runs a Node.js process which does the actual work, the Python library just communicates with the JavaScript for you.

As with Zed, the Playwright team offer seven pre-compiled wheels for different platforms. The list today is:

  • playwright-1.22.0-py3-none-win_amd64.whl
  • playwright-1.22.0-py3-none-win32.whl
  • playwright-1.22.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
  • playwright-1.22.0-py3-none-manylinux1_x86_64.whl
  • playwright-1.22.0-py3-none-macosx_11_0_universal2.whl
  • playwright-1.22.0-py3-none-macosx_11_0_arm64.whl
  • playwright-1.22.0-py3-none-macosx_10_13_x86_64.whl

I wish I could say "you can now pip install a browser!" but Playwright doesn’t actually bundle the browsers themselves—you need to run python -m playwright install to download those separately.

Pretty fascinating example of the same pattern though!

pip install a SQLite database

It’s not quite the same thing, since it’s not packaging an executable, but the one project I have that fits this mould if you squint a little is my datasette-basemap plugin.

It’s a Datasette plugin which bundles a 23MB SQLite database file containing OpenStreetMap tiles for the first seven zoom levels of their world map—5,461 tile images total.

I built it so that people could use my datasette-cluster-map and datasette-leaflet-geojson entirely standalone, without needing to load tiles from a central tile server.

You can play with a demo here. I wrote more about that project in Serving map tiles from SQLite with MBTiles and datasette-tiles. It’s pretty fun to be able to run pip install datasette-basemap to install a full map of the world.

Seen any other interesting examples of pip install being (ab)used in this way? Ping them to me on Twitter.

Update: Paul O’Leary McCann points out that PyPI has a default 60MB size limit for packages, though it can be raised on a case-by-case basis. He wrote about this in Distributing Large Files with PyPI Packages.

Weeknotes: Camping, a road trip and two new museums 13 days ago

Natalie and I took a week-long road trip and camping holiday. The plan was to camp on Santa Rosa Island in the California Channel Islands, but the boat to the island was cancelled due to bad weather. We treated ourselves to a Central Californian road trip instead.

The Madonna Inn

If you’re driving down from San Francisco to Santa Barbara and you don’t stay a night at the Madonna Inn in San Luis Obispo you’re missing out.

This legendary hotel/motel built 110 guest rooms in the 1960s, each of them with a different theme. We ended up staying two nights thanks to our boat cancellation—one in the Kona Rock room (Hawaii themed, mostly carved out of solid rock, the shower has a waterfall) and one in Safari. Epic.

The Kona Rock room - the walls are all maed of rocks The bathroom in Kona Rock is made ofmrocks too - the sink is a huge uneven piece of rock The Safari room has a beautiful four poster bed and exciting wallpaper

Camping

Camping in California generally requires booking a site, often months in advance. Our travel companions knew what they were doing and managed to grab us last minute spots for one night at Islay Creek near Los Osos and two nights in the beautiful Los Padres National Forest.

The Victorian Mansion

I have a habit of dropping labels on Google Maps with tips that people have given me about different places. Labels have quite a strict length limit, which means my tips are often devoid of context—including when and from whom the tip came.

This means I’m constantly stumbling across little tips from my past self, with no recollection of where the tip came from. This is delightful.

As we were planning the last leg of our trip, I spotted a label north of Santa Barbara which just said “6 rooms puts Madonna Inn to shame”.

I have no recollection of saving this tip. I had attached it to the Victorian Mansion Bed & Breakfast in Los Alamos, California—an old Victorian house with six uniquely themed rooms.

We stayed in the 1950s suite. It was full of neon and the bed was a 1956 Cadillac convertible which the house had been reconstructed around when the building was moved to its present location. We watched Sideways, a movie set in the area, on the projector that simulated a drive-in movie theater on a screen in front of the car.

The outside of the Victorian Mansion is a beautiful, well, Victorian mansion - with suspiciously boarded up windows The 1950s suite with neon lights and a car for a bed

And some museums

On the way down to San Luis Obispo we stumbled across the Paso Robles Pioneer Museum. This was the best kind of local history museum—entirely run by volunteers, and with an eclectic accumulation of donated exhibits covering all kinds of details of the history of the surrounding area. I particularly enjoyed the Swift Jewell Barbed Wire Collection—the fourth largest collection of barbed wire on public display in the world!

(This raised the obvious question: what are the top three? From this category on Atlas Obscura it looks like there are two in Kansas and one in Texas.)

The museum has an indoor street with recreations of historic businesses from the local town A sign says: Swift Jewell Barbed Wire Collection - above a wall full of barbed wire samples

Then on the way back up we checked Roadside America and found its listing for Mendenhall’s Museum of Gasoline Pumps & Petroliana. This was the absolute best kind of niche museum: an obsessive collection, in someone’s home, available to view by appointment only.

We got lucky: one of the museum’s operators spotted us lurking around the perimeter looking optimistic and let us have a look around despite not having pre-booked.

The museum features neon, dozens of gas pumps, more than 400 porcelain gas pump globes, thousands of gas station signs plus classic and historic racing cars too. My write-up and photos are available on Niche Museums.

A wall outside the museum covered in signs and neon Beautiful old historic gas pumps A very exciting classic racing car in a garage covered in more signs A bar area, with signs covering every inch of the walls and ceiling

Museums this week

TIL this week

Weeknotes: Datasette Lite, nogil Python, HYTRADBOI 22 days ago

My big project this week was Datasette Lite, a new way to run Datasette directly in a browser, powered by WebAssembly and Pyodide. I also continued my research into running SQL queries in parallel, described last week. Plus I spoke at HYTRADBOI.

Datasette Lite

This started out as a research project, inspired by the excitement around Python in the browser from PyCon US last week (which I didn’t attend, but observed with some jealousy on Twitter).

I’ve been wanting to explore this possibility for a while. JupyterLite had convinced me that it would be feasible to run Datasette using Pyodide, especially after I found out that the sqlite3 module from the Python standard library works there already.

I have a private “notes” GitHub repository which I use to keep notes in GitHub issues. I started a thread there researching the possibility of running an ASGI application in Pyodide, thinking that might be a good starting point to getting Datasette to work.

The proof of concept moved remarkably quickly, especially once I realized that Service Workers weren’t going to work but Web Workers might.

Once I had comitted to Datasette Lite as a full project I started a new repository for it and transferred across my initial prototype issue thread. You can read that full thread for a blow-by-blow account of how my research pulled together in datasette-lite issue #1.

The rest of the project is documented in detail in my blog post.

Since launching it the biggest change I’ve made was a change of URL: since it’s clearly going to be a core component of the Datasette project going forward I promoted it from simonw.github.io/datasette-lite/ to its new permanent home at lite.datasette.io. It’s still hosted by GitHub Pages—here’s my TIL about setting up the new domain.

It may have started as a proof of concept tech demo, but the response to it so far has convinced me that I should really take it seriously. Being able to host Datasette without needing to run any server-side code at all is an incredibly compelling experience.

It doesn’t matter how hard I work on getting the Datasette deployment experience as easy as possible, static file hosting will always be an order of magnitude more accessible. And even at this early stage Datasette Lite is already proving to be a genuinely useful way to run the software.

As part of this research I also shipped sqlite-utils 3.26.1 with a minor dependency fix that means it works in Pyodide now. You can try that out by running the following in the Pyodide REPL:

>>> import micropip
>>> await micropip.install("sqlite-utils")
>>> import sqlite_utils
>>> db = sqlite_utils.Database(memory=True)
>>> list(db.query("select 3 * 5"))
[{'3 * 5': 15}]

Parallel SQL queries work... if you can get rid of the GIL

Last week I described my effort to implement Parallel SQL queries for Datasette.

The idea there was that many Datasette pages execute multiple SQL queries—a count(*) and a select ... limit 101 for example—that could be run in parallel instead of serial, for a potential improvement in page load times.

My hope was that I could get away with this despite Python’s infamous Global Interpreter Lock because the sqlite3 C module releases the GIL when it executes a query.

My initial results weren’t showing an increase in performance, even while the queries were shown to be overlapping each other. I opened a research thread and spent some time this week investigating.

My conclusion, sadly, was that the GIL was indeed to blame. sqlite3 releases the GIL to execute the query, but there’s still a lot of work that happens in Python land itself—most importantly the code that assembles the objects that represent the rows returned by the query, which is still subject to the GIL.

Then this comment on a thread about the GIL on Lobsters reminded me of the nogil fork of Python by Sam Gross, who has been working on this problem for several years now.

Since that fork has a Docker image trying it out was easy... and to my amazement it worked! Running my parallel queries implementation against nogil Python reduced a page load time from 77ms to 47ms.

Sam’s work is against Python 3.9, but he’s discussing options for bringing his improvemets into Python itself with the core maintainers. I’m hopeful that this might happen in the next few years. It’s an incredible piece of work.

An amusing coincidence: one restriction of WASM and Pyodide is that they can’t start new threads—so as part of getting Datasette to work on that platform I had to add a new setting that disables the ability to run SQL queries in threads entirely!

datasette-copy-to-memory

One question I found myself asking while investigating parallel SQL queries (before I determined that the GIL was to blame) was whether parallel SQLite queries against the same database file were suffering from some form of file locking or contention.

To rule that out, I built a new plugin: datasette-copy-to-memory—which reads a SQLite database from disk and copies it into an in-memory database when Datasette first starts up.

This didn’t make an observable difference in performance, but I’ve not tested it extensively—especially not against larger databases using servers with increased amounts of available RAM.

If you’re inspired to give this plugin a go I’d love to hear about your results.

asgi-gzip and datasette-gzip

I mentioned datasette-gzip last week: a plugin that acts as a wrapper around the excellent GZipMiddleware from Starlette.

The performance improvements from this—especially for larger HTML tables, which it turns out compress extremely well—were significant. Enough so that I plan to bring gzip support into Datasette core very shortly.

Since I don’t want to add the whole of Starlette as a dependency just to get gzip support, I extracted that code out into a new Python package called asgi-gzip.

The obvious risk with doing this is that it might fall behind the excellent Starlette implementation. So I came up with a pattern based on Git scraping that would automatically open a new GitHub issue should the borrowed Starlette code change in the future.

I wrote about that pattern in Automatically opening issues when tracked file content changes.

Speaking at HYTRADBOI

I spoke at the HYTRADBOI conference last week: Have You Tried Rubbing A Database On It.

HYTRADBOI was organized by Jamie Brandon. It was a neat event, with a smart format: 34 pre-recorded 10 minute long talks, arranged into a schedule to encourage people to watch and discuss them at specific times during the day of the event.

It’s worth reading Jamie’s postmortem of the event for some insightful thinking on online event organization.

My talk was Datasette: a big bag of tricks for solving interesting problems using SQLite. It ended up working out as a lightning-fast 10 minute tutorial on using the sqlite-utils CLI to clean up some data (in this case Manatee Carcass Recovery Locations in Florida since 1974) and then using Datasette to explore and publish it.

I’ve posted some basic notes to accompany the talk. My plan is to use this as the basis for an official tutorial on sqlite-utils for the tutorials section of the Datasette website.

Releases this week

TIL this week

Datasette Lite: a server-side Python web application running in a browser 24 days ago

Datasette Lite is a new way to run Datasette: entirely in a browser, taking advantage of the incredible Pyodide project which provides Python compiled to WebAssembly plus a whole suite of useful extras.

You can try it out here:

https://lite.datasette.io/

A screenshot of the pypi_packages database table running in Google Chrome in a page with the URL of lite.datasette.io/#/content/pypi_packages?_facet=author

The initial example loads two databases—the classic fixtures.db used by the Datasette test suite, and the content.db database that powers the official datasette.io website (described in some detail in my post about Baked Data).

You can instead use the “Load database by URL to a SQLite DB” button to paste in a URL to your own database. That file will need to be served with CORS headers that allow it to be fetched by the website (see README).

Try this URL, for example:

https://congress-legislators.datasettes.com/legislators.db

You can follow this link to open that database in Datasette Lite.

Datasette Lite supports almost all of Datasette’s regular functionality: you can view tables, apply facets, run your own custom SQL results and export the results as CSV or JSON.

It’s basically the full Datasette experience, except it’s running entirely in your browser with no server (other than the static file hosting provided here by GitHub Pages) required.

I’m pretty stunned that this is possible now.

I had to make some small changes to Datasette to get this to work, detailed below, but really nothing extravagant—the demo is running the exact same Python code as the regular server-side Datasette application, just inside a web worker process in a browser rather than on a server.

The implementation is pretty small—around 300 lines of JavaScript. You can see the code in the simonw/datasette-lite repository—in two files, index.html and webworker.js

Why build this?

I built this because I want as many people as possible to be able to use my software.

I’ve invested a ton of effort in reducing the friction to getting started with Datasette. I’ve documented the install process, I’ve packaged it for Homebrew, I’ve written guides to running it on Glitch, I’ve built tools to help deploy it to Heroku, Cloud Run, Vercel and Fly.io. I even taught myself Electron and built a macOS Datasette Desktop application, so people could install it without having to think about their Python environment.

Datasette Lite is my latest attempt at this. Anyone with a browser that can run WebAssembly can now run Datasette in it—if they can afford the 10MB load (which in many places with metered internet access is way too much).

I also built this because I’m fascinated by WebAssembly and I’ve been looking for an opportunity to really try it out.

And, I find this project deeply amusing. Running a Python server-side web application in a browser still feels like an absurd thing to do. I love that it works.

I’m deeply inspired by JupyterLite. Datasette Lite’s name is a tribute to that project.

How it works: Python in a Web Worker

Datasette Lite does most of its work in a Web Worker—a separate process that can run expensive CPU operations (like an entire Python interpreter) without blocking the main browser’s UI thread.

The worker starts running when you load the page. It loads a WebAssembly compiled Python interpreter from a CDN, then installs Datasette and its dependencies into that interpreter using micropip.

It also downloads the specified SQLite database files using the browser’s HTTP fetching mechanism and writes them to a virtual in-memory filesystem managed by Pyodide.

Once everything is installed, it imports datasette and creates a Datasette() object called ds. This object stays resident in the web worker.

To render pages, the index.html page sends a message to the web worker specifying which Datasette path has been requested—/ for the homepage, /fixtures for the database index page, /fixtures/facetable for a table page and so on.

The web worker then simulates an HTTP GET against that path within Datasette using the following code:

response = await ds.client.get(path, follow_redirects=True)

This takes advantage of a really useful internal Datasette API: datasette.client is an HTTPX client object that can be used to execute HTTP requests against Datasette internally, without doing a round-trip across the network.

I initially added datasette.client with the goal of making any JSON APIs that Datasette provides available for internal calls by plugins as well, and to make it easier to write automated tests. It turns out to have other interesting applications too!

The web worker sends a message back to index.html with the status code, content type and content retrieved from Datasette. JavaScript in index.html then injects that HTML into the page using .innerHTML.

To get internal links working, Datasette Lite uses a trick I originally learned from jQuery: it applies a capturing event listener to the area of the page displaying the content, such that any link clicks or form submissions will be intercepted by a JavaScript function. That JavaScript can then turn them into new messages to the web worker rather than navigating to another page.

Some annotated code

Here are annotated versions of the most important pieces of code. In index.html this code manages the worker and updates the page when it recieves messages from it:

// Load the worker script
const datasetteWorker = new Worker("webworker.js");

// Extract the ?url= from the current page's URL
const initialUrl = new URLSearchParams(location.search).get('url');

// Message that to the worker: {type: 'startup', initialUrl: url}
datasetteWorker.postMessage({type: 'startup', initialUrl});

// This function does most of the work - it responds to messages sent
// back from the worker to the index page:
datasetteWorker.onmessage = (event) => {
  // {type: log, line: ...} messages are appended to a log textarea:
  var ta = document.getElementById('loading-logs');
  if (event.data.type == 'log') {
    loadingLogs.push(event.data.line);
    ta.value = loadingLogs.join("\n");
    ta.scrollTop = ta.scrollHeight;
    return;
  }
  let html = '';
  // If it's an {error: ...} message show it in a <pre> in a <div>
  if (event.data.error) {
    html = `<div style="padding: 0.5em"><h3>Error</h3><pre>${escapeHtml(event.data.error)}</pre></div>`;
  // If contentType is text/html, show it as straight HTML
  } else if (/^text\/html/.exec(event.data.contentType)) {
    html = event.data.text;
  // For contentType of application/json parse and pretty-print it
  } else if (/^application\/json/.exec(event.data.contentType)) {
    html = `<pre style="padding: 0.5em">${escapeHtml(JSON.stringify(JSON.parse(event.data.text), null, 4))}</pre>`;
  // Anything else (likely CSV data) escape it and show in a <pre>
  } else {
    html = `<pre style="padding: 0.5em">${escapeHtml(event.data.text)}</pre>`;
  }
  // Add the result to <div id="output"> using innerHTML
  document.getElementById("output").innerHTML = html;
  // Update the document.title if a <title> element is present
  let title = document.getElementById("output").querySelector("title");
  if (title) {
    document.title = title.innerText;
  }
  // Scroll to the top of the page after each new page is loaded
  window.scrollTo({top: 0, left: 0});
  // If we're showing the initial loading indicator, hide it
  document.getElementById('loading-indicator').style.display = 'none';
};

The webworker.js script is where the real magic happens:

// Load Pyodide from the CDN
importScripts("https://cdn.jsdelivr.net/pyodide/dev/full/pyodide.js");

// Deliver log messages back to the index.html page
function log(line) {
  self.postMessage({type: 'log', line: line});
}

// This function initializes Pyodide and installs Datasette
async function startDatasette(initialUrl) {
  // Mechanism for downloading and saving specified DB files
  let toLoad = [];
  if (initialUrl) {
    let name = initialUrl.split('.db')[0].split('/').slice(-1)[0];
    toLoad.push([name, initialUrl]);
  } else {
    // If no ?url= provided, loads these two demo databases instead:
    toLoad.push(["fixtures.db", "https://latest.datasette.io/fixtures.db"]);
    toLoad.push(["content.db", "https://datasette.io/content.db"]);
  }
  // This does a LOT of work - it pulls down the WASM blob and starts it running
  self.pyodide = await loadPyodide({
    indexURL: "https://cdn.jsdelivr.net/pyodide/dev/full/"
  });
  // We need these packages for the next bit of code to work
  await pyodide.loadPackage('micropip', log);
  await pyodide.loadPackage('ssl', log);
  await pyodide.loadPackage('setuptools', log); // For pkg_resources
  try {
    // Now we switch to Python code
    await self.pyodide.runPythonAsync(`
    # Here's where we download and save those .db files - they are saved
    # to a virtual in-memory filesystem provided by Pyodide

    # pyfetch is a wrapper around the JS fetch() function - calls using
    # it are handled by the browser's regular HTTP fetching mechanism
    from pyodide.http import pyfetch
    names = []
    for name, url in ${JSON.stringify(toLoad)}:
        response = await pyfetch(url)
        with open(name, "wb") as fp:
            fp.write(await response.bytes())
        names.append(name)

    import micropip
    # Workaround for Requested 'h11<0.13,>=0.11', but h11==0.13.0 is already installed
    await micropip.install("h11==0.12.0")
    # Install Datasette itself!
    await micropip.install("datasette==0.62a0")
    # Now we can create a Datasette() object that can respond to fake requests
    from datasette.app import Datasette
    ds = Datasette(names, settings={
        "num_sql_threads": 0,
    }, metadata = {
        # This metadata is displayed in Datasette's footer
        "about": "Datasette Lite",
        "about_url": "https://github.com/simonw/datasette-lite"
    })
    `);
    datasetteLiteReady();
  } catch (error) {
    self.postMessage({error: error.message});
  }
}

// Outside promise pattern
// https://github.com/simonw/datasette-lite/issues/25#issuecomment-1116948381
let datasetteLiteReady;
let readyPromise = new Promise(function(resolve) {
  datasetteLiteReady = resolve;
});

// This function handles messages sent from index.html to webworker.js
self.onmessage = async (event) => {
  // The first message should be that startup message, carrying the URL
  if (event.data.type == 'startup') {
    await startDatasette(event.data.initialUrl);
    return;
  }
  // This promise trick ensures that we don't run the next block until we
  // are certain that startDatasette() has finished and the ds.client
  // Python object is ready to use
  await readyPromise;
  // Run the reuest in Python to get a status code, content type and text
  try {
    let [status, contentType, text] = await self.pyodide.runPythonAsync(
      `
      import json
      # ds.client.get(path) simulates running a request through Datasette
      response = await ds.client.get(
          # Using json here is a quick way to generate a quoted string
          ${JSON.stringify(event.data.path)},
          # If Datasette redirects to another page we want to follow that
          follow_redirects=True
      )
      [response.status_code, response.headers.get("content-type"), response.text]
      `
    );
    // Message the results back to index.html
    self.postMessage({status, contentType, text});
  } catch (error) {
    // If an error occurred, send that back as a {error: ...} message
    self.postMessage({error: error.message});
  }
};

One last bit of code: here’s the JavaScript in index.html which intercepts clicks on links and turns them into messages to the worker:

let output = document.getElementById('output');
// This captures any click on any element within <div id="output">
output.addEventListener('click', (ev => {
  // .closest("a") traverses up the DOM to find if this is an a
  // or an element nested in an a. We ignore other clicks.
  var link = ev.srcElement.closest("a");
  if (link && link.href) {
    // It was a click on a <a href="..."> link! Cancel the event:
    ev.stopPropagation();
    ev.preventDefault();
    // I want #fragment links to still work, using scrollIntoView()
    if (isFragmentLink(link.href)) {
      // Jump them to that element, but don't update the URL bar
      // since we use # in the URL to mean something else
      let fragment = new URL(link.href).hash.replace("#", "");
      if (fragment) {
        let el = document.getElementById(fragment);
        el.scrollIntoView();
      }
      return;
    }
    let href = link.getAttribute("href");
    // Links to external sites should open in a new window
    if (isExternal(href)) {
      window.open(href);
      return;
    }
    // It's an internal link navigation - send it to the worker
    loadPath(href);
  }
}), true);

function loadPath(path) {
  // We don't want anything after #, and we only want the /path
  path = path.split("#")[0].replace("http://localhost", "");
  // Update the URL with the new # location
  history.pushState({path: path}, path, "#" + path);
  // Plausible analytics, see:
  // https://github.com/simonw/datasette-lite/issues/22
  useAnalytics && plausible('pageview', {u: location.href.replace('?url=', '').replace('#', '/')});
  // Send a {path: "/path"} message to the worker
  datasetteWorker.postMessage({path});
}

Getting Datasette to work in Pyodide

Pyodide is the secret sauce that makes this all possible. That project provides several key components:

  • A custom WebAssembly build of the core Python interpreter, bundling the standard library (including a compiled WASM version of SQLite)
  • micropip—a package that can install additional Python dependencies by downloading them from PyPI
  • A comprehensive JavaScript to Python bridge, including mechanisms for translating Python objects to JavaScript and vice-versa
  • A JavaScript API for launching and then managing a Python interpreter process

I found the documentation on Using Pyodide in a web worker particularly helpful.

I had to make a few changes to Datasette to get it working with Pyodide. My tracking issue for that has the full details, but the short version is:

  • Ensure each of Datasette’s dependencies had a wheel package on PyPI (as opposed to just a .tar.gz)—micropip only works with wheels. I ended up removing python-baseconv as a dependency and replacing click-default-group with my own click-default-group-wheel forked package (repo here). I got sqlite-utils working in Pyodide with this change too, see the 3.26.1 release notes.
  • Work around an error caused by importing uvicorn. Since Datasette Lite doesn’t actually run its own web server that dependency wasn’t necessary, so I changed my code to catch the ImportError in the right place.
  • The biggest change: WebAssembly can’t run threads, which means Python can’t run threads, which means any attempts to start a thread in Python cause an error. Datasette only uses threads in one place: to execute SQL queries in a thread pool where they won’t block the event loop. I added a new --setting num_sql_threads 0 feature for disabling threading entirely, see issue 1735.

Having made those changes I shipped them in a Datasette 0.62a0 release. It’s this release that Datasette Lite installs from PyPI.

Fragment hashes for navigation

You may have noticed that as you navigate through Datasette Lite the URL bar updates with URLs that look like the following:

https://lite.datasette.io/#/content/pypi_packages?_facet=author

I’m using the # here to separate out the path within the virtual Datasette instance from the URL to the Datasette Lite application itself.

Maintaining the state in the URL like this means that the Back and Forward browser buttons work, and also means that users can bookmark pages within the application and share links to them.

I usually like to avoid # URLs—the HTML history API makes it possible to use “real” URLs these days, even for JavaScript applications. But in the case of Datasette Lite those URLs wouldn’t actually work—if someone attempted to refresh the page or navigate to a link GitHub Pages wouldn’t know what file to serve.

I could run this on my own domain with a catch-all page handler that serves the Datasette Lite HTML and JavaScript no matter what path is requested, but I wanted to keep this as pure and simple as possible.

This also means I can reserve Datasette Lite’s own query string for things like specifying the database to load, and potentially other options in the future.

Web Workers or Service Workers?

My initial idea for this project was to build it with Service Workers.

Service Workers are some deep, deep browser magic: they let you install a process that can intercept browser traffic to a specific domain (or path within that domain) and run custom code to return a result. Effectively they let you run your own server-side code in the browser itself.

They’re mainly designed for building offline applications, but my hope was that I could use them to offer a full simulation of a server-side application instead.

Here’s my TIL on Intercepting fetch in a service worker that came out of my initial research.

I managed to get a server-side JavaScript “hello world” demo working, but when I tried to add Pyodide I ran into some unavoidable road blocks. It turns out Service Workers are very restricted in which APIs they provide—in particular, they don’t allow XMLHttpRequest calls. Pyodide apparently depends on XMLHttpRequest, so it was unable to run in a Service Worker at all. I filed an issue about it with the Pyodide project.

Initially I thought this would block the whole project, but eventually I figured out a way to achieve the same goals using Web Workers instead.

Is this an SPA or an MPA?

SPAs are Single Page Applications. MPAs are Multi Page Applications. Datasette Lite is a weird hybrid of the two.

This amuses me greatly.

Datasette itself is very deliberately architected as a multi page application.

I think SPAs, as developed over the last decade, have mostly been a mistake. In my experience they take longer to build, have more bugs and provide worse performance than a server-side, multi-page alternatives implementation.

Obviously if you are building Figma or VS Code then SPAs are the right way to go. But most web applications are not Figma, and don’t need to be!

(I used to think Gmail was a shining example of an SPA, but it’s so sludgy and slow loading these days that I now see it as more of an argument against the paradigm.)

Datasette Lite is an SPA wrapper around an MPA. It literally simulates the existing MPA by running it in a web worker.

It’s very heavy—it loads 11MB of assets before it can show you anything. But it also inherits many of the benefits of the underlying MPA: it has obvious distinctions between pages, a deeply interlinked interface, working back and forward buttons, it’s bookmarkable and it’s easy to maintain and add new features.

I’m not sure what my conclusion here is. I’m skeptical of SPAs, and now I’ve built a particularly weird one. Is this even a good idea? I’m looking forward to finding that out for myself.

Coming soon: JavaScript!

Another amusing detail about Datasette Lite is that the one part of Datasette that doesn’t work yet is Datasette’s existing JavaScript features!

Datasette currently makes very sparing use of JavaScript in the UI: it’s used to add some drop-down interactive menus (including the handy “cog” menu on column headings) and for a CodeMirror-enhanced SQL editing interface.

JavaScript is used much more extensively by several popular Datasette plugins, including datasette-cluster-map and datasette-vega.

Unfortunately none of this works in Datasette Lite at the moment—because I don’t yet have a good way to turn <script src="..."> links into things that can load content from the Web Worker.

This is one of the reasons I was initially hopeful about Service Workers.

Thankfully, since Datasette is built on the principles of progressive enhancement this doesn’t matter: the application remains usable even if none of the JavaScript enhancements are applied.

I have an open issue for this. I welcome suggestions as to how I can get all of Datasette’s existing JavaScript working in the new environment with as little effort as possible.

Bonus: Testing it with shot-scraper

In building Datasette Lite, I’ve committed to making Pyodide a supported runtime environment for Datasette. How can I ensure that future changes I make to Datasette—accidentally introducing a new dependency that doesn’t work there for example—don’t break in Pyodide without me noticing?

This felt like a great opportunity to exercise my shot-scraper CLI tool, in particular its ability to run some JavaScript against a page and pass or fail a CI job depending on if that JavaScript throws an error.

Pyodide needs you to run it from a real web server, not just an HTML file saved to disk—so I put together a very scrappy shell script which builds a Datasette wheel package, starts a localhost file server (using python3 -m http.server), then uses shot-scraper javascript to execute a test against it that installs Datasette from the wheel using micropip and confirms that it can execute a simple SQL query via the JSON API.

Here’s the script in full, with extra comments:

#!/bin/bash
set -e
# I always forget to do this in my bash scripts - without it, any
# commands that fail in the script won't result in the script itself
# returning a non-zero exit code. I need it for running tests in CI.

# Build the wheel - this generates a file with a name similar to
# dist/datasette-0.62a0-py3-none-any.whl
python3 -m build

# Find the name of that wheel file, strip off the dist/
wheel=$(basename $(ls dist/*.whl))
# $wheel is now datasette-0.62a0-py3-none-any.whl

# Create a blank index page that loads Pyodide
echo '
<script src="https://cdn.jsdelivr.net/pyodide/v0.20.0/full/pyodide.js"></script>
' > dist/index.html

# Run a localhost web server for that dist/ folder, in the background
# so we can do more stuff in this script
cd dist
python3 -m http.server 8529 &
cd ..

# Now we use shot-scraper to run a block of JavaScript against our
# temporary web server. This will execute in the context of that
# index.html page we created earlier, which has loaded Pyodide
shot-scraper javascript http://localhost:8529/ "
async () => {
  // Load Pyodide and all of its necessary assets
  let pyodide = await loadPyodide();
  // We also need these packages for Datasette to work
  await pyodide.loadPackage(['micropip', 'ssl', 'setuptools']);
  // We need to escape the backticks because of Bash escaping rules
  let output = await pyodide.runPythonAsync(\`
    import micropip
    // This is needed to avoid a dependency conflict error
    await micropip.install('h11==0.12.0')
    // Here we install the Datasette wheel package we created earlier
    await micropip.install('http://localhost:8529/$wheel')
    // These imports avoid Pyodide errors importing datasette itself
    import ssl
    import setuptools
    from datasette.app import Datasette
    // num_sql_threads=0 is essential or Datasette will crash, since
    // Pyodide and WebAssembly cannot start threads
    ds = Datasette(memory=True, settings={'num_sql_threads': 0})
    // Simulate a hit to execute 'select 55 as itworks' and return the text
    (await ds.client.get(
      '/_memory.json?sql=select+55+as+itworks&_shape=array'
    )).text
  \`);
  // The last expression in the runPythonAsync block is returned, here
  // that's the text returned by the simulated HTTP response to the JSON API
  if (JSON.parse(output)[0].itworks != 55) {
    // This throws if the JSON API did not return the expected result
    // shot-scraper turns that into a non-zero exit code for the script
    // which will cause the CI task to fail
    throw 'Got ' + output + ', expected itworks: 55';
  }
  // This gets displayed on the console, with a 0 exit code for a pass
  return 'Test passed!';
}
"

# Shut down the server we started earlier, by searching for and killing
# a process that's running on the port we selected
pkill -f 'http.server 8529'

Automatically opening issues when tracked file content changes one month ago

I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.

Extracting GZipMiddleware from Starlette

Here’s why I needed to solve this problem.

I want to add gzip support to my Datasette open source project. Datasette builds on the Python ASGI standard, and Starlette provides an extremely well tested, robust GZipMiddleware class that adds gzip support to any ASGI application. As with everything else in Starlette, it’s really good code.

The problem is, I don’t want to add the whole of Starlette as a dependency. I’m trying to keep Datasette’s core as small as possible, so I’m very careful about new dependencies. Starlette itself is actually very light (and only has a tiny number of dependencies of its own) but I still don’t want the whole thing just for that one class.

So I decided to extract the GZipMiddleware class into a separate Python package, under the same BSD license as Starlette itself.

The result is my new asgi-gzip package, now available on PyPI.

What if Starlette fixes a bug?

The problem with extracting code like this is that Starlette is a very effectively maintained package. What if they make improvements or fix bugs in the GZipMiddleware class? How can I make sure to apply those same fixes to my extracted copy?

As I thought about this challenge, I realized I had most of the solution already.

Git scraping is the name I’ve given to the trick of running a periodic scraper that writes to a git repository in order to track changes to data over time.

It may seem redundant to do this against a file that already lives in version control elsewhere—but in addition to tracking changes, Git scraping can offfer a cheap and easy way to add automation that triggers when a change is detected.

I need an actionable alert any time the Starlette code changes so I can review the change and apply a fix to my own library, if necessary.

Since I already run all of my projects out of GitHub issues, automatically opening an issue against the asgi-gzip repository would be ideal.

My track.yml workflow does exactly that: it implements the Git scraping pattern against the gzip.py module in Starlette, and files an issue any time it detects changes to that file.

Starlette haven’t made any changes to that file since I started tracking it, so I created a test repo to try this out.

Here’s one of the example issues. I decided to include the visual diff in the issue description and have a link to it from the underlying commit as well.

Screenshot of an open issue page. The issues is titled "gzip.py was updated" and contains a visual diff showing the change to a file. A commit that references the issue is listed too.

How it works

The implementation is contained entirely in this track.yml workflow. I designed this to be contained as a single file to make it easy to copy and paste it to adapt it for other projects.

It uses actions/github-script, which makes it easy to do things like file new issues using JavaScript.

Here’s a heavily annotated copy:

name: Track the Starlette version of this

# Run on repo pushes, and if a user clicks the "run this action" button,
# and on a schedule at 5:21am UTC every day
on:
  push:
  workflow_dispatch:
  schedule:
  - cron:  '21 5 * * *'

# Without this block I got this error when the action ran:
# HttpError: Resource not accessible by integration
permissions:
  # Allow the action to create issues
  issues: write
  # Allow the action to commit back to the repository
  contents: write

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/github-script@v6
      # Using env: here to demonstrate how an action like this can
      # be adjusted to take dynamic inputs
      env:
        URL: https://raw.githubusercontent.com/encode/starlette/master/starlette/middleware/gzip.py
        FILE_NAME: tracking/gzip.py
      with:
        script: |
          const { URL, FILE_NAME } = process.env;
          // promisify pattern for getting an await version of child_process.exec
          const util = require("util");
          // Used exec_ here because 'exec' variable name is already used:
          const exec_ = util.promisify(require("child_process").exec);
          // Use curl to download the file
          await exec_(`curl -o ${FILE_NAME} ${URL}`);
          // Use 'git diff' to detect if the file has changed since last time
          const { stdout } = await exec_(`git diff ${FILE_NAME}`);
          if (stdout) {
            // There was a diff to that file
            const title = `${FILE_NAME} was updated`;
            const body =
              `${URL} changed:` +
              "\n\n```diff\n" +
              stdout +
              "\n```\n\n" +
              "Close this issue once those changes have been integrated here";
            const issue = await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: title,
              body: body,
            });
            const issueNumber = issue.data.number;
            // Now commit and reference that issue number, so the commit shows up
            // listed at the bottom of the issue page
            const commitMessage = `${FILE_NAME} updated, refs #${issueNumber}`;
            // https://til.simonwillison.net/github-actions/commit-if-file-changed
            await exec_(`git config user.name "Automated"`);
            await exec_(`git config user.email "actions@users.noreply.github.com"`);
            await exec_(`git add -A`);
            await exec_(`git commit -m "${commitMessage}" || exit 0`);
            await exec_(`git pull --rebase`);
            await exec_(`git push`);
          }

In the asgi-gzip repository I keep the fetched gzip.py file in a tracking/ directory. This directory isn’t included in the Python package that gets uploaded to PyPI—it’s there only so that my code can track changes to it over time.

More interesting applications

I built this to solve my "tell me when Starlette update their gzip.py file" problem, but clearly this pattern has much more interesting uses.

You could point this at any web page to get a new GitHub issue opened when that page content changes. Subscribe to notifications for that repository and you get a robust , shared mechanism for alerts—plus an issue system where you can post additional comments and close the issue once someone has reviewed the change.

There’s a lot of potential here for solving all kinds of interesting problems. And it doesn’t cost anything either: GitHub Actions (somehow) remains completely free for public repositories!

Elsewhere

27th May 2022

  • Architecture Notes: Datasette (via) I was interviewed for the first edition of Architecture Notes—a new publication (website and newsletter) about software architecture created by Mahdi Yusuf. We covered a bunch of topics in detail: ASGI, SQLIte and asyncio, Baked Data, plugin hook design, Python in WebAssembly, Python in an Electron app and more. Mahdi also turned my scrappy diagrams into beautiful illustrations for the piece. #27th May 2022, 3:20 pm

26th May 2022

  • upptime (via) “Open-source uptime monitor and status page, powered entirely by GitHub Actions, Issues, and Pages.” This is a very creative (ab)use of GitHub Actions: it runs a scheduled action to check the availability of sites that you specify, records the results in a YAML file (with the commit history tracking them over time) and can automatically open a GitHub issue for you if it detects a new incident. #26th May 2022, 3:53 am
  • Benjamin "Zags" Zagorsky: Handling Timezones in Python. The talks from PyCon US have started appearing on YouTube. I found this one really useful for shoring up my Python timezone knowledge: It reminds that if your code calls datetime.now(), datetime.utcnow() or date.today(), you have timezone bugs—you’ve been working with ambiguous representations of instances in time that could span a 26 hour interval from UTC-12 to UTC+14. date.today() represents a 24 hour period and hence is prone to timezone surprises as well. My code has a lot of timezone bugs! #26th May 2022, 3:40 am

22nd May 2022

  • Paint Holding - reducing the flash of white on same-origin navigations. I missed this when it happened back in 2019: Chrome (and apparently Safari too—not sure about Firefox) implemented a feature where rather than showing a blank screen in between page navigations Chrome “waits briefly before starting to paint, especially if the page is fast enough”. As a result, fast loading multi-page applications become almost indistinguishable from SPAs (single-page apps). It’s a really neat feature, and now that I know how it works I realize that it explains why page navigations have felt a lot snappier to me over the past few years. #22nd May 2022, 2:50 am
  • The balance has shifted away from SPAs (via) “There’s a feeling in the air. A zeitgeist. SPAs are no longer the cool kids they once were 10 years ago.” Nolan Lawson offers some opinions on why the pendulum seems to be swinging back in favour of server-side rendering over rendering every page entirely on the client. He argues that paint holding, back-forward caching and service workers have made the benefits of SPAs over MPAs much less apparent. I’m inclined to agree. #22nd May 2022, 2:47 am

21st May 2022

  • GOV.UK Guidance: Documenting APIs (via) Characteristically excellent guide from GOV.UK on writing great API documentation. “Task-based guidance helps users complete the most common integration tasks, based on the user needs from your research.” #21st May 2022, 11:31 pm

18th May 2022

  • Comby (via) Describes itself as “Structural search and replace for any language”. Lets you execute search and replace patterns that look a little bit like simplified regular expressions, but with some deep OCaml-powered magic that makes them aware of comment, string and nested parenthesis rules for different languages. This means you can use it to construct scripts that automate common refactoring or code upgrade tasks. #18th May 2022, 5:47 am

17th May 2022

16th May 2022

  • Supercharging GitHub Actions with Job Summaries (via) GitHub Actions workflows can now generate a rendered Markdown summary of, well, anything that you can think to generate as part of the workflow execution. I particularly like the way this is designed: they provide a filename in a $GITHUB_STEP_SUMMARY environment variable which you can then append data to from each of your steps. #16th May 2022, 11:02 pm
  • Heroku: Core Impact (via) Ex-Heroku engineer Brandur Leach pulls together some of the background information circulating concerning the now more than a month long Heroku security incident and provides some ex-insider commentary on what went right and what went wrong with a platform that left a huge, if somewhat underappreciated impact on the technology industry at large. #16th May 2022, 4:24 am

15th May 2022

13th May 2022

4th May 2022