Weeknotes: Datasette Lite, nogil Python, HYTRADBOI
6th May 2022
My big project this week was Datasette Lite, a new way to run Datasette directly in a browser, powered by WebAssembly and Pyodide. I also continued my research into running SQL queries in parallel, described last week. Plus I spoke at HYTRADBOI.
Datasette Lite
This started out as a research project, inspired by the excitement around Python in the browser from PyCon US last week (which I didn’t attend, but observed with some jealousy on Twitter).
I’ve been wanting to explore this possibility for a while. JupyterLite had convinced me that it would be feasible to run Datasette using Pyodide, especially after I found out that the sqlite3
module from the Python standard library works there already.
I have a private “notes” GitHub repository which I use to keep notes in GitHub issues. I started a thread there researching the possibility of running an ASGI application in Pyodide, thinking that might be a good starting point to getting Datasette to work.
The proof of concept moved remarkably quickly, especially once I realized that Service Workers weren’t going to work but Web Workers might.
Once I had comitted to Datasette Lite as a full project I started a new repository for it and transferred across my initial prototype issue thread. You can read that full thread for a blow-by-blow account of how my research pulled together in datasette-lite issue #1.
The rest of the project is documented in detail in my blog post.
Since launching it the biggest change I’ve made was a change of URL: since it’s clearly going to be a core component of the Datasette project going forward I promoted it from simonw.github.io/datasette-lite/
to its new permanent home at lite.datasette.io. It’s still hosted by GitHub Pages—here’s my TIL about setting up the new domain.
It may have started as a proof of concept tech demo, but the response to it so far has convinced me that I should really take it seriously. Being able to host Datasette without needing to run any server-side code at all is an incredibly compelling experience.
It doesn’t matter how hard I work on getting the Datasette deployment experience as easy as possible, static file hosting will always be an order of magnitude more accessible. And even at this early stage Datasette Lite is already proving to be a genuinely useful way to run the software.
As part of this research I also shipped sqlite-utils 3.26.1 with a minor dependency fix that means it works in Pyodide now. You can try that out by running the following in the Pyodide REPL:
>>> import micropip
>>> await micropip.install("sqlite-utils")
>>> import sqlite_utils
>>> db = sqlite_utils.Database(memory=True)
>>> list(db.query("select 3 * 5"))
[{'3 * 5': 15}]
Parallel SQL queries work... if you can get rid of the GIL
Last week I described my effort to implement Parallel SQL queries for Datasette.
The idea there was that many Datasette pages execute multiple SQL queries—a count(*)
and a select ... limit 101
for example—that could be run in parallel instead of serial, for a potential improvement in page load times.
My hope was that I could get away with this despite Python’s infamous Global Interpreter Lock because the sqlite3
C module releases the GIL when it executes a query.
My initial results weren’t showing an increase in performance, even while the queries were shown to be overlapping each other. I opened a research thread and spent some time this week investigating.
My conclusion, sadly, was that the GIL was indeed to blame. sqlite3
releases the GIL to execute the query, but there’s still a lot of work that happens in Python land itself—most importantly the code that assembles the objects that represent the rows returned by the query, which is still subject to the GIL.
Then this comment on a thread about the GIL on Lobsters reminded me of the nogil fork of Python by Sam Gross, who has been working on this problem for several years now.
Since that fork has a Docker image trying it out was easy... and to my amazement it worked! Running my parallel queries implementation against nogil
Python reduced a page load time from 77ms to 47ms.
Sam’s work is against Python 3.9, but he’s discussing options for bringing his improvemets into Python itself with the core maintainers. I’m hopeful that this might happen in the next few years. It’s an incredible piece of work.
An amusing coincidence: one restriction of WASM and Pyodide is that they can’t start new threads—so as part of getting Datasette to work on that platform I had to add a new setting that disables the ability to run SQL queries in threads entirely!
datasette-copy-to-memory
One question I found myself asking while investigating parallel SQL queries (before I determined that the GIL was to blame) was whether parallel SQLite queries against the same database file were suffering from some form of file locking or contention.
To rule that out, I built a new plugin: datasette-copy-to-memory—which reads a SQLite database from disk and copies it into an in-memory database when Datasette first starts up.
This didn’t make an observable difference in performance, but I’ve not tested it extensively—especially not against larger databases using servers with increased amounts of available RAM.
If you’re inspired to give this plugin a go I’d love to hear about your results.
asgi-gzip and datasette-gzip
I mentioned datasette-gzip
last week: a plugin that acts as a wrapper around the excellent GZipMiddleware
from Starlette.
The performance improvements from this—especially for larger HTML tables, which it turns out compress extremely well—were significant. Enough so that I plan to bring gzip support into Datasette core very shortly.
Since I don’t want to add the whole of Starlette as a dependency just to get gzip support, I extracted that code out into a new Python package called asgi-gzip.
The obvious risk with doing this is that it might fall behind the excellent Starlette implementation. So I came up with a pattern based on Git scraping that would automatically open a new GitHub issue should the borrowed Starlette code change in the future.
I wrote about that pattern in Automatically opening issues when tracked file content changes.
Speaking at HYTRADBOI
I spoke at the HYTRADBOI conference last week: Have You Tried Rubbing A Database On It.
HYTRADBOI was organized by Jamie Brandon. It was a neat event, with a smart format: 34 pre-recorded 10 minute long talks, arranged into a schedule to encourage people to watch and discuss them at specific times during the day of the event.
It’s worth reading Jamie’s postmortem of the event for some insightful thinking on online event organization.
My talk was Datasette: a big bag of tricks for solving interesting problems using SQLite. It ended up working out as a lightning-fast 10 minute tutorial on using the sqlite-utils CLI to clean up some data (in this case Manatee Carcass Recovery Locations in Florida since 1974) and then using Datasette to explore and publish it.
I’ve posted some basic notes to accompany the talk. My plan is to use this as the basis for an official tutorial on sqlite-utils
for the tutorials section of the Datasette website.
Releases this week
-
datasette: 0.62a0—(111 releases total)—2022-05-02
An open source multi-tool for exploring and publishing data -
sqlite-utils: 3.26.1—(100 releases total)—2022-05-02
Python CLI utility and library for manipulating SQLite databases -
click-default-group-wheel: 1.2.2—2022-05-02
Extends click.Group to invoke a command without explicit subcommand name (this version publishes a wheel) -
s3-credentials: 0.11—(11 releases total)—2022-05-01
A tool for creating credentials for accessing S3 buckets -
datasette-copy-to-memory: 0.2—(5 releases total)—2022-04-30
Copy database files into an in-memory database on startup -
datasette-gzip: 0.2—(2 releases total)—2022-04-28
Add gzip compression to Datasette -
asgi-gzip: 0.1—2022-04-28
gzip middleware for ASGI applications, extracted from Starlette
TIL this week
More recent articles
- Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming mode - 11th December 2024
- ChatGPT Canvas can make API requests now, but it's complicated - 10th December 2024
- I can now run a GPT-4 class model on my laptop - 9th December 2024