Weeknotes: A bunch of things I learned this week, plus datasette-explain
The Datasette table view refactor, JSON redesign and
?_extra= continues this week, mainly in this ongoing pull request and this tracking issue.
I’ve also written an unusually large number of TILs, which reflect some of the other pieces of research I’ve been digging into.
I’m going to start these weeknotes with an annotated set of links to TILs.
TIL: Using recursive CTEs to explore hierarchical Twitter threads
This TIL is actually a learned-several-years-ago: I wrote up notes on hierarchical CTEs in SQLite in an old Gist, long before I started my TIL website. I’ve now promoted that to a full article.
TIL: Calculating embeddings with gtr-t5-large in Python
I’ve been wanting to run some form of language model on my own hardware for ages... and I finally found one that works!
gtr-t5-large isn’t a full GPT-3 style generative text language model: it’s an embeddings model, similar to the OpenAI embeddings API I described a few weeks ago.
Crucially though, it’s small enough to run on a MacBook M2 (and likely on less powerful machines as well, though I haven’t tried that). And the embeddings it generates seem to provide really good results for finding similar content, which is exactly what I want to use them for.
TIL: Running nanoGPT on a MacBook M2 to generate terrible Shakespeare
... even more exciting, I did manage to train and run a GPT-style generative language model! I used nanoGPT by Andrej Karpathy, which trains really nicely on an M2. I managed to get it to spit out some terrible Shakespeare, which you can see in the TIL.
Not mentioned in the TIL: I also tried training a model against the full text content of my blog. Here’s some text I managed to get it to output:
Google is a group of software as it looks further reading as PHP’s enough for Linux, but I need it’s pretty much more for me to be served for me than Pingback. I’ll put out a background idea with a problem with a new tool (in group time accessible). I am getting a web page for that it’s great site in my HTML. I’m done using an entry interface for their mouse gestures (which is done) I have a few more of the site.
I mean, it’s total junk... but it does have a hint of my voice to it.
Update 9th Feb 2023: A few people asked for more details about this so I wrote them up in another TIL: Training nanoGPT entirely on content from my blog.
This is trained from scratch, it’s not from a pre-existing language model. Training it to the point where it can output meaningful sentences would require mixing in TBs of text and training for many years. But it’s cool to have got something working!
TIL: Run Python code in a WebAssembly sandbox
This TIL represented another of my long-term goals.
The thing that excites me most about WebAssembly is sandboxing: I want to be able to run untrusted code safely my own hardware—both for personal use and for things like server-side web applications I build that might want to let users customize with their own code.
There are multiple builds of Python in WebAssembly now—I use one of them in my Datasette Lite web application. But until recently I’ve been unable to figure out how to run them in a sandbox inside a regular Python script, using a WebAssembly library such as wasmer-python or wasmtime-py or pywasm3 (all three of which are frustratingly under-documented).
After I complained about how hard this was noisily in a few different places, Tim Bart stepped in and figured out this example using
wasmtime and the new Python WASM build released by VMware labs.
It works! My TIL shows my own version inspired by that example. It’s a little inelegant—in particular, you have to redirect standard output and standard error to temporary files—but I do now at least have a way to run untrusted Python code in a sandbox, for the first time.
TIL: Building Mastodon bots with GitHub Actions and toot
Twitter announced they were ending all free bots. Then they un-announced that a few days later, but by then I’d already decided to move my bot to Mastodon.
https://fedi.simonwillison.net/@covidsewage is my Mastodon bot that posts a screenshot of the latest Covid sewage numbers for parts of the Bay Area every morning.
My TIL describes how I built it, using scheduled GitHub Actions. I was about to build a new tiny Python CLI tool for this, but in checking for available names on PyPI I found toot which is a superset of the tool I had been planning to build myself (a Mastodon version of my tweet-images utility.)
TIL: The SQLite now argument is stable within the same query
Not much to say about this one: it relates to
select datetime('now') in SQLite. There’s nothing that’s too small for the TIL format!
TIL: Avoiding “length” errors in Apache Bench with the -l option
Another tiny one. I write TILs like this one mainly to ensure that when I (or anyone else) searches for a confusing error message in the future there’s an obvious explanation of what it means.
TIL: Subqueries in select expressions in SQLite
This one is a lot more substantial: I wrote up a detailed explanation of a pattern I’ve been exploring in SQLite that lets you fetch back a row AND a limited number of related rows in a single SELECT query. If you’ve ever used Django’s prefetch_related you’ll recognise the problem.
As part of working on this I built a little Datasette plugin for running and showing
explain query plan ... output for a query, which deserves its own mention...
Just one new plugin this week: datasette-explain, which adds a feature to the Datasette SQL query editing page which constantly runs
explain query plan ... on the query you’re typing and shows the results.
This is partly an experiment in what it feels like for Datasette to run interactive code relating to queries as you type them.
Here’s a demo as an animated GIF:
The plugin works by sending the SQL query you have typed to a
/db/-/explain endpoint in the background. This endpoint does a few things:
- It runs
explain query plan ...on the query you have typed so far. If this returns an error message (because the query isn’t complete or includes problems) that message is displayed to the end user.
- If there’s no error, the output of that explanation is shown on the page. This gets pretty interesting as queries touch on more features, like the subquery select expression described earlier—demo of that here.
- It has one more trick up its sleeve: it runs a
explain select ...query and uses the results from that to figure out which SQLite tables are referenced in the query—then fetches a full list of columns for those tables and shows them on the page as well. The intent here is to help remind you of which columns are available on the tables you are querying or joining against, as you type.
This plugin is very much an alpha: I’m not convinced it’s displaying the right data in the right way yet.
I really like the error checking mode—much more so than the explain output and table and columns. I’m tempted to bring live error checking like this to Datasette core.
It’s worth highlighting that the queries themselves are really fast—no matter how complex the SQL query, running
explain query plan ... against it is very quick. That’s why I’m currently not thinking too hard about having the full results of the query show live on the page while the query is being typed.
... although, that might be worth exploring in another prototype at some point. Datasette has a good mechanism for cutting off queries that take too long to load, so I could try previewing queries but only if they can be run in less than 100ms or so.
At any rate, it’s an interesting direction to explore.
Releases this week
Explain and validate SQL queries as you type them into Datasette
db-to-sqlite: 1.5—(17 releases total)—2023-02-07
CLI tool for exporting tables or queries from any SQL database to a SQLite file
shot-scraper: 1.1.1—(26 releases total)—2023-01-30
A command-line utility for taking automated screenshots of websites
More recent articles
- I built a ChatGPT plugin to answer questions about data hosted in Datasette - 24th March 2023
- Weeknotes: AI won't slow down, a new newsletter and a huge Datasette refactor - 22nd March 2023
- Don't trust AI to talk accurately about itself: Bard wasn't trained on Gmail - 22nd March 2023
- A conversation about prompt engineering with CBC Day 6 - 18th March 2023
- Could you train a ChatGPT-beating model for $85,000 and run it in a browser? - 17th March 2023
- Stanford Alpaca, and the acceleration of on-device large language model development - 13th March 2023