Weeknotes: A bunch of things I learned this week, plus datasette-explain
9th February 2023
I’ve also written an unusually large number of TILs, which reflect some of the other pieces of research I’ve been digging into.
I’m going to start these weeknotes with an annotated set of links to TILs.
This TIL is actually a learned-several-years-ago: I wrote up notes on hierarchical CTEs in SQLite in an old Gist, long before I started my TIL website. I’ve now promoted that to a full article.
I’ve been wanting to run some form of language model on my own hardware for ages... and I finally found one that works!
Crucially though, it’s small enough to run on a MacBook M2 (and likely on less powerful machines as well, though I haven’t tried that). And the embeddings it generates seem to provide really good results for finding similar content, which is exactly what I want to use them for.
... even more exciting, I did manage to train and run a GPT-style generative language model! I used nanoGPT by Andrej Karpathy, which trains really nicely on an M2. I managed to get it to spit out some terrible Shakespeare, which you can see in the TIL.
Not mentioned in the TIL: I also tried training a model against the full text content of my blog. Here’s some text I managed to get it to output:
Google is a group of software as it looks further reading as PHP’s enough for Linux, but I need it’s pretty much more for me to be served for me than Pingback. I’ll put out a background idea with a problem with a new tool (in group time accessible). I am getting a web page for that it’s great site in my HTML. I’m done using an entry interface for their mouse gestures (which is done) I have a few more of the site.
I mean, it’s total junk... but it does have a hint of my voice to it.
Update 9th Feb 2023: A few people asked for more details about this so I wrote them up in another TIL: Training nanoGPT entirely on content from my blog.
This is trained from scratch, it’s not from a pre-existing language model. Training it to the point where it can output meaningful sentences would require mixing in TBs of text and training for many years. But it’s cool to have got something working!
This TIL represented another of my long-term goals.
The thing that excites me most about WebAssembly is sandboxing: I want to be able to run untrusted code safely my own hardware—both for personal use and for things like server-side web applications I build that might want to let users customize with their own code.
There are multiple builds of Python in WebAssembly now—I use one of them in my Datasette Lite web application. But until recently I’ve been unable to figure out how to run them in a sandbox inside a regular Python script, using a WebAssembly library such as wasmer-python or wasmtime-py or pywasm3 (all three of which are frustratingly under-documented).
It works! My TIL shows my own version inspired by that example. It’s a little inelegant—in particular, you have to redirect standard output and standard error to temporary files—but I do now at least have a way to run untrusted Python code in a sandbox, for the first time.
Twitter announced they were ending all free bots. Then they un-announced that a few days later, but by then I’d already decided to move my bot to Mastodon.
https://fedi.simonwillison.net/@covidsewage is my Mastodon bot that posts a screenshot of the latest Covid sewage numbers for parts of the Bay Area every morning.
My TIL describes how I built it, using scheduled GitHub Actions. I was about to build a new tiny Python CLI tool for this, but in checking for available names on PyPI I found toot which is a superset of the tool I had been planning to build myself (a Mastodon version of my tweet-images utility.)
Not much to say about this one: it relates to
select datetime('now') in SQLite. There’s nothing that’s too small for the TIL format!
Another tiny one. I write TILs like this one mainly to ensure that when I (or anyone else) searches for a confusing error message in the future there’s an obvious explanation of what it means.
This one is a lot more substantial: I wrote up a detailed explanation of a pattern I’ve been exploring in SQLite that lets you fetch back a row AND a limited number of related rows in a single SELECT query. If you’ve ever used Django’s prefetch_related you’ll recognise the problem.
As part of working on this I built a little Datasette plugin for running and showing
explain query plan ... output for a query, which deserves its own mention...
Just one new plugin this week: datasette-explain, which adds a feature to the Datasette SQL query editing page which constantly runs
explain query plan ... on the query you’re typing and shows the results.
This is partly an experiment in what it feels like for Datasette to run interactive code relating to queries as you type them.
Here’s a demo as an animated GIF:
The plugin works by sending the SQL query you have typed to a
/db/-/explain endpoint in the background. This endpoint does a few things:
- It runs
explain query plan ...on the query you have typed so far. If this returns an error message (because the query isn’t complete or includes problems) that message is displayed to the end user.
- If there’s no error, the output of that explanation is shown on the page. This gets pretty interesting as queries touch on more features, like the subquery select expression described earlier—demo of that here.
- It has one more trick up its sleeve: it runs a
explain select ...query and uses the results from that to figure out which SQLite tables are referenced in the query—then fetches a full list of columns for those tables and shows them on the page as well. The intent here is to help remind you of which columns are available on the tables you are querying or joining against, as you type.
This plugin is very much an alpha: I’m not convinced it’s displaying the right data in the right way yet.
I really like the error checking mode—much more so than the explain output and table and columns. I’m tempted to bring live error checking like this to Datasette core.
It’s worth highlighting that the queries themselves are really fast—no matter how complex the SQL query, running
explain query plan ... against it is very quick. That’s why I’m currently not thinking too hard about having the full results of the query show live on the page while the query is being typed.
... although, that might be worth exploring in another prototype at some point. Datasette has a good mechanism for cutting off queries that take too long to load, so I could try previewing queries but only if they can be run in less than 100ms or so.
At any rate, it’s an interesting direction to explore.
Releases this week
Explain and validate SQL queries as you type them into Datasette
db-to-sqlite: 1.5—(17 releases total)—2023-02-07
CLI tool for exporting tables or queries from any SQL database to a SQLite file
shot-scraper: 1.1.1—(26 releases total)—2023-01-30
A command-line utility for taking automated screenshots of websites
More recent articles
- Weeknotes: the Datasette Cloud API, a podcast appearance and more - 1st October 2023
- Things I've learned about building CLI tools in Python - 30th September 2023
- Talking Large Language Models with Rooftop Ruby - 29th September 2023
- Weeknotes: Embeddings, more embeddings and Datasette Cloud - 17th September 2023
- Build an image search engine with llm-clip, chat with models with llm chat - 12th September 2023
- LLM now provides tools for working with embeddings - 4th September 2023
- Datasette 1.0a4 and 1.0a5, plus weeknotes - 30th August 2023
- Making Large Language Models work for you - 27th August 2023
- Datasette Cloud, Datasette 1.0a3, llm-mlc and more - 16th August 2023
- How I make annotated presentations - 6th August 2023