Simon Willison’s Weblog

Subscribe

Weeknotes: Embeddings, more embeddings and Datasette Cloud

17th September 2023

Since my last weeknotes, a flurry of activity. LLM has embeddings support now, and Datasette Cloud has driven some major improvements to the wider Datasette ecosystem.

Embeddings in LLM

LLM gained embedding support in version 0.9, and then got binary embedding support (for CLIP) in version 0.10. I wrote about those releases in detail in:

Embeddings are a fascinating tool. If you haven’t got your head around them yet the first of my blog entries tries to explain why they are so interesting.

There’s a lot more I want to built on top of embeddings—most notably, LLM (or Datasette, or likely a combination of the two) will be growing support for Retrieval Augmented Generation on top of the LLM embedding mechanism.

Annotated releases

I always include a list of new releases in my weeknotes. This time I’m going to use those to illustrate the themes I’ve been working on.

The first group of release relates to LLM and its embedding support. LLM 0.10 extended that support:

  • llm 0.10—2023-09-12
    Access large language models from the command-line

Embedding models can now be built as LLM plugins. I’ve released two of those so far:

The CLIP one is particularly fun, because it genuinely allows you to build a sophisticated image search engine that runs entirely on your own computer!

  • symbex 1.4—2023-09-05
    Find the Python code for specified symbols

Symbex is my tool for extracting symbols—functions, methods and classes—from Python code. I introduced that in Symbex: search Python code for functions and classes, then pipe them into a LLM.

Symbex 1.4 adds a tiny but impactful feature: it can now output a list of symbols as JSON, CSV or TSV. These output formats are designed to be compatible with the new llm embed-multi command, which means you can easily create embeddings for all of your functions:

symbex '*' '*:*' --nl | \
  llm embed-multi symbols - \
  --format nl --database embeddings.db --store

I haven’t fully explored what this enables yet, but it should mean that both related functions and semantic function search (“Find my a function that downloads a CSV”) are now easy to build.

Yet another thing you can do with embeddings is use them to find clusters of related items.

The neatest feature of llm-cluster is that you can ask it to generate names for these clusters by sending the names of the items in each cluster through another language model, something like this:

llm cluster issues 10 \
  -d issues.db \
  --summary \
  --prompt 'Short, concise title for this cluster of related documents'

One last embedding related project: datasette-llm-embed is a tiny plugin that adds a select llm_embed('sentence-transformers/all-mpnet-base-v2', 'This is some text') SQL function. I built it to support quickly prototyping embedding-related ideas in Datasette.

Spending time with embedding models has lead me to spend more time with Hugging Face. I realized last week that the Hugging Face all models sorted by downloads page doubles as a list of the models that are most likely to be easy to use.

One of the models I tried out was Salesforce BLIP, an astonishing model that can genuinely produce usable captions for images.

It’s really easy to work with. I ended up building this tiny little CLI tool that wraps the model:

  • blip-caption 0.1—2023-09-10
    Generate captions for images with Salesforce BLIP

Releases driven by Datasette Cloud

Datasette Cloud continues to drive improvements to the wider Datasette ecosystem as a whole.

It runs on the latest Datasette 1.0 alpha series, taking advantage of the JSON write API.

This also means that it’s been highlighting breaking changes in 1.0 that have caused old plugins to break, either subtly or completely.

This has driven a bunch of new plugin releases. Some of these are compatible with both 0.x and 1.x—the ones that only work with the 1.x alphas are themselves marked as alpha releases.

Datasette Cloud’s API works using database-backed access tokens, to ensure users can revoke tokens if they need to (something that’s not easily done with purely signed tokens) and that each token can record when it was most recently used.

I’ve been building that into the existing datasette-auth-tokens plugin:

Alex Garcia has been working with me building out features for Datasette Cloud, generously sponsored by Fly.io.

We’re beginning to build out social features for Datasette Cloud—feature that will help teams privately collaborate on data investigations together.

Alex has been building datasette-short-links as an experimental link shortener. In building that, we realized that we needed a mechanism for resolving actor IDs displayed in a list (e.g. this link created by X) to their actual names.

Datasette doesn’t dictate the shape of actor representations, and there’s no guarantee that actors would be represented in a predictable table.

So... we needed a new plugin hook. I released Datasette 1.06a with a new hook, actors_from_ids(actor_ids), which can be used to answer the question “who are the actors represented by these IDs”.

Alex is using this in datasette-short-links, and I built two plugins to work with the new hook as well:

Datasette Cloud lets users insert, edit and delete rows from their tables, using the plugin Alex built called datasette-write-ui which he introduced on the Datasette Cloud blog.

This inspired me to finally put out a fresh release of datasette-edit-schema—the plugin which provides the ability to edit table schemas—adding and removing columns, changing column types, even altering the order columns are stored in the table.

datasette-edit-schema 0.6 is a major release, with three significant new features:

  • You can now create a brand new table from scratch!
  • You can edit the table’s primary key
  • You can modify the foreign key constraints on the table

Those last two became important when I realized that Datasette’s API is much more interesting if there are foreign key relationships to follow.

Combine that with datasette-write-ui and Datasette Cloud now has a full set of features for building, populating and editing tables—backed by a comprehensive JSON API.

  • sqlite-migrate 0.1a2—2023-09-03
    A simple database migration system for SQLite, based on sqlite-utils

sqlite-migrate is still marked as an alpha, but won’t be for much longer: it’s my attempt at a migration system for SQLite, inspired by Django migrations but with a less sophisticated set of features.

I’m using it in LLM now to manage the schema used to store embeddings, and it’s beginning to show up in some Datasette plugins as well. I’ll be promoting this to non-alpha status pretty soon.

  • sqlite-utils 3.35.1—2023-09-09
    Python CLI utility and library for manipulating SQLite databases

A tiny fix in this, which with hindsight was less impactful than I thought.

I spotted a bug on Datasette Cloud when I configured full-text search on a column, then edited the schema and found that searches no longer returned the correct results.

It turned out the rowid column in SQLite was being rewritten by calls to the sqlite-utils table.transform() method. FTS records are related to their underlying row by rowid, so this was breaking search!

I pushed out a fix for this in 3.35.1. But then... I learned that rowid in SQLite has always been unstable—they are rewritten any time someone VACUUMs a table!

I’ve been designing future features for Datasette that assume that rowid is a useful stable identifier for a row. This clearly isn’t going to work! I’m still thinking through the consequences of it, but I think there may be Datasette features (like the ability to comment on a row) that will only work for tables with a proper foreign key.

sqlite-chronicle

  • sqlite-chronicle 0.1—2023-09-11
    Use triggers to track when rows in a SQLite table were updated or deleted

This is very early, but I’m excited about the direction it’s going in.

I keep on finding problems where I want to be able to synchronize various processes with the data in a table.

I built sqlite-history a few months ago, which uses SQLite triggers to create a full copy of the updated data every time a row in a table is edited.

That’s a pretty heavy-weight solution. What if there was something lighter that could achieve a lot of the same goals?

sqlite-chronicle uses triggers to instead create what I’m calling a “chronicle table”. This is a shadow table that records, for every row in the main table, four integer values:

  • added_ms—the timestamp in milliseconds when the row was added
  • updated_ms—the timestamp in milliseconds when the row was last updated
  • version—a constantly incrementing version number, global across the entire table
  • deleted—set to 1 if the row has been deleted

Just storing four integers (plus copies of the primary key) makes this a pretty tiny table, and hopefully one that’s cheap to update via triggers.

But... having this table enables some pretty interesting things—because external processes can track the last version number that they saw and use it to see just which rows have been inserted and updated since that point.

I gave a talk at DjangoCon a few years ago called the denormalized query engine pattern, describing the challenge of syncing an external search index like Elasticsearch with data held in a relational database.

These chronicle tables can solve that problem, and can be applied to a whole host of other problems too. So far I’m thinking about the following:

  • Publishing SQLite databases up to Datasette, sending only the rows that have changed since the last sync. I wrote a prototype that does this and it seems to work very well.
  • Copying a table from Datasette Cloud to other places—a desktop copy, or another instance, or even into an alternative database such as PostgreSQL or MySQL, in a way that only copies and deletes rows that have changed.
  • Saved search alerts: run a SQL query against just rows that were modified since the last time that query ran, then send alerts if any rows are matched.
  • Showing users a note that “34 rows in this table have changed since your last visit”, then displaying those rows.

I’m sure there are many more applications for this. I’m looking forward to finding out what they are!

I needed to fix a bug in Datasette Cloud by moving a table from one database to another... so I built a little plugin for sqlite-utils that adds a sqlite-utils move-tables origin.db destination.db tablename command. I love being able to build single-use features as plugins like this.

And some TILs

This was a fun TIL exercising the new embeddings feature in LLM. I used Django SQL Dashboardto break up my blog entries into paragraphs and exported those as CSV which could then be piped into llm embed-multi, then used that to build a CLI-driven semantic search engine for my blog.

llama-cpp has grammars now, which enable you to control the exact output format of the LLM. I’m optimistic that these could be used to implement an equivalent to OpenAI Functions on top of Llama 2 and similar models. So far I’ve just got them to output arrays of JSON objects.

I’m using this trick a lot at the moment. I have API access to Claude now, which has a 100,000 token context limit (GPT-4 is just 8,000 by default). That’s enough to summarize 100+ comment threads from Hacker News, for which I’m now using this prompt:

Summarize the themes of the opinions expressed here, including quotes (with author attribution) where appropriate.

The quotes part has been working really well—it turns out summaries of themes with illustrative quotes are much more interesting, and so far my spot checks haven’t found any that were hallucinated.

cr-sqlite adds full CRDTs to SQLite, which should enable multiple databases to accept writes independently and then seamlessly merge them together. It’s a very exciting capability!

It turns out Hugging Faces offer free scale-to-zero hosting for demos that run in Docker containers on machines with a full 16GB of RAM! I’m used to optimizing Datasette for tiny 256MB containers, so having this much memory available is a real treat.

And the rest: