Simon Willison’s Weblog

On generativeai 52 mastodon 15 machinelearning 53 javascript 604 gpt3 34 ...

 

Recent entries

datasette-scraper, Big Local News and other weeknotes nine days ago

In addition to exploring the new MusicCaps training and evaluation data I’ve been working on the big Datasette JSON refactor, and getting excited about a Datasette project that I didn’t work on at all.

datasette-scraper

The best thing about a plugin system is that you can wake up one day and your software has grown extra features without you even having to review a pull request.

Colin Dellow’s datasette-scraper—first released a few weeks ago—takes that a step further: it’s a plugin that builds an entire custom application on top of Datasette.

It’s really cool!

Colin has a ten minute demo up on YouTube which is well worth checking out.

Screenshot of a dss_crawl_queue_history table, showing a history of crawls - with a window to the side showing the terminal output as the crawler ran

The plugin implements a website crawler which can crawl pages, build up a queue (using sitemap.xml if it’s available), then fetch and cache the content—storing its progress and the resulting data in a SQLite database hosted by Datasette.

It uses some really neat tricks to customize various pieces of the Datasette interface to provide an interface for configuring and controlling the crawler.

Most impressively of all, it implements its own plugin hooks... which means you can use small custom plugins to define how you would like data to be extracted from the pages you are crawling.

Colin has a bunch of other Datasette plugins that are worth checking out too:

  • datasette-rewrite-sql monkey-patches Datasette’s database connection code (since there’s no appropriate plugin hook there yet) to provide its own hook for further plugin functions that can rewrite SQL queries before they get executed.
  • datasette-ersatz-table-valued-functions (ersatz (adj.): made or used as a substitute, typically an inferior one, for something else.) is a delightfully gnarly hack which supports custom table-valued SQL functions in SQLite, despite Python’s sqlite3 module not providing those. It works by rewriting a SQL query against a function that returns JSON to use a gnarly CTE and json_each() combo instead.
  • datasette-ui-extras is brand new: it tweaks the Datasette default interface in various ways, adding features like sticky header and facets in a sidebar. I’m so excited to see someone experimenting with changes to the default UI in this way, and I fully expect that some of the ideas Colin is playing with here will make it into Datasette core in the future.

Datasette and Big Local News

Big Local News is a collaborative data archive for journalists run by a team out of Stanford.

I worked with Ben Welsh this week to build an updated version of an old integration with Datasette, which went live on Friday.

Here’s the animated GIF demo from their announcement:

The user starts out on the Big Local site, browses into a project, selects a menu item next to a CSV file in that project and runs "Open in Datasette" - they are then taken to a separate Datasette instance where that CSV file is imported into a table, ready for them to interact with.

The code I wrote for this integration lives in simonw/datasette-big-local. I ended up building a custom plugin which exclusively works with Big Local (so it’s not shipped to PyPI).

The main challenge with this project involved authentication and permissions. Big Local has both public and private datasets, and we wanted the integration to work with both. But if someone opened a file in a private project, only they and others with access to that project should be able to view the resulting table.

In the end I solved this by creating a new SQLite database for each project, then configuring Datasette’s authentication system to run a permission check for each user to check that they were allowed to access files in the associated project, via a call to the Big Local GraphQL API.

The first time a user opens a file in Big Local JavaScript on the site there performs an HTTP POST to Datasette, transmitting details of the project ID, the filename and an authentication token for that user. That token can then be used by Datasette to call the GraphQL API on their behalf, verifying their permissions and signing them into the Datasette instance using a signed cookie.

The datasette-big-local README includes full details about how this all works.

Progress on ?_extra=

I finally started making progress on Datasette issue #262: Add ?_extra= mechanism for requesting extra properties in JSON—first opened back in May 2018!

This is the key step on the journey to slimming down Datasette’s default JSON representation for tables and queries.

I want to return this by default:

{
    "ok": true,
    "rows": [
        {"id": 1, "title": "Example 1"},
        {"id": 2, "title": "Example 2"},
        {"id": 3, "title": "Example 3"}
    ],
    "next": null
}

Then allow users to specify all kinds of extra information—the table schema, the suggested facets, information on columns, the underlying SQL query... all by adding on ?_extra=x parameters to the URL.

This turns out to be a lot of work: I’m having to completely refactor the internals of the most complicated part of Datasette’s codebase.

Still lots more to go, but I’m happy to finally be making progress here.

Improved Datasette examples

The Datasette website has long had an Examples page linked to from the top navigation—and analytics show that it’s one of the most visited pages on the site.

I finally gave that page the upgrade it needed. It now starts with illustrated examples that have been selected to help highlight what Datasette can do—both the variety of problems it can be used to solve, and the way plugins can be used to add extra functionality.

Screenshot of the Examples of sites built using Datasette page, featuring datasette.io and LAION-Aesthetics and github-to-sqlite with GraphQL

shot-scraper 1.1

I implemented the new screenshots for the examples page using my shot-scraper screenshot automation tool.

The screenshots are taken by a GitHub Actions workflow in the datasette-screenshots repository.

I added 5 new screenshot definitions to the YAML that powers that workflow, which is used by the shot-scraper multi command.

In working with shot-scraper I spotted a couple of opportunities for small improvements, version 1.1 with the following changes:

  • New --log-console option for logging the output of calls to console.log() to standard error. #101
  • New --skip and --fail options to specify what should happen if an HTTP 4xx or 5xx error is encountered while trying to load the page. --skip will ignore the error and either exit cleanly or move on to the next screenshot (in the case of multi). --fail will cause the tool to return a non-zero exit code, useful for running in CI environments. #102

datasette-granian

Granian is a new web server for running Python WSGI and ASGI applications, written in Rust.

A while ago I built datasette-gunicorn, a plugin which adds a datasette gunicorn my.db command for serving Datasette using the Gunicorn WSGI server.

datasette-granian now provides the same thing using Granian. It’s an alpha release because I haven’t actually used it in production yet, but it seems to work well and it adds yet another option for people who want to deploy Datasette.

Creator of Granian Giovanni Barillari was really helpful in helping me figure out how to dynamically serve a freshly configured ASGI application rather than just passing a module path to the granian CLI command.

datasette-faiss 0.2

I introduced datasette-faiss a few weeks ago. It’s a plugin that suppors fast vector similarity lookups within Datasette using the FAISS vector search library by Facebook Research.

The first release of the plugin created a FAISS index on server startup for each table that contains an embeddings column. Any similarity searches would then be run against that entire table.

But what if you want to combine those searches with other filters in a query? For example, first filter to every article published in 2022, then run a similarity search on what’s left.

In datasette-faiss 0.2 I introduced two new SQLite aggregate functions: faiss_agg() and faiss_agg_with_scores(), that are designed to handle this case.

The new functions work by constructing a new FAISS index from scratch every time they are called, covering just the rows that were processed by the aggregation.

This is best illustrated with an example. The following query first selects the embeddings for just the blog entries published in 2022, then uses those to find items that are most similar to the provided ID.

with entries_2022 as (
  select
    id,
    embedding
  from
    blog_entry_embeddings
  where
    id in (select id from blog_entry where created like '2022%')
),
faiss as (
  select
    faiss_agg(
      id,
      embedding,
      (select embedding from blog_entry_embeddings where id = :id),
      10
    ) as results
  from
    entries_2022
),
ids as (
  select
    value as id
  from
    json_each(faiss.results),
    faiss
)
select
  blog_entry.id,
  blog_entry.title,
  blog_entry.created
from
  ids
  join blog_entry on ids.id = blog_entry.id

You can try the query out here.

Releases this week

TIL this week

Exploring MusicCaps, the evaluation data released to accompany Google’s MusicLM text-to-music model 11 days ago

Google Research just released MusicLM: Generating Music From Text. It’s a new generative AI model that takes a descriptive prompt and produces a “high-fidelity” music track. Here’s the paper (and a more readable version using arXiv Vanity).

There’s no interactive demo yet, but there are dozens of examples on the site. The prompts are things like this:

A fusion of reggaeton and electronic dance music, with a spacey, otherworldly sound. Induces the experience of being lost in space, and the music would be designed to evoke a sense of wonder and awe, while being danceable.

Included are examples of opera, jazz, peruvian punk, berlin 90s house and many more. It’s a really fun page to explore.

The MusicCaps dataset

The paper abstract includes this line:

To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.

Later in the paper they provide a little more detail:

To evaluate MusicLM, we prepare MusicCaps, a high-quality music caption dataset, which we make publicly available. This dataset includes 5.5k music clips from AudioSet (Gemmeke et al., 2017), each paired with corresponding text descriptions in English, written by ten professional musicians. For each 10-second music clip, MusicCaps provides: (1) a free-text caption consisting of four sentences on average, describing the music and (2) a list of music aspects, describing genre, mood, tempo, singer voices, instrumentation, dissonances, rhythm, etc. On average, the dataset includes eleven aspects per clip.

Here’s where they published that data on Kaggle.

I love digging into these training datasets—and this one is pretty tiny. I decided to take a look and see what I could learn.

I built musiccaps.datasette.io to support exploring and searching the data.

A Datasette table. Title is "musiccaps_details (view)",     subheading is "3 rows where search matches "ballad" and musiccaps_names contains "Electric piano""     The first result is a YouTube video from the channel "Marks Piano - PGN Piano", with a musiccaps_caption of "This audio contains someone playing a ballad on a piano then a male voice starts talking and the piano-playing ends. This song may be playing at home practicing piano."

The dataset itself turns out to not have any audio clips in it at all—instead, each row of the data includes a YouTube video ID and a start and end time for a clip within it.

This is similar to how the LAION dataset used for Stable Diffusion works—that dataset contains URLs to images, but not the images themselves.

A YouTube video ID is just a string like zYM0gtd_PRo—the dataset has no further information.

But... information on those videos is available via the YouTube API. So I fetched back full details of all 5,500 videos and included them in the database as well.

This means we can start answering all sorts of interesting questions:

The search feature is configured to run against the human-written descriptions provided as the key feature of the MusicCaps dataset—try some searches like opera, ballad, guitar, or whimsical.

How I built this

The dataset is available on Kaggle, but since it’s licensed CC BY-SA 4.0 I grabbed a copy of it and dropped the CSV into this GitHub repo.

You can explore that using Datasette Lite at this URL:

https://lite.datasette.io/?csv=https://github.com/simonw/musiccaps-datasette/blob/main/data/musiccaps-public.csv

This was a decent starting point, but it felt really important to be able to click “play” and listen to that audio.

I built a new Datasette plugin for this: datasette-youtube-embed.

The plugin works by looking out for YouTube URLs of the following format:

  • https://www.youtube.com/watch?v=-U16iKiXGuY
  • https://www.youtube.com/watch?v=-U16iKiXGuY&start=30
  • https://www.youtube.com/watch?v=-U16iKiXGuY&start=30&end=40

If it finds one of those, it replaces it with a YouTube embed that passes through the start and end parameters, if present.

This means it can play the exact clip that was referenced by the MusicCaps dataset.

My first attempt at this plugin used regular YouTube embeds, but Datasette defaults to returning up to 100 rows on a page, and 100 YouTube iframe embeds is pretty heavy!

Instead, I switched the plugin to use the Lite YouTube Embed Web Component by Paul Irish.

Sadly this means the plugin doesn’t work with Datasette Lite, so I switched to deploying a full Datasette instance to Vercel instead.

Adding video details from the YouTube API

I wanted to add more context about each of the videos. The YouTube Data API has a videos endpoint which accepts a comma-separated list of video IDs (up to 50 at a time, not mentioned in the documentation) and returns details about each video.

After some experimentation, this turned out to be the recipe that gave me the key data I wanted:

https://www.googleapis.com/youtube/v3/videos
  ?part=snippet,statistics
  &id=video_id1,video_id2,video_id3
  &key=youtube-api-key

I built a Jupyter notebook that batched up all of the IDs into groups of 50, fetched the data and wrote it into my SQLite database using sqlite-utils.

The audioset_positive_labels column in the initial CSV had values like /m/0140xf,/m/02cjck,/m/04rlf—these turned out to be match IDs in this CSV file of AudioSet categories in the tensorflow/models GitHub repo, so I fetched and transformed those as well.

I had to do a little bit of extra cleanup to get everything working how I wanted. The final result was two tables, with the following schema:

CREATE TABLE [musiccaps] (
   [ytid] TEXT PRIMARY KEY,
   [url] TEXT,
   [caption] TEXT,
   [aspect_list] TEXT,
   [audioset_names] TEXT,
   [author_id] TEXT,
   [start_s] TEXT,
   [end_s] TEXT,
   [is_balanced_subset] INTEGER,
   [is_audioset_eval] INTEGER,
   [audioset_ids] TEXT
);
CREATE TABLE [videos] (
   [id] TEXT PRIMARY KEY,
   [publishedAt] TEXT,
   [channelId] TEXT,
   [title] TEXT,
   [description] TEXT,
   [thumbnails] TEXT,
   [channelTitle] TEXT,
   [tags] TEXT,
   [categoryId] TEXT,
   [liveBroadcastContent] TEXT,
   [localized] TEXT,
   [viewCount] INTEGER,
   [likeCount] INTEGER,
   [favoriteCount] INTEGER,
   [commentCount] INTEGER,
   [defaultAudioLanguage] TEXT,
   [defaultLanguage] TEXT
);

I configured SQLite full-text search against the musiccaps.caption column.

The last step was to create a SQL view that combined the key data from the two tables. After some more iteration I came up with this one:

CREATE VIEW musiccaps_details AS select
  musiccaps.url as video,
  json_object(
    'label',
    coalesce(videos.title, 'Missing from YouTube'),
    'href',
    musiccaps.url
  ) as youtube_link,
  musiccaps.caption as musiccaps_caption,
  videos.publishedAt as youtube_published,
  videos.channelTitle as youtube_channel,
  videos.description as youtube_description,
  musiccaps.audioset_names as musiccaps_names,
  musiccaps.aspect_list as musiccaps_aspects,
  musiccaps.author_id as musiccaps_author,
  videos.id as youtube_id,
  musiccaps.rowid as musiccaps_rowid
from
  musiccaps
  left join videos on musiccaps.ytid = videos.id;

I built a custom template for the instance homepage to add a search box, then shipped the whole thing to Vercel using the datasette-publish-vercel plugin.

Let me know what you find

Digging around in this data is a lot of fun. I’d love to hear what you find. Hit me up on Mastodon if you find anything interesting!

Weeknotes: AI hacking and a SpatiaLite tutorial 23 days ago

Short weeknotes this time because the key things I worked on have already been covered here:

Releases this week

TIL this week

How to implement Q&A against your documentation with GPT3, embeddings and Datasette 25 days ago

If you’ve spent any time with GPT-3 or ChatGPT, you’ve likely thought about how useful it would be if you could point them at a specific, current collection of text or documentation and have it use that as part of its input for answering questions.

It turns out there is a neat trick for doing exactly that. I’ve been experimenting with it using my Datasette web application as a rapid prototyping platform for trying out new AI techniques using custom SQL functions.

Here’s how to do this:

  • Run a text search (or a semantic search, described later) against your documentation to find content that looks like it could be relevant to the user’s question
  • Grab extracts of that content and glue them all together into a blob of text
  • Construct a prompt consisting of that text followed by “Given the above content, answer the following question: ” and the user’s question
  • Send the whole thing through the GPT-3 API and see what comes back

I’ve been calling this the semantic search answers pattern.

This is a form of prompt engineering (and hence is prone to prompt injection attacks, so bear that in mind). You need to fit your overall prompt into the token limit for GPT-3, which is currently 4,000 tokens. A token is more-or-less the same thing as a word.

Since publishing this post I’ve been pointed to the Question Answering using Embeddings notebook in the OpenAI cookbook which describes this exact same technique. I’ve also learned that this technique is also known as Retrieval-Augmented Generation, as described in this paper from May 2020.

In this post I’ll show how I implemented this pattern using SQLite, Datasette and some custom SQL functions and tools written in Python.

First, some demos

I’ve implemented this pattern against my blog—3,000 posts going back over twenty years. I built it using some custom SQLite SQL functions that call the OpenAI APIs.

First, some examples of questions and answers I ran against my blog:

What does shot scraper do?

Shot-scraper is a command-line tool that helps automate the process of taking screenshots of web pages and scraping data from them using JavaScript. It can be used to take one-off screenshots or take multiple screenshots in a repeatable way by defining them in a YAML file. It can also be used to execute JavaScript on a page and return the resulting value.

shot-scraper is a tool I released in 2022—after the 2021 GPT-3 training cut-off date. This is a very good summary—better I think than anything I’ve written about shot-scraper myself.

What museums has Simon been to?

Simon has been to the Pitt Rivers museum, the British Museum, the Science Museum, the Natural History Museum, the V&A, the Victoria and Albert Museum, and the Pioneer Museum in Paso Robles. He has also seen the Giant Squid at the Natural History Museum and Charles Darwin’s collection at the spirit building.

This is entirely correct, though I’ve talked about a lot more museums than that on my blog.

What are some unconventional things you can do with GitHub Actions?

  1. Create dynamic content for GitHub repository templates using cookiecutter and GitHub Actions
  2. Automate repetitive tasks like code linting, testing, and deployment using GitHub Actions
  3. Run scheduled scrapers in GitHub Actions
  4. Use GitHub Actions to trigger a webhook call when a repository is forked
  5. Create an automated issue-tracking system with GitHub Issues and GitHub Actions
  6. Automatically extract captions and transcripts from online videos using Whisper and GitHub Actions.

Some of my relevant posts:

I’ve never written anything about triggering webhook calls or issue-tracking systems though, so 4 and 5 in that list are a little suspicious.

What is the big grey box at the top of the utility pole? What does it do?

The big grey box at the top of the utility pole is likely a transformer. Transformers are used to step down the voltage of electricity from the high-voltage lines that run along the pole to the lower-voltage lines that are used to provide electricity to homes and businesses.

I’m including this example because I’ve never written anything even remotely related to transformers and utility poles on my blog. This demonstrates that in the lack of useful context GPT-3 will answer the question entirely by itself—which may or may not be what you want from this system.

Try this out yourself

If you want to try this out yourself you’ll need to get your own API key from OpenAI. I don’t want to foot the bill for people using my blog as a free source of GPT-3 prompt answers!

You can sign up for one here. I believe they are still running a free trial period.

Now head over to this page:

https://datasette.simonwillison.net/simonwillisonblog/answer_question?_hide_sql=1

You’ll need to paste in your OpenAI key. I am not logging these anywhere, and the form stores these in a cookie in order to avoid transmitting it over a GET query string where it could be accidentally logged somewhere.

Then type in your question and see what comes back!

Let’s talk about how this all works—in a whole lot of detail.

Semantic search using embeddings

You can implement the first step of this sequence using any search engine you like—but there’s a catch: we are encouraging users here to ask questions, which increases the chance that they might include text in their prompt which doesn’t exactly match documents in our index.

“What are the key features of Datasette?” for example might miss blog entries that don’t include the word “feature” even though they describe functionality of the software in detail.

What we want here is semantic search—we want to find documents that match the meaning of the user’s search term, even if the matching keywords are not present.

OpenAI have a less well-known API that can help here, which had a big upgrade (and major price reduction) back in December: their embedding model.

Update 31st January 2023: I figured out how to run an alternative embeddings model that can execute entirely on my laptop, described here: Calculating embeddings with gtr-t5-large in Python.

An embedding is a list of floating point numbers.

As an example, consider a latitude/longitude location: it’s a list of two floating point numbers. You can use those numbers to find other nearby points by calculating distances between them.

Add a third number and now you can plot locations in three dimensional space—and still calculate distances between them to find the closest points.

This idea keeps on working even as we go beyond three dimensions: you can calculate distances between vectors of any length, no matter how many dimensions they have.

So if we can represent some text in a many-multi-dimensional vector space, we can calculate distances between those vectors to find the closest matches.

The OpenAI embedding model lets you take any string of text (up to a ~8,000 word length limit) and turn that into a list of 1,536 floating point numbers. We’ll call this list the “embedding” for the text.

These numbers are derived from a sophisticated language model. They take a vast amount of knowledge of human language and flatten that down to a list of floating point numbers—at 4 bytes per floating point number that’s 4*1,536 = 6,144 bytes per embedding—6KiB.

The distance between two embeddings represents how semantically similar the text is to each other.

The two most obvious applications of this are search and similarity scores.

Take a user’s search term. Calculate its embedding. Now find the distance between that embedding and every pre-calculated embedding in a corpus and return the 10 closest results.

Or for document similarity: calculate embeddings for every document in a collection, then look at each one in turn and find the closest other embeddings: those are the documents that are most similar to it.

For my semantic search answers implementation, I use an embedding-based semantic search as the first step to find the best matches for the question. I then assemble these top 5 matches into the prompt to pass to GPT-3.

Calculating embeddings

Embeddings can be calculated from text using the OpenAI embeddings API. It’s really easy to use:

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{"input": "Your text string goes here",
       "model":"text-embedding-ada-002"}'

The documentation doesn’t mention this, but you can pass a list of strings (up to 2048 according to the official Python library source code) as "input" to run embeddings in bulk:

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{"input": ["First string", "Second string", "Third string"],
       "model":"text-embedding-ada-002"}'

The returned data from this API looks like this:

{
  "data": [
    {
      "embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],
      "index": 0,
      "object": "embedding"
    }
  ]

As expected, it’s a list of 1,536 floating point numbers.

I’ve been storing embeddings as a binary string that appends all of the floating point numbers together, using their 4-byte representation.

Here are the tiny Python functions I’ve been using for doing that:

import struct

def decode(blob):
    return struct.unpack("f" * 1536, blob)

def encode(values):
    return struct.pack("f" * 1536, *values)

I then store them in SQLite blob columns in my database.

I wrote a custom tool for doing this, called openai-to-sqlite. I can run it like this:

openai-to-sqlite embeddings simonwillisonblog.db \
  --sql 'select id, title, body from blog_entry' \
  --table blog_entry_embeddings

This concatenates together the title and body columns from that table, runs them through the OpenAI embeddings API and stores the results in a new table called blog_entry_embeddings with the following schema:

CREATE TABLE [blog_entry_embeddings] (
   [id] INTEGER PRIMARY KEY,
   [embedding] BLOB
)

I can join this against the blog_entry table by ID later on.

Finding the closest matches

The easiest way to calculate similarity between two embedding arrays is to use cosine similarity. A simple Python function for that looks like this:

def cosine_similarity(a, b):
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = sum(x * x for x in a) ** 0.5
    magnitude_b = sum(x * x for x in b) ** 0.5
    return dot_product / (magnitude_a * magnitude_b)

You can brute-force find the top matches for a table by executing that comparison for every row and returning the ones with the highest score.

I added this to my datasette-openai Datasette plugin as a custom SQL function called openai_embedding_similarity(). Here’s a query that uses it:

with input as (
  select
    embedding
  from
    blog_entry_embeddings
  where
    id = :entry_id
),
top_n as (
  select
    id,
    openai_embedding_similarity(
      blog_entry_embeddings.embedding,
      input.embedding
    ) as score
  from
    blog_entry_embeddings,
    input
  order by
    score desc
  limit
    20
)
select
  score,
  blog_entry.id,
  blog_entry.title
from
  blog_entry
  join top_n on blog_entry.id = top_n.id

Try that out here.

This takes as input the ID of one of my blog entries and returns a list of the other entries, ordered by their similarity score.

Unfortunately this is pretty slow! It takes over 1.3s to run against all 3,000 embeddings in my blog.

I did some research and found that a highly regarded solutions for fast vector similarity calculations is FAISS, by Facebook AI research. It has neat Python bindings and can be installed using pip install faiss-cpu (the -gpu version requires a GPU).

FAISS works against an in-memory index. My blog’s Datasette instance uses the baked data pattern which means the entire thing is re-deployed any time the data changes—as such, I can spin up an in-memory index once on startup without needing to worry about updating the index continually as rows in the database change.

So I built another plugin to do that: datasette-faiss—which can be configured to build an in-memory FAISS index against a configured table on startup, and can then be queried using another custom SQL function.

Here’s the related entries query from above rewritten to use the FAISS index:

with input as (
  select
    embedding
  from
    blog_entry_embeddings
  where
    id = :entry_id
),
top_n as (
  select value as id from json_each(
    faiss_search(
      'simonwillisonblog',
      'blog_entry_embeddings',
      input.embedding,
      20
    )
  ), input
)
select
  blog_entry.id,
  blog_entry.title
from
  blog_entry
  join top_n on blog_entry.id = top_n.id

This one runs in 4.8ms!

faiss_search(database_name, table_name, embedding, n) returns a JSON array of the top n IDs from the specified embeddings table, based on distance scores from the provided embedding.

The json_each() trick here is a workaround for the fact that Python’s SQLite driver doesn’t yet provide an easy way to write table-valued functions—SQL functions that return something in the shape of a table.

Instead, I use json_each() to turn the string JSON array of IDs from datasette_faiss() into a table that I can run further joins against.

Implementing semantic search with embeddings

So far we’ve just seen embeddings used for finding similar items. Let’s implement semantic search, using a user-provided query.

This is going to need an API key again, because it involves a call to OpenAI to run embeddings against the user’s search query.

Here’s the SQL query:

select
  value,
  blog_entry.title,
  substr(blog_entry.body, 0, 500)
from
  json_each(
    faiss_search(
      'simonwillisonblog',
      'blog_entry_embeddings',
      (
        select
          openai_embedding(:query, :_cookie_openai_api_key)
      ),
      10
    )
  )
  join blog_entry on value = blog_entry.id
  where length(coalesce(:query, '')) > 0

Try that here (with extra some cosmetic tricks.)

We’re using a new function here: openai_embedding()—which takes some text and an API key and returns an embedding for that text.

The API key comes from :_cookie_openai_api_key—this is a special Datasette mechanism called magic parameters which can read variables from cookies.

The datasette-cookies-for-magic-parameters plugin notices these and turns them into an interface for the user to populate the cookies with, decsribed earlier.

One last trick: adding where length(coalesce(:query, '')) > 0 to the query means that the query won’t run if the user hasn’t entered any text into the search box.

Constructing a prompt from semantic search query results

Getting back to our semantic search answers pattern.

We need a way to construct a prompt for GPT-3 using the results of our semantic search query.

There’s one big catch: GPT-3 has a length limit, and it’s strictly enforced. If you pass even one token over that limit you’ll get an error.

We want to use as much material from the top five search results as possible, leaving enough space for the rest of the prompt (the user’s question and our own text) and the prompt response.

I ended up solving this with another custom SQL function:

select openai_build_prompt(content, 'Context:
------------
', '
------------
Given the above context, answer the following question: ' || :question,
  500
  ) from search_results

This function works as an aggregate function—it takes a table of results and returns a single string.

It takes the column to aggregate—in this case content—as the first argument. Then it takes a prefix and a suffix, which are concatenated together with the aggregated content in the middle.

The third argument is the number of tokens to allow for the response.

The function then attempts to truncate each of the input values to the maximum length that will still allow them all to be concatenated together while staying inside that 4,000 token limit.

Adding it all together

With all of the above in place, the following query is my full implementation of semantic search answers against my blog:

with query as (
  select
    openai_embedding(:question, :_cookie_openai_api_key) as q
),
top_n as (
  select
    value
  from json_each(
    faiss_search(
      'simonwillisonblog',
      'blog_entry_embeddings',
      (select q from query),
      5
    )
  )
  where length(coalesce(:question, '')) > 0
),
texts as (
  select 'Created: ' || created || ', Title: ' || title || 
  ', Body: ' || openai_strip_tags(body) as text
  from blog_entry where id in (select value from top_n)
),
prompt as (
  select openai_build_prompt(text, 'Context:
------------
', '
------------
Given the above context, answer the following question: ' || :question,
  500
  ) as prompt from texts
)
select
  'Response' as title,
  openai_davinci(
    prompt,
    500,
    0.7,
    :_cookie_openai_api_key
  ) as value
  from prompt
  where length(coalesce(:question, '')) > 0
union all
select
  'Prompt' as title,
  prompt from prompt

As you can see, I really like using CTEs (the with name as (...) pattern) to assemble complex queries like this.

The texts as ... CTE is where I strip HTML tags from my content (using another custom function from the datasete-openai plugin called openai_strip_tags()) and assemble it along with the Created and Title metadata. Adding these gave the system a better chance of answering questions like “When did Natalie and Simon get married?” with the correct year.

The last part of this query uses a handy debugging trick: it returns two rows via a union all—the first has a Response label and shows the response from GPT-3, while the second has a Prompt label and shows the prompt that I passed to the model.

A Datasette form page. Question is When did Natalie and Simon get married?. Answer is Natalie and Simon got married on Saturday the 5th of June in 2010. The prompt is then displayed, which is a whole bunch of text from relevant blog entries.

Next steps

There are so many ways to improve this system.

  • Smarter prompt design. My prompt here is the first thing that I got to work—I’m certain there are all kinds of tricks that could be used to make this more effective.
  • Better selection of the content to include in the prompt. I’m using embedding search but then truncating to the first portion: a smarter implementation would attempt to crop out the most relevant parts of each entry, maybe by using embeddings against smaller chunks of text.
    • Yoz tipped me off to GPT Index, a project which aims to solve this exact problem by using a pre-trained LLM to help summarize text to better fit in a prompt used for these kinds of queries.
    • Spotted this idea from Hassan Hayat: “don’t embed the question when searching. Ask GPT-3 to generate a fake answer, embed this answer, and use this to search”. See also this paper about Hypothetical Document Embeddings, via Jay Hack.
  • Hold out for GPT-4: I’ve heard rumours that the next version of the model will have a significantly larger token limit, which should result in much better results from this mechanism.

Want my help implementing this?

I plan to use this pattern to add semantic search and semantic search answers to my Datasette Cloud SaaS platform. Please get in touch if this sounds like a feature that could be relevant to your organization.

Datasette 0.64, with a warning about SpatiaLite 29 days ago

I release Datasette 0.64 this morning. This release is mainly a response to the realization that it’s not safe to run Datasette with the SpatiaLite extension loaded if that Datasette instance is configured to enable arbitrary SQL queries from untrusted users.

Here are the release notes quoted in full:

The problem with SpatiaLite

Datasette allows arbitrary SQL execution as a core feature. It takes a bunch of steps to provide this safely: database connections are opened in read-only mode, it imposes a strict time limit on SQL queries and Datasette is designed to be run in containers for a further layer of protection.

SQLite itself is an excellent platform for this feature: it has a set of default functionality that supports this well, protected by a legendarily thorough test suite.

SpatiaLite is a long running third-party extension for SQLite that adds a bewildering array of additional functionality to SQLite—much of it around GIS, but with a whole host of extras as well. It includes debugging routines, XML parsers and even it’s own implementation of stored procedures!

Unfortunately, not all of this functionality is safe to expose to untrusted queries—even for databases that have been opened in read-only mode.

After identifying functions which could crash the Datasette instance, I decided that Datasette should make a strong recommendation not to expose SpatiaLite in an unprotected manner.

In addition to the new documentation, I also added a feature I’ve been planning for a while: a simple setting for disabling arbitrary SQL queries entirely:

datasette --setting default_allow_sql off

Prior to 0.64 you could achieve the same thing by adding the following line to your metadata.json file:

{
    "allow_sql": false
}

Or in metadata.yml:

allow_sql: false

The new setting achieves the same thing, but is more obvious and can be easily applied even for Datasette instances that don’t use metadata.

A new SpatiaLite tutorial

The documentation now recommends running SpatiaLite instances with pre-approved SQL implemented using Datasette’s canned queries feature.

To help clarify how this works, I decided to publish a new entry in the official series of Datasette tutorials:

Building a location to time zone API with SpatiaLite

This is an updated version of a tutorial I first wrote back in 2017.

The new tutorial now includes material on Chris Amico’s datasette-geojson-map plugin, SpatiaLite point-in-polygon queries, polygon intersection queries, spatial indexes and how to use the simplify() function to reduce huge polygons down to a size that is more practical to display on a map.

I’m really happy with this new tutorial. Not only does it show a safe way to run SpatiaLite, but it also illustrates a powerful pattern for using Datasette to create and deploy custom APIs.

The resulting API can be accessed here:

https://timezones.datasette.io/timezones

It’s hosted on Fly, using their $1.94/month instance size with 256MB of RAM—easily powerful enough to host this class of application.

I also updated the datasette-publish-fly plugin to make it easier to deploy instances with SQL execution disabled, see the 1.3 release notes.

2022 in projects and blogging one month ago

In lieu of my regular weeknotes (I took two weeks off for the holidays) here’s a look back at 2022, mainly in terms of projects and things I’ve written about.

January

February

March

April

May

June

July

August

September

October

November

December

Elsewhere

Today

  • PocketPy. PocketPy is “a lightweight(~5000 LOC) Python interpreter for game engines”. It’s implemented as a single C++ header which provides an impressive subset of the Python language: functions, dictionaries, lists, strings and basic classes too. There’s also a browser demo that loads a 766.66 KB pypocket.wasm file (240.72 KB compressed) and uses it to power a basic terminal interface. I tried and failed to get that pypocket.wasm file working from wasmer/wasmtime/wasm3—it should make a really neat lightweight language to run in a WebAssembly sandbox. #8th February 2023, 5:13 am

Yesterday

  • Big Data is Dead (via) Don’t be distracted by the headline, this is very worth your time. Jordan Tigani spent ten years working on Google BigQuery, during which time he was surprised to learn that the median data storage size for regular customers was much less than 100GB. In this piece he argues that genuine Big Data solutions are relevant to a tiny fraction of companies, and there’s way more value in solving problems for everyone else. I’ve been talking about Datasette as a tool for solving “small data” problems for a while, and this article has given me a whole bunch of new arguments I can use to support that concept. #7th February 2023, 7:25 pm

6th February 2023

  • Making SQLite extensions pip install-able (via) Alex Garcia figured out how to bundle a compiled SQLite extension in a Python wheel (building different wheels for different platforms) and publish them to PyPI. This is a huge leap forward in terms of the usability of SQLite extensions, which have previously been pretty difficult to actually install and run. Alex also created Datasette plugins that depend on his packages, so you can now “datasette install datasette-sqlite-regex” (or datasette-sqlite-ulid, datasette-sqlite-fastrand, datasette-sqlite-jsonschema) to gain access to his custom SQLite extensions in your Datasette instance. It even works with “datasette publish --install” to deploy to Vercel, Fly.io and Cloud Run. #6th February 2023, 7:44 pm
  • The technology behind GitHub’s new code search (via) I’ve been a beta user of the new GitHub code search for a while and I absolutely love it: you really can run a regular expression search across the entire of GitHub, which is absurdly useful for both finding code examples of under-documented APIs and for seeing how people are using open source code that you have released yourself. It turns out GitHub built their own search engine for this from scratch, called Blackbird. It’s implemented in Rust and makes clever use of sharded ngram indexes—not just trigrams, because it turns out those aren’t quite selective enough for a corpus that includes a lot of three letter keywords like “for”.

    I also really appreciated the insight into how they handle visibility permissions: they compile those into additional internal search clauses, resulting in things like “RepoIDs(...) or PublicRepo()” #6th February 2023, 6:38 pm

5th February 2023

  • When you start a creative project but don’t finish, the experience drags you down. Worst of all is when you never decisively abandon a project, instead allowing it to fade into forgetfulness. The fades add up; they become a gloomy haze that whispers, you’re not the kind of person who DOES things.

    When you start and finish, by contrast — and it can be a project of any scope: a 24-hour comic, a one-page short story, truly anything — it is powerful fuel that goes straight back into the tank. When a project is finished, it exits the realm of “this is gonna be great” and becomes instead something you (and perhaps others) can actually evaluate. Even if that evaluation is disastrous, it is also, I will insist, thrilling and productive. A project finished is the pump of a piston, preparing the engine for the next one.

    Robin Sloan # 5th February 2023, 5:59 pm

  • The 21st century is being delayed: We’re stuck with corporations building these incredible artifacts and then staring at them and realizing the questions they encode are too vast and unwieldy to be worth the risk of tackling. The future is here – and it’s locked up in a datacenter, experimented with by small groups of people who are aware of their own power and fear to exercise it. What strange times we are in.

    Jack Clark (on MusicML) # 5th February 2023, 5:51 pm

4th February 2023

  • The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.

    Andrej Karpathy # 4th February 2023, 12:08 am

3rd February 2023

  • Just used prompt injection to read out the secret OpenAI API key of a very well known GPT-3 application.

    In essence, whenever parts of the returned response from GPT-3 is executed directly, e.g. using eval() in Python, malicious user can basically execute arbitrary code

    Ludwig Stumpp # 3rd February 2023, 1:52 am

  • I’m Now a Full-Time Professional Open Source Maintainer. Filippo Valsorda, previously a member of the Go team at Google, is now independent and making a full-time living as a maintainer of various open source projects relating to Go. He’s managing to pull in an amount “equivalent to my Google total compensation package”, which is a huge achievement: the greatest cost involved in independent open source is usually the opportunity cost of turning down a big tech salary. He’s doing this through a high touch retainer model, where six client companies pay him to keep working on his projects and also provide them with varying amounts of expert consulting. #3rd February 2023, 1:12 am

2nd February 2023

  • GROUNDHOG-DAY.com (via) “The leading Groundhog Day data source”. I love this so much: it’s a collection of predictions from all 59 groundhogs active in towns scattered across North America (I had no idea there were that many). The data is available via a JSON API too. #2nd February 2023, 10:05 pm
  • Carving the Scheduler Out of Our Orchestrator (via) Thomas Ptacek describes Fly’s new custom-built alternative to Nomad and Kubernetes in detail, including why they eventually needed to build something custom to best serve their platform. In doing so he provides the best explanation I’ve ever seen of what an orchestration system actually does. #2nd February 2023, 9:46 pm

1st February 2023

  • Python’s “Disappointing” Superpowers. Luke Plant provides a fascinating detailed list of Python libraries that use dynamic meta-programming tricks in interesting ways—including SQLAlchemy, Django, Werkzeug, pytest and more. #1st February 2023, 10:41 pm
  • pyfakefs usage (via) New to me pytest fixture library that provides a really easy way to mock Python’s filesystem functions—open(), os.path.listdir() and so on—so a test can run against a fake set of files. This looks incredibly useful. #1st February 2023, 10:37 pm

29th January 2023

  • datasette-scraper walkthrough on YouTube (via) datasette-scraper is Colin Dellow’s new plugin that turns Datasette into a powerful web scraping tool, with a web UI based on plugin-driven customizations to the Datasette interface. It’s really impressive, and this ten minute demo shows quite how much it is capable of: it can crawl sitemaps and fetch pages, caching them (using zstandard with optional custom dictionaries for extra compression) to speed up subsequent crawls... and you can add your own plugins to extract structured data from crawled pages and save it to a separate SQLite table! #29th January 2023, 5:23 am
  • Examples of sites built using Datasette (via) I gave the examples page on the Datasette website a significant upgrade today: it now includes screenshots (taken using shot-scraper) of six projects chosen to illustrate the variety of problems Datasette can be used to tackle. #29th January 2023, 3:40 am

28th January 2023

  • We’ve built many tools for publishing to the web—but I want to make the claim that we have underdeveloped the tools and platforms for publishing collections, indexes and small databases. It’s too hard to build these kinds of experiences, too hard to maintain them and a lack of collaborative tools.

    Tom Critchlow # 28th January 2023, 4:43 pm

  • Cyber (via) “Cyber is a new language for fast, efficient, and concurrent scripting.” Lots of interesting ideas in here, but the one that really caught my eye is that its designed to be easily embedded into other languages and “will allow the host to insert gas mileage checks in user scripts. This allows the host to control how long a script can run”—my dream feature for implementing a safe, sandboxed extension mechanism! Cyber is implemented using Zig and LLVM. #28th January 2023, 4:25 am
  • sqlite-jsonschema. “A SQLite extension for validating JSON objects with JSON Schema”, building on the jsonschema Rust crate. SQLite and JSON are already a great combination—Alex suggests using this extension to implement check constraints to validate JSON columns before inserting into a table, or just to run queries finding existing data that doesn’t match a given schema. #28th January 2023, 3:50 am
  • sqlite-ulid. Alex Garcia’s sqlite-ulid adds lightning-fast SQL functions for generating ULIDs—Universally Unique Lexicographically Sortable Identifiers. These work like UUIDs but are smaller and faster to generate, and can be canonically encoded as a URL-safe 26 character string (UUIDs are 36 characters). Again, this builds on a Rust crate—ulid-rs—and can generate 1 million byte-represented ULIDs with the ulid_bytes() function in just 88.4ms. #28th January 2023, 3:45 am
  • sqlite-fastrand. Alex Garcia just dropped three new SQLite extensions, and I’m going to link to all of them. The first is sqlite-fastrand, which adds new functions for generating random numbers (and alphanumeric characters too). Impressively, these out-perform the default SQLite random() and randomblob() functions by about 1.6-2.6x, thanks to being built on the Rust fastrand crate which builds on wyhash, an extremely fast (though not cryptographically secure) hashing function. #28th January 2023, 3:41 am