Simon Willison’s Weblog

Subscribe

LLM now provides tools for working with embeddings

4th September 2023

LLM is my Python library and command-line tool for working with language models. I just released LLM 0.9 with a new set of features that extend LLM to provide tools for working with embeddings.

This is a long post with a lot of theory and background. If you already know what embeddings are, here’s a TLDR you can try out straight away:

# Install LLM
pip install llm

# If you already installed via Homebrew/pipx you can upgrade like this:
llm install -U llm

# Install the llm-sentence-transformers plugin
llm install llm-sentence-transformers

# Install the all-MiniLM-L6-v2 embedding model
llm sentence-transformers register all-MiniLM-L6-v2

# Generate and store embeddings for every README.md in your home directory, recursively
llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md'
  # Add --store to store the text content as well

# Run a similarity search for "sqlite" against those embeddings
llm similar readmes -c sqlite

For everyone else, read on and the above example should hopefully all make sense.

Embeddings

Embeddings are a fascinating concept within the larger world of language models.

I explain embeddings in detail (with both a video and heavily annotated slides) in Embeddings: What they are and why they matter.

An embedding model lets you take a string of text—a word, sentence, paragraph or even a whole document—and turn that into an array of floating point numbers called an embedding vector.

On the left is a text post from one of my sites: Storing and serving related documents with openai-to-sqlite and embeddings. An arrow points to a huge JSON array on the right, with the label 1536 floating point numbers.

A model will always produce the same length of array—1,536 numbers for the OpenAI embedding model, 384 for all-MiniLM-L6-v2—but the array itself is inscrutable. What are you meant to do with it?

The answer is that you can compare them. I like to think of an embedding vector as a location in 1,536-dimensional space. The distance between two vectors is a measure of how semantically similar they are in meaning, at least according to the model that produced them.

A location in 1,536 dimension space  There's a 3D plot with 400 red dots arranged randomly across 3 axis.

“One happy dog” and “A playful hound” will end up close together, even though they don’t share any keywords. The embedding vector represents the language model’s interpretation of the meaning of the text.

Things you can do with embeddings include:

  1. Find related items. I use this on my TIL site to display related articles, as described in Storing and serving related documents with openai-to-sqlite and embeddings.
  2. Build semantic search. As shown above, an embeddings-based search engine can find content relevant to the user’s search term even if none of the keywords match.
  3. Implement retrieval augmented generation—the trick where you take a user’s question, find relevant documentation in your own corpus and use that to get an LLM to spit out an answer. More on that here.
  4. Clustering: you can find clusters of nearby items and identify patterns in a corpus of documents.
  5. Classification: calculate the embedding of a piece of text and compare it to pre-calculated “average” embeddings for different categories.

LLM’s new embedding features

My goal with LLM is to provide a plugin-driven abstraction around a growing collection of language models. I want to make installing, using and comparing these models as easy as possible.

The new release adds several command-line tools for working with embeddings, plus a new Python API for working with embeddings in your own code.

It also adds support for installing additional embedding models via plugins. I’ve released one plugin for this so far: llm-sentence-transformers, which adds support for new models based on the sentence-transformers library.

The example above shows how to use sentence-transformers. LLM also supports API-driven access to the OpenAI ada-002 model.

Here’s how to embed some text using ada-002, assuming you have installed LLM already:

# Set your OpenAI API key
llm keys set openai
# <paste key here>

# Embed some text
llm embed -m ada-002 -c "Hello world"

This will output a huge JSON list of floating point numbers to your terminal. You can add -f base64 (or -f hex) to get that back in a different format, though none of these outputs are instantly useful.

Embeddings are much more interesting when you store them.

LLM already uses SQLite to store prompts and responses. It was a natural fit to use SQLite to store embeddings as well.

Embedding collections

LLM 0.9 introduces the concept of a collection of embeddings. A collection has a name—like readmes—and contains a set of embeddings, each of which has an ID and an embedding vector.

All of the embeddings in a collection are generated by the same model, to ensure they can be compared with each others.

The llm embed command can store the vector in the database instead of returning it to the console. Pass it the name of an existing (or to-be-created) collection and the ID to use to store the embedding.

Here we’ll store the embedding for the phrase “Hello world” in a collection called phrases with the ID hello, using that ada-002 embedding model:

llm embed phrases hello -m ada-002 -c "Hello world"

Future phrases can be added without needing to specify the model again, since it is remembered by the collection:

llm embed phrases goodbye -c "Goodbye world"

The llm embed-db collections shows a list of collections:

phrases: ada-002
  2 embeddings
readmes: sentence-transformers/all-MiniLM-L6-v2
  16796 embeddings

The data is stored in a SQLite embeddings table with the following schema:

CREATE TABLE [collections] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT,
   [model] TEXT
);
CREATE TABLE "embeddings" (
   [collection_id] INTEGER REFERENCES [collections]([id]),
   [id] TEXT,
   [embedding] BLOB,
   [content] TEXT,
   [content_hash] BLOB,
   [metadata] TEXT,
   [updated] INTEGER,
   PRIMARY KEY ([collection_id], [id])
);

CREATE UNIQUE INDEX [idx_collections_name]
    ON [collections] ([name]);
CREATE INDEX [idx_embeddings_content_hash]
    ON [embeddings] ([content_hash]);

By default this is the SQLite database at the location revealed by llm embed-db path, but you can pass --database my-embeddings.db to various LLM commands to use a different database.

Each embedding vector is stored as a binary BLOB in the embedding column, consisting of those floating point numbers packed together as 32 bit floats.

The content_hash column contains a MD5 hash of the content. This helps avoid re-calculating the embedding (which can cost actual money for API-based embedding models like ada-002) unless the content has changed.

The content column is usually null, but can contain a copy of the original text content if you pass the --store option to the llm embed command.

metadata can contain a JSON object with metadata, if you pass --metadata '{"json": "goes here"}.

You don’t have to pass content using -c—you can instead pass a file path using the -i/--input option:

llm embed docs llm-setup -m ada-002 -i llm/docs/setup.md

Or pipe things to standard input like this:

cat llm/docs/setup.md | llm embed docs llm-setup -m ada-002 -i -

Embedding similarity search

Once you’ve built a collection, you can search for similar embeddings using the llm similar command.

The -c "term" option will embed the text you pass in using the embedding model for the collection and use that as the comparison vector:

llm similar readmes -c sqlite

You can also pass the ID of an object in that collection to use that embedding instead. This gets you related documents, for example:

llm similar readmes sqlite-utils/README.md

The output from this command is currently newline-delimited JSON.

Embedding in bulk

The llm embed command embeds a single string at a time. llm embed-multi is much more powerful: you can feed a CSV or JSON file, a SQLite database or even have it read from a directory of files in order to embed multiple items at once.

Many embeddings models are optimized for batch operations, so embedding multiple items at a time can provide a significant speed boost.

The embed-multi command is described in detail in the documentation. Here are a couple of fun things you can do with it.

First, I’m going to create embeddings for every single one of my Apple Notes.

My apple-notes-to-sqlite tool can export Apple Notes to a SQLite database. I’ll run that first:

apple-notes-to-sqlite notes.db

This took quite a while to run on my machine and generated a 828M SQLite database containing 6,462 records!

Next, I’m going to embed the content of all of those notes using the sentence-transformers/all-MiniLM-L6-v2 model:

llm embed-multi notes \
  -d notes.db \
  --sql 'select id, title, body from notes' \
  -m sentence-transformers/all-MiniLM-L6-v2

This took around 15 minutes to run, and increased the size of my database by 13MB.

The --sql option here specifies a SQL query. The first column must be an id, then any subsequent columns will be concatenated together to form the content to embed.

In this case the embeddings are written back to the same notes.db database that the content came from.

And now I can run embedding similarity operations against all of my Apple notes!

llm similar notes -d notes.db -c 'ideas for blog posts'

Embedding files in a directory

Let’s revisit the example from the top of this post. In this case, I’m using the --files option to search for files on disk and embed each of them:

llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md'

The --files option takes two arguments: a path to a directory and a pattern to match against filenames. In this case I’m searching my home directory recursively for any files named README.md.

Running this command gives me embeddings for all of my README.md files, which I can then search against like this:

llm similar readmes -c sqlite

Embeddings in Python

So far I’ve only covered the command-line tools. LLM 0.9 also introduces a new Python API for working with embeddings.

There are two aspects to this. If you just want to embed content and handle the resulting vectors yourself, you can use llm.get_embedding_model():

import llm

# This takes model IDs and aliases defined by plugins:
model = llm.get_embedding_model("sentence-transformers/all-MiniLM-L6-v2")
vector = model.embed("This is text to embed")

vector will then be a Python list of floating point numbers.

You can serialize that to the same binary format that LLM uses like this:

binary_vector = llm.encode(vector)
# And to deserialize:
vector = llm.decode(binary_vector)

The second aspect of the Python API is the llm.Collection class, for working with collections of embeddings. This example code is quoted from the documentation:

import sqlite_utils
import llm

# This collection will use an in-memory database that will be
# discarded when the Python process exits
collection = llm.Collection("entries", model_id="ada-002")

# Or you can persist the database to disk like this:
db = sqlite_utils.Database("my-embeddings.db")
collection = llm.Collection("entries", db, model_id="ada-002")

# You can pass a model directly using model= instead of model_id=
embedding_model = llm.get_embedding_model("ada-002")
collection = llm.Collection("entries", db, model=embedding_model)

# Store a string in the collection with an ID:
collection.embed("hound", "my happy hound")

# Or to store content and extra metadata:
collection.embed(
    "hound",
    "my happy hound",
    metadata={"name": "Hound"},
    store=True
)

# Or embed things in bulk:
collection.embed_multi(
    [
        ("hound", "my happy hound"),
        ("cat", "my dissatisfied cat"),
    ],
    # Add this to store the strings in the content column:
    store=True,
)

As with everything else in LLM, the goal is that anything you can do with the CLI can be done with the Python API, and vice-versa.

Clustering with llm-cluster

Another interesting application of embeddings is that you can use them to cluster content—identifying patterns in a corpus of documents.

I’ve started exploring this area with a new plugin, called llm-cluster.

You can install it like this:

llm install llm-cluster

Let’s create a new collection using data pulled from GitHub. I’m going to import all of the LLM issues from the GitHub API, using my paginate-json tool:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --store

Running this gives me a issues.db SQLite database with 218 embeddings contained in a collection called llm-issues.

Now let’s try out the llm-cluster command, requesting ten clusters from that collection:

llm cluster llm-issues --database issues.db 10

The output from this command, truncated, looks like this:

[
  {
    "id": "0",
    "items": [
      {
        "id": "1784149135",
        "content": "Tests fail with pydantic 2"
      },
      {
        "id": "1837084995",
        "content": "Allow for use of Pydantic v1 as well as v2."
      },
      {
        "id": "1857942721",
        "content": "Get tests passing against Pydantic 1"
      }
    ]
  },
  {
    "id": "1",
    "items": [
      {
        "id": "1724577618",
        "content": "Better ways of storing and accessing API keys"
      },
      {
        "id": "1772024726",
        "content": "Support for `-o key value` options such as `temperature`"
      },
      {
        "id": "1784111239",
        "content": "`--key` should be used in place of the environment variable"
      }
    ]
  },
  {
    "id": "8",
    "items": [
      {
        "id": "1835739724",
        "content": "Bump the python-packages group with 1 update"
      },
      {
        "id": "1848143453",
        "content": "Python library support for adding aliases"
      },
      {
        "id": "1857268563",
        "content": "Bump the python-packages group with 1 update"
      }
    ]
  }
]

These look pretty good! But wouldn’t it be neat if we had a snappy title for each one?

The --summary option can provide exactly that, by piping the members of each cluster through a call to another LLM in order to generate a useful summary.

llm cluster llm-issues --database issues.db 10 --summary

This uses gpt-3.5-turbo to generate a summary for each cluster, with this default prompt:

Short, concise title for this cluster of related documents.

The results I got back are pretty good, including:

  • Template Storage and Management Improvements
  • Package and Dependency Updates and Improvements
  • Adding Conversation Mechanism and Tools

I tried the same thing using a Llama 2 model running on my own laptop, with a custom prompt:

llm cluster llm-issues --database issues.db 10 \
  --summary --model mlc-chat-Llama-2-13b-chat-hf-q4f16_1 \
  --prompt 'Concise title for this cluster of related documents, just return the title'

I didn’t quite get what I wanted! Llama 2 is proving a lot harder to prompt, so each cluster came back with something that looked like this:

Sure! Here’s a concise title for this cluster of related documents:

“Design Improvements for the Neat Prompt System”

This title captures the main theme of the documents, which is to improve the design of the Neat prompt system. It also highlights the focus on improving the system’s functionality and usability

llm-cluster only took a few hours to throw together, which I’m seeing as a positive indicator that the LLM library is developing in the right direction.

Future plans

The two future features I’m most excited about are indexing and chunking.

Indexing

The llm similar command and collection.similar() Python method currently use effectively the slowest brute force approach possible: calculate a cosine difference between input vector and every other embedding in the collection, then sort the results.

This works fine for collections with a few hundred items, but will start to suffer for collections of 100,000 or more.

There are plenty of potential ways of speeding this up: you can run a vector index like FAISS or hnswlib, use a database extension like sqlite-vss or pgvector, or turn to a hosted vector database like Pinecone or Milvus.

With this many potential solutions, the obvious answer for LLM is to address this with plugins.

I’m still thinking through the details, but the core idea is that users should be able to define an index against one or more collections, and LLM will then coordinate updates to that index. These may not happen in real-time—some indexes can be expensive to rebuild, so there are benefits to applying updates in batches.

I experimented with FAISS earlier this year in datasette-faiss. That’s likely to be the base for my first implementation.

The embeddings table has an updated timestamp column to support this use-case—so indexers can run against just the items that have changed since the last indexing run.

Follow issue #216 for updates on this feature.

Chunking

When building an embeddings-based search engine, the hardest challenge is deciding how best to “chunk” the documents.

Users will type in short phrases or questions. The embedding for a four word question might not necessarily map closely to the embedding of a thousand word article, even if the article itself should be a good match for that query.

To maximize the chance of returning the most relevant content, we need to be smarter about what we embed.

I’m still trying to get a good feeling for the strategies that make sense here. Some that I’ve seen include:

  • Split a document up into fixed length shorter segments.
  • Split into segments but including a ~10% overlap with the previous and next segments, to reduce problems caused by words and sentences being split in a way that disrupts their semantic meaning.
  • Splitting by sentence, using NLP techniques.
  • Splitting into higher level sections, based on things like document headings.

Then there are more exciting, LLM-driven approaches:

  • Generate an LLM summary of a document and embed that.
  • Ask an LLM “What questions are answered by the following text?” and then embed each of the resulting questions!

It’s possible to try out these different techniques using LLM already: write code that does the splitting, then feed the results to Collection.embed_multi() or llm embed-multi.

But... it would be really cool if LLM could split documents for you—with the splitting techniques themselves defined by plugins, to make it easy to try out new approaches.

Get involved

It should be clear by now that the potential scope of the LLM project is enormous. I’m trying to use plugins to tie together an enormous and rapidly growing ecosystem of models and techniques into something that’s as easy for people to work with and build on as possible.

There are plenty of ways you can help!

  • Join the #llm Discord to talk about the project.
  • Try out plugins and run different models with them. There are 12 plugins already, and several of those can be used to run dozens if not hundreds of models (llm-mlc, llm-gpt4all and llm-llama-cpp in particular). I’ve hardly scratched the surface of these myself, and I’m testing exclusively on Apple Silicon. I’m really keen to learn more about which models work well, which models don’t and which perform the best on different hardware.
  • Try building a plugin for a new model. My dream here is that every significant Large Language Model will have an LLM plugin that makes it easy to install and use.
  • Build stuff using LLM and let me know what you’ve built. Nothing fuels an open source project more than stories of cool things people have built with it.