Simon Willison’s Weblog

Subscribe

34 items tagged “embeddings”

2024

Introducing sqlite-lembed: A SQLite extension for generating text embeddings locally (via) Alex Garcia's latest SQLite extension is a C wrapper around the llama.cpp that exposes just its embedding support, allowing you to register a GGUF file containing an embedding model:

INSERT INTO temp.lembed_models(name, model)
  select 'all-MiniLM-L6-v2',
  lembed_model_from_file('all-MiniLM-L6-v2.e4ce9877.q8_0.gguf');

And then use it to calculate embeddings as part of a SQL query:

select lembed(
  'all-MiniLM-L6-v2',
  'The United States Postal Service is an independent agency...'
); -- X'A402...09C3' (1536 bytes)

all-MiniLM-L6-v2.e4ce9877.q8_0.gguf here is a 24MB file, so this should run quite happily even on machines without much available RAM.

What if you don't want to run the models locally at all? Alex has another new extension for that, described in Introducing sqlite-rembed: A SQLite extension for generating text embeddings from remote APIs. The rembed is for remote embeddings, and this extension uses Rust to call multiple remotely-hosted embeddings APIs, registered like this:

INSERT INTO temp.rembed_clients(name, options)
  VALUES ('text-embedding-3-small', 'openai');
select rembed(
  'text-embedding-3-small',
  'The United States Postal Service is an independent agency...'
); -- X'A452...01FC', Blob<6144 bytes>

Here's the Rust code that implements Rust wrapper functions for HTTP JSON APIs from OpenAI, Nomic, Cohere, Jina, Mixedbread and localhost servers provided by Ollama and Llamafile.

Both of these extensions are designed to complement Alex's sqlite-vec extension, which is nearing a first stable release.

# 25th July 2024, 8:30 pm / embeddings, rust, sqlite, c, alex-garcia

Searching an aerial photo with text queries. Robin Wilson built a demo that lets you search a large aerial photograph of Southampton for things like "roundabout" or "tennis court". He explains how it works in detail: he used the SkyCLIP model, which is trained on "5.2 million remote sensing image-text pairs in total, covering more than 29K distinct semantic tags" to generate embeddings for 200x200 image segments (with 100px of overlap), then stored them in Pinecone.

# 12th July 2024, 6:07 pm / gis, embeddings, clip

The Super Effectiveness of Pokémon Embeddings Using Only Raw JSON and Images. A deep dive into embeddings from Max Woolf, exploring 1,000 different Pokémon (loaded from PokéAPI using this epic GraphQL query) and then embedding the cleaned up JSON data using nomic-embed-text-v1.5 and the official Pokémon image representations using nomic-embed-vision-v1.5.

I hadn't seen nomic-embed-vision-v1.5 before: it brings multimodality to Nomic embeddings and operates in the same embedding space as nomic-embed-text-v1.5 which means you can use it to perform CLIP-style tricks comparing text and images. Here's their announcement from June 5th:

Together, Nomic Embed is the only unified embedding space that outperforms OpenAI CLIP and OpenAI Text Embedding 3 Small on multimodal and text tasks respectively.

Sadly the new vision weights are available under a non-commercial Creative Commons license (unlike the text weights which are Apache 2), so if you want to use the vision weights commercially you'll need to access them via Nomic's paid API.

# 30th June 2024, 9:22 pm / ai, embeddings, max-woolf, clip

Val Vibes: Semantic search in Val Town. A neat case-study by JP Posma on how Val Town's developers can use Val Town Vals to build prototypes of new features that later make it into Val Town core.

This one explores building out semantic search against Vals using OpenAI embeddings and the PostgreSQL pgvector extension.

# 21st June 2024, 2:16 am / openai, postgresql, ai, embeddings, val-town

Using DuckDB for Embeddings and Vector Search (via) Sören Brunk's comprehensive tutorial combining DuckDB 1.0, a subset of German Wikipedia from Hugging Face (loaded using Parquet), the BGE M3 embedding model and DuckDB's new vss extension for implementing an HNSW vector index.

# 15th June 2024, 2:39 pm / embeddings, ai, duckdb

My Twitter thread figuring out the AI features in Microsoft’s Recall. I posed this question on Twitter about why Microsoft Recall (previously) is being described as "AI":

Is it just that the OCR uses a machine learning model, or are there other AI components in the mix here?

I learned that Recall works by taking full desktop screenshots and then applying both OCR and some sort of CLIP-style embeddings model to their content. Both the OCRd text and the vector embeddings are stored in SQLite databases (schema here, thanks Daniel Feldman) which can then be used to search your past computer activity both by text but also by semantic vision terms - "blue dress" to find blue dresses in screenshots, for example. The si_diskann_graph table names hint at Microsoft's DiskANN vector indexing library

A Microsoft engineer confirmed on Hacker News that Recall uses on-disk vector databases to provide local semantic search for both text and images, and that they aren't using Microsoft's Phi-3 or Phi-3 Vision models. As far as I can tell there's no LLM used by the Recall system at all at the moment, just embeddings.

# 5th June 2024, 10:39 pm / twitter, ai, embeddings, microsoft, sqlite, recall

Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun (via) A real tour de force of data engineering. Wilson Lin fetched 40 million posts and comments from the Hacker News API (using Node.js with a custom multi-process worker pool) and then ran them all through the BGE-M3 embedding model using RunPod, which let him fire up ~150 GPU instances to get the whole run done in a few hours, using a custom RocksDB and Rust queue he built to save on Amazon SQS costs.

Then he crawled 4 million linked pages, embedded that content using the faster and cheaper jina-embeddings-v2-small-en model, ran UMAP dimensionality reduction to render a 2D map and did a whole lot of follow-on work to identify topic areas and make the map look good.

That's not even half the project - Wilson built several interactive features on top of the resulting data, and experimented with custom rendering techniques on top of canvas to get everything to render quickly.

There's so much in here, and both the code and data (multiple GBs of arrow files) are available if you want to dig in and try some of this out for yourself.

In the Hacker News comments Wilson shares that the total cost of the project was a couple of hundred dollars.

One tiny detail I particularly enjoyed - unrelated to the embeddings - was this trick for testing which edge location is closest to a user using JavaScript:

const edge = await Promise.race(
  EDGES.map(async (edge) => {
    // Run a few times to avoid potential cold start biases.
    for (let i = 0; i < 3; i++) {
      await fetch(`https://${edge}.edge-hndr.wilsonl.in/healthz`);
    }
    return edge;
  }),
);

# 10th May 2024, 4:42 pm / hacker-news, embeddings

I’m writing a new vector search SQLite Extension. Alex Garcia is working on sqlite-vec, a spiritual successor to his sqlite-vss project. The new SQLite C extension will have zero other dependencies (sqlite-vss used some tricky C++ libraries) and will work using virtual tables, storing chunks of vectors in shadow tables to avoid needing to load everything into memory at once.

# 3rd May 2024, 3:16 am / embeddings, sqlite, vectors, c, alex-garcia

llm-nomic-api-embed. My new plugin for LLM which adds API access to the Nomic series of embedding models. Nomic models can be run locally too, which makes them a great long-term commitment as there’s no risk of the models being retired in a way that damages the value of your previously calculated embedding vectors.

# 31st March 2024, 3:17 pm / llm, plugins, projects, nomic, ai, embeddings

Cohere int8 & binary Embeddings—Scale Your Vector Database to Large Datasets (via) Jo Kristian Bergum told me “The accuracy retention [of binary embedding vectors] is sensitive to whether the model has been using this binarization as part of the loss function.”

Cohere provide an API for embeddings, and last week added support for returning binary vectors specifically tuned in this way.

250M embeddings (Cohere provide a downloadable dataset of 250M embedded documents from Wikipedia) at float32 (4 bytes) is 954GB.

Cohere claim that reducing to 1 bit per dimension knocks that down to 30 GB (954/32) while keeping “90-98% of the original search quality”.

# 26th March 2024, 6:19 am / embeddings, cohere

My binary vector search is better than your FP32 vectors. I’m still trying to get my head around this, but here’s what I understand so far.

Embedding vectors as calculated by models such as OpenAI text-embedding-3-small are arrays of floating point values, which look something like this:

[0.0051681744, 0.017187592, -0.018685209, -0.01855924, -0.04725188...]—1356 elements long

Different embedding models have different lengths, but they tend to be hundreds up to low thousands of numbers. If each float is 32 bits that’s 4 bytes per float, which can add up to a lot of memory if you have millions of embedding vectors to compare.

If you look at those numbers you’ll note that they are all pretty small positive or negative numbers, close to 0.

Binary vector search is a trick where you take that sequence of floating point numbers and turn it into a binary vector—just a list of 1s and 0s, where you store a 1 if the corresponding float was greater than 0 and a 0 otherwise.

For the above example, this would start [1, 1, 0, 0, 0...]

Incredibly, it looks like the cosine distance between these 0 and 1 vectors captures much of the semantic relevant meaning present in the distance between the much more accurate vectors. This means you can use 1/32nd of the space and still get useful results!

Ce Gao here suggests a further optimization: use the binary vectors for a fast brute-force lookup of the top 200 matches, then run a more expensive re-ranking against those filtered values using the full floating point vectors.

# 26th March 2024, 4:56 am / embeddings

Adaptive Retrieval with Matryoshka Embeddings (via) Nomic Embed v1 only came out two weeks ago, but the same team just released Nomic Embed v1.5 trained using a new technique called Matryoshka Representation.

This means that unlike v1 the v1.5 embeddings are resizable—instead of a fixed 768 dimension embedding vector you can trade size for quality and drop that size all the way down to 64, while still maintaining strong semantically relevant results.

Joshua Lochner build this interactive demo on top of Transformers.js which illustrates quite how well this works: it lets you embed a query, embed a series of potentially matching text sentences and then adjust the number of dimensions and see what impact it has on the results.

# 15th February 2024, 4:19 am / transformers-js, nomic, ai, embeddings, llms

Announcing DuckDB 0.10.0. Somewhat buried in this announcement: DuckDB has Fixed-Length Arrays now, along with array_cross_product(a1, a2), array_cosine_similarity(a1, a2) and array_inner_product(a1, a2) functions.

This means you can now use DuckDB to find related content (and other tricks) using vector embeddings!

Also notable: “DuckDB can now attach MySQL, Postgres, and SQLite databases in addition to databases stored in its own format. This allows data to be read into DuckDB and moved between these systems in a convenient manner, as attached databases are fully functional, appear just as regular tables, and can be updated in a safe, transactional manner.”

# 13th February 2024, 5:57 pm / embeddings, sql, duckdb, databases, mysql, postgresql, sqlite

llm-sentence-transformers 0.2. I added a new --trust-remote-code option when registering an embedding model, which means LLM can now run embeddings through the new Nomic AI nomic-embed-text-v1 model.

# 4th February 2024, 7:39 pm / llm, embeddings, plugins, projects, ai, transformers, nomic

Introducing Nomic Embed: A Truly Open Embedding Model. A new text embedding model from Nomic AI which supports 8192 length sequences, claims better scores than many other models (including OpenAI’s new text-embedding-3-small) and is available as both a hosted API and a run-yourself model. The model is Apache 2 licensed and Nomic have released the full set of training data and code.

From the accompanying paper: “Full training of nomic-embed-text-v1 can be conducted in a single week on one 8xH100 node.”

# 3rd February 2024, 11:13 pm / ai, embeddings, nomic

ChunkViz (via) Handy tool by Greg Kamradt to help understand how different text chunking mechanisms work by visualizing them. Chunking is an important part of preparing text to be embedded for semantic search, and thanks to this tool I’ve finally got a solid mental model of what recursive character text splitting does.

# 2nd February 2024, 2:23 am / ai, embeddings

ColBERT query-passage scoring interpretability (via) Neat interactive visualization tool for understanding what the ColBERT embedding model does—this works by loading around 50MB of model files directly into your browser and running them with WebAssembly.

# 28th January 2024, 4:49 pm / webassembly, ai, embeddings, interpretability

Text Embeddings Reveal (Almost) As Much As Text. Embeddings of text—where a text string is converted into a fixed-number length array of floating point numbers—are demonstrably reversible: “a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs exactly”.

This means that if you’re using a vector database for embeddings of private data you need to treat those embedding vectors with the same level of protection as the original text.

# 8th January 2024, 5:22 am / ai, privacy, security, embeddings

2023

Fleet Context. This project took the source code and documentation for 1221 popular Python libraries and ran them through the OpenAI text-embedding-ada-002 embedding model, then made those pre-calculated embedding vectors available as Parquet files for download from S3 or via a custom Python CLI tool.

I haven’t seen many projects release pre-calculated embeddings like this, it’s an interesting initiative.

# 15th November 2023, 10:20 pm / embeddings, ai, python, llms

Execute Jina embeddings with a CLI using llm-embed-jina

Berlin-based Jina AI just released a new family of embedding models, boasting that they are the “world’s first open-source 8K text embedding model” and that they rival OpenAI’s text-embedding-ada-002 in quality.

[... 1,392 words]

Embeddings: What they are and why they matter

Visit Embeddings: What they are and why they matter

Embeddings are a really neat trick that often come wrapped in a pile of intimidating jargon.

[... 5,835 words]

Bottleneck T5 Text Autoencoder (via) Colab notebook by Linus Lee demonstrating his Contra Bottleneck T5 embedding model, which can take up to 512 tokens of text, convert that into a 1024 floating point number embedding vector... and then then reconstruct the original text (or a close imitation) from the embedding again.

This allows for some fascinating tricks, where you can do things like generate embeddings for two completely different sentences and then reconstruct a new sentence that combines the weights from both.

# 10th October 2023, 2:12 am / llms, ai, embeddings, generative-ai, jupyter, python

Finding Bathroom Faucets with Embeddings. Absolutely the coolest thing I’ve seen someone build on top of my LLM tool so far: Drew Breunig is renovating a bathroom and needed a way to filter through literally thousands of options for facet taps. He scraped 20,000 images of fixtures from a plumbing supply site and used LLM to embed every one of them via CLIP... and now he can ask for “faucets that look like this one”, or even run searches for faucets that match “Gawdy” or “Bond Villain” or “Nintendo 64”. Live demo included!

# 27th September 2023, 6:18 pm / llm, embeddings, generative-ai, ai, drew-breunig, clip

Weeknotes: Embeddings, more embeddings and Datasette Cloud

Since my last weeknotes, a flurry of activity. LLM has embeddings support now, and Datasette Cloud has driven some major improvements to the wider Datasette ecosystem.

[... 2,427 words]

Build an image search engine with llm-clip, chat with models with llm chat

Visit Build an image search engine with llm-clip, chat with models with llm chat

LLM is my combination CLI tool and Python library for working with Large Language Models. I just released LLM 0.10 with two significant new features: embedding support for binary files and the llm chat command.

[... 1,188 words]

Symbex 1.4. New release of my Symbex tool for finding symbols (functions, methods and classes) in a Python codebase. Symbex can now output matching symbols in JSON, CSV or TSV in addition to plain text.

I designed this feature for compatibility with the new “llm embed-multi” command—so you can now use Symbex to find every Python function in a nested directory and then pipe them to LLM to calculate embeddings for every one of them.

I tried it on my projects directory and embedded over 13,000 functions in just a few minutes! Next step is to figure out what kind of interesting things I can do with all of those embeddings.

# 5th September 2023, 5:29 pm / symbex, llm, generative-ai, projects, ai, embeddings

LLM now provides tools for working with embeddings

Visit LLM now provides tools for working with embeddings

LLM is my Python library and command-line tool for working with language models. I just released LLM 0.9 with a new set of features that extend LLM to provide tools for working with embeddings.

[... 3,466 words]

Getting creative with embeddings (via) Amelia Wattenberger describes a neat application of embeddings I haven’t seen before: she wanted to build a system that could classify individual sentences in terms of how “concrete” or “abstract” they are. So she generated several example sentences for each of those categories, embedded then and calculated the average of those embeddings.

And now she can get a score for how abstract vs concrete a new sentence is by calculating its embedding and seeing where it falls in the 1500 dimension space between those two other points.

# 10th August 2023, 7:05 pm / llms, ai, embeddings, generative-ai, amelia-wattenberger

Vector Search. Amjith Ramanujam provides a very thorough tutorial on implementing vector similarity search using SentenceTransformers embeddings (all-MiniLM-L6-v2) executed using sqlite-utils, then served via datasette-sqlite-vss and deployed using Fly.

# 2nd June 2023, 5:02 am / fly, sqlite, datasette, ai, embeddings

ImageBind. New model release from Facebook/Meta AI research: “An approach to learn a joint embedding across six different modalities—images, text, audio, depth, thermal, and IMU (inertial measurement units) data”. The non-interactive demo shows searching audio starting with an image, searching images starting with audio, using text to retrieve images and audio, using image and audio to retrieve images (e.g. a barking sound and a photo of a beach to get dogs on a beach) and using audio as input to an image generator.

# 9th May 2023, 7:04 pm / facebook, ai, generative-ai, embeddings