Execute Jina embeddings with a CLI using llm-embed-jina

26th October 2023

Berlin-based Jina AI just released a new family of embedding models, boasting that they are the “world’s first open-source 8K text embedding model” and that they rival OpenAI’s text-embedding-ada-002 in quality.

I wrote about embeddings extensively the other day—if you’re not familiar with what they are and what you can do with them I suggest reading that first.

This evening I built and released a new plugin for my LLM tool which adds support for Jina’s new embedding models.

Trying out llm-embed-jina

The plugin is called llm-embed-jina. Here’s the quickest way to get started with it:

First, install LLM if you haven’t already. You can use pipx:

pipx install llm

Or pip:

pip install llm

Unfortunately installing LLM using Homebrew doesn’t currently work with this plugin as PyTorch has not yet been released for Python 3.12—details in this issue.

Now you can install the llm-embed-jina plugin:

llm install llm-embed-jina

The llm install command ensures it gets installed in the correct virtual environment, no matter how you installed LLM itself.

Run this command to check that it added the models:

llm embed-models

You should see output like this:

ada-002 (aliases: ada, oai)
jina-embeddings-v2-small-en
jina-embeddings-v2-base-en
jina-embeddings-v2-large-en

The jina-emebddings-v2-large-en model isn’t available yet, but should work as soon as Jina release it. I expect it will show up at huggingface.co/jinaai/jina-embeddings-v2-large-en (currently a 404).

Now you can run one of the models. The -small-en model is a good starting point, it’s only a 65MB download—the -base-en model is 275MB.

The model will download the first time you try to use it. Run this:

llm embed -m jina-embeddings-v2-small-en -c 'Hello world'

This will return a JSON array of 512 floating point numbers—the embedding vector for the string “Hello world”.

Embeddings are much more interesting if you store them somewhere and then use them to run comparisons. The llm embed-multi command can do that.

Change directory to a folder that you know contains README.md files (anything with a node_modules folder will do) and run this:

llm embed-multi readmes \
    -m jina-embeddings-v2-small-en \
    --files . '**/README.md' \
    --database readmes.db

This will create a SQLite database called readmes.db, then search for every README.md file in the current directory and all subdirectories, embed the content of each one and store the results in that database.

Those embeddings will live in a collection called readmes.

If you leave off the --database readmes.db option the collections will be stored in a default SQLite database tucked away somewhere on your system.

Having done this, you can run semantic similarity searches against the new collection like this:

llm similar readmes -d readmes.db -c 'utility functions'

When I ran that in my hmb-map directory I got these:

{"id": "node_modules/@maplibre/maplibre-gl-style-spec/src/feature_filter/README.md", "score": 0.7802185991017785, "content": null, "metadata": null}
{"id": "node_modules/kind-of/README.md", "score": 0.7725600920927725, "content": null, "metadata": null}
{"id": "node_modules/which/README.md", "score": 0.7645426557095619, "content": null, "metadata": null}
{"id": "node_modules/@mapbox/point-geometry/README.md", "score": 0.7636548563018607, "content": null, "metadata": null}
{"id": "node_modules/esbuild/README.md", "score": 0.7633325127194481, "content": null, "metadata": null}
{"id": "node_modules/maplibre-gl/src/shaders/README.md", "score": 0.7614428292518743, "content": null, "metadata": null}
{"id": "node_modules/minimist/README.md", "score": 0.7581314986768929, "content": null, "metadata": null}
{"id": "node_modules/split-string/README.md", "score": 0.7563253351715924, "content": null, "metadata": null}
{"id": "node_modules/assign-symbols/README.md", "score": 0.7555915219064293, "content": null, "metadata": null}
{"id": "node_modules/maplibre-gl/build/README.md", "score": 0.754027372081506, "content": null, "metadata": null}

These are the top ten results by similarity to the string I entered.

You can also pass in the ID of an item in the collection to see other similar items:

llm similar readmes -d readmes.db node_modules/esbuild/README.md | jq .id

I piped it through | jq .id to get back just the IDs. I got this:

"node_modules/@esbuild/darwin-arm64/README.md"
"node_modules/rollup/README.md"
"node_modules/assign-symbols/README.md"
"node_modules/split-string/node_modules/extend-shallow/README.md"
"node_modules/isobject/README.md"
"node_modules/maplibre-gl/build/README.md"
"node_modules/vite/README.md"
"node_modules/nanoid/README.md"
"node_modules/@mapbox/tiny-sdf/README.md"
"node_modules/split-string/node_modules/is-extendable/README.md"

See the LLM embeddings documentation for more details on things you can do with this tool.

How I built the plugin

I built the first version of this plugin in about 15 minutes. It took another hour to iron out a couple of bugs.

I started with this cookiecutter template, followed by pasting in the recipe in the LLM documentation on writing embedding model plugins combined with some example code that Jina provided in their model release. Here’s their code:

from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
print(cos_sim(embeddings[0], embeddings[1]))

That numpy and cos_sim bit isn’t necessary, so I ignored that.

The first working version of the plugin was a file called llm_embed_jina.py that looked like this:

import llm
from transformers import AutoModel


@llm.hookimpl
def register_embedding_models(register):
    for model_id in (
        "jina-embeddings-v2-small-en",
        "jina-embeddings-v2-base-en",
        "jina-embeddings-v2-large-en",
    ):
        register(JinaEmbeddingModel(model_id))


class JinaEmbeddingModel(llm.EmbeddingModel):
    def __init__(self, model_id):
        self.model_id = model_id
        self._model = None

    def embed_batch(self, texts):
        if self._model is None:
            self._model = AutoModel.from_pretrained(
                "jinaai/{}".format(self.model_id), trust_remote_code=True
            )
        results = self._model.encode(texts)
        return (list(map(float, result)) for result in results)

There’s really not a lot to it.

The register_embedding_models() function is a plugin hook that LLM calls to register all of the embedding models.

JinaEmbeddingModel is a subclass of llm.EmbeddingModel. It just needs to implement two things: a constructor and that embed_batch(self, texts) method.

AutoModel.from_pretrained() is provided by Hugging Face Transformers. It downloads and caches the model the first time you call it.

The model returns numpy arrays, but LLM wants a regular Python list of floats—that’s what that last return line is doing.

I found a couple of bugs with this. The model didn’t like having .encode(texts) called with a generator, so I needed to convert that into a list. Then later I found that text longer than 8192 characters could cause the model to hang in some situations, so I added my own truncated.

The current version (0.1.2) of the plugin, with fixes for both of those issues, looks like this:

import llm
from transformers import AutoModel

MAX_LENGTH = 8192


@llm.hookimpl
def register_embedding_models(register):
    for model_id in (
        "jina-embeddings-v2-small-en",
        "jina-embeddings-v2-base-en",
        "jina-embeddings-v2-large-en",
    ):
        register(JinaEmbeddingModel(model_id))


class JinaEmbeddingModel(llm.EmbeddingModel):
    def __init__(self, model_id):
        self.model_id = model_id
        self._model = None

    def embed_batch(self, texts):
        if self._model is None:
            self._model = AutoModel.from_pretrained(
                "jinaai/{}".format(self.model_id), trust_remote_code=True
            )
        results = self._model.encode([text[:MAX_LENGTH] for text in texts])
        return (list(map(float, result)) for result in results)

I’m really pleased with how quickly this came together—I think it’s a strong signal that the LLM embeddings plugin design is working well.

Posted 26th October 2023 at 3:47 am · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog