Simon Willison’s Weblog

Subscribe

7 items tagged “clip”

CLIP is an embedding model by OpenAI that supports comparisons between images and text.

2024

Searching an aerial photo with text queries. Robin Wilson built a demo that lets you search a large aerial photograph of Southampton for things like "roundabout" or "tennis court". He explains how it works in detail: he used the SkyCLIP model, which is trained on "5.2 million remote sensing image-text pairs in total, covering more than 29K distinct semantic tags" to generate embeddings for 200x200 image segments (with 100px of overlap), then stored them in Pinecone.

# 12th July 2024, 6:07 pm / gis, embeddings, clip

The Super Effectiveness of Pokémon Embeddings Using Only Raw JSON and Images. A deep dive into embeddings from Max Woolf, exploring 1,000 different Pokémon (loaded from PokéAPI using this epic GraphQL query) and then embedding the cleaned up JSON data using nomic-embed-text-v1.5 and the official Pokémon image representations using nomic-embed-vision-v1.5.

I hadn't seen nomic-embed-vision-v1.5 before: it brings multimodality to Nomic embeddings and operates in the same embedding space as nomic-embed-text-v1.5 which means you can use it to perform CLIP-style tricks comparing text and images. Here's their announcement from June 5th:

Together, Nomic Embed is the only unified embedding space that outperforms OpenAI CLIP and OpenAI Text Embedding 3 Small on multimodal and text tasks respectively.

Sadly the new vision weights are available under a non-commercial Creative Commons license (unlike the text weights which are Apache 2), so if you want to use the vision weights commercially you'll need to access them via Nomic's paid API.

# 30th June 2024, 9:22 pm / ai, max-woolf, embeddings, clip

2023

Embeddings: What they are and why they matter

Visit Embeddings: What they are and why they matter

Embeddings are a really neat trick that often come wrapped in a pile of intimidating jargon.

[... 5,835 words]

Finding Bathroom Faucets with Embeddings. Absolutely the coolest thing I’ve seen someone build on top of my LLM tool so far: Drew Breunig is renovating a bathroom and needed a way to filter through literally thousands of options for facet taps. He scraped 20,000 images of fixtures from a plumbing supply site and used LLM to embed every one of them via CLIP... and now he can ask for “faucets that look like this one”, or even run searches for faucets that match “Gawdy” or “Bond Villain” or “Nintendo 64”. Live demo included!

# 27th September 2023, 6:18 pm / ai, generative-ai, embeddings, llm, drew-breunig, clip

Build an image search engine with llm-clip, chat with models with llm chat

Visit Build an image search engine with llm-clip, chat with models with llm chat

LLM is my combination CLI tool and Python library for working with Large Language Models. I just released LLM 0.10 with two significant new features: embedding support for binary files and the llm chat command.

[... 1,188 words]

AI photo sorter (via) Really interesting implementation of machine learning photo classification by Alexander Visheratin. This tool lets you select as many photos as you like from your own machine, then provides a web interface for classifying them into labels that you provide. It loads a 102MB quantized CLIP model and executes it in the browser using WebAssembly. Once classified, a “Generate script” button produces a copyable list of shell commands for moving your images into corresponding folders on your own machine. Your photos never get uploaded to a server—everything happens directly in your browser.

# 2nd April 2023, 4:27 am / machine-learning, webassembly, openai, clip

Announcing Open Flamingo (via) New from LAION: “OpenFlamingo is a framework that enables training and evaluation of large multimodal models (LMMs)”. Multimodal here means it can answer questions about images—their interactive demo includes tools for image captioning, animal recognition, counting objects and visual question answering. Theye’ve released the OpenFlamingo-9B model built on top of LLaMA 7B and CLIP ViT/L-14—the model checkpoint is a 5.24 GB download from Hugging Face, and is available under a non-commercial research license.

# 28th March 2023, 9:59 pm / ai, generative-ai, llama, laion, llms, clip