Simon Willison’s Weblog

Subscribe
Atom feed

Blogmarks

Filters: Sorted by date

My binary vector search is better than your FP32 vectors. I’m still trying to get my head around this, but here’s what I understand so far.

Embedding vectors as calculated by models such as OpenAI text-embedding-3-small are arrays of floating point values, which look something like this:

[0.0051681744, 0.017187592, -0.018685209, -0.01855924, -0.04725188...]—1356 elements long

Different embedding models have different lengths, but they tend to be hundreds up to low thousands of numbers. If each float is 32 bits that’s 4 bytes per float, which can add up to a lot of memory if you have millions of embedding vectors to compare.

If you look at those numbers you’ll note that they are all pretty small positive or negative numbers, close to 0.

Binary vector search is a trick where you take that sequence of floating point numbers and turn it into a binary vector—just a list of 1s and 0s, where you store a 1 if the corresponding float was greater than 0 and a 0 otherwise.

For the above example, this would start [1, 1, 0, 0, 0...]

Incredibly, it looks like the cosine distance between these 0 and 1 vectors captures much of the semantic relevant meaning present in the distance between the much more accurate vectors. This means you can use 1/32nd of the space and still get useful results!

Ce Gao here suggests a further optimization: use the binary vectors for a fast brute-force lookup of the top 200 matches, then run a more expensive re-ranking against those filtered values using the full floating point vectors.

# 26th March 2024, 4:56 am / vector-search, embeddings

Semgrep: AutoFixes using LLMs (via) semgrep is a really neat tool for semantic grep against source code—you can give it a pattern like “log.$A(...)” to match all forms of log.warning(...) / log.error(...) etc.

Ilia Choly built semgrepx— xargs for semgrep—and here shows how it can be used along with my llm CLI tool to execute code replacements against matches by passing them through an LLM such as Claude 3 Opus.

# 26th March 2024, 12:51 am / cli, ai, generative-ai, llms, llm, claude

sqlite-schema-diagram.sql (via) A SQLite SQL query that directly returns a GraphViz definition that renders a diagram of the database schema, by Tim Allen.

The SQL is beautifully commented. It works as a big set of UNION ALL statements against queries that join data from pragma_table_list(), pragma_table_info() and pragma_foreign_key_list().

# 25th March 2024, 5:12 am

Reviving PyMiniRacer (via) PyMiniRacer is “a V8 bridge in Python”—it’s a library that lets Python code execute JavaScript code in a V8 isolate and pass values back and forth (provided they serialize to JSON) between the two environments.

It was originally released in 2016 by Sqreen, a web app security startup startup. They were acquired by Datadog in 2021 and the project lost its corporate sponsor, but in this post Ben Creech announces that he is revitalizing the project, with the approval of the original maintainers.

I’m always interested in new options for running untrusted code in a safe sandbox. PyMiniRacer has the three features I care most about: code can’t access the filesystem or network by default, you can limit the RAM available to it and you can have it raise an error if code execution exceeds a time limit.

The documentation includes a newly written architecture overview which is well worth a read. Rather than embed V8 directly in Python the authors chose to use ctypes—they build their own V8 with a thin additional C++ layer to expose a ctypes-friendly API, then the Python library code uses ctypes to call that.

I really like this. V8 is a notoriously fast moving and complex dependency, so reducing the interface to just a thin C++ wrapper via ctypes feels very sensible to me.

This blog post is fun too: it’s a good, detailed description of the process to update something like this to use modern Python and modern CI practices. The steps taken to build V8 (6.6 GB of miscellaneous source and assets!) across multiple architectures in order to create binary wheels are particularly impressive—the Linux aarch64 build takes several days to run on GitHub Actions runners (via emulation), so they use Mozilla’s Sccache to cache compilation steps so they can retry until it finally finishes.

On macOS (Apple Silicon) installing the package with “pip install mini-racer” got me a 37MB dylib and a 17KB ctypes wrapper module.

# 24th March 2024, 5 pm / ctypes, javascript, open-source, python, v8

shelmet (via) This looks like a pleasant ergonomic alternative to Python's subprocess module, plus a whole bunch of other useful utilities. Lets you do things like this:

sh.cmd("ps", "aux").pipe("grep", "-i", check=False).run("search term")

I like the way it uses context managers as well: with sh.environ({"KEY1": "val1"}) sets new environment variables for the duration of the block, with sh.cd("path/to/dir") temporarily changes the working directory and with sh.atomicfile("file.txt") as fp lets you write to a temporary file that will be atomically renamed when the block finishes.

# 24th March 2024, 4:37 am / python

Strachey love letter algorithm (via) This is a beautiful piece of computer history. In 1952, Christopher Strachey—a contemporary of Alan Turing—wrote a love letter generation program for a Manchester Mark 1 computer. It produced output like this:

"Darling Sweetheart,

You are my avid fellow feeling. My affection curiously clings to your passionate wish. My liking yearns for your heart. You are my wistful sympathy: my tender liking.

Yours beautifully

M. U. C."

The algorithm simply combined a small set of predefined sentence structures, filled in with random adjectives.

Wikipedia notes that "Strachey wrote about his interest in how “a rather simple trick” can produce an illusion that the computer is thinking, and that “these tricks can lead to quite unexpected and interesting results”.

LLMs, 1952 edition!

# 23rd March 2024, 9:55 pm / computer-history, history, generative-ai, llms

time-machine example test for a segfault in Python (via) Here's a really neat testing trick by Adam Johnson. Someone reported a segfault bug in his time-machine library. How you you write a unit test that exercises a segfault without crashing the entire test suite?

Adam's solution is a test that does this:

subprocess.run([sys.executable, "-c", code_that_crashes_python], check=True)

sys.executable is the path to the current Python executable - ensuring the code will run in the same virtual environment as the test suite itself. The -c option can be used to have it run a (multi-line) string of Python code, and check=True causes the subprocess.run() function to raise an error if the subprocess fails to execute cleanly and returns an error code.

I'm absolutely going to be borrowing this pattern next time I need to add tests to cover a crashing bug in one of my projects.

# 23rd March 2024, 7:44 pm / python, testing, adam-johnson

mapshaper.org (via) It turns out the mapshaper CLI tool for manipulating geospatial data—including converting shapefiles to GeoJSON and back again—also has a web UI that runs the conversions entirely in your browser. If you need to convert between those (and other) formats it’s hard to imagine a more convenient option.

# 23rd March 2024, 3:44 am / cli, geospatial, javascript, shapefiles, geojson

Threads has entered the fediverse (via) Threads users with public profiles in certain countries can now turn on a setting which makes their posts available in the fediverse—so users of ActivityPub systems such as Mastodon can follow their accounts to subscribe to their posts.

It’s only a partial integration at the moment: Threads users can’t themselves follow accounts from other providers yet, and their notifications will show them likes but not boosts or replies: “For now, people who want to see replies on their posts on other fediverse servers will have to visit those servers directly.”

Depending on how you count, Mastodon has around 9m user accounts of which 1m are active. Threads claims more than 130m active monthly users. The Threads team are developing these features cautiously which is reassuring to see—a clumsy or thoughtless integration could cause all sorts of damage just from the sheer scale of their service.

# 22nd March 2024, 8:15 pm / facebook, threads, mastodon, activitypub, fediverse

The Dropflow Playground (via) Dropflow is a “CSS layout engine” written in TypeScript and taking advantage of the HarfBuzz text shaping engine (used by Chrome, Android, Firefox and more) compiled to WebAssembly to implement glyph layout.

This linked demo is fascinating: on the left hand side you can edit HTML with inline styles, and the right hand side then updates live to show that content rendered by Dropflow in a canvas element.

Why would you want this? It lets you generate images and PDFs with excellent performance using your existing knowledge HTML and CSS. It’s also just really cool!

# 22nd March 2024, 1:33 am / css, javascript, typescript, webassembly

DuckDB as the New jq (via) The DuckDB CLI tool can query JSON files directly, making it a surprisingly effective replacement for jq. Paul Gross demonstrates the following query:

select license->>'key' as license, count(*) from 'repos.json' group by 1

repos.json contains an array of {"license": {"key": "apache-2.0"}..} objects. This example query shows counts for each of those licenses.

# 21st March 2024, 8:36 pm / cli, sql, jq, duckdb

Redis Adopts Dual Source-Available Licensing (via) Well this sucks: after fifteen years (and contributions from more than 700 people), Redis is dropping the 3-clause BSD license going forward, instead being “dual-licensed under the Redis Source Available License (RSALv2) and Server Side Public License (SSPLv1)” from Redis 7.4 onwards.

# 21st March 2024, 2:24 am / open-source, redis

Talking about Django’s history and future on Django Chat (via) Django co-creator Jacob Kaplan-Moss sat down with the Django Chat podcast team to talk about Django’s history, his recent return to the Django Software Foundation board and what he hopes to achieve there.

Here’s his post about it, where he used Whisper and Claude to extract some of his own highlights from the conversation.

# 21st March 2024, 12:42 am / django, jacob-kaplan-moss, podcasts, python, dsf

GitHub Public repo history tool (via) I built this Observable Notebook to run queries against the GH Archive (via ClickHouse) to try to answer questions about repository history—in particular, were they ever made public as opposed to private in the past.

It works by combining together PublicEvent event (moments when a private repo was made public) with the most recent PushEvent event for each of a user’s repositories.

# 20th March 2024, 9:56 pm / github, projects, observable, clickhouse

Releasing Common Corpus: the largest public domain dataset for training LLMs (via) Released today. 500 billion words from "a wide diversity of cultural heritage initiatives". 180 billion words of English, 110 billion of French, 30 billion of German, then Dutch, Spanish and Italian.

Includes quite a lot of US public domain data - 21 million digitized out-of-copyright newspapers (or do they mean newspaper articles?)

This is only an initial part of what we have collected so far, in part due to the lengthy process of copyright duration verification. In the following weeks and months, we’ll continue to publish many additional datasets also coming from other open sources, such as open data or open science.

Coordinated by French AI startup Pleias and supported by the French Ministry of Culture, among others.

I can't wait to try a model that's been trained on this.

# 20th March 2024, 7:34 pm / copyright, ethics, ai, generative-ai, llms, training-data, pleias, ai-ethics

Skew protection in Vercel (via) Version skew is a name for the bug that occurs when your user loads a web application and then unintentionally keeps that browser tab open across a deployment of a new version of the app. If you’re unlucky this can lead to broken behaviour, where a client makes a call to a backend endpoint that has changed in an incompatible way.

Vercel have an ingenious solution to this problem. Their platform already makes it easy to deploy many different instances of an application. You can now turn on “skew protection” for a number of hours which will keep older versions of your backend deployed.

The application itself can then include its desired deployment ID in a x-deployment-id header, a __vdpl cookie or a ?dpl= query string parameter.

# 20th March 2024, 2:06 pm / frontend, zero-downtime, vercel

Every dunder method in Python. Trey Hunner: "Python includes 103 'normal' dunder methods, 12 library-specific dunder methods, and at least 52 other dunder attributes of various types."

This cheat sheet doubles as a tour of many of the more obscure corners of the Python language and standard library.

I did not know that Python has over 100 dunder methods now! Quite a few of these were new to me, like __class_getitem__ which can be used to implement type annotations such as list[int].

# 20th March 2024, 3:45 am / python, trey-hunner

AI Prompt Engineering Is Dead. Long live AI prompt engineering. Ignoring the clickbait in the title, this article summarizes research around the idea of using machine learning models to optimize prompts—as seen in tools such as Stanford’s DSPy and Google’s OPRO.

The article includes possibly the biggest abuse of the term “just” I have ever seen:

“But that’s where hopefully this research will come in and say ‘don’t bother.’ Just develop a scoring metric so that the system itself can tell whether one prompt is better than another, and then just let the model optimize itself.”

Developing a scoring metric to determine which prompt works better remains one of the hardest challenges in generative AI!

Imagine if we had a discipline of engineers who could reliably solve that problem—who spent their time developing such metrics and then using them to optimize their prompts. If the term “prompt engineer” hadn’t already been reduced to basically meaning “someone who types out prompts” it would be a pretty fitting term for such experts.

# 20th March 2024, 3:22 am / ai, prompt-engineering, generative-ai, llms, dspy

Papa Parse (via) I’ve been trying out this JavaScript library for parsing CSV and TSV data today and I’m very impressed. It’s extremely fast, has all of the advanced features I want (streaming support, optional web workers, automatically detecting delimiters and column types), has zero dependencies and weighs just 19KB minified—6.8KB gzipped.

The project is 11 years old now. It was created by Matt Holt, who later went on to create the Caddy web server. Today it’s maintained by Sergi Almacellas Abellana.

# 20th March 2024, 12:53 am / csv, javascript, matt-holt

DiskCache (via) Grant Jenks built DiskCache as an alternative caching backend for Django (also usable without Django), using a SQLite database on disk. The performance numbers are impressive—it even beats memcached in microbenchmarks, due to avoiding the need to access the network.

The source code (particularly in core.py) is a great case-study in SQLite performance optimization, after five years of iteration on making it all run as fast as possible.

# 19th March 2024, 3:43 pm / django, performance, python, sqlite

The Tokenizer Playground (via) I built a tool like this a while ago, but this one is much better: it provides an interface for experimenting with tokenizers from a wide range of model architectures, including Llama, Claude, Mistral and Grok-1—all running in the browser using Transformers.js.

# 19th March 2024, 2:18 am / ai, generative-ai, llms, transformers-js, tokenization

900 Sites, 125 million accounts, 1 vulnerability (via) Google’s Firebase development platform encourages building applications (mobile an web) which talk directly to the underlying data store, reading and writing from “collections” with access protected by Firebase Security Rules.

Unsurprisingly, a lot of development teams make mistakes with these.

This post describes how a security research team built a scanner that found over 124 million unprotected records across 900 different applications, including huge amounts of PII: 106 million email addresses, 20 million passwords (many in plaintext) and 27 million instances of “Bank details, invoices, etc”.

Most worrying of all, only 24% of the site owners they contacted shipped a fix for the misconfiguration.

# 18th March 2024, 6:53 pm / google, security

Grok-1 code and model weights release (via) xAI have released their Grok-1 model under an Apache 2 license (for both weights and code). It's distributed as a 318.24G torrent file and likely requires 320GB of VRAM to run, so needs some very hefty hardware.

The accompanying blog post says "Trained from scratch by xAI using a custom training stack on top of JAX and Rust in October 2023", and describes it as a "314B parameter Mixture-of-Experts model with 25% of the weights active on a given token".

Very little information on what it was actually trained on, all we know is that it was "a large amount of text data, not fine-tuned for any particular task".

# 17th March 2024, 8:20 pm / ai, generative-ai, llms, grok, llm-release, xai

Add ETag header for static responses. I’ve been procrastinating on adding better caching headers for static assets (JavaScript and CSS) served by Datasette for several years, because I’ve been wanting to implement the perfect solution that sets far-future cache headers on every asset and ensures the URLs change when they are updated.

Agustin Bacigalup just submitted the best kind of pull request: he observed that adding ETag support for static assets would side-step the complexity while adding much of the benefit, and implemented it along with tests.

It’s a substantial performance improvement for any Datasette instance with a number of JavaScript plugins... like the ones we are building on Datasette Cloud. I’m just annoyed we didn’t ship something like this sooner!

# 17th March 2024, 7:25 pm / etags, web-performance, datasette, datasette-cloud

How does SQLite store data? Michal Pitr explores the design of the SQLite on-disk file format, as part of building an educational implementation of SQLite from scratch in Go.

# 17th March 2024, 6:47 pm / sqlite

npm install everything, and the complete and utter chaos that follows (via) Here’s an experiment which went really badly wrong: a team of mostly-students decided to see if it was possible to install every package from npm (all 2.5 million of them) on the same machine. As part of that experiment they created and published their own npm package that depended on every other package in the registry.

Unfortunately, in response to the leftpad incident a few years ago npm had introduced a policy that a package cannot be removed from the registry if there exists at least one other package that lists it as a dependency. The new “everything” package inadvertently prevented all 2.5m packages—including many that had no other dependencies—from ever being removed!

# 16th March 2024, 5:18 am / packaging, security, npm

Phanpy. Phanpy is "a minimalistic opinionated Mastodon web client" by Chee Aun.

I think that description undersells it. It's beautifully crafted and designed and has a ton of innovative ideas - they way it displays threads and replies, the "Catch-up" beta feature, it's all a really thoughtful and fresh perspective on how Mastodon can work.

I love that all Mastodon servers (including my own dedicated instance) offer a CORS-enabled JSON API which directly supports building these kinds of alternative clients.

Building a full-featured client like this one is a huge amount of work, but building a much simpler client that just displays the user's incoming timeline could be a pretty great educational project for people who are looking to deepen their front-end development skills.

# 16th March 2024, 1:34 am / javascript, mastodon, fediverse, cors

Google Scholar search: “certainly, here is” -chatgpt -llm (via) Searching Google Scholar for “certainly, here is” turns up a huge number of academic papers that include parts that were evidently written by ChatGPT—sections that start with “Certainly, here is a concise summary of the provided sections:” are a dead giveaway.

# 15th March 2024, 1:43 pm / ethics, google, search-engines, ai, generative-ai, chatgpt, llms, ai-ethics, ai-misuse

Advanced Topics in Reminders and To Do Lists. Fred Benenson’s advanced guide to the Apple Reminders ecosystem. I live my life by Reminders—I particularly like that you can set them with Siri, so “Hey Siri, remind me to check the chickens made it to bed at 7pm every evening” sets up a recurring reminder without having to fiddle around in the UI. Fred has some useful tips here I hadn’t seen before.

# 15th March 2024, 2:38 am / productivity

How Figma’s databases team lived to tell the scale (via) The best kind of scaling war story:

"Figma’s database stack has grown almost 100x since 2020. [...] In 2020, we were running a single Postgres database hosted on AWS’s largest physical instance, and by the end of 2022, we had built out a distributed architecture with caching, read replicas, and a dozen vertically partitioned databases."

I like the concept of "colos", their internal name for sharded groups of related tables arranged such that those tables can be queried using joins.

Also smart: separating the migration into "logical sharding" - where queries all still run against a single database, even though they are logically routed as if the database was already sharded - followed by "physical sharding" where the data is actually copied to and served from the new database servers.

Logical sharding was implemented using PostgreSQL views, which can accept both reads and writes:

CREATE VIEW table_shard1 AS SELECT * FROM table WHERE hash(shard_key) >= min_shard_range AND hash(shard_key) < max_shard_range)

The final piece of the puzzle was DBProxy, a custom PostgreSQL query proxy written in Go that can parse the query to an AST and use that to decide which shard the query should be sent to. Impressively it also has a scatter-gather mechanism, so select * from table can be sent to all shards at once and the results combined back together again.

# 14th March 2024, 9:23 pm / databases, postgresql, scaling, sharding

Years

Tags