Simon Willison’s Weblog

Subscribe

Items in 2024

Filters: Year: 2024 × Sorted by date


I’ve worked out why I don’t get much value out of LLMs. The hardest and most time-consuming parts of my job involve distinguishing between ideas that are correct, and ideas that are plausible-sounding but wrong. Current AI is great at the latter type of ideas, and I don’t need more of those.

Martin Kleppmann # 27th April 2024, 7:31 pm

Everything Google’s Python team were responsible for. In a questionable strategic move, Google laid off the majority of their internal Python team a few days ago. Someone on Hacker News asked what the team had been responsible for, and team member zem relied with this fascinating comment providing detailed insight into how the team worked and indirectly how Python is used within Google. # 27th April 2024, 6:52 pm

It’s very fast to build something that’s 90% of a solution. The problem is that the last 10% of building something is usually the hard part which really matters, and with a black box at the center of the product, it feels much more difficult to me to nail that remaining 10%. With vibecheck, most of the time the results to my queries are great; some percentage of the time they aren’t. Closing that gap with gen AI feels much more fickle to me than a normal engineering problem. It could be that I’m unfamiliar with it, but I also wonder if some classes of generative AI based products are just doomed to mediocrity as a result.

Moxie Marlinspike # 26th April 2024, 9:40 pm

If you’re auditioning for your job every day, and you’re auditioning against every other brilliant employee there, and you know that at the end of the year, 6% of you are going to get cut no matter what, and at the same time, you have access to unrivaled data on partners, sellers, and competitors, you might be tempted to look at that data to get an edge and keep your job and get to your restricted stock units.

Dana Mattioli # 26th April 2024, 5:43 pm

Food Delivery Leak Unmasks Russian Security Agents. This story is from April 2022 but I realize now I never linked to it.

Yandex Food, a popular food delivery service in Russia, suffered a major data leak.

The data included an order history with names, addresses and phone numbers of people who had placed food orders through that service.

Bellingcat were able to cross-reference this leak with addresses of Russian security service buildings—including those linked to the GRU and FSB.This allowed them to identify the names and phone numbers of people working for those organizations, and then combine that information with further leaked data as part of their other investigations.

If you look closely at the screenshots in this story they may look familiar: Bellingcat were using Datasette internally as a tool for exploring this data! # 26th April 2024, 1:59 am

The only difference between screwing around and science is writing it down

Alex Jason, via Adam Savage # 25th April 2024, 2:17 pm

I’ve been at OpenAI for almost a year now. In that time, I’ve trained a lot of generative models. [...] It’s becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. [...] What this manifests as is – trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. [...] This is a surprising observation! It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else. Everything else is a means to an end in efficiently delivery compute to approximating that dataset.

James Betker # 25th April 2024, 5:13 am

Blogmarks that use markdown. I needed to attach a correction to an older blogmark (my 20-year old name for short-form links with commentary on my blog) today - but the commentary field has always been text, not HTML, so I didn't have a way to add the necessary link.

This motivated me to finally add optional Markdown support for blogmarks to my blog's custom Django CMS. I then went through and added inline code markup to a bunch of different older posts, and built this Django SQL Dashboard to keep track of which posts I had updated. # 25th April 2024, 4:34 am

No, Most Books Don’t Sell Only a Dozen Copies. I linked to a story the other day about book sales claiming "90 percent of them sold fewer than 2,000 copies and 50 percent sold less than a dozen copies", based on numbers released in the Penguin antitrust lawsuit. It turns out those numbers were interpreted incorrectly.

In this piece from September 2022 Lincoln Michel addresses this and other common misconceptions about book statistics.

Understanding these numbers requires understanding a whole lot of intricacies about how publishing actually works. Here's one illustrative snippet:

"Take the statistic that most published books only sell 99 copies. This seems shocking on its face. But if you dig into it, you’ll notice it was counting one year’s sales of all books that were in BookScan’s system. That’s quite different statistic than saying most books don’t sell 100 copies in total! A book could easily be a bestseller in, say, 1960 and sell only a trickle of copies today."

The top comment on the post comes from Kristen McLean of NPD BookScan, the organization who's numbers were misrepresented is the trial. She wasn't certain how the numbers had been sliced to get that 90% result, but in her own analysis of "frontlist sales for the top 10 publishers by unit volume in the U.S. Trade market" she found that 14.7% sold less than 12 copies and the 51.4% spot was for books selling less than a thousand. # 25th April 2024, 3:41 am

Snowflake Arctic Cookbook. Today's big model release was Snowflake Arctic, an enormous 480B model with a 128×3.66B MoE (Mixture of Experts) architecture. It's Apache 2 licensed and Snowflake state that "in addition, we are also open sourcing all of our data recipes and research insights."

The research insights will be shared on this Arctic Cookbook blog - which currently has two articles covering their MoE architecture and describing how they optimized their training run in great detail.

They also list dozens of "coming soon" posts, which should be pretty interesting given how much depth they've provided in their writing so far. # 25th April 2024, 2:47 am

When I said “Send a text message to Julian Chokkattu,” who’s a friend and fellow AI Pin reviewer over at Wired, I thought I’d be asked what I wanted to tell him. Instead, the device simply said OK and told me it sent the words “Hey Julian, just checking in. How’s your day going?” to Chokkattu. I’ve never said anything like that to him in our years of friendship, but I guess technically the AI Pin did do what I asked.

Cherlynn Low # 24th April 2024, 3:07 pm

openelm/README-pretraining.md. Apple released something big three hours ago, and I’m still trying to get my head around exactly what it is.

The parent project is called CoreNet, described as “A library for training deep neural networks”. Part of the release is a new LLM called OpenELM, which includes completely open source training code and a large number of published training checkpoint.

I’m linking here to the best documentation I’ve found of that training data: it looks like the bulk of it comes from RefinedWeb, RedPajama, The Pile and Dolma. # 24th April 2024, 2:57 am

A bad survey won’t tell you it’s bad. It’s actually really hard to find out that a bad survey is bad — or to tell whether you have written a good or bad set of questions. Bad code will have bugs. A bad interface design will fail a usability test. It’s possible to tell whether you are having a bad user interview right away. Feedback from a bad survey can only come in the form of a second source of information contradicting your analysis of the survey results.

Most seductively, surveys yield responses that are easy to count and counting things feels so certain and objective and truthful.

Even if you are counting lies.

Erika Hall # 24th April 2024, 12:31 am

We [Bluesky] took a somewhat novel approach of giving every user their own SQLite database. By removing the Postgres dependency, we made it possible to run a ‘PDS in a box’ [Personal Data Server] without having to worry about managing a database. We didn’t have to worry about things like replicas or failover. For those thinking this is irresponsible: don’t worry, we are backing up all the data on our PDSs!

SQLite worked really well because the PDS – in its ideal form – is a single-tenant system. We owned up to that by having these single tenant SQLite databases.

Daniel Holmgren # 23rd April 2024, 7 pm

microsoft/Phi-3-mini-4k-instruct-gguf (via) Microsoft’s Phi-3 LLM is out and it’s really impressive. This 4,000 token context GGUF model is just a 2.2GB (for the Q4 version) and ran on my Mac using the llamafile option described in the README. I could then run prompts through it using the llm-llamafile plugin.

The vibes are good! Initial test prompts I’ve tried feel similar to much larger 7B models, despite using just a few GBs of RAM. Tokens are returned fast too—it feels like the fastest model I’ve tried yet.

And it’s MIT licensed. # 23rd April 2024, 5:40 pm

Weeknotes: Llama 3, AI for Data Journalism, llm-evals and datasette-secrets

Llama 3 landed on Thursday. I ended up updating a whole bunch of different plugins to work with it, described in Options for accessing Llama 3 from the terminal using LLM.

[... 1030 words]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (via) By far the most detailed paper on prompt injection I’ve seen yet from OpenAI, published a few days ago and with six credited authors: Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke and Alex Beutel.

The paper notes that prompt injection mitigations which completely refuse any form of instruction in an untrusted prompt may not actually be ideal: some forms of instruction are harmless, and refusing them may provide a worse experience.

Instead, it proposes a hierarchy—where models are trained to consider if instructions from different levels conflict with or support the goals of the higher-level instructions—if they are aligned or misaligned with them.

The authors tested this idea by fine-tuning a model on top of GPT 3.5, and claim that it shows greatly improved performance against numerous prompt injection benchmarks.

As always with prompt injection, my key concern is that I don’t think “improved” is good enough here. If you are facing an adversarial attacker reducing the chance that they might find an exploit just means they’ll try harder until they find an attack that works.

The paper concludes with this note: “Finally, our current models are likely still vulnerable to powerful adversarial attacks. In the future, we will conduct more explicit adversarial training, and study more generally whether LLMs can be made sufficiently robust to enable high-stakes agentic applications.” # 23rd April 2024, 3:36 am

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone.

Phi-3 Technical Report # 23rd April 2024, 3 am

timpaul/form-extractor-prototype (via) Tim Paul, Head of Interaction Design at the UK’s Government Digital Service, published this brilliant prototype built on top of Claude 3 Opus.

The video shows what it can do. Give it an image of a form and it will extract the form fields and use them to create a GDS-style multi-page interactive form, using their GOV.UK Forms design system and govuk-frontend npm package.

It works for both hand-drawn napkin illustrations and images of existing paper forms.

The bulk of the prompting logic is the schema definition in data/extract-form-questions.json

I’m always excited to see applications built on LLMs that go beyond the chatbot UI. This is a great example of exactly that. # 22nd April 2024, 10:01 pm

No one buys books (via) Fascinating insights into the book publishing industry gathered by Elle Griffin from details that came out during the Penguin vs. DOJ antitrust lawsuit.

Publishing turns out to be similar to VC investing: a tiny percentage of books are hits that cover the costs for the vast majority that didn't sell well. The DOJ found that, of 58,000 books published in a year, "90 percent of them sold fewer than 2,000 copies and 50 percent sold less than a dozen copies."

UPDATE: This story is inaccurate: those statistics were grossly misinterpreted during the trial. See this post for updated information.

Here's an even better debunking: Yes, People Do Buy Books (subtitle: "Despite viral claims, Americans buy over a billion books a year"). # 22nd April 2024, 9:55 pm

Options for accessing Llama 3 from the terminal using LLM

Llama 3 was released on Thursday. Early indications are that it’s now the best available openly licensed model—Llama 3 70b Instruct has taken joint 5th place on the LMSYS arena leaderboard, behind only Claude 3 Opus and some GPT-4s and sharing 5th place with Gemini Pro and Claude 3 Sonnet. But unlike those other models Llama 3 70b is weights available and can even be run on a (high end) laptop!

[... 1962 words]

qrank (via) Interesting and very niche project by Colin Dellow.

Wikidata has pages for huge numbers of concepts, people, places and things.

One of the many pieces of data they publish is QRank—“ranking Wikidata entities by aggregating page views on Wikipedia, Wikispecies, Wikibooks, Wikiquote, and other Wikimedia projects”. Every item gets a score and these scores can be used to answer questions like “which island nations get the most interest across Wikipedia”—potentially useful for things like deciding which labels to display on a highly compressed map of the world.

QRank is published as a gzipped CSV file.

Colin’s hikeratlas/qrank GitHub repository runs weekly, fetches the latest qrank.csv.gz file and loads it into a SQLite database using SQLite’s “.import” mechanism. Then it publishes the resulting SQLite database as an asset attached to the “latest” GitHub release on that repo—currently a 307MB file.

The database itself has just a single table mapping the Wikidata ID (a primary key integer) to the latest QRank—another integer. You’d need your own set of data with Wikidata IDs to join against this to do anything useful.

I’d never thought of using GitHub Releases for this kind of thing. I think it’s a really interesting pattern. # 21st April 2024, 10:28 pm

tiny-world-map (via) I love this project. It’s a JavaScript file (694K uncompressed, 283KB compressed) which can be used with the Leaflet mapping library and provides a SVG base map of the world with country borders and labels for every world city with a population more than 48,000—10,000 cities total.

This means you can bundle an offline map of the world as part of any application that doesn’t need a higher level of detail. A lot of smaller island nations are missing entirely though, so this may not be right for every project.

It even includes a service worker to help implement offline mapping support, plus several variants of the map with less cities that are even smaller. # 21st April 2024, 10:11 pm

doom-htop (via) Ludicrous, brilliant hack: it runs Doom, converts each frame to ASCII art, then runs one process for each line of ASCII and sets each process to allocate enough memory such that sorting by M_VIRT will show the lines in the correct order. Then it updates the argv[0] for each process on every frame such that htop displays the state of the game.

Probably only works on Ubuntu.

From the FAQ: “Q: Why did you make this? A: I thought it would be funny.” # 21st April 2024, 1:59 pm

The blog post announcing the shutdown was done one day early. The idea was to take the opportunity of the new Pope being announced and Andy Rubin being replaced as head of Android, so that the [Google] Reader news may be drowned out. PR didn’t apparently realize that the kinds of people that care about the other two events (especially the Pope) are not the same kind of people that care about Reader, so it didn’t work.

Mihai Parparita # 20th April 2024, 9:55 pm

Tips on Adding JSON Output to Your CLI App (via) Kelly Brazil - also the author of jc, the neat CLI tool that converts the output of common Unix utilities such as dig into JSON - provides some useful do's and don'ts for adding JSON output as an option to a command-line tool.

Kelly recommends defaulting to arrays of flat objects - or newline-delimited objects - and suggests including an "unbuffer" option for streaming tools that discourages the OS from buffering output that is being sent through a pipe. # 20th April 2024, 9:43 pm

llm-gpt4all. New release of my LLM plugin which builds on Nomic's excellent gpt4all Python library. I've upgraded to their latest version which adds support for Llama 3 8B Instruct, so after a 4.4GB model download this works:

llm -m Meta-Llama-3-8B-Instruct "say hi in Spanish" # 20th April 2024, 5:58 pm

Ruff v0.4.0: a hand-written recursive descent parser for Python. The latest release of Ruff—a Python linter and formatter, written in Rust—includes a complete rewrite of the core parser. Previously Ruff used a parser borrowed from RustPython, generated using the LALRPOP parser generator. Victor Hugo Gomes contributed a new parser written from scratch, which provided a 2x speedup and also added error recovery, allowing parsing of invalid Python—super-useful for a linter.

I tried Ruff 0.4.0 just now against Datasette—a reasonably large Python project—and it ran in less than 1/10th of a second. This thing is Fast. # 19th April 2024, 5 am

A POI Database in One Line (via) Overture maps offer an extraordinarily useful freely licensed databases of POI (point of interest) listings, principally derived from partners such as Facebook and including restaurants, shops, museums and other locations from all around the world.

Their new "overturemaps" Python CLI utility makes it easy to quickly pull subsets of their data... but requires you to provide a bounding box to do so.

Drew Breunig came up with this delightful recipe for fetching data using LLM and gpt-3.5-turbo to fill in those bounding boxes:

overturemaps download --bbox=$(llm 'Give me a bounding box for Alameda, California expressed as only four numbers delineated by commas, with no spaces, longitude preceding latitude.') -f geojsonseq --type=place | geojson-to-sqlite alameda.db places - --nl --pk=id # 19th April 2024, 2:44 am

Andrej Karpathy’s Llama 3 review. The most interesting coverage I’ve seen so far of Meta’s Llama 3 models (8b and 70b so far, 400b promised later).

Andrej notes that Llama 3 trained on 15 trillion tokens—up from 2 trillion for Llama 2—and they used that many even for the smaller 8b model, 75x more than the chinchilla scaling laws would suggest.

The tokenizer has also changed—they now use 128,000 tokens, up from 32,000. This results in a 15% drop in the tokens needed to represent a string of text.

The one disappointment is the context length—just 8,192, 2x that of Llama 2 and 4x LLaMA 1 but still pretty small by today’s standards.

If early indications hold, the 400b model could be the first genuinely GPT-4 class openly licensed model. We’ll have to wait and see. # 18th April 2024, 8:50 pm