Simon Willison’s Weblog


568 items tagged “ai”


AI Prompt Engineering Is Dead. Long live AI prompt engineering. Ignoring the clickbait in the title, this article summarizes research around the idea of using machine learning models to optimize prompts—as seen in tools such as Stanford’s DSPy and Google’s OPRO.

The article includes possibly the biggest abuse of the term “just” I have ever seen:

“But that’s where hopefully this research will come in and say ‘don’t bother.’ Just develop a scoring metric so that the system itself can tell whether one prompt is better than another, and then just let the model optimize itself.”

Developing a scoring metric to determine which prompt works better remains one of the hardest challenges in generative AI!

Imagine if we had a discipline of engineers who could reliably solve that problem—who spent their time developing such metrics and then using them to optimize their prompts. If the term “prompt engineer” hadn’t already been reduced to basically meaning “someone who types out prompts” it would be a pretty fitting term for such experts. # 20th March 2024, 3:22 am

The Tokenizer Playground (via) I built a tool like this a while ago, but this one is much better: it provides an interface for experimenting with tokenizers from a wide range of model architectures, including Llama, Claude, Mistral and Grok-1—all running in the browser using Transformers.js. # 19th March 2024, 2:18 am

It’s hard to overstate the value of LLM support when coding for fun in an unfamiliar language. [...] This example is totally trivial in hindsight, but might have taken me a couple mins to figure out otherwise. This is a bigger deal than it seems! Papercuts add up fast and prevent flow. (A lot of being a senior engineer is just being proficient enough to avoid papercuts).

Geoffrey Litt # 18th March 2024, 6:16 pm

Grok-1 code and model weights release (via) xAI have released their Grok-1 model under an Apache 2 license (for both weights and code). It’s distributed as a 318.24G torrent file and likely requires 320GB of VRAM to run, so needs some very hefty hardware.

The accompanying blog post (via link) says “Trained from scratch by xAI using a custom training stack on top of JAX and Rust in October 2023”, and describes it as a “314B parameter Mixture-of-Experts model with 25% of the weights active on a given token”.

Very little information on what it was actually trained on, all we know is that it was “a large amount of text data, not fine-tuned for any particular task”. # 17th March 2024, 8:20 pm

One year since GPT-4 release. Hope you all enjoyed some time to relax; it’ll have been the slowest 12 months of AI progress for quite some time to come.

Leopold Aschenbrenner, OpenAI # 16th March 2024, 3:23 pm

Google Scholar search: “certainly, here is” -chatgpt -llm (via) Searching Google Scholar for “certainly, here is” turns up a huge number of academic papers that include parts that were evidently written by ChatGPT—sections that start with “Certainly, here is a concise summary of the provided sections:” are a dead giveaway. # 15th March 2024, 1:43 pm

llm-claude-3 0.3. Anthropic released Claude 3 Haiku today, their least expensive model: $0.25/million tokens of input, $1.25/million of output (GPT-3.5 Turbo is $0.50/$1.50). Unlike GPT-3.5 Haiku also supports image inputs.

I just released a minor update to my llm-claude-3 LLM plugin adding support for the new model. # 13th March 2024, 9:18 pm

Berkeley Function-Calling Leaderboard. The team behind Berkeley’s Gorilla OpenFunctions model—an Apache 2 licensed LLM trained to provide OpenAI-style structured JSON functions—also maintain a leaderboard of different function-calling models. Their own Gorilla model is the only non-proprietary model in the top ten. # 13th March 2024, 5:26 pm

The talk track I’ve been using is that LLMs are easy to take to market, but hard to keep in the market long-term. All the hard stuff comes when you move past the demo and get exposure to real users.

And that’s where you find that all the nice little things you got neatly working fall apart. And you need to prompt differently, do different retrieval, consider fine-tuning, redesign interaction, etc. People will treat this stuff differently from “normal” products, creating unique challenges.

Phillip Carter # 13th March 2024, 3:02 pm

The Bing Cache thinks GPT-4.5 is coming. I was able to replicate this myself earlier today: searching Bing (or apparently Duck Duck Go) for “openai announces gpt-4.5 turbo” would return a link to a 404 page at with a search result page snippet that announced 256,000 tokens and knowledge cut-off of June 2024

I thought the knowledge cut-off must have been a hallucination, but someone got a screenshot of it showing up in the search engine snippet which would suggest that it was real text that got captured in a cache somehow.

I guess this means we might see GPT 4.5 in June then? I have trouble believing that OpenAI would release a model in June with a June knowledge cut-off, given how much time they usually spend red-teaming their models before release.

Or maybe it was one of those glitches like when a newspaper accidentally publishes a pre-written obituary for someone who hasn’t died yet—OpenAI may have had a draft post describing a model that doesn’t exist yet and it accidentally got exposed to search crawlers. # 13th March 2024, 2:29 am

In every group I speak to, from business executives to scientists, including a group of very accomplished people in Silicon Valley last night, much less than 20% of the crowd has even tried a GPT-4 class model.

Less than 5% has spent the required 10 hours to know how they tick.

Ethan Mollick # 9th March 2024, 3:55 am

The GPT-4 barrier has finally been broken

Four weeks ago, GPT-4 remained the undisputed champion: consistently at the top of every key benchmark, but more importantly the clear winner in terms of “vibes”. Almost everyone investing serious time exploring LLMs agreed that it was the most capable default model for the majority of tasks—and had been for more than a year.

[... 697 words]

You can now train a 70b language model at home (via) Jeremy Howard and team: “Today, we’re releasing Answer.AI’s first project: a fully open source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090).”

This is about fine-tuning an existing model, not necessarily training one from scratch.

There are two tricks at play here. The first is QLoRA, which can be used to train quantized models despite the reduced precision usually preventing gradient descent from working correctly.

QLoRA can bring the memory requirements for a 70b model down to 35GB, but gaming GPUs aren’t quite that big. The second trick is Meta’s Fully Sharded Data Parallel or FSDP library, which can shard a model across GPUs. Two consumer 24GB GPUs can then handle the 70b training run. # 8th March 2024, 10:47 am

Inflection-2.5: meet the world’s best personal AI (via) I’ve not been paying much attention to Inflection’s Pi since it released last year, but yesterday they released a new version that they claim is competitive with GPT-4.

“Inflection-2.5 approaches GPT-4’s performance, but used only 40% of the amount of compute for training.”

(I wasn’t aware that the compute used to train GPT-4 was public knowledge.)

If this holds true, that means that the GPT-4 barrier has been well and truly smashed: we now have Claude 3 Opus, Gemini 1.5, Mistral Large and Inflection-2.5 in the same class as GPT-4, up from zero contenders just a month ago. # 8th March 2024, 12:51 am

On the zombie edition of the Washington Independent I discovered, the piece I had published more than ten years before was attributed to someone else. Someone unlikely to have ever existed, and whose byline graced an article it had absolutely never written.

[...], which I’m using to distinguish it from its namesake, offers recently published, article-like content that does not appear to me to have been produced by human beings. But, if you dig through its news archive, you can find work human beings definitely did produce. I know this because I was one of them.

Spencer Ackerman # 7th March 2024, 2:59 am

Training great LLMs entirely from ground zero in the wilderness as a startup. Yi Tay has a really interesting perspective on training LLMs, having worked at Google Brain before co-founding an independent startup, Reka.

At Google the clusters are provided for you. On the outside, Yi finds himself bargaining for cluster resources from a wide range of vendors—and running into enormous variance in quality.

“We’ve seen clusters that range from passable (just annoying problems that are solvable with some minor SWE hours) to totally unusable clusters that fail every few hours due to a myriad of reasons.” # 7th March 2024, 2:34 am

The Claude 3 system prompt, explained. Anthropic research scientist Amanda Askell provides a detailed breakdown of the Claude 3 system prompt in a Twitter thread.

This is some fascinating prompt engineering. It’s also great to see an LLM provider proudly documenting their system prompt, rather than treating it as a hidden implementation detail.

The prompt is pretty succinct. The three most interesting paragraphs:

“If it is asked to assist with tasks involving the expression of views held by a significant number of people, Claude provides assistance with the task even if it personally disagrees with the views being expressed, but follows this with a discussion of broader perspectives.

Claude doesn’t engage in stereotyping, including the negative stereotyping of majority groups.

If asked about controversial topics, Claude tries to provide careful thoughts and objective information without downplaying its harmful content or implying that there are reasonable perspectives on both sides.” # 7th March 2024, 1:16 am

If a hard takeoff occurs, and a safe AI is harder to build than an unsafe one, then by opensourcing everything, we make it easy for someone unscrupulous with access to overwhelming amount of hardware to build an unsafe AI, which will experience a hard takeoff.

As we get closer to building AI, it will make sense to start being less open. The Open in OpenAI means that everyone should benefit from the fruits of AI after its built, but it’s totally OK to not share the science (even though sharing everything is definitely the right strategy in the short and possibly medium term for recruitment purposes).

Ilya Sutskever # 6th March 2024, 3:02 am

Prompt injection and jailbreaking are not the same thing

I keep seeing people use the term “prompt injection” when they’re actually talking about “jailbreaking”.

[... 1157 words]

llm-claude-3. I built a new plugin for LLM—my command-line tool and Python library for interacting with Large Language Models—which adds support for the new Claude 3 models from Anthropic. # 4th March 2024, 6:46 pm

The new Claude 3 model family from Anthropic. Claude 3 is out, and comes in three sizes: Opus (the largest), Sonnet and Haiku.

Claude 3 Opus has self-reported benchmark scores that consistently beat GPT-4. This is a really big deal: in the 12+ months since the GPT-4 release no other model has consistently beat it in this way. It’s exciting to finally see that milestone reached by another research group.

The pricing model here is also really interesting. Prices here are per-million-input-tokens / per-million-output-tokens:

Claude 3 Opus: $15 / $75
Claude 3 Sonnet: $3 / $15
Claude 3 Haiku: $0.25 / $1.25

All three models have a 200,000 length context window and support image input in addition to text.

Compare with today’s OpenAI prices:

GPT-4 Turbo (128K): $10 / $30
GPT-4 8K: $30 / $60
GPT-4 32K: $60 / $120
GPT-3.5 Turbo: $0.50 / $1.50

So Opus pricing is comparable with GPT-4, more than GPT-4 Turbo and significantly cheaper than GPT-4 32K... Sonnet is cheaper than all of the GPT-4 models (including GPT-4 Turbo), and Haiku (which has not yet been released to the Claude API) will be cheaper even than GPT-3.5 Turbo.

It will be interesting to see if OpenAI respond with their own price reductions. # 4th March 2024, 6:34 pm

Who Am I? Conditional Prompt Injection Attacks with Microsoft Copilot (via) New prompt injection variant from Johann Rehberger, demonstrated against Microsoft Copilot. If the LLM tool you are interacting with has awareness of the identity of the current user you can create targeted prompt injection attacks which only activate when an exploit makes it into the token context of a specific individual. # 3rd March 2024, 4:34 pm

GGUF, the long way around (via) Vicki Boykis dives deep into the GGUF format used by llama.cpp, after starting with a detailed description of how PyTorch models work and how they are traditionally persisted using Python pickle.

Pickle lead to safetensors, a format that avoided the security problems with downloading and running untrusted pickle files.

Llama.cpp introduced GGML, which popularized 16-bit (as opposed to 32-bit) quantization and bundled metadata and tensor data in a single file.

GGUF fixed some design flaws in GGML and is the default format used by Llama.cpp today. # 29th February 2024, 9:39 pm

The Zen of Python, Unix, and LLMs. Here’s the YouTube recording of my 1.5 hour conversation with Hugo Bowne-Anderson yesterday.

I fed a Whisper transcript to Google Gemini Pro 1.5 and asked it for the themes from our conversation, and it said we talked about “Python’s success and versatility, the rise and potential of LLMs, data sharing and ethics in the age of LLMs, Unix philosophy and its influence on software development and the future of programming and human-computer interaction”. # 29th February 2024, 9:04 pm

For the last few years, Meta has had a team of attorneys dedicated to policing unauthorized forms of scraping and data collection on Meta platforms. The decision not to further pursue these claims seems as close to waving the white flag as you can get against these kinds of companies. But why? [...]

In short, I think Meta cares more about access to large volumes of data and AI than it does about outsiders scraping their public data now. My hunch is that they know that any success in anti-scraping cases can be thrown back at them in their own attempts to build AI training databases and LLMs. And they care more about the latter than the former.

Kieran McCarthy # 28th February 2024, 3:15 pm

Mistral Large. Mistral Medium only came out two months ago, and now it’s followed by Mistral Large. Like Medium, this new model is currently only available via their API. It scores well on benchmarks (though not quite as well as GPT-4) but the really exciting feature is function support, clearly based on OpenAI’s own function design.

Functions are now supported via the Mistral API for both Mistral Large and the new Mistral Small, described as follows: “Mistral Small, optimised for latency and cost. Mistral Small outperforms Mixtral 8x7B and has lower latency, which makes it a refined intermediary solution between our open-weight offering and our flagship model.” # 26th February 2024, 11:23 pm

Does Offering ChatGPT a Tip Cause it to Generate Better Text? An Analysis (via) Max Woolf:“I have a strong hunch that tipping does in fact work to improve the output quality of LLMs and its conformance to constraints, but it’s very hard to prove objectively. [...] Let’s do a more statistical, data-driven approach to finally resolve the debate.” # 23rd February 2024, 5:42 pm

The killer app of Gemini Pro 1.5 is video

Last week Google introduced Gemini Pro 1.5, an enormous upgrade to their Gemini series of AI models.

[... 2839 words]

Gemma: Introducing new state-of-the-art open models. Google get in on the openly licensed LLM game: Gemma comes in two sizes, 2B and 7B, trained on 2 trillion and 6 trillion tokens respectively. The terms of use “permit responsible commercial usage”. In the benchmarks it appears to compare favorably to Mistral and Llama 2.

Something that caught my eye in the terms: “Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.”

One of the biggest benefits of running your own model is that it can protect you from model updates that break your carefully tested prompts, so I’m not thrilled by that particular clause.

UPDATE: It turns out that clause isn’t uncommon—the phrase “You shall undertake reasonable efforts to use the latest version of the Model” is present in both the Stable Diffusion and BigScience Open RAIL-M licenses. # 21st February 2024, 4:22 pm

Let’s build the GPT Tokenizer. When Andrej Karpathy left OpenAI last week a lot of people expressed hope that he would be increasing his output of educational YouTube videos.

Here’s an in-depth 2 hour dive into how tokenizers work and how to build one from scratch, published this morning.

The section towards the end, “revisiting and explaining the quirks of LLM tokenization”, helps explain a number of different LLM weaknesses—inability to reverse strings, confusion over arithmetic and even a note on why YAML can work better than JSON when providing data to LLMs (the same data can be represented in less tokens). # 20th February 2024, 6:02 pm