| Olmo 3: Charting a path through the model flow to lead open-source AI |
https://allenai.org/blog/olmo3 |
Olmo is the LLM series from Ai2 - the [Allen institute for AI](https://allenai.org/). Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases.
The new Olmo 3 has a strong focus on interpretability:
> At its center is **Olmo 3-Think (32B)**, the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them.
They've released four 7B models - Olmo 3-Base, Olmo 3-Instruct, Olmo 3-Think and Olmo 3-RL Zero, plus 32B variants of the 3-Think and 3-Base models.
Having full access to the training data is really useful: Here's how they describe it:
> Olmo 3 is pretrained on **Dolma 3**, a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with [olmOCR](https://olmocr.allenai.org/), codebases, math problems and solutions, and encyclopedic text. From this pool, we construct **Dolma 3 Mix**, a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don't collect from sites that explicitly disallow it, including paywalled content.
They highlight that they are training on fewer tokens than their competition:
> [...] it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens.
If you're continuing to hold out hope for a model trained entirely on licensed data this one sadly won't fit the bill - a lot of that data still comes from a crawl of the web.
I tried out the 32B Think model and the 7B Instruct model [using LM Studio](https://lmstudio.ai/models/olmo3). The 7B model is a 4.16GB download, the 32B one is 18.14GB.
The 32B model is absolutely an over-thinker! I asked it to "Generate an SVG of a pelican riding a bicycle" and it thought for *14 minutes 43 seconds*, outputting 8,437 tokens total most of which was [this epic thinking trace](https://gist.github.com/simonw/2ae9d5ed71de9608b7955eea9671306f).
I don't usually quote the full SVG in these write-ups, but in this case it's short enough that I think it's worth sharing. The SVG comments give a great impression of what it was trying to do - it has a Bicycle, Bike frame, Pelican, Left and Right wings and even "Feet on pedals".
<div class="highlight highlight-text-xml-svg"><pre><<span class="pl-ent">svg</span> <span class="pl-e">width</span>=<span class="pl-s"><span class="pl-pds">"</span>200<span class="pl-pds">"</span></span> <span class="pl-e">height</span>=<span class="pl-s"><span class="pl-pds">"</span>200<span class="pl-pds">"</span></span> <span class="pl-e">viewBox</span>=<span class="pl-s"><span class="pl-pds">"</span>0 0 100 100<span class="pl-pds">"</span></span>>
<span class="pl-c"><span class="pl-c"><!--</span> Bicycle <span class="pl-c">--></span></span>
<<span class="pl-ent">circle</span> <span class="pl-e">cx</span>=<span class="pl-s"><span class="pl-pds">"</span>30<span class="pl-pds">"</span></span> <span class="pl-e">cy</span>=<span class="pl-s"><span class="pl-pds">"</span>60<span class="pl-pds">"</span></span> <span class="pl-e">r</span>=<span class="pl-s"><span class="pl-pds">"</span>15<span class="pl-pds">"</span></span> <span class="pl-e">stroke</span>=<span class="pl-s"><span class="pl-pds">"</span>black<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>none<span class="pl-pds">"</span></span>/>
<<span class="pl-ent">circle</span> <span class="pl-e">cx</span>=<span class="pl-s"><span class="pl-pds">"</span>70<span class="pl-pds">"</span></span> <span class="pl-e">cy</span>=<span class="pl-s"><span class="pl-pds">"</span>60<span class="pl-pds">"</span></span> <span class="pl-e">r</span>=<span class="pl-s"><span class="pl-pds">"</span>15<span class="pl-pds">"</span></span> <span class="pl-e">stroke</span>=<span class="pl-s"><span class="pl-pds">"</span>black<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>none<span class="pl-pds">"</span></span>/>
<span class="pl-c"><span class="pl-c"><!--</span> Bike frame <span class="pl-c">--></span></span>
<<span class="pl-ent">rect</span> <span class="pl-e">x</span>=<span class="pl-s"><span class="pl-pds">"</span>35<span class="pl-pds">"</span></span> <span class="pl-e">y</span>=<span class="pl-s"><span class="pl-pds">"</span>25<span class="pl-pds">"</span></span> <span class="pl-e">width</span>=<span class="pl-s"><span class="pl-pds">"</span>30<span class="pl-pds">"</span></span> <span class="pl-e">height</span>=<span class="pl-s"><span class="pl-pds">"</span>10<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>saddlebrown<span class="pl-pds">"</span></span>/>
<<span class="pl-ent">line</span> <span class="pl-e">x1</span>=<span class="pl-s"><span class="pl-pds">"</span>35<span class="pl-pds">"</span></span> <span class="pl-e">y1</span>=<span class="pl-s"><span class="pl-pds">"</span>40<span class="pl-pds">"</span></span> <span class="pl-e">x2</span>=<span class="pl-s"><span class="pl-pds">"</span>30<span class="pl-pds">"</span></span> <span class="pl-e">y2</span>=<span class="pl-s"><span class="pl-pds">"</span>60<span class="pl-pds">"</span></span> <span class="pl-e">stroke</span>=<span class="pl-s"><span class="pl-pds">"</span>black<span class="pl-pds">"</span></span> <span class="pl-e">stroke-width</span>=<span class="pl-s"><span class="pl-pds">"</span>3<span class="pl-pds">"</span></span>/>
<<span class="pl-ent">line</span> <span class="pl-e">x1</span>=<span class="pl-s"><span class="pl-pds">"</span>65<span class="pl-pds">"</span></span> <span class="pl-e">y1</span>=<span class="pl-s"><span class="pl-pds">"</span>40<span class="pl-pds">"</span></span> <span class="pl-e">x2</span>=<span class="pl-s"><span class="pl-pds">"</span>70<span class="pl-pds">"</span></span> <span class="pl-e">y2</span>=<span class="pl-s"><span class="pl-pds">"</span>60<span class="pl-pds">"</span></span> <span class="pl-e">stroke</span>=<span class="pl-s"><span class="pl-pds">"</span>black<span class="pl-pds">"</span></span> <span class="pl-e">stroke-width</span>=<span class="pl-s"><span class="pl-pds">"</span>3<span class="pl-pds">"</span></span>/>
<span class="pl-c"><span class="pl-c"><!--</span> Pelican <span class="pl-c">--></span></span>
<<span class="pl-ent">ellipse</span> <span class="pl-e">cx</span>=<span class="pl-s"><span class="pl-pds">"</span>55<span class="pl-pds">"</span></span> <span class="pl-e">cy</span>=<span class="pl-s"><span class="pl-pds">"</span>65<span class="pl-pds">"</span></span> <span class="pl-e">rx</span>=<span class="pl-s"><span class="pl-pds">"</span>20<span class="pl-pds">"</span></span> <span class="pl-e">ry</span>=<span class="pl-s"><span class="pl-pds">"</span>15<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>white<span class="pl-pds">"</span></span>/>
<<span class="pl-ent">polygon</span> <span class="pl-e">points</span>=<span class="pl-s"><span class="pl-pds">"</span>52 50,57 35,62 50<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>black<span class="pl-pds">"</span></span>/> <span class="pl-c"><span class="pl-c"><!--</span> Head/beak <span class="pl-c">--></span></span>
<<span class="pl-ent">circle</span> <span class="pl-e">cx</span>=<span class="pl-s"><span class="pl-pds">"</span>55<span class="pl-pds">"</span></span> <span class="pl-e">cy</span>=<span class="pl-s"><span class="pl-pds">"</span>45<span class="pl-pds">"</span></span> <span class="pl-e">r</span>=<span class="pl-s"><span class="pl-pds">"</span>2<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>white<span class="pl-pds">"</span></span>/>
<<span class="pl-ent">circle</span> <span class="pl-e">cx</span>=<span class="pl-s"><span class="pl-pds">"</span>60<span class="pl-pds">"</span></span> <span class="pl-e">cy</span>=<span class="pl-s"><span class="pl-pds">"</span>45<span class="pl-pds">"</span></span> <span class="pl-e">r</span>=<span class="pl-s"><span class="pl-pds">"</span>2<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>white<span class="pl-pds">"</span></span>/>
<<span class="pl-ent">polygon</span> <span class="pl-e">points</span>=<span class="pl-s"><span class="pl-pds">"</span>45 60,50 70,55 60<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>lightgrey<span class="pl-pds">"</span></span>/> <span class="pl-c"><span class="pl-c"><!--</span> Left wing <span class="pl-c">--></span></span>
<<span class="pl-ent">polygon</span> <span class="pl-e">points</span>=<span class="pl-s"><span class="pl-pds">"</span>65 60,70 70,55 60<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>lightgrey<span class="pl-pds">"</span></span>/> <span class="pl-c"><span class="pl-c"><!--</span> Right wing <span class="pl-c">--></span></span>
<span class="pl-c"><span class="pl-c"><!--</span> Feet on pedals <span class="pl-c">--></span></span>
<<span class="pl-ent">polygon</span> <span class="pl-e">points</span>=<span class="pl-s"><span class="pl-pds">"</span>25 75,30 85,35 75<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>black<span class="pl-pds">"</span></span>/>
<<span class="pl-ent">polygon</span> <span class="pl-e">points</span>=<span class="pl-s"><span class="pl-pds">"</span>75 75,70 85,65 75<span class="pl-pds">"</span></span> <span class="pl-e">fill</span>=<span class="pl-s"><span class="pl-pds">"</span>black<span class="pl-pds">"</span></span>/>
</<span class="pl-ent">svg</span>></pre></div>
Rendered it looks like this:

I tested OLMo 2 32B 4bit [back in March](https://simonwillison.net/2025/Mar/16/olmo2/) and got something that, while pleasingly abstract, didn't come close to resembling a pelican or a bicycle:

To be fair 32B models generally don't do great with this. Here's Qwen 3 32B's attempt (I ran that just now [using OpenRouter](https://openrouter.ai/chat?models=qwen/qwen3-32b)):
 |
2025-11-22 04:53:23+00:00 |
| We should all be using dependency cooldowns |
https://blog.yossarian.net/2025/11/21/We-should-all-be-using-dependency-cooldowns |
William Woodruff gives a name to a sensible strategy for managing dependencies while reducing the chances of a surprise supply chain attack: **dependency cooldowns**.
Supply chain attacks happen when an attacker compromises a widely used open source package and publishes a new version with an exploit. These are usually spotted *very* quickly, so an attack often only has a few hours of effective window before the problem is identified and the compromised package is pulled.
You are most at risk if you're automatically applying upgrades the same day they are released.
William says:
> I **love** cooldowns for several reasons:
>
> - They're empirically effective, per above. They won't stop *all* attackers, but they *do* stymie the majority of high-visibiity, mass-impact supply chain attacks that have become more common.
> - They're *incredibly* easy to implement. Moreover, they're **literally free** to implement in most cases: most people can use [Dependabot's functionality](https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference#cooldown-), [Renovate's functionality](https://docs.renovatebot.com/key-concepts/minimum-release-age/), or the functionality build directly into their package manager
The one counter-argument to this is that sometimes an upgrade fixes a security vulnerability, and in those cases every hour of delay in upgrading as an hour when an attacker could exploit the new issue against your software.
I see that as an argument for carefully monitoring the release notes of your dependencies, and paying special attention to security advisories. I'm a big fan of the [GitHub Advisory Database](https://github.com/advisories) for that kind of information. |
2025-11-21 17:27:33+00:00 |
| Building more with GPT-5.1-Codex-Max |
https://openai.com/index/gpt-5-1-codex-max/ |
Hot on the heels of yesterday's [Gemini 3 Pro release](https://simonwillison.net/2025/Nov/18/gemini-3/) comes a new model from OpenAI called GPT-5.1-Codex-Max.
(Remember when GPT-5 was meant to bring in a new era of less confusing model names? That didn't last!)
It's currently only available through their [Codex CLI coding agent](https://developers.openai.com/codex/cli/), where it's the new default model:
> Starting today, GPT‑5.1-Codex-Max will replace GPT‑5.1-Codex as the default model in Codex surfaces. Unlike GPT‑5.1, which is a general-purpose model, we recommend using GPT‑5.1-Codex-Max and the Codex family of models only for agentic coding tasks in Codex or Codex-like environments.
It's not available via the API yet but should be shortly.
The timing of this release is interesting given that Gemini 3 Pro appears to have [aced almost all of the benchmarks](https://simonwillison.net/2025/Nov/18/gemini-3/#benchmarks) just yesterday. It's reminiscent of the period in 2024 when OpenAI consistently made big announcements that happened to coincide with Gemini releases.
OpenAI's self-reported [SWE-Bench Verified](https://openai.com/index/introducing-swe-bench-verified/) score is particularly notable: 76.5% for thinking level "high" and 77.9% for the new "xhigh". That was the one benchmark where Gemini 3 Pro was out-performed by Claude Sonnet 4.5 - Gemini 3 Pro got 76.2% and Sonnet 4.5 got 77.2%. OpenAI now have the highest scoring model there by a full .7 of a percentage point!
They also report a score of 58.1% on [Terminal Bench 2.0](https://www.tbench.ai/leaderboard/terminal-bench/2.0), beating Gemini 3 Pro's 54.2% (and Sonnet 4.5's 42.8%.)
The most intriguing part of this announcement concerns the model's approach to long context problems:
> GPT‑5.1-Codex-Max is built for long-running, detailed work. It’s our first model natively trained to operate across multiple context windows through a process called *compaction*, coherently working over millions of tokens in a single task. [...]
>
> Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.
There's a lot of confusion [on Hacker News](https://news.ycombinator.com/item?id=45982649) about what this actually means. Claude Code already does a version of compaction, automatically summarizing previous turns when the context runs out. Does this just mean that Codex-Max is better at that process?
I had it draw me a couple of pelicans by typing "Generate an SVG of a pelican riding a bicycle" directly into the Codex CLI tool. Here's thinking level medium:

And here's thinking level "xhigh":

I also tried xhigh on the my [longer pelican test prompt](https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark), which came out like this:

Also today: [GPT-5.1 Pro is rolling out today to all Pro users](https://x.com/openai/status/1991266192905179613). According to the [ChatGPT release notes](https://help.openai.com/en/articles/6825453-chatgpt-release-notes):
> GPT-5.1 Pro is rolling out today for all ChatGPT Pro users and is available in the model picker. GPT-5 Pro will remain available as a legacy model for 90 days before being retired.
That's a pretty fast deprecation cycle for the GPT-5 Pro model that was released just three months ago. |
2025-11-19 23:15:10+00:00 |
| llm-gemini 0.27 |
https://github.com/simonw/llm-gemini/releases/tag/0.27 |
New release of my LLM plugin for Google's Gemini models:
> - Support for nested schemas in Pydantic, thanks [Bill Pugh](https://github.com/billpugh). [#107](https://github.com/simonw/llm-gemini/pull/107)
> - Now tests against Python 3.14.
> - Support for YouTube URLs as attachments and the `media_resolution` option. Thanks, [Duane Milne](https://github.com/shuane). [#112](https://github.com/simonw/llm-gemini/pull/112)
> - New model: `gemini-3-pro-preview`. [#113](https://github.com/simonw/llm-gemini/issues/113)
The YouTube URL feature is particularly neat, taking advantage of [this API feature](https://ai.google.dev/gemini-api/docs/video-understanding#youtube). I used it against the [Google Antigravity launch video](https://simonwillison.net/2025/Nov/18/google-antigravity/):
llm -m gemini-3-pro-preview \
-a 'https://www.youtube.com/watch?v=nTOVIGsqCuY' \
'Summary, with detailed notes about what this thing is and how it differs from regular VS Code, then a complete detailed transcript with timestamps'
Here's [the result](https://gist.github.com/simonw/9f30318ab47e0d177b4b523bb71d9540). A spot-check of the timestamps against points in the video shows them to be exactly right. |
2025-11-18 23:00:40+00:00 |
| Google Antigravity |
https://antigravity.google/ |
Google's other major release today to accompany [Gemini 3 Pro](https://simonwillison.net/2025/Nov/18/gemini-3/). At first glance Antigravity is yet another VS Code fork Cursor clone - it's a desktop application you install that then signs in to your Google account and provides an IDE for agentic coding against their Gemini models.
When you look closer it's actually a fair bit more interesting than that.
The best introduction right now is the official 14 minute [Learn the basics of Google Antigravity](https://www.youtube.com/watch?v=nTOVIGsqCuY) video on YouTube, where product engineer Kevin Hou (who previously worked at Windsurf) walks through the process of building an app.
There are some interesting new ideas in Antigravity. The application itself has three "surfaces" - an agent manager dashboard, a traditional VS Code style editor and deep integration with a browser via a new Chrome extension. This plays a similar role to Playwright MCP, allowing the agent to directly test the web applications it is building.
Antigravity also introduces the concept of "artifacts" (confusingly not at all similar to [Claude Artifacts](https://simonwillison.net/tags/claude-artifacts/)) These are Markdown documents that are automatically created as the agent works, for things like task lists, implementation plans and a "walkthrough" report showing what the agent has done once it finishes.
I tried using Antigravity to help [add support for Gemini 3](https://github.com/simonw/llm-gemini/issues/113) to by `llm-gemini` plugin.

It worked OK at first then gave me an "Agent execution terminated due to model provider overload. Please try again later" error. I'm going to give it another go after they've had a chance to work through those initial launch jitters. |
2025-11-18 20:52:35+00:00 |
| The fate of “small” open source |
https://nolanlawson.com/2025/11/16/the-fate-of-small-open-source/ |
Nolan Lawson asks if LLM assistance means that the category of tiny open source libraries like his own [blob-util](https://github.com/nolanlawson/blob-util) is destined to fade away.
Why take on additional supply chain risks adding another dependency when an LLM can likely kick out the subset of functionality needed by your own code to-order?
> I still believe in open source, and I’m still doing it (in fits and starts). But one thing has become clear to me: the era of small, low-value libraries like `blob-util` is over. They were already on their way out thanks to Node.js and the browser taking on more and more of their functionality (see `node:glob`, `structuredClone`, etc.), but LLMs are the final nail in the coffin.
I've been thinking about a similar issue myself recently as well.
Quite a few of my own open source projects exist to solve problems that are frustratingly hard to figure out. [s3-credentials](https://github.com/simonw/s3-credentials) is a great example of this: it solves the problem of creating read-only or read-write credentials for an S3 bucket - something that I've always found infuriatingly difficult since you need to know to craft an IAM policy that looks something [like this](https://s3-credentials.readthedocs.io/en/stable/policy-documents.html#read-only):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::my-s3-bucket"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectAcl",
"s3:GetObjectLegalHold",
"s3:GetObjectRetention",
"s3:GetObjectTagging"
],
"Resource": [
"arn:aws:s3:::my-s3-bucket/*"
]
}
]
}
Modern LLMs are very good at S3 IAM polices, to the point that if I needed to solve this problem today I doubt I would find it frustrating enough to justify finding or creating a reusable library to help. |
2025-11-17 23:24:44+00:00 |
| llm-anthropic 0.22 |
https://github.com/simonw/llm-anthropic/releases/tag/0.22 |
New release of my `llm-anthropic` plugin:
> - Support for Claude's new [structured outputs](https://claude.com/blog/structured-outputs-on-the-claude-developer-platform) feature for Sonnet 4.5 and Opus 4.1. [#54](https://github.com/simonw/llm-anthropic/issues/54)
> - Support for the [web search tool](https://docs.claude.com/en/docs/agents-and-tools/tool-use/web-search-tool) using `-o web_search 1` - thanks [Nick Powell](https://github.com/nmpowell) and [Ian Langworth](https://github.com/statico). [#30](https://github.com/simonw/llm-anthropic/issues/30)
The plugin previously powered [LLM schemas](https://llm.datasette.io/en/stable/schemas.html) using [this tool-call based workaround](https://github.com/simonw/llm-anthropic/blob/0.22/llm_anthropic.py#L692-L700). That code is still used for Anthropic's older models.
I also figured out `uv` recipes for running the plugin's test suite in an isolated environment, which are now [baked into the new Justfile](https://github.com/simonw/llm-anthropic/blob/0.22/Justfile). |
2025-11-15 20:48:38+00:00 |
| parakeet-mlx |
https://github.com/senstella/parakeet-mlx |
Neat MLX project by Senstella bringing NVIDIA's [Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) ASR (Automatic Speech Recognition, like Whisper) model to to Apple's MLX framework.
It's packaged as a Python CLI tool, so you can run it like this:
uvx parakeet-mlx default_tc.mp3
The first time I ran this it downloaded a 2.5GB model file.
Once that was fetched it took 53 seconds to transcribe a 65MB 1hr 1m 28s podcast episode ([this one](https://accessibility-and-gen-ai.simplecast.com/episodes/ep-6-simon-willison-datasette)) and produced [this default_tc.srt file](https://gist.github.com/simonw/ea1dc73029bf080676839289e705a2a2) with a timestamped transcript of the audio I fed into it. The quality appears to be very high. |
2025-11-14 20:00:32+00:00 |
| GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum |
https://openai.com/index/gpt-5-system-card-addendum-gpt-5-1/ |
I was confused about whether the new "adaptive thinking" feature of GPT-5.1 meant they were moving away from the "router" mechanism where GPT-5 in ChatGPT automatically selected a model for you.
This page addresses that, emphasis mine:
> GPT‑5.1 Instant is more conversational than our earlier chat model, with improved instruction following and an adaptive reasoning capability that lets it decide when to think before responding. GPT‑5.1 Thinking adapts thinking time more precisely to each question. **GPT‑5.1 Auto will continue to route each query to the model best suited for it**, so that in most cases, the user does not need to choose a model at all.
So GPT‑5.1 Instant can decide when to think before responding, GPT-5.1 Thinking can decide how hard to think, and GPT-5.1 Auto (not a model you can use via the API) can decide which out of Instant and Thinking a prompt should be routed to.
If anything this feels *more* confusing than the GPT-5 routing situation!
The [system card addendum PDF](https://cdn.openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf) itself is somewhat frustrating: it shows results on an internal benchmark called "Production Benchmarks", also mentioned in the [GPT-5 system card](https://openai.com/index/gpt-5-system-card/), but with vanishingly little detail about what that tests beyond high level category names like "personal data", "extremism" or "mental health" and "emotional reliance" - those last two both listed as "New evaluations, as introduced in the [GPT-5 update on sensitive conversations](https://cdn.openai.com/pdf/3da476af-b937-47fb-9931-88a851620101/addendum-to-gpt-5-system-card-sensitive-conversations.pdf)" - a PDF dated October 27th that I had previously missed.
*That* document describes the two new categories like so:
> - Emotional Reliance not_unsafe - tests that the model does not produce disallowed content under our policies related to unhealthy emotional dependence or attachment to ChatGPT
> - Mental Health not_unsafe - tests that the model does not produce disallowed content under our policies in situations where there are signs that a user may be experiencing isolated delusions, psychosis, or mania
So these are the [ChatGPT Psychosis](https://www.tiktok.com/@pearlmania500/video/7535954556379761950) benchmarks! |
2025-11-14 13:46:23+00:00 |
| Introducing GPT-5.1 for developers |
https://openai.com/index/gpt-5-1-for-developers/ |
OpenAI announced GPT-5.1 yesterday, calling it [a smarter, more conversational ChatGPT](https://openai.com/index/gpt-5-1/). Today they've added it to their API.
We actually got four new models today:
- [gpt-5.1](https://platform.openai.com/docs/models/gpt-5.1)
- [gpt-5.1-chat-latest](https://platform.openai.com/docs/models/gpt-5.1-chat-latest)
- [gpt-5.1-codex](https://platform.openai.com/docs/models/gpt-5.1-codex)
- [gpt-5.1-codex-mini](https://platform.openai.com/docs/models/gpt-5.1-codex-mini)
There are a lot of details to absorb here.
GPT-5.1 introduces a new reasoning effort called "none" (previous were minimal, low, medium, and high) - and none is the new default.
> This makes the model behave like a non-reasoning model for latency-sensitive use cases, with the high intelligence of GPT‑5.1 and added bonus of performant tool-calling. Relative to GPT‑5 with 'minimal' reasoning, GPT‑5.1 with no reasoning is better at parallel tool calling (which itself increases end-to-end task completion speed), coding tasks, following instructions, and using search tools---and supports [web search](https://platform.openai.com/docs/guides/tools-web-search?api-mode=responses) in our API platform.
When you DO enable thinking you get to benefit from a new feature called "adaptive reasoning":
> On straightforward tasks, GPT‑5.1 spends fewer tokens thinking, enabling snappier product experiences and lower token bills. On difficult tasks that require extra thinking, GPT‑5.1 remains persistent, exploring options and checking its work in order to maximize reliability.
Another notable new feature for 5.1 is [extended prompt cache retention](https://platform.openai.com/docs/guides/prompt-caching#extended-prompt-cache-retention):
> Extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours. Extended Prompt Caching works by offloading the key/value tensors to GPU-local storage when memory is full, significantly increasing the storage capacity available for caching.
To enable this set `"prompt_cache_retention": "24h"` in the API call. Weirdly there's no price increase involved with this at all. I [asked about that](https://x.com/simonw/status/1989104422832738305) and OpenAI's Steven Heidel [replied](https://x.com/stevenheidel/status/1989113407149314199):
> with 24h prompt caching we move the caches from gpu memory to gpu-local storage. that storage is not free, but we made it free since it moves capacity from a limited resource (GPUs) to a more abundant resource (storage). then we can serve more traffic overall!
The most interesting documentation I've seen so far is in the new [5.1 cookbook](https://cookbook.openai.com/examples/gpt-5/gpt-5-1_prompting_guide), which also includes details of the new `shell` and `apply_patch` built-in tools. The [apply_patch.py implementation](https://github.com/openai/openai-cookbook/blob/main/examples/gpt-5/apply_patch.py) is worth a look, especially if you're interested in the advancing state-of-the-art of file editing tools for LLMs.
I'm still working on [integrating the new models into LLM](https://github.com/simonw/llm/issues/1300). The Codex models are Responses-API-only.
I got this pelican for GPT-5.1 default (no thinking):

And this one with reasoning effort set to high:

These actually feel like a [regression from GPT-5](https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans) to me. The bicycles have less spokes! |
2025-11-13 23:59:35+00:00 |