<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: llm-release</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/llm-release.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-03-05T23:56:09+00:00</updated><author><name>Simon Willison</name></author><entry><title>Introducing GPT‑5.4</title><link href="https://simonwillison.net/2026/Mar/5/introducing-gpt54/#atom-tag" rel="alternate"/><published>2026-03-05T23:56:09+00:00</published><updated>2026-03-05T23:56:09+00:00</updated><id>https://simonwillison.net/2026/Mar/5/introducing-gpt54/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-4/"&gt;Introducing GPT‑5.4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Two new API models: &lt;a href="https://developers.openai.com/api/docs/models/gpt-5.4"&gt;gpt-5.4&lt;/a&gt; and &lt;a href="https://developers.openai.com/api/docs/models/gpt-5.4-pro"&gt;gpt-5.4-pro&lt;/a&gt;, also available in ChatGPT and Codex CLI. August 31st 2025 knowledge cutoff, 1 million token context window. Priced &lt;a href="https://www.llm-prices.com/#sel=gpt-5.2%2Cgpt-5.2-pro%2Cgpt-5.4%2Cgpt-5.4-272k%2Cgpt-5.4-pro%2Cgpt-5.4-pro-272k"&gt;slightly higher&lt;/a&gt; than the GPT-5.2 family with a bump in price for both models if you go above 272,000 tokens.&lt;/p&gt;
&lt;p&gt;5.4 beats coding specialist GPT-5.3-Codex on all of the relevant benchmarks. I wonder if we'll get a 5.4 Codex or if that model line has now been merged into main?&lt;/p&gt;
&lt;p&gt;Given Claude's recent focus on business applications it's interesting to see OpenAI highlight this in their announcement of GPT-5.4:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We put a particular focus on improving GPT‑5.4’s ability to create and edit spreadsheets, presentations, and documents. On an internal benchmark of spreadsheet modeling tasks that a junior investment banking analyst might do, GPT‑5.4 achieves a mean score of &lt;strong&gt;87.3%&lt;/strong&gt;, compared to &lt;strong&gt;68.4%&lt;/strong&gt; for GPT‑5.2.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's a pelican on a bicycle &lt;a href="https://gist.github.com/simonw/7fe75b8dab6ec9c2b6bd8fd1a5a640a6"&gt;drawn by GPT-5.4&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="alt text by GPT-5.4: Illustration of a cartoon pelican riding a bicycle, with a light gray background, dark blue bike frame and wheels, orange beak and legs, and motion lines suggesting movement." src="https://static.simonwillison.net/static/2026/gpt-5.4-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/688c0d5d93a5539b93d3f549a0b733ad"&gt;here's one&lt;/a&gt; by GPT-5.4 Pro, which took 4m45s and cost me &lt;a href="https://www.llm-prices.com/#it=16&amp;amp;ot=8593&amp;amp;sel=gpt-5.4-pro"&gt;$1.55&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Described by GPT-5.4: Illustration of a cartoon pelican riding a blue bicycle on pale green grass against a light gray background, with a large orange beak, gray-and-white body, and orange legs posed on the pedals." src="https://static.simonwillison.net/static/2026/gpt-5.4-pro-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-release"/><category term="generative-ai"/><category term="openai"/><category term="ai"/><category term="llms"/><category term="pelican-riding-a-bicycle"/></entry><entry><title>Gemini 3.1 Flash-Lite</title><link href="https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/#atom-tag" rel="alternate"/><published>2026-03-03T21:53:54+00:00</published><updated>2026-03-03T21:53:54+00:00</updated><id>https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/"&gt;Gemini 3.1 Flash-Lite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google's latest model is an update to their inexpensive Flash-Lite family. At $0.25/million tokens of input and $1.5/million output this is 1/8th the price of Gemini 3.1 Pro.&lt;/p&gt;
&lt;p&gt;It supports four different thinking levels, so I had it output &lt;a href="https://gist.github.com/simonw/99fb28dc11d0c24137d4ff8a33978a9e"&gt;four different pelicans&lt;/a&gt;:&lt;/p&gt;
&lt;div style="
    display: grid;
    grid-template-columns: repeat(2, 1fr);
    gap: 8px;
    margin: 0 auto;
  "&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-minimal.png" alt="A minimalist vector-style illustration of a stylized bird riding a bicycle." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;minimal&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-low.png" alt="A minimalist graphic of a light blue round bird with a single black dot for an eye, wearing a yellow backpack and riding a black bicycle on a flat grey line." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;low&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-medium.png" alt="A minimalist digital illustration of a light blue bird wearing a yellow backpack while riding a bicycle." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;medium&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-high.png" alt="A minimal, stylized line drawing of a bird-like creature with a yellow beak riding a bicycle made of simple geometric lines." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;high&lt;/p&gt;
    &lt;/div&gt;
&lt;/div&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;



</summary><category term="gemini"/><category term="llm"/><category term="pelican-riding-a-bicycle"/><category term="llm-pricing"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="google"/><category term="generative-ai"/></entry><entry><title>Gemini 3.1 Pro</title><link href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/#atom-tag" rel="alternate"/><published>2026-02-19T17:58:37+00:00</published><updated>2026-02-19T17:58:37+00:00</updated><id>https://simonwillison.net/2026/Feb/19/gemini-31-pro/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/"&gt;Gemini 3.1 Pro&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The first in the Gemini 3.1 series, priced the same as Gemini 3 Pro ($2/million input, $12/million output under 200,000 tokens, $4/$18 for 200,000 to 1,000,000). That's less than half the price of Claude Opus 4.6 with very similar benchmark scores to that model.&lt;/p&gt;
&lt;p&gt;They boast about its improved SVG animation performance compared to Gemini 3 Pro in the announcement!&lt;/p&gt;
&lt;p&gt;I tried "Generate an SVG of a pelican riding a bicycle" &lt;a href="https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221ugF9fBfLGxnNoe8_rLlluzo9NSPJDWuF%22%5D,%22action%22:%22open%22,%22userId%22:%22106366615678321494423%22,%22resourceKeys%22:%7B%7D%7D&amp;amp;usp=sharing"&gt;in Google AI Studio&lt;/a&gt; and it thought for 323.9 seconds (&lt;a href="https://gist.github.com/simonw/03a755865021739a3659943a22c125ba#thinking-trace"&gt;thinking trace here&lt;/a&gt;) before producing this one:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat-style illustration of a pelican wearing a blue and white baseball cap, riding a red bicycle with yellow-rimmed wheels along a road. The pelican has a large orange bill and a green scarf. A small fish peeks out of a brown basket on the handlebars. The background features a light blue sky with a yellow sun, white clouds, and green hills." src="https://static.simonwillison.net/static/2026/gemini-3.1-pro-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;It's good to see the legs clearly depicted on both sides of the frame (should &lt;a href="https://twitter.com/elonmusk/status/2023833496804839808"&gt;satisfy Elon&lt;/a&gt;), the fish in the basket is a nice touch and I appreciated this comment in &lt;a href="https://gist.github.com/simonw/03a755865021739a3659943a22c125ba#response"&gt;the SVG code&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Black Flight Feathers on Wing Tip --&amp;gt;
&amp;lt;path d="M 420 175 C 440 182, 460 187, 470 190 C 450 210, 430 208, 410 198 Z" fill="#374151" /&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I've &lt;a href="https://github.com/simonw/llm-gemini/issues/121"&gt;added&lt;/a&gt; the two new model IDs &lt;code&gt;gemini-3.1-pro-preview&lt;/code&gt; and &lt;code&gt;gemini-3.1-pro-preview-customtools&lt;/code&gt; to my &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini plugin&lt;/a&gt; for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;. That "custom tools" one is &lt;a href="https://ai.google.dev/gemini-api/docs/models/gemini-3.1-pro-preview#gemini-31-pro-preview-customtools"&gt;described here&lt;/a&gt; - apparently it may provide better tool performance than the default model in some situations.&lt;/p&gt;
&lt;p&gt;The model appears to be &lt;em&gt;incredibly&lt;/em&gt; slow right now - it took 104s to respond to a simple "hi" and a few of my other tests met "Error: This model is currently experiencing high demand. Spikes in demand are usually temporary. Please try again later." or "Error: Deadline expired before operation could complete" errors. I'm assuming that's just teething problems on launch day.&lt;/p&gt;
&lt;p&gt;It sounds like last week's &lt;a href="https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/"&gt;Deep Think release&lt;/a&gt; was our first exposure to the 3.1 family:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Last week, we released a major update to Gemini 3 Deep Think to solve modern challenges across science, research and engineering. Today, we’re releasing the upgraded core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: In &lt;a href="https://simonwillison.net/2025/nov/13/training-for-pelicans-riding-bicycles/"&gt;What happens if AI labs train for pelicans riding bicycles?&lt;/a&gt; last November I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I’m going to test it on all manner of creatures riding all sorts of transportation devices.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Google's Gemini Lead Jeff Dean &lt;a href="https://x.com/JeffDean/status/2024525132266688757"&gt;tweeted this video&lt;/a&gt; featuring an animated pelican riding a bicycle, plus a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.&lt;/p&gt;
&lt;video style="margin-bottom: 1em" poster="https://static.simonwillison.net/static/2026/gemini-animated-pelicans.jpg" muted controls preload="none" style="max-width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/gemini-animated-pelicans.mp4" type="video/mp4"&gt;
&lt;/video&gt;

&lt;p&gt;I've been saying for a while that I wish AI labs would highlight things that their new models can do that their older models could not, so top marks to the Gemini team for this video.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: I used &lt;code&gt;llm-gemini&lt;/code&gt; to run my &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;more detailed Pelican prompt&lt;/a&gt;, with &lt;a href="https://gist.github.com/simonw/a3bdd4ec9476ba9e9ba7aa61b46d8296"&gt;this result&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Flat-style illustration of a brown pelican riding a teal bicycle with dark blue-rimmed wheels against a plain white background. Unlike the previous image's white cartoon pelican, this pelican has realistic brown plumage with detailed feather patterns, a dark maroon head, yellow eye, and a large pink-tinged pouch bill. The bicycle is a simpler design without a basket, and the scene lacks the colorful background elements like the sun, clouds, road, hills, cap, and scarf from the first illustration, giving it a more minimalist feel." src="https://static.simonwillison.net/static/2026/gemini-3.1-pro-pelican-2.png" /&gt;&lt;/p&gt;
&lt;p&gt;From the SVG comments:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Pouch Gradient (Breeding Plumage: Red to Olive/Green) --&amp;gt;
...
&amp;lt;!-- Neck Gradient (Breeding Plumage: Chestnut Nape, White/Yellow Front) --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/svg"&gt;svg&lt;/a&gt;&lt;/p&gt;



</summary><category term="gemini"/><category term="llm"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="google"/><category term="generative-ai"/><category term="svg"/></entry><entry><title>Introducing Claude Sonnet 4.6</title><link href="https://simonwillison.net/2026/Feb/17/claude-sonnet-46/#atom-tag" rel="alternate"/><published>2026-02-17T23:58:58+00:00</published><updated>2026-02-17T23:58:58+00:00</updated><id>https://simonwillison.net/2026/Feb/17/claude-sonnet-46/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-sonnet-4-6"&gt;Introducing Claude Sonnet 4.6&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sonnet 4.6 is out today, and Anthropic claim it offers similar performance to &lt;a href="https://simonwillison.net/2025/Nov/24/claude-opus/"&gt;November's Opus 4.5&lt;/a&gt; while maintaining the Sonnet pricing of $3/million input and $15/million output tokens (the Opus models are $5/$25). Here's &lt;a href="https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf"&gt;the system card PDF&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Sonnet 4.6 has a "reliable knowledge cutoff" of August 2025, compared to Opus 4.6's May 2025 and Haiku 4.5's February 2025. Both Opus and Sonnet default to 200,000 max input tokens but can stretch to 1 million in beta and at a higher cost.&lt;/p&gt;
&lt;p&gt;I just released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.24"&gt;llm-anthropic 0.24&lt;/a&gt; with support for both Sonnet 4.6 and Opus 4.6. Claude Code &lt;a href="https://github.com/simonw/llm-anthropic/pull/65"&gt;did most of the work&lt;/a&gt; - the new models had a fiddly amount of extra details around adaptive thinking and no longer supporting prefixes, as described &lt;a href="https://platform.claude.com/docs/en/about-claude/models/migration-guide"&gt;in Anthropic's migration guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/b185576a95e9321b441f0a4dfc0e297c"&gt;what I got&lt;/a&gt; from:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx --with llm-anthropic llm 'Generate an SVG of a pelican riding a bicycle' -m claude-sonnet-4.6
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="The pelican has a jaunty top hat with a red band. There is a string between the upper and lower beaks for some reason. The bicycle frame is warped in the wrong way." src="https://static.simonwillison.net/static/2026/pelican-sonnet-4.6.png" /&gt;&lt;/p&gt;
&lt;p&gt;The SVG comments include:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Hat (fun accessory) --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I tried a second time and also got a top hat. Sonnet 4.6 apparently loves top hats!&lt;/p&gt;
&lt;p&gt;For comparison, here's the pelican Opus 4.5 drew me &lt;a href="(https://simonwillison.net/2025/Nov/24/claude-opus/)"&gt;in November&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars." src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;And here's Anthropic's current best pelican, drawn by Opus 4.6 &lt;a href="https://simonwillison.net/2026/Feb/5/two-new-models/"&gt;on February 5th&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Slightly wonky bicycle frame but an excellent pelican, very clear beak and pouch, nice feathers." src="https://static.simonwillison.net/static/2026/opus-4.6-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;Opus 4.6 produces the best pelican beak/pouch. I do think the top hat from Sonnet 4.6 is a nice touch though.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47050488"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="generative-ai"/><category term="pelican-riding-a-bicycle"/><category term="claude-code"/></entry><entry><title>Qwen3.5: Towards Native Multimodal Agents</title><link href="https://simonwillison.net/2026/Feb/17/qwen35/#atom-tag" rel="alternate"/><published>2026-02-17T04:30:57+00:00</published><updated>2026-02-17T04:30:57+00:00</updated><id>https://simonwillison.net/2026/Feb/17/qwen35/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwen.ai/blog?id=qwen3.5"&gt;Qwen3.5: Towards Native Multimodal Agents&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alibaba's Qwen just released the first two models in the Qwen 3.5 series - one open weights, one proprietary. Both are multi-modal for vision input.&lt;/p&gt;
&lt;p&gt;The open weight one is a Mixture of Experts model called Qwen3.5-397B-A17B. Interesting to see Qwen call out serving efficiency as a benefit of that architecture:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's &lt;a href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B"&gt;807GB on Hugging Face&lt;/a&gt;, and Unsloth have a &lt;a href="https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF"&gt;collection of smaller GGUFs&lt;/a&gt; ranging in size from 94.2GB 1-bit to 462GB Q8_K_XL.&lt;/p&gt;
&lt;p&gt;I got this &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican&lt;/a&gt; from the &lt;a href="https://openrouter.ai/qwen/qwen3.5-397b-a17b"&gt;OpenRouter hosted model&lt;/a&gt; (&lt;a href="https://gist.github.com/simonw/625546cf6b371f9c0040e64492943b82"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pelican is quite good although the neck lacks an outline for some reason. Bicycle is very basic with an incomplete frame" src="https://static.simonwillison.net/static/2026/qwen3.5-397b-a17b.png" /&gt;&lt;/p&gt;
&lt;p&gt;The proprietary hosted model is called Qwen3.5 Plus 2026-02-15, and is a little confusing. Qwen researcher &lt;a href="https://twitter.com/JustinLin610/status/2023340126479569140"&gt;Junyang Lin  says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens, Qwen3.5-Plus supports 1M token context length. Additionally it supports search and code interpreter, which you can use on Qwen Chat with Auto mode.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/9507dd47483f78dc1195117735273e20"&gt;its pelican&lt;/a&gt;, which is similar in quality to the open weights model:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Similar quality pelican. The bicycle is taller and has a better frame shape. They are visually quite similar." src="https://static.simonwillison.net/static/2026/qwen3.5-plus-02-15.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;&lt;/p&gt;



</summary><category term="vision-llms"/><category term="ai"/><category term="qwen"/><category term="llms"/><category term="ai-in-china"/><category term="llm-release"/><category term="generative-ai"/><category term="openrouter"/><category term="pelican-riding-a-bicycle"/></entry><entry><title>Introducing GPT‑5.3‑Codex‑Spark</title><link href="https://simonwillison.net/2026/Feb/12/codex-spark/#atom-tag" rel="alternate"/><published>2026-02-12T21:16:07+00:00</published><updated>2026-02-12T21:16:07+00:00</updated><id>https://simonwillison.net/2026/Feb/12/codex-spark/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-3-codex-spark/"&gt;Introducing GPT‑5.3‑Codex‑Spark&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI announced a partnership with Cerebras &lt;a href="https://openai.com/index/cerebras-partnership/"&gt;on January 14th&lt;/a&gt;. Four weeks later they're already launching the first integration, "an ultra-fast model for real-time coding in Codex".&lt;/p&gt;
&lt;p&gt;Despite being named GPT-5.3-Codex-Spark it's not purely an accelerated alternative to GPT-5.3-Codex - the blog post calls it "a smaller version of GPT‑5.3-Codex" and clarifies that "at launch, Codex-Spark has a 128k context window and is text-only."&lt;/p&gt;
&lt;p&gt;I had some preview access to this model and I can confirm that it's significantly faster than their other models.&lt;/p&gt;
&lt;p&gt;Here's what that speed looks like running in Codex CLI:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-medium-last.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-medium.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;That was the "Generate an SVG of a pelican riding a bicycle" prompt - here's the rendered result:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat illustration of an orange duck merged with a bicycle, where the duck's body forms the seat and frame area while its head extends forward over the handlebars, set against a simple light blue sky and green grass background." src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;Compare that to the speed of regular GPT-5.3 Codex medium:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2026/gpt-5.3-codex-medium-last.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-medium.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;Significantly slower, but the pelican is a lot better:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat illustration of a white pelican riding a dark blue bicycle at speed, with motion lines behind it, its long orange beak streaming back in the wind, set against a light blue sky and green grass background." src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;What's interesting about this model isn't the quality though, it's the &lt;em&gt;speed&lt;/em&gt;. When a model responds this fast you can stay in flow state and iterate with the model much more productively.&lt;/p&gt;
&lt;p&gt;I showed a demo of Cerebras running Llama 3.1 70 B at 2,000 tokens/second against Val Town &lt;a href="https://simonwillison.net/2024/Oct/31/cerebras-coder/"&gt;back in October 2024&lt;/a&gt;. OpenAI claim 1,000 tokens/second for their new model, and I expect it will prove to be a ferociously useful partner for hands-on iterative coding sessions.&lt;/p&gt;
&lt;p&gt;It's not yet clear what the pricing will look like for this new model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-performance"/><category term="openai"/><category term="cerebras"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="codex-cli"/><category term="generative-ai"/></entry><entry><title>Gemini 3 Deep Think</title><link href="https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/#atom-tag" rel="alternate"/><published>2026-02-12T18:12:17+00:00</published><updated>2026-02-12T18:12:17+00:00</updated><id>https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/"&gt;Gemini 3 Deep Think&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from Google. They say it's "built to push the frontier of intelligence and solve modern challenges across science, research, and engineering".&lt;/p&gt;
&lt;p&gt;It drew me a &lt;em&gt;really good&lt;/em&gt; &lt;a href="https://gist.github.com/simonw/7e317ebb5cf8e75b2fcec4d0694a8199"&gt;SVG of a pelican riding a bicycle&lt;/a&gt;! I think this is the best one I've seen so far - here's &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my previous collection&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="This alt text also generated by Gemini 3 Deep Think: A highly detailed, colorful, flat vector illustration with thick dark blue outlines depicting a stylized white pelican riding a bright cyan blue bicycle from left to right across a sandy beige beach with white speed lines indicating forward motion. The pelican features a light blue eye, a pink cheek blush, a massive bill with a vertical gradient from yellow to orange, a backward magenta cap with a cyan brim and a small yellow top button, and a matching magenta scarf blowing backward in the wind. Its white wing, accented with a grey mid-section and dark blue feather tips, reaches forward to grip the handlebars, while its long tan leg and orange foot press down on an orange pedal. Attached to the front handlebars is a white wire basket carrying a bright blue cartoon fish that is pointing upwards and forwards. The bicycle itself has a cyan frame, dark blue tires, striking neon pink inner rims, cyan spokes, a white front chainring, and a dark blue chain. Behind the pelican, a grey trapezoidal pier extends from the sand toward a horizontal band of deep blue ocean water detailed with light cyan wavy lines. A massive, solid yellow-orange semi-circle sun sits on the horizon line, setting directly behind the bicycle frame. The background sky is a smooth vertical gradient transitioning from soft pink at the top to warm golden-yellow at the horizon, decorated with stylized pale peach fluffy clouds, thin white horizontal wind streaks, twinkling four-pointed white stars, and small brown v-shaped silhouettes of distant flying birds." src="https://static.simonwillison.net/static/2026/gemini-3-deep-think-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;(And since it's an FAQ, here's my answer to &lt;a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/"&gt;What happens if AI labs train for pelicans riding bicycles?&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;Since it did so well on my basic &lt;code&gt;Generate an SVG of a pelican riding a bicycle&lt;/code&gt; I decided to try the &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;more challenging version&lt;/a&gt; as well:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/154c0cc7b4daed579f6a5e616250ecc8"&gt;what I got&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Also described by Gemini 3 Deep Think: A highly detailed, vibrant, and stylized vector illustration of a whimsical bird resembling a mix between a pelican and a frigatebird enthusiastically riding a bright cyan bicycle from left to right across a flat tan and brown surface. The bird leans horizontally over the frame in an aerodynamic racing posture, with thin, dark brown wing-like arms reaching forward to grip the silver handlebars and a single thick brown leg, patterned with white V-shapes, stretching down to press on a black pedal. The bird's most prominent and striking feature is an enormous, vividly bright red, inflated throat pouch hanging beneath a long, straight grey upper beak that ends in a small orange hook. Its head is mostly white with a small pink patch surrounding the eye, a dark brown stripe running down the back of its neck, and a distinctive curly pale yellow crest on the very top. The bird's round, dark brown body shares the same repeating white V-shaped feather pattern as its leg and is accented by a folded wing resting on its side, made up of cleanly layered light blue and grey feathers. A tail composed of four stiff, straight dark brown feathers extends directly backward. Thin white horizontal speed lines trail behind the back wheel and the bird's tail, emphasizing swift forward motion. The bicycle features a classic diamond frame, large wheels with thin black tires, grey rims, and detailed silver spokes, along with a clearly visible front chainring, silver chain, and rear cog. The whimsical scene is set against a clear light blue sky featuring two small, fluffy white clouds on the left and a large, pale yellow sun in the upper right corner that radiates soft, concentric, semi-transparent pastel green and yellow halos. A solid, darker brown shadow is cast directly beneath the bicycle's wheels on the minimalist two-toned brown ground." src="https://static.simonwillison.net/static/2026/gemini-3-deep-think-complex-pelican.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46991240"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;



</summary><category term="gemini"/><category term="llm-reasoning"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="google"/><category term="generative-ai"/></entry><entry><title>GLM-5: From Vibe Coding to Agentic Engineering</title><link href="https://simonwillison.net/2026/Feb/11/glm-5/#atom-tag" rel="alternate"/><published>2026-02-11T18:56:14+00:00</published><updated>2026-02-11T18:56:14+00:00</updated><id>https://simonwillison.net/2026/Feb/11/glm-5/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://z.ai/blog/glm-5"&gt;GLM-5: From Vibe Coding to Agentic Engineering&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is a &lt;em&gt;huge&lt;/em&gt; new MIT-licensed model: 744B parameters and &lt;a href="https://huggingface.co/zai-org/GLM-5"&gt;1.51TB on Hugging Face&lt;/a&gt; twice the size of &lt;a href="https://huggingface.co/zai-org/GLM-4.7"&gt;GLM-4.7&lt;/a&gt; which was 368B and 717GB (4.5 and 4.6 were around that size too).&lt;/p&gt;
&lt;p&gt;It's interesting to see Z.ai take a position on what we should call professional software engineers building with LLMs - I've seen &lt;strong&gt;Agentic Engineering&lt;/strong&gt; show up in a few other places recently. most notable &lt;a href="https://twitter.com/karpathy/status/2019137879310836075"&gt;from Andrej Karpathy&lt;/a&gt; and &lt;a href="https://addyosmani.com/blog/agentic-engineering/"&gt;Addy Osmani&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ran my "Generate an SVG of a pelican riding a bicycle" prompt through GLM-5 via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; and got back &lt;a href="https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f0725d"&gt;a very good pelican on a disappointing bicycle frame&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican is good and has a well defined beak. The bicycle frame is a wonky red triangle. Nice sun and motion lines." src="https://static.simonwillison.net/static/2026/glm-5-pelican.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46977210"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;&lt;/p&gt;



</summary><category term="agentic-engineering"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="llm-release"/><category term="vibe-coding"/><category term="ai-assisted-programming"/><category term="generative-ai"/><category term="definitions"/><category term="openrouter"/><category term="glm"/></entry><entry><title>Opus 4.6 and Codex 5.3</title><link href="https://simonwillison.net/2026/Feb/5/two-new-models/#atom-tag" rel="alternate"/><published>2026-02-05T20:29:20+00:00</published><updated>2026-02-05T20:29:20+00:00</updated><id>https://simonwillison.net/2026/Feb/5/two-new-models/#atom-tag</id><summary type="html">
    &lt;p&gt;Two major new model releases today, within about 15 minutes of each other.&lt;/p&gt;
&lt;p&gt;Anthropic &lt;a href="https://www.anthropic.com/news/claude-opus-4-6"&gt;released Opus 4.6&lt;/a&gt;. Here's &lt;a href="https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe216"&gt;its pelican&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Slightly wonky bicycle frame but an excellent pelican, very clear beak and pouch, nice feathers." src="https://static.simonwillison.net/static/2026/opus-4.6-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;OpenAI &lt;a href="https://openai.com/index/introducing-gpt-5-3-codex/"&gt;release GPT-5.3-Codex&lt;/a&gt;, albeit only via their Codex app, not yet in their API. Here's &lt;a href="https://gist.github.com/simonw/bfc4a83f588ac762c773679c0d1e034b"&gt;its pelican&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Not nearly as good - the bicycle is a bit mangled, the pelican not nearly as well rendered - it's more of a line drawing." src="https://static.simonwillison.net/static/2026/codex-5.3-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;I've had a bit of preview access to both of these models and to be honest I'm finding it hard to find a good angle to write about them - they're both &lt;em&gt;really good&lt;/em&gt;, but so were their predecessors Codex 5.2 and Opus 4.5. I've been having trouble finding tasks that those previous models couldn't handle but the new ones are able to ace.&lt;/p&gt;
&lt;p&gt;The most convincing story about capabilities of the new model so far is Nicholas Carlini from Anthropic talking about Opus 4.6 and &lt;a href="https://www.anthropic.com/engineering/building-c-compiler"&gt;Building a C compiler with a team of parallel Claudes&lt;/a&gt; - Anthropic's version of Cursor's &lt;a href="https://simonwillison.net/2026/Jan/23/fastrender/"&gt;FastRender project&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parallel-agents"&gt;parallel-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-release"/><category term="anthropic"/><category term="generative-ai"/><category term="openai"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="llms"/><category term="parallel-agents"/><category term="c"/><category term="nicholas-carlini"/></entry><entry><title>Kimi K2.5: Visual Agentic Intelligence</title><link href="https://simonwillison.net/2026/Jan/27/kimi-k25/#atom-tag" rel="alternate"/><published>2026-01-27T15:07:41+00:00</published><updated>2026-01-27T15:07:41+00:00</updated><id>https://simonwillison.net/2026/Jan/27/kimi-k25/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.kimi.com/blog/kimi-k2-5.html"&gt;Kimi K2.5: Visual Agentic Intelligence&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Kimi K2 landed &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/"&gt;in July&lt;/a&gt; as a 1 trillion parameter open weight LLM. It was joined by Kimi K2 Thinking &lt;a href="https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/"&gt;in November&lt;/a&gt; which added reasoning capabilities. Now they've made it multi-modal: the K2 models were text-only, but the new 2.5 can handle image inputs as well:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Kimi K2.5 builds on Kimi K2 with continued pretraining over approximately 15T mixed visual and text tokens. Built as a native multimodal model, K2.5 delivers state-of-the-art coding and vision capabilities and a self-directed agent swarm paradigm.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The "self-directed agent swarm paradigm" claim there means improved long-sequence tool calling and training on how to break down tasks for multiple agents to work on at once:&lt;/p&gt;
&lt;blockquote id="complex-tasks"&gt;&lt;p&gt;For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls. Compared with a single-agent setup, this reduces execution time by up to 4.5x. The agent swarm is automatically created and orchestrated by Kimi K2.5 without any predefined subagents or workflow.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I used the &lt;a href="https://openrouter.ai/moonshotai/kimi-k2.5"&gt;OpenRouter Chat UI&lt;/a&gt; to have it "Generate an SVG of a pelican riding a bicycle", and it did &lt;a href="https://gist.github.com/simonw/32a85e337fbc6ee935d10d89726c0476"&gt;quite well&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cartoon illustration of a white pelican with a large orange beak and yellow throat pouch riding a green bicycle with yellow feet on the pedals, set against a light blue sky with soft bokeh circles and a green grassy hill. The bicycle frame is a little questionable. The pelican is quite good. The feet do not quite align with the pedals, which are floating clear of the frame." src="https://static.simonwillison.net/static/2026/kimi-k2.5-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;As a more interesting test, I decided to exercise the claims around multi-agent planning with this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I want to build a Datasette plugin that offers a UI to upload files to an S3 bucket and stores information about them in a SQLite table. Break this down into ten tasks suitable for execution by parallel coding agents.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/ee2583b2eb5706400a4737f56d57c456"&gt;the full response&lt;/a&gt;. It produced ten realistic tasks and reasoned through the dependencies between them. For comparison here's the same prompt &lt;a href="https://claude.ai/share/df9258e7-97ba-4362-83da-76d31d96196f"&gt;against Claude Opus 4.5&lt;/a&gt; and &lt;a href="https://chatgpt.com/share/6978d48c-3f20-8006-9c77-81161f899104"&gt;against GPT-5.2 Thinking&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://huggingface.co/moonshotai/Kimi-K2.5"&gt;Hugging Face repository&lt;/a&gt; is 595GB. The model uses Kimi's janky "modified MIT" license, which adds the following clause:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Given the model's size, I expect one way to run it locally would be with MLX and a pair of $10,000 512GB RAM M3 Ultra Mac Studios. That setup has &lt;a href="https://twitter.com/awnihannun/status/1943723599971443134"&gt;been demonstrated to work&lt;/a&gt; with previous trillion parameter K2 models.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46775961"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-agents"&gt;ai-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parallel-agents"&gt;parallel-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/janky-licenses"&gt;janky-licenses&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="vision-llms"/><category term="ai-agents"/><category term="llm-tool-use"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="moonshot"/><category term="kimi"/><category term="parallel-agents"/><category term="hugging-face"/><category term="janky-licenses"/><category term="llm-release"/></entry><entry><title>Introducing GPT-5.2-Codex</title><link href="https://simonwillison.net/2025/Dec/19/introducing-gpt-52-codex/#atom-tag" rel="alternate"/><published>2025-12-19T05:21:17+00:00</published><updated>2025-12-19T05:21:17+00:00</updated><id>https://simonwillison.net/2025/Dec/19/introducing-gpt-52-codex/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-2-codex/"&gt;Introducing GPT-5.2-Codex&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The latest in OpenAI's &lt;a href="https://simonwillison.net/tags/gpt-codex/"&gt;Codex family of models&lt;/a&gt; (not the same thing as their Codex CLI or Codex Cloud coding agent tools).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT‑5.2-Codex is a version of &lt;a href="https://openai.com/index/introducing-gpt-5-2/"&gt;GPT‑5.2⁠&lt;/a&gt; further optimized for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As with some previous Codex models this one is available via their Codex coding agents now and will be coming to the API "in the coming weeks". Unlike previous models there's a new invite-only preview process for vetted cybersecurity professionals for "more permissive models".&lt;/p&gt;
&lt;p&gt;I've been very impressed recently with GPT 5.2's ability to &lt;a href="https://simonwillison.net/2025/Dec/15/porting-justhtml/"&gt;tackle multi-hour agentic coding challenges&lt;/a&gt;. 5.2 Codex scores 64% on the Terminal-Bench 2.0 benchmark that GPT-5.2 scored 62.2% on. I'm not sure how concrete that 1.8% improvement will be!&lt;/p&gt;
&lt;p&gt;I didn't hack API access together this time (see &lt;a href="https://simonwillison.net/2025/Nov/9/gpt-5-codex-mini/"&gt;previous attempts&lt;/a&gt;), instead opting to just ask Codex CLI to "Generate an SVG of a pelican riding a bicycle" while running the new model (effort medium). &lt;a href="https://tools.simonwillison.net/codex-timeline?url=https://gist.githubusercontent.com/simonw/10ad81e82889a97a7d28827e0ea6d768/raw/d749473b37d86d519b4c3fa0892b5e54b5941b38/rollout-2025-12-18T16-09-10-019b33f0-6111-7840-89b0-aedf755a6e10.jsonl#tz=local&amp;amp;q=&amp;amp;type=all&amp;amp;payload=all&amp;amp;role=all&amp;amp;hide=1&amp;amp;truncate=1&amp;amp;sel=3"&gt;Here's the transcript&lt;/a&gt; in my new Codex CLI timeline viewer, and here's the pelican it drew:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Alt text by GPT-5.2-Codex: A minimalist illustration of a white pelican with a large orange beak riding a teal bicycle across a sandy strip of ground. The pelican leans forward as if pedaling, its wings tucked back and legs reaching toward the pedals. Simple gray motion lines trail behind it, and a pale yellow sun sits in the top‑right against a warm beige sky." src="https://static.simonwillison.net/static/2025/5.2-codex-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gpt-codex"&gt;gpt-codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;



</summary><category term="gpt-codex"/><category term="openai"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="codex-cli"/><category term="generative-ai"/></entry><entry><title>Gemini 3 Flash</title><link href="https://simonwillison.net/2025/Dec/17/gemini-3-flash/#atom-tag" rel="alternate"/><published>2025-12-17T22:44:52+00:00</published><updated>2025-12-17T22:44:52+00:00</updated><id>https://simonwillison.net/2025/Dec/17/gemini-3-flash/#atom-tag</id><summary type="html">
    &lt;p&gt;It continues to be a busy December, if not quite as busy &lt;a href="https://simonwillison.net/2024/Dec/20/december-in-llms-has-been-a-lot/"&gt;as last year&lt;/a&gt;. Today's big news is &lt;a href="https://blog.google/technology/developers/build-with-gemini-3-flash/"&gt;Gemini 3 Flash&lt;/a&gt;, the latest in Google's "Flash" line of faster and less expensive models.&lt;/p&gt;
&lt;p&gt;Google are emphasizing the comparison between the new Flash and their previous generation's top model Gemini 2.5 Pro:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Building on 3 Pro’s strong multimodal, coding and agentic features, 3 Flash offers powerful performance at less than a quarter the cost of 3 Pro, along with higher rate limits. The new 3 Flash model surpasses 2.5 Pro across many benchmarks while delivering faster speeds.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini 3 Flash's characteristics are almost identical to Gemini 3 Pro: it accepts text, image, video, audio, and PDF, outputs only text, handles 1,048,576 maximum input tokens and up to 65,536 output tokens, and has the same knowledge cut-off date of January 2025 (also shared with the Gemini 2.5 series).&lt;/p&gt;
&lt;p&gt;The benchmarks look good. The cost is appealing: 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro &amp;gt;200k, and it's nice not to have a price increase for the new Flash at larger token lengths.&lt;/p&gt;
&lt;p&gt;It's a little &lt;em&gt;more&lt;/em&gt; expensive than previous Flash models - Gemini 2.5 Flash was $0.30/million input tokens and $2.50/million on output, Gemini 3 Flash is $0.50/million and $3/million respectively.&lt;/p&gt;
&lt;p&gt;Google &lt;a href="https://blog.google/products/gemini/gemini-3-flash/"&gt;claim&lt;/a&gt; it may still end up cheaper though, due to more efficient output token usage:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;&gt; Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://www.llm-prices.com/#it=100000&amp;amp;ot=10000&amp;amp;sel=gemini-3-flash-preview%2Cgemini-3-pro-preview%2Cgemini-3-pro-preview-200k%2Cgpt-5.2%2Cclaude-opus-4-5%2Cclaude-sonnet-4.5%2Cclaude-4.5-haiku%2Cgemini-2.5-flash%2Cgpt-5-mini"&gt;a more extensive price comparison&lt;/a&gt; on my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; site.&lt;/p&gt;
&lt;h4 id="generating-some-svgs-of-pelicans"&gt;Generating some SVGs of pelicans&lt;/h4&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.28"&gt;llm-gemini 0.28&lt;/a&gt; this morning with support for the new model. You can try it out like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-gemini
llm keys set gemini # paste in key
llm -m gemini-3-flash-preview "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;According to &lt;a href="https://ai.google.dev/gemini-api/docs/gemini-3#thinking_level"&gt;the developer docs&lt;/a&gt; the new model supports four different thinking level options: &lt;code&gt;minimal&lt;/code&gt;, &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, and &lt;code&gt;high&lt;/code&gt;. This is different from Gemini 3 Pro, which only supported &lt;code&gt;low&lt;/code&gt; and &lt;code&gt;high&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You can run those like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini-3-flash-preview --thinking-level minimal "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are four pelicans, for thinking levels &lt;a href="https://gist.github.com/simonw/8047c805a4a1df7fd4e854b18e7482d9"&gt;minimal&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/fb61686a1f915e3777b4a40e2df41068"&gt;low&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/190c3ce82cd8976827139bbc4dcc2d19"&gt;medium&lt;/a&gt;, and &lt;a href="https://gist.github.com/simonw/da66ffce135359161996e41e50e32ec3"&gt;high&lt;/a&gt;:&lt;/p&gt;
&lt;image-gallery width="4"&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg" alt="A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg" alt="Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg" alt="A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg" alt="Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background." /&gt;
&lt;/image-gallery&gt;
&lt;h4 id="i-built-the-gallery-component-with-gemini-3-flash"&gt;I built the gallery component with Gemini 3 Flash&lt;/h4&gt;
&lt;p&gt;The gallery above uses a new Web Component which I built using Gemini 3 Flash to try out its coding abilities. The code on the page looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;image-gallery&lt;/span&gt; &lt;span class="pl-c1"&gt;width&lt;/span&gt;="&lt;span class="pl-s"&gt;4&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;image-gallery&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Those alt attributes are all generated by Gemini 3 Flash as well, using this recipe:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gemini-3-flash-preview --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;You write alt text for any image pasted in by the user. Alt text is always presented in a&lt;/span&gt;
&lt;span class="pl-s"&gt;fenced code block to make it easy to copy and paste out. It is always presented on a single&lt;/span&gt;
&lt;span class="pl-s"&gt;line so it can be used easily in Markdown images. All text on the image (for screenshots etc)&lt;/span&gt;
&lt;span class="pl-s"&gt;must be exactly included. A short note describing the nature of the image itself should go first.&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
-a https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can see the code that powers the image gallery Web Component &lt;a href="https://github.com/simonw/simonwillisonblog/blob/31651b3a527011d1c971d4256c1c9f61ef378d23/static/image-gallery.js"&gt;here on GitHub&lt;/a&gt;. I built it by prompting Gemini 3 Flash via &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gemini-3-flash-preview &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Build a Web Component that implements a simple image gallery. Usage is like this:&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;image-gallery width="5"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image1.jpg" alt="Image 1"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image2.jpg" alt="Image 2" data-thumb="image2-thumb.jpg"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image3.jpg" alt="Image 3"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;/image-gallery&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;If an image has a data-thumb= attribute that one is used instead, other images are scaled down. &lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;The image gallery always takes up 100% of available width. The width="5" attribute means that five images will be shown next to each other in each row. The default is 3. There are gaps between the images. When an image is clicked it opens a modal dialog with the full size image.&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Return a complete HTML file with both the implementation of the Web Component several example uses of it. Use https://picsum.photos/300/200 URLs for those example images.&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It took a few follow-up prompts using &lt;code&gt;llm -c&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Use a real modal such that keyboard shortcuts and accessibility features work without extra JS&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Use X for the close icon and make it a bit more subtle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;remove the hover effect entirely&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;I want no border on the close icon even when it is focused&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/09f63a49f29620d4cbbfd383cfee1db3"&gt;the full transcript&lt;/a&gt;, exported using &lt;code&gt;llm logs -cue&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Those five prompts took:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;225 input, 3,269 output&lt;/li&gt;
&lt;li&gt;2,243 input, 2,908 output&lt;/li&gt;
&lt;li&gt;4,319 input, 2,516 output&lt;/li&gt;
&lt;li&gt;6,376 input, 2,094 output&lt;/li&gt;
&lt;li&gt;8,151 input, 1,806 output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Added together that's 21,314 input and 12,593 output for a grand total &lt;a href="https://www.llm-prices.com/#it=21314&amp;amp;ot=12593&amp;amp;sel=gemini-3-flash-preview"&gt;of 4.8436 cents&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The guide to &lt;a href="https://ai.google.dev/gemini-api/docs/gemini-3#migrating_from_gemini_25"&gt;migrating from Gemini 2.5&lt;/a&gt; reveals one disappointment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Image segmentation:&lt;/strong&gt; Image segmentation capabilities (returning pixel-level masks for objects) are not supported in Gemini 3 Pro or Gemini 3 Flash. For workloads requiring native image segmentation, we recommend continuing to utilize Gemini 2.5 Flash with thinking turned off or &lt;a href="https://ai.google.dev/gemini-api/docs/robotics-overview"&gt;Gemini Robotics-ER 1.5&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/"&gt;wrote about this capability in Gemini 2.5&lt;/a&gt; back in April. I hope they come back in future models - they're a really neat capability that is unique to Gemini.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-components"&gt;web-components&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="web-components"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>GPT-5.2</title><link href="https://simonwillison.net/2025/Dec/11/gpt-52/#atom-tag" rel="alternate"/><published>2025-12-11T23:58:04+00:00</published><updated>2025-12-11T23:58:04+00:00</updated><id>https://simonwillison.net/2025/Dec/11/gpt-52/#atom-tag</id><summary type="html">
    &lt;p&gt;OpenAI reportedly &lt;a href="https://www.wsj.com/tech/ai/openais-altman-declares-code-red-to-improve-chatgpt-as-google-threatens-ai-lead-7faf5ea6"&gt;declared a "code red"&lt;/a&gt; on the 1st of December in response to increasingly credible competition from the likes of Google's Gemini 3. It's less than two weeks later and they just &lt;a href="https://openai.com/index/introducing-gpt-5-2/"&gt;announced GPT-5.2&lt;/a&gt;, calling it "the most capable model series yet for professional knowledge work".&lt;/p&gt;
&lt;h4 id="key-characteristics-of-gpt-5-2"&gt;Key characteristics of GPT-5.2&lt;/h4&gt;
&lt;p&gt;The new model comes in two variants: GPT-5.2 and GPT-5.2 Pro. There's no Mini variant yet.&lt;/p&gt;
&lt;p&gt;GPT-5.2 is available via their UI in both "instant" and "thinking" modes, presumably still corresponding to the API concept of different reasoning effort levels.&lt;/p&gt;
&lt;p&gt;The knowledge cut-off date for both variants is now &lt;strong&gt;August 31st 2025&lt;/strong&gt;. This is significant - GPT 5.1 and 5 were both Sep 30, 2024 and GPT-5 mini was May 31, 2024.&lt;/p&gt;
&lt;p&gt;Both of the 5.2 models have a 400,000 token context window and 128,000 max output tokens - no different from 5.1 or 5.&lt;/p&gt;
&lt;p&gt;Pricing wise 5.2 is a rare &lt;em&gt;increase&lt;/em&gt; - it's 1.4x the cost of GPT 5.1, at $1.75/million input and $14/million output. GPT-5.2 Pro is $21.00/million input and a hefty $168.00/million output, putting it &lt;a href="https://www.llm-prices.com/#sel=gpt-4.5%2Co1-pro%2Cgpt-5.2-pro"&gt;up there&lt;/a&gt; with their previous most expensive models o1 Pro and GPT-4.5.&lt;/p&gt;
&lt;p&gt;So far the main benchmark results we have are self-reported by OpenAI. The most interesting ones are a 70.9% score on their GDPval "Knowledge work tasks" benchmark (GPT-5 got 38.8%) and a 52.9% on ARC-AGI-2 (up from 17.6% for GPT-5.1 Thinking).&lt;/p&gt;
&lt;p&gt;The ARC Prize Twitter account provided &lt;a href="https://x.com/arcprize/status/1999182732845547795"&gt;this interesting note&lt;/a&gt; on the efficiency gains for GPT-5.2 Pro&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A year ago, we verified a preview of an unreleased version of @OpenAI
o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task&lt;/p&gt;
&lt;p&gt;Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task&lt;/p&gt;
&lt;p&gt;This represents a ~390X efficiency improvement in one year&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;GPT-5.2 can be accessed in OpenAI's Codex CLI tool like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;codex -m gpt-5.2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are three new API models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.2"&gt;gpt-5.2&lt;/a&gt; - I think this is what you get if you select "GPT-5.2 Thinking" in ChatGPT but &lt;a href="https://twitter.com/simonw/status/1999603339382976785"&gt;I'm a little confused&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://platform.openai.com/docs/models/gpt-5.2-chat-latest"&gt;gpt-5.2-chat-latest&lt;/a&gt; - the model used by ChatGPT for "GPT-5.2 Instant" mode. It's priced the same as GPT-5.2 but has a reduced 128,000 context window with 16,384 max output tokens.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.2-pro"&gt;gpt-5.2-pro&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OpenAI have published a new &lt;a href="https://cookbook.openai.com/examples/gpt-5/gpt-5-2_prompting_guide"&gt;GPT-5.2 Prompting Guide&lt;/a&gt;. An interesting note from that document is that compaction can now be run with &lt;a href="https://platform.openai.com/docs/api-reference/responses/compact"&gt;a new dedicated server-side API&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For long-running, tool-heavy workflows that exceed the standard context window, GPT-5.2 with Reasoning supports response compaction via the &lt;code&gt;/responses/compact&lt;/code&gt; endpoint. Compaction performs a loss-aware compression pass over prior conversation state, returning encrypted, opaque items that preserve task-relevant information while dramatically reducing token footprint. This allows the model to continue reasoning across extended workflows without hitting context limits.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4 id="it-s-better-at-vision"&gt;It's better at vision&lt;/h4&gt;
&lt;p&gt;One note from the announcement that caught my eye:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT‑5.2 Thinking is our strongest vision model yet, cutting error rates roughly in half on chart reasoning and software interface understanding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had &lt;a href="https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-coding/"&gt;disappointing results from GPT-5&lt;/a&gt; on an OCR task a while ago. I tried it against GPT-5.2 and it did &lt;em&gt;much&lt;/em&gt; better:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gpt-5.2 ocr -a https://static.simonwillison.net/static/2025/ft.jpeg&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/b4a13f1e424e58b8b0aca72ae2c3cb00"&gt;the result&lt;/a&gt; from that, which cost 1,520 input and 1,022 for a total of &lt;a href="https://www.llm-prices.com/#it=1520&amp;amp;ot=1022&amp;amp;sel=gpt-5.2"&gt;1.6968 cents&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="rendering-some-pelicans"&gt;Rendering some pelicans&lt;/h4&gt;
&lt;p&gt;For my classic "Generate an SVG of a pelican riding a bicycle" test:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gpt-5.2 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-2.5-pelican.png" alt="Described by GPT-5.2: Cartoon-style illustration: A white, duck-like bird with a small black eye, oversized orange beak (with a pale blue highlight along the lower edge), and a pink neckerchief rides a blue-framed bicycle in side view; the bike has two large black wheels with gray spokes, a blue front fork, visible black crank/pedal area, and thin black handlebar lines, with gray motion streaks and a soft gray shadow under the bike on a light-gray road; background is a pale blue sky with a simple yellow sun at upper left and two rounded white clouds (one near upper center-left and one near upper right)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And for the more advanced alternative test, which tests instruction following in a little more depth:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gpt-5.2 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Generate an SVG of a California brown pelican riding a bicycle. The bicycle&lt;/span&gt;
&lt;span class="pl-s"&gt;must have spokes and a correctly shaped bicycle frame. The pelican must have its&lt;/span&gt;
&lt;span class="pl-s"&gt;characteristic large pouch, and there should be a clear indication of feathers.&lt;/span&gt;
&lt;span class="pl-s"&gt;The pelican must be clearly pedaling the bicycle. The image should show the full&lt;/span&gt;
&lt;span class="pl-s"&gt;breeding plumage of the California brown pelican.&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5.2-p2.png" alt="Digital illustration on a light gray/white background with a thin horizontal baseline: a stylized California brown pelican in breeding plumage is drawn side-on, leaning forward and pedaling a bicycle; the pelican has a dark brown body with layered wing lines, a pale cream head with a darker brown cap and neck shading, a small black eye, and an oversized long golden-yellow bill extending far past the front wheel; one brown leg reaches down to a pedal while the other is tucked back; the bike is shown in profile with two large spoked wheels (black tires, white rims), a dark frame, crank and chainring near the rear wheel, a black saddle above the rear, and the front fork aligned under the pelican’s head; text at the top reads &amp;quot;California brown pelican (breeding plumage) pedaling a bicycle&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 14th December 2025&lt;/strong&gt;: I used GPT-5.2 running in Codex CLI to &lt;a href="https://simonwillison.net/2025/Dec/15/porting-justhtml/"&gt;port a complex Python library to JavaScript&lt;/a&gt;. It ran without interference for nearly four hours and completed a complex task exactly to my specification.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="gpt-5"/></entry><entry><title>Devstral 2</title><link href="https://simonwillison.net/2025/Dec/9/devstral-2/#atom-tag" rel="alternate"/><published>2025-12-09T23:58:27+00:00</published><updated>2025-12-09T23:58:27+00:00</updated><id>https://simonwillison.net/2025/Dec/9/devstral-2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/devstral-2-vibe-cli"&gt;Devstral 2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Two new models from Mistral today: Devstral 2 and Devstral Small 2 - both focused on powering coding agents such as Mistral's newly released Mistral Vibe which &lt;a href="https://simonwillison.net/2025/Dec/9/mistral-vibe/"&gt;I wrote about earlier today&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Devstral 2: SOTA open model for code agents with a fraction of the parameters of its competitors and achieving 72.2% on SWE-bench Verified.&lt;/li&gt;
&lt;li&gt;Up to 7x more cost-efficient than Claude Sonnet at real-world tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Devstral 2 is a 123B model released under a janky license - it's "modified MIT" where &lt;a href="https://huggingface.co/mistralai/Devstral-2-123B-Instruct-2512/blob/main/LICENSE"&gt;the modification&lt;/a&gt; is:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company (or that of your employer) exceeds $20 million (or its equivalent in another currency) for the preceding month. This restriction in (b) applies to the Model and any derivatives, modifications, or combined works based on it, whether provided by Mistral AI or by a third party. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Mistral Small 2 is under a proper Apache 2 license with no weird strings attached. It's a 24B model which is &lt;a href="https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512"&gt;51.6GB on Hugging Face&lt;/a&gt; and should quantize to significantly less.&lt;/p&gt;
&lt;p&gt;I tried out the larger model via &lt;a href="https://github.com/simonw/llm-mistral"&gt;my llm-mistral plugin&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mistral
llm mistral refresh
llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Bicycle looks a bit like a cybertruck" src="https://static.simonwillison.net/static/2025/devstral-2.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;For a ~120B model that one is pretty good!&lt;/p&gt;
&lt;p&gt;Here's the same prompt with &lt;code&gt;-m mistral/labs-devstral-small-2512&lt;/code&gt; for the API hosted version of Devstral Small 2:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A small white pelican on what looks more like a child's cart." src="https://static.simonwillison.net/static/2025/devstral-small-2.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Again, a decent result given the small parameter size. For comparison, &lt;a href="https://simonwillison.net/2025/Jun/20/mistral-small-32/"&gt;here's what I got&lt;/a&gt; for the 24B Mistral Small 3.2 earlier this year.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/janky-licenses"&gt;janky-licenses&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-release"/><category term="mistral"/><category term="generative-ai"/><category term="ai"/><category term="janky-licenses"/><category term="llms"/><category term="llm"/><category term="pelican-riding-a-bicycle"/></entry><entry><title>Introducing Mistral 3</title><link href="https://simonwillison.net/2025/Dec/2/introducing-mistral-3/#atom-tag" rel="alternate"/><published>2025-12-02T17:30:57+00:00</published><updated>2025-12-02T17:30:57+00:00</updated><id>https://simonwillison.net/2025/Dec/2/introducing-mistral-3/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/mistral-3"&gt;Introducing Mistral 3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Four new models from Mistral today: three in their "Ministral" smaller model series (14B, 8B, and 3B) and a new Mistral Large 3 MoE model with 675B parameters, 41B active.&lt;/p&gt;
&lt;p&gt;All of the models are vision capable, and they are all released under an Apache 2 license.&lt;/p&gt;
&lt;p&gt;I'm particularly excited about the 3B model, which appears to be a competent vision-capable model in a tiny ~3GB file.&lt;/p&gt;
&lt;p&gt;Xenova from Hugging Face &lt;a href="https://x.com/xenovacom/status/1995879338583945635"&gt;got it working in a browser&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;@MistralAI releases Mistral 3, a family of multimodal models, including three start-of-the-art dense models (3B, 8B, and 14B) and Mistral Large 3 (675B, 41B active). All Apache 2.0! 🤗&lt;/p&gt;
&lt;p&gt;Surprisingly, the 3B is small enough to run 100% locally in your browser on WebGPU! 🤯&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can &lt;a href="https://huggingface.co/spaces/mistralai/Ministral_3B_WebGPU"&gt;try that demo in your browser&lt;/a&gt;, which will fetch 3GB of model and then stream from your webcam and let you run text prompts against what the model is seeing, entirely locally.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a man with glasses holding a red cube-shaped object up to the camera in a live computer vision interface; top left label reads “LIVE FEED”; top right slider label reads “INPUT SIZE: 480PX”; lower left panel titled “PROMPT LIBRARY” with prompts “Describe what you see in one sentence.” “What is the color of my shirt?” “Identify any text or written content visible.” “What emotions or actions are being portrayed?” “Name the object I am holding in my hand.”; below that a field labeled “PROMPT” containing the text “write a haiku about this”; lower right panel titled “OUTPUT STREAM” with buttons “VIEW HISTORY” and “LIVE INFERENCE” and generated text “Red cube held tight, Fingers frame the light’s soft glow– Mystery shines bright.”; a small status bar at the bottom shows “ttft: 4188ms  tokens/sec: 5.09” and “ctx: 3.3B-Instruct”." src="https://static.simonwillison.net/static/2025/3b-webcam.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Mistral's API hosted versions of the new models are supported by my &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral plugin&lt;/a&gt; already thanks to the &lt;code&gt;llm mistral refresh&lt;/code&gt; command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ llm mistral refresh
Added models: ministral-3b-2512, ministral-14b-latest, mistral-large-2512, ministral-14b-2512, ministral-8b-2512
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I &lt;a href="https://gist.github.com/simonw/0df5e656291d5a7a1bf012fabc9edc3f"&gt;tried pelicans against all of the models&lt;/a&gt;. Here's the best one, from Mistral Large 3:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Nice cloud. Pelican isn't great, the beak is missing the pouch. It's floating above the bicycle which has two wheels and an incorrect frame." src="https://static.simonwillison.net/static/2025/mistral-large-3.png" /&gt;&lt;/p&gt;
&lt;p&gt;And the worst from Ministral 3B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A black sky. A brown floor. A set of abstract brown and grey shapes float, menacingly." src="https://static.simonwillison.net/static/2025/ministral-3b.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="vision-llms"/><category term="llm-release"/><category term="mistral"/><category term="llm"/><category term="generative-ai"/><category term="ai"/><category term="llms"/></entry><entry><title>DeepSeek-V3.2</title><link href="https://simonwillison.net/2025/Dec/1/deepseek-v32/#atom-tag" rel="alternate"/><published>2025-12-01T23:56:19+00:00</published><updated>2025-12-01T23:56:19+00:00</updated><id>https://simonwillison.net/2025/Dec/1/deepseek-v32/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://api-docs.deepseek.com/news/news251201"&gt;DeepSeek-V3.2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Two new open weight (MIT licensed) models from DeepSeek today: &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2"&gt;DeepSeek-V3.2&lt;/a&gt; and &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale"&gt;DeepSeek-V3.2-Speciale&lt;/a&gt;, both 690GB, 685B parameters. Here's the &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2/resolve/main/assets/paper.pdf"&gt;PDF tech report&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;DeepSeek-V3.2 is DeepSeek's new flagship model, now running on &lt;a href="https://chat.deepseek.com"&gt;chat.deepseek.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The difference between the two new models is best explained by this paragraph from the technical report:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DeepSeek-V3.2 integrates reasoning, agent, and human alignment data distilled from specialists, undergoing thousands of steps of continued RL training to reach the final checkpoints. To investigate the potential of extended thinking, we also developed an experimental variant, DeepSeek-V3.2-Speciale. This model was trained exclusively on reasoning data with a reduced length penalty during RL. Additionally, we incorporated the dataset and reward method from DeepSeekMath-V2 (Shao et al., 2025) to enhance capabilities in mathematical proofs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I covered &lt;a href="https://simonwillison.net/2025/Nov/27/deepseek-math-v2/"&gt;DeepSeek-Math-V2 last week&lt;/a&gt;. Like that model, DeepSeek-V3.2-Speciale also scores gold on the 2025 International Mathematical Olympiad so beloved of model training teams!&lt;/p&gt;
&lt;p id="pelicans"&gt;I tried both models on "Generate an SVG of a pelican riding a bicycle" using the chat feature of [OpenRouter](https://openrouter.ai/). DeepSeek V3.2 produced this very short reasoning chain:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Let's assume the following:&lt;/p&gt;
&lt;p&gt;Wheel radius: 40&lt;br&gt;
Distance between wheel centers: 180&lt;br&gt;
Seat height: 60 (above the rear wheel center)&lt;br&gt;
Handlebars: above the front wheel, extending back and up.&lt;/p&gt;
&lt;p&gt;We'll set the origin at the center of the rear wheel.&lt;/p&gt;
&lt;p&gt;We'll create the SVG with a viewBox that fits the entire drawing.&lt;/p&gt;
&lt;p&gt;Let's start by setting up the SVG.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Followed by this illustration:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pleasing gradents for the sky and ground and sun. Neat three-circle clouds. A Pelican on a Bicycle title printed on the image. The pelican is cute but stlightly detached from the bicycle. The bicycle has a somewhat mangled brown frame." src="https://static.simonwillison.net/static/2025/deepseek-v32.png" /&gt;&lt;/p&gt;
&lt;p&gt;Here's what I got from the Speciale model, which thought deeply about the geometry of bicycles and pelicans for &lt;a href="https://gist.githubusercontent.com/simonw/3debaf0df67c2d99a36f41f21ffe534c/raw/fbbb60c6d5b6f02d539ade5105b990490a81a86d/svg.txt"&gt;a very long time (at least 10 minutes)&lt;/a&gt; before spitting out this result:&lt;/p&gt;
&lt;p&gt;&lt;img alt="It's not great. The bicycle is distorted, the pelican is a white oval, an orange almost-oval beak, a little black eye and setched out straight line limbs leading to the pedal and handlebars." src="https://static.simonwillison.net/static/2025/deepseek-v32-speciale.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46108780"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-release"/><category term="openrouter"/><category term="generative-ai"/><category term="deepseek"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="llm-reasoning"/><category term="pelican-riding-a-bicycle"/></entry><entry><title>deepseek-ai/DeepSeek-Math-V2</title><link href="https://simonwillison.net/2025/Nov/27/deepseek-math-v2/#atom-tag" rel="alternate"/><published>2025-11-27T15:59:23+00:00</published><updated>2025-11-27T15:59:23+00:00</updated><id>https://simonwillison.net/2025/Nov/27/deepseek-math-v2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-Math-V2"&gt;deepseek-ai/DeepSeek-Math-V2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New on Hugging Face, a specialist mathematical reasoning LLM from DeepSeek. This is their entry in the space previously dominated by proprietary models from OpenAI and Google DeepMind, both of which &lt;a href="https://simonwillison.net/2025/Jul/21/gemini-imo/"&gt;achieved gold medal scores&lt;/a&gt; on the International Mathematical Olympiad earlier this year.&lt;/p&gt;
&lt;p&gt;We now have an open weights (Apache 2 licensed) 685B, 689GB model that can achieve the same. From the &lt;a href="https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf"&gt;accompanying paper&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DeepSeekMath-V2 demonstrates strong performance on competition mathematics. With scaled test-time compute, it achieved gold-medal scores in high-school competitions including IMO 2025 and CMO 2024, and a near-perfect score on the undergraduate Putnam 2024 competition.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mathematics"&gt;mathematics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-release"/><category term="llm-reasoning"/><category term="deepseek"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="mathematics"/><category term="generative-ai"/></entry><entry><title>Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult</title><link href="https://simonwillison.net/2025/Nov/24/claude-opus/#atom-tag" rel="alternate"/><published>2025-11-24T19:37:07+00:00</published><updated>2025-11-24T19:37:07+00:00</updated><id>https://simonwillison.net/2025/Nov/24/claude-opus/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic &lt;a href="https://www.anthropic.com/news/claude-opus-4-5"&gt;released Claude Opus 4.5&lt;/a&gt; this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's &lt;a href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/"&gt;GPT-5.1-Codex-Max&lt;/a&gt; and Google's &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/"&gt;Gemini 3&lt;/a&gt;, both released within the past week!&lt;/p&gt;
&lt;p&gt;The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 "reliable knowledge cutoff" (Sonnet 4.5 is January, Haiku 4.5 is February).&lt;/p&gt;
&lt;p&gt;The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for &amp;gt;200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-5#key-improvements-in-opus-4-5-over-opus-4-1"&gt;Key improvements in Opus 4.5 over Opus 4.1&lt;/a&gt; document has a few more interesting details:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Opus 4.5 has a new &lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort"&gt;effort parameter&lt;/a&gt; which defaults to high but can be set to medium or low for faster responses.&lt;/li&gt;
&lt;li&gt;The model supports &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool"&gt;enhanced computer use&lt;/a&gt;, specifically a &lt;code&gt;zoom&lt;/code&gt; tool which you can provide to Opus 4.5 to allow it to request a zoomed in region of the screen to inspect.&lt;/li&gt;
&lt;li&gt;"&lt;a href="https://platform.claude.com/docs/en/build-with-claude/extended-thinking#thinking-block-preservation-in-claude-opus-4-5"&gt;Thinking blocks from previous assistant turns are preserved in model context by default&lt;/a&gt;" - apparently previous Anthropic models discarded those.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I had access to a preview of Anthropic's new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in &lt;a href="https://simonwillison.net/2025/Nov/24/sqlite-utils-40a1/"&gt;a new alpha release of sqlite-utils&lt;/a&gt; that included several large-scale refactorings - Opus 4.5 was responsible for most of the work across &lt;a href="https://github.com/simonw/sqlite-utils/compare/10957305be998999e3c95c11863b5709d42b7ae3...4.0a1"&gt;20 commits, 39 files changed,  2,022 additions and 1,173 deletions&lt;/a&gt; in a two day period. Here's the &lt;a href="https://gistpreview.github.io/?f40971b693024fbe984a68b73cc283d2"&gt;Claude Code transcript&lt;/a&gt; where I had it help implement one of the more complicated new features.&lt;/p&gt;
&lt;p&gt;It's clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in &lt;a href="https://github.com/simonw/sqlite-utils/milestone/7?closed=1"&gt;the milestone for the alpha&lt;/a&gt;. I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I'd been achieving with the new model.&lt;/p&gt;
&lt;p&gt;With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected.&lt;/p&gt;
&lt;p&gt;I'm not saying the new model isn't an improvement on Sonnet 4.5 - but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two.&lt;/p&gt;
&lt;p&gt;This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.&lt;/p&gt;
&lt;p&gt;Google's Nano Banana Pro image generation model was notable in that its ability to &lt;a href="https://simonwillison.net/2025/Nov/20/nano-banana-pro/#creating-an-infographic"&gt;render usable infographics&lt;/a&gt; really does represent a task at which  previous models had been laughably incapable.&lt;/p&gt;
&lt;p&gt;The frontier LLMs are a lot harder to differentiate between. Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis?&lt;/p&gt;
&lt;p&gt;And honestly, this is mainly on me. I've fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they've fallen one-by-one and now I'm embarrassingly lacking in suitable challenges to help evaluate new models.&lt;/p&gt;
&lt;p&gt;I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on - a tip I picked up from Ethan Mollick. I need to double-down on that advice myself!&lt;/p&gt;
&lt;p&gt;I'd love to see AI labs like Anthropic help address this challenge directly. I'd like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle.&lt;/p&gt;
&lt;p&gt;"Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5" would excite me a &lt;em&gt;lot&lt;/em&gt; more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.&lt;/p&gt;
&lt;p id="pelicans"&gt;In the meantime, I'm just gonna have to keep on getting them to draw &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelicans riding bicycles&lt;/a&gt;. Here's Opus 4.5 (on its default &lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort"&gt;"high" effort level&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican.jpg" alt="The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It did significantly better on the &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;new more detailed prompt&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican-advanced.jpg" alt="The pelican has feathers and a red pouch - a close enough version of breeding plumage. The bicycle is a much better shape." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's that same complex prompt &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#advanced-pelican"&gt;against Gemini 3 Pro&lt;/a&gt; and &lt;a href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/#advanced-pelican-codex-max"&gt;against GPT-5.1-Codex-Max-xhigh&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="still-susceptible-to-prompt-injection"&gt;Still susceptible to prompt injection&lt;/h4&gt;
&lt;p&gt;From &lt;a href="https://www.anthropic.com/news/claude-opus-4-5#a-step-forward-on-safety"&gt;the safety section&lt;/a&gt; of Anthropic's announcement post:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-prompt-injection.jpg" alt="Bar chart titled &amp;quot;Susceptibility to prompt-injection style attacks&amp;quot; with subtitle &amp;quot;At k queries; lower is better&amp;quot;. Y-axis shows &amp;quot;ATTACK SUCCESS RATE (%)&amp;quot; from 0-100. Five stacked bars compare AI models with three k values (k=1 in dark gray, k=10 in beige, k=100 in pink). Results: Gemini 3 Pro Thinking (12.5, 60.7, 92.0), GPT-5.1 Thinking (12.6, 58.2, 87.8), Haiku 4.5 Thinking (8.3, 51.1, 85.6), Sonnet 4.5 Thinking (7.3, 41.9, 72.4), Opus 4.5 Thinking (4.7, 33.6, 63.0)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;On the one hand this looks great, it's a clear improvement over previous models and the competition.&lt;/p&gt;
&lt;p&gt;What does the chart actually tell us though? It tells us that single attempts at prompt injection still work 1/20 times, and if an attacker can try ten different attacks that success rate goes up to 1/3!&lt;/p&gt;
&lt;p&gt;I still don't think training models not to fall for prompt injection is the way forward here. We continue to need to design our applications under the assumption that a suitably motivated attacker will be able to find a way to trick the models.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/prompt-injection"&gt;prompt-injection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="prompt-injection"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="evals"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="november-2025-inflection"/></entry><entry><title>Olmo 3 is a fully open LLM</title><link href="https://simonwillison.net/2025/Nov/22/olmo-3/#atom-tag" rel="alternate"/><published>2025-11-22T23:59:46+00:00</published><updated>2025-11-22T23:59:46+00:00</updated><id>https://simonwillison.net/2025/Nov/22/olmo-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Olmo is the LLM series from Ai2 - the &lt;a href="https://allenai.org/"&gt;Allen institute for AI&lt;/a&gt;. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://allenai.org/blog/olmo3"&gt;new Olmo 3&lt;/a&gt; claims to be "the best fully open 32B-scale thinking model" and has a strong focus on interpretability:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;At its center is &lt;strong&gt;Olmo 3-Think (32B)&lt;/strong&gt;, the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They've released four 7B models - Olmo 3-Base, Olmo 3-Instruct, Olmo 3-Think and Olmo 3-RL Zero, plus 32B variants of the 3-Think and 3-Base models.&lt;/p&gt;
&lt;p&gt;Having full access to the training data is really useful. Here's how they describe that:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Olmo 3 is pretrained on &lt;strong&gt;Dolma 3&lt;/strong&gt;, a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with &lt;a href="https://olmocr.allenai.org/"&gt;olmOCR&lt;/a&gt;, codebases, math problems and solutions, and encyclopedic text. From this pool, we construct &lt;strong&gt;Dolma 3 Mix&lt;/strong&gt;, a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don't collect from sites that explicitly disallow it, including paywalled content.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They also highlight that they are training on fewer tokens than their competition:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you're continuing to hold out hope for a model trained entirely on licensed data this one sadly won't fit the bill - a lot of that data still comes from a crawl of the web.&lt;/p&gt;
&lt;p&gt;I tried out the 32B Think model and the 7B Instruct model &lt;a href="https://lmstudio.ai/models/olmo3"&gt;using LM Studio&lt;/a&gt;. The 7B model is a 4.16GB download, the 32B one is 18.14GB.&lt;/p&gt;
&lt;p&gt;The 32B model is absolutely an over-thinker! I asked it to "Generate an SVG of a pelican riding a bicycle" and it thought for &lt;em&gt;14 minutes 43 seconds&lt;/em&gt;, outputting 8,437 tokens total most of which was &lt;a href="https://gist.github.com/simonw/2ae9d5ed71de9608b7955eea9671306f"&gt;this epic thinking trace&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I don't usually quote the full SVG in these write-ups, but in this case it's short enough that I think it's worth sharing. The SVG comments give a great impression of what it was trying to do - it has a Bicycle, Bike frame, Pelican, Left and Right wings and even "Feet on pedals".&lt;/p&gt;
&lt;div class="highlight highlight-text-xml-svg"&gt;&lt;pre&gt;&amp;lt;&lt;span class="pl-ent"&gt;svg&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;200&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;200&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;viewBox&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0 0 100 100&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bicycle &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;70&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Bike frame &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;rect&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;height&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;saddlebrown&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;30&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;line&lt;/span&gt; &lt;span class="pl-e"&gt;x1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y1&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;40&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;x2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;70&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y2&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pelican &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;ellipse&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;55&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;rx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;20&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;ry&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;52 50,57 35,62 50&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Head/beak &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;55&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;circle&lt;/span&gt; &lt;span class="pl-e"&gt;cx&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;cy&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;r&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;white&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45 60,50 70,55 60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;lightgrey&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Left wing &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;65 60,70 70,55 60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;lightgrey&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Right wing &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Feet on pedals &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25 75,30 85,35 75&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;polygon&lt;/span&gt; &lt;span class="pl-e"&gt;points&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;75 75,70 85,65 75&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;black&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;svg&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Rendered it looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmo3-32b-pelican.jpg" alt="Two circles, each with a triangle sticking out from the bottom. They have bars leading up to a brown box. Overlapping them is a black triangle with white circles for eyes and two grey triangles that are probably meant to be wings. It is not recognizable as a pelican or a bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I tested OLMo 2 32B 4bit &lt;a href="https://simonwillison.net/2025/Mar/16/olmo2/"&gt;back in March&lt;/a&gt; and got something that, while pleasingly abstract, didn't come close to resembling a pelican or a bicycle:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmo2-pelican.jpg" alt="Blue and black wiggly lines looking more like a circuit diagram than a pelican riding a bicycle" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;To be fair 32B models generally don't do great with this. Here's Qwen 3 32B's attempt (I ran that just now &lt;a href="https://openrouter.ai/chat?models=qwen/qwen3-32b"&gt;using OpenRouter&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/qwen3-32b-pelican.png" alt="The bicycle is two black circles joined by two lines, with a weird rectangular saddle perched on top The pelican is a blue oval, a white circles with a yellow triangle in it and a weird eye shaped oval overlapping the blue one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="olmotrace"&gt;OlmoTrace&lt;/h4&gt;
&lt;p&gt;I was particularly keen on trying out the ability to "inspect intermediate reasoning traces". Here's how that's described later in the announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A core goal of Olmo 3 is not just to &lt;em&gt;open&lt;/em&gt; the model flow, but to make it &lt;em&gt;actionable&lt;/em&gt; for people who want to understand and improve model behavior. Olmo 3 integrates with &lt;a href="https://allenai.org/blog/olmotrace"&gt;&lt;strong&gt;OlmoTrace&lt;/strong&gt;&lt;/a&gt;, our tool for tracing model outputs back to training data in real time.&lt;/p&gt;
&lt;p&gt;For example, in the Ai2 Playground, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. This closes the gap between training data and model behavior: you can see not only what the model is doing, but why---and adjust data or training decisions accordingly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can access OlmoTrace via &lt;a href="https://playground.allenai.org/"&gt;playground.allenai.org&lt;/a&gt;, by first running a prompt and then clicking the "Show OlmoTrace" button below the output.&lt;/p&gt;
&lt;p&gt;I tried that on "Generate a conference bio for Simon Willison" (an ego-prompt I use to see how much the models have picked up about me from their training data) and got back a result that looked like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/olmotrace.jpg" alt="Screenshot of AI2 Playground interface with Olmo 3 32B Think model selected. User prompt reads &amp;quot;Generate a conference bio for Simon Willison&amp;quot;. AI response shows expanded &amp;quot;Thoughts&amp;quot; section: &amp;quot;Okay, the user wants me to generate a conference bio for Simon Willison. Let me start by recalling who Simon is. He's a well-known developer, entrepreneur, and author in the tech community. I remember he's the co-founder of co:here and now works at Anthropic. He's famous for his work with SQLite, especially the datasette project. Also, he writes a blog called SimonWillison.net which is quite popular. First, I need to structure a typical conference bio. These usually include his current role, past positions, notable projects, and maybe some achievements or recognitions. I should highlight his key contributions like datasette, which allows easy exploration of SQLite databases via a web interface. That's a big deal in data journalism and open data circles.&amp;quot; Right panel shows &amp;quot;OlmoTrace&amp;quot; feature described as &amp;quot;Documents from the training data that have exact text matches with the model response. Powered by infini-gram&amp;quot;. First document excerpt discusses technology and innovation, with highlighted match text &amp;quot;societal implications of technology, emphasizing the&amp;quot; shown in bold, surrounded by text about responsibility and merging innovation with intellect. Second document excerpt about Matt Hall has highlighted match &amp;quot;is a software engineer and entrepreneur based in&amp;quot; shown in bold, describing someone in New York City who co-founded a PFP collection and works at Google Creative Lab. Note indicates &amp;quot;Document repeated 2 times in result&amp;quot; with &amp;quot;View all repeated documents&amp;quot; link." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It thinks I co-founded co:here and work at Anthropic, both of which are incorrect - but that's not uncommon with LLMs, I frequently see them suggest that I'm the CTO of GitHub and other such inaccuracies.&lt;/p&gt;
&lt;p&gt;I found the OlmoTrace panel on the right disappointing. None of the training documents it highlighted looked relevant - it appears to be looking for phrase matches (powered by &lt;a href="https://infini-gram.io/"&gt;Ai2's infini-gram&lt;/a&gt;) but the documents it found had nothing to do with me at all.&lt;/p&gt;
&lt;h4 id="can-open-training-data-address-concerns-of-backdoors-"&gt;Can open training data address concerns of backdoors?&lt;/h4&gt;
&lt;p&gt;Ai2 claim that Olmo 3 is "the best fully open 32B-scale thinking model", which I think holds up provided you define "fully open" as including open training data. There's not a great deal of competition in that space though - Ai2 compare themselves to &lt;a href="https://marin.community/"&gt;Stanford's Marin&lt;/a&gt; and &lt;a href="https://www.swiss-ai.org/apertus"&gt;Swiss AI's Apertus&lt;/a&gt;, neither of which I'd heard about before.&lt;/p&gt;
&lt;p&gt;A big disadvantage of other open weight models is that it's impossible to audit their training data. Anthropic published a paper last month showing that &lt;a href="https://www.anthropic.com/research/small-samples-poison"&gt;a small number of samples can poison LLMs of any size&lt;/a&gt; - it can take just "250 poisoned documents" to add a backdoor to a large model that triggers undesired behavior based on a short carefully crafted prompt.&lt;/p&gt;

&lt;p&gt;This makes fully open training data an even bigger deal.&lt;/p&gt;

&lt;p&gt;Ai2 researcher Nathan Lambert included this note about the importance of transparent training data in &lt;a href="https://www.interconnects.ai/p/olmo-3-americas-truly-open-reasoning"&gt;his detailed post about the release&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;In particular, we're excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative).&lt;/p&gt;

&lt;p&gt;This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." &lt;a href="https://arxiv.org/abs/2506.10947"&gt;arXiv preprint arXiv:2506.10947&lt;/a&gt; (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." &lt;a href="https://arxiv.org/abs/2507.10532"&gt;arXiv preprint arXiv:2507.10532&lt;/a&gt; (2025).)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I hope we see more competition in this space, including further models in the Olmo series. The improvements from Olmo 1 (in &lt;a href="https://simonwillison.net/2024/Feb/2/olmos/"&gt;February 2024&lt;/a&gt;) and Olmo 2 (in &lt;a href="https://simonwillison.net/2025/Mar/16/olmo2/"&gt;March 2025&lt;/a&gt;) have been significant. I'm hoping that trend continues!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/interpretability"&gt;interpretability&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nathan-lambert"&gt;nathan-lambert&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/olmo"&gt;olmo&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="interpretability"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="ai2"/><category term="ai-ethics"/><category term="llm-release"/><category term="lm-studio"/><category term="nathan-lambert"/><category term="olmo"/></entry><entry><title>Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model</title><link href="https://simonwillison.net/2025/Nov/20/nano-banana-pro/#atom-tag" rel="alternate"/><published>2025-11-20T16:32:25+00:00</published><updated>2025-11-20T16:32:25+00:00</updated><id>https://simonwillison.net/2025/Nov/20/nano-banana-pro/#atom-tag</id><summary type="html">
    &lt;p&gt;Hot on the heels of Tuesday's &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/"&gt;Gemini 3 Pro&lt;/a&gt; release, today it's &lt;a href="https://blog.google/technology/ai/nano-banana-pro/"&gt;Nano Banana Pro&lt;/a&gt;, also known as &lt;a href="https://deepmind.google/models/gemini-image/pro/"&gt;Gemini 3 Pro Image&lt;/a&gt;. I've had a few days of preview access and this is an &lt;em&gt;astonishingly&lt;/em&gt; capable image generation model.&lt;/p&gt;
&lt;p&gt;As is often the case, the most useful low-level details can be found in &lt;a href="https://ai.google.dev/gemini-api/docs/image-generation#gemini-3-capabilities"&gt;the API documentation&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Designed to tackle the most challenging workflows through advanced reasoning, it excels at complex, multi-turn creation and modification tasks.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High-resolution output&lt;/strong&gt;: Built-in generation capabilities for 1K, 2K, and 4K visuals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced text rendering&lt;/strong&gt;: Capable of generating legible, stylized text for infographics, menus, diagrams, and marketing assets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grounding with Google Search&lt;/strong&gt;: The model can use Google Search as a tool to verify facts and generate imagery based on real-time data (e.g., current weather maps, stock charts, recent events).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thinking mode&lt;/strong&gt;: The model utilizes a "thinking" process to reason through complex prompts. It generates interim "thought images" (visible in the backend but not charged) to refine the composition before producing the final high-quality output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Up to 14 reference images&lt;/strong&gt;: You can now mix up to 14 reference images to produce the final image.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;[...] These 14 images can include the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Up to 6 images of objects with high-fidelity to include in the final image&lt;/li&gt;
&lt;li&gt;Up to 5 images of humans to maintain character consistency&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;There is also a short (6 page) &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Image-Model-Card.pdf"&gt;model card PDF&lt;/a&gt; which lists the following as "new capabilities" compared to the previous Nano Banana: Multi character editing, Chart editing, Text editing, Factuality - Edu, Multi-input 1-3, Infographics, Doodle editing, Visual design.&lt;/p&gt;
&lt;h4 id="trying-out-some-detailed-instruction-image-prompts"&gt;Trying out some detailed instruction image prompts&lt;/h4&gt;
&lt;p&gt;Max Woolf published &lt;a href="https://minimaxir.com/2025/11/nano-banana-prompts/#hello-nano-banana"&gt;the definitive guide to prompting Nano Banana&lt;/a&gt; just a few days ago. I decided to try his example prompts against the new model, requesting results in 4K.&lt;/p&gt;
&lt;p&gt;Here's what I got for his first test prompt, using Google's &lt;a href="https://aistudio.google.com/"&gt;AI Studio&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Create an image of a three-dimensional pancake in the shape of a skull, garnished on top with blueberries and maple syrup.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/pancake-skull-1.jpg" alt="A very detailed quality photo of a skull made of pancake batter, blueberries on top, maple syrup dripping down, maple syrup bottle in the background." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The result came out as a 24.1MB, 5632 × 3072 pixel PNG file. I don't want to serve that on my own blog so here's &lt;a href="https://drive.google.com/file/d/1QV3pcW1KfbTRQscavNh6ld9PyqG4BRes/view?usp=drive_link"&gt;a Google Drive link for the original&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then I ran his follow-up prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Make ALL of the following edits to the image:
- Put a strawberry in the left eye socket.
- Put a blackberry in the right eye socket.
- Put a mint garnish on top of the pancake.
- Change the plate to a plate-shaped chocolate-chip cookie.
- Add happy people to the background.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/pancake-skull-2.jpg" alt="It's the exact same skull with the requested edits made - mint garnish on the blueberries, a strawberry in the left hand eye socket (from our perspective, technically the skull's right hand socket), a blackberry in the other, the plate is now a plate-sized chocolate chip cookie (admittedly on a regular plate) and there are four happy peo ple in the background." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'll note that it did put the plate-sized cookie on a regular plate. Here's &lt;a href="https://drive.google.com/file/d/18AzhM-BUZAfLGoHWl6MQW_UW9ju4km-i/view?usp=drive_link"&gt;the 24.9MB PNG&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new model isn't cheap. Here's &lt;a href="https://ai.google.dev/gemini-api/docs/pricing#gemini-3-pro-image-preview"&gt;the API pricing&lt;/a&gt;: it's 24 cents for a 4K image and 13.4 cents for a 1K or 2K image. Image inputs are 0.11 cents (just over 1/10th of a cent) each - an earlier version of their pricing page incorrectly said 6.7 cents each but that's now been fixed.&lt;/p&gt;
&lt;p&gt;Unlike most of Google's other models it also isn't available for free via AI Studio: you have to configure an API key with billing in order to use the model there.&lt;/p&gt;
&lt;h4 id="creating-an-infographic"&gt;Creating an infographic&lt;/h4&gt;
&lt;p&gt;So this thing is great at following instructions. How about rendering text?&lt;/p&gt;
&lt;p&gt;I tried this prompt, this time using the Gemini consumer app in "thinking" mode (which now uses Nano Banana Pro for image generation). &lt;a href="https://gemini.google.com/share/d40fe391f309"&gt;Here's a share link&lt;/a&gt; - my prompt was:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Infographic explaining how the Datasette open source project works&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is a great opportunity to test its ability to run searches (aka "Grounding with Google Search"). Here's what it created based on that 9 word prompt:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/nano-banana-datasette.jpg" alt="Described by Gemini 3 Pro: A technical architecture diagram titled &amp;quot;How Datasette Works: From Raw Data to Explorable API,&amp;quot; illustrating a workflow from left to right. 1. &amp;quot;RAW DATA SOURCES&amp;quot; include &amp;quot;CSV&amp;quot;, &amp;quot;JSON&amp;quot;, &amp;quot;Excel (XLSX)&amp;quot;, and &amp;quot;Log Files&amp;quot;. 2. These flow into &amp;quot;DATA PREPARATION &amp;amp; CONVERSION&amp;quot; using tools &amp;quot;csvs-to-sqlite&amp;quot; and &amp;quot;sqlite-utils&amp;quot; to create a &amp;quot;SQLite DATABASE&amp;quot;. 3. This feeds into the central &amp;quot;DATASETTE APPLICATION CORE,&amp;quot; a stack comprising &amp;quot;Data Ingestion (Read-Only)&amp;quot;, &amp;quot;Query Engine (SQL)&amp;quot;, &amp;quot;API Layer (JSON)&amp;quot;, and &amp;quot;Web UI Rendering&amp;quot;. 4. A &amp;quot;PLUGIN ECOSYSTEM&amp;quot; connects below the core, listing &amp;quot;Vega Charts&amp;quot;, &amp;quot;Cluster Maps&amp;quot;, &amp;quot;Full-Text Search&amp;quot;, and &amp;quot;Custom Renderers&amp;quot;. 5. To the right, the core branches into &amp;quot;OUTPUT &amp;amp; INTERFACES&amp;quot;: an &amp;quot;Interactive Web Interface (Explore, Filter, Facet)&amp;quot;, a &amp;quot;JSON API (Programmatic Access)&amp;quot; showing a snippet {&amp;quot;data&amp;quot;: [...]}, and a &amp;quot;SQL EDITOR Custom SQL Queries&amp;quot; showing SELECT * FROM.... 6. The API output connects to &amp;quot;PUBLISHING &amp;amp; DEPLOYMENT&amp;quot; via a terminal command datasette publish cloudrun my.db leading to deployment targets &amp;quot;Heroku&amp;quot;, &amp;quot;Google Cloud Run&amp;quot;, &amp;quot;Fly.io&amp;quot;, and &amp;quot;Vercel&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That's a pretty extraordinary result for such a simple prompt. The text is all spelled correctly and rendered without glitches. The content is solid too - it even included logos for the most popular publish platforms, and a tiny thumbnail of the Datasette UI which is close-enough for an infographic.&lt;/p&gt;
&lt;p&gt;"Data Ingestion (Read-Only)" is a bit off. Everything else looks great to me.&lt;/p&gt;
&lt;p&gt;Here's the thinking trace for that image generation process:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Initiating Datasette Infographic&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I'm starting a detailed infographic about the Datasette project. I plan to use Google Search to gather specific information about its core processes, which I will then visualize in a multi-panel diagram.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Developing the Datasette workflow&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I am visualizing the Datasette workflow as described in the search results. I'm focusing on the conversion of data from different formats into a SQLite database, which is then presented as a web application complete with a user interface and JSON output.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Visualizing Datasette Workflow&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I've sketched out the initial structure for the infographic, focusing on a flowchart style. The key components include data input from diverse sources, conversion to SQLite, the Datasette core application, its main features (web UI, JSON API, SQL querying), ecosystem plugins, and finally, deployment options. I aim for clear icons and labels to represent each step. Now I will focus on refining the flow and clarifying the relationships between the parts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Defining Core Datasette Stages&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I've sketched out the initial infographic structure. I have identified the key components, focusing on visual representation of the data flow and the key benefits and features of the system. I have made updates to the flowchart with clear icons and labels. Now I'm working to expand the data input section and refine the visual representation of each step.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Constructing Datasette Infographic&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I am translating the current Datasette workflow outline into visual elements. The input section will show icons for different data formats. I've sketched the data conversion stage with clear illustrations of the necessary tools. I will add a depiction of the Datasette core application, showcasing its interfaces and functions. The plugin ecosystem and deployment options will follow. Now, I am refining the visual representation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Analyzing Datasette's Architecture&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I'm currently working on representing the Datasette open-source project's architecture visually. I've broken it down into key components: raw data processing, SQLite conversion, core application functionalities, plugin integration, and deployment methods. I'm focusing on creating a clear, linear representation to show the workflow.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mapping the Datasette Flow&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I've crafted a comprehensive graphic to illustrate Datasette's operation. It spans the data ingestion, SQLite conversion, main application features, plugin interactions, and deployment stages, aiming for clarity and a logical flow.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="detecting-generated-images-with-synthid"&gt;Detecting generated images with SynthID&lt;/h4&gt;
&lt;p&gt;This note from the &lt;a href="https://blog.google/technology/ai/nano-banana-pro/"&gt;announcement post&lt;/a&gt; caught my eye:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We believe it's critical to know when an image is AI-generated. This is why all media generated by Google's tools are embedded with our imperceptible &lt;a href="https://deepmind.google/models/synthid/"&gt;SynthID&lt;/a&gt; digital watermark.&lt;/p&gt;
&lt;p&gt;Today, we are putting a powerful verification tool directly in consumers' hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Last night I used Nano Banana Pro to generate a fake photograph of raccoons stealing our food delivery, then scrubbed out the little diamond icon using the Apple Photos "cleanup" tool. I uploaded that to the Gemini app and asked "Was this image created with AI?":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/nano-banana-detected.jpg" alt="Screenshot of a mobile chat interface displaying a conversation about AI image detection. The user has uploaded a photo showing two raccoons on a porch; one raccoon reaches inside a paper bag a bench while the other stands on the ground looking up at it. The conversation title reads &amp;quot;AI Image Creation Confirmed&amp;quot;. The user asks, &amp;quot;Was this image created with ai?&amp;quot; The AI response, labeled &amp;quot;Analysis &amp;amp; 1 more&amp;quot;, states: &amp;quot;Yes, it appears that all or part of this image was created with Google AI. SynthID detected a watermark in 25-50% of the image.&amp;quot;" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Yes, it appears that all or part of this image was created with Google Al. SynthID detected a watermark in 25-50% of the image.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Presumably that 25-50% figure is because the rest of the photo was taken by me - it was just the raccoons that were added by Nano Banana Pro.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nano-banana"&gt;nano-banana&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="datasette"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="text-to-image"/><category term="llm-release"/><category term="nano-banana"/></entry><entry><title>Building more with GPT-5.1-Codex-Max</title><link href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/#atom-tag" rel="alternate"/><published>2025-11-19T23:15:10+00:00</published><updated>2025-11-19T23:15:10+00:00</updated><id>https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/gpt-5-1-codex-max/"&gt;Building more with GPT-5.1-Codex-Max&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Hot on the heels of yesterday's &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/"&gt;Gemini 3 Pro release&lt;/a&gt; comes a new model from OpenAI called GPT-5.1-Codex-Max.&lt;/p&gt;
&lt;p&gt;(Remember when GPT-5 was meant to bring in a new era of less confusing model names? That didn't last!)&lt;/p&gt;
&lt;p&gt;It's currently only available through their &lt;a href="https://developers.openai.com/codex/cli/"&gt;Codex CLI coding agent&lt;/a&gt;, where it's the new default model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Starting today, GPT‑5.1-Codex-Max will replace GPT‑5.1-Codex as the default model in Codex surfaces. Unlike GPT‑5.1, which is a general-purpose model, we recommend using GPT‑5.1-Codex-Max and the Codex family of models only for agentic coding tasks in Codex or Codex-like environments.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's not available via the API yet but should be shortly.&lt;/p&gt;
&lt;p&gt;The timing of this release is interesting given that Gemini 3 Pro appears to have &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#benchmarks"&gt;aced almost all of the benchmarks&lt;/a&gt; just yesterday. It's reminiscent of the period in 2024 when OpenAI consistently made big announcements that happened to coincide with Gemini releases.&lt;/p&gt;
&lt;p&gt;OpenAI's self-reported &lt;a href="https://openai.com/index/introducing-swe-bench-verified/"&gt;SWE-Bench Verified&lt;/a&gt; score is particularly notable: 76.5% for thinking level "high" and 77.9% for the new "xhigh". That was the one benchmark where Gemini 3 Pro was out-performed by Claude Sonnet 4.5 - Gemini 3 Pro got 76.2% and Sonnet 4.5 got 77.2%. OpenAI now have the highest scoring model there by a full .7 of a percentage point!&lt;/p&gt;
&lt;p&gt;They also report a score of 58.1% on &lt;a href="https://www.tbench.ai/leaderboard/terminal-bench/2.0"&gt;Terminal Bench 2.0&lt;/a&gt;, beating Gemini 3 Pro's 54.2% (and Sonnet 4.5's 42.8%.)&lt;/p&gt;
&lt;p&gt;The most intriguing part of this announcement concerns the model's approach to long context problems:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT‑5.1-Codex-Max is built for long-running, detailed work. It’s our first model natively trained to operate across multiple context windows through a process called &lt;em&gt;compaction&lt;/em&gt;, coherently working over millions of tokens in a single task. [...]&lt;/p&gt;
&lt;p&gt;Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's a lot of confusion &lt;a href="https://news.ycombinator.com/item?id=45982649"&gt;on Hacker News&lt;/a&gt; about what this actually means. Claude Code already does a version of compaction, automatically summarizing previous turns when the context runs out. Does this just mean that Codex-Max is better at that process?&lt;/p&gt;
&lt;p&gt;I had it draw me a couple of pelicans by typing "Generate an SVG of a pelican riding a bicycle" directly into the Codex CLI tool. Here's thinking level medium:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A flat-style illustration shows a white, round-bodied bird with an orange beak pedaling a red-framed bicycle with thin black wheels along a sandy beach, with a calm blue ocean and clear sky in the background." src="https://static.simonwillison.net/static/2025/codex-max-medium.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;And here's thinking level "xhigh":&lt;/p&gt;
&lt;p&gt;&lt;img alt="A plump white bird with an orange beak and small black eyes crouches low on a blue bicycle with oversized dark wheels, shown racing forward with motion lines against a soft gradient blue sky." src="https://static.simonwillison.net/static/2025/codex-max-xhigh.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I also tried xhigh on the my &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;longer pelican test prompt&lt;/a&gt;, which came out like this:&lt;/p&gt;
&lt;p id="advanced-pelican-codex-max"&gt;&lt;img alt="A stylized dark gray bird with layered wings, a yellow head crest, and a long brown beak leans forward in a racing pose on a black-framed bicycle, riding across a glossy blue surface under a pale sky." src="https://static.simonwillison.net/static/2025/codex-breeding-max-xhigh.jpg"&gt;&lt;/p&gt;

&lt;p&gt;Also today: &lt;a href="https://x.com/openai/status/1991266192905179613"&gt;GPT-5.1 Pro is rolling out today to all Pro users&lt;/a&gt;. According to the &lt;a href="https://help.openai.com/en/articles/6825453-chatgpt-release-notes"&gt;ChatGPT release notes&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT-5.1 Pro is rolling out today for all ChatGPT Pro users and is available in the model picker. GPT-5 Pro will remain available as a legacy model for 90 days before being retired.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's a pretty fast deprecation cycle for the GPT-5 Pro model that was released just three months ago.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45982649"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-codex"&gt;gpt-codex&lt;/a&gt;&lt;/p&gt;



</summary><category term="november-2025-inflection"/><category term="llm-release"/><category term="gpt-5"/><category term="codex-cli"/><category term="generative-ai"/><category term="openai"/><category term="ai"/><category term="llms"/><category term="pelican-riding-a-bicycle"/><category term="evals"/><category term="gpt-codex"/></entry><entry><title>Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark</title><link href="https://simonwillison.net/2025/Nov/18/gemini-3/#atom-tag" rel="alternate"/><published>2025-11-18T19:00:48+00:00</published><updated>2025-11-18T19:00:48+00:00</updated><id>https://simonwillison.net/2025/Nov/18/gemini-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Google released Gemini 3 Pro today. Here's &lt;a href="https://blog.google/products/gemini/gemini-3/"&gt;the announcement from Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu&lt;/a&gt;, their &lt;a href="https://blog.google/technology/developers/gemini-3-developers/"&gt;developer blog announcement from Logan Kilpatrick&lt;/a&gt;, the &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf"&gt;Gemini 3 Pro Model Card&lt;/a&gt;, and their &lt;a href="https://blog.google/products/gemini/gemini-3-collection/"&gt;collection of 11 more articles&lt;/a&gt;. It's a big release!&lt;/p&gt;
&lt;p&gt;I had a few days of preview access to this model via &lt;a href="https://aistudio.google.com/"&gt;AI Studio&lt;/a&gt;. The best way to describe it is that it's &lt;strong&gt;Gemini 2.5 upgraded to match the leading rival models&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Gemini 3 has the same underlying characteristics as Gemini 2.5. The knowledge cutoff is the same (January 2025). It accepts 1 million input tokens, can output up to 64,000 tokens, and has multimodal inputs across text, images, audio, and video.&lt;/p&gt;
&lt;h4 id="benchmarks"&gt;Benchmarks&lt;/h4&gt;
&lt;p&gt;Google's own reported numbers (in &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf"&gt;the model card&lt;/a&gt;) show it scoring slightly higher against Claude 4.5 Sonnet and GPT-5.1 against most of the standard benchmarks. As always I'm waiting for independent confirmation, but I have no reason to believe those numbers are inaccurate.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg" alt="Table of benchmark numbers, described in full below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pricing"&gt;Pricing&lt;/h4&gt;
&lt;p&gt;It terms of pricing it's a little more expensive than Gemini 2.5 but still cheaper than Claude Sonnet 4.5. Here's how it fits in with those other leading models:&lt;/p&gt;
&lt;center&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input (per 1M tokens)&lt;/th&gt;
      &lt;th&gt;Output (per 1M tokens)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.1&lt;/td&gt;
      &lt;td&gt;$1.25&lt;/td&gt;
      &lt;td&gt;$10.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $1.25&lt;br /&gt;
        &amp;gt; 200k tokens: $2.50
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $10.00&lt;br /&gt;
        &amp;gt; 200k tokens: $15.00
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Gemini 3 Pro&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $2.00&lt;br /&gt;
        &amp;gt; 200k tokens: $4.00
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $12.00&lt;br /&gt;
        &amp;gt; 200k tokens: $18.00
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.5&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $3.00&lt;br /&gt;
        &amp;gt; 200k tokens: $6.00
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $15.00&lt;br /&gt;
        &amp;gt; 200k tokens: $22.50
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.1&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
      &lt;td&gt;$75.00&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/center&gt;
&lt;h4 id="trying-it-out-against-a-complex-image"&gt;Trying it out against a complex image&lt;/h4&gt;
&lt;p&gt;That screenshot of the benchmarks from above looked like a good test for Gemini 3's multimodal support. I fed it that image URL and asked it to generate alt text for the image:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview -a https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg 'Alt text for this image, include all figures and make them comprehensible to a screen reader user'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's what I got back:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A comprehensive table comparing four AI models across various performance benchmarks. The columns listed are: Benchmark Name, Description, Gemini 3 Pro (highlighted in blue), Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Humanity's Last Exam (Academic reasoning)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No tools: Gemini 3 Pro 37.5%, Gemini 2.5 Pro 21.6%, Claude Sonnet 4.5 13.7%, GPT-5.1 26.5%.&lt;/li&gt;
&lt;li&gt;With search and code execution: Gemini 3 Pro 45.8% (others have no data).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ARC-AGI-2 (Visual reasoning puzzles; ARC Prize Verified)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 31.1%, Gemini 2.5 Pro 4.9%, Claude Sonnet 4.5 13.6%, GPT-5.1 17.6%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;GPQA Diamond (Scientific knowledge; No tools)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 91.9%, Gemini 2.5 Pro 86.4%, Claude Sonnet 4.5 83.4%, GPT-5.1 88.1%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;AIME 2025 (Mathematics)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No tools: Gemini 3 Pro 95.0%, Gemini 2.5 Pro 88.0%, Claude Sonnet 4.5 87.0%, GPT-5.1 94.0%.&lt;/li&gt;
&lt;li&gt;With code execution: Gemini 3 Pro 100%, Claude Sonnet 4.5 100%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MathArena Apex (Challenging Math Contest problems)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 23.4%, Gemini 2.5 Pro 0.5%, Claude Sonnet 4.5 1.6%, GPT-5.1 1.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MMMU-Pro (Multimodal understanding and reasoning)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 81.0%, Gemini 2.5 Pro 68.0%, Claude Sonnet 4.5 68.0%, GPT-5.1 76.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ScreenSpot-Pro (Screen understanding)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 72.7%, Gemini 2.5 Pro 11.4%, Claude Sonnet 4.5 36.2%, GPT-5.1 3.5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;CharXiv Reasoning (Information synthesis from complex charts)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 81.4%, Gemini 2.5 Pro 69.6%, Claude Sonnet 4.5 68.5%, GPT-5.1 69.5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;OmniDocBench 1.5 (OCR; Overall Edit Distance, lower is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 0.115, Gemini 2.5 Pro 0.145, Claude Sonnet 4.5 0.145, GPT-5.1 0.147.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Video-MMMU (Knowledge acquisition from videos)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 87.6%, Gemini 2.5 Pro 83.6%, Claude Sonnet 4.5 77.8%, GPT-5.1 80.4%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;LiveCodeBench Pro (Competitive coding problems; Elo Rating, higher is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 2,439; Gemini 2.5 Pro 1,775; Claude Sonnet 4.5 1,418; GPT-5.1 2,243.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Terminal-Bench 2.0 (Agentic terminal coding; Terminus-2 agent)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 54.2%, Gemini 2.5 Pro 32.6%, Claude Sonnet 4.5 42.8%, GPT-5.1 47.6%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SWE-Bench Verified (Agentic coding; Single attempt)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 76.2%, Gemini 2.5 Pro 59.6%, Claude Sonnet 4.5 77.2%, GPT-5.1 76.3%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;t2-bench (Agentic tool use)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 85.4%, Gemini 2.5 Pro 54.9%, Claude Sonnet 4.5 84.7%, GPT-5.1 80.2%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Vending-Bench 2 (Long-horizon agentic tasks; Net worth (mean), higher is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro $5,478.16; Gemini 2.5 Pro $573.64; Claude Sonnet 4.5 $3,838.74; GPT-5.1 $1,473.43.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;FACTS Benchmark Suite (Held out internal grounding, parametric, MM, and search retrieval benchmarks)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 70.5%, Gemini 2.5 Pro 63.4%, Claude Sonnet 4.5 50.4%, GPT-5.1 50.8%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SimpleQA Verified (Parametric knowledge)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 72.1%, Gemini 2.5 Pro 54.5%, Claude Sonnet 4.5 29.3%, GPT-5.1 34.9%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MMMLU (Multilingual Q&amp;amp;A)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 91.8%, Gemini 2.5 Pro 89.5%, Claude Sonnet 4.5 89.1%, GPT-5.1 91.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Global PIQA (Commonsense reasoning across 100 Languages and Cultures)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 93.4%, Gemini 2.5 Pro 91.5%, Claude Sonnet 4.5 90.1%, GPT-5.1 90.9%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MRCR v2 (8-needle) (Long context performance)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;128k (average): Gemini 3 Pro 77.0%, Gemini 2.5 Pro 58.0%, Claude Sonnet 4.5 47.1%, GPT-5.1 61.6%.&lt;/li&gt;
&lt;li&gt;1M (pointwise): Gemini 3 Pro 26.3%, Gemini 2.5 Pro 16.4%, Claude Sonnet 4.5 (not supported), GPT-5.1 (not supported).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have not checked every line of this but a loose spot-check looks accurate to me.&lt;/p&gt;
&lt;p&gt;That prompt took 1,105 input and 3,901 output tokens, at a cost of &lt;a href="https://www.llm-prices.com/#it=1105&amp;amp;cit=3901&amp;amp;ot=3901&amp;amp;ic=2&amp;amp;oc=12&amp;amp;sel=gemini-3-pro-preview"&gt;5.6824 cents&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ran this follow-up prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -c 'Convert to JSON'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can see &lt;a href="https://gist.github.com/simonw/ea7d52706557528e7eb3912cdf9250b0#response-1"&gt;the full output here&lt;/a&gt;, which starts like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"metadata"&lt;/span&gt;: {
    &lt;span class="pl-ent"&gt;"columns"&lt;/span&gt;: [
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Benchmark&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Description&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Gemini 3 Pro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Gemini 2.5 Pro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Claude Sonnet 4.5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;GPT-5.1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    ]
  },
  &lt;span class="pl-ent"&gt;"benchmarks"&lt;/span&gt;: [
    {
      &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Humanity's Last Exam&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"description"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Academic reasoning&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"sub_results"&lt;/span&gt;: [
        {
          &lt;span class="pl-ent"&gt;"condition"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;No tools&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_3_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;37.5%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_2_5_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;21.6%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"claude_sonnet_4_5"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;13.7%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gpt_5_1"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;26.5%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        },
        {
          &lt;span class="pl-ent"&gt;"condition"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;With search and code execution&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_3_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45.8%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_2_5_pro"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"claude_sonnet_4_5"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gpt_5_1"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;
        }
      ]
    },&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="analyzing-a-city-council-meeting"&gt;Analyzing a city council meeting&lt;/h4&gt;
&lt;p&gt;To try it out against an audio file I extracted the 3h33m of audio from the video &lt;a href="https://www.youtube.com/watch?v=qgJ7x7R6gy0"&gt;Half Moon Bay City Council Meeting - November 4, 2025&lt;/a&gt;. I used &lt;code&gt;yt-dlp&lt;/code&gt; to get that audio:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;yt-dlp -x --audio-format m4a &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://www.youtube.com/watch?v=qgJ7x7R6gy0&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That gave me a 74M m4a file, which I ran through Gemini 3 Pro like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview -a /tmp/HMBCC\ 11⧸4⧸25\ -\ Half\ Moon\ Bay\ City\ Council\ Meeting\ -\ November\ 4,\ 2025\ \[qgJ7x7R6gy0\].m4a 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That failed with an "Internal error encountered" message, so I shrunk the file down to a more manageable 38MB using &lt;code&gt;ffmpeg&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ffmpeg -i &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/private/tmp/HMB.m4a&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -ac 1 -ar 22050 -c:a aac -b:a 24k &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/private/tmp/HMB_compressed.m4a&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then ran it again like this (for some reason I had to use &lt;code&gt;--attachment-type&lt;/code&gt; this time):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview --attachment-type /tmp/HMB_compressed.m4a 'audio/aac' 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This time it worked! The &lt;a href="https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943314"&gt;full output is here&lt;/a&gt;, but it starts like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here is the transcript of the Half Moon Bay City Council meeting.&lt;/p&gt;
&lt;h4&gt;Meeting Outline&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;1. Call to Order, Updates, and Public Forum&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Mayor Brownstone calls the meeting to order. City Manager Chidester reports no reportable actions from the closed session. Announcements are made regarding food insecurity volunteers and the Diwali celebration. During the public forum, Councilmember Penrose (speaking as a citizen) warns against autocracy. Citizens speak regarding lease agreements, downtown maintenance, local music events, and homelessness outreach statistics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:00:00 - 00:13:25&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Mayor Brownstone, Matthew Chidester, Irma Acosta, Deborah Penrose, Jennifer Moore, Sandy Vella, Joaquin Jimenez, Anita Rees.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2. Consent Calendar&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; The Council approves minutes from previous meetings and a resolution authorizing a licensing agreement for Seahorse Ranch. Councilmember Johnson corrects a pull request regarding abstentions on minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:13:25 - 00:15:15&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Mayor Brownstone, Councilmember Johnson, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Nagengast.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. Ordinance Introduction: Commercial Vitality (Item 9A)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff presents a new ordinance to address neglected and empty commercial storefronts, establishing maintenance and display standards. Councilmembers discuss enforcement mechanisms, window cleanliness standards, and the need for objective guidance documents to avoid subjective enforcement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:15:15 - 00:30:45&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Karen Decker, Councilmember Johnson, Councilmember Nagengast, Vice Mayor Ruddick, Councilmember Penrose.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;4. Ordinance Introduction: Building Standards &amp;amp; Electrification (Item 9B)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff introduces updates to the 2025 Building Code. A major change involves repealing the city's all-electric building requirement due to the 9th Circuit Court ruling (&lt;em&gt;California Restaurant Association v. City of Berkeley&lt;/em&gt;). &lt;strong&gt;Public speaker Mike Ferreira expresses strong frustration and disagreement with "unelected state agencies" forcing the City to change its ordinances.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:30:45 - 00:45:00&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Ben Corrales, Keith Weiner, Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;5. Housing Element Update &amp;amp; Adoption (Item 9C)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff presents the 5th draft of the Housing Element, noting State HCD requirements to modify ADU allocations and place a measure on the ballot regarding the "Measure D" growth cap. &lt;strong&gt;There is significant disagreement from Councilmembers Ruddick and Penrose regarding the State's requirement to hold a ballot measure.&lt;/strong&gt; Public speakers debate the enforceability of Measure D. &lt;strong&gt;Mike Ferreira interrupts the vibe to voice strong distaste for HCD's interference in local law.&lt;/strong&gt; The Council votes to adopt the element but strikes the language committing to a ballot measure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:45:00 - 01:05:00&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Leslie (Staff), Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Johnson.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr /&gt;
&lt;h4&gt;Transcript&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Mayor Brownstone&lt;/strong&gt; [00:00:00]
Good evening everybody and welcome to the November 4th Half Moon Bay City Council meeting. As a reminder, we have Spanish interpretation services available in person and on Zoom.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Victor Hernandez (Interpreter)&lt;/strong&gt; [00:00:35]
Thank you, Mr. Mayor, City Council, all city staff, members of the public. &lt;em&gt;[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]&lt;/em&gt; Thank you very much.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Those first two lines of the transcript already illustrate something interesting here: Gemini 3 Pro chose NOT to include the exact text of the Spanish instructions, instead summarizing them as "[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]".&lt;/p&gt;
&lt;p&gt;I haven't spot-checked the entire 3hr33m meeting, but I've confirmed that the timestamps do not line up. The transcript closes like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mayor Brownstone&lt;/strong&gt; [01:04:00]
Meeting adjourned. Have a good evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That actually happens &lt;a href="https://www.youtube.com/watch?v=qgJ7x7R6gy0&amp;amp;t=3h31m5s"&gt;at 3h31m5s&lt;/a&gt; and the mayor says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Okay. Well, thanks everybody, members of the public for participating. Thank you for staff. Thank you to fellow council members. This meeting is now adjourned. Have a good evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm disappointed about the timestamps, since mismatches there make it much harder to jump to the right point and confirm that the summarized transcript is an accurate representation of what was said.&lt;/p&gt;
&lt;p&gt;This took 320,087 input tokens and 7,870 output tokens, for a total cost of &lt;a href="https://www.llm-prices.com/#it=320087&amp;amp;ot=7870&amp;amp;ic=4&amp;amp;oc=18"&gt;$1.42&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="and-a-new-pelican-benchmark"&gt;And a new pelican benchmark&lt;/h4&gt;
&lt;p&gt;Gemini 3 Pro has a new concept of a "thinking level" which can be set to low or high (and defaults to high). I tried my classic &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;Generate an SVG of a pelican riding a bicycle&lt;/a&gt; prompt at both levels.&lt;/p&gt;
&lt;p&gt;Here's low - Gemini decided to add a jaunty little hat (with a comment &lt;a href="https://gist.github.com/simonw/70d56ba39b7cbb44985d2384004fc4a0#response"&gt;in the SVG&lt;/a&gt; that says &lt;code&gt;&amp;lt;!-- Hat (Optional Fun Detail) --&amp;gt;&lt;/code&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-pelican-low.png" alt="The pelican is wearing a blue hat. It has a good beak. The bicycle is a little bit incorrect but generally a good effort." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's high. This is genuinely an excellent pelican, and the bicycle frame is at least the correct shape:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-pelican-high.png" alt="The pelican is not wearing a hat. It has a good beak. The bicycle is accurate and well-drawn." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Honestly though, my pelican benchmark is beginning to feel a little bit too basic. I decided to upgrade it. Here's v2 of the benchmark, which I plan to use going forward:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For reference, here's a photo I took of a California brown pelican recently (sadly without a bicycle):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/breeding-plumage.jpg" alt="A glorious California brown pelican perched on a rock by the water. It has a yellow tint to its head and a red spot near its throat." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's Gemini 3 Pro's &lt;a href="https://gist.github.com/simonw/2b9930ae1ce6f3f5e9cfe3cb31ec0c0a"&gt;attempt&lt;/a&gt; at high thinking level for that new prompt:&lt;/p&gt;
&lt;p id="advanced-pelican"&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-breeding-pelican-high.png" alt="It's clearly a pelican. It has all of the requested features. It looks a bit abstract though." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And for good measure, here's that same prompt &lt;a href="https://gist.github.com/simonw/7a655ebe42f3d428d2ea5363dad8067c"&gt;against GPT-5.1&lt;/a&gt; - which produced this dumpy little fellow:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-1-breeding-pelican.png" alt="The pelican is very round. Its body overlaps much of the bicycle. It has a lot of dorky charisma." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And Claude Sonnet 4.5, which &lt;a href="https://gist.github.com/simonw/3296af92e4328dd4740385e6a4a2ac35"&gt;didn't do quite as well&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4-5-breeding-pelican.png" alt="Oh dear. It has all of the requested components, but the bicycle is a bit wrong and the pelican is arranged in a very awkward shape." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;None of the models seem to have caught on to the crucial detail that the California brown pelican is not, in fact, brown.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>Introducing GPT-5.1 for developers</title><link href="https://simonwillison.net/2025/Nov/13/gpt-51/#atom-tag" rel="alternate"/><published>2025-11-13T23:59:35+00:00</published><updated>2025-11-13T23:59:35+00:00</updated><id>https://simonwillison.net/2025/Nov/13/gpt-51/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/gpt-5-1-for-developers/"&gt;Introducing GPT-5.1 for developers&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI announced GPT-5.1 yesterday, calling it &lt;a href="https://openai.com/index/gpt-5-1/"&gt;a smarter, more conversational ChatGPT&lt;/a&gt;. Today they've added it to their API.&lt;/p&gt;
&lt;p&gt;We actually got four new models today:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1"&gt;gpt-5.1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-chat-latest"&gt;gpt-5.1-chat-latest&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-codex"&gt;gpt-5.1-codex&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-codex-mini"&gt;gpt-5.1-codex-mini&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are a lot of details to absorb here.&lt;/p&gt;
&lt;p&gt;GPT-5.1 introduces a new reasoning effort called "none" (previous were minimal, low, medium, and high) - and none is the new default.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This makes the model behave like a non-reasoning model for latency-sensitive use cases, with the high intelligence of GPT‑5.1 and added bonus of performant tool-calling. Relative to GPT‑5 with 'minimal' reasoning, GPT‑5.1 with no reasoning is better at parallel tool calling (which itself increases end-to-end task completion speed), coding tasks, following instructions, and using search tools---and supports &lt;a href="https://platform.openai.com/docs/guides/tools-web-search?api-mode=responses"&gt;web search⁠&lt;/a&gt; in our API platform.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When you DO enable thinking you get to benefit from a new feature called "adaptive reasoning":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On straightforward tasks, GPT‑5.1 spends fewer tokens thinking, enabling snappier product experiences and lower token bills. On difficult tasks that require extra thinking, GPT‑5.1 remains persistent, exploring options and checking its work in order to maximize reliability.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Another notable new feature for 5.1 is &lt;a href="https://platform.openai.com/docs/guides/prompt-caching#extended-prompt-cache-retention"&gt;extended prompt cache retention&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours. Extended Prompt Caching works by offloading the key/value tensors to GPU-local storage when memory is full, significantly increasing the storage capacity available for caching.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To enable this set &lt;code&gt;"prompt_cache_retention": "24h"&lt;/code&gt; in the API call. Weirdly there's no price increase involved with this at all. I &lt;a href="https://x.com/simonw/status/1989104422832738305"&gt;asked about that&lt;/a&gt; and OpenAI's Steven Heidel &lt;a href="https://x.com/stevenheidel/status/1989113407149314199"&gt;replied&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;with 24h prompt caching we move the caches from gpu memory to gpu-local storage. that storage is not free, but we made it free since it moves capacity from a limited resource (GPUs) to a more abundant resource (storage). then we can serve more traffic overall!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The most interesting documentation I've seen so far is in the new &lt;a href="https://cookbook.openai.com/examples/gpt-5/gpt-5-1_prompting_guide"&gt;5.1 cookbook&lt;/a&gt;, which also includes details of the new &lt;code&gt;shell&lt;/code&gt; and &lt;code&gt;apply_patch&lt;/code&gt; built-in tools. The &lt;a href="https://github.com/openai/openai-cookbook/blob/main/examples/gpt-5/apply_patch.py"&gt;apply_patch.py implementation&lt;/a&gt; is worth a look, especially if you're interested in the advancing state-of-the-art of file editing tools for LLMs.&lt;/p&gt;
&lt;p&gt;I'm still working on &lt;a href="https://github.com/simonw/llm/issues/1300"&gt;integrating the new models into LLM&lt;/a&gt;. The Codex models are Responses-API-only.&lt;/p&gt;
&lt;p&gt;I got this pelican for GPT-5.1 default (no thinking):&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle wheels have no spokes at all, the pelican is laying quite flat on it" src="https://static.simonwillison.net/static/2025/gpt-5.1-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;And this one with reasoning effort set to high:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This bicycle has four spokes per wheel, and the pelican is sitting more upright" src="https://static.simonwillison.net/static/2025/gpt-5.1-high-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;These actually feel like a &lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans"&gt;regression from GPT-5&lt;/a&gt; to me. The bicycles have less spokes!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-codex"&gt;gpt-codex&lt;/a&gt;&lt;/p&gt;



</summary><category term="november-2025-inflection"/><category term="llm"/><category term="openai"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="gpt-5"/><category term="generative-ai"/><category term="gpt-codex"/></entry><entry><title>Reverse engineering Codex CLI to get GPT-5-Codex-Mini to draw me a pelican</title><link href="https://simonwillison.net/2025/Nov/9/gpt-5-codex-mini/#atom-tag" rel="alternate"/><published>2025-11-09T03:31:34+00:00</published><updated>2025-11-09T03:31:34+00:00</updated><id>https://simonwillison.net/2025/Nov/9/gpt-5-codex-mini/#atom-tag</id><summary type="html">
    &lt;p&gt;OpenAI partially released a new model yesterday called GPT-5-Codex-Mini, which they &lt;a href="https://x.com/OpenAIDevs/status/1986861734619947305"&gt;describe&lt;/a&gt; as "a more compact and cost-efficient version of GPT-5-Codex". It's currently only available via their Codex CLI tool and VS Code extension, with proper API access "&lt;a href="https://x.com/OpenAIDevs/status/1986861736041853368"&gt;coming soon&lt;/a&gt;". I decided to use Codex to reverse engineer the Codex CLI tool and give me the ability to prompt the new model directly.&lt;/p&gt;
&lt;p&gt;I made &lt;a href="https://www.youtube.com/watch?v=9o1_DL9uNlM"&gt;a video&lt;/a&gt; talking through my progress and demonstrating the final results.&lt;/p&gt;

&lt;p&gt;&lt;lite-youtube videoid="9o1_DL9uNlM" js-api="js-api" title="Reverse engineering Codex CLI to get GPT-5-Codex-Mini to draw me a pelican" playlabel="Play: Reverse engineering Codex CLI to get GPT-5-Codex-Mini to draw me a pelican"&gt; &lt;/lite-youtube&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Nov/9/gpt-5-codex-mini/#this-is-a-little-bit-cheeky"&gt;This is a little bit cheeky&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Nov/9/gpt-5-codex-mini/#codex-cli-is-written-in-rust"&gt;Codex CLI is written in Rust&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Nov/9/gpt-5-codex-mini/#iterating-on-the-code"&gt;Iterating on the code&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Nov/9/gpt-5-codex-mini/#let-s-draw-some-pelicans"&gt;Let's draw some pelicans&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Nov/9/gpt-5-codex-mini/#bonus-the-debug-option"&gt;Bonus: the --debug option&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="this-is-a-little-bit-cheeky"&gt;This is a little bit cheeky&lt;/h4&gt;
&lt;p&gt;OpenAI clearly don't intend for people to access this model directly just yet. It's available exclusively through Codex CLI which is a privileged application - it gets to access a special backend API endpoint that's not publicly documented, and it uses a special authentication mechanism that bills usage directly to the user's existing ChatGPT account.&lt;/p&gt;
&lt;p&gt;I figured reverse-engineering that API directly would be somewhat impolite. But... Codex CLI is an open source project released under an Apache 2.0 license. How about upgrading that to let me run my own prompts through its existing API mechanisms instead?&lt;/p&gt;
&lt;p&gt;This felt like a somewhat absurd loophole, and I couldn't resist trying it out and seeing what happened.&lt;/p&gt;
&lt;h4 id="codex-cli-is-written-in-rust"&gt;Codex CLI is written in Rust&lt;/h4&gt;
&lt;p&gt;The &lt;a href="https://github.com/openai/codex"&gt;openai/codex&lt;/a&gt; repository contains the source code for the Codex CLI tool, which OpenAI rewrote in Rust just a few months ago.&lt;/p&gt;
&lt;p&gt;I don't know much Rust at all.&lt;/p&gt;
&lt;p&gt;I made my own clone on GitHub and checked it out locally:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone git@github.com:simonw/codex
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; codex&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I fired up Codex itself (in dangerous mode, because I like living dangerously):&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;codex --dangerously-bypass-approvals-and-sandbox&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And ran this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Figure out how to build the rust version of this tool and then build it&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This worked. It churned away for a bit and figured out how to build itself. This is a useful starting point for a project like this - in figuring out the compile step the coding agent gets seeded with a little bit of relevant information about the project, and if it can compile that means it can later partially test the code it is writing while it works.&lt;/p&gt;
&lt;p&gt;Once the compile had succeeded I fed it the design for the new feature I wanted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Add a new sub-command to the Rust tool called "codex prompt"&lt;/p&gt;
&lt;p&gt;codex prompt "prompt goes here" - this runs the given prompt directly against the OpenAI API that Codex uses, with the same code path and authentication credentials as the rest of Codex.&lt;/p&gt;
&lt;p&gt;codex prompt -m &amp;lt;model_name&amp;gt; "prompt goes here" - same again but lets you set the model&lt;/p&gt;
&lt;p&gt;codex prompt "prompt goes here" -s/--system "system prompt goes here" - runs with a custom system/developer message&lt;/p&gt;
&lt;p&gt;codex prompt --models - this lists all available models that can be used with the "codex prompt" command&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My design for this new sub-command is lifted directly from my own &lt;a href="https://llm.datasette.io/en/stable/help.html#llm-prompt-help"&gt;llm command&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="iterating-on-the-code"&gt;Iterating on the code&lt;/h4&gt;
&lt;p&gt;Codex got to work. I've &lt;a href="https://gistpreview.github.io/?ddabbff092bdd658e06d8a2e8f142098"&gt;shared the full transcript here&lt;/a&gt; (using the tool &lt;a href="https://simonwillison.net/2025/Oct/23/claude-code-for-web-video/"&gt;I described here&lt;/a&gt;), but this TODO list it made itself is a useful summary of what it decided to do:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Updated Plan
└ ✔ Inspect CLI structure and relevant core utilities for sending prompts and listing models
  ✔ Implement new `codex prompt` subcommand (CLI plumbing + command handler)
  ✔ Format, lint, and test (`just fmt`, `just fix -p codex-cli`, `cargo test -p codex-cli`)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I like that it figured out &lt;a href="https://github.com/openai/codex/blob/a47181e471b6efe55e95f9858c913fc89a3a44fa/codex-rs/justfile"&gt;the justfile&lt;/a&gt; in the repo and decided to use it to run formatting and linting commands without me needing to tell it to. (Update: it turns out that was dictated by the &lt;a href="https://github.com/openai/codex/blob/f8b30af6dc275b3e64de5f1987e6cafe604cb72a/AGENTS.md"&gt;AGENTS.md&lt;/a&gt; file.)&lt;/p&gt;
&lt;p&gt;I tried running the first version of the code it wrote like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;./target/debug/codex prompt &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -m gpt-5-codex-mini&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;... and it didn't quite work. I got this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;(reasoning summary) **Seeking
(reasoning summary)  instructions
(reasoning summary)  and
(reasoning summary)  sandbox
(reasoning summary)  info
(reasoning summary) **
(reasoning summary) **Dec
(reasoning summary) iding
(reasoning summary)  on
(reasoning summary)  SVG
(reasoning summary)  creation
(reasoning summary)  approach
(reasoning summary) **
(reasoning summary) **Checking
(reasoning summary)  current
(reasoning summary)  directory
(reasoning summary) **
(reasoning summary) **Preparing
(reasoning summary)  to
(reasoning summary)  check
(reasoning summary)  current
(reasoning summary)  directory
(reasoning summary) **
I�m ready to help�what would you like me to do next?I�m ready to help�what would you like me to do next?
Token usage: total=2459 input=2374 cached_input=0 output=85 reasoning_output=64
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that it DID think about SVG creation, but then decided it should look at the current directory. This isn't what I want - it appeared to be running in Codex's normal mode with a system prompt telling it to edit files on disk. I wanted it to respond to the prompt without acting as if it had a full workspace available to it.&lt;/p&gt;
&lt;p&gt;I prompted it again, pasting in the broken output:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;this did not seem to work - here is what happened - note that the reasoning summary comes out on a lot of lines when I would rather it streams out to stderr on the same line (flushing constantly) - and then despite the reasoning summary thinking about pelican on SVG it did not actually act on that and display the result, which is very surprising. [pasted content]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It churned for a bit more, but from the output it looked like it was still trying to include Codex-specific tools and workspace information. So I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note that this new 'codex prompt' command should NOT be able to execute any tools at all, and should not have to think about the workspace or sandbox - it should send a single prompt with no tools and then stream and display a response and then stop.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This got me a new build, but when I ran it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Error: unexpected status 400 Bad Request: {"detail":"Instructions are not valid"}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;My guess is that OpenAI's custom backend API for Codex doesn't like it if you attempt to call it &lt;em&gt;without&lt;/em&gt; passing in the expected instructions. Fair enough!&lt;/p&gt;
&lt;p&gt;Rather than paste that error in I got Codex to try the command itself, so it could see the error:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You try running this and see what happens: ./target/debug/codex prompt 'Generate an SVG of a pelican riding a bicycle'&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex switched back to the default instructions. From its reply:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;cli/src/prompt_cmd.rs#L1-L320&lt;/code&gt; now forces a tool-free request: we keep the workspace defaults, inject a lightweight system/developer message (“don’t run tools”), disable apply_patch/web search via config overrides, and build the outbound prompt as just that system span plus the user text.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I guess that will have to do. I asked for one more feature:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Add a --debug option to the codex prompt command which causes the full JSON request and response to be printed to stderr, plus the URL that is being accessed and the HTTP verb&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;... and we're ready to try this thing out!&lt;/p&gt;
&lt;p&gt;Notably I haven't written a single line of Rust myself here and paid almost no attention to what it was actually doing. My main contribution was to run the binary every now and then to see if it was doing what I needed yet.&lt;/p&gt;
&lt;p&gt;I've pushed the working code to &lt;a href="https://github.com/simonw/codex/compare/a47181e471b6efe55e95f9858c913fc89a3a44fa...ae5f98a9248a8edb5d3c53261273a482fc0b5306"&gt;a prompt-subcommand branch in my repo&lt;/a&gt; if you want to take a look and see how it all works.&lt;/p&gt;

&lt;h4 id="let-s-draw-some-pelicans"&gt;Let's draw some pelicans&lt;/h4&gt;
&lt;p&gt;With the final version of the code built, I drew some pelicans. Here's the &lt;a href="https://gistpreview.github.io/?a11f9ac456d2b2bc3715ba900ef1203d"&gt;full terminal transcript&lt;/a&gt;, but here are some highlights.&lt;/p&gt;
&lt;p&gt;This is with the default GPT-5-Codex model:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;./target/debug/codex prompt &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I pasted it into my &lt;a href="https://tools.simonwillison.net/svg-render"&gt;tools.simonwillison.net/svg-render&lt;/a&gt; tool and got the following:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/codex-hacking-default.png" alt="It's a dumpy little pelican with a weird face, not particularly great" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I ran it again for GPT-5:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;./target/debug/codex prompt &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -m gpt-5&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/codex-hacking-gpt-5.png" alt="Much better bicycle, pelican is a bit line-drawing-ish but does have the necessary parts in the right places" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And now the moment of truth... GPT-5 Codex Mini!&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;./target/debug/codex prompt &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -m gpt-5-codex-mini&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/codex-hacking-mini.png" alt="This is terrible. The pelican is an abstract collection of shapes, the bicycle is likewise very messed up" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I don't think I'll be adding that one to my SVG drawing toolkit any time soon.&lt;/p&gt;

&lt;h4 id="bonus-the-debug-option"&gt;Bonus: the --debug option&lt;/h4&gt;
&lt;p&gt;I had Codex add a &lt;code&gt;--debug&lt;/code&gt; option to help me see exactly what was going on.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;./target/debug/codex prompt -m gpt-5-codex-mini &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --debug&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The output starts like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[codex prompt debug] POST https://chatgpt.com/backend-api/codex/responses
[codex prompt debug] Request JSON:
&lt;/code&gt;&lt;/pre&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"model"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;gpt-5-codex-mini&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"instructions"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;You are Codex, based on GPT-5. You are running as a coding agent ...&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"input"&lt;/span&gt;: [
    {
      &lt;span class="pl-ent"&gt;"type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;message&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"role"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;developer&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: [
        {
          &lt;span class="pl-ent"&gt;"type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;input_text&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"text"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;You are a helpful assistant. Respond directly to the user request without running tools or shell commands.&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        }
      ]
    },
    {
      &lt;span class="pl-ent"&gt;"type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;message&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"role"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;user&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: [
        {
          &lt;span class="pl-ent"&gt;"type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;input_text&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"text"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        }
      ]
    }
  ],
  &lt;span class="pl-ent"&gt;"tools"&lt;/span&gt;: [],
  &lt;span class="pl-ent"&gt;"tool_choice"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;auto&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"parallel_tool_calls"&lt;/span&gt;: &lt;span class="pl-c1"&gt;false&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"reasoning"&lt;/span&gt;: {
    &lt;span class="pl-ent"&gt;"summary"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;auto&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  },
  &lt;span class="pl-ent"&gt;"store"&lt;/span&gt;: &lt;span class="pl-c1"&gt;false&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"stream"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"include"&lt;/span&gt;: [
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;reasoning.encrypted_content&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  ],
  &lt;span class="pl-ent"&gt;"prompt_cache_key"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;019a66bf-3e2c-7412-b05e-db9b90bbad6e&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This reveals that OpenAI's private API endpoint for Codex CLI is &lt;code&gt;https://chatgpt.com/backend-api/codex/responses&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Also interesting is how the &lt;code&gt;"instructions"&lt;/code&gt; key (truncated above, &lt;a href="https://gist.github.com/simonw/996388ecf785ad54de479315bd4d33b7"&gt;full copy here&lt;/a&gt;) contains the default instructions, without which the API appears not to work - but it also shows that you can send a message with &lt;code&gt;role="developer"&lt;/code&gt; in advance of your user prompt.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-codex"&gt;gpt-codex&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="rust"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="vibe-coding"/><category term="coding-agents"/><category term="gpt-5"/><category term="codex-cli"/><category term="gpt-codex"/></entry><entry><title>Kimi K2 Thinking</title><link href="https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/#atom-tag" rel="alternate"/><published>2025-11-06T23:53:06+00:00</published><updated>2025-11-06T23:53:06+00:00</updated><id>https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/moonshotai/Kimi-K2-Thinking"&gt;Kimi K2 Thinking&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Chinese AI lab Moonshot's Kimi K2 established itself as one of the largest open weight models - 1 trillion parameters - &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/"&gt;back in July&lt;/a&gt;. They've now released the Thinking version, also a trillion parameters (MoE, 32B active) and also under their custom modified (so &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/#kimi-license"&gt;not quite open source&lt;/a&gt;) MIT license.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This one is only 594GB on Hugging Face - Kimi K2 was 1.03TB - which I think is due to the new INT4 quantization. This makes the model both cheaper and faster to host.&lt;/p&gt;
&lt;p&gt;So far the only people hosting it are Moonshot themselves. I tried it out both via &lt;a href="https://platform.moonshot.ai"&gt;their own API&lt;/a&gt; and via &lt;a href="https://openrouter.ai/moonshotai/kimi-k2-thinking/providers"&gt;the OpenRouter proxy to it&lt;/a&gt;, via the &lt;a href="https://github.com/ghostofpokemon/llm-moonshot"&gt;llm-moonshot&lt;/a&gt; plugin (by NickMystic) and my &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin respectively.&lt;/p&gt;
&lt;p&gt;The buzz around this model so far is very positive. Could this be the first open weight model that's competitive with the latest from OpenAI and Anthropic, especially for long-running agentic tool call sequences?&lt;/p&gt;
&lt;p&gt;Moonshot AI's &lt;a href="https://moonshotai.github.io/Kimi-K2/thinking.html"&gt;self-reported benchmark scores&lt;/a&gt; show K2 Thinking beating the top OpenAI and Anthropic models (GPT-5 and Sonnet 4.5 Thinking) at "Agentic Reasoning" and "Agentic Search" but not quite top for "Coding":&lt;/p&gt;
&lt;p&gt;&lt;img alt="Comparison bar chart showing agentic reasoning, search, and coding benchmark performance scores across three AI systems (K, OpenAI, and AI) on tasks including Humanity's Last Exam (44.9, 41.7, 32.0), BrowseComp (60.2, 54.9, 24.1), Seal-0 (56.3, 51.4, 53.4), SWE-Multilingual (61.1, 55.3, 68.0), SWE-bench Verified (71.3, 74.9, 77.2), and LiveCodeBench V6 (83.1, 87.0, 64.0), with category descriptions including &amp;quot;Expert-level questions across subjects&amp;quot;, &amp;quot;Agentic search &amp;amp; browsing&amp;quot;, &amp;quot;Real-world latest information collection&amp;quot;, &amp;quot;Agentic coding&amp;quot;, and &amp;quot;Competitive programming&amp;quot;." src="https://static.simonwillison.net/static/2025/kimi-k2-thinking-benchmarks.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I ran a couple of pelican tests:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-moonshot
llm keys set moonshot # paste key
llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Sonnet 4.5 described this as: Cartoon illustration of a white duck or goose with an orange beak and gray wings riding a bicycle with a red frame and light blue wheels against a light blue background." src="https://static.simonwillison.net/static/2025/k2-thinking.png" /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter # paste key
llm -m openrouter/moonshotai/kimi-k2-thinking \
  'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Sonnet 4.5: Minimalist cartoon illustration of a white bird with an orange beak and feet standing on a triangular-framed penny-farthing style bicycle with gray-hubbed wheels and a propeller hat on its head, against a light background with dotted lines and a brown ground line." src="https://static.simonwillison.net/static/2025/k2-thinking-openrouter.png" /&gt;&lt;/p&gt;
&lt;p&gt;Artificial Analysis &lt;a href="https://x.com/ArtificialAnlys/status/1986541785511043536"&gt;said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CNBC quoted a source who &lt;a href="https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-releases-new-ai-model-kimi-k2-thinking.html"&gt;provided the training price&lt;/a&gt; for the model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Kimi K2 Thinking model cost $4.6 million to train, according to a source familiar with the matter. [...] CNBC was unable to independently verify the DeepSeek or Kimi figures.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MLX developer Awni Hannun &lt;a href="https://x.com/awnihannun/status/1986601104130646266"&gt;got it working&lt;/a&gt; on two 512GB M3 Ultra Mac Studios:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality!&lt;/p&gt;
&lt;p&gt;The model was quantization aware trained (qat) at int4.&lt;/p&gt;
&lt;p&gt;Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://huggingface.co/mlx-community/Kimi-K2-Thinking"&gt;the 658GB mlx-community model&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm"/><category term="llm-reasoning"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="moonshot"/><category term="llm-release"/><category term="generative-ai"/><category term="openrouter"/><category term="kimi"/><category term="artificial-analysis"/><category term="mlx"/></entry><entry><title>Introducing SWE-1.5: Our Fast Agent Model</title><link href="https://simonwillison.net/2025/Oct/29/swe-15/#atom-tag" rel="alternate"/><published>2025-10-29T23:59:20+00:00</published><updated>2025-10-29T23:59:20+00:00</updated><id>https://simonwillison.net/2025/Oct/29/swe-15/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cognition.ai/blog/swe-1-5"&gt;Introducing SWE-1.5: Our Fast Agent Model&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's the second fast coding model released by a coding agent IDE in the same day - the first was &lt;a href="https://simonwillison.net/2025/Oct/29/cursor-composer/"&gt;Composer-1 by Cursor&lt;/a&gt;. This time it's Windsurf releasing SWE-1.5:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Today we’re releasing SWE-1.5, the latest in our family of models optimized for software engineering. It is a frontier-size model with hundreds of billions of parameters that achieves near-SOTA coding performance. It also sets a new standard for speed: we partnered with Cerebras to serve it at up to 950 tok/s – 6x faster than Haiku 4.5 and 13x faster than Sonnet 4.5.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Like Composer-1 it's only available via their editor, no separate API yet. Also like Composer-1 they don't appear willing to share details of the "leading open-source base model" they based their new model on.&lt;/p&gt;
&lt;p&gt;I asked it to generate an SVG of a pelican riding a bicycle and got this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bicycle has a red upside down Y shaped frame, pelican is a bit dumpy, it does at least have a long sharp beak." src="https://static.simonwillison.net/static/2025/swe-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;This one felt &lt;em&gt;really fast&lt;/em&gt;. Partnering with Cerebras for inference is a very smart move.&lt;/p&gt;
&lt;p&gt;They share a lot of details about their training process in the post:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;SWE-1.5 is trained on our state-of-the-art cluster of thousands of GB200 NVL72 chips. We believe SWE-1.5 may be the first public production model trained on the new GB200 generation. [...]&lt;/p&gt;
&lt;p&gt;Our RL rollouts require high-fidelity environments with code execution and even web browsing. To achieve this, we leveraged our VM hypervisor &lt;code&gt;otterlink&lt;/code&gt; that  allows us to scale &lt;strong&gt;Devin&lt;/strong&gt; to tens of thousands of concurrent machines (learn more about &lt;a href="https://cognition.ai/blog/blockdiff#why-incremental-vm-snapshots"&gt;blockdiff&lt;/a&gt;). This enabled us to smoothly support very high concurrency and ensure the training environment is aligned with our Devin production environments.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's &lt;em&gt;another&lt;/em&gt; similarity to Cursor's Composer-1! Cursor talked about how they ran "hundreds of thousands of concurrent sandboxed coding environments in the cloud" in &lt;a href="https://cursor.com/blog/composer"&gt;their description of their RL training&lt;/a&gt; as well.&lt;/p&gt;
&lt;p&gt;This is a notable trend: if you want to build a really great agentic coding tool there's clearly a lot to be said for using reinforcement learning to fine-tune a model against your own custom set of tools using large numbers of sandboxed simulated coding environments as part of that process.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;a href="https://x.com/zai_org/status/1984076614951420273"&gt;I think it's built on GLM&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/cognition/status/1983662838955831372"&gt;@cognition&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-performance"/><category term="llm-release"/><category term="coding-agents"/><category term="ai-assisted-programming"/><category term="generative-ai"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="llms"/><category term="glm"/></entry><entry><title>MiniMax M2 &amp; Agent: Ingenious in Simplicity</title><link href="https://simonwillison.net/2025/Oct/29/minimax-m2/#atom-tag" rel="alternate"/><published>2025-10-29T22:49:47+00:00</published><updated>2025-10-29T22:49:47+00:00</updated><id>https://simonwillison.net/2025/Oct/29/minimax-m2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.minimax.io/news/minimax-m2"&gt;MiniMax M2 &amp;amp; Agent: Ingenious in Simplicity&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
MiniMax M2 was released on Monday 27th October by MiniMax, a Chinese AI lab founded in December 2021.&lt;/p&gt;
&lt;p&gt;It's a very promising model. Their self-reported benchmark scores show it as comparable to Claude Sonnet 4, and Artificial Analysis &lt;a href="https://x.com/ArtificialAnlys/status/1982714153375854998"&gt;are ranking it&lt;/a&gt; as the best currently available open weight model according to their intelligence score:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;MiniMax’s M2 achieves a new all-time-high Intelligence Index score for an open weights model and offers impressive efficiency with only 10B active parameters (200B total). [...]&lt;/p&gt;
&lt;p&gt;The model’s strengths include tool use and instruction following (as shown by Tau2 Bench and IFBench). As such, while M2 likely excels at agentic use cases it may underperform other open weights leaders such as DeepSeek V3.2 and Qwen3 235B at some generalist tasks. This is in line with a number of recent open weights model releases from Chinese AI labs which focus on agentic capabilities, likely pointing to a heavy post-training emphasis on RL.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The size is particularly significant: the model weights are 230GB &lt;a href="https://huggingface.co/MiniMaxAI/MiniMax-M2"&gt;on Hugging Face&lt;/a&gt;, significantly smaller than other high performing open weight models. That's small enough to run on a 256GB Mac Studio, and the MLX community &lt;a href="https://huggingface.co/mlx-community/MiniMax-M2-8bit"&gt;have that working already&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;MiniMax offer their own API, and recommend using their Anthropic-compatible endpoint and the official Anthropic SDKs to access it. MiniMax Head of Engineering Skyler Miao
 &lt;a href="https://x.com/SkylerMiao7/status/1982989507252367687"&gt;provided some background on that&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;M2 is a agentic thinking model, it do interleaved thinking like sonnet 4.5, which means every response will contain its thought content.
Its very important for M2 to keep the chain of thought. So we must make sure the history thought passed back to the model.
Anthropic API support it for sure, as sonnet needs it as well. OpenAI only support it in their new Response API, no support for in ChatCompletion.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MiniMax are offering the new model via their API for free until November 7th, after which the cost will be $0.30/million input tokens and $1.20/million output tokens - similar in price to Gemini 2.5 Flash and GPT-5 Mini, see &lt;a href="https://www.llm-prices.com/#it=51&amp;amp;ot=4017&amp;amp;sel=minimax-m2%2Cgpt-5-mini%2Cclaude-3-haiku%2Cgemini-2.5-flash-lite%2Cgemini-2.5-flash"&gt;price comparison here&lt;/a&gt; on my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; site.&lt;/p&gt;
&lt;p&gt;I released a new plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; called &lt;a href="https://github.com/simonw/llm-minimax"&gt;llm-minimax&lt;/a&gt; providing support for M2 via the MiniMax API:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-minimax
llm keys set minimax
# Paste key here
llm -m m2 -o max_tokens 10000 "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/da79447830dc431c067a93648b338be6"&gt;the result&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Biycle is good though obscured by the pelican. Pelican has an impressive triple beak and is stretched along the bicycle frame. Not clear if it can pedal or what it is sitting on." src="https://static.simonwillison.net/static/2025/m2-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;51 input, 4,017 output. At $0.30/m input and $1.20/m output that pelican would cost 0.4836 cents - less than half a cent.&lt;/p&gt;
&lt;p&gt;This is the first plugin I've written for an Anthropic-API-compatible model. I released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.21"&gt;llm-anthropic 0.21&lt;/a&gt; first adding the ability to customize the &lt;code&gt;base_url&lt;/code&gt; parameter when using that model class. This meant the new plugin was less than &lt;a href="https://github.com/simonw/llm-minimax/blob/0.1/llm_minimax.py"&gt;30 lines of Python&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/minimax"&gt;minimax&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-release"/><category term="generative-ai"/><category term="pelican-riding-a-bicycle"/><category term="llm-pricing"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="local-llms"/><category term="llm"/><category term="minimax"/></entry><entry><title>Composer: Building a fast frontier model with RL</title><link href="https://simonwillison.net/2025/Oct/29/cursor-composer/#atom-tag" rel="alternate"/><published>2025-10-29T20:45:53+00:00</published><updated>2025-10-29T20:45:53+00:00</updated><id>https://simonwillison.net/2025/Oct/29/cursor-composer/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cursor.com/blog/composer"&gt;Composer: Building a fast frontier model with RL&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Cursor released &lt;a href="https://cursor.com/blog/2-0"&gt;Cursor 2.0 today&lt;/a&gt;, with a refreshed UI focused on agentic coding (and running agents in parallel) and a new model that's unique to Cursor called &lt;strong&gt;Composer&amp;nbsp;1&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;As far as I can tell there's no way to call the model directly via an API, so I fired up "Ask" mode in Cursor's chat side panel and asked it to "Generate an SVG of a pelican riding a bicycle":&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of Cursor 2 - In the chat panel I have asked the question and it spat out a bunch of SVG." src="https://static.simonwillison.net/static/2025/cursor-2.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/e5c9176f153ca718370055ecd256fe70"&gt;the result&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is levitating against a blue sky. The pelican looks a little bit more like a baby chicken but does at least have a long beak." src="https://static.simonwillison.net/static/2025/cursor-1-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;The notable thing about Composer-1 is that it is designed to be &lt;em&gt;fast&lt;/em&gt;. The pelican certainly came back quickly, and in their announcement they describe it as being "4x faster than similarly intelligent models".&lt;/p&gt;
&lt;p&gt;It's interesting to see Cursor investing resources in training their own code-specific model - similar to &lt;a href="https://openai.com/index/introducing-upgrades-to-codex/"&gt;GPT-5-Codex&lt;/a&gt; or &lt;a href="https://github.com/QwenLM/Qwen3-Coder"&gt;Qwen3-Coder&lt;/a&gt;. From their post:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Composer is a mixture-of-experts (MoE) language model supporting long-context generation and understanding. It is specialized for software engineering through reinforcement learning (RL) in a diverse range of development environments. [...]&lt;/p&gt;
&lt;p&gt;Efficient training of large MoE models requires significant investment into building infrastructure and systems research. We built custom training infrastructure leveraging PyTorch and Ray to power asynchronous reinforcement learning at scale. We natively train our models at low precision by combining our &lt;a href="https://cursor.com/blog/kernels"&gt;MXFP8 MoE kernels&lt;/a&gt; with expert parallelism and hybrid sharded data parallelism, allowing us to scale training to thousands of NVIDIA GPUs with minimal communication cost. [...]&lt;/p&gt;
&lt;p&gt;During RL, we want our model to be able to call any tool in the Cursor Agent harness. These tools allow editing code, using semantic search, grepping strings, and running terminal commands. At our scale, teaching the model to effectively call these tools requires running hundreds of thousands of concurrent sandboxed coding environments in the cloud.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;One detail that's notably absent from their description: did they train the model from scratch, or did they start with an existing open-weights model such as something from Qwen or GLM?&lt;/p&gt;
&lt;p&gt;Cursor researcher Sasha Rush has been answering questions &lt;a href="https://news.ycombinator.com/item?id=45748725"&gt;on Hacker News&lt;/a&gt;, but has so far been evasive in answering questions about the base model. When directly asked "is Composer a fine tune of an existing open source base model?" they replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our primary focus is on RL post-training. We think that is the best way to get the model to be a strong interactive agent.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sasha &lt;a href="https://news.ycombinator.com/item?id=45748725#45750784"&gt;did confirm&lt;/a&gt; that rumors of an earlier Cursor preview model, Cheetah, being based on a model by xAI's Grok were "Straight up untrue."

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45748725"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cursor"&gt;cursor&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parallel-agents"&gt;parallel-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-performance"/><category term="llm-release"/><category term="cursor"/><category term="pelican-riding-a-bicycle"/><category term="generative-ai"/><category term="ai"/><category term="llms"/><category term="parallel-agents"/><category term="coding-agents"/><category term="ai-assisted-programming"/></entry><entry><title>Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code</title><link href="https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-code/#atom-tag" rel="alternate"/><published>2025-10-20T17:21:52+00:00</published><updated>2025-10-20T17:21:52+00:00</updated><id>https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-code/#atom-tag</id><summary type="html">
    &lt;p&gt;DeepSeek released a new model yesterday: &lt;a href="https://github.com/deepseek-ai/DeepSeek-OCR"&gt;DeepSeek-OCR&lt;/a&gt;, a 6.6GB model fine-tuned specifically for OCR. They released it as model weights that run using PyTorch and CUDA. I got it running on the NVIDIA Spark by having Claude Code effectively brute force the challenge of getting it working on that particular hardware.&lt;/p&gt;
&lt;p&gt;This small project (40 minutes this morning, most of which was Claude Code churning away while I had breakfast and did some other things) ties together a bunch of different concepts I've been exploring recently. I &lt;a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/"&gt;designed an agentic loop&lt;/a&gt; for the problem, gave Claude full permissions inside a Docker sandbox, embraced the &lt;a href="https://simonwillison.net/2025/Oct/5/parallel-coding-agents/"&gt;parallel agents lifestyle&lt;/a&gt; and reused my &lt;a href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/"&gt;notes on the NVIDIA Spark&lt;/a&gt; from last week.&lt;/p&gt;
&lt;p&gt;I knew getting a PyTorch CUDA model running on the Spark was going to be a little frustrating, so I decided to outsource the entire process to Claude Code to see what would happen.&lt;/p&gt;
&lt;p&gt;TLDR: It worked. It took four prompts (one long, three very short) to have Claude Code figure out everything necessary to run the new DeepSeek model on the NVIDIA Spark, OCR a document for me and produce &lt;em&gt;copious&lt;/em&gt; notes about the process.&lt;/p&gt;
&lt;h4 id="the-setup"&gt;The setup&lt;/h4&gt;
&lt;p&gt;I connected to the Spark from my Mac via SSH and started a new Docker container there:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
  bash&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I installed npm and used that to install Claude Code:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;apt-get update
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npm
npm install -g @anthropic-ai/claude-code&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then started Claude Code, telling it that it's OK that it's running as &lt;code&gt;root&lt;/code&gt; because it's in a sandbox:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;IS_SANDBOX=1 claude --dangerously-skip-permissions&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It provided me a URL to click on to authenticate with my Anthropic account.&lt;/p&gt;
&lt;h4 id="the-initial-prompts"&gt;The initial prompts&lt;/h4&gt;
&lt;p&gt;I kicked things off with this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create a folder deepseek-ocr and do everything else in that folder&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I ran the following, providing links to both the GitHub repository and the Hugging Face model, providing a clue about NVIDIA ARM and giving it an image (&lt;a href="https://static.simonwillison.net/static/2025/ft.jpeg"&gt;this one&lt;/a&gt;, see &lt;a href="https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-coding/"&gt;previous post&lt;/a&gt;) that I wanted it to run OCR on.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Your task is to get this working: &lt;a href="https://github.com/deepseek-ai/DeepSeek-OCR"&gt;https://github.com/deepseek-ai/DeepSeek-OCR&lt;/a&gt; - it uses Hugging Face Transformers and models from &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR"&gt;https://huggingface.co/deepseek-ai/DeepSeek-OCR&lt;/a&gt; - you are running in a Docker container on an NVIDIA ARM device, first research your environment to check it will be possible to do this. Git clone both of those - you may need to enable LFS first. Make extensive notes as you go in a notes.md file which you should only ever append new notes to. At the end of this you should have scripts in the folder which can be run to both setup and then run the model against an example - use &lt;a href="https://static.simonwillison.net/static/2025/ft.jpeg"&gt;https://static.simonwillison.net/static/2025/ft.jpeg&lt;/a&gt; as the example image that you OCR. As a final step create a detailed README.md file explaining the project and what you learned and how to run your code.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note how I also told it to "make extensive notes as you go in a notes.md file which you should only ever append new notes to" so I'd have some useful notes at the end.&lt;/p&gt;
&lt;h4 id="the-follow-ups"&gt;The follow-ups&lt;/h4&gt;
&lt;p&gt;It churned away for 13 minutes, and made a sequence of discoveries about the platform and the model and the difficulties of running PyTorch on it.&lt;/p&gt;
&lt;p&gt;And then... &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-assistant--2025-10-20-154706"&gt;it gave up&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The NVIDIA GB10 GPU has compute capability sm_121, which is too new for PyTorch 2.5.1&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PyTorch 2.5.1 supports: sm_50, sm_80, sm_86, sm_89, sm_90, sm_90a&lt;/li&gt;
&lt;li&gt;GB10 requires: sm_121&lt;/li&gt;
&lt;li&gt;Error: &lt;code&gt;CUDA error: no kernel image is available for execution on the device&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally, the model has hardcoded &lt;code&gt;.cuda()&lt;/code&gt; calls, preventing CPU fallback without code modifications.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So I gave it some encouragement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Any other options that might help get CUDA working like using different versions of PyTorch that have available CUDA wheels for this platform? Have you checked to see what PyTorch versions already have these ARM CUDA wheels that would work here?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It ran this command to grep for all available PyTorch wheels on &lt;code&gt;download.pytorch.org&lt;/code&gt; that match the ARM architecture:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -s https://download.pytorch.org/whl/torch/ &lt;span class="pl-k"&gt;|&lt;/span&gt; grep -o &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;torch-[0-9.]*.*aarch64.whl&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; sort -V &lt;span class="pl-k"&gt;|&lt;/span&gt; tail -20&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;... and that unblocked it.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Excellent! I can see that PyTorch 2.9.0 has ARM64 wheels for CUDA 12.8, 12.9, and &lt;strong&gt;CUDA 13.0&lt;/strong&gt; (cu130)! This is exactly what we need!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It installed the wheels it found and reported:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Good news! PyTorch 2.9.0 can detect the GB10 GPU with CUDA capability 12.1 (sm_121), but it says the maximum supported is 12.0. However, it's just a warning now instead of a hard error! Let's try running inference: [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then kept on churning for a bit, ignoring warnings, and managed to &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-user--2025-10-20-155056"&gt;run the model&lt;/a&gt;! It loaded the image and ran this prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;|grounding|&amp;gt;OCR this image.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And got back output that started like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;|ref|&amp;gt;The perils of vibe coding&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[352, 30, 624, 111]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;opt&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 0, 994, 30]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;such&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 24, 996, 58]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;days&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 52, 996, 87]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;pavi&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[970, 85, 996, 118]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;TECHNOLOGY&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[33, 199, 127, 230]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;holds the promise of replacing program-&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[220, 170, 390, 210]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;mers - a profession of very well paid&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[220, 199, 390, 238]]&amp;lt;|/det|&amp;gt;
&amp;lt;|ref|&amp;gt;people.For another, the work can be&amp;lt;|/ref|&amp;gt;&amp;lt;|det|&amp;gt;[[220, 228, 390, 267]]&amp;lt;|/det|&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Claude declared victory and pointed me to the &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/output/result.mmd?short_path=f43da6a"&gt;output/result.mmd&lt;/a&gt; file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-user--2025-10-20-155826"&gt;said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;That result.mmd file looks to be exclusively whitespace to me, try and get a better result there - the bounding box image was good though&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which caused Claude to &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/claude-log.md#-assistant--2025-10-20-155900"&gt;reconsider&lt;/a&gt; the prompt it used:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Maybe we need to use a different prompt. Looking at the README from DeepSeek-OCR, there are different prompts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;"&amp;lt;image&amp;gt;\n&amp;lt;|grounding|&amp;gt;Convert the document to markdown."&lt;/code&gt; - for documents&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"&amp;lt;image&amp;gt;\n&amp;lt;|grounding|&amp;gt;OCR this image."&lt;/code&gt; - general OCR&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"&amp;lt;image&amp;gt;\nFree OCR."&lt;/code&gt; - without layouts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The "grounding" mode seems to be focused on bounding boxes, which is why we got good bounding box results but the text output might be in a different format.&lt;/p&gt;
&lt;p&gt;Let me try running it again with a different prompt that's more focused on text extraction rather than grounding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It tried all of the prompt variants, &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/notes.md#text-output-improvement-2025-10-20-1601"&gt;made a note about them&lt;/a&gt; and created a file called &lt;a href="https://github.com/simonw/research/blob/main/deepseek-ocr-nvidia-spark/PROMPTS_GUIDE.md"&gt;PROMPTS_GUIDE.md&lt;/a&gt; with a bunch more details, including this comparison table:&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Text Quality&lt;/th&gt;
&lt;th&gt;Structure&lt;/th&gt;
&lt;th&gt;Coordinates&lt;/th&gt;
&lt;th&gt;Best Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free OCR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚡⚡⚡ Fast&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Excellent&lt;/td&gt;
&lt;td&gt;⭐ Basic&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;General OCR&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;⚡⚡ Medium&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Excellent&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Full&lt;/td&gt;
&lt;td&gt;⭐⭐ Partial&lt;/td&gt;
&lt;td&gt;Documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grounding&lt;/td&gt;
&lt;td&gt;⚡ Slow&lt;/td&gt;
&lt;td&gt;⭐⭐ Good&lt;/td&gt;
&lt;td&gt;⭐ Basic&lt;/td&gt;
&lt;td&gt;⭐⭐⭐ Full&lt;/td&gt;
&lt;td&gt;Annotations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detailed&lt;/td&gt;
&lt;td&gt;⚡⚡⚡ Fastest&lt;/td&gt;
&lt;td&gt;⭐ N/A&lt;/td&gt;
&lt;td&gt;❌ N/A&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Image analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;And this table under the heading "Performance Benchmarks":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Test image: 3503×1668 pixels (Financial Times article)&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Output Size&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Free OCR&lt;/td&gt;
&lt;td&gt;24s&lt;/td&gt;
&lt;td&gt;Clean text&lt;/td&gt;
&lt;td&gt;2257&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;39s&lt;/td&gt;
&lt;td&gt;Formatted MD&lt;/td&gt;
&lt;td&gt;2257 + structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grounding&lt;/td&gt;
&lt;td&gt;58s&lt;/td&gt;
&lt;td&gt;Text + coords&lt;/td&gt;
&lt;td&gt;2257 + boxes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detailed&lt;/td&gt;
&lt;td&gt;9s&lt;/td&gt;
&lt;td&gt;Description&lt;/td&gt;
&lt;td&gt;~300&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;My final prompt was this, to gather everything together into a zip file I could extract from the Docker container:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create a zip file with the output and output_text and all of the scripts and notes - but leave out the github repo and the huggingface repo directories&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I added the contents of that zip file to my new &lt;a href="https://github.com/simonw/research"&gt;simonw/research&lt;/a&gt; GitHub repo &lt;a href="https://github.com/simonw/research/tree/main/deepseek-ocr-nvidia-spark"&gt;in the deepseek-ocr-nvidia-spark&lt;/a&gt; folder.&lt;/p&gt;
&lt;p&gt;Claude really likes writing notes! Here's the directory listing of that finished folder:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;  |-- download_test_image.sh
  |-- FINAL_SUMMARY.md
  |-- notes.md
  |-- output
  |   |-- images
  |   |-- result_with_boxes.jpg
  |   `-- result.mmd
  |-- output_text
  |   |-- detailed
  |   |   |-- images
  |   |   |-- result_with_boxes.jpg
  |   |   `-- result.mmd
  |   |-- free_ocr
  |   |   |-- images
  |   |   |-- result_with_boxes.jpg
  |   |   `-- result.mmd
  |   `-- markdown
  |       |-- images
  |       |   `-- 0.jpg
  |       |-- result_with_boxes.jpg
  |       `-- result.mmd
  |-- PROMPTS_GUIDE.md
  |-- README_SUCCESS.md
  |-- README.md
  |-- run_ocr_best.py
  |-- run_ocr_cpu_nocuda.py
  |-- run_ocr_cpu.py
  |-- run_ocr_text_focused.py
  |-- run_ocr.py
  |-- run_ocr.sh
  |-- setup.sh
  |-- SOLUTION.md
  |-- test_image.jpeg
  |-- TEXT_OUTPUT_SUMMARY.md
  `-- UPDATE_PYTORCH.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id="takeaways"&gt;Takeaways&lt;/h4&gt;
&lt;p&gt;My first prompt was at 15:31:07 (UTC). The final message from Claude Code came in at 16:10:03. That means it took less than 40 minutes start to finish, and I was only actively involved for about 5-10 minutes of that time. The rest of the time I was having breakfast and doing other things.&lt;/p&gt;
&lt;p&gt;Having tried and failed to get PyTorch stuff working in the past, I count this as a &lt;em&gt;huge&lt;/em&gt; win. I'll be using this process a whole lot more in the future.&lt;/p&gt;
&lt;p&gt;How good were the actual results? There's honestly so much material in the resulting notes created by Claude that I haven't reviewed all of it. There may well be all sorts of errors in there, but it's indisputable that it managed to run the model and made notes on how it did that such that I'll be able to do the same thing in the future.&lt;/p&gt;
&lt;p&gt;I think the key factors in executing this project successfully were the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;I gave it exactly what it needed: a Docker environment in the target hardware, instructions on where to get what it needed (the code and the model) and a clear goal for it to pursue. This is a great example of the pattern I described in &lt;a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/"&gt;designing agentic loops&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Running it in a Docker sandbox meant I could use &lt;code&gt;claude --dangerously-skip-permissions&lt;/code&gt; and leave it running on its own. If I'd had to approve every command it wanted to run I would have got frustrated and quit the project after just a few minutes.&lt;/li&gt;
&lt;li&gt;I applied my own knowledge and experience when it got stuck. I was confident (based on &lt;a href="https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/#claude-code-for-everything"&gt;previous experiments&lt;/a&gt; with the Spark) that a CUDA wheel for ARM64 existed that was likely to work, so when it gave up I prompted it to try again, leading to success.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Oh, and it looks like DeepSeek OCR is a pretty good model if you spend the time experimenting with different ways to run it.&lt;/p&gt;
&lt;h4 id="bonus-using-vs-code-to-monitor-the-container"&gt;Bonus: Using VS Code to monitor the container&lt;/h4&gt;
&lt;p&gt;A small TIL from today: I had kicked off the job running in the Docker container via SSH to the Spark when I realized it would be neat if I could easily monitor the files it was creating while it was running.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://claude.ai/share/68a0ebff-b586-4278-bd91-6b715a657d2b"&gt;asked Claude.ai&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I am running a Docker container on a remote machine, which I started over SSH&lt;/p&gt;
&lt;p&gt;How can I have my local VS Code on MacOS show me the filesystem in that docker container inside that remote machine, without restarting anything?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It gave me a set of steps that solved this exact problem:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install the VS Code "Remote SSH" and "Dev Containers" extensions&lt;/li&gt;
&lt;li&gt;Use "Remote-SSH: Connect to Host" to connect to the remote machine (on my Tailscale network that's &lt;code&gt;spark@100.113.1.114&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;In the window for that remote SSH session, run "Dev Containers: Attach to Running Container" - this shows a list of containers and you can select the one you want to attach to&lt;/li&gt;
&lt;li&gt;... and that's it! VS Code opens a new window providing full access to all of the files in that container. I opened up &lt;code&gt;notes.md&lt;/code&gt; and watched it as Claude Code appended to it in real time.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;At the end when I told Claude to create a zip file of the results I could select that in the VS Code file explorer and use the "Download" menu item to download it to my Mac.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytorch"&gt;pytorch&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vs-code"&gt;vs-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia-spark"&gt;nvidia-spark&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ocr"/><category term="python"/><category term="ai"/><category term="docker"/><category term="pytorch"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="claude"/><category term="nvidia"/><category term="vs-code"/><category term="vision-llms"/><category term="deepseek"/><category term="llm-release"/><category term="coding-agents"/><category term="claude-code"/><category term="ai-in-china"/><category term="nvidia-spark"/></entry><entry><title>Introducing Claude Haiku 4.5</title><link href="https://simonwillison.net/2025/Oct/15/claude-haiku-45/#atom-tag" rel="alternate"/><published>2025-10-15T19:36:34+00:00</published><updated>2025-10-15T19:36:34+00:00</updated><id>https://simonwillison.net/2025/Oct/15/claude-haiku-45/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-haiku-4-5"&gt;Introducing Claude Haiku 4.5&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Anthropic released Claude Haiku 4.5 today, the cheapest member of the Claude 4.5 family that started with Sonnet 4.5 &lt;a href="https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/"&gt;a couple of weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's priced at $1/million input tokens and $5/million output tokens, slightly more expensive than Haiku 3.5 ($0.80/$4) and a &lt;em&gt;lot&lt;/em&gt; more expensive than the original Claude 3 Haiku ($0.25/$1.25), both of which remain available at those prices.&lt;/p&gt;
&lt;p&gt;It's a third of the price of Sonnet 4 and Sonnet 4.5 (both $3/$15) which is notable because Anthropic's benchmarks put it in a similar space to that older Sonnet 4 model. As they put it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What was recently at the frontier is now cheaper and faster. Five months ago, Claude Sonnet 4 was a state-of-the-art model. Today, Claude Haiku 4.5 gives you similar levels of coding performance but at one-third the cost and more than twice the speed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've been hoping to see Anthropic release a fast, inexpensive model that's price competitive with the cheapest models from OpenAI and Gemini, currently $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite). Haiku 4.5 certainly isn't that, it looks like they're continuing to focus squarely on the "great at code" part of the market.&lt;/p&gt;
&lt;p&gt;The new Haiku is the first Haiku model to support reasoning. It sports a 200,000 token context window, 64,000 maximum output (up from just 8,192 for Haiku 3.5) and a "reliable knowledge cutoff" of February 2025, one month later than the January 2025 date for Sonnet 4 and 4.5 and Opus 4 and 4.1.&lt;/p&gt;
&lt;p&gt;Something that caught my eye in the accompanying &lt;a href="https://assets.anthropic.com/m/99128ddd009bdcb/original/Claude-Haiku-4-5-System-Card.pdf"&gt;system card&lt;/a&gt; was this note about context length:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For Claude Haiku 4.5, we trained the model to be explicitly context-aware, with precise information about how much context-window has been used. This has two effects: the model learns when and how to wrap up its answer when the limit is approaching, and the model learns to continue reasoning more persistently when the limit is further away. We found this intervention—along with others—to be effective at limiting agentic “laziness” (the phenomenon where models stop working on a problem prematurely, give incomplete answers, or cut corners on tasks).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've added the new price to &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;, released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.20"&gt;llm-anthropic 0.20&lt;/a&gt; with the new model and updated my &lt;a href="https://tools.simonwillison.net/haiku"&gt;Haiku-from-your-webcam&lt;/a&gt; demo (&lt;a href="https://github.com/simonw/tools/blob/main/haiku.html"&gt;source&lt;/a&gt;) to use Haiku 4.5 as well.&lt;/p&gt;
&lt;p&gt;Here's &lt;code&gt;llm -m claude-haiku-4.5 'Generate an SVG of a pelican riding a bicycle'&lt;/code&gt; (&lt;a href="https://gist.github.com/simonw/31256c523fa502eeb303b8e0bbe30eee"&gt;transcript&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Described by Haiku 4.5: A whimsical illustration of a bird with a round tan body, pink beak, and orange legs riding a bicycle against a blue sky and green grass background." src="https://static.simonwillison.net/static/2025/claude-haiku-4.5-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;18 input tokens and 1513 output tokens = &lt;a href="https://www.llm-prices.com/#it=18&amp;amp;ot=1513&amp;amp;ic=1&amp;amp;oc=5"&gt;0.7583 cents&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45595403"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="llm-reasoning"/><category term="llm-pricing"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="generative-ai"/><category term="pelican-riding-a-bicycle"/></entry></feed>