<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: llm</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/llm.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-03-03T21:53:54+00:00</updated><author><name>Simon Willison</name></author><entry><title>Gemini 3.1 Flash-Lite</title><link href="https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/#atom-tag" rel="alternate"/><published>2026-03-03T21:53:54+00:00</published><updated>2026-03-03T21:53:54+00:00</updated><id>https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/"&gt;Gemini 3.1 Flash-Lite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google's latest model is an update to their inexpensive Flash-Lite family. At $0.25/million tokens of input and $1.5/million output this is 1/8th the price of Gemini 3.1 Pro.&lt;/p&gt;
&lt;p&gt;It supports four different thinking levels, so I had it output &lt;a href="https://gist.github.com/simonw/99fb28dc11d0c24137d4ff8a33978a9e"&gt;four different pelicans&lt;/a&gt;:&lt;/p&gt;
&lt;div style="
    display: grid;
    grid-template-columns: repeat(2, 1fr);
    gap: 8px;
    margin: 0 auto;
  "&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-minimal.png" alt="A minimalist vector-style illustration of a stylized bird riding a bicycle." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;minimal&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-low.png" alt="A minimalist graphic of a light blue round bird with a single black dot for an eye, wearing a yellow backpack and riding a black bicycle on a flat grey line." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;low&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-medium.png" alt="A minimalist digital illustration of a light blue bird wearing a yellow backpack while riding a bicycle." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;medium&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-high.png" alt="A minimal, stylized line drawing of a bird-like creature with a yellow beak riding a bicycle made of simple geometric lines." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;high&lt;/p&gt;
    &lt;/div&gt;
&lt;/div&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;



</summary><category term="gemini"/><category term="llm"/><category term="pelican-riding-a-bicycle"/><category term="llm-pricing"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="google"/><category term="generative-ai"/></entry><entry><title>Gemini 3.1 Pro</title><link href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/#atom-tag" rel="alternate"/><published>2026-02-19T17:58:37+00:00</published><updated>2026-02-19T17:58:37+00:00</updated><id>https://simonwillison.net/2026/Feb/19/gemini-31-pro/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/"&gt;Gemini 3.1 Pro&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The first in the Gemini 3.1 series, priced the same as Gemini 3 Pro ($2/million input, $12/million output under 200,000 tokens, $4/$18 for 200,000 to 1,000,000). That's less than half the price of Claude Opus 4.6 with very similar benchmark scores to that model.&lt;/p&gt;
&lt;p&gt;They boast about its improved SVG animation performance compared to Gemini 3 Pro in the announcement!&lt;/p&gt;
&lt;p&gt;I tried "Generate an SVG of a pelican riding a bicycle" &lt;a href="https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221ugF9fBfLGxnNoe8_rLlluzo9NSPJDWuF%22%5D,%22action%22:%22open%22,%22userId%22:%22106366615678321494423%22,%22resourceKeys%22:%7B%7D%7D&amp;amp;usp=sharing"&gt;in Google AI Studio&lt;/a&gt; and it thought for 323.9 seconds (&lt;a href="https://gist.github.com/simonw/03a755865021739a3659943a22c125ba#thinking-trace"&gt;thinking trace here&lt;/a&gt;) before producing this one:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat-style illustration of a pelican wearing a blue and white baseball cap, riding a red bicycle with yellow-rimmed wheels along a road. The pelican has a large orange bill and a green scarf. A small fish peeks out of a brown basket on the handlebars. The background features a light blue sky with a yellow sun, white clouds, and green hills." src="https://static.simonwillison.net/static/2026/gemini-3.1-pro-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;It's good to see the legs clearly depicted on both sides of the frame (should &lt;a href="https://twitter.com/elonmusk/status/2023833496804839808"&gt;satisfy Elon&lt;/a&gt;), the fish in the basket is a nice touch and I appreciated this comment in &lt;a href="https://gist.github.com/simonw/03a755865021739a3659943a22c125ba#response"&gt;the SVG code&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Black Flight Feathers on Wing Tip --&amp;gt;
&amp;lt;path d="M 420 175 C 440 182, 460 187, 470 190 C 450 210, 430 208, 410 198 Z" fill="#374151" /&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I've &lt;a href="https://github.com/simonw/llm-gemini/issues/121"&gt;added&lt;/a&gt; the two new model IDs &lt;code&gt;gemini-3.1-pro-preview&lt;/code&gt; and &lt;code&gt;gemini-3.1-pro-preview-customtools&lt;/code&gt; to my &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini plugin&lt;/a&gt; for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;. That "custom tools" one is &lt;a href="https://ai.google.dev/gemini-api/docs/models/gemini-3.1-pro-preview#gemini-31-pro-preview-customtools"&gt;described here&lt;/a&gt; - apparently it may provide better tool performance than the default model in some situations.&lt;/p&gt;
&lt;p&gt;The model appears to be &lt;em&gt;incredibly&lt;/em&gt; slow right now - it took 104s to respond to a simple "hi" and a few of my other tests met "Error: This model is currently experiencing high demand. Spikes in demand are usually temporary. Please try again later." or "Error: Deadline expired before operation could complete" errors. I'm assuming that's just teething problems on launch day.&lt;/p&gt;
&lt;p&gt;It sounds like last week's &lt;a href="https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/"&gt;Deep Think release&lt;/a&gt; was our first exposure to the 3.1 family:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Last week, we released a major update to Gemini 3 Deep Think to solve modern challenges across science, research and engineering. Today, we’re releasing the upgraded core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: In &lt;a href="https://simonwillison.net/2025/nov/13/training-for-pelicans-riding-bicycles/"&gt;What happens if AI labs train for pelicans riding bicycles?&lt;/a&gt; last November I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I’m going to test it on all manner of creatures riding all sorts of transportation devices.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Google's Gemini Lead Jeff Dean &lt;a href="https://x.com/JeffDean/status/2024525132266688757"&gt;tweeted this video&lt;/a&gt; featuring an animated pelican riding a bicycle, plus a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.&lt;/p&gt;
&lt;video style="margin-bottom: 1em" poster="https://static.simonwillison.net/static/2026/gemini-animated-pelicans.jpg" muted controls preload="none" style="max-width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/gemini-animated-pelicans.mp4" type="video/mp4"&gt;
&lt;/video&gt;

&lt;p&gt;I've been saying for a while that I wish AI labs would highlight things that their new models can do that their older models could not, so top marks to the Gemini team for this video.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: I used &lt;code&gt;llm-gemini&lt;/code&gt; to run my &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;more detailed Pelican prompt&lt;/a&gt;, with &lt;a href="https://gist.github.com/simonw/a3bdd4ec9476ba9e9ba7aa61b46d8296"&gt;this result&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Flat-style illustration of a brown pelican riding a teal bicycle with dark blue-rimmed wheels against a plain white background. Unlike the previous image's white cartoon pelican, this pelican has realistic brown plumage with detailed feather patterns, a dark maroon head, yellow eye, and a large pink-tinged pouch bill. The bicycle is a simpler design without a basket, and the scene lacks the colorful background elements like the sun, clouds, road, hills, cap, and scarf from the first illustration, giving it a more minimalist feel." src="https://static.simonwillison.net/static/2026/gemini-3.1-pro-pelican-2.png" /&gt;&lt;/p&gt;
&lt;p&gt;From the SVG comments:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Pouch Gradient (Breeding Plumage: Red to Olive/Green) --&amp;gt;
...
&amp;lt;!-- Neck Gradient (Breeding Plumage: Chestnut Nape, White/Yellow Front) --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/svg"&gt;svg&lt;/a&gt;&lt;/p&gt;



</summary><category term="gemini"/><category term="llm"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="google"/><category term="generative-ai"/><category term="svg"/></entry><entry><title>llm-gemini 0.29</title><link href="https://simonwillison.net/2026/Feb/19/llm-gemini/#atom-tag" rel="alternate"/><published>2026-02-19T17:57:25+00:00</published><updated>2026-02-19T17:57:25+00:00</updated><id>https://simonwillison.net/2026/Feb/19/llm-gemini/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.29"&gt;llm-gemini 0.29&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;LLM plugin to access Google&amp;#x27;s Gemini family of models&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="gemini"/></entry><entry><title>Introducing Claude Sonnet 4.6</title><link href="https://simonwillison.net/2026/Feb/17/claude-sonnet-46/#atom-tag" rel="alternate"/><published>2026-02-17T23:58:58+00:00</published><updated>2026-02-17T23:58:58+00:00</updated><id>https://simonwillison.net/2026/Feb/17/claude-sonnet-46/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-sonnet-4-6"&gt;Introducing Claude Sonnet 4.6&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sonnet 4.6 is out today, and Anthropic claim it offers similar performance to &lt;a href="https://simonwillison.net/2025/Nov/24/claude-opus/"&gt;November's Opus 4.5&lt;/a&gt; while maintaining the Sonnet pricing of $3/million input and $15/million output tokens (the Opus models are $5/$25). Here's &lt;a href="https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf"&gt;the system card PDF&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Sonnet 4.6 has a "reliable knowledge cutoff" of August 2025, compared to Opus 4.6's May 2025 and Haiku 4.5's February 2025. Both Opus and Sonnet default to 200,000 max input tokens but can stretch to 1 million in beta and at a higher cost.&lt;/p&gt;
&lt;p&gt;I just released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.24"&gt;llm-anthropic 0.24&lt;/a&gt; with support for both Sonnet 4.6 and Opus 4.6. Claude Code &lt;a href="https://github.com/simonw/llm-anthropic/pull/65"&gt;did most of the work&lt;/a&gt; - the new models had a fiddly amount of extra details around adaptive thinking and no longer supporting prefixes, as described &lt;a href="https://platform.claude.com/docs/en/about-claude/models/migration-guide"&gt;in Anthropic's migration guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/b185576a95e9321b441f0a4dfc0e297c"&gt;what I got&lt;/a&gt; from:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx --with llm-anthropic llm 'Generate an SVG of a pelican riding a bicycle' -m claude-sonnet-4.6
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="The pelican has a jaunty top hat with a red band. There is a string between the upper and lower beaks for some reason. The bicycle frame is warped in the wrong way." src="https://static.simonwillison.net/static/2026/pelican-sonnet-4.6.png" /&gt;&lt;/p&gt;
&lt;p&gt;The SVG comments include:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Hat (fun accessory) --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I tried a second time and also got a top hat. Sonnet 4.6 apparently loves top hats!&lt;/p&gt;
&lt;p&gt;For comparison, here's the pelican Opus 4.5 drew me &lt;a href="(https://simonwillison.net/2025/Nov/24/claude-opus/)"&gt;in November&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars." src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;And here's Anthropic's current best pelican, drawn by Opus 4.6 &lt;a href="https://simonwillison.net/2026/Feb/5/two-new-models/"&gt;on February 5th&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Slightly wonky bicycle frame but an excellent pelican, very clear beak and pouch, nice feathers." src="https://static.simonwillison.net/static/2026/opus-4.6-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;Opus 4.6 produces the best pelican beak/pouch. I do think the top hat from Sonnet 4.6 is a nice touch though.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47050488"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="generative-ai"/><category term="pelican-riding-a-bicycle"/><category term="claude-code"/></entry><entry><title>llm-anthropic 0.24</title><link href="https://simonwillison.net/2026/Feb/17/llm-anthropic/#atom-tag" rel="alternate"/><published>2026-02-17T23:51:23+00:00</published><updated>2026-02-17T23:51:23+00:00</updated><id>https://simonwillison.net/2026/Feb/17/llm-anthropic/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.24"&gt;llm-anthropic 0.24&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;LLM access to models by Anthropic, including the Claude series&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="claude"/><category term="anthropic"/></entry><entry><title>datasette-llm 0.1a0</title><link href="https://simonwillison.net/2026/Jan/22/datasette-llm/#atom-tag" rel="alternate"/><published>2026-01-22T17:10:06+00:00</published><updated>2026-01-22T17:10:06+00:00</updated><id>https://simonwillison.net/2026/Jan/22/datasette-llm/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/datasette/datasette-llm/releases/tag/0.1a0"&gt;datasette-llm 0.1a0&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;LLM integration plugin for other plugins to depend on&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="datasette"/><category term="llm"/></entry><entry><title>jordanhubbard/nanolang</title><link href="https://simonwillison.net/2026/Jan/19/nanolang/#atom-tag" rel="alternate"/><published>2026-01-19T23:58:56+00:00</published><updated>2026-01-19T23:58:56+00:00</updated><id>https://simonwillison.net/2026/Jan/19/nanolang/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/jordanhubbard/nanolang"&gt;jordanhubbard/nanolang&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Plenty of people have mused about what a new programming language specifically designed to be used by LLMs might look like. Jordan Hubbard (&lt;a href="https://en.wikipedia.org/wiki/Jordan_Hubbard"&gt;co-founder of FreeBSD&lt;/a&gt;, with serious stints at Apple and NVIDIA) just released exactly that.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A minimal, LLM-friendly programming language with mandatory testing and unambiguous syntax.&lt;/p&gt;
&lt;p&gt;NanoLang transpiles to C for native performance while providing a clean, modern syntax optimized for both human readability and AI code generation.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The syntax strikes me as an interesting mix between C, Lisp and Rust.&lt;/p&gt;
&lt;p&gt;I decided to see if an LLM could produce working code in it directly, given the necessary context. I started with this &lt;a href="https://github.com/jordanhubbard/nanolang/blob/main/MEMORY.md"&gt;MEMORY.md&lt;/a&gt; file, which begins:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Purpose:&lt;/strong&gt; This file is designed specifically for Large Language Model consumption. It contains the essential knowledge needed to generate, debug, and understand NanoLang code. Pair this with &lt;code&gt;spec.json&lt;/code&gt; for complete language coverage.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I ran that using &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; and &lt;a href="https://github.com/simonw/llm-anthropic"&gt;llm-anthropic&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m claude-opus-4.5 \
  -s https://raw.githubusercontent.com/jordanhubbard/nanolang/refs/heads/main/MEMORY.md \
  'Build me a mandelbrot fractal CLI tool in this language' 
  &amp;gt; /tmp/fractal.nano
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href="https://gist.github.com/simonw/7847f022566d11629ec2139f1d109fb8#mandelbrot-fractal-cli-tool-in-nano"&gt;resulting code&lt;/a&gt;... &lt;a href="https://gist.github.com/simonw/7847f022566d11629ec2139f1d109fb8?permalink_comment_id=5947465#gistcomment-5947465"&gt;did not compile&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I may have been too optimistic expecting a one-shot working program for a new language like this. So I ran a clone of the actual project, copied in my program and had Claude Code take a look at the failing compiler output.&lt;/p&gt;
&lt;p&gt;... and it worked! Claude happily grepped its way through the various &lt;code&gt;examples/&lt;/code&gt; and built me a working program.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gisthost.github.io/?9696da6882cb6596be6a9d5196e8a7a5/index.html"&gt;the Claude Code transcript&lt;/a&gt; - you can see it &lt;a href="https://gisthost.github.io/?9696da6882cb6596be6a9d5196e8a7a5/page-001.html#msg-2026-01-19T23-43-09-675Z"&gt;reading relevant examples here&lt;/a&gt; - and here's &lt;a href="https://gist.github.com/simonw/e7f3577adcfd392ab7fa23b1295d00f2"&gt;the finished code plus its output&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I've suspected &lt;a href="https://simonwillison.net/2025/Nov/7/llms-for-new-programming-languages/"&gt;for a while&lt;/a&gt; that LLMs and coding agents might significantly reduce the friction involved in launching a new language. This result reinforces my opinion.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46684958"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/programming-languages"&gt;programming-languages&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;&lt;/p&gt;



</summary><category term="coding-agents"/><category term="ai-assisted-programming"/><category term="programming-languages"/><category term="claude-code"/><category term="generative-ai"/><category term="ai"/><category term="llms"/><category term="llm"/></entry><entry><title>llm-gemini 0.28.2</title><link href="https://simonwillison.net/2025/Dec/23/llm-gemini/#atom-tag" rel="alternate"/><published>2025-12-23T16:20:33+00:00</published><updated>2025-12-23T16:20:33+00:00</updated><id>https://simonwillison.net/2025/Dec/23/llm-gemini/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.28.2"&gt;llm-gemini 0.28.2&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;LLM plugin to access Google&amp;#x27;s Gemini family of models&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="gemini"/></entry><entry><title>llm-gemini 0.28.1</title><link href="https://simonwillison.net/2025/Dec/18/llm-gemini/#atom-tag" rel="alternate"/><published>2025-12-18T17:57:28+00:00</published><updated>2025-12-18T17:57:28+00:00</updated><id>https://simonwillison.net/2025/Dec/18/llm-gemini/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.28.1"&gt;llm-gemini 0.28.1&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;LLM plugin to access Google&amp;#x27;s Gemini family of models&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="gemini"/></entry><entry><title>Gemini 3 Flash</title><link href="https://simonwillison.net/2025/Dec/17/gemini-3-flash/#atom-tag" rel="alternate"/><published>2025-12-17T22:44:52+00:00</published><updated>2025-12-17T22:44:52+00:00</updated><id>https://simonwillison.net/2025/Dec/17/gemini-3-flash/#atom-tag</id><summary type="html">
    &lt;p&gt;It continues to be a busy December, if not quite as busy &lt;a href="https://simonwillison.net/2024/Dec/20/december-in-llms-has-been-a-lot/"&gt;as last year&lt;/a&gt;. Today's big news is &lt;a href="https://blog.google/technology/developers/build-with-gemini-3-flash/"&gt;Gemini 3 Flash&lt;/a&gt;, the latest in Google's "Flash" line of faster and less expensive models.&lt;/p&gt;
&lt;p&gt;Google are emphasizing the comparison between the new Flash and their previous generation's top model Gemini 2.5 Pro:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Building on 3 Pro’s strong multimodal, coding and agentic features, 3 Flash offers powerful performance at less than a quarter the cost of 3 Pro, along with higher rate limits. The new 3 Flash model surpasses 2.5 Pro across many benchmarks while delivering faster speeds.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini 3 Flash's characteristics are almost identical to Gemini 3 Pro: it accepts text, image, video, audio, and PDF, outputs only text, handles 1,048,576 maximum input tokens and up to 65,536 output tokens, and has the same knowledge cut-off date of January 2025 (also shared with the Gemini 2.5 series).&lt;/p&gt;
&lt;p&gt;The benchmarks look good. The cost is appealing: 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro &amp;gt;200k, and it's nice not to have a price increase for the new Flash at larger token lengths.&lt;/p&gt;
&lt;p&gt;It's a little &lt;em&gt;more&lt;/em&gt; expensive than previous Flash models - Gemini 2.5 Flash was $0.30/million input tokens and $2.50/million on output, Gemini 3 Flash is $0.50/million and $3/million respectively.&lt;/p&gt;
&lt;p&gt;Google &lt;a href="https://blog.google/products/gemini/gemini-3-flash/"&gt;claim&lt;/a&gt; it may still end up cheaper though, due to more efficient output token usage:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;&gt; Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://www.llm-prices.com/#it=100000&amp;amp;ot=10000&amp;amp;sel=gemini-3-flash-preview%2Cgemini-3-pro-preview%2Cgemini-3-pro-preview-200k%2Cgpt-5.2%2Cclaude-opus-4-5%2Cclaude-sonnet-4.5%2Cclaude-4.5-haiku%2Cgemini-2.5-flash%2Cgpt-5-mini"&gt;a more extensive price comparison&lt;/a&gt; on my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; site.&lt;/p&gt;
&lt;h4 id="generating-some-svgs-of-pelicans"&gt;Generating some SVGs of pelicans&lt;/h4&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.28"&gt;llm-gemini 0.28&lt;/a&gt; this morning with support for the new model. You can try it out like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-gemini
llm keys set gemini # paste in key
llm -m gemini-3-flash-preview "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;According to &lt;a href="https://ai.google.dev/gemini-api/docs/gemini-3#thinking_level"&gt;the developer docs&lt;/a&gt; the new model supports four different thinking level options: &lt;code&gt;minimal&lt;/code&gt;, &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, and &lt;code&gt;high&lt;/code&gt;. This is different from Gemini 3 Pro, which only supported &lt;code&gt;low&lt;/code&gt; and &lt;code&gt;high&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You can run those like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini-3-flash-preview --thinking-level minimal "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are four pelicans, for thinking levels &lt;a href="https://gist.github.com/simonw/8047c805a4a1df7fd4e854b18e7482d9"&gt;minimal&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/fb61686a1f915e3777b4a40e2df41068"&gt;low&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/190c3ce82cd8976827139bbc4dcc2d19"&gt;medium&lt;/a&gt;, and &lt;a href="https://gist.github.com/simonw/da66ffce135359161996e41e50e32ec3"&gt;high&lt;/a&gt;:&lt;/p&gt;
&lt;image-gallery width="4"&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg" alt="A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg" alt="Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg" alt="A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg" alt="Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background." /&gt;
&lt;/image-gallery&gt;
&lt;h4 id="i-built-the-gallery-component-with-gemini-3-flash"&gt;I built the gallery component with Gemini 3 Flash&lt;/h4&gt;
&lt;p&gt;The gallery above uses a new Web Component which I built using Gemini 3 Flash to try out its coding abilities. The code on the page looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;image-gallery&lt;/span&gt; &lt;span class="pl-c1"&gt;width&lt;/span&gt;="&lt;span class="pl-s"&gt;4&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;image-gallery&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Those alt attributes are all generated by Gemini 3 Flash as well, using this recipe:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gemini-3-flash-preview --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;You write alt text for any image pasted in by the user. Alt text is always presented in a&lt;/span&gt;
&lt;span class="pl-s"&gt;fenced code block to make it easy to copy and paste out. It is always presented on a single&lt;/span&gt;
&lt;span class="pl-s"&gt;line so it can be used easily in Markdown images. All text on the image (for screenshots etc)&lt;/span&gt;
&lt;span class="pl-s"&gt;must be exactly included. A short note describing the nature of the image itself should go first.&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
-a https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can see the code that powers the image gallery Web Component &lt;a href="https://github.com/simonw/simonwillisonblog/blob/31651b3a527011d1c971d4256c1c9f61ef378d23/static/image-gallery.js"&gt;here on GitHub&lt;/a&gt;. I built it by prompting Gemini 3 Flash via &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gemini-3-flash-preview &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Build a Web Component that implements a simple image gallery. Usage is like this:&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;image-gallery width="5"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image1.jpg" alt="Image 1"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image2.jpg" alt="Image 2" data-thumb="image2-thumb.jpg"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image3.jpg" alt="Image 3"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;/image-gallery&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;If an image has a data-thumb= attribute that one is used instead, other images are scaled down. &lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;The image gallery always takes up 100% of available width. The width="5" attribute means that five images will be shown next to each other in each row. The default is 3. There are gaps between the images. When an image is clicked it opens a modal dialog with the full size image.&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Return a complete HTML file with both the implementation of the Web Component several example uses of it. Use https://picsum.photos/300/200 URLs for those example images.&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It took a few follow-up prompts using &lt;code&gt;llm -c&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Use a real modal such that keyboard shortcuts and accessibility features work without extra JS&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Use X for the close icon and make it a bit more subtle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;remove the hover effect entirely&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;I want no border on the close icon even when it is focused&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/09f63a49f29620d4cbbfd383cfee1db3"&gt;the full transcript&lt;/a&gt;, exported using &lt;code&gt;llm logs -cue&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Those five prompts took:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;225 input, 3,269 output&lt;/li&gt;
&lt;li&gt;2,243 input, 2,908 output&lt;/li&gt;
&lt;li&gt;4,319 input, 2,516 output&lt;/li&gt;
&lt;li&gt;6,376 input, 2,094 output&lt;/li&gt;
&lt;li&gt;8,151 input, 1,806 output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Added together that's 21,314 input and 12,593 output for a grand total &lt;a href="https://www.llm-prices.com/#it=21314&amp;amp;ot=12593&amp;amp;sel=gemini-3-flash-preview"&gt;of 4.8436 cents&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The guide to &lt;a href="https://ai.google.dev/gemini-api/docs/gemini-3#migrating_from_gemini_25"&gt;migrating from Gemini 2.5&lt;/a&gt; reveals one disappointment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Image segmentation:&lt;/strong&gt; Image segmentation capabilities (returning pixel-level masks for objects) are not supported in Gemini 3 Pro or Gemini 3 Flash. For workloads requiring native image segmentation, we recommend continuing to utilize Gemini 2.5 Flash with thinking turned off or &lt;a href="https://ai.google.dev/gemini-api/docs/robotics-overview"&gt;Gemini Robotics-ER 1.5&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/"&gt;wrote about this capability in Gemini 2.5&lt;/a&gt; back in April. I hope they come back in future models - they're a really neat capability that is unique to Gemini.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-components"&gt;web-components&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="gemini"/><category term="llm"/><category term="pelican-riding-a-bicycle"/><category term="llm-pricing"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="generative-ai"/><category term="google"/><category term="web-components"/></entry><entry><title>llm-gemini 0.28</title><link href="https://simonwillison.net/2025/Dec/17/llm-gemini/#atom-tag" rel="alternate"/><published>2025-12-17T18:14:48+00:00</published><updated>2025-12-17T18:14:48+00:00</updated><id>https://simonwillison.net/2025/Dec/17/llm-gemini/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.28"&gt;llm-gemini 0.28&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;LLM plugin to access Google&amp;#x27;s Gemini family of models&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="gemini"/></entry><entry><title>LLM 0.28</title><link href="https://simonwillison.net/2025/Dec/12/llm-028/#atom-tag" rel="alternate"/><published>2025-12-12T20:20:14+00:00</published><updated>2025-12-12T20:20:14+00:00</updated><id>https://simonwillison.net/2025/Dec/12/llm-028/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://llm.datasette.io/en/stable/changelog.html#v0-28"&gt;LLM 0.28&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I released a new version of my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; Python library and CLI tool for interacting with Large Language Models. Highlights from the release notes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New OpenAI models: &lt;code&gt;gpt-5.1&lt;/code&gt;, &lt;code&gt;gpt-5.1-chat-latest&lt;/code&gt;, &lt;code&gt;gpt-5.2&lt;/code&gt; and &lt;code&gt;gpt-5.2-chat-latest&lt;/code&gt;. &lt;a href="https://github.com/simonw/llm/issues/1300"&gt;#1300&lt;/a&gt;, &lt;a href="https://github.com/simonw/llm/issues/1317"&gt;#1317&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;When fetching URLs as fragments using &lt;code&gt;llm -f URL&lt;/code&gt;, the request now includes a custom user-agent header: &lt;code&gt;llm/VERSION (https://llm.datasette.io/)&lt;/code&gt;. &lt;a href="https://github.com/simonw/llm/issues/1309"&gt;#1309&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Fixed a bug where fragments were not correctly registered with their source when using &lt;code&gt;llm chat&lt;/code&gt;. Thanks, &lt;a href="https://github.com/grota"&gt;Giuseppe Rota&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1316"&gt;#1316&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Fixed some file descriptor leak warnings. Thanks, &lt;a href="https://github.com/eedeebee"&gt;Eric Bloch&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/issues/1313"&gt;#1313&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Type annotations for the OpenAI Chat, AsyncChat and Completion &lt;code&gt;execute()&lt;/code&gt; methods. Thanks, &lt;a href="https://github.com/ar-jan"&gt;Arjan Mossel&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1315"&gt;#1315&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The project now uses &lt;code&gt;uv&lt;/code&gt; and dependency groups for development. See the updated &lt;a href="https://llm.datasette.io/en/stable/contributing.html"&gt;contributing documentation&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/issues/1318"&gt;#1318&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;That last bullet point about &lt;code&gt;uv&lt;/code&gt; relates to the dependency groups pattern I &lt;a href="https://til.simonwillison.net/uv/dependency-groups"&gt;wrote about in a recent TIL&lt;/a&gt;. I'm currently working through applying it to my other projects - the net result is that running the test suite is as simple as doing:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git clone https://github.com/simonw/llm
cd llm
uv run pytest
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The new &lt;code&gt;dev&lt;/code&gt; dependency group &lt;a href="https://github.com/simonw/llm/blob/0.28/pyproject.toml#L44-L69"&gt;defined in pyproject.toml&lt;/a&gt; is automatically installed by &lt;code&gt;uv run&lt;/code&gt; in a new virtual environment which means everything needed to run &lt;code&gt;pytest&lt;/code&gt; is available without needing to add any extra commands.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-release-notes"&gt;annotated-release-notes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm"/><category term="uv"/><category term="annotated-release-notes"/><category term="ai"/><category term="llms"/><category term="python"/><category term="generative-ai"/><category term="projects"/></entry><entry><title>llm 0.28</title><link href="https://simonwillison.net/2025/Dec/12/llm/#atom-tag" rel="alternate"/><published>2025-12-12T20:03:45+00:00</published><updated>2025-12-12T20:03:45+00:00</updated><id>https://simonwillison.net/2025/Dec/12/llm/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm/releases/tag/0.28"&gt;llm 0.28&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;Access large language models from the command-line&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/></entry><entry><title>GPT-5.2</title><link href="https://simonwillison.net/2025/Dec/11/gpt-52/#atom-tag" rel="alternate"/><published>2025-12-11T23:58:04+00:00</published><updated>2025-12-11T23:58:04+00:00</updated><id>https://simonwillison.net/2025/Dec/11/gpt-52/#atom-tag</id><summary type="html">
    &lt;p&gt;OpenAI reportedly &lt;a href="https://www.wsj.com/tech/ai/openais-altman-declares-code-red-to-improve-chatgpt-as-google-threatens-ai-lead-7faf5ea6"&gt;declared a "code red"&lt;/a&gt; on the 1st of December in response to increasingly credible competition from the likes of Google's Gemini 3. It's less than two weeks later and they just &lt;a href="https://openai.com/index/introducing-gpt-5-2/"&gt;announced GPT-5.2&lt;/a&gt;, calling it "the most capable model series yet for professional knowledge work".&lt;/p&gt;
&lt;h4 id="key-characteristics-of-gpt-5-2"&gt;Key characteristics of GPT-5.2&lt;/h4&gt;
&lt;p&gt;The new model comes in two variants: GPT-5.2 and GPT-5.2 Pro. There's no Mini variant yet.&lt;/p&gt;
&lt;p&gt;GPT-5.2 is available via their UI in both "instant" and "thinking" modes, presumably still corresponding to the API concept of different reasoning effort levels.&lt;/p&gt;
&lt;p&gt;The knowledge cut-off date for both variants is now &lt;strong&gt;August 31st 2025&lt;/strong&gt;. This is significant - GPT 5.1 and 5 were both Sep 30, 2024 and GPT-5 mini was May 31, 2024.&lt;/p&gt;
&lt;p&gt;Both of the 5.2 models have a 400,000 token context window and 128,000 max output tokens - no different from 5.1 or 5.&lt;/p&gt;
&lt;p&gt;Pricing wise 5.2 is a rare &lt;em&gt;increase&lt;/em&gt; - it's 1.4x the cost of GPT 5.1, at $1.75/million input and $14/million output. GPT-5.2 Pro is $21.00/million input and a hefty $168.00/million output, putting it &lt;a href="https://www.llm-prices.com/#sel=gpt-4.5%2Co1-pro%2Cgpt-5.2-pro"&gt;up there&lt;/a&gt; with their previous most expensive models o1 Pro and GPT-4.5.&lt;/p&gt;
&lt;p&gt;So far the main benchmark results we have are self-reported by OpenAI. The most interesting ones are a 70.9% score on their GDPval "Knowledge work tasks" benchmark (GPT-5 got 38.8%) and a 52.9% on ARC-AGI-2 (up from 17.6% for GPT-5.1 Thinking).&lt;/p&gt;
&lt;p&gt;The ARC Prize Twitter account provided &lt;a href="https://x.com/arcprize/status/1999182732845547795"&gt;this interesting note&lt;/a&gt; on the efficiency gains for GPT-5.2 Pro&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A year ago, we verified a preview of an unreleased version of @OpenAI
o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task&lt;/p&gt;
&lt;p&gt;Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task&lt;/p&gt;
&lt;p&gt;This represents a ~390X efficiency improvement in one year&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;GPT-5.2 can be accessed in OpenAI's Codex CLI tool like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;codex -m gpt-5.2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are three new API models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.2"&gt;gpt-5.2&lt;/a&gt; - I think this is what you get if you select "GPT-5.2 Thinking" in ChatGPT but &lt;a href="https://twitter.com/simonw/status/1999603339382976785"&gt;I'm a little confused&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://platform.openai.com/docs/models/gpt-5.2-chat-latest"&gt;gpt-5.2-chat-latest&lt;/a&gt; - the model used by ChatGPT for "GPT-5.2 Instant" mode. It's priced the same as GPT-5.2 but has a reduced 128,000 context window with 16,384 max output tokens.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.2-pro"&gt;gpt-5.2-pro&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OpenAI have published a new &lt;a href="https://cookbook.openai.com/examples/gpt-5/gpt-5-2_prompting_guide"&gt;GPT-5.2 Prompting Guide&lt;/a&gt;. An interesting note from that document is that compaction can now be run with &lt;a href="https://platform.openai.com/docs/api-reference/responses/compact"&gt;a new dedicated server-side API&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For long-running, tool-heavy workflows that exceed the standard context window, GPT-5.2 with Reasoning supports response compaction via the &lt;code&gt;/responses/compact&lt;/code&gt; endpoint. Compaction performs a loss-aware compression pass over prior conversation state, returning encrypted, opaque items that preserve task-relevant information while dramatically reducing token footprint. This allows the model to continue reasoning across extended workflows without hitting context limits.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4 id="it-s-better-at-vision"&gt;It's better at vision&lt;/h4&gt;
&lt;p&gt;One note from the announcement that caught my eye:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT‑5.2 Thinking is our strongest vision model yet, cutting error rates roughly in half on chart reasoning and software interface understanding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had &lt;a href="https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-coding/"&gt;disappointing results from GPT-5&lt;/a&gt; on an OCR task a while ago. I tried it against GPT-5.2 and it did &lt;em&gt;much&lt;/em&gt; better:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gpt-5.2 ocr -a https://static.simonwillison.net/static/2025/ft.jpeg&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/b4a13f1e424e58b8b0aca72ae2c3cb00"&gt;the result&lt;/a&gt; from that, which cost 1,520 input and 1,022 for a total of &lt;a href="https://www.llm-prices.com/#it=1520&amp;amp;ot=1022&amp;amp;sel=gpt-5.2"&gt;1.6968 cents&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="rendering-some-pelicans"&gt;Rendering some pelicans&lt;/h4&gt;
&lt;p&gt;For my classic "Generate an SVG of a pelican riding a bicycle" test:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gpt-5.2 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-2.5-pelican.png" alt="Described by GPT-5.2: Cartoon-style illustration: A white, duck-like bird with a small black eye, oversized orange beak (with a pale blue highlight along the lower edge), and a pink neckerchief rides a blue-framed bicycle in side view; the bike has two large black wheels with gray spokes, a blue front fork, visible black crank/pedal area, and thin black handlebar lines, with gray motion streaks and a soft gray shadow under the bike on a light-gray road; background is a pale blue sky with a simple yellow sun at upper left and two rounded white clouds (one near upper center-left and one near upper right)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And for the more advanced alternative test, which tests instruction following in a little more depth:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gpt-5.2 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Generate an SVG of a California brown pelican riding a bicycle. The bicycle&lt;/span&gt;
&lt;span class="pl-s"&gt;must have spokes and a correctly shaped bicycle frame. The pelican must have its&lt;/span&gt;
&lt;span class="pl-s"&gt;characteristic large pouch, and there should be a clear indication of feathers.&lt;/span&gt;
&lt;span class="pl-s"&gt;The pelican must be clearly pedaling the bicycle. The image should show the full&lt;/span&gt;
&lt;span class="pl-s"&gt;breeding plumage of the California brown pelican.&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5.2-p2.png" alt="Digital illustration on a light gray/white background with a thin horizontal baseline: a stylized California brown pelican in breeding plumage is drawn side-on, leaning forward and pedaling a bicycle; the pelican has a dark brown body with layered wing lines, a pale cream head with a darker brown cap and neck shading, a small black eye, and an oversized long golden-yellow bill extending far past the front wheel; one brown leg reaches down to a pedal while the other is tucked back; the bike is shown in profile with two large spoked wheels (black tires, white rims), a dark frame, crank and chainring near the rear wheel, a black saddle above the rear, and the front fork aligned under the pelican’s head; text at the top reads &amp;quot;California brown pelican (breeding plumage) pedaling a bicycle&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 14th December 2025&lt;/strong&gt;: I used GPT-5.2 running in Codex CLI to &lt;a href="https://simonwillison.net/2025/Dec/15/porting-justhtml/"&gt;port a complex Python library to JavaScript&lt;/a&gt;. It ran without interference for nearly four hours and completed a complex task exactly to my specification.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="openai"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="gpt-5"/><category term="generative-ai"/></entry><entry><title>Devstral 2</title><link href="https://simonwillison.net/2025/Dec/9/devstral-2/#atom-tag" rel="alternate"/><published>2025-12-09T23:58:27+00:00</published><updated>2025-12-09T23:58:27+00:00</updated><id>https://simonwillison.net/2025/Dec/9/devstral-2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/devstral-2-vibe-cli"&gt;Devstral 2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Two new models from Mistral today: Devstral 2 and Devstral Small 2 - both focused on powering coding agents such as Mistral's newly released Mistral Vibe which &lt;a href="https://simonwillison.net/2025/Dec/9/mistral-vibe/"&gt;I wrote about earlier today&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Devstral 2: SOTA open model for code agents with a fraction of the parameters of its competitors and achieving 72.2% on SWE-bench Verified.&lt;/li&gt;
&lt;li&gt;Up to 7x more cost-efficient than Claude Sonnet at real-world tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Devstral 2 is a 123B model released under a janky license - it's "modified MIT" where &lt;a href="https://huggingface.co/mistralai/Devstral-2-123B-Instruct-2512/blob/main/LICENSE"&gt;the modification&lt;/a&gt; is:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company (or that of your employer) exceeds $20 million (or its equivalent in another currency) for the preceding month. This restriction in (b) applies to the Model and any derivatives, modifications, or combined works based on it, whether provided by Mistral AI or by a third party. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Mistral Small 2 is under a proper Apache 2 license with no weird strings attached. It's a 24B model which is &lt;a href="https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512"&gt;51.6GB on Hugging Face&lt;/a&gt; and should quantize to significantly less.&lt;/p&gt;
&lt;p&gt;I tried out the larger model via &lt;a href="https://github.com/simonw/llm-mistral"&gt;my llm-mistral plugin&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mistral
llm mistral refresh
llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Bicycle looks a bit like a cybertruck" src="https://static.simonwillison.net/static/2025/devstral-2.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;For a ~120B model that one is pretty good!&lt;/p&gt;
&lt;p&gt;Here's the same prompt with &lt;code&gt;-m mistral/labs-devstral-small-2512&lt;/code&gt; for the API hosted version of Devstral Small 2:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A small white pelican on what looks more like a child's cart." src="https://static.simonwillison.net/static/2025/devstral-small-2.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Again, a decent result given the small parameter size. For comparison, &lt;a href="https://simonwillison.net/2025/Jun/20/mistral-small-32/"&gt;here's what I got&lt;/a&gt; for the 24B Mistral Small 3.2 earlier this year.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/janky-licenses"&gt;janky-licenses&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-release"/><category term="mistral"/><category term="generative-ai"/><category term="ai"/><category term="janky-licenses"/><category term="llms"/><category term="llm"/><category term="pelican-riding-a-bicycle"/></entry><entry><title>Introducing Mistral 3</title><link href="https://simonwillison.net/2025/Dec/2/introducing-mistral-3/#atom-tag" rel="alternate"/><published>2025-12-02T17:30:57+00:00</published><updated>2025-12-02T17:30:57+00:00</updated><id>https://simonwillison.net/2025/Dec/2/introducing-mistral-3/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/mistral-3"&gt;Introducing Mistral 3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Four new models from Mistral today: three in their "Ministral" smaller model series (14B, 8B, and 3B) and a new Mistral Large 3 MoE model with 675B parameters, 41B active.&lt;/p&gt;
&lt;p&gt;All of the models are vision capable, and they are all released under an Apache 2 license.&lt;/p&gt;
&lt;p&gt;I'm particularly excited about the 3B model, which appears to be a competent vision-capable model in a tiny ~3GB file.&lt;/p&gt;
&lt;p&gt;Xenova from Hugging Face &lt;a href="https://x.com/xenovacom/status/1995879338583945635"&gt;got it working in a browser&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;@MistralAI releases Mistral 3, a family of multimodal models, including three start-of-the-art dense models (3B, 8B, and 14B) and Mistral Large 3 (675B, 41B active). All Apache 2.0! 🤗&lt;/p&gt;
&lt;p&gt;Surprisingly, the 3B is small enough to run 100% locally in your browser on WebGPU! 🤯&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can &lt;a href="https://huggingface.co/spaces/mistralai/Ministral_3B_WebGPU"&gt;try that demo in your browser&lt;/a&gt;, which will fetch 3GB of model and then stream from your webcam and let you run text prompts against what the model is seeing, entirely locally.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a man with glasses holding a red cube-shaped object up to the camera in a live computer vision interface; top left label reads “LIVE FEED”; top right slider label reads “INPUT SIZE: 480PX”; lower left panel titled “PROMPT LIBRARY” with prompts “Describe what you see in one sentence.” “What is the color of my shirt?” “Identify any text or written content visible.” “What emotions or actions are being portrayed?” “Name the object I am holding in my hand.”; below that a field labeled “PROMPT” containing the text “write a haiku about this”; lower right panel titled “OUTPUT STREAM” with buttons “VIEW HISTORY” and “LIVE INFERENCE” and generated text “Red cube held tight, Fingers frame the light’s soft glow– Mystery shines bright.”; a small status bar at the bottom shows “ttft: 4188ms  tokens/sec: 5.09” and “ctx: 3.3B-Instruct”." src="https://static.simonwillison.net/static/2025/3b-webcam.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Mistral's API hosted versions of the new models are supported by my &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral plugin&lt;/a&gt; already thanks to the &lt;code&gt;llm mistral refresh&lt;/code&gt; command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ llm mistral refresh
Added models: ministral-3b-2512, ministral-14b-latest, mistral-large-2512, ministral-14b-2512, ministral-8b-2512
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I &lt;a href="https://gist.github.com/simonw/0df5e656291d5a7a1bf012fabc9edc3f"&gt;tried pelicans against all of the models&lt;/a&gt;. Here's the best one, from Mistral Large 3:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Nice cloud. Pelican isn't great, the beak is missing the pouch. It's floating above the bicycle which has two wheels and an incorrect frame." src="https://static.simonwillison.net/static/2025/mistral-large-3.png" /&gt;&lt;/p&gt;
&lt;p&gt;And the worst from Ministral 3B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A black sky. A brown floor. A set of abstract brown and grey shapes float, menacingly." src="https://static.simonwillison.net/static/2025/ministral-3b.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="vision-llms"/><category term="llm-release"/><category term="mistral"/><category term="llm"/><category term="generative-ai"/><category term="ai"/><category term="llms"/></entry><entry><title>llm-templates-fabric 0.2.1</title><link href="https://simonwillison.net/2025/Nov/26/llm-templates-fabric/#atom-tag" rel="alternate"/><published>2025-11-26T04:56:15+00:00</published><updated>2025-11-26T04:56:15+00:00</updated><id>https://simonwillison.net/2025/Nov/26/llm-templates-fabric/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-templates-fabric/releases/tag/0.2.1"&gt;llm-templates-fabric 0.2.1&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;Load LLM templates from Fabric&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/></entry><entry><title>llm-anthropic 0.23</title><link href="https://simonwillison.net/2025/Nov/25/llm-anthropic/#atom-tag" rel="alternate"/><published>2025-11-25T05:26:34+00:00</published><updated>2025-11-25T05:26:34+00:00</updated><id>https://simonwillison.net/2025/Nov/25/llm-anthropic/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.23"&gt;llm-anthropic 0.23&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New plugin release adding support for Claude Opus 4.5, including the new &lt;code&gt;thinking_effort&lt;/code&gt; option:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-anthropic
llm -m claude-opus-4.5 -o thinking_effort low 'muse on pelicans'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This took longer to release than I had hoped because it was blocked on Anthropic shipping &lt;a href="https://github.com/anthropics/anthropic-sdk-python/releases/tag/v0.75.0"&gt;0.75.0&lt;/a&gt; of their Python library with support for thinking effort.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="generative-ai"/><category term="projects"/><category term="ai"/><category term="llms"/></entry><entry><title>llm-anthropic 0.23</title><link href="https://simonwillison.net/2025/Nov/25/llm-anthropic-2/#atom-tag" rel="alternate"/><published>2025-11-25T05:20:15+00:00</published><updated>2025-11-25T05:20:15+00:00</updated><id>https://simonwillison.net/2025/Nov/25/llm-anthropic-2/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.23"&gt;llm-anthropic 0.23&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;LLM access to models by Anthropic, including the Claude series&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="anthropic"/><category term="claude"/></entry><entry><title>llm-anthropic 0.22</title><link href="https://simonwillison.net/2025/Nov/19/llm-anthropic/#atom-tag" rel="alternate"/><published>2025-11-19T15:15:36+00:00</published><updated>2025-11-19T15:15:36+00:00</updated><id>https://simonwillison.net/2025/Nov/19/llm-anthropic/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.22"&gt;llm-anthropic 0.22&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;LLM access to models by Anthropic, including the Claude series&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="claude"/><category term="anthropic"/></entry><entry><title>llm-gemini 0.27</title><link href="https://simonwillison.net/2025/Nov/18/llm-gemini/#atom-tag" rel="alternate"/><published>2025-11-18T23:00:40+00:00</published><updated>2025-11-18T23:00:40+00:00</updated><id>https://simonwillison.net/2025/Nov/18/llm-gemini/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.27"&gt;llm-gemini 0.27&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New release of my LLM plugin for Google's Gemini models:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Support for nested schemas in Pydantic, thanks &lt;a href="https://github.com/billpugh"&gt;Bill Pugh&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm-gemini/pull/107"&gt;#107&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Now tests against Python 3.14.&lt;/li&gt;
&lt;li&gt;Support for YouTube URLs as attachments and the &lt;code&gt;media_resolution&lt;/code&gt; option. Thanks, &lt;a href="https://github.com/shuane"&gt;Duane Milne&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm-gemini/pull/112"&gt;#112&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New model: &lt;code&gt;gemini-3-pro-preview&lt;/code&gt;. &lt;a href="https://github.com/simonw/llm-gemini/issues/113"&gt;#113&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The YouTube URL feature is particularly neat, taking advantage of &lt;a href="https://ai.google.dev/gemini-api/docs/video-understanding#youtube"&gt;this API feature&lt;/a&gt;. I used it against the &lt;a href="https://simonwillison.net/2025/Nov/18/google-antigravity/"&gt;Google Antigravity launch video&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini-3-pro-preview \
 -a 'https://www.youtube.com/watch?v=nTOVIGsqCuY' \
 'Summary, with detailed notes about what this thing is and how it differs from regular VS Code, then a complete detailed transcript with timestamps'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/9f30318ab47e0d177b4b523bb71d9540"&gt;the result&lt;/a&gt;. A spot-check of the timestamps against points in the video shows them to be exactly right.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/youtube"&gt;youtube&lt;/a&gt;&lt;/p&gt;



</summary><category term="gemini"/><category term="llm"/><category term="generative-ai"/><category term="projects"/><category term="ai"/><category term="llms"/><category term="youtube"/></entry><entry><title>llm-gemini 0.27</title><link href="https://simonwillison.net/2025/Nov/18/llm-gemini-2/#atom-tag" rel="alternate"/><published>2025-11-18T22:56:18+00:00</published><updated>2025-11-18T22:56:18+00:00</updated><id>https://simonwillison.net/2025/Nov/18/llm-gemini-2/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.27"&gt;llm-gemini 0.27&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;LLM plugin to access Google&amp;#x27;s Gemini family of models&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="gemini"/></entry><entry><title>Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark</title><link href="https://simonwillison.net/2025/Nov/18/gemini-3/#atom-tag" rel="alternate"/><published>2025-11-18T19:00:48+00:00</published><updated>2025-11-18T19:00:48+00:00</updated><id>https://simonwillison.net/2025/Nov/18/gemini-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Google released Gemini 3 Pro today. Here's &lt;a href="https://blog.google/products/gemini/gemini-3/"&gt;the announcement from Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu&lt;/a&gt;, their &lt;a href="https://blog.google/technology/developers/gemini-3-developers/"&gt;developer blog announcement from Logan Kilpatrick&lt;/a&gt;, the &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf"&gt;Gemini 3 Pro Model Card&lt;/a&gt;, and their &lt;a href="https://blog.google/products/gemini/gemini-3-collection/"&gt;collection of 11 more articles&lt;/a&gt;. It's a big release!&lt;/p&gt;
&lt;p&gt;I had a few days of preview access to this model via &lt;a href="https://aistudio.google.com/"&gt;AI Studio&lt;/a&gt;. The best way to describe it is that it's &lt;strong&gt;Gemini 2.5 upgraded to match the leading rival models&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Gemini 3 has the same underlying characteristics as Gemini 2.5. The knowledge cutoff is the same (January 2025). It accepts 1 million input tokens, can output up to 64,000 tokens, and has multimodal inputs across text, images, audio, and video.&lt;/p&gt;
&lt;h4 id="benchmarks"&gt;Benchmarks&lt;/h4&gt;
&lt;p&gt;Google's own reported numbers (in &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf"&gt;the model card&lt;/a&gt;) show it scoring slightly higher against Claude 4.5 Sonnet and GPT-5.1 against most of the standard benchmarks. As always I'm waiting for independent confirmation, but I have no reason to believe those numbers are inaccurate.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg" alt="Table of benchmark numbers, described in full below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pricing"&gt;Pricing&lt;/h4&gt;
&lt;p&gt;It terms of pricing it's a little more expensive than Gemini 2.5 but still cheaper than Claude Sonnet 4.5. Here's how it fits in with those other leading models:&lt;/p&gt;
&lt;center&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input (per 1M tokens)&lt;/th&gt;
      &lt;th&gt;Output (per 1M tokens)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.1&lt;/td&gt;
      &lt;td&gt;$1.25&lt;/td&gt;
      &lt;td&gt;$10.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $1.25&lt;br /&gt;
        &amp;gt; 200k tokens: $2.50
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $10.00&lt;br /&gt;
        &amp;gt; 200k tokens: $15.00
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Gemini 3 Pro&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $2.00&lt;br /&gt;
        &amp;gt; 200k tokens: $4.00
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $12.00&lt;br /&gt;
        &amp;gt; 200k tokens: $18.00
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.5&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $3.00&lt;br /&gt;
        &amp;gt; 200k tokens: $6.00
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $15.00&lt;br /&gt;
        &amp;gt; 200k tokens: $22.50
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.1&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
      &lt;td&gt;$75.00&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/center&gt;
&lt;h4 id="trying-it-out-against-a-complex-image"&gt;Trying it out against a complex image&lt;/h4&gt;
&lt;p&gt;That screenshot of the benchmarks from above looked like a good test for Gemini 3's multimodal support. I fed it that image URL and asked it to generate alt text for the image:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview -a https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg 'Alt text for this image, include all figures and make them comprehensible to a screen reader user'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's what I got back:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A comprehensive table comparing four AI models across various performance benchmarks. The columns listed are: Benchmark Name, Description, Gemini 3 Pro (highlighted in blue), Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Humanity's Last Exam (Academic reasoning)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No tools: Gemini 3 Pro 37.5%, Gemini 2.5 Pro 21.6%, Claude Sonnet 4.5 13.7%, GPT-5.1 26.5%.&lt;/li&gt;
&lt;li&gt;With search and code execution: Gemini 3 Pro 45.8% (others have no data).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ARC-AGI-2 (Visual reasoning puzzles; ARC Prize Verified)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 31.1%, Gemini 2.5 Pro 4.9%, Claude Sonnet 4.5 13.6%, GPT-5.1 17.6%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;GPQA Diamond (Scientific knowledge; No tools)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 91.9%, Gemini 2.5 Pro 86.4%, Claude Sonnet 4.5 83.4%, GPT-5.1 88.1%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;AIME 2025 (Mathematics)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No tools: Gemini 3 Pro 95.0%, Gemini 2.5 Pro 88.0%, Claude Sonnet 4.5 87.0%, GPT-5.1 94.0%.&lt;/li&gt;
&lt;li&gt;With code execution: Gemini 3 Pro 100%, Claude Sonnet 4.5 100%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MathArena Apex (Challenging Math Contest problems)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 23.4%, Gemini 2.5 Pro 0.5%, Claude Sonnet 4.5 1.6%, GPT-5.1 1.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MMMU-Pro (Multimodal understanding and reasoning)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 81.0%, Gemini 2.5 Pro 68.0%, Claude Sonnet 4.5 68.0%, GPT-5.1 76.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ScreenSpot-Pro (Screen understanding)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 72.7%, Gemini 2.5 Pro 11.4%, Claude Sonnet 4.5 36.2%, GPT-5.1 3.5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;CharXiv Reasoning (Information synthesis from complex charts)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 81.4%, Gemini 2.5 Pro 69.6%, Claude Sonnet 4.5 68.5%, GPT-5.1 69.5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;OmniDocBench 1.5 (OCR; Overall Edit Distance, lower is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 0.115, Gemini 2.5 Pro 0.145, Claude Sonnet 4.5 0.145, GPT-5.1 0.147.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Video-MMMU (Knowledge acquisition from videos)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 87.6%, Gemini 2.5 Pro 83.6%, Claude Sonnet 4.5 77.8%, GPT-5.1 80.4%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;LiveCodeBench Pro (Competitive coding problems; Elo Rating, higher is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 2,439; Gemini 2.5 Pro 1,775; Claude Sonnet 4.5 1,418; GPT-5.1 2,243.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Terminal-Bench 2.0 (Agentic terminal coding; Terminus-2 agent)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 54.2%, Gemini 2.5 Pro 32.6%, Claude Sonnet 4.5 42.8%, GPT-5.1 47.6%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SWE-Bench Verified (Agentic coding; Single attempt)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 76.2%, Gemini 2.5 Pro 59.6%, Claude Sonnet 4.5 77.2%, GPT-5.1 76.3%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;t2-bench (Agentic tool use)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 85.4%, Gemini 2.5 Pro 54.9%, Claude Sonnet 4.5 84.7%, GPT-5.1 80.2%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Vending-Bench 2 (Long-horizon agentic tasks; Net worth (mean), higher is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro $5,478.16; Gemini 2.5 Pro $573.64; Claude Sonnet 4.5 $3,838.74; GPT-5.1 $1,473.43.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;FACTS Benchmark Suite (Held out internal grounding, parametric, MM, and search retrieval benchmarks)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 70.5%, Gemini 2.5 Pro 63.4%, Claude Sonnet 4.5 50.4%, GPT-5.1 50.8%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SimpleQA Verified (Parametric knowledge)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 72.1%, Gemini 2.5 Pro 54.5%, Claude Sonnet 4.5 29.3%, GPT-5.1 34.9%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MMMLU (Multilingual Q&amp;amp;A)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 91.8%, Gemini 2.5 Pro 89.5%, Claude Sonnet 4.5 89.1%, GPT-5.1 91.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Global PIQA (Commonsense reasoning across 100 Languages and Cultures)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 93.4%, Gemini 2.5 Pro 91.5%, Claude Sonnet 4.5 90.1%, GPT-5.1 90.9%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MRCR v2 (8-needle) (Long context performance)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;128k (average): Gemini 3 Pro 77.0%, Gemini 2.5 Pro 58.0%, Claude Sonnet 4.5 47.1%, GPT-5.1 61.6%.&lt;/li&gt;
&lt;li&gt;1M (pointwise): Gemini 3 Pro 26.3%, Gemini 2.5 Pro 16.4%, Claude Sonnet 4.5 (not supported), GPT-5.1 (not supported).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have not checked every line of this but a loose spot-check looks accurate to me.&lt;/p&gt;
&lt;p&gt;That prompt took 1,105 input and 3,901 output tokens, at a cost of &lt;a href="https://www.llm-prices.com/#it=1105&amp;amp;cit=3901&amp;amp;ot=3901&amp;amp;ic=2&amp;amp;oc=12&amp;amp;sel=gemini-3-pro-preview"&gt;5.6824 cents&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ran this follow-up prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -c 'Convert to JSON'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can see &lt;a href="https://gist.github.com/simonw/ea7d52706557528e7eb3912cdf9250b0#response-1"&gt;the full output here&lt;/a&gt;, which starts like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"metadata"&lt;/span&gt;: {
    &lt;span class="pl-ent"&gt;"columns"&lt;/span&gt;: [
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Benchmark&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Description&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Gemini 3 Pro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Gemini 2.5 Pro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Claude Sonnet 4.5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;GPT-5.1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    ]
  },
  &lt;span class="pl-ent"&gt;"benchmarks"&lt;/span&gt;: [
    {
      &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Humanity's Last Exam&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"description"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Academic reasoning&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"sub_results"&lt;/span&gt;: [
        {
          &lt;span class="pl-ent"&gt;"condition"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;No tools&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_3_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;37.5%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_2_5_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;21.6%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"claude_sonnet_4_5"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;13.7%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gpt_5_1"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;26.5%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        },
        {
          &lt;span class="pl-ent"&gt;"condition"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;With search and code execution&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_3_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45.8%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_2_5_pro"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"claude_sonnet_4_5"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gpt_5_1"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;
        }
      ]
    },&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="analyzing-a-city-council-meeting"&gt;Analyzing a city council meeting&lt;/h4&gt;
&lt;p&gt;To try it out against an audio file I extracted the 3h33m of audio from the video &lt;a href="https://www.youtube.com/watch?v=qgJ7x7R6gy0"&gt;Half Moon Bay City Council Meeting - November 4, 2025&lt;/a&gt;. I used &lt;code&gt;yt-dlp&lt;/code&gt; to get that audio:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;yt-dlp -x --audio-format m4a &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://www.youtube.com/watch?v=qgJ7x7R6gy0&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That gave me a 74M m4a file, which I ran through Gemini 3 Pro like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview -a /tmp/HMBCC\ 11⧸4⧸25\ -\ Half\ Moon\ Bay\ City\ Council\ Meeting\ -\ November\ 4,\ 2025\ \[qgJ7x7R6gy0\].m4a 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That failed with an "Internal error encountered" message, so I shrunk the file down to a more manageable 38MB using &lt;code&gt;ffmpeg&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ffmpeg -i &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/private/tmp/HMB.m4a&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -ac 1 -ar 22050 -c:a aac -b:a 24k &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/private/tmp/HMB_compressed.m4a&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then ran it again like this (for some reason I had to use &lt;code&gt;--attachment-type&lt;/code&gt; this time):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview --attachment-type /tmp/HMB_compressed.m4a 'audio/aac' 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This time it worked! The &lt;a href="https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943314"&gt;full output is here&lt;/a&gt;, but it starts like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here is the transcript of the Half Moon Bay City Council meeting.&lt;/p&gt;
&lt;h4&gt;Meeting Outline&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;1. Call to Order, Updates, and Public Forum&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Mayor Brownstone calls the meeting to order. City Manager Chidester reports no reportable actions from the closed session. Announcements are made regarding food insecurity volunteers and the Diwali celebration. During the public forum, Councilmember Penrose (speaking as a citizen) warns against autocracy. Citizens speak regarding lease agreements, downtown maintenance, local music events, and homelessness outreach statistics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:00:00 - 00:13:25&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Mayor Brownstone, Matthew Chidester, Irma Acosta, Deborah Penrose, Jennifer Moore, Sandy Vella, Joaquin Jimenez, Anita Rees.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2. Consent Calendar&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; The Council approves minutes from previous meetings and a resolution authorizing a licensing agreement for Seahorse Ranch. Councilmember Johnson corrects a pull request regarding abstentions on minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:13:25 - 00:15:15&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Mayor Brownstone, Councilmember Johnson, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Nagengast.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. Ordinance Introduction: Commercial Vitality (Item 9A)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff presents a new ordinance to address neglected and empty commercial storefronts, establishing maintenance and display standards. Councilmembers discuss enforcement mechanisms, window cleanliness standards, and the need for objective guidance documents to avoid subjective enforcement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:15:15 - 00:30:45&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Karen Decker, Councilmember Johnson, Councilmember Nagengast, Vice Mayor Ruddick, Councilmember Penrose.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;4. Ordinance Introduction: Building Standards &amp;amp; Electrification (Item 9B)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff introduces updates to the 2025 Building Code. A major change involves repealing the city's all-electric building requirement due to the 9th Circuit Court ruling (&lt;em&gt;California Restaurant Association v. City of Berkeley&lt;/em&gt;). &lt;strong&gt;Public speaker Mike Ferreira expresses strong frustration and disagreement with "unelected state agencies" forcing the City to change its ordinances.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:30:45 - 00:45:00&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Ben Corrales, Keith Weiner, Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;5. Housing Element Update &amp;amp; Adoption (Item 9C)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff presents the 5th draft of the Housing Element, noting State HCD requirements to modify ADU allocations and place a measure on the ballot regarding the "Measure D" growth cap. &lt;strong&gt;There is significant disagreement from Councilmembers Ruddick and Penrose regarding the State's requirement to hold a ballot measure.&lt;/strong&gt; Public speakers debate the enforceability of Measure D. &lt;strong&gt;Mike Ferreira interrupts the vibe to voice strong distaste for HCD's interference in local law.&lt;/strong&gt; The Council votes to adopt the element but strikes the language committing to a ballot measure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:45:00 - 01:05:00&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Leslie (Staff), Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Johnson.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr /&gt;
&lt;h4&gt;Transcript&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Mayor Brownstone&lt;/strong&gt; [00:00:00]
Good evening everybody and welcome to the November 4th Half Moon Bay City Council meeting. As a reminder, we have Spanish interpretation services available in person and on Zoom.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Victor Hernandez (Interpreter)&lt;/strong&gt; [00:00:35]
Thank you, Mr. Mayor, City Council, all city staff, members of the public. &lt;em&gt;[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]&lt;/em&gt; Thank you very much.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Those first two lines of the transcript already illustrate something interesting here: Gemini 3 Pro chose NOT to include the exact text of the Spanish instructions, instead summarizing them as "[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]".&lt;/p&gt;
&lt;p&gt;I haven't spot-checked the entire 3hr33m meeting, but I've confirmed that the timestamps do not line up. The transcript closes like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mayor Brownstone&lt;/strong&gt; [01:04:00]
Meeting adjourned. Have a good evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That actually happens &lt;a href="https://www.youtube.com/watch?v=qgJ7x7R6gy0&amp;amp;t=3h31m5s"&gt;at 3h31m5s&lt;/a&gt; and the mayor says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Okay. Well, thanks everybody, members of the public for participating. Thank you for staff. Thank you to fellow council members. This meeting is now adjourned. Have a good evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm disappointed about the timestamps, since mismatches there make it much harder to jump to the right point and confirm that the summarized transcript is an accurate representation of what was said.&lt;/p&gt;
&lt;p&gt;This took 320,087 input tokens and 7,870 output tokens, for a total cost of &lt;a href="https://www.llm-prices.com/#it=320087&amp;amp;ot=7870&amp;amp;ic=4&amp;amp;oc=18"&gt;$1.42&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="and-a-new-pelican-benchmark"&gt;And a new pelican benchmark&lt;/h4&gt;
&lt;p&gt;Gemini 3 Pro has a new concept of a "thinking level" which can be set to low or high (and defaults to high). I tried my classic &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;Generate an SVG of a pelican riding a bicycle&lt;/a&gt; prompt at both levels.&lt;/p&gt;
&lt;p&gt;Here's low - Gemini decided to add a jaunty little hat (with a comment &lt;a href="https://gist.github.com/simonw/70d56ba39b7cbb44985d2384004fc4a0#response"&gt;in the SVG&lt;/a&gt; that says &lt;code&gt;&amp;lt;!-- Hat (Optional Fun Detail) --&amp;gt;&lt;/code&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-pelican-low.png" alt="The pelican is wearing a blue hat. It has a good beak. The bicycle is a little bit incorrect but generally a good effort." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's high. This is genuinely an excellent pelican, and the bicycle frame is at least the correct shape:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-pelican-high.png" alt="The pelican is not wearing a hat. It has a good beak. The bicycle is accurate and well-drawn." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Honestly though, my pelican benchmark is beginning to feel a little bit too basic. I decided to upgrade it. Here's v2 of the benchmark, which I plan to use going forward:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For reference, here's a photo I took of a California brown pelican recently (sadly without a bicycle):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/breeding-plumage.jpg" alt="A glorious California brown pelican perched on a rock by the water. It has a yellow tint to its head and a red spot near its throat." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's Gemini 3 Pro's &lt;a href="https://gist.github.com/simonw/2b9930ae1ce6f3f5e9cfe3cb31ec0c0a"&gt;attempt&lt;/a&gt; at high thinking level for that new prompt:&lt;/p&gt;
&lt;p id="advanced-pelican"&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-breeding-pelican-high.png" alt="It's clearly a pelican. It has all of the requested features. It looks a bit abstract though." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And for good measure, here's that same prompt &lt;a href="https://gist.github.com/simonw/7a655ebe42f3d428d2ea5363dad8067c"&gt;against GPT-5.1&lt;/a&gt; - which produced this dumpy little fellow:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-1-breeding-pelican.png" alt="The pelican is very round. Its body overlaps much of the bicycle. It has a lot of dorky charisma." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And Claude Sonnet 4.5, which &lt;a href="https://gist.github.com/simonw/3296af92e4328dd4740385e6a4a2ac35"&gt;didn't do quite as well&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4-5-breeding-pelican.png" alt="Oh dear. It has all of the requested components, but the bicycle is a bit wrong and the pelican is arranged in a very awkward shape." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;None of the models seem to have caught on to the crucial detail that the California brown pelican is not, in fact, brown.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="gemini"/><category term="llm"/><category term="llm-reasoning"/><category term="pelican-riding-a-bicycle"/><category term="llm-pricing"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="google"/><category term="generative-ai"/></entry><entry><title>datasette-demo-for-llm-accountant 0.1a0</title><link href="https://simonwillison.net/2025/Nov/18/datasette-demo-for-llm-accountant/#atom-tag" rel="alternate"/><published>2025-11-18T01:27:10+00:00</published><updated>2025-11-18T01:27:10+00:00</updated><id>https://simonwillison.net/2025/Nov/18/datasette-demo-for-llm-accountant/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/datasette/datasette-demo-for-llm-accountant/releases/tag/0.1a0"&gt;datasette-demo-for-llm-accountant 0.1a0&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;Demo app for datasette-llm-accountant&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="datasette"/></entry><entry><title>datasette-llm-accountant 0.1a0</title><link href="https://simonwillison.net/2025/Nov/18/datasette-llm-accountant/#atom-tag" rel="alternate"/><published>2025-11-18T01:23:03+00:00</published><updated>2025-11-18T01:23:03+00:00</updated><id>https://simonwillison.net/2025/Nov/18/datasette-llm-accountant/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/datasette/datasette-llm-accountant/releases/tag/0.1a0"&gt;datasette-llm-accountant 0.1a0&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;LLM accounting for Datasette&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="datasette"/></entry><entry><title>llm-anthropic 0.22</title><link href="https://simonwillison.net/2025/Nov/15/llm-anthropic-022/#atom-tag" rel="alternate"/><published>2025-11-15T20:48:38+00:00</published><updated>2025-11-15T20:48:38+00:00</updated><id>https://simonwillison.net/2025/Nov/15/llm-anthropic-022/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.22"&gt;llm-anthropic 0.22&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New release of my &lt;code&gt;llm-anthropic&lt;/code&gt; plugin:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Support for Claude's new &lt;a href="https://claude.com/blog/structured-outputs-on-the-claude-developer-platform"&gt;structured outputs&lt;/a&gt; feature for Sonnet 4.5 and Opus 4.1. &lt;a href="https://github.com/simonw/llm-anthropic/issues/54"&gt;#54&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Support for the &lt;a href="https://docs.claude.com/en/docs/agents-and-tools/tool-use/web-search-tool"&gt;web search tool&lt;/a&gt; using &lt;code&gt;-o web_search 1&lt;/code&gt; - thanks &lt;a href="https://github.com/nmpowell"&gt;Nick Powell&lt;/a&gt; and &lt;a href="https://github.com/statico"&gt;Ian Langworth&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm-anthropic/issues/30"&gt;#30&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The plugin previously powered &lt;a href="https://llm.datasette.io/en/stable/schemas.html"&gt;LLM schemas&lt;/a&gt; using &lt;a href="https://github.com/simonw/llm-anthropic/blob/0.22/llm_anthropic.py#L692-L700"&gt;this tool-call based workaround&lt;/a&gt;. That code is still used for Anthropic's older models.&lt;/p&gt;
&lt;p&gt;I also figured out &lt;code&gt;uv&lt;/code&gt; recipes for running the plugin's test suite in an isolated environment, which are now &lt;a href="https://github.com/simonw/llm-anthropic/blob/0.22/Justfile"&gt;baked into the new Justfile&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="generative-ai"/><category term="projects"/><category term="ai"/><category term="llms"/><category term="uv"/><category term="python"/></entry><entry><title>Introducing GPT-5.1 for developers</title><link href="https://simonwillison.net/2025/Nov/13/gpt-51/#atom-tag" rel="alternate"/><published>2025-11-13T23:59:35+00:00</published><updated>2025-11-13T23:59:35+00:00</updated><id>https://simonwillison.net/2025/Nov/13/gpt-51/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/gpt-5-1-for-developers/"&gt;Introducing GPT-5.1 for developers&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI announced GPT-5.1 yesterday, calling it &lt;a href="https://openai.com/index/gpt-5-1/"&gt;a smarter, more conversational ChatGPT&lt;/a&gt;. Today they've added it to their API.&lt;/p&gt;
&lt;p&gt;We actually got four new models today:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1"&gt;gpt-5.1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-chat-latest"&gt;gpt-5.1-chat-latest&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-codex"&gt;gpt-5.1-codex&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-codex-mini"&gt;gpt-5.1-codex-mini&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are a lot of details to absorb here.&lt;/p&gt;
&lt;p&gt;GPT-5.1 introduces a new reasoning effort called "none" (previous were minimal, low, medium, and high) - and none is the new default.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This makes the model behave like a non-reasoning model for latency-sensitive use cases, with the high intelligence of GPT‑5.1 and added bonus of performant tool-calling. Relative to GPT‑5 with 'minimal' reasoning, GPT‑5.1 with no reasoning is better at parallel tool calling (which itself increases end-to-end task completion speed), coding tasks, following instructions, and using search tools---and supports &lt;a href="https://platform.openai.com/docs/guides/tools-web-search?api-mode=responses"&gt;web search⁠&lt;/a&gt; in our API platform.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When you DO enable thinking you get to benefit from a new feature called "adaptive reasoning":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On straightforward tasks, GPT‑5.1 spends fewer tokens thinking, enabling snappier product experiences and lower token bills. On difficult tasks that require extra thinking, GPT‑5.1 remains persistent, exploring options and checking its work in order to maximize reliability.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Another notable new feature for 5.1 is &lt;a href="https://platform.openai.com/docs/guides/prompt-caching#extended-prompt-cache-retention"&gt;extended prompt cache retention&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours. Extended Prompt Caching works by offloading the key/value tensors to GPU-local storage when memory is full, significantly increasing the storage capacity available for caching.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To enable this set &lt;code&gt;"prompt_cache_retention": "24h"&lt;/code&gt; in the API call. Weirdly there's no price increase involved with this at all. I &lt;a href="https://x.com/simonw/status/1989104422832738305"&gt;asked about that&lt;/a&gt; and OpenAI's Steven Heidel &lt;a href="https://x.com/stevenheidel/status/1989113407149314199"&gt;replied&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;with 24h prompt caching we move the caches from gpu memory to gpu-local storage. that storage is not free, but we made it free since it moves capacity from a limited resource (GPUs) to a more abundant resource (storage). then we can serve more traffic overall!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The most interesting documentation I've seen so far is in the new &lt;a href="https://cookbook.openai.com/examples/gpt-5/gpt-5-1_prompting_guide"&gt;5.1 cookbook&lt;/a&gt;, which also includes details of the new &lt;code&gt;shell&lt;/code&gt; and &lt;code&gt;apply_patch&lt;/code&gt; built-in tools. The &lt;a href="https://github.com/openai/openai-cookbook/blob/main/examples/gpt-5/apply_patch.py"&gt;apply_patch.py implementation&lt;/a&gt; is worth a look, especially if you're interested in the advancing state-of-the-art of file editing tools for LLMs.&lt;/p&gt;
&lt;p&gt;I'm still working on &lt;a href="https://github.com/simonw/llm/issues/1300"&gt;integrating the new models into LLM&lt;/a&gt;. The Codex models are Responses-API-only.&lt;/p&gt;
&lt;p&gt;I got this pelican for GPT-5.1 default (no thinking):&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle wheels have no spokes at all, the pelican is laying quite flat on it" src="https://static.simonwillison.net/static/2025/gpt-5.1-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;And this one with reasoning effort set to high:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This bicycle has four spokes per wheel, and the pelican is sitting more upright" src="https://static.simonwillison.net/static/2025/gpt-5.1-high-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;These actually feel like a &lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans"&gt;regression from GPT-5&lt;/a&gt; to me. The bicycles have less spokes!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-codex"&gt;gpt-codex&lt;/a&gt;&lt;/p&gt;



</summary><category term="november-2025-inflection"/><category term="llm"/><category term="openai"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="ai"/><category term="llms"/><category term="llm-release"/><category term="gpt-5"/><category term="generative-ai"/><category term="gpt-codex"/></entry><entry><title>datasette-llm-usage 0.1a2</title><link href="https://simonwillison.net/2025/Nov/8/datasette-llm-usage/#atom-tag" rel="alternate"/><published>2025-11-08T01:41:13+00:00</published><updated>2025-11-08T01:41:13+00:00</updated><id>https://simonwillison.net/2025/Nov/8/datasette-llm-usage/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/datasette/datasette-llm-usage/releases/tag/0.1a2"&gt;datasette-llm-usage 0.1a2&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;Track usage of LLM tokens in a SQLite table&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="llm"/><category term="datasette"/></entry><entry><title>datasette-enrichments-llm 0.1a2</title><link href="https://simonwillison.net/2025/Nov/8/datasette-enrichments-llm/#atom-tag" rel="alternate"/><published>2025-11-08T01:39:12+00:00</published><updated>2025-11-08T01:39:12+00:00</updated><id>https://simonwillison.net/2025/Nov/8/datasette-enrichments-llm/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/datasette/datasette-enrichments-llm/releases/tag/0.1a2"&gt;datasette-enrichments-llm 0.1a2&lt;/a&gt;&lt;/p&gt;
    &lt;p&gt;Enrich data by prompting LLMs&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="datasette"/><category term="llm"/></entry><entry><title>Kimi K2 Thinking</title><link href="https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/#atom-tag" rel="alternate"/><published>2025-11-06T23:53:06+00:00</published><updated>2025-11-06T23:53:06+00:00</updated><id>https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/moonshotai/Kimi-K2-Thinking"&gt;Kimi K2 Thinking&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Chinese AI lab Moonshot's Kimi K2 established itself as one of the largest open weight models - 1 trillion parameters - &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/"&gt;back in July&lt;/a&gt;. They've now released the Thinking version, also a trillion parameters (MoE, 32B active) and also under their custom modified (so &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/#kimi-license"&gt;not quite open source&lt;/a&gt;) MIT license.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This one is only 594GB on Hugging Face - Kimi K2 was 1.03TB - which I think is due to the new INT4 quantization. This makes the model both cheaper and faster to host.&lt;/p&gt;
&lt;p&gt;So far the only people hosting it are Moonshot themselves. I tried it out both via &lt;a href="https://platform.moonshot.ai"&gt;their own API&lt;/a&gt; and via &lt;a href="https://openrouter.ai/moonshotai/kimi-k2-thinking/providers"&gt;the OpenRouter proxy to it&lt;/a&gt;, via the &lt;a href="https://github.com/ghostofpokemon/llm-moonshot"&gt;llm-moonshot&lt;/a&gt; plugin (by NickMystic) and my &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin respectively.&lt;/p&gt;
&lt;p&gt;The buzz around this model so far is very positive. Could this be the first open weight model that's competitive with the latest from OpenAI and Anthropic, especially for long-running agentic tool call sequences?&lt;/p&gt;
&lt;p&gt;Moonshot AI's &lt;a href="https://moonshotai.github.io/Kimi-K2/thinking.html"&gt;self-reported benchmark scores&lt;/a&gt; show K2 Thinking beating the top OpenAI and Anthropic models (GPT-5 and Sonnet 4.5 Thinking) at "Agentic Reasoning" and "Agentic Search" but not quite top for "Coding":&lt;/p&gt;
&lt;p&gt;&lt;img alt="Comparison bar chart showing agentic reasoning, search, and coding benchmark performance scores across three AI systems (K, OpenAI, and AI) on tasks including Humanity's Last Exam (44.9, 41.7, 32.0), BrowseComp (60.2, 54.9, 24.1), Seal-0 (56.3, 51.4, 53.4), SWE-Multilingual (61.1, 55.3, 68.0), SWE-bench Verified (71.3, 74.9, 77.2), and LiveCodeBench V6 (83.1, 87.0, 64.0), with category descriptions including &amp;quot;Expert-level questions across subjects&amp;quot;, &amp;quot;Agentic search &amp;amp; browsing&amp;quot;, &amp;quot;Real-world latest information collection&amp;quot;, &amp;quot;Agentic coding&amp;quot;, and &amp;quot;Competitive programming&amp;quot;." src="https://static.simonwillison.net/static/2025/kimi-k2-thinking-benchmarks.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I ran a couple of pelican tests:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-moonshot
llm keys set moonshot # paste key
llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Sonnet 4.5 described this as: Cartoon illustration of a white duck or goose with an orange beak and gray wings riding a bicycle with a red frame and light blue wheels against a light blue background." src="https://static.simonwillison.net/static/2025/k2-thinking.png" /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter # paste key
llm -m openrouter/moonshotai/kimi-k2-thinking \
  'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Sonnet 4.5: Minimalist cartoon illustration of a white bird with an orange beak and feet standing on a triangular-framed penny-farthing style bicycle with gray-hubbed wheels and a propeller hat on its head, against a light background with dotted lines and a brown ground line." src="https://static.simonwillison.net/static/2025/k2-thinking-openrouter.png" /&gt;&lt;/p&gt;
&lt;p&gt;Artificial Analysis &lt;a href="https://x.com/ArtificialAnlys/status/1986541785511043536"&gt;said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CNBC quoted a source who &lt;a href="https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-releases-new-ai-model-kimi-k2-thinking.html"&gt;provided the training price&lt;/a&gt; for the model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Kimi K2 Thinking model cost $4.6 million to train, according to a source familiar with the matter. [...] CNBC was unable to independently verify the DeepSeek or Kimi figures.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MLX developer Awni Hannun &lt;a href="https://x.com/awnihannun/status/1986601104130646266"&gt;got it working&lt;/a&gt; on two 512GB M3 Ultra Mac Studios:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality!&lt;/p&gt;
&lt;p&gt;The model was quantization aware trained (qat) at int4.&lt;/p&gt;
&lt;p&gt;Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://huggingface.co/mlx-community/Kimi-K2-Thinking"&gt;the 658GB mlx-community model&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm"/><category term="llm-reasoning"/><category term="pelican-riding-a-bicycle"/><category term="ai"/><category term="ai-in-china"/><category term="llms"/><category term="moonshot"/><category term="llm-release"/><category term="generative-ai"/><category term="openrouter"/><category term="kimi"/><category term="artificial-analysis"/><category term="mlx"/></entry></feed>