<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: llm-pricing</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/llm-pricing.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-06-09T23:59:54+00:00</updated><author><name>Simon Willison</name></author><entry><title>Initial impressions of Claude Fable 5</title><link href="https://simonwillison.net/2026/Jun/9/claude-fable-5/#atom-tag" rel="alternate"/><published>2026-06-09T23:59:54+00:00</published><updated>2026-06-09T23:59:54+00:00</updated><id>https://simonwillison.net/2026/Jun/9/claude-fable-5/#atom-tag</id><summary type="html">
    &lt;p&gt;I didn't have early access to today's &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5"&gt;Claude Fable 5&lt;/a&gt; release, but I've spent the past ~5.5 hours putting it through its paces. My initial impressions are that this is something of a &lt;em&gt;beast&lt;/em&gt;. It's slow, expensive and has been quite happily churning through everything I've thrown at it so far. As is frequently the case with current frontier models the challenge is finding tasks that it can't do.&lt;/p&gt;
&lt;p&gt;First, let's review the key characteristics.&lt;/p&gt;
&lt;p&gt;Anthropic claim that &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5"&gt;Claude Fable 5&lt;/a&gt; offers the same performance as Claude Mythos 5, except with much more strict guardrails in place to prevent it being used for harmful things. Those guardrails trigger often enough that the Claude API has new mechanisms for letting you know when you hit them, and even has a &lt;a href="https://platform.claude.com/docs/en/build-with-claude/refusals-and-fallback"&gt;new option&lt;/a&gt; to request it falls back to another model automatically if something gets rejected.&lt;/p&gt;
&lt;p&gt;Claude Mythos 5 is out today as well, &lt;a href="https://platform.claude.com/docs/en/about-claude/models/introducing-claude-fable-5-and-claude-mythos-5"&gt;Anthropic say it&lt;/a&gt; "Shares Claude Fable 5's capabilities without the safety classifiers".&lt;/p&gt;
&lt;p&gt;The models have a 1 million token context window, 128,000 maximum output tokens and a knowledge cut-off date of January 2026.&lt;/p&gt;
&lt;p&gt;They are priced at twice the price of Claude Opus 4.5/4.6/4.7/4.8: $10/million input tokens and $50/million output tokens. There's no increase in price for longer context usage.&lt;/p&gt;
&lt;p&gt;Other than that the &lt;a href="https://platform.claude.com/docs/en/about-claude/models/introducing-claude-fable-5-and-claude-mythos-5"&gt;upgrade guide&lt;/a&gt; is substantially thinner than &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-8"&gt;the similar guide for Opus 4.8&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="the-big-model-smell"&gt;The big model smell&lt;/h4&gt;
&lt;p&gt;The best way to describe Fable is that it feels &lt;em&gt;big&lt;/em&gt;. Not just in terms of speed and cost, but also in how much it knows.&lt;/p&gt;
&lt;p&gt;Here's an example prompt, which I used to compare Fable's knowledge with Opus 4.8 (typo included):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;List all of Simon Willion's open source projects, most recent first, each with a rough date of when they were first released&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Opus 4.8's reply &lt;a href="https://gist.github.com/simonw/e77abf645156ade8c5c610f524226dcb#response-1"&gt;started like this&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I don't have a reliable, comprehensive, and date-verified list of all of Simon Willison's open source projects, and I want to be honest about that rather than risk giving you inaccurate dates or fabricated entries.&lt;/p&gt;
&lt;p&gt;Here's what I can tell you with reasonable confidence:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Well-known projects by Simon Willison:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;LLM&lt;/strong&gt; (~2023) – A command-line tool and Python library for interacting with large language models.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Datasette&lt;/strong&gt; (~2017) – A tool for exploring and publishing data, probably his most prominent recent project.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;sqlite-utils&lt;/strong&gt; (~2018-2019) – A Python library and CLI tool for manipulating SQLite databases, often used alongside Datasette.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Django&lt;/strong&gt; (2005) – He is one of the co-creators of the Django web framework, along with Adrian Holovaty, originally developed at the Lawrence Journal-World newspaper.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;He also maintains a large ecosystem of &lt;strong&gt;Datasette plugins&lt;/strong&gt; and many smaller utilities (often with names like &lt;code&gt;shot-scraper&lt;/code&gt;, various &lt;code&gt;*-to-sqlite&lt;/code&gt; tools for importing data, etc.).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's not bad. Here's &lt;a href="https://gist.github.com/simonw/e77abf645156ade8c5c610f524226dcb#response"&gt;the start of the response&lt;/a&gt; for the exact same prompt from Fable 5 - both models were running without access to search:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I think you mean &lt;strong&gt;Simon Willison&lt;/strong&gt; — the prolific open source developer, co-creator of Django, and creator of Datasette.&lt;/p&gt;
&lt;p&gt;An important caveat upfront: a truly complete list isn't feasible. Simon Willison is famously prolific — he has published &lt;strong&gt;hundreds&lt;/strong&gt; of repositories (Datasette alone has 100+ plugins, many written by him), and my knowledge has a cutoff date, so I'll be missing his most recent work. Below are his &lt;strong&gt;notable&lt;/strong&gt; projects with approximate initial release dates, most recent first:&lt;/p&gt;
&lt;h2 id="notable-projects-approximate-dates-"&gt;Notable projects (approximate dates)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;files-to-prompt&lt;/strong&gt; (April 2024) – concatenates files into a single prompt for LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;datasette-extract&lt;/strong&gt; (2024) – plugin for extracting structured data using LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt; (May–June 2023) – CLI tool and Python library for interacting with large language models, plus a large ecosystem of plugins (llm-gpt4all, llm-mistral, llm-claude, etc., 2023 onward)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;symbex&lt;/strong&gt; (June 2023) – search Python code for symbols&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ttok&lt;/strong&gt; and &lt;strong&gt;strip-tags&lt;/strong&gt; (May 2023) – CLI tools for token counting and HTML cleanup for LLM pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;datasette-lite&lt;/strong&gt; (May 2022) – Datasette running entirely in the browser via WebAssembly/Pyodide&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;shot-scraper&lt;/strong&gt; (March 2022) – automated website screenshots via Playwright&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;s3-credentials&lt;/strong&gt; (November 2021) – CLI for creating scoped AWS S3 credentials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;django-sql-dashboard&lt;/strong&gt; (2021) – SQL reporting dashboards for Django&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dogsheep suite&lt;/strong&gt; (2019) – personal analytics tools: twitter-to-sqlite, github-to-sqlite, healthkit-to-sqlite, dogsheep-beta, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqlite-utils&lt;/strong&gt; (2018) – CLI and Python library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datasette&lt;/strong&gt; (November 2017) – his flagship project; tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;csvs-to-sqlite&lt;/strong&gt; (2017) – convert CSV files to SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Various early tools&lt;/strong&gt; (~2007–2010) – soupselect, json-head, geocoders, and others&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Django&lt;/strong&gt; (developed 2003–2005, open-sourced July 2005) – co-created with Adrian Holovaty at the Lawrence Journal-World&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;(Here's &lt;a href="https://gist.github.com/simonw/8d05ce836d44bd8543055c9614f7c478"&gt;GPT-5.5 for good measure&lt;/a&gt;. It listed even more projects than Fable did!)&lt;/p&gt;
&lt;p&gt;In the past I've stated that I don't care about how much models &lt;em&gt;know&lt;/em&gt; - I want them to be able to manipulate text and code in useful ways and actively look up the information they need via search tools, not bake it into their weights.&lt;/p&gt;
&lt;p&gt;But knowledge like this is a reasonably good proxy for model size - you can cram a whole lot more details about the world into a larger number of parameters.&lt;/p&gt;
&lt;p&gt;Does &lt;em&gt;knowing more stuff&lt;/em&gt; mean the model is better at the tasks we pose to it? I can certainly imagine how a coding model with deeper knowledge of modern libraries and patterns could crunch through coding tasks more effectively.&lt;/p&gt;
&lt;p&gt;Is Fable really bigger than Opus? Anthropic haven't said anything about model size, so all we have are tea-leaves, but the speed, pricing and my own poking at its knowledge make me think that it's a large model. Maybe the largest yet from any vendor.&lt;/p&gt;
&lt;h4 id="using-fable-in-claude-ai"&gt;Using Fable in Claude.ai&lt;/h4&gt;
&lt;p&gt;Anthropic made Fable 5 available across all of their surfaces - the &lt;a href="https://claude.ai/"&gt;Claude.ai&lt;/a&gt; chat interface, Claude Code for web, Claude Code CLI and Claude Cowork as well. The model is available "until June 22nd" on the subscription plans (I'm on $100/month Max at the moment), after which it will be billed extra.&lt;/p&gt;
&lt;p&gt;Claude.ai is often under-estimated. Since &lt;a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/"&gt;September 2025&lt;/a&gt; every chat has had access to a full container environment to run code, including the ability to install additional packages and even clone repositories directly from GitHub.&lt;/p&gt;
&lt;p&gt;Last week &lt;a href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/"&gt;I released micropython-wasm&lt;/a&gt;, a Python library that uses &lt;a href="https://wasmtime.dev/"&gt;wasmtime&lt;/a&gt; to run a custom build of MicroPython in WebAssembly to act as a sandbox for untrusted Python code.&lt;/p&gt;
&lt;p&gt;I decided to see if Fable could upgrade that to running full Python instead. I started with this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Clone simonw/micropython-wasm from GitHub and research how this could use a full Python as opposed to MicroPython&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Fable identified that it could use Brett Cannon's &lt;a href="https://github.com/brettcannon/cpython-wasi-build/releases"&gt;cpython-wasi-build&lt;/a&gt; builds for this, but was unable to download them itself due to environment restrictions.&lt;/p&gt;
&lt;p&gt;So I grabbed the two zip files from that page and uploaded them to Claude:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Here's the Brett Cannon builds&lt;/code&gt; (&lt;code&gt;python-3.zip&lt;/code&gt;, &lt;code&gt;_build-python-3.zip&lt;/code&gt; as attachments)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And that was that. It churned away for a few minutes and got the entire thing working.&lt;/p&gt;
&lt;p&gt;Part of the response included:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I tried the cleaner single-zip-stdlib approach to shrink the filesystem surface, but CPython's &lt;code&gt;getpath&lt;/code&gt; bootstrap fails to find &lt;code&gt;encodings&lt;/code&gt; from inside a zip without more prefix finessing — the directory-preopen approach works reliably, so that's what the PoC uses. The zip path is solvable but needs &lt;code&gt;_PYTHONHOME&lt;/code&gt;/frozen-getpath work.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Try a bit more at the single-zip-stdlib problem&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then a little later:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;I want a wheel that has the whole system in it, the Python wrappers and the WASM files and the stdlibrary, so I can do uv run --with path-to-whl python -c "demo code"&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;... and it gave me &lt;a href="https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl"&gt;this 13.9MB cpython_wasm-0.1.0-py3-none-any.whl&lt;/a&gt; file. You can try running Python code in a sandbox using that wheel URL and &lt;code&gt;uv&lt;/code&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --with https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl \
  cpython-wasm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;print(45 ** 56)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35"&gt;the full chat transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This was a &lt;em&gt;very&lt;/em&gt; strong start.&lt;/p&gt;
&lt;h4 id="adding-features-to-datasette-agent-and-llm-using-claude-code"&gt;Adding features to Datasette Agent and LLM using Claude Code&lt;/h4&gt;
&lt;p&gt;Before I'd realized it was Fable day, my stretch goal for today was to add a new feature to &lt;a href="https://agent.datasette.io/"&gt;Datasette Agent&lt;/a&gt;: I wanted tool calls within that agent software to gain the ability to pause mid-execution and request approval directly from the user.&lt;/p&gt;
&lt;p&gt;This felt like a suitably meaty task to throw at the new model.&lt;/p&gt;
&lt;p&gt;Over the course of the day Fable not only &lt;a href="https://github.com/datasette/datasette-agent/pull/20"&gt;solved that problem&lt;/a&gt;, it also identified and then implemented four issues in my underlying LLM library that would help support this kind of advanced pause-resume mechanism in tool calls.&lt;/p&gt;
&lt;p&gt;It got everything working first using somewhat gnarly hacks, but the moment I told it that changes to LLM itself were in scope it set to work unraveling the hacks and turning them into supported features of LLM instead.&lt;/p&gt;
&lt;p&gt;My stretch goal turned into &lt;a href="https://llm.datasette.io/en/latest/changelog.html#a3-2026-06-09"&gt;LLM 0.32a3&lt;/a&gt;, almost entirely written by Fable. Here are the release notes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Driven by the needs of &lt;a href="https://github.com/datasette/datasette-agent"&gt;Datasette Agent&lt;/a&gt;'s human-in-the-loop &lt;code&gt;ask_user()&lt;/code&gt; feature, made the following improvements to how tool calls work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tool implementations can declare a parameter named &lt;code&gt;llm_tool_call&lt;/code&gt; in order to be passed the &lt;code&gt;llm.ToolCall&lt;/code&gt; object for the current invocation. This allows them to access the current &lt;code&gt;llm_tool_call.tool_call_id&lt;/code&gt;. See &lt;a href="https://llm.datasette.io/en/latest/python-api.html#python-api-tools-llm-tool-call"&gt;Accessing the tool call from inside a tool&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1480"&gt;#1480&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Every tool call is now guaranteed a unique &lt;code&gt;tool_call_id&lt;/code&gt; - providers that do not supply one get a synthesized &lt;code&gt;tc_&lt;/code&gt;-prefixed ULID. &lt;a href="https://github.com/simonw/llm/pull/1481"&gt;#1481&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Tools can raise a &lt;code&gt;llm.PauseChain&lt;/code&gt; exception to cleanly pause the tool chain, useful for things like waiting for human approval. The exception propagates to the caller with &lt;code&gt;.tool_call&lt;/code&gt; and &lt;code&gt;.tool_results&lt;/code&gt; (completed sibling results) attached, and no model call is made with a placeholder result. See &lt;a href="https://llm.datasette.io/en/latest/python-api.html#python-api-tools-pause"&gt;Pausing a chain from inside a tool&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1482"&gt;#1482&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Failure semantics for concurrent tool execution: async sibling tool calls always run to completion before a pause or hook exception propagates. &lt;a href="https://github.com/simonw/llm/pull/1482"&gt;#1482&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Chains can now resume from a &lt;code&gt;messages=&lt;/code&gt; history ending in unresolved tool calls: the calls are executed through the normal &lt;code&gt;before_call&lt;/code&gt;/&lt;code&gt;after_call&lt;/code&gt; machinery before the first model call, skipping any that already have results. The &lt;code&gt;execute_tool_calls()&lt;/code&gt; method also accepts a new optional &lt;code&gt;tool_calls_list=&lt;/code&gt; argument for executing an explicit list of &lt;code&gt;ToolCall&lt;/code&gt; objects in place of the calls requested by the response. See &lt;a href="https://llm.datasette.io/en/latest/python-api.html#python-api-tools-resume"&gt;Resuming a chain with pending tool calls&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1482"&gt;#1482&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fixed a bug where the async tool executor silently dropped calls to tools not present in &lt;code&gt;tools=&lt;/code&gt; - these now return &lt;code&gt;Error: tool "..." does not exist&lt;/code&gt; results, matching the sync executor. &lt;a href="https://github.com/simonw/llm/pull/1483"&gt;#1483&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm really impressed with the quality of API design, tests, code and documentation that Fable put together for this. I spent several hours on it today, but it feels like several days' worth of work.&lt;/p&gt;
&lt;h4 id="how-much-i-ve-spent"&gt;How much I've spent&lt;/h4&gt;
&lt;p&gt;I recently started using &lt;a href="https://agentsview.io"&gt;AgentsView&lt;/a&gt; to help track my local LLM usage across all of the different coding agents. I published a &lt;a href="https://til.simonwillison.net/llms/agentsview-custom-model-price"&gt;TIL today&lt;/a&gt; about adding custom Fable pricing to that tool, which I expect will not be necessary in the very near future.&lt;/p&gt;
&lt;p&gt;After setting the price, I ran this command to start a localhost web server to explore my usage:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx agentsview serve
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the treemap showing the breakdown of my Fable usage across various projects today:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/agentsview-fable-full-day.jpg" alt="Screenshot of a cost tracking dashboard with two panels. The first panel is titled &amp;quot;Cost Attribution&amp;quot; with toggle buttons for Project / Model / Agent and Treemap / List, with Project and Treemap selected. Italic text reads &amp;quot;Click to hide from chart&amp;quot;. A treemap shows a large red block labeled prod_datasette_agent $99.26 89.9%, with smaller blocks to its right labeled cloud (blue), datasette (teal), llm (red), and money (pink), plus a tiny orange sliver. A legend lists: 1 prod_datasette_agent $99.26, 2 cloud $3.98, 3 datasette $2.81, 4 llm $2.30, 5 money $1.92, 6 simon $0.15. The second panel is titled &amp;quot;Top Sessions by Cost&amp;quot; and lists nine sessions, each with a &amp;quot;Claude&amp;quot; badge, a prompt excerpt, a project name with a session UUID (omitted here), a token count, and a cost: 1. Review ./datasette-agent and ./datasette-apps - we are going to add a new feature to agent but you ... prod_datasette_agent, 78.2M, $99.26. 2. issues.db is a copy of the Datasette issues database. There are a LOT of notes in there relating to... datasette, 826.8k, $2.81. 3. Consult fly-docs and then look at datasette.cloud (which launches fly machines) and datasettecloud-... cloud, 924.7k, $2.61. 4. simonwillisonblog.db is a copy of my blog, plus all my software releases and other interesting thin... money, 542.9k, $1.92. 5. Look in datasette.cloud and figure out all remaining steps and decisions that need to be made in or... cloud, 455k, $1.37. 6. Review PRs and issues filed against this repo within the last 4 weeks and see if any deserve to be ... llm, 323.3k, $0.95. 7. run mypy, llm, 320.9k, $0.76. 8. [Image #1] fix this in github actions, llm, 183.9k, $0.59. 9. simon, simon, 26.4k, $0.15." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I used $110.42 worth of tokens today, all as part of my $100/month subscription.&lt;/p&gt;
&lt;h4 id="and-some-pelicans"&gt;And some pelicans&lt;/h4&gt;
&lt;p&gt;I ran "Generate an SVG of a pelican riding a bicycle" against all five thinking effort levels with Fable.&lt;/p&gt;
&lt;p&gt;Here are &lt;a href="https://tools.simonwillison.net/markdown-svg-renderer#url=https%3A%2F%2Fgist.github.com%2Fsimonw%2F94fde31c34a0400c1d29f57e6a708e6b"&gt;the results&lt;/a&gt;, including the token cost for each one:&lt;/p&gt;

&lt;div style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 1em"&gt;
  &lt;figure style="margin: 0; flex: 1 1 30%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-low.jpg" alt="low" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;low: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=1929&amp;amp;sel=claude-fable-5"&gt;1,929 out, 9.67c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 30%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-medium.jpg" alt="medium" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;medium: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=2290&amp;amp;sel=claude-fable-5"&gt;2,290 out, 11.475c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 30%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-high.jpg" alt="high" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;high: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=2057&amp;amp;sel=claude-fable-5"&gt;2,057 out, 10.31c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 45%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-xhigh.jpg" alt="xhigh" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;xhigh: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=5992&amp;amp;sel=claude-fable-5"&gt;5,992 out, 29.985c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 45%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-max.jpg" alt="max" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;max: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=14430&amp;amp;sel=claude-fable-5"&gt;14,430 out, 72.175c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
&lt;/div&gt;

&lt;p&gt;It's interesting that high ended up using fewer tokens than medium for this particular run.&lt;/p&gt;

&lt;p&gt;Here are the &lt;a href="https://simonwillison.net/2026/May/28/claude-opus-4-8/#and-some-pelicans"&gt;Opus 4.8 pelicans&lt;/a&gt; for comparison.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-mythos"&gt;claude-mythos&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="claude-mythos"/></entry><entry><title>Setting a custom price for a model in AgentsView</title><link href="https://simonwillison.net/2026/Jun/9/agentsview-custom-model-price/#atom-tag" rel="alternate"/><published>2026-06-09T21:35:31+00:00</published><updated>2026-06-09T21:35:31+00:00</updated><id>https://simonwillison.net/2026/Jun/9/agentsview-custom-model-price/#atom-tag</id><summary type="html">
    
        &lt;p&gt;&lt;strong&gt;TIL:&lt;/strong&gt; &lt;a href="https://til.simonwillison.net/llms/agentsview-custom-model-price"&gt;Setting a custom price for a model in AgentsView&lt;/a&gt;&lt;/p&gt;
        &lt;p&gt;I've been really enjoying &lt;a href="https://agentsview.io/"&gt;AgentsView&lt;/a&gt; by Wes McKinney as a tool for exploring my token usage across different coding agents running on my laptop.&lt;/p&gt;
&lt;p&gt;Claude Fable 5 came out today and wasn't yet included in the pricing database AgentsView uses. I used Fable to reverse-engineer AgentsView and figured out this recipe for setting custom prices.&lt;/p&gt;
&lt;p&gt;Here's my Claude Fable 5 usage for today so far, plotted by AgentsView as a treemap across my different local projects:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a cost analytics dashboard. Cost Attribution - Click to hide from chart - toggle buttons for Project / Model / Agent and Treemap / List. A treemap shows a large red block: prod_datasette_agent $74.06 89.3%, then blue: cloud $3.98 4.8%, teal: datasette $2.81 3.4%, pink: money $1.92 2.3%, and a thin orange sliver. A legend lists 1 prod_datasette_agent $74.06, 2 cloud $3.98, 3 datasette $2.81, 4 money $1.92, 5 simon $0.15. Below left, Top Sessions by Cost: 1 Claude - Review ./datasette-agent and ./datasette-apps - we are going to a... - prod_datasette_agent · 08a1f374-0e77-420f-be2d-af805d67e8aa - 55.9M $74.06; 2 Claude - issues.db is a copy of the Datasette issues database. There are a... - datasette · 8caa2d2d-b91f-43b3-bf3a-4268995b6011 - 826.8k $2.81; 3 Claude - Consult fly-docs and then look at datasette.cloud (which launche... - cloud · bfcacc70-09d7-4b27-aaec-4bb8accd9fec - 924.7k $2.61; 4 Claude - simonwillisonblog.db is a copy of my blog, plus all my software re... - money · 0c0fb9dc-6347-4e1b-9307-3709a7cdf0c8 - 542.9k $1.92; 5 Claude - Look in datasette.cloud and figure out all remaining steps and dec... - cloud · 45963b5f-608a-4caa-ad6b-6ae81e1dbf0d - 455k $1.37; 6 Claude - simon - simon · deeccb5d-9e90-4b1e-bfe6-c2b271e1b1d4 - 26.4k $0.15. Below right, Cache Efficiency with horizontal bars: Cache Reads 57.6M (nearly full green bar), Cache Writes 769.3K, Uncached Input 64.4K, Output 300.9K (all tiny bars), and a green highlighted note: $516.62 saved vs uncached." src="https://static.simonwillison.net/static/2026/agentsview-fable.jpg" /&gt;&lt;/p&gt;
    
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-mythos"&gt;claude-mythos&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm-pricing"/><category term="claude-mythos"/></entry><entry><title>Uber Caps Usage of AI Tools Like Claude Code to Manage Costs</title><link href="https://simonwillison.net/2026/Jun/3/uber-caps-usage/#atom-tag" rel="alternate"/><published>2026-06-03T12:01:27+00:00</published><updated>2026-06-03T12:01:27+00:00</updated><id>https://simonwillison.net/2026/Jun/3/uber-caps-usage/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.bloomberg.com/news/articles/2026-06-02/uber-caps-usage-of-ai-tools-like-claude-code-to-cut-costs"&gt;Uber Caps Usage of AI Tools Like Claude Code to Manage Costs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I wrote &lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#the-ai-failure-stories-around-this-are-pretty-thin"&gt;the other day&lt;/a&gt; about Uber blowing its 2026 AI budget in four months, and how that wasn't particularly surprising given they would have set that budget in 2025, before anyone could have predicted how popular token-burning coding agents were about to become.
Natalie Lung for Bloomberg:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The rideshare giant is limiting all employees to $1,500 in monthly token spending per AI coding tool, an Uber spokesperson said in response to a Bloomberg News inquiry. That means spending on one tool doesn’t have a bearing on the budget for another. The limits, which have been instituted in recent months, only apply to agentic coding software such as Cursor or Anthropic PBC’s Claude Code.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A $1,500 monthly limit per tool strikes me as a rational policy response to over-spending, and &lt;em&gt;much&lt;/em&gt; more sensible than those &lt;a href="https://en.wikipedia.org/wiki/Token_maxxing"&gt;tokenmaxxing&lt;/a&gt; leaderboards encouraging employees to compete for as much AI usage as possible.&lt;/p&gt;
&lt;p&gt;It's also interesting in that it hints at a real dollar value for what Uber is getting out of these tools. If we assume two actively used tools per engineer that's $3,000 * 12 = $36,000 cap per engineer per year. Levels.fyi lists &lt;a href="https://www.levels.fyi/companies/uber/salaries/software-engineer?country=254"&gt;the median yearly compensation package for Uber software engineers in the USA&lt;/a&gt; at $330,000.&lt;/p&gt;
&lt;p&gt;That means each employee's AI spending cap is ~11% of that median compensation package.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#enterprise-customers-are-now-paying-api-prices"&gt;noted&lt;/a&gt; that my own token usage comes to about $1,000/month against each of Anthropic and OpenAI - which currently costs me just $100 per provider thanks to their generous subsidized plans for individual subscribers. Those plans are no longer available to larger companies like Uber.&lt;/p&gt;
&lt;p&gt;Their new policy means if I were working at Uber I'd still have ~$500/month of tokens to spare for each of those tools, given my current usage patterns.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uber"&gt;uber&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm-pricing"/><category term="coding-agents"/><category term="uber"/></entry><entry><title>I think Anthropic and OpenAI have found product-market fit</title><link href="https://simonwillison.net/2026/May/27/product-market-fit/#atom-tag" rel="alternate"/><published>2026-05-27T16:38:35+00:00</published><updated>2026-05-27T16:38:35+00:00</updated><id>https://simonwillison.net/2026/May/27/product-market-fit/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic are &lt;a href="https://techcrunch.com/2026/05/20/anthropic-says-its-about-to-have-its-first-profitable-quarter/"&gt;strongly rumored&lt;/a&gt; to be about to have their first profitable quarter. Stories &lt;a href="https://www.theinformation.com/newsletters/applied-ai/uber-cto-shows-claude-code-can-blow-ai-budgets"&gt;are circulating&lt;/a&gt; of companies surprised at how expensive their LLM bills are becoming from usage by their staff. I think this is because OpenAI and Anthropic have both found product-market fit.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#enterprise-customers-are-now-paying-api-prices"&gt;Enterprise customers are now paying API prices&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#i-think-they-ve-found-product-market-fit"&gt;I think they've found product-market fit&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#and-they-re-ramping-up"&gt;And they're ramping up&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#the-ai-failure-stories-around-this-are-pretty-thin"&gt;The AI-failure stories around this are pretty thin&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#we-also-know-the-labs-are-spending-a-lot"&gt;We also know the labs are spending a lot&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#api-revenue-is-becoming-less-important"&gt;API revenue is becoming less important&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#april-is-a-new-inflection-point"&gt;April is a new inflection point&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="enterprise-customers-are-now-paying-api-prices"&gt;Enterprise customers are now paying API prices&lt;/h4&gt;
&lt;p&gt;I currently subscribe to the $100/month Max plan from Anthropic and the $100/month Pro plan from OpenAI. If you are a heavy user of coding agents these plans are a fantastic deal. I just ran the &lt;a href="https://github.com/ryoppippi/ccusage"&gt;ccusage&lt;/a&gt; tool on my laptop to get an estimate of how much I would have spent if I were to pay for API tokens in the past 30 days and got:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;$1,199.79 for Anthropic Claude Code&lt;/li&gt;
&lt;li&gt;$980.37 for OpenAI Codex&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That's $2,180.16 worth of tokens for $200 - not bad at all! I'm a moderately heavy user of these tools, but I'm certainly not running agents every hour of the day and night.&lt;/p&gt;
&lt;p&gt;I had assumed that companies making extensive use of agents were getting similar discounts. It turns out I &lt;em&gt;could not have been more wrong&lt;/em&gt; about that.&lt;/p&gt;
&lt;p&gt;I haven't been able to track down the exact date, but at some point in the last six months Anthropic switched their Enterprise plan (originally &lt;a href="https://www.anthropic.com/news/claude-code-on-team-and-enterprise"&gt;"Claude seats include enough usage for a typical workday" back in August 2025&lt;/a&gt;) to $20/seat/month plus API pricing for usage. This story about the change &lt;a href="https://www.theinformation.com/articles/anthropic-changes-pricing-bill-firms-based-ai-use-amid-compute-crunch"&gt;from The Information&lt;/a&gt; is dated Apr 14, 2026, but cites an Anthropic spokesperson claiming that the pricing change occurred in November 2025. Existing customers are finding out about the change as they renew their contracts.&lt;/p&gt;
&lt;p&gt;OpenAI made a similar pricing change in April. The &lt;a href="https://help.openai.com/en/articles/20001106-codex-rate-card"&gt;Codex rate card&lt;/a&gt; (&lt;a href="https://web.archive.org/web/20260519062438/https://help.openai.com/en/articles/20001106-codex-rate-card"&gt;Internet Archive copy&lt;/a&gt;) currently says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: On April 2, 2026, we updated Codex pricing to align with API token usage, instead of per-message pricing. This change was applicable to new and existing Plus, Pro, ChatGPT Business and new ChatGPT Enterprise plans.&lt;/p&gt;
&lt;p&gt;On April 23, 2026, we made this update for all existing ChatGPT Enterprise plans as well, inclusive of Edu, Health, Gov, and ChatGPT for Teachers.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's a little harder to decode as they quote prices in "credits", but as far as I can tell those credit costs are an exact match for the API token costs listed for those models.&lt;/p&gt;
&lt;p&gt;All of which is to say that as of April 2026 the "Enterprise" cost for both OpenAI Codex and Anthropic Claude Code/Cowork is the same as the listed API price.&lt;/p&gt;
&lt;p&gt;GPT-5.5 (released April 23rd) is 2x the API price of GPT-5.4. Opus 4.7 (April 16th) is &lt;a href="https://simonwillison.net/2026/Apr/20/claude-token-counts/"&gt;around 1.4x&lt;/a&gt; the price of Opus 4.6 when you take their new tokenizer into account.&lt;/p&gt;
&lt;p&gt;So April saw both leading model companies release new frontier models with a higher API price, &lt;em&gt;and&lt;/em&gt; both companies now have measures to lock their enterprise customers (who tend to sign year-long deals) at those API prices, not the previous extreme discounts.&lt;/p&gt;
&lt;h4 id="i-think-they-ve-found-product-market-fit"&gt;I think they've found product-market fit&lt;/h4&gt;
&lt;p&gt;Why these sudden aggressive moves on pricing? Both Anthropic and OpenAI are planning to IPO, but I suspect there's a more important factor here: I think they've finally found product-market fit, with the coding/general-purpose agent products embodied by Claude Code/Cowork and Codex.&lt;/p&gt;
&lt;p&gt;Tools like ChatGPT are wildly popular, but that wild popularity has been difficult to turn into revenue. In February &lt;a href="https://finance.yahoo.com/news/chatgpt-almost-1-billion-weekly-212157499.html"&gt;OpenAI boasted&lt;/a&gt; more than 900 million weekly active users for ChatGPT, but only 50 million - 5.6% of that - were paying consumer subscribers.&lt;/p&gt;
&lt;p&gt;Charging $10-$20/month per user is an OK business, but you'd need 1-2 billion subscribers sticking around for four years to cover &lt;a href="https://openai.com/global-affairs/seizing-the-ai-opportunity/"&gt;$1 trillion in infrastructure&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Companies spending $200+/month/user will get you there a whole lot faster - and as noted above, as a power-user I'm at ~$1,000/month in API costs per vendor already.&lt;/p&gt;
&lt;p&gt;Coding agents really did change everything. These are tools which burn &lt;em&gt;vastly&lt;/em&gt; more tokens, but are also quickly becoming daily drivers for the work carried out by extremely well-compensated professionals. Right now that's still mostly software engineers, but a coding agent is a tool that can automate anything you can do by typing commands into a computer... so they are clearly applicable to a much wider set of skilled knowledge workers.&lt;/p&gt;
&lt;p&gt;As I've &lt;a href="https://simonwillison.net/tags/november-2025-inflection/"&gt;discussed on this site at length&lt;/a&gt;, the models released in November 2025 elevated agents to being genuinely useful. We've had six months to get used to that idea now - it's no wonder companies are beginning to spend real money on this technology.&lt;/p&gt;
&lt;p&gt;You could argue that ChatGPT achieved product-market fit when it became the &lt;a href="https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/"&gt;fastest-growing consumer app in history&lt;/a&gt; back in February 2023... but it certainly wasn't making any actual money back then. Coding agents plus enterprise pricing marks the point when these companies start making &lt;em&gt;very&lt;/em&gt; real revenue. Maybe even enough to start covering their costs!&lt;/p&gt;
&lt;h4 id="and-they-re-ramping-up"&gt;And they're ramping up&lt;/h4&gt;
&lt;p&gt;As further evidence that enterprise agents represent product-market fit for these companies, consider their open job listings.&lt;/p&gt;
&lt;p&gt;OpenAI have &lt;a href="https://openai.com/careers/search/"&gt;703 open jobs&lt;/a&gt; right now, of which I'd categorize 229 (32.6%) as relating to enterprise sales and support - account executives, "Go To Market", "Forward Deployed Engineers" and the like.&lt;/p&gt;
&lt;p&gt;Anthropic have &lt;a href="https://www.anthropic.com/careers/jobs"&gt;390 open jobs&lt;/a&gt;, 105 (26.9%) of which look enterprisey to me.&lt;/p&gt;
&lt;p&gt;It's pleasingly ironic that these AI labs have picked a business model with such a heavy demand on human labor - enterprise sales contracts don't close themselves without a whole lot of humans in the mix!&lt;/p&gt;
&lt;p&gt;&lt;small&gt;(I ran this analysis by scraping their job sites with Claude Code, then having it use Datasette's &lt;a href="https://docs.datasette.io/en/latest/json_api.html"&gt;JSON API&lt;/a&gt; to pipe that data into Datasette Cloud where I used &lt;a href="https://agent.datasette.io/"&gt;Datasette Agent&lt;/a&gt; for the analysis, &lt;a href="https://gist.github.com/simonw/5632d208d76b3c8b34f1fdbaf69eb1b8#agent-4"&gt;exported here&lt;/a&gt;. Dogfood!)&lt;/small&gt;&lt;/p&gt;
&lt;h4 id="the-ai-failure-stories-around-this-are-pretty-thin"&gt;The AI-failure stories around this are pretty thin&lt;/h4&gt;
&lt;p&gt;I started digging into this in response to &lt;a href="https://news.ycombinator.com/item?id=48287025#48287219"&gt;a growing volume&lt;/a&gt; of stories claiming that large companies were sounding the alarm because their AI usage costs had grown so large.&lt;/p&gt;
&lt;p&gt;The most widely cited of these stories appear quite overblown to me.&lt;/p&gt;
&lt;p&gt;The most discussed has been Uber, based on &lt;a href="https://www.theinformation.com/newsletters/applied-ai/uber-cto-shows-claude-code-can-blow-ai-budgets"&gt;this report&lt;/a&gt; where CTO Praveen Neppalli Naga indicated that Uber had "maxed out its full year AI budget just a few months into 2026", mostly thanks to Claude Code.&lt;/p&gt;
&lt;p&gt;Given that Claude Code only got &lt;em&gt;really&lt;/em&gt; good in November it's entirely unsurprising to me that a budget set in 2025 may have failed to predict demand for that tool in 2026!&lt;/p&gt;
&lt;p&gt;That Uber story was further fueled by comments made by Uber's COO, Andrew Macdonald, on the Rapid Response podcast. I tracked down &lt;a href="https://www.youtube.com/watch?v=y_mQ6xLcKyc&amp;amp;t=1616s"&gt;the segment&lt;/a&gt; and there really isn't much there. Here's what Andrew said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But then you sometimes go and talk to your senior engineering leaders and you're saying, OK, how many projects that were on the cutting room floor got moved above the line because of the productivity gains because 25% of our code commits were via Claude Code last quarter?&lt;/p&gt;
&lt;p&gt;That link is not there yet, right? I think maybe implicitly there's more that is getting shipped. But it's very hard to draw a line between one of those stats and, OK, now we're actually producing like 25% more useful consumer features, right? And that line is hard to draw.&lt;/p&gt;
&lt;p&gt;[...] And so if you're not actually able to draw a direct line to how much useful features and functionality you're shipping to your users, that trade becomes harder to justify.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Somehow this fragment turned into headlines like &lt;a href="https://www.businessinsider.com/uber-coo-andrew-macdonald-ai-token-spending-harder-justify-2026-5"&gt;Uber's COO says it's getting harder to justify the money spent on AI tokenmaxxing&lt;/a&gt;, because the market for stories about AI failures remains enormous.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 29th May 2026&lt;/strong&gt;: I edited the above quote to add that last paragraph ending in "becomes harder to justify" on &lt;a href="https://x.com/MadisonMills22/status/2060343512936186240"&gt;the suggestion of Madison Mills&lt;/a&gt; - previously my quoted section stopped at "hard to draw". Here's the &lt;a href="https://gist.github.com/simonw/59096a338c82f6f95e40e3d7c7b5bad9"&gt;full unedited transcript&lt;/a&gt; from MacWhisper.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The other popular story around this is &lt;a href="https://www.theverge.com/tech/930447/microsoft-claude-code-discontinued-notepad"&gt;Microsoft starts canceling Claude Code licenses&lt;/a&gt;, ostensibly to encourage their engineers to dogfood their own Copilot CLI agent instead - but The Verge reporter Tom Warren says "sources tell me the decision is also a financial one", triggered by the June 30th end of Microsoft's financial year.&lt;/p&gt;
&lt;p&gt;I think both of these stories support my "product-market fit" hypothesis. The best advice I ever heard on pricing a product was that your customer should &lt;em&gt;suck air through their teeth&lt;/em&gt; and then say yes. Uber's budget overrun and Microsoft's seat cancellations look like that effect playing out in practice.&lt;/p&gt;
&lt;h4 id="we-also-know-the-labs-are-spending-a-lot"&gt;We also know the labs are spending a lot&lt;/h4&gt;
&lt;p&gt;The big AI labs spend billions of dollars on both training and inference. Credible figures are hard to come by, but we did get one huge hint as to the figures involved from, oddly enough, the recent &lt;a href="https://www.sec.gov/Archives/edgar/data/1181412/000162828026036936/spaceexplorationtechnologi.htm"&gt;SpaceX S-1&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] in May 2026, we entered into &lt;strong&gt;Cloud Services Agreements with Anthropic PBC&lt;/strong&gt; (“Anthropic”), an AI research and development public benefit corporation, with respect to access to &lt;strong&gt;compute capacity across COLOSSUS and COLOSSUS II&lt;/strong&gt;. Pursuant to these agreements, the customer &lt;strong&gt;has agreed to pay us $1.25 billion per month&lt;/strong&gt; through May 2029 [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;a href="https://www.anthropic.com/news/higher-limits-spacex"&gt;Anthropic announcement&lt;/a&gt; said that this deal meant they could "increase our usage limits for Claude Code and the Claude API", heavily implying that Colossus is being used for inference, not model training.&lt;/p&gt;
&lt;p&gt;Anthropic already have vast amounts of compute from other providers. The fact that they're willing to spend $1.25 billion per month for extra capacity from just &lt;em&gt;one&lt;/em&gt; of their vendors hints at how big these inference budgets have become.&lt;/p&gt;
&lt;h4 id="api-revenue-is-becoming-less-important"&gt;API revenue is becoming less important&lt;/h4&gt;
&lt;p&gt;Over the past two years my impression has been that OpenAI made more of their income from subscription revenue while Anthropic made more from their API.&lt;/p&gt;
&lt;p&gt;Anthropic's API revenue was historically quite dependent on a small number of large API customers - &lt;a href="https://venturebeat.com/ai/anthropic-revenue-tied-to-two-customers-as-ai-pricing-war-threatens-margins"&gt;this VentureBeat story from August 2025&lt;/a&gt; quotes "sources familiar with the matter" suggesting that just Cursor and GitHub Copilot were responsible for $1.2 billion of the company's then-$4 billion revenue.&lt;/p&gt;
&lt;p&gt;Today Anthropic are rumored to hit &lt;a href="https://www.wsj.com/tech/ai/mind-blowing-growth-is-about-to-propel-anthropic-into-its-first-profitable-quarter-7edbf2f4"&gt;$10.9 billion in the second quarter&lt;/a&gt;, potentially even operating at a profit for the first time.&lt;/p&gt;
&lt;p&gt;This pivot-to-Enterprise suggests that the labs have realized that the real money lies in cutting out the middlemen. Anthropic's Claude Code directly competes with Cursor and Copilot. No wonder Cursor are &lt;a href="https://cursor.com/blog/composer-2"&gt;investing in their own models&lt;/a&gt;!&lt;/p&gt;
&lt;h4 id="april-is-a-new-inflection-point"&gt;April is a new inflection point&lt;/h4&gt;
&lt;p&gt;I've called November 2025 the &lt;a href="https://simonwillison.net/tags/november-2025-inflection/"&gt;November inflection point&lt;/a&gt; because that was when GPT-5.1 and Opus 4.5, combined with their respective coding agent harnesses, got &lt;em&gt;good&lt;/em&gt; - good enough that we've spent the last six months adapting to agent systems that can reliably get useful work done.&lt;/p&gt;
&lt;p&gt;I think April 2026 is a new inflection point where the revenue implications of this have started to land, to the benefit of the frontier AI labs and with material impacts on the budgets of large companies.&lt;/p&gt;
&lt;p&gt;We'll know for sure how real this moment is when the S-1 documents for the upcoming Anthropic and OpenAI IPOs give us some real, audited numbers to get our teeth into.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex"&gt;codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-cowork"&gt;claude-cowork&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-agent"&gt;datasette-agent&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uber"&gt;uber&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="datasette"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="llm-pricing"/><category term="coding-agents"/><category term="claude-code"/><category term="codex"/><category term="claude-cowork"/><category term="november-2025-inflection"/><category term="datasette-agent"/><category term="uber"/></entry><entry><title>Gemini 3.5 Flash: more expensive, but Google plan to use it for everything</title><link href="https://simonwillison.net/2026/May/19/gemini-35-flash/#atom-tag" rel="alternate"/><published>2026-05-19T22:40:25+00:00</published><updated>2026-05-19T22:40:25+00:00</updated><id>https://simonwillison.net/2026/May/19/gemini-35-flash/#atom-tag</id><summary type="html">
    &lt;p&gt;Today at Google I/O, Google &lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/"&gt;released Gemini 3.5 Flash&lt;/a&gt;. This one skipped the &lt;code&gt;-preview&lt;/code&gt; modifier and went straight to general availability, and Google appear to be using it for a whole lot of their key products:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;3.5 Flash is available today to billions of people globally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For everyone via the Gemini app and AI Mode in &lt;a href="https://blog.google/products-and-platforms/products/search/search-io-2026"&gt;Google Search&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;For developers in our agent-first development platform Google Antigravity and Gemini API in Google AI Studio and Android Studio&lt;/li&gt;
&lt;li&gt;For enterprises in Gemini Enterprise Agent Platform and Gemini Enterprise.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;As usual with Gemini, the most interesting details are tucked away in the &lt;a href="https://ai.google.dev/gemini-api/docs/whats-new-gemini-3.5"&gt;What's new in Gemini 3.5 Flash&lt;/a&gt; developer documentation. It mostly has the same set of platform features as the previous Gemini 3.x series, albeit with no &lt;a href="https://ai.google.dev/gemini-api/docs/computer-use"&gt;computer use&lt;/a&gt;. The model ID is &lt;code&gt;gemini-3.5-flash&lt;/code&gt;. The knowledge cut-off is January 2025, and it supports 1,048,576 input tokens and 65,536 maximum output tokens.&lt;/p&gt;
&lt;p&gt;Google are also pushing a new &lt;a href="https://ai.google.dev/gemini-api/docs/interactions"&gt;Interactions API&lt;/a&gt;, currently in beta, which looks to me like their version of the patterns introduced by &lt;a href="https://developers.openai.com/api/reference/responses/overview"&gt;OpenAI Responses&lt;/a&gt; - in particular server-side history management.&lt;/p&gt;
&lt;h4 id="the-price-has-gone-up"&gt;The price has gone up&lt;/h4&gt;
&lt;p&gt;Gemini 3.5 Flash is accompanied by a notable price bump. The previous models in the "Flash" family were &lt;a href="https://ai.google.dev/gemini-api/docs/models/gemini-3-flash-preview"&gt;Gemini 3 Flash Preview&lt;/a&gt; and &lt;a href="https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-lite"&gt;Gemini 3.1 Flash-Lite&lt;/a&gt;. The new 3.5 Flash is 3x the price of 3 Flash Preview and 6x the price of 3.1 Flash-Lite (see &lt;a href="https://www.llm-prices.com/#sel=gemini-3-flash-preview%2Cgemini-3.5-flash%2Cgemini-3.1-flash-lite-preview"&gt;price comparison here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;At $1.50/million input and $9/million output it's getting close in price to Google's Gemini 3.1 Pro, which is $2 and $12.&lt;/p&gt;
&lt;p&gt;The Gemini team promise that 3.5 Pro will roll out "next month" - presumably at an even higher price.&lt;/p&gt;
&lt;p&gt;This fits a trend: OpenAI's GPT-5.5 was 2x the price of GPT-5.4, and Claude Opus 4.7 is around 1.46x the price of 4.6 when you take the &lt;a href="https://simonwillison.net/2026/Apr/20/claude-token-counts/"&gt;new tokenizer into account&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Given the price increase it's interesting to see Google roll it out for so many of their own free-to-consumer products. It feels like all three of the major AI labs are starting to probe the price tolerance of their API customers.&lt;/p&gt;
&lt;p&gt;Artificial Analysis publish the cost to run their proprietary benchmark against models, which is a useful way to take things like tokenization and increased volume of reasoning tokens into account. Some numbers worth comparing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-5-flash"&gt;Gemini 3.5 Flash (high)&lt;/a&gt;: $1,551.60&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-1-pro-preview"&gt;Gemini 3.1 Pro Preview&lt;/a&gt;: $892.28&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-flash-reasoning"&gt;Gemini 3 Flash Preview (Reasoning)&lt;/a&gt;: $278.26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-1-flash-lite-preview"&gt;Gemini 3.1 Flash-Lite Preview&lt;/a&gt;: $93.60&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Running the benchmark for 3.5 Flash (high) cost significantly more than 3.1 Pro Preview!&lt;/p&gt;
&lt;p&gt;Here are some numbers from other vendors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/claude-opus-4-7"&gt;Claude Opus 4.7 (Adaptive Reasoning, Max Effort)&lt;/a&gt;: $5,117.14&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/claude-opus-4-7-non-reasoning"&gt;Claude Opus 4.7 (Non-reasoning, High Effort)&lt;/a&gt;: $1,217.23&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gpt-5-5"&gt;GPT-5.5 (xhigh)&lt;/a&gt;: $3,357.00&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gpt-5-5-medium"&gt;GPT-5.5 (medium)&lt;/a&gt;: $1,199.14&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="a-pelican-on-a-bicycle"&gt;A pelican on a bicycle&lt;/h4&gt;
&lt;p&gt;I ran "Generate an SVG of a pelican riding a bicycle" &lt;a href="https://gist.github.com/simonw/09cc5a5545d7e75b33b75ffa92a34601"&gt;against the Gemini API&lt;/a&gt; and got back this pelican, which is a &lt;em&gt;lot&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gemini-3.5-flash.png" alt="Black background, bats in the sky against a stylized moon. Pelican is funky looking. Very good beak. Bicycle frame is a bit twisted, and the bar from pedals to back wheel is missing. Bike lamp illuminates the road in front. Quite stylish." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;From the code comments: &lt;code&gt;&amp;lt;!-- Pelican Eye / Sunglasses (Cool Retro Aviators) --&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://news.ycombinator.com/item?id=48196570#48198275"&gt;hedgehog on Hacker News&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;That pelican looks like it's in Miami for a crypto conference.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That one cost me 11 input tokens and 14,403 output tokens, for a total cost of &lt;a href="https://www.llm-prices.com/#it=11&amp;amp;ot=14403&amp;amp;sel=gemini-3.5-flash"&gt;just under 13 cents&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>DeepSeek V4 - almost on the frontier, a fraction of the price</title><link href="https://simonwillison.net/2026/Apr/24/deepseek-v4/#atom-tag" rel="alternate"/><published>2026-04-24T06:01:04+00:00</published><updated>2026-04-24T06:01:04+00:00</updated><id>https://simonwillison.net/2026/Apr/24/deepseek-v4/#atom-tag</id><summary type="html">
    &lt;p&gt;Chinese AI lab DeepSeek's last model release was V3.2 (and V3.2 Speciale) &lt;a href="https://simonwillison.net/2025/Dec/1/deepseek-v32/"&gt;last December&lt;/a&gt;. They just dropped the first of their hotly anticipated V4 series in the shape of two preview models, &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro"&gt;DeepSeek-V4-Pro&lt;/a&gt; and &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash"&gt;DeepSeek-V4-Flash&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Both models are 1 million token context Mixture of Experts. Pro is 1.6T total parameters, 49B active. Flash is 284B total, 13B active. They're using the standard MIT license.&lt;/p&gt;
&lt;p&gt;I think this makes DeepSeek-V4-Pro the new largest open weights model. It's larger than Kimi K2.6 (1.1T) and GLM-5.1 (754B) and more than twice the size of DeepSeek V3.2 (685B).&lt;/p&gt;
&lt;p&gt;Pro is 865GB on Hugging Face, Flash is 160GB. I'm hoping that a lightly quantized Flash will run on my 128GB M5 MacBook Pro. It's &lt;em&gt;possible&lt;/em&gt; the Pro model may run on it if I can stream just the necessary active experts from disk.&lt;/p&gt;
&lt;p&gt;For the moment I tried the models out via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt;, using &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm openrouter refresh
llm -m openrouter/deepseek/deepseek-v4-pro 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the pelican &lt;a href="https://gist.github.com/simonw/4a7a9e75b666a58a0cf81495acddf529"&gt;for DeepSeek-V4-Flash&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/deepseek-v4-flash.png" alt="Excellent bicycle - good frame shape, nice chain, even has a reflector on the front wheel. Pelican has a mean looking expression but has its wings on the handlebars and feet on the pedals. Pouch is a little sharp." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/9e8dfed68933ab752c9cf27a03250a7c"&gt;for DeepSeek-V4-Pro&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/deepseek-v4-pro.png" alt="Another solid bicycle, albeit the spokes are a little jagged and the frame is compressed a bit. Pelican has gone a bit wrong - it has a VERY large body, only one wing, a weirdly hairy backside and generally loos like it was drown be a different artist from the bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;For comparison, take a look at the pelicans I got from &lt;a href="https://simonwillison.net/2025/Dec/1/deepseek-v32/"&gt;DeepSeek V3.2 in December&lt;/a&gt;, &lt;a href="https://simonwillison.net/2025/Aug/22/deepseek-31/"&gt;V3.1 in August&lt;/a&gt;, and &lt;a href="https://simonwillison.net/2025/Mar/24/deepseek/"&gt;V3-0324 in March 2025&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;So the pelicans are pretty good, but what's really notable here is the &lt;em&gt;cost&lt;/em&gt;. DeepSeek V4 is a very, very inexpensive model.&lt;/p&gt;
&lt;p&gt;This is &lt;a href="https://api-docs.deepseek.com/quick_start/pricing"&gt;DeepSeek's pricing page&lt;/a&gt;. They're charging $0.14/million tokens input and $0.28/million tokens output for Flash, and $1.74/million input and $3.48/million output for Pro.&lt;/p&gt;
&lt;p&gt;Here's a comparison table with the frontier models from Gemini, OpenAI and Anthropic:&lt;/p&gt;
&lt;center&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/M)&lt;/th&gt;
&lt;th&gt;Output ($/M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.14&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4 Nano&lt;/td&gt;
&lt;td&gt;$0.20&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Flash-Lite&lt;/td&gt;
&lt;td&gt;$0.25&lt;/td&gt;
&lt;td&gt;$1.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3 Flash Preview&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4 Mini&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;td&gt;$4.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;$1&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1.74&lt;/td&gt;
&lt;td&gt;$3.48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
&lt;td&gt;$2&lt;/td&gt;
&lt;td&gt;$12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.7&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;td&gt;$25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;td&gt;$30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/center&gt;
&lt;p&gt;DeepSeek-V4-Flash is the cheapest of the small models, beating even OpenAI's GPT-5.4 Nano. DeepSeek-V4-Pro is the cheapest of the larger frontier models.&lt;/p&gt;
&lt;p&gt;This note from &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/DeepSeek_V4.pdf"&gt;the DeepSeek paper&lt;/a&gt; helps explain why they can price these models so low - they've focused a great deal on efficiency with this release, especially for longer context prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the scenario of 1M-token context, even DeepSeek-V4-Pro, which has a larger number of activated parameters, attains only 27% of the single-token FLOPs (measured in equivalent FP8 FLOPs) and 10% of the KV cache size relative to DeepSeek-V3.2. Furthermore, DeepSeek-V4-Flash, with its smaller number of activated parameters, pushes efficiency even further: in the 1M-token context setting, it achieves only 10% of the single-token FLOPs and 7% of the KV cache size compared with DeepSeek-V3.2.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;DeepSeek's self-reported benchmarks &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/DeepSeek_V4.pdf"&gt;in their paper&lt;/a&gt; show their Pro model competitive with those other frontier models, albeit with this note:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Through the expansion of reasoning tokens, DeepSeek-V4-Pro-Max demonstrates superior performance relative to GPT-5.2 and Gemini-3.0-Pro on standard reasoning benchmarks. Nevertheless, its performance falls marginally short of GPT-5.4 and Gemini-3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm keeping an eye on &lt;a href="https://huggingface.co/unsloth/models"&gt;huggingface.co/unsloth/models&lt;/a&gt; as I expect the Unsloth team will have a set of quantized versions out pretty soon. It's going to be very interesting to see how well that Flash model runs on my own machine.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="deepseek"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>A pelican for GPT-5.5 via the semi-official Codex backdoor API</title><link href="https://simonwillison.net/2026/Apr/23/gpt-5-5/#atom-tag" rel="alternate"/><published>2026-04-23T19:59:47+00:00</published><updated>2026-04-23T19:59:47+00:00</updated><id>https://simonwillison.net/2026/Apr/23/gpt-5-5/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-5/"&gt;GPT-5.5 is out&lt;/a&gt;. It's available in OpenAI Codex and is rolling out to paid ChatGPT subscribers. I've had some preview access and found it to be a fast, effective and highly capable model. As is usually the case these days, it's hard to put into words what's good about it - I ask it to build things and it builds exactly what I ask for!&lt;/p&gt;
&lt;p&gt;There's one notable omission from today's release - the API:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale. We'll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When I run my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican benchmark&lt;/a&gt; I always prefer to use an API, to avoid hidden system prompts in ChatGPT or other agent harnesses from impacting the results.&lt;/p&gt;
&lt;h4 id="the-openclaw-backdoor"&gt;The OpenClaw backdoor&lt;/h4&gt;
&lt;p&gt;One of the ongoing tension points in the AI world over the past few months has concerned how agent harnesses like OpenClaw and Pi interact with the APIs provided by the big providers.&lt;/p&gt;
&lt;p&gt;Both OpenAI and Anthropic offer popular monthly subscriptions which provide access to their models at a significant discount to their raw API.&lt;/p&gt;
&lt;p&gt;OpenClaw integrated directly with this mechanism, and was then &lt;a href="https://www.theverge.com/ai-artificial-intelligence/907074/anthropic-openclaw-claude-subscription-ban"&gt;blocked from doing so&lt;/a&gt; by Anthropic. This kicked off a whole thing. OpenAI - who recently hired OpenClaw creator Peter Steinberger - saw an opportunity for an easy karma win and announced that OpenClaw was welcome to continue integrating with OpenAI's subscriptions via the same mechanism used by their (open source) Codex CLI tool.&lt;/p&gt;
&lt;p&gt;Does this mean &lt;em&gt;anyone&lt;/em&gt; can write code that integrates with OpenAI's Codex-specific APIs to hook into those existing subscriptions?&lt;/p&gt;
&lt;p&gt;The other day &lt;a href="https://twitter.com/jeremyphoward/status/2046537816834965714"&gt;Jeremy Howard asked&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Anyone know whether OpenAI officially supports the use of the &lt;code&gt;/backend-api/codex/responses&lt;/code&gt; endpoint that Pi and Opencode (IIUC) uses?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It turned out that on March 30th OpenAI's Romain Huet &lt;a href="https://twitter.com/romainhuet/status/2038699202834841962"&gt;had tweeted&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We want people to be able to use Codex, and their ChatGPT subscription, wherever they like! That means in the app, in the terminal, but also in JetBrains, Xcode, OpenCode, Pi, and now Claude Code.&lt;/p&gt;
&lt;p&gt;That’s why Codex CLI and Codex app server are open source too! 🙂&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And Peter Steinberger &lt;a href="https://twitter.com/steipete/status/2046775849769148838"&gt;replied to Jeremy&lt;/a&gt; that:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenAI sub is officially supported.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="llm-openai-via-codex"&gt;llm-openai-via-codex&lt;/h4&gt;
&lt;p&gt;So... I had Claude Code reverse-engineer the &lt;a href="https://github.com/openai/codex"&gt;openai/codex&lt;/a&gt; repo, figure out how authentication tokens were stored and build me &lt;a href="https://github.com/simonw/llm-openai-via-codex"&gt;llm-openai-via-codex&lt;/a&gt;, a new plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; which picks up your existing Codex subscription and uses it to run prompts!&lt;/p&gt;
&lt;p&gt;(With hindsight I wish I'd used GPT-5.4 or the GPT-5.5 preview, it would have been funnier. I genuinely considered rewriting the project from scratch using Codex and GPT-5.5 for the sake of the joke, but decided not to spend any more time on this!)&lt;/p&gt;
&lt;p&gt;Here's how to use it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install Codex CLI, buy an OpenAI plan, login to Codex&lt;/li&gt;
&lt;li&gt;Install LLM: &lt;code&gt;uv tool install llm&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Install the new plugin: &lt;code&gt;llm install llm-openai-via-codex&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Start prompting: &lt;code&gt;llm -m openai-codex/gpt-5.5 'Your prompt goes here'&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All existing LLM features should also work - use &lt;code&gt;-a filepath.jpg/URL&lt;/code&gt; to attach an image, &lt;code&gt;llm chat -m openai-codex/gpt-5.5&lt;/code&gt; to start an ongoing chat, &lt;code&gt;llm logs&lt;/code&gt; to view logged conversations and &lt;code&gt;llm --tool ...&lt;/code&gt; to &lt;a href="https://llm.datasette.io/en/stable/tools.html"&gt;try it out with tool support&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="and-some-pelicans"&gt;And some pelicans&lt;/h4&gt;
&lt;p&gt;Let's generate a pelican!&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openai-via-codex
llm -m openai-codex/gpt-5.5 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/edda1d98f7ba07fd95eeff473cb16634"&gt;what I got back&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gpt-5.5-pelican.png" alt="It is a bit mangled to be honest - good beak, pelican body shapes are slightly weird, legs do at least extend to the pedals, bicycle frame is not quite right." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I've seen better &lt;a href="https://simonwillison.net/2026/Mar/17/mini-and-nano/#pelicans"&gt;from GPT-5.4&lt;/a&gt;, so I tagged on &lt;code&gt;-o reasoning_effort xhigh&lt;/code&gt; and &lt;a href="https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602cc5"&gt;tried again&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;That one took almost four minutes to generate, but I think it's a much better effort.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gpt-5.5-pelican-xhigh.png" alt="Pelican has gradients now, body is much better put together, bicycle is nearly the right shape albeit with one extra bar between pedals and front wheel, clearly a better image overall." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;If you compare the SVG code (&lt;a href="https://gist.github.com/simonw/edda1d98f7ba07fd95eeff473cb16634#response"&gt;default&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602cc5#response"&gt;xhigh&lt;/a&gt;) the &lt;code&gt;xhigh&lt;/code&gt; one took a very different approach, which is much more CSS-heavy - as demonstrated by those gradients. &lt;code&gt;xhigh&lt;/code&gt; used 9,322 reasoning tokens where the default used just 39.&lt;/p&gt;
&lt;h4 id="a-few-more-notes-on-gpt-5-5"&gt;A few more notes on GPT-5.5&lt;/h4&gt;
&lt;p&gt;One of the most notable things about GPT-5.5 is the pricing. Once it goes live in the API it's &lt;a href="https://openai.com/index/introducing-gpt-5-5/#availability-and-pricing"&gt;going to be priced&lt;/a&gt; at &lt;em&gt;twice&lt;/em&gt; the cost of GPT-5.4 - $5 per 1M input tokens and $30 per 1M output tokens, where 5.4 is $2.5 and $15.&lt;/p&gt;
&lt;p&gt;GPT-5.5 Pro will be even more: $30 per 1M input tokens and $180 per 1M output tokens.&lt;/p&gt;
&lt;p&gt;GPT-5.4 will remain available. At half the price of 5.5 this feels like 5.4 is to 5.5 as Claude Sonnet is to Claude Opus.&lt;/p&gt;
&lt;p&gt;Ethan Mollick has a &lt;a href="https://www.oneusefulthing.org/p/sign-of-the-future-gpt-55"&gt;detailed review of GPT-5.5&lt;/a&gt; where he put it (and GPT-5.5 Pro) through an array of interesting challenges. His verdict: the jagged frontier continues to hold, with GPT-5.5 excellent at some things and challenged by others in a way that remains difficult to predict.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex"&gt;codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="codex"/><category term="gpt"/></entry><entry><title>Changes to GitHub Copilot Individual plans</title><link href="https://simonwillison.net/2026/Apr/22/changes-to-github-copilot/#atom-tag" rel="alternate"/><published>2026-04-22T03:30:02+00:00</published><updated>2026-04-22T03:30:02+00:00</updated><id>https://simonwillison.net/2026/Apr/22/changes-to-github-copilot/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.blog/news-insights/company-news/changes-to-github-copilot-individual-plans/"&gt;Changes to GitHub Copilot Individual plans&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
On the same day as Claude Code's temporary will-they-won't-they $100/month kerfuffle (for the moment, &lt;a href="https://simonwillison.net/2026/Apr/22/claude-code-confusion/#they-reversed-it"&gt;they won't&lt;/a&gt;), here's the latest on GitHub Copilot pricing.&lt;/p&gt;
&lt;p&gt;Unlike Anthropic, GitHub put up an official announcement about their changes, which include tightening usage limits, pausing signups for individual plans (!), restricting Claude Opus 4.7 to the more expensive $39/month "Pro+" plan, and dropping the previous Opus models entirely.&lt;/p&gt;
&lt;p&gt;The key paragraph:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Agentic workflows have fundamentally changed Copilot’s compute demands. Long-running, parallelized sessions now regularly consume far more resources than the original plan structure was built to support. As Copilot’s agentic capabilities have expanded rapidly, agents are doing more work, and more customers are hitting usage limits designed to maintain service reliability.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's easy to forget that just six months ago heavy LLM users were burning an order of magnitude less tokens. Coding agents consume a &lt;em&gt;lot&lt;/em&gt; of compute.&lt;/p&gt;
&lt;p&gt;Copilot was also unique (I believe) among agents in charging per-request, not per-token. (&lt;em&gt;Correction: Windsurf also operated a credit system like this which they &lt;a href="https://windsurf.com/blog/windsurf-pricing-plans"&gt;abandoned last month&lt;/a&gt;&lt;/em&gt;.) This means that single agentic requests which burn more tokens cut directly into their margins. The most recent pricing scheme addresses that with token-based usage limits on a per-session and weekly basis.&lt;/p&gt;
&lt;p&gt;My one problem with this announcement is that it doesn't clearly clarify &lt;em&gt;which&lt;/em&gt; product called "GitHub Copilot" is affected by these changes. Last month in &lt;a href="https://teybannerman.com/strategy/2026/03/31/how-many-microsoft-copilot-are-there.html"&gt;How many products does Microsoft have named 'Copilot'? I mapped every one&lt;/a&gt; Tey Bannerman identified 75 products that share the Copilot brand, 15 of which have "GitHub Copilot" in the title.&lt;/p&gt;
&lt;p&gt;Judging by the linked &lt;a href="https://github.com/features/copilot/plans"&gt;GitHub Copilot plans page&lt;/a&gt; this covers Copilot CLI, Copilot cloud agent and code review (features on &lt;a href="https://github.com/"&gt;GitHub.com&lt;/a&gt; itself), and the Copilot IDE features available in VS Code, Zed, JetBrains and more.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47838508"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/microsoft"&gt;microsoft&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-copilot"&gt;github-copilot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="microsoft"/><category term="ai"/><category term="generative-ai"/><category term="github-copilot"/><category term="llms"/><category term="llm-pricing"/><category term="coding-agents"/></entry><entry><title>Is Claude Code going to cost $100/month? Probably not - it's all very confusing</title><link href="https://simonwillison.net/2026/Apr/22/claude-code-confusion/#atom-tag" rel="alternate"/><published>2026-04-22T02:07:34+00:00</published><updated>2026-04-22T02:07:34+00:00</updated><id>https://simonwillison.net/2026/Apr/22/claude-code-confusion/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic today quietly (as in &lt;em&gt;silently&lt;/em&gt;, no announcement anywhere at all) updated their &lt;a href="https://claude.com/pricing"&gt;claude.com/pricing&lt;/a&gt; page (but not their &lt;a href="https://support.claude.com/en/articles/11049762-choosing-a-claude-plan"&gt;Choosing a Claude plan page&lt;/a&gt;, which shows up first for me on Google) to add this tiny but significant detail (arrow is mine, &lt;a href="https://simonwillison.net/2026/Apr/22/claude-code-confusion/#they-reversed-it"&gt;and it's already reverted&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/anthropic-x.jpg" alt="Screenshot of the Claude pricing grid - Compare features across plans. Free, Pro, Max 5x and Max 20x all have the same features, with the exception of Claude Code which is on Max only and Claude Cowork which is on Pro and Max only. An arrow highlights the Claude Code for Pro cross." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://web.archive.org/web/20260421040656/claude.com/pricing"&gt;Internet Archive copy&lt;/a&gt; from yesterday shows a checkbox there. Claude Code used to be a feature of the $20/month Pro plan, but according to the new pricing page it is now exclusive to the $100/month or $200/month Max plans.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: don't miss &lt;a href="https://simonwillison.net/2026/Apr/22/claude-code-confusion/#they-reversed-it"&gt;the update to this post&lt;/a&gt;, they've already changed course a few hours after this change went live.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;So what the heck is going on? Unsurprisingly, &lt;a href="https://www.reddit.com/r/ClaudeAI/comments/1srzhd7/psa_claude_pro_no_longer_lists_claude_code_as_an/"&gt;Reddit&lt;/a&gt; and &lt;a href="https://news.ycombinator.com/item?id=47854477"&gt;Hacker News&lt;/a&gt; and &lt;a href="https://twitter.com/i/trending/2046718768634589239"&gt;Twitter&lt;/a&gt; all caught fire.&lt;/p&gt;
&lt;p&gt;I didn't believe the screenshots myself when I first saw them - aside from the pricing grid I could find no announcement from Anthropic anywhere. Then Amol Avasare, Anthropic's Head of Growth, &lt;a href="https://twitter.com/TheAmolAvasare/status/2046724659039932830"&gt;tweeted&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For clarity, we're running a small test on ~2% of new prosumer signups. Existing Pro and Max subscribers aren't affected.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And that appears to be the closest we have had to official messaging from Anthropic.&lt;/p&gt;
&lt;p&gt;I don't buy the "~2% of new prosumer signups" thing, since everyone I've talked to is seeing the new pricing grid and the Internet Archive has already &lt;a href="https://web.archive.org/web/20260422001250/https://claude.com/pricing"&gt;snapped a copy&lt;/a&gt;. Maybe he means that they'll only be running this version of the pricing grid for a limited time which somehow adds up to "2%" of signups?&lt;/p&gt;
&lt;p&gt;I'm also amused to see Claude Cowork remain available on the $20/month plan, because Claude Cowork is effectively a rebranded version of Claude Code wearing a less threatening hat!&lt;/p&gt;
&lt;p&gt;There are a whole bunch of things that are bad about this.&lt;/p&gt;
&lt;p&gt;If we assume this is indeed a test, and that test comes up negative and they decide not to go ahead with it, the damage has still been extensive:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A whole lot of people got scared or angry or both that a service they relied on was about to be rug-pulled. There really is a significant difference between $20/month and $100/month for most people, especially outside of higher salary countries.&lt;/li&gt;
&lt;li&gt;The uncertainty is really bad! A tweet from an employee is &lt;em&gt;not&lt;/em&gt; the way to make an announcement like this. I wasted a solid hour of my afternoon trying to figure out what had happened here. My trust in Anthropic's transparency around pricing - a &lt;em&gt;crucial factor&lt;/em&gt; in how I understand their products - has been shaken.&lt;/li&gt;
&lt;li&gt;Strategically, should I be taking a bet on Claude Code if I know that they might 5x the minimum price of the product?&lt;/li&gt;
&lt;li&gt;More of a personal issue, but one I care deeply about myself: I invest a &lt;a href="https://simonwillison.net/tags/claude-code/"&gt;great deal of effort&lt;/a&gt; (that's 105 posts and counting) in teaching people how to use Claude Code. I don't want to invest that effort in a product that most people cannot afford to use.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Last month I ran &lt;a href="https://simonw.github.io/nicar-2026-coding-agents/"&gt;a tutorial for journalists&lt;/a&gt; on "Coding agents for data analysis" at the annual NICAR data journalism conference. I'm not going to be teaching that audience a course that depends on a $100/month subscription!&lt;/p&gt;
&lt;p&gt;This also doesn't make sense to me as a strategy for Anthropic. Claude Code &lt;em&gt;defined the category&lt;/em&gt; of coding agents. It's responsible for billions of dollars in annual revenue for Anthropic already. It has a stellar reputation, but I'm not convinced that reputation is strong enough for it to lose the $20/month trial and jump people directly to a $100/month subscription.&lt;/p&gt;
&lt;p&gt;OpenAI have been investing heavily in catching up to Claude Code with their Codex products. Anthropic just handed them this marketing opportunity on a plate - here's Codex engineering lead &lt;a href="https://twitter.com/thsottiaux/status/2046740759056162816"&gt;Thibault Sottiaux&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I don't know what they are doing over there, but Codex will continue to be available both in the FREE and PLUS ($20) plans. We have the compute and efficient models to support it. For important changes, we will engage with the community well ahead of making them.&lt;/p&gt;
&lt;p&gt;Transparency and trust are two principles we will not break, even if it means momentarily earning less. A reminder that you vote with your subscription for the values you want to see in this world.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I should note that I pay $200/month for Claude Max and I consider it well worth the money. I've had periods of free access in the past courtesy of Anthropic but I'm currently paying full price, and happy to do so.&lt;/p&gt;
&lt;p&gt;But I care about the accessibility of the tools that I work with and teach. If Codex has a free tier while Claude Code starts at $100/month I should obviously switch to Codex, because that way I can use the same tool as the people I want to teach how to use coding agents.&lt;/p&gt;
&lt;p&gt;Here's what I think happened. I think Anthropic are trying to optimize revenue growth - obviously - and someone pitched making Claude Code only available for Max and higher. That's clearly a bad idea, but "testing" culture says that it's worth putting even bad ideas out to test just in case they surprise you.&lt;/p&gt;
&lt;p&gt;So they started a test, without taking into account the wailing and gnashing of teeth that would result when their test was noticed - or accounting for the longer-term brand damage that would be caused.&lt;/p&gt;
&lt;p&gt;Or maybe they &lt;em&gt;did&lt;/em&gt; account for that, and decided it was worth the risk.&lt;/p&gt;
&lt;p&gt;I don't think that calculation was worthwhile. They're going to have to make a &lt;em&gt;very&lt;/em&gt; firm commitment along the lines of "we heard your feedback and we commit to keeping Claude Code available on our $20/month plan going forward" to regain my trust.&lt;/p&gt;
&lt;p&gt;As it stands, Codex is looking like a much safer bet for me to invest my time in learning and building educational materials around.&lt;/p&gt;
&lt;h4 id="they-reversed-it"&gt;Update: they've reversed it already&lt;/h4&gt;
&lt;p&gt;In the time I was &lt;em&gt;typing this blog entry&lt;/em&gt; Anthropic appear to have reversed course - the &lt;a href="https://claude.com/pricing"&gt;claude.com/pricing page&lt;/a&gt; now has a checkbox back in the Pro column for Claude Code. I can't find any official communication about it though.&lt;/p&gt;
&lt;p&gt;Let's see if they can come up with an explanation/apology that's convincing enough to offset the trust bonfire from this afternoon!&lt;/p&gt;
&lt;h4 id="update-2"&gt;Update 2: it may still affect 2% of signups?&lt;/h4&gt;
&lt;p&gt;Amol &lt;a href="https://x.com/TheAmolAvasare/status/2046788872517066971"&gt;on Twitter&lt;/a&gt;:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;was a mistake that the logged-out landing page and docs were updated for this test [&lt;a href="https://twitter.com/TheAmolAvasare/status/2046783926920978681"&gt;embedded self-tweet&lt;/a&gt;]&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Getting lots of questions on why the landing page / docs were updated if only 2% of new signups were affected.&lt;/p&gt;

&lt;p&gt;This was understandably confusing for the 98% of folks not part of the experiment, and we've reverted both the landing page and docs changes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;p&gt;So the experiment is still running, just not visible to the rest of the world?&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex"&gt;codex&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="llm-pricing"/><category term="ai-ethics"/><category term="coding-agents"/><category term="claude-code"/><category term="codex"/></entry><entry><title>Claude Token Counter, now with model comparisons</title><link href="https://simonwillison.net/2026/Apr/20/claude-token-counts/#atom-tag" rel="alternate"/><published>2026-04-20T00:50:45+00:00</published><updated>2026-04-20T00:50:45+00:00</updated><id>https://simonwillison.net/2026/Apr/20/claude-token-counts/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://tools.simonwillison.net/claude-token-counter"&gt;Claude Token Counter, now with model comparisons&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I &lt;a href="https://github.com/simonw/tools/pull/269"&gt;upgraded&lt;/a&gt; my Claude Token Counter tool to add the ability to run the same count against different models in order to compare them.&lt;/p&gt;
&lt;p&gt;As far as I can tell Claude Opus 4.7 is the first model to change the tokenizer, so it's only worth running comparisons between 4.7 and 4.6. The Claude &lt;a href="https://platform.claude.com/docs/en/build-with-claude/token-counting"&gt;token counting API&lt;/a&gt; accepts any Claude model ID though so I've included options for all four of the notable current models (Opus 4.7 and 4.6, Sonnet 4.6, and Haiku 4.5).&lt;/p&gt;
&lt;p&gt;In the Opus 4.7 announcement &lt;a href="https://www.anthropic.com/news/claude-opus-4-7#migrating-from-opus-46-to-opus-47"&gt;Anthropic said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I pasted the &lt;a href="https://github.com/simonw/research/blob/2cf912666ba08ef0c00a1b51ee07c9a8e64579ef/extract-system-prompts/claude-opus-4-7.md?plain=1"&gt;Opus 4.7 system prompt&lt;/a&gt; into the token counting tool and found that the Opus 4.7 tokenizer used 1.46x the number of tokens as Opus 4.6.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a token comparison tool. Models to compare: claude-opus-4-7 (checked), claude-opus-4-6 (checked), claude-opus-4-5, claude-sonnet-4-6, claude-haiku-4-5. Note: &amp;quot;These models share the same tokenizer&amp;quot;. Blue &amp;quot;Count Tokens&amp;quot; button. Results table — Model | Tokens | vs. lowest. claude-opus-4-7: 7,335 tokens, 1.46x (yellow badge). claude-opus-4-6: 5,039 tokens, 1.00x (green badge)." src="https://static.simonwillison.net/static/2026/claude-token-count.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Opus 4.7 uses the same pricing is Opus 4.6 - $5 per million input tokens and $25 per million output tokens - but this token inflation means we can expect it to be around 40% more expensive.&lt;/p&gt;
&lt;p&gt;The token counter tool also accepts images. Opus 4.7 has improved image support, described like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Opus 4.7 has better vision for high-resolution images: it can accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times as many as prior Claude models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tried counting tokens for a 3456x2234 pixel 3.7MB PNG and got an even bigger increase in token counts - 3.01x times the number of tokens for 4.7 compared to 4.6:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Same UI, this time with an uploaded screenshot PNG image. claude-opus-4-7: 4,744 tokens, 3.01x (yellow badge). claude-opus-4-6: 1,578 tokens, 1.00x (green badge)." src="https://static.simonwillison.net/static/2026/claude-token-count-image.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: That 3x increase for images is &lt;em&gt;entirely&lt;/em&gt; due to Opus 4.7 being able to handle higher resolutions. I tried that again with a 682x318 pixel image and it took 314 tokens with Opus 4.7 and 310 with Opus 4.6, so effectively the same cost.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: I tried a 15MB, 30 page text-heavy PDF and Opus 4.7 reported 60,934   tokens while 4.6 reported 56,482 - that's a 1.08x multiplier, significantly lower than the multiplier I got for raw text.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="tokenization"/></entry><entry><title>GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52</title><link href="https://simonwillison.net/2026/Mar/17/mini-and-nano/#atom-tag" rel="alternate"/><published>2026-03-17T19:39:17+00:00</published><updated>2026-03-17T19:39:17+00:00</updated><id>https://simonwillison.net/2026/Mar/17/mini-and-nano/#atom-tag</id><summary type="html">
    &lt;p&gt;OpenAI today: &lt;a href="https://openai.com/index/introducing-gpt-5-4-mini-and-nano/"&gt;Introducing GPT‑5.4 mini and nano&lt;/a&gt;. These models join GPT-5.4 which was released &lt;a href="https://openai.com/index/introducing-gpt-5-4/"&gt;two weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;OpenAI's self-reported benchmarks show the new 5.4-nano out-performing their previous GPT-5 mini model when run at maximum reasoning effort. The new mini is also 2x faster than the previous mini.&lt;/p&gt;
&lt;p&gt;Here's how the pricing looks - all prices are per million tokens. &lt;code&gt;gpt-5.4-nano&lt;/code&gt; is notably even cheaper than Google's Gemini 3.1 Flash-Lite:&lt;/p&gt;
&lt;center&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input&lt;/th&gt;
      &lt;th&gt;Cached input&lt;/th&gt;
      &lt;th&gt;Output&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;gpt-5.4&lt;/td&gt;
      &lt;td&gt;$2.50&lt;/td&gt;
      &lt;td&gt;$0.25&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;gpt-5.4-mini&lt;/td&gt;
      &lt;td&gt;$0.75&lt;/td&gt;
      &lt;td&gt;$0.075&lt;/td&gt;
      &lt;td&gt;$4.50&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;gpt-5.4-nano&lt;/td&gt;
      &lt;td&gt;$0.20&lt;/td&gt;
      &lt;td&gt;$0.02&lt;/td&gt;
      &lt;td&gt;$1.25&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;&lt;td colspan="4"&gt;&lt;center&gt;Other models for comparison&lt;/center&gt;&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.6&lt;/td&gt;
      &lt;td&gt;$5.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$25.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
      &lt;td&gt;$2.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$12.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
      &lt;td&gt;$1.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$5.00&lt;/td&gt;
    &lt;/tr&gt;
&lt;tr&gt;
      &lt;td&gt;Gemini 3.1 Flash-Lite&lt;/td&gt;
      &lt;td&gt;$0.25&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$1.50&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/center&gt;
&lt;p&gt;I used GPT-5.4 nano to generate a description of this photo I took at the &lt;a href="https://www.niche-museums.com/118"&gt;John M. Mossman Lock Collection&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/IMG_2324.jpeg" alt="Description below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gpt-5.4-nano -a IMG_2324.jpeg 'describe image'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the output:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image shows the interior of a museum gallery with a long display wall. White-painted brick walls are covered with many framed portraits arranged in neat rows. Below the portraits, there are multiple glass display cases with dark wooden frames and glass tops/fronts, containing various old historical objects and equipment. The room has a polished wooden floor, hanging ceiling light fixtures/cords, and a few visible pipes near the top of the wall. In the foreground, glass cases run along the length of the room, reflecting items from other sections of the gallery.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That took 2,751 input tokens and 112 output tokens, at a cost of &lt;a href="https://www.llm-prices.com/#it=2751&amp;amp;ot=112&amp;amp;sel=gpt-5.4-nano"&gt;0.069 cents&lt;/a&gt; (less than a tenth of a cent). That means describing every single photo in my 76,000 photo collection would cost around $52.44.&lt;/p&gt;
&lt;p&gt;I released &lt;a href="https://llm.datasette.io/en/stable/changelog.html#v0-29"&gt;llm 0.29&lt;/a&gt; with support for the new models.&lt;/p&gt;
&lt;h4 id="pelicans"&gt;Pelicans&lt;/h4&gt;
&lt;p&gt;Then I had OpenAI Codex loop through all five reasoning effort levels and all three models and produce this combined SVG grid of pelicans riding bicycles (&lt;a href="https://gist.github.com/simonw/f16292d9a5b90b28054cff3ba497a3ca"&gt;generation transcripts here&lt;/a&gt;). I do like the gpt-5.4 xhigh one the best, it has a good bicycle (with nice spokes) and the pelican has a fish in its beak!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gpt-5.4-pelican-family.svg" alt="Described by Claude Opus 4.6: A 5x3 comparison grid of AI-generated cartoon illustrations of a pelican riding a bicycle. Columns are labeled &amp;quot;gpt-5.4-nano&amp;quot;, &amp;quot;gpt-5.4-mini&amp;quot;, and &amp;quot;gpt-5.4&amp;quot; across the top, and rows are labeled &amp;quot;none&amp;quot;, &amp;quot;low&amp;quot;, &amp;quot;medium&amp;quot;, &amp;quot;high&amp;quot;, and &amp;quot;xhigh&amp;quot; down the left side, representing quality/detail settings. In the &amp;quot;none&amp;quot; row, gpt-5.4-nano shows a chaotic white bird with misplaced arrows and tangled wheels on grass, gpt-5.4-mini shows a duck-like brown bird awkwardly straddling a motorcycle-like bike, and gpt-5.4 shows a stiff gray-and-white pelican sitting atop a blue tandem bicycle with extra legs. In the &amp;quot;low&amp;quot; row, nano shows a chubby round white bird pedaling with small feet on grass, mini shows a cleaner white bird riding a blue bicycle with motion lines, and gpt-5.4 shows a pelican with a blue cap riding confidently but with slightly awkward proportions. In the &amp;quot;medium&amp;quot; row, nano regresses to a strange bird standing over bowling balls on ice, mini shows two plump white birds merged onto one yellow-wheeled bicycle, and gpt-5.4 shows a more recognizable gray-and-white pelican on a red bicycle but with tangled extra legs. In the &amp;quot;high&amp;quot; row, nano shows multiple small pelicans crowded around a broken green bicycle on grass with a sun overhead, mini shows a tandem bicycle with two white pelicans and clear blue sky, and gpt-5.4 shows two pelicans stacked on a red tandem bike with the most realistic proportions yet. In the &amp;quot;xhigh&amp;quot; row, nano shows the most detailed scene with a pelican on a detailed bicycle with grass and a large sun but still somewhat jumbled anatomy, mini produces the cleanest single pelican on a yellow-accented bicycle with a light blue sky, and gpt-5.4 shows a well-rendered gray pelican on a teal bicycle with the best overall coherence. Generally, quality improves moving right across models and down through quality tiers, though &amp;quot;medium&amp;quot; is inconsistently worse than &amp;quot;low&amp;quot; for some models, and all images maintain a lighthearted cartoon style with pastel skies and simple backgrounds." style="max-width: 100%;" /&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>1M context is now generally available for Opus 4.6 and Sonnet 4.6</title><link href="https://simonwillison.net/2026/Mar/13/1m-context/#atom-tag" rel="alternate"/><published>2026-03-13T18:29:13+00:00</published><updated>2026-03-13T18:29:13+00:00</updated><id>https://simonwillison.net/2026/Mar/13/1m-context/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://claude.com/blog/1m-context-ga"&gt;1M context is now generally available for Opus 4.6 and Sonnet 4.6&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's what surprised me:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Standard pricing now applies across the full 1M window for both models, with no long-context premium.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenAI and Gemini both &lt;a href="https://www.llm-prices.com/#sel=gemini-3-1-pro-preview-200k%2Cgpt-5.4-272k%2Cgemini-3-1-pro-preview%2Cgpt-5.4"&gt;charge more&lt;/a&gt; for prompts where the token count goes above a certain point - 200,000 for Gemini 3.1 Pro and 272,000 for GPT-5.4.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="long-context"/></entry><entry><title>Gemini 3.1 Flash-Lite</title><link href="https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/#atom-tag" rel="alternate"/><published>2026-03-03T21:53:54+00:00</published><updated>2026-03-03T21:53:54+00:00</updated><id>https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/"&gt;Gemini 3.1 Flash-Lite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google's latest model is an update to their inexpensive Flash-Lite family. At $0.25/million tokens of input and $1.5/million output this is 1/8th the price of Gemini 3.1 Pro.&lt;/p&gt;
&lt;p&gt;It supports four different thinking levels, so I had it output &lt;a href="https://gist.github.com/simonw/99fb28dc11d0c24137d4ff8a33978a9e"&gt;four different pelicans&lt;/a&gt;:&lt;/p&gt;
&lt;div style="
    display: grid;
    grid-template-columns: repeat(2, 1fr);
    gap: 8px;
    margin: 0 auto;
  "&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-minimal.png" alt="A minimalist vector-style illustration of a stylized bird riding a bicycle." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;minimal&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-low.png" alt="A minimalist graphic of a light blue round bird with a single black dot for an eye, wearing a yellow backpack and riding a black bicycle on a flat grey line." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;low&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-medium.png" alt="A minimalist digital illustration of a light blue bird wearing a yellow backpack while riding a bicycle." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;medium&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-high.png" alt="A minimal, stylized line drawing of a bird-like creature with a yellow beak riding a bicycle made of simple geometric lines." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;high&lt;/p&gt;
    &lt;/div&gt;
&lt;/div&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Introducing Claude Sonnet 4.6</title><link href="https://simonwillison.net/2026/Feb/17/claude-sonnet-46/#atom-tag" rel="alternate"/><published>2026-02-17T23:58:58+00:00</published><updated>2026-02-17T23:58:58+00:00</updated><id>https://simonwillison.net/2026/Feb/17/claude-sonnet-46/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-sonnet-4-6"&gt;Introducing Claude Sonnet 4.6&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sonnet 4.6 is out today, and Anthropic claim it offers similar performance to &lt;a href="https://simonwillison.net/2025/Nov/24/claude-opus/"&gt;November's Opus 4.5&lt;/a&gt; while maintaining the Sonnet pricing of $3/million input and $15/million output tokens (the Opus models are $5/$25). Here's &lt;a href="https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf"&gt;the system card PDF&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Sonnet 4.6 has a "reliable knowledge cutoff" of August 2025, compared to Opus 4.6's May 2025 and Haiku 4.5's February 2025. Both Opus and Sonnet default to 200,000 max input tokens but can stretch to 1 million in beta and at a higher cost.&lt;/p&gt;
&lt;p&gt;I just released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.24"&gt;llm-anthropic 0.24&lt;/a&gt; with support for both Sonnet 4.6 and Opus 4.6. Claude Code &lt;a href="https://github.com/simonw/llm-anthropic/pull/65"&gt;did most of the work&lt;/a&gt; - the new models had a fiddly amount of extra details around adaptive thinking and no longer supporting prefixes, as described &lt;a href="https://platform.claude.com/docs/en/about-claude/models/migration-guide"&gt;in Anthropic's migration guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/b185576a95e9321b441f0a4dfc0e297c"&gt;what I got&lt;/a&gt; from:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx --with llm-anthropic llm 'Generate an SVG of a pelican riding a bicycle' -m claude-sonnet-4.6
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="The pelican has a jaunty top hat with a red band. There is a string between the upper and lower beaks for some reason. The bicycle frame is warped in the wrong way." src="https://static.simonwillison.net/static/2026/pelican-sonnet-4.6.png" /&gt;&lt;/p&gt;
&lt;p&gt;The SVG comments include:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Hat (fun accessory) --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I tried a second time and also got a top hat. Sonnet 4.6 apparently loves top hats!&lt;/p&gt;
&lt;p&gt;For comparison, here's the pelican Opus 4.5 drew me &lt;a href="(https://simonwillison.net/2025/Nov/24/claude-opus/)"&gt;in November&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars." src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;And here's Anthropic's current best pelican, drawn by Opus 4.6 &lt;a href="https://simonwillison.net/2026/Feb/5/two-new-models/"&gt;on February 5th&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Slightly wonky bicycle frame but an excellent pelican, very clear beak and pouch, nice feathers." src="https://static.simonwillison.net/static/2026/opus-4.6-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;Opus 4.6 produces the best pelican beak/pouch. I do think the top hat from Sonnet 4.6 is a nice touch though.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47050488"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="claude-code"/></entry><entry><title>Claude: Speed up responses with fast mode</title><link href="https://simonwillison.net/2026/Feb/7/claude-fast-mode/#atom-tag" rel="alternate"/><published>2026-02-07T23:10:33+00:00</published><updated>2026-02-07T23:10:33+00:00</updated><id>https://simonwillison.net/2026/Feb/7/claude-fast-mode/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://code.claude.com/docs/en/fast-mode"&gt;Claude: Speed up responses with fast mode&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New "research preview" from Anthropic today: you can now access a faster version of their frontier model Claude Opus 4.6 by typing &lt;code&gt;/fast&lt;/code&gt; in Claude Code... but at a cost that's 6x the normal price.&lt;/p&gt;
&lt;p&gt;Opus is usually $5/million input and $25/million output. The new fast mode is $30/million input and $150/million output!&lt;/p&gt;
&lt;p&gt;There's a 50% discount until the end of February 16th, so only a 3x multiple (!) before then.&lt;/p&gt;
&lt;p&gt;How much faster is it? The linked documentation doesn't say, but &lt;a href="https://x.com/claudeai/status/2020207322124132504"&gt;on Twitter&lt;/a&gt; Claude say:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our teams have been building with a 2.5x-faster version of Claude Opus 4.6.&lt;/p&gt;
&lt;p&gt;We’re now making it available as an early experiment via Claude Code and our API.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Opus 4.5 had a context limit of 200,000 tokens. 4.6 has an option to increase that to 1,000,000 at 2x the input price ($10/m) and 1.5x the output price ($37.50/m) once your input exceeds 200,000 tokens. These multiples hold for fast mode too, so after Feb 16th you'll be able to pay a hefty $60/m input and $225/m output for Anthropic's fastest best model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="claude-code"/><category term="llm-performance"/></entry><entry><title>Gemini 3 Flash</title><link href="https://simonwillison.net/2025/Dec/17/gemini-3-flash/#atom-tag" rel="alternate"/><published>2025-12-17T22:44:52+00:00</published><updated>2025-12-17T22:44:52+00:00</updated><id>https://simonwillison.net/2025/Dec/17/gemini-3-flash/#atom-tag</id><summary type="html">
    &lt;p&gt;It continues to be a busy December, if not quite as busy &lt;a href="https://simonwillison.net/2024/Dec/20/december-in-llms-has-been-a-lot/"&gt;as last year&lt;/a&gt;. Today's big news is &lt;a href="https://blog.google/technology/developers/build-with-gemini-3-flash/"&gt;Gemini 3 Flash&lt;/a&gt;, the latest in Google's "Flash" line of faster and less expensive models.&lt;/p&gt;
&lt;p&gt;Google are emphasizing the comparison between the new Flash and their previous generation's top model Gemini 2.5 Pro:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Building on 3 Pro’s strong multimodal, coding and agentic features, 3 Flash offers powerful performance at less than a quarter the cost of 3 Pro, along with higher rate limits. The new 3 Flash model surpasses 2.5 Pro across many benchmarks while delivering faster speeds.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini 3 Flash's characteristics are almost identical to Gemini 3 Pro: it accepts text, image, video, audio, and PDF, outputs only text, handles 1,048,576 maximum input tokens and up to 65,536 output tokens, and has the same knowledge cut-off date of January 2025 (also shared with the Gemini 2.5 series).&lt;/p&gt;
&lt;p&gt;The benchmarks look good. The cost is appealing: 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro &amp;gt;200k, and it's nice not to have a price increase for the new Flash at larger token lengths.&lt;/p&gt;
&lt;p&gt;It's a little &lt;em&gt;more&lt;/em&gt; expensive than previous Flash models - Gemini 2.5 Flash was $0.30/million input tokens and $2.50/million on output, Gemini 3 Flash is $0.50/million and $3/million respectively.&lt;/p&gt;
&lt;p&gt;Google &lt;a href="https://blog.google/products/gemini/gemini-3-flash/"&gt;claim&lt;/a&gt; it may still end up cheaper though, due to more efficient output token usage:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;&gt; Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://www.llm-prices.com/#it=100000&amp;amp;ot=10000&amp;amp;sel=gemini-3-flash-preview%2Cgemini-3-pro-preview%2Cgemini-3-pro-preview-200k%2Cgpt-5.2%2Cclaude-opus-4-5%2Cclaude-sonnet-4.5%2Cclaude-4.5-haiku%2Cgemini-2.5-flash%2Cgpt-5-mini"&gt;a more extensive price comparison&lt;/a&gt; on my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; site.&lt;/p&gt;
&lt;h4 id="generating-some-svgs-of-pelicans"&gt;Generating some SVGs of pelicans&lt;/h4&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.28"&gt;llm-gemini 0.28&lt;/a&gt; this morning with support for the new model. You can try it out like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-gemini
llm keys set gemini # paste in key
llm -m gemini-3-flash-preview "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;According to &lt;a href="https://ai.google.dev/gemini-api/docs/gemini-3#thinking_level"&gt;the developer docs&lt;/a&gt; the new model supports four different thinking level options: &lt;code&gt;minimal&lt;/code&gt;, &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, and &lt;code&gt;high&lt;/code&gt;. This is different from Gemini 3 Pro, which only supported &lt;code&gt;low&lt;/code&gt; and &lt;code&gt;high&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You can run those like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini-3-flash-preview --thinking-level minimal "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are four pelicans, for thinking levels &lt;a href="https://gist.github.com/simonw/8047c805a4a1df7fd4e854b18e7482d9"&gt;minimal&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/fb61686a1f915e3777b4a40e2df41068"&gt;low&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/190c3ce82cd8976827139bbc4dcc2d19"&gt;medium&lt;/a&gt;, and &lt;a href="https://gist.github.com/simonw/da66ffce135359161996e41e50e32ec3"&gt;high&lt;/a&gt;:&lt;/p&gt;
&lt;image-gallery width="4"&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg" alt="A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg" alt="Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg" alt="A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background." /&gt;
    &lt;img src="https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg" alt="Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background." /&gt;
&lt;/image-gallery&gt;
&lt;h4 id="i-built-the-gallery-component-with-gemini-3-flash"&gt;I built the gallery component with Gemini 3 Flash&lt;/h4&gt;
&lt;p&gt;The gallery above uses a new Web Component which I built using Gemini 3 Flash to try out its coding abilities. The code on the page looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;image-gallery&lt;/span&gt; &lt;span class="pl-c1"&gt;width&lt;/span&gt;="&lt;span class="pl-s"&gt;4&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;img&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg&lt;/span&gt;" &lt;span class="pl-c1"&gt;alt&lt;/span&gt;="&lt;span class="pl-s"&gt;Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background.&lt;/span&gt;" &lt;span class="pl-kos"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;image-gallery&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Those alt attributes are all generated by Gemini 3 Flash as well, using this recipe:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gemini-3-flash-preview --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;You write alt text for any image pasted in by the user. Alt text is always presented in a&lt;/span&gt;
&lt;span class="pl-s"&gt;fenced code block to make it easy to copy and paste out. It is always presented on a single&lt;/span&gt;
&lt;span class="pl-s"&gt;line so it can be used easily in Markdown images. All text on the image (for screenshots etc)&lt;/span&gt;
&lt;span class="pl-s"&gt;must be exactly included. A short note describing the nature of the image itself should go first.&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
-a https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can see the code that powers the image gallery Web Component &lt;a href="https://github.com/simonw/simonwillisonblog/blob/31651b3a527011d1c971d4256c1c9f61ef378d23/static/image-gallery.js"&gt;here on GitHub&lt;/a&gt;. I built it by prompting Gemini 3 Flash via &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gemini-3-flash-preview &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Build a Web Component that implements a simple image gallery. Usage is like this:&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;image-gallery width="5"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image1.jpg" alt="Image 1"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image2.jpg" alt="Image 2" data-thumb="image2-thumb.jpg"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;img src="image3.jpg" alt="Image 3"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;/image-gallery&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;If an image has a data-thumb= attribute that one is used instead, other images are scaled down. &lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;The image gallery always takes up 100% of available width. The width="5" attribute means that five images will be shown next to each other in each row. The default is 3. There are gaps between the images. When an image is clicked it opens a modal dialog with the full size image.&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;Return a complete HTML file with both the implementation of the Web Component several example uses of it. Use https://picsum.photos/300/200 URLs for those example images.&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It took a few follow-up prompts using &lt;code&gt;llm -c&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Use a real modal such that keyboard shortcuts and accessibility features work without extra JS&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Use X for the close icon and make it a bit more subtle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;remove the hover effect entirely&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;I want no border on the close icon even when it is focused&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/09f63a49f29620d4cbbfd383cfee1db3"&gt;the full transcript&lt;/a&gt;, exported using &lt;code&gt;llm logs -cue&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Those five prompts took:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;225 input, 3,269 output&lt;/li&gt;
&lt;li&gt;2,243 input, 2,908 output&lt;/li&gt;
&lt;li&gt;4,319 input, 2,516 output&lt;/li&gt;
&lt;li&gt;6,376 input, 2,094 output&lt;/li&gt;
&lt;li&gt;8,151 input, 1,806 output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Added together that's 21,314 input and 12,593 output for a grand total &lt;a href="https://www.llm-prices.com/#it=21314&amp;amp;ot=12593&amp;amp;sel=gemini-3-flash-preview"&gt;of 4.8436 cents&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The guide to &lt;a href="https://ai.google.dev/gemini-api/docs/gemini-3#migrating_from_gemini_25"&gt;migrating from Gemini 2.5&lt;/a&gt; reveals one disappointment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Image segmentation:&lt;/strong&gt; Image segmentation capabilities (returning pixel-level masks for objects) are not supported in Gemini 3 Pro or Gemini 3 Flash. For workloads requiring native image segmentation, we recommend continuing to utilize Gemini 2.5 Flash with thinking turned off or &lt;a href="https://ai.google.dev/gemini-api/docs/robotics-overview"&gt;Gemini Robotics-ER 1.5&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/"&gt;wrote about this capability in Gemini 2.5&lt;/a&gt; back in April. I hope they come back in future models - they're a really neat capability that is unique to Gemini.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-components"&gt;web-components&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="web-components"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult</title><link href="https://simonwillison.net/2025/Nov/24/claude-opus/#atom-tag" rel="alternate"/><published>2025-11-24T19:37:07+00:00</published><updated>2025-11-24T19:37:07+00:00</updated><id>https://simonwillison.net/2025/Nov/24/claude-opus/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic &lt;a href="https://www.anthropic.com/news/claude-opus-4-5"&gt;released Claude Opus 4.5&lt;/a&gt; this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's &lt;a href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/"&gt;GPT-5.1-Codex-Max&lt;/a&gt; and Google's &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/"&gt;Gemini 3&lt;/a&gt;, both released within the past week!&lt;/p&gt;
&lt;p&gt;The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 "reliable knowledge cutoff" (Sonnet 4.5 is January, Haiku 4.5 is February).&lt;/p&gt;
&lt;p&gt;The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for &amp;gt;200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-5#key-improvements-in-opus-4-5-over-opus-4-1"&gt;Key improvements in Opus 4.5 over Opus 4.1&lt;/a&gt; document has a few more interesting details:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Opus 4.5 has a new &lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort"&gt;effort parameter&lt;/a&gt; which defaults to high but can be set to medium or low for faster responses.&lt;/li&gt;
&lt;li&gt;The model supports &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool"&gt;enhanced computer use&lt;/a&gt;, specifically a &lt;code&gt;zoom&lt;/code&gt; tool which you can provide to Opus 4.5 to allow it to request a zoomed in region of the screen to inspect.&lt;/li&gt;
&lt;li&gt;"&lt;a href="https://platform.claude.com/docs/en/build-with-claude/extended-thinking#thinking-block-preservation-in-claude-opus-4-5"&gt;Thinking blocks from previous assistant turns are preserved in model context by default&lt;/a&gt;" - apparently previous Anthropic models discarded those.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I had access to a preview of Anthropic's new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in &lt;a href="https://simonwillison.net/2025/Nov/24/sqlite-utils-40a1/"&gt;a new alpha release of sqlite-utils&lt;/a&gt; that included several large-scale refactorings - Opus 4.5 was responsible for most of the work across &lt;a href="https://github.com/simonw/sqlite-utils/compare/10957305be998999e3c95c11863b5709d42b7ae3...4.0a1"&gt;20 commits, 39 files changed,  2,022 additions and 1,173 deletions&lt;/a&gt; in a two day period. Here's the &lt;a href="https://gistpreview.github.io/?f40971b693024fbe984a68b73cc283d2"&gt;Claude Code transcript&lt;/a&gt; where I had it help implement one of the more complicated new features.&lt;/p&gt;
&lt;p&gt;It's clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in &lt;a href="https://github.com/simonw/sqlite-utils/milestone/7?closed=1"&gt;the milestone for the alpha&lt;/a&gt;. I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I'd been achieving with the new model.&lt;/p&gt;
&lt;p&gt;With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected.&lt;/p&gt;
&lt;p&gt;I'm not saying the new model isn't an improvement on Sonnet 4.5 - but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two.&lt;/p&gt;
&lt;p&gt;This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.&lt;/p&gt;
&lt;p&gt;Google's Nano Banana Pro image generation model was notable in that its ability to &lt;a href="https://simonwillison.net/2025/Nov/20/nano-banana-pro/#creating-an-infographic"&gt;render usable infographics&lt;/a&gt; really does represent a task at which  previous models had been laughably incapable.&lt;/p&gt;
&lt;p&gt;The frontier LLMs are a lot harder to differentiate between. Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis?&lt;/p&gt;
&lt;p&gt;And honestly, this is mainly on me. I've fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they've fallen one-by-one and now I'm embarrassingly lacking in suitable challenges to help evaluate new models.&lt;/p&gt;
&lt;p&gt;I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on - a tip I picked up from Ethan Mollick. I need to double-down on that advice myself!&lt;/p&gt;
&lt;p&gt;I'd love to see AI labs like Anthropic help address this challenge directly. I'd like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle.&lt;/p&gt;
&lt;p&gt;"Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5" would excite me a &lt;em&gt;lot&lt;/em&gt; more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.&lt;/p&gt;
&lt;p id="pelicans"&gt;In the meantime, I'm just gonna have to keep on getting them to draw &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelicans riding bicycles&lt;/a&gt;. Here's Opus 4.5 (on its default &lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort"&gt;"high" effort level&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican.jpg" alt="The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It did significantly better on the &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;new more detailed prompt&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican-advanced.jpg" alt="The pelican has feathers and a red pouch - a close enough version of breeding plumage. The bicycle is a much better shape." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's that same complex prompt &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#advanced-pelican"&gt;against Gemini 3 Pro&lt;/a&gt; and &lt;a href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/#advanced-pelican-codex-max"&gt;against GPT-5.1-Codex-Max-xhigh&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="still-susceptible-to-prompt-injection"&gt;Still susceptible to prompt injection&lt;/h4&gt;
&lt;p&gt;From &lt;a href="https://www.anthropic.com/news/claude-opus-4-5#a-step-forward-on-safety"&gt;the safety section&lt;/a&gt; of Anthropic's announcement post:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-prompt-injection.jpg" alt="Bar chart titled &amp;quot;Susceptibility to prompt-injection style attacks&amp;quot; with subtitle &amp;quot;At k queries; lower is better&amp;quot;. Y-axis shows &amp;quot;ATTACK SUCCESS RATE (%)&amp;quot; from 0-100. Five stacked bars compare AI models with three k values (k=1 in dark gray, k=10 in beige, k=100 in pink). Results: Gemini 3 Pro Thinking (12.5, 60.7, 92.0), GPT-5.1 Thinking (12.6, 58.2, 87.8), Haiku 4.5 Thinking (8.3, 51.1, 85.6), Sonnet 4.5 Thinking (7.3, 41.9, 72.4), Opus 4.5 Thinking (4.7, 33.6, 63.0)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;On the one hand this looks great, it's a clear improvement over previous models and the competition.&lt;/p&gt;
&lt;p&gt;What does the chart actually tell us though? It tells us that single attempts at prompt injection still work 1/20 times, and if an attacker can try ten different attacks that success rate goes up to 1/3!&lt;/p&gt;
&lt;p&gt;I still don't think training models not to fall for prompt injection is the way forward here. We continue to need to design our applications under the assumption that a suitably motivated attacker will be able to find a way to trick the models.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/prompt-injection"&gt;prompt-injection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="prompt-injection"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="evals"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="november-2025-inflection"/></entry><entry><title>Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark</title><link href="https://simonwillison.net/2025/Nov/18/gemini-3/#atom-tag" rel="alternate"/><published>2025-11-18T19:00:48+00:00</published><updated>2025-11-18T19:00:48+00:00</updated><id>https://simonwillison.net/2025/Nov/18/gemini-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Google released Gemini 3 Pro today. Here's &lt;a href="https://blog.google/products/gemini/gemini-3/"&gt;the announcement from Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu&lt;/a&gt;, their &lt;a href="https://blog.google/technology/developers/gemini-3-developers/"&gt;developer blog announcement from Logan Kilpatrick&lt;/a&gt;, the &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf"&gt;Gemini 3 Pro Model Card&lt;/a&gt;, and their &lt;a href="https://blog.google/products/gemini/gemini-3-collection/"&gt;collection of 11 more articles&lt;/a&gt;. It's a big release!&lt;/p&gt;
&lt;p&gt;I had a few days of preview access to this model via &lt;a href="https://aistudio.google.com/"&gt;AI Studio&lt;/a&gt;. The best way to describe it is that it's &lt;strong&gt;Gemini 2.5 upgraded to match the leading rival models&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Gemini 3 has the same underlying characteristics as Gemini 2.5. The knowledge cutoff is the same (January 2025). It accepts 1 million input tokens, can output up to 64,000 tokens, and has multimodal inputs across text, images, audio, and video.&lt;/p&gt;
&lt;h4 id="benchmarks"&gt;Benchmarks&lt;/h4&gt;
&lt;p&gt;Google's own reported numbers (in &lt;a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf"&gt;the model card&lt;/a&gt;) show it scoring slightly higher against Claude 4.5 Sonnet and GPT-5.1 against most of the standard benchmarks. As always I'm waiting for independent confirmation, but I have no reason to believe those numbers are inaccurate.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg" alt="Table of benchmark numbers, described in full below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pricing"&gt;Pricing&lt;/h4&gt;
&lt;p&gt;It terms of pricing it's a little more expensive than Gemini 2.5 but still cheaper than Claude Sonnet 4.5. Here's how it fits in with those other leading models:&lt;/p&gt;
&lt;center&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input (per 1M tokens)&lt;/th&gt;
      &lt;th&gt;Output (per 1M tokens)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-5.1&lt;/td&gt;
      &lt;td&gt;$1.25&lt;/td&gt;
      &lt;td&gt;$10.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $1.25&lt;br /&gt;
        &amp;gt; 200k tokens: $2.50
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $10.00&lt;br /&gt;
        &amp;gt; 200k tokens: $15.00
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Gemini 3 Pro&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $2.00&lt;br /&gt;
        &amp;gt; 200k tokens: $4.00
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $12.00&lt;br /&gt;
        &amp;gt; 200k tokens: $18.00
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.5&lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $3.00&lt;br /&gt;
        &amp;gt; 200k tokens: $6.00
      &lt;/td&gt;
      &lt;td&gt;
        ≤ 200k tokens: $15.00&lt;br /&gt;
        &amp;gt; 200k tokens: $22.50
      &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.1&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
      &lt;td&gt;$75.00&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/center&gt;
&lt;h4 id="trying-it-out-against-a-complex-image"&gt;Trying it out against a complex image&lt;/h4&gt;
&lt;p&gt;That screenshot of the benchmarks from above looked like a good test for Gemini 3's multimodal support. I fed it that image URL and asked it to generate alt text for the image:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview -a https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg 'Alt text for this image, include all figures and make them comprehensible to a screen reader user'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's what I got back:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A comprehensive table comparing four AI models across various performance benchmarks. The columns listed are: Benchmark Name, Description, Gemini 3 Pro (highlighted in blue), Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Humanity's Last Exam (Academic reasoning)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No tools: Gemini 3 Pro 37.5%, Gemini 2.5 Pro 21.6%, Claude Sonnet 4.5 13.7%, GPT-5.1 26.5%.&lt;/li&gt;
&lt;li&gt;With search and code execution: Gemini 3 Pro 45.8% (others have no data).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ARC-AGI-2 (Visual reasoning puzzles; ARC Prize Verified)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 31.1%, Gemini 2.5 Pro 4.9%, Claude Sonnet 4.5 13.6%, GPT-5.1 17.6%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;GPQA Diamond (Scientific knowledge; No tools)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 91.9%, Gemini 2.5 Pro 86.4%, Claude Sonnet 4.5 83.4%, GPT-5.1 88.1%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;AIME 2025 (Mathematics)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No tools: Gemini 3 Pro 95.0%, Gemini 2.5 Pro 88.0%, Claude Sonnet 4.5 87.0%, GPT-5.1 94.0%.&lt;/li&gt;
&lt;li&gt;With code execution: Gemini 3 Pro 100%, Claude Sonnet 4.5 100%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MathArena Apex (Challenging Math Contest problems)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 23.4%, Gemini 2.5 Pro 0.5%, Claude Sonnet 4.5 1.6%, GPT-5.1 1.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MMMU-Pro (Multimodal understanding and reasoning)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 81.0%, Gemini 2.5 Pro 68.0%, Claude Sonnet 4.5 68.0%, GPT-5.1 76.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ScreenSpot-Pro (Screen understanding)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 72.7%, Gemini 2.5 Pro 11.4%, Claude Sonnet 4.5 36.2%, GPT-5.1 3.5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;CharXiv Reasoning (Information synthesis from complex charts)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 81.4%, Gemini 2.5 Pro 69.6%, Claude Sonnet 4.5 68.5%, GPT-5.1 69.5%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;OmniDocBench 1.5 (OCR; Overall Edit Distance, lower is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 0.115, Gemini 2.5 Pro 0.145, Claude Sonnet 4.5 0.145, GPT-5.1 0.147.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Video-MMMU (Knowledge acquisition from videos)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 87.6%, Gemini 2.5 Pro 83.6%, Claude Sonnet 4.5 77.8%, GPT-5.1 80.4%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;LiveCodeBench Pro (Competitive coding problems; Elo Rating, higher is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 2,439; Gemini 2.5 Pro 1,775; Claude Sonnet 4.5 1,418; GPT-5.1 2,243.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Terminal-Bench 2.0 (Agentic terminal coding; Terminus-2 agent)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 54.2%, Gemini 2.5 Pro 32.6%, Claude Sonnet 4.5 42.8%, GPT-5.1 47.6%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SWE-Bench Verified (Agentic coding; Single attempt)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 76.2%, Gemini 2.5 Pro 59.6%, Claude Sonnet 4.5 77.2%, GPT-5.1 76.3%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;t2-bench (Agentic tool use)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 85.4%, Gemini 2.5 Pro 54.9%, Claude Sonnet 4.5 84.7%, GPT-5.1 80.2%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Vending-Bench 2 (Long-horizon agentic tasks; Net worth (mean), higher is better)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro $5,478.16; Gemini 2.5 Pro $573.64; Claude Sonnet 4.5 $3,838.74; GPT-5.1 $1,473.43.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;FACTS Benchmark Suite (Held out internal grounding, parametric, MM, and search retrieval benchmarks)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 70.5%, Gemini 2.5 Pro 63.4%, Claude Sonnet 4.5 50.4%, GPT-5.1 50.8%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;SimpleQA Verified (Parametric knowledge)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 72.1%, Gemini 2.5 Pro 54.5%, Claude Sonnet 4.5 29.3%, GPT-5.1 34.9%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MMMLU (Multilingual Q&amp;amp;A)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 91.8%, Gemini 2.5 Pro 89.5%, Claude Sonnet 4.5 89.1%, GPT-5.1 91.0%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Global PIQA (Commonsense reasoning across 100 Languages and Cultures)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini 3 Pro 93.4%, Gemini 2.5 Pro 91.5%, Claude Sonnet 4.5 90.1%, GPT-5.1 90.9%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MRCR v2 (8-needle) (Long context performance)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;128k (average): Gemini 3 Pro 77.0%, Gemini 2.5 Pro 58.0%, Claude Sonnet 4.5 47.1%, GPT-5.1 61.6%.&lt;/li&gt;
&lt;li&gt;1M (pointwise): Gemini 3 Pro 26.3%, Gemini 2.5 Pro 16.4%, Claude Sonnet 4.5 (not supported), GPT-5.1 (not supported).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have not checked every line of this but a loose spot-check looks accurate to me.&lt;/p&gt;
&lt;p&gt;That prompt took 1,105 input and 3,901 output tokens, at a cost of &lt;a href="https://www.llm-prices.com/#it=1105&amp;amp;cit=3901&amp;amp;ot=3901&amp;amp;ic=2&amp;amp;oc=12&amp;amp;sel=gemini-3-pro-preview"&gt;5.6824 cents&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ran this follow-up prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -c 'Convert to JSON'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can see &lt;a href="https://gist.github.com/simonw/ea7d52706557528e7eb3912cdf9250b0#response-1"&gt;the full output here&lt;/a&gt;, which starts like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"metadata"&lt;/span&gt;: {
    &lt;span class="pl-ent"&gt;"columns"&lt;/span&gt;: [
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Benchmark&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Description&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Gemini 3 Pro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Gemini 2.5 Pro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Claude Sonnet 4.5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;GPT-5.1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    ]
  },
  &lt;span class="pl-ent"&gt;"benchmarks"&lt;/span&gt;: [
    {
      &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Humanity's Last Exam&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"description"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Academic reasoning&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"sub_results"&lt;/span&gt;: [
        {
          &lt;span class="pl-ent"&gt;"condition"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;No tools&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_3_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;37.5%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_2_5_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;21.6%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"claude_sonnet_4_5"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;13.7%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gpt_5_1"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;26.5%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        },
        {
          &lt;span class="pl-ent"&gt;"condition"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;With search and code execution&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_3_pro"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;45.8%&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gemini_2_5_pro"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"claude_sonnet_4_5"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
          &lt;span class="pl-ent"&gt;"gpt_5_1"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;
        }
      ]
    },&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="analyzing-a-city-council-meeting"&gt;Analyzing a city council meeting&lt;/h4&gt;
&lt;p&gt;To try it out against an audio file I extracted the 3h33m of audio from the video &lt;a href="https://www.youtube.com/watch?v=qgJ7x7R6gy0"&gt;Half Moon Bay City Council Meeting - November 4, 2025&lt;/a&gt;. I used &lt;code&gt;yt-dlp&lt;/code&gt; to get that audio:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;yt-dlp -x --audio-format m4a &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://www.youtube.com/watch?v=qgJ7x7R6gy0&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That gave me a 74M m4a file, which I ran through Gemini 3 Pro like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview -a /tmp/HMBCC\ 11⧸4⧸25\ -\ Half\ Moon\ Bay\ City\ Council\ Meeting\ -\ November\ 4,\ 2025\ \[qgJ7x7R6gy0\].m4a 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That failed with an "Internal error encountered" message, so I shrunk the file down to a more manageable 38MB using &lt;code&gt;ffmpeg&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ffmpeg -i &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/private/tmp/HMB.m4a&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -ac 1 -ar 22050 -c:a aac -b:a 24k &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/private/tmp/HMB_compressed.m4a&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then ran it again like this (for some reason I had to use &lt;code&gt;--attachment-type&lt;/code&gt; this time):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;llm -m gemini-3-pro-preview --attachment-type /tmp/HMB_compressed.m4a 'audio/aac' 'Output a Markdown transcript of this meeting. Include speaker names and timestamps. Start with an outline of the key meeting sections, each with a title and summary and timestamp and list of participating names. Note in bold if anyone raised their voices, interrupted each other or had disagreements. Then follow with the full transcript.'&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This time it worked! The &lt;a href="https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943314"&gt;full output is here&lt;/a&gt;, but it starts like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here is the transcript of the Half Moon Bay City Council meeting.&lt;/p&gt;
&lt;h4&gt;Meeting Outline&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;1. Call to Order, Updates, and Public Forum&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Mayor Brownstone calls the meeting to order. City Manager Chidester reports no reportable actions from the closed session. Announcements are made regarding food insecurity volunteers and the Diwali celebration. During the public forum, Councilmember Penrose (speaking as a citizen) warns against autocracy. Citizens speak regarding lease agreements, downtown maintenance, local music events, and homelessness outreach statistics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:00:00 - 00:13:25&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Mayor Brownstone, Matthew Chidester, Irma Acosta, Deborah Penrose, Jennifer Moore, Sandy Vella, Joaquin Jimenez, Anita Rees.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2. Consent Calendar&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; The Council approves minutes from previous meetings and a resolution authorizing a licensing agreement for Seahorse Ranch. Councilmember Johnson corrects a pull request regarding abstentions on minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:13:25 - 00:15:15&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Mayor Brownstone, Councilmember Johnson, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Nagengast.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. Ordinance Introduction: Commercial Vitality (Item 9A)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff presents a new ordinance to address neglected and empty commercial storefronts, establishing maintenance and display standards. Councilmembers discuss enforcement mechanisms, window cleanliness standards, and the need for objective guidance documents to avoid subjective enforcement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:15:15 - 00:30:45&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Karen Decker, Councilmember Johnson, Councilmember Nagengast, Vice Mayor Ruddick, Councilmember Penrose.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;4. Ordinance Introduction: Building Standards &amp;amp; Electrification (Item 9B)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff introduces updates to the 2025 Building Code. A major change involves repealing the city's all-electric building requirement due to the 9th Circuit Court ruling (&lt;em&gt;California Restaurant Association v. City of Berkeley&lt;/em&gt;). &lt;strong&gt;Public speaker Mike Ferreira expresses strong frustration and disagreement with "unelected state agencies" forcing the City to change its ordinances.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:30:45 - 00:45:00&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Ben Corrales, Keith Weiner, Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;5. Housing Element Update &amp;amp; Adoption (Item 9C)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summary:&lt;/strong&gt; Staff presents the 5th draft of the Housing Element, noting State HCD requirements to modify ADU allocations and place a measure on the ballot regarding the "Measure D" growth cap. &lt;strong&gt;There is significant disagreement from Councilmembers Ruddick and Penrose regarding the State's requirement to hold a ballot measure.&lt;/strong&gt; Public speakers debate the enforceability of Measure D. &lt;strong&gt;Mike Ferreira interrupts the vibe to voice strong distaste for HCD's interference in local law.&lt;/strong&gt; The Council votes to adopt the element but strikes the language committing to a ballot measure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp:&lt;/strong&gt; 00:45:00 - 01:05:00&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Participants:&lt;/strong&gt; Leslie (Staff), Joaquin Jimenez, Jeremy Levine, Mike Ferreira, Councilmember Penrose, Vice Mayor Ruddick, Councilmember Johnson.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr /&gt;
&lt;h4&gt;Transcript&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Mayor Brownstone&lt;/strong&gt; [00:00:00]
Good evening everybody and welcome to the November 4th Half Moon Bay City Council meeting. As a reminder, we have Spanish interpretation services available in person and on Zoom.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Victor Hernandez (Interpreter)&lt;/strong&gt; [00:00:35]
Thank you, Mr. Mayor, City Council, all city staff, members of the public. &lt;em&gt;[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]&lt;/em&gt; Thank you very much.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Those first two lines of the transcript already illustrate something interesting here: Gemini 3 Pro chose NOT to include the exact text of the Spanish instructions, instead summarizing them as "[Spanish instructions provided regarding accessing the interpretation channel on Zoom and in the room.]".&lt;/p&gt;
&lt;p&gt;I haven't spot-checked the entire 3hr33m meeting, but I've confirmed that the timestamps do not line up. The transcript closes like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mayor Brownstone&lt;/strong&gt; [01:04:00]
Meeting adjourned. Have a good evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That actually happens &lt;a href="https://www.youtube.com/watch?v=qgJ7x7R6gy0&amp;amp;t=3h31m5s"&gt;at 3h31m5s&lt;/a&gt; and the mayor says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Okay. Well, thanks everybody, members of the public for participating. Thank you for staff. Thank you to fellow council members. This meeting is now adjourned. Have a good evening.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm disappointed about the timestamps, since mismatches there make it much harder to jump to the right point and confirm that the summarized transcript is an accurate representation of what was said.&lt;/p&gt;
&lt;p&gt;This took 320,087 input tokens and 7,870 output tokens, for a total cost of &lt;a href="https://www.llm-prices.com/#it=320087&amp;amp;ot=7870&amp;amp;ic=4&amp;amp;oc=18"&gt;$1.42&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="and-a-new-pelican-benchmark"&gt;And a new pelican benchmark&lt;/h4&gt;
&lt;p&gt;Gemini 3 Pro has a new concept of a "thinking level" which can be set to low or high (and defaults to high). I tried my classic &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;Generate an SVG of a pelican riding a bicycle&lt;/a&gt; prompt at both levels.&lt;/p&gt;
&lt;p&gt;Here's low - Gemini decided to add a jaunty little hat (with a comment &lt;a href="https://gist.github.com/simonw/70d56ba39b7cbb44985d2384004fc4a0#response"&gt;in the SVG&lt;/a&gt; that says &lt;code&gt;&amp;lt;!-- Hat (Optional Fun Detail) --&amp;gt;&lt;/code&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-pelican-low.png" alt="The pelican is wearing a blue hat. It has a good beak. The bicycle is a little bit incorrect but generally a good effort." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's high. This is genuinely an excellent pelican, and the bicycle frame is at least the correct shape:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-pelican-high.png" alt="The pelican is not wearing a hat. It has a good beak. The bicycle is accurate and well-drawn." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Honestly though, my pelican benchmark is beginning to feel a little bit too basic. I decided to upgrade it. Here's v2 of the benchmark, which I plan to use going forward:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For reference, here's a photo I took of a California brown pelican recently (sadly without a bicycle):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/breeding-plumage.jpg" alt="A glorious California brown pelican perched on a rock by the water. It has a yellow tint to its head and a red spot near its throat." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's Gemini 3 Pro's &lt;a href="https://gist.github.com/simonw/2b9930ae1ce6f3f5e9cfe3cb31ec0c0a"&gt;attempt&lt;/a&gt; at high thinking level for that new prompt:&lt;/p&gt;
&lt;p id="advanced-pelican"&gt;&lt;img src="https://static.simonwillison.net/static/2025/gemini-3-breeding-pelican-high.png" alt="It's clearly a pelican. It has all of the requested features. It looks a bit abstract though." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And for good measure, here's that same prompt &lt;a href="https://gist.github.com/simonw/7a655ebe42f3d428d2ea5363dad8067c"&gt;against GPT-5.1&lt;/a&gt; - which produced this dumpy little fellow:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-1-breeding-pelican.png" alt="The pelican is very round. Its body overlaps much of the bicycle. It has a lot of dorky charisma." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And Claude Sonnet 4.5, which &lt;a href="https://gist.github.com/simonw/3296af92e4328dd4740385e6a4a2ac35"&gt;didn't do quite as well&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4-5-breeding-pelican.png" alt="Oh dear. It has all of the requested components, but the bicycle is a bit wrong and the pelican is arranged in a very awkward shape." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;None of the models seem to have caught on to the crucial detail that the California brown pelican is not, in fact, brown.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>MiniMax M2 &amp; Agent: Ingenious in Simplicity</title><link href="https://simonwillison.net/2025/Oct/29/minimax-m2/#atom-tag" rel="alternate"/><published>2025-10-29T22:49:47+00:00</published><updated>2025-10-29T22:49:47+00:00</updated><id>https://simonwillison.net/2025/Oct/29/minimax-m2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.minimax.io/news/minimax-m2"&gt;MiniMax M2 &amp;amp; Agent: Ingenious in Simplicity&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
MiniMax M2 was released on Monday 27th October by MiniMax, a Chinese AI lab founded in December 2021.&lt;/p&gt;
&lt;p&gt;It's a very promising model. Their self-reported benchmark scores show it as comparable to Claude Sonnet 4, and Artificial Analysis &lt;a href="https://x.com/ArtificialAnlys/status/1982714153375854998"&gt;are ranking it&lt;/a&gt; as the best currently available open weight model according to their intelligence score:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;MiniMax’s M2 achieves a new all-time-high Intelligence Index score for an open weights model and offers impressive efficiency with only 10B active parameters (200B total). [...]&lt;/p&gt;
&lt;p&gt;The model’s strengths include tool use and instruction following (as shown by Tau2 Bench and IFBench). As such, while M2 likely excels at agentic use cases it may underperform other open weights leaders such as DeepSeek V3.2 and Qwen3 235B at some generalist tasks. This is in line with a number of recent open weights model releases from Chinese AI labs which focus on agentic capabilities, likely pointing to a heavy post-training emphasis on RL.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The size is particularly significant: the model weights are 230GB &lt;a href="https://huggingface.co/MiniMaxAI/MiniMax-M2"&gt;on Hugging Face&lt;/a&gt;, significantly smaller than other high performing open weight models. That's small enough to run on a 256GB Mac Studio, and the MLX community &lt;a href="https://huggingface.co/mlx-community/MiniMax-M2-8bit"&gt;have that working already&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;MiniMax offer their own API, and recommend using their Anthropic-compatible endpoint and the official Anthropic SDKs to access it. MiniMax Head of Engineering Skyler Miao
 &lt;a href="https://x.com/SkylerMiao7/status/1982989507252367687"&gt;provided some background on that&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;M2 is a agentic thinking model, it do interleaved thinking like sonnet 4.5, which means every response will contain its thought content.
Its very important for M2 to keep the chain of thought. So we must make sure the history thought passed back to the model.
Anthropic API support it for sure, as sonnet needs it as well. OpenAI only support it in their new Response API, no support for in ChatCompletion.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MiniMax are offering the new model via their API for free until November 7th, after which the cost will be $0.30/million input tokens and $1.20/million output tokens - similar in price to Gemini 2.5 Flash and GPT-5 Mini, see &lt;a href="https://www.llm-prices.com/#it=51&amp;amp;ot=4017&amp;amp;sel=minimax-m2%2Cgpt-5-mini%2Cclaude-3-haiku%2Cgemini-2.5-flash-lite%2Cgemini-2.5-flash"&gt;price comparison here&lt;/a&gt; on my &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt; site.&lt;/p&gt;
&lt;p&gt;I released a new plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; called &lt;a href="https://github.com/simonw/llm-minimax"&gt;llm-minimax&lt;/a&gt; providing support for M2 via the MiniMax API:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-minimax
llm keys set minimax
# Paste key here
llm -m m2 -o max_tokens 10000 "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/da79447830dc431c067a93648b338be6"&gt;the result&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Biycle is good though obscured by the pelican. Pelican has an impressive triple beak and is stretched along the bicycle frame. Not clear if it can pedal or what it is sitting on." src="https://static.simonwillison.net/static/2025/m2-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;51 input, 4,017 output. At $0.30/m input and $1.20/m output that pelican would cost 0.4836 cents - less than half a cent.&lt;/p&gt;
&lt;p&gt;This is the first plugin I've written for an Anthropic-API-compatible model. I released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.21"&gt;llm-anthropic 0.21&lt;/a&gt; first adding the ability to customize the &lt;code&gt;base_url&lt;/code&gt; parameter when using that model class. This meant the new plugin was less than &lt;a href="https://github.com/simonw/llm-minimax/blob/0.1/llm_minimax.py"&gt;30 lines of Python&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/minimax"&gt;minimax&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="ai-in-china"/><category term="minimax"/></entry><entry><title>Introducing Claude Haiku 4.5</title><link href="https://simonwillison.net/2025/Oct/15/claude-haiku-45/#atom-tag" rel="alternate"/><published>2025-10-15T19:36:34+00:00</published><updated>2025-10-15T19:36:34+00:00</updated><id>https://simonwillison.net/2025/Oct/15/claude-haiku-45/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-haiku-4-5"&gt;Introducing Claude Haiku 4.5&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Anthropic released Claude Haiku 4.5 today, the cheapest member of the Claude 4.5 family that started with Sonnet 4.5 &lt;a href="https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/"&gt;a couple of weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's priced at $1/million input tokens and $5/million output tokens, slightly more expensive than Haiku 3.5 ($0.80/$4) and a &lt;em&gt;lot&lt;/em&gt; more expensive than the original Claude 3 Haiku ($0.25/$1.25), both of which remain available at those prices.&lt;/p&gt;
&lt;p&gt;It's a third of the price of Sonnet 4 and Sonnet 4.5 (both $3/$15) which is notable because Anthropic's benchmarks put it in a similar space to that older Sonnet 4 model. As they put it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What was recently at the frontier is now cheaper and faster. Five months ago, Claude Sonnet 4 was a state-of-the-art model. Today, Claude Haiku 4.5 gives you similar levels of coding performance but at one-third the cost and more than twice the speed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've been hoping to see Anthropic release a fast, inexpensive model that's price competitive with the cheapest models from OpenAI and Gemini, currently $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite). Haiku 4.5 certainly isn't that, it looks like they're continuing to focus squarely on the "great at code" part of the market.&lt;/p&gt;
&lt;p&gt;The new Haiku is the first Haiku model to support reasoning. It sports a 200,000 token context window, 64,000 maximum output (up from just 8,192 for Haiku 3.5) and a "reliable knowledge cutoff" of February 2025, one month later than the January 2025 date for Sonnet 4 and 4.5 and Opus 4 and 4.1.&lt;/p&gt;
&lt;p&gt;Something that caught my eye in the accompanying &lt;a href="https://assets.anthropic.com/m/99128ddd009bdcb/original/Claude-Haiku-4-5-System-Card.pdf"&gt;system card&lt;/a&gt; was this note about context length:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For Claude Haiku 4.5, we trained the model to be explicitly context-aware, with precise information about how much context-window has been used. This has two effects: the model learns when and how to wrap up its answer when the limit is approaching, and the model learns to continue reasoning more persistently when the limit is further away. We found this intervention—along with others—to be effective at limiting agentic “laziness” (the phenomenon where models stop working on a problem prematurely, give incomplete answers, or cut corners on tasks).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've added the new price to &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;, released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.20"&gt;llm-anthropic 0.20&lt;/a&gt; with the new model and updated my &lt;a href="https://tools.simonwillison.net/haiku"&gt;Haiku-from-your-webcam&lt;/a&gt; demo (&lt;a href="https://github.com/simonw/tools/blob/main/haiku.html"&gt;source&lt;/a&gt;) to use Haiku 4.5 as well.&lt;/p&gt;
&lt;p&gt;Here's &lt;code&gt;llm -m claude-haiku-4.5 'Generate an SVG of a pelican riding a bicycle'&lt;/code&gt; (&lt;a href="https://gist.github.com/simonw/31256c523fa502eeb303b8e0bbe30eee"&gt;transcript&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Described by Haiku 4.5: A whimsical illustration of a bird with a round tan body, pink beak, and orange legs riding a bicycle against a blue sky and green grass background." src="https://static.simonwillison.net/static/2025/claude-haiku-4.5-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;18 input tokens and 1513 output tokens = &lt;a href="https://www.llm-prices.com/#it=18&amp;amp;ot=1513&amp;amp;ic=1&amp;amp;oc=5"&gt;0.7583 cents&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45595403"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>GPT-5 pro</title><link href="https://simonwillison.net/2025/Oct/6/gpt-5-pro/#atom-tag" rel="alternate"/><published>2025-10-06T19:48:45+00:00</published><updated>2025-10-06T19:48:45+00:00</updated><id>https://simonwillison.net/2025/Oct/6/gpt-5-pro/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5-pro"&gt;GPT-5 pro&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's OpenAI's model documentation for their GPT-5 pro model, released to their API today at their DevDay event.&lt;/p&gt;
&lt;p&gt;It has similar base characteristics to &lt;a href="https://platform.openai.com/docs/models/gpt-5"&gt;GPT-5&lt;/a&gt;: both share a September 30, 2024 knowledge cutoff and 400,000 context limit.&lt;/p&gt;
&lt;p&gt;GPT-5 pro has maximum output tokens 272,000 max, an increase from 128,000 for GPT-5.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As our most advanced reasoning model, GPT-5 pro defaults to (and only supports) &lt;code&gt;reasoning.effort: high&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's only available via OpenAI's Responses API. My &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; tool doesn't support that in core yet, but the &lt;a href="https://github.com/simonw/llm-openai-plugin"&gt;llm-openai-plugin&lt;/a&gt; plugin does. I released &lt;a href="https://github.com/simonw/llm-openai-plugin/releases/tag/0.7"&gt;llm-openai-plugin 0.7&lt;/a&gt; adding support for the new model, then ran this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install -U llm-openai-plugin
llm -m openai/gpt-5-pro "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It's very, very slow. The model took 6 minutes 8 seconds to respond and charged me for 16 input and 9,205 output tokens. At $15/million input and $120/million output this pelican &lt;a href="https://www.llm-prices.com/#it=16&amp;amp;ot=9205&amp;amp;ic=15&amp;amp;oc=120&amp;amp;sb=output&amp;amp;sd=descending"&gt;cost me $1.10&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img alt="It's obviously a pelican riding a bicycle. Half the spokes are missing on each wheel and the pelican is a bit squat looking." src="https://static.simonwillison.net/static/2025/gpt-5-pro.png" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/9a06ab36f486f31401fec1fc104a8ce5"&gt;the full transcript&lt;/a&gt;. It looks visually pretty simpler to the much, much cheaper result I &lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans"&gt;got from GPT-5&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="gpt-5"/><category term="gpt"/></entry><entry><title>Claude Sonnet 4.5 is probably the "best coding model in the world" (at least for now)</title><link href="https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/#atom-tag" rel="alternate"/><published>2025-09-29T18:11:39+00:00</published><updated>2025-09-29T18:11:39+00:00</updated><id>https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic &lt;a href="https://www.anthropic.com/news/claude-sonnet-4-5"&gt;released Claude Sonnet 4.5 today&lt;/a&gt;, with a &lt;em&gt;very&lt;/em&gt; bold set of claims:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Claude Sonnet 4.5 is the best coding model in the world. It's the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Anthropic gave me access to a preview version of a "new model" over the weekend which turned out to be Sonnet 4.5. My initial impressions were that it felt like a better model for code than GPT-5-Codex, which has been my preferred coding model since &lt;a href="https://simonwillison.net/2025/Sep/23/gpt-5-codex/"&gt;it launched a few weeks ago&lt;/a&gt;. This space moves &lt;em&gt;so fast&lt;/em&gt; - Gemini 3 is rumored to land soon so who knows how long Sonnet 4.5 will continue to hold the "best coding model" crown.&lt;/p&gt;
&lt;p&gt;The pricing is the same as the previous Sonnet: $3/million input tokens and $15/million output tokens. This remains significantly cheaper than Claude Opus - $15/$75 - but still quite a bit more than GPT-5 and GPT-5-Codex, both at $1.25/$10.&lt;/p&gt;
&lt;h4 id="it-really-shines-with-claude-ai-code-interpreter"&gt;It really shines with Claude.ai Code Interpreter&lt;/h4&gt;
&lt;p&gt;The &lt;a href="https://claude.ai/"&gt;claude.ai&lt;/a&gt; web interface (not yet the Claude iPhone native app) recently added the ability for Claude to write and then directly execute code in a sandboxed server environment, using Python and Node.js. I &lt;a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/"&gt;wrote about that in detail&lt;/a&gt; three weeks ago.&lt;/p&gt;
&lt;p&gt;Anthropic's implementation of this code interpreter pattern is more powerful than ChatGPT's equivalent because it can directly clone code from GitHub and install software packages from NPM and PyPI.&lt;/p&gt;
&lt;p&gt;Sonnet 4.5 &lt;em&gt;absolutely shines&lt;/em&gt; at using this tool!&lt;/p&gt;
&lt;p&gt;Here's my most ambitious experiment with it. I started like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Checkout https://github.com/simonw/llm and run the tests with&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;pip install -e '.[test]'&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;pytest&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It did exactly as I asked, checking out my &lt;a href="https://github.com/simonw/llm"&gt;LLM repo&lt;/a&gt;, installing it along with all of its dependencies, and running the tests.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;✅ All 466 tests passed in 167.69 seconds (about 2 minutes 47 seconds)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I posed it a &lt;em&gt;wildly&lt;/em&gt; ambitious challenge:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;The tool stores prompts and responses in a SQLite database. It currently uses a single conversation row and a linear collection of responses within that conversation. Experiment with the idea of adding a parent_response_id column to that responses table and using it to model the responses in a conversation as a tree instead. Use new pytest tests which you execute individually (to save time on the whole test suite) to exercise your design. Write a tree_notes.md file which you first populate with your plan and then use for notes as you go along.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As you can see in &lt;a href="https://claude.ai/share/9d5b0729-b58d-4b15-9e45-ab1e7152b89e"&gt;the resulting conversation transcript&lt;/a&gt;, it did a really impressive job of this! It churned away across &lt;em&gt;dozens&lt;/em&gt; of different tool calls, eventually outputting the following summary of its activity:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I've successfully experimented with adding tree-structured conversations to the LLM CLI tool. Here's what was accomplished:&lt;/p&gt;
&lt;h3 id="-implementation-complete"&gt;✅ &lt;strong&gt;Implementation Complete&lt;/strong&gt;
&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;1. Database Schema Enhancement&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Added &lt;code&gt;parent_response_id&lt;/code&gt; column to the &lt;code&gt;responses&lt;/code&gt; table via migration &lt;code&gt;m022_parent_response_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Enables tree-structured (branching) conversations instead of just linear sequences&lt;/li&gt;
&lt;li&gt;Fully backward compatible - existing data unaffected&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2. Comprehensive Utility Module&lt;/strong&gt; (&lt;code&gt;tree_utils.py&lt;/code&gt;)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;12 helper functions for tree operations:
&lt;ul&gt;
&lt;li&gt;Navigation (children, parent, siblings, paths)&lt;/li&gt;
&lt;li&gt;Analysis (depth, descendants, size)&lt;/li&gt;
&lt;li&gt;Queries (roots, leaves, statistics)&lt;/li&gt;
&lt;li&gt;Visualization (ASCII tree display)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. Complete Test Suite&lt;/strong&gt; (&lt;code&gt;test_tree_conversations.py&lt;/code&gt;)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;16 tests covering all tree operations&lt;/li&gt;
&lt;li&gt;Tests linear chains, branching, multiple roots, forests&lt;/li&gt;
&lt;li&gt;Includes realistic debugging scenario&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;All tests pass ✅&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;4. Integration Verification&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Updated existing migration tests to expect new column&lt;/li&gt;
&lt;li&gt;All 6 migration tests pass&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: 22/22 tests passing ✅&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="-deliverables-7-files-in-mnt-user-data-outputs-"&gt;📦 &lt;strong&gt;Deliverables&lt;/strong&gt; (7 files in &lt;code&gt;/mnt/user-data/outputs/&lt;/code&gt;)&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;README.md&lt;/strong&gt; - Quick start guide&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IMPLEMENTATION_SUMMARY.md&lt;/strong&gt; - Complete technical overview&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIGRATION_CODE.md&lt;/strong&gt; - Database migration details&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TEST_RESULTS.md&lt;/strong&gt; - Full test coverage report&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tree_notes.md&lt;/strong&gt; - Design decisions and development notes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tree_utils.py&lt;/strong&gt; - Utility functions module&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;test_tree_conversations.py&lt;/strong&gt; - Test suite&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="-key-features"&gt;🎯 &lt;strong&gt;Key Features&lt;/strong&gt;
&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Multiple branches from any conversation point&lt;/li&gt;
&lt;li&gt;Multiple roots per conversation (forest structure)&lt;/li&gt;
&lt;li&gt;Rich analytics (depth, branching factor, tree size)&lt;/li&gt;
&lt;li&gt;ASCII tree visualization&lt;/li&gt;
&lt;li&gt;Cycle detection for safety&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="-ready-for-integration"&gt;🚀 &lt;strong&gt;Ready for Integration&lt;/strong&gt;
&lt;/h3&gt;
&lt;p&gt;The implementation is production-ready with comprehensive testing. Next steps would be:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Integrate &lt;code&gt;tree_utils.py&lt;/code&gt; into LLM package&lt;/li&gt;
&lt;li&gt;Add CLI commands (&lt;code&gt;llm branch&lt;/code&gt;, &lt;code&gt;llm tree&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Update &lt;code&gt;Response.log()&lt;/code&gt; to accept &lt;code&gt;parent_response_id&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All code is tested, documented, and ready to use!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Did I mention I kicked off this experiment by entering prompts on my phone?&lt;/p&gt;
&lt;p&gt;I told it to &lt;code&gt;Give me a zip file of everything you have done so far&lt;/code&gt; - you can explore the contents of the file it made me &lt;a href="https://gist.github.com/simonw/f9d0f870e8d1af399a7f366a7f87b8b4"&gt;in this Gist&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I shared a &lt;a href="https://claude.ai/share/9d5b0729-b58d-4b15-9e45-ab1e7152b89e"&gt;longer version&lt;/a&gt; of this experiment where I had the model experiment with a complex message hashing mechanism as well. I've &lt;a href="https://claude.ai/share/9d5b0729-b58d-4b15-9e45-ab1e7152b89e"&gt;pushed the code to a branch&lt;/a&gt;, which is best explained by the &lt;a href="https://github.com/simonw/llm/blob/697636222a277f892dfffef86b2471cc89c95576/message_matching_notes.md"&gt;message_matching_notes.md&lt;/a&gt; document written by Claude.&lt;/p&gt;
&lt;h4 id="pelicans"&gt;Pelicans&lt;/h4&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.19"&gt;llm-anthropic 0.19&lt;/a&gt; adding support for the new model. Then I ran my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;classic pelican benchmark&lt;/a&gt;, first &lt;a href="https://gist.github.com/simonw/52a5bb478fa743daede12d97173cb817"&gt;with thinking turned on&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.5 -o thinking 1 -o thinking_budget 4000 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4.5-thinking.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And then &lt;a href="https://gist.github.com/simonw/449619ea8f0c112e3faa72e99fcb16ee"&gt;without&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.5 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-sonnet-4.5.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;These are pretty good - they are recognizably pelicans! - though not quite as good &lt;a href="https://simonwillison.net/2025/Sep/23/gpt-5-codex/"&gt;as GPT-5-Codex&lt;/a&gt; which is better at drawing bicycles.&lt;/p&gt;
&lt;p&gt;I also tried having it describe this image:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/pelicans.jpg" alt="A whole mess of pelicans, described below." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.5 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;describe this image&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -a https://static.simonwillison.net/static/2024/pelicans.jpg&lt;/pre&gt;&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;This image shows a large gathering of pelicans crowded together along what appears to be a waterfront or shoreline. The birds are densely packed, creating a chaotic scene with dozens of brown pelicans visible. Their distinctive long bills and pouches are clearly visible among the mass of feathered bodies. In the background, there's calm water and a dark tree line or forested area. The lighting suggests this was taken during early morning or late afternoon, giving the scene a golden, warm tone. The pelicans appear to be resting or congregating together, which is typical behavior for these colonial seabirds.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="anthropic-are-rolling-this-out-everywhere"&gt;Anthropic are rolling this out everywhere&lt;/h4&gt;
&lt;p&gt;The release of this model has been &lt;em&gt;very&lt;/em&gt; well coordinated. My embargo on talking about it was due to lift at 10am Pacific today, and I got an email from them linking to their blog post at 10am on the dot. It's also already &lt;a href="https://openrouter.ai/anthropic/claude-sonnet-4.5"&gt;live on OpenRouter&lt;/a&gt; and &lt;a href="https://x.com/cursor_ai/status/1972713190074261949"&gt;in Cursor&lt;/a&gt; and &lt;a href="https://github.blog/changelog/2025-09-29-anthropic-claude-sonnet-4-5-is-in-public-preview-for-github-copilot/"&gt;GitHub Copilot&lt;/a&gt; and no doubt a whole bunch of other places as well.&lt;/p&gt;
&lt;p&gt;Anthropic also shipped a &lt;a href="https://marketplace.visualstudio.com/items?itemName=anthropic.claude-code"&gt;new Claude Code VS Code extension&lt;/a&gt; today, plus a big upgrade to the Claude Code terminal app. Plus they rebranded their confusingly named Claude Code SDK to the &lt;a href="https://docs.claude.com/en/api/agent-sdk/overview"&gt;Claude Agent SDK&lt;/a&gt; instead, emphasizing that it's a tool for building agents beyond just customizing the existing Claude Code product. That's available for both &lt;a href="https://docs.claude.com/en/api/agent-sdk/typescript"&gt;TypeScript&lt;/a&gt; and &lt;a href="https://docs.claude.com/en/api/agent-sdk/python"&gt;Python&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="code-interpreter"/><category term="llm-tool-use"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>Grok 4 Fast</title><link href="https://simonwillison.net/2025/Sep/20/grok-4-fast/#atom-tag" rel="alternate"/><published>2025-09-20T23:59:33+00:00</published><updated>2025-09-20T23:59:33+00:00</updated><id>https://simonwillison.net/2025/Sep/20/grok-4-fast/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://x.ai/news/grok-4-fast"&gt;Grok 4 Fast&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New hosted vision-enabled reasoning model from xAI that's designed to be fast and extremely competitive on price. It has a 2 million token context window and "was trained end-to-end with tool-use reinforcement learning".&lt;/p&gt;
&lt;p&gt;It's priced at $0.20/million input tokens and $0.50/million output tokens - 15x less than Grok 4 (which is $3/million input and $15/million output). That puts it cheaper than GPT-5 mini and Gemini 2.5 Flash on &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The same model weights handle reasoning and non-reasoning based on a parameter passed to the model.&lt;/p&gt;
&lt;p&gt;I've been trying it out via my updated &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin, since Grok 4 Fast is available &lt;a href="https://openrouter.ai/x-ai/grok-4-fast"&gt;for free on OpenRouter&lt;/a&gt; for a limited period.&lt;/p&gt;
&lt;p&gt;Here's output from the &lt;a href="https://gist.github.com/simonw/7f9a5e5c780b1d5bfe98b4f4ad540551"&gt;non-reasoning model&lt;/a&gt;. This actually output an invalid SVG - I had to make &lt;a href="https://gist.github.com/simonw/7f9a5e5c780b1d5bfe98b4f4ad540551?permalink_comment_id=5768049#gistcomment-5768049"&gt;a tiny manual tweak&lt;/a&gt; to the XML to get it to render.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4-fast:free "Generate an SVG of a pelican riding a bicycle" -o reasoning_enabled false
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Described by Grok 4 Fast: Simple line drawing of a white bird with a long yellow beak riding a bicycle, pedaling with its orange legs." src="https://static.simonwillison.net/static/2025/grok-4-no-reasoning.png" /&gt;&lt;/p&gt;
&lt;p&gt;(I initially ran this without that &lt;code&gt;-o reasoning_enabled false&lt;/code&gt; flag, but then I saw that &lt;a href="https://x.com/OpenRouterAI/status/1969427723098435738"&gt;OpenRouter enable reasoning by default&lt;/a&gt; for that model. Here's my &lt;a href="https://gist.github.com/simonw/6a52e6585cb3c45e64ae23b9c5ebafe9"&gt;previous invalid result&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/539719a1495253bbd27f3107931e6dd3"&gt;the reasoning model&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4-fast:free "Generate an SVG of a pelican riding a bicycle" -o reasoning_enabled true
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Described by Grok 4 Fast: A simple line drawing of a white pelican with a yellow beak holding a yellow object, riding a black bicycle on green grass under a blue sky with white clouds." src="https://static.simonwillison.net/static/2025/grok-4-fast-reasoning.png" /&gt;&lt;/p&gt;
&lt;p&gt;In related news, the New York Times had a story a couple of days ago about Elon's recent focus on xAI: &lt;a href="https://www.nytimes.com/2025/09/18/technology/elon-musk-artificial-intelligence-xai.html"&gt;Since Leaving Washington, Elon Musk Has Been All In on His A.I. Company&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/grok"&gt;grok&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xai"&gt;xai&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="grok"/><category term="llm-release"/><category term="openrouter"/><category term="xai"/></entry><entry><title>gpt-5 and gpt-5-mini rate limit updates</title><link href="https://simonwillison.net/2025/Sep/12/gpt-5-rate-limits/#atom-tag" rel="alternate"/><published>2025-09-12T23:14:46+00:00</published><updated>2025-09-12T23:14:46+00:00</updated><id>https://simonwillison.net/2025/Sep/12/gpt-5-rate-limits/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/openaidevs/status/1966610846559134140"&gt;gpt-5 and gpt-5-mini rate limit updates&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI have increased the rate limits for their two main GPT-5  models. These look significant:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;gpt-5&lt;br&gt;
Tier 1: 30K → 500K TPM (1.5M batch)&lt;br&gt;
Tier 2: 450K → 1M (3M batch)&lt;br&gt;
Tier 3: 800K → 2M&lt;br&gt;
Tier 4: 2M → 4M&lt;/p&gt;
&lt;p&gt;gpt-5-mini&lt;br&gt;
Tier 1: 200K → 500K (5M batch)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5"&gt;GPT-5 rate limits here&lt;/a&gt; show tier 5 stays at 40M tokens per minute. The &lt;a href="https://platform.openai.com/docs/models/gpt-5-mini"&gt;GPT-5 mini rate limits&lt;/a&gt; for tiers 2 through 5 are 2M, 4M, 10M and 180M TPM respectively.&lt;/p&gt;
&lt;p&gt;As a reminder, &lt;a href="https://platform.openai.com/docs/guides/rate-limits#usage-tiers"&gt;those tiers&lt;/a&gt; are assigned based on how much money you have spent on the OpenAI API - from $5 for tier 1 up through $50, $100, $250 and then $1,000 for tier &lt;/p&gt;
&lt;p&gt;For comparison, Anthropic's current top tier is Tier 4 ($400 spent) which provides 2M maximum input tokens per minute and 400,000 maximum output tokens, though you can contact their sales team for higher limits than that.&lt;/p&gt;
&lt;p&gt;Gemini's top tier is Tier 3 for $1,000 spent and &lt;a href="https://ai.google.dev/gemini-api/docs/rate-limits#tier-3"&gt;currently gives you&lt;/a&gt; 8M TPM for Gemini 2.5 Pro and Flash and 30M TPM for the Flash-Lite and 2.0 Flash models.&lt;/p&gt;
&lt;p&gt;So OpenAI's new rate limit increases for their top performing model pulls them ahead of Anthropic but still leaves them significantly behind Gemini.&lt;/p&gt;
&lt;p&gt;GPT-5 mini remains the champion for smaller models with that enormous 180M TPS limit for its top tier.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="gemini"/><category term="llm-pricing"/><category term="gpt-5"/><category term="gpt"/></entry><entry><title>Load Llama-3.2 WebGPU in your browser from a local folder</title><link href="https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-tag" rel="alternate"/><published>2025-09-08T20:53:52+00:00</published><updated>2025-09-08T20:53:52+00:00</updated><id>https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://static.simonwillison.net/static/2025/llama-3.2-webgpu/"&gt;Load Llama-3.2 WebGPU in your browser from a local folder&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Inspired by &lt;a href="https://news.ycombinator.com/item?id=45168953#45169054"&gt;a comment&lt;/a&gt; on Hacker News I decided to see if it was possible to modify the &lt;a href="https://github.com/huggingface/transformers.js-examples/tree/main/llama-3.2-webgpu"&gt;transformers.js-examples/tree/main/llama-3.2-webgpu&lt;/a&gt; Llama 3.2 chat demo (&lt;a href="https://huggingface.co/spaces/webml-community/llama-3.2-webgpu"&gt;online here&lt;/a&gt;, I &lt;a href="https://simonwillison.net/2024/Sep/30/llama-32-webgpu/"&gt;wrote about it last November&lt;/a&gt;) to add an option to open a local model file directly from a folder on disk, rather than waiting for it to download over the network.&lt;/p&gt;
&lt;p&gt;I posed the problem to OpenAI's GPT-5-enabled Codex CLI like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git clone https://github.com/huggingface/transformers.js-examples
cd transformers.js-examples/llama-3.2-webgpu
codex
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Modify this application such that it offers the user a file browse button for selecting their own local copy of the model file instead of loading it over the network. Provide a "download model" option too.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex churned away for several minutes, even running commands like &lt;code&gt;curl -sL https://raw.githubusercontent.com/huggingface/transformers.js/main/src/models.js | sed -n '1,200p'&lt;/code&gt; to inspect the source code of the underlying Transformers.js library.&lt;/p&gt;
&lt;p&gt;After four prompts total (&lt;a href="https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01bd2?permalink_comment_id=5751814#gistcomment-5751814"&gt;shown here&lt;/a&gt;) it built something which worked!&lt;/p&gt;
&lt;p&gt;To try it out you'll need your own local copy of the Llama 3.2 ONNX model. You can get that (a ~1.2GB) download) like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git lfs install
git clone https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then visit my &lt;a href="https://static.simonwillison.net/static/2025/llama-3.2-webgpu/"&gt;llama-3.2-webgpu&lt;/a&gt; page in Chrome or Firefox Nightly (since WebGPU is required), click "Browse folder", select that folder you just cloned, agree to the "Upload" confirmation (confusing since nothing is uploaded from your browser, the model file is opened locally on your machine) and click "Load local model".&lt;/p&gt;
&lt;p&gt;Here's an animated demo (recorded in real-time, I didn't speed this up):&lt;/p&gt;
&lt;p&gt;&lt;img alt="GIF. I follow the setup instructions, clicking to load a local model and browsing to the correct folder. Once loaded the model shows a chat interface, I run the example about time management which returns tokens at about 10/second." src="https://static.simonwillison.net/static/2025/webgpu-llama-demo-small.gif" /&gt;&lt;/p&gt;
&lt;p&gt;I pushed &lt;a href="https://github.com/simonw/transformers.js-examples/commit/cdebf4128c6e30414d437affd4b13b6c9c79421d"&gt;a branch with those changes here&lt;/a&gt;. The next step would be to modify this to support other models in addition to the Llama 3.2 demo, but I'm pleased to have got to this proof of concept with so little work beyond throwing some prompts at Codex to see if it could figure it out.&lt;/p&gt;
&lt;p&gt;According to the Codex &lt;code&gt;/status&lt;/code&gt; command &lt;a href="https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01bd2?permalink_comment_id=5751807#gistcomment-5751807"&gt;this used&lt;/a&gt; 169,818 input tokens, 17,112 output tokens and 1,176,320 cached input tokens. At current GPT-5 token pricing ($1.25/million input, $0.125/million cached input, $10/million output) that would cost 53.942 cents, but Codex CLI hooks into my existing $20/month ChatGPT Plus plan so this was bundled into that.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45168953#45173297"&gt;My Hacker News comment&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex"&gt;codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;



</summary><category term="javascript"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="transformers-js"/><category term="webgpu"/><category term="llm-pricing"/><category term="vibe-coding"/><category term="gpt-5"/><category term="codex"/><category term="gpt"/></entry><entry><title>Introducing gpt-realtime</title><link href="https://simonwillison.net/2025/Sep/1/introducing-gpt-realtime/#atom-tag" rel="alternate"/><published>2025-09-01T17:34:55+00:00</published><updated>2025-09-01T17:34:55+00:00</updated><id>https://simonwillison.net/2025/Sep/1/introducing-gpt-realtime/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/introducing-gpt-realtime/"&gt;Introducing gpt-realtime&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Released a few days ago (August 28th), &lt;code&gt;gpt-realtime&lt;/code&gt; is OpenAI's new "most advanced speech-to-speech model". It looks like this is a replacement for the older &lt;code&gt;gpt-4o-realtime-preview&lt;/code&gt; model that was released &lt;a href="https://openai.com/index/introducing-the-realtime-api/"&gt;last October&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This is a slightly confusing release. The previous realtime model was clearly described as a variant of GPT-4o, sharing the same October 2023 training cut-off date as that model.&lt;/p&gt;
&lt;p&gt;I had expected that &lt;code&gt;gpt-realtime&lt;/code&gt; might be a GPT-5 relative, but its training date is still October 2023 whereas GPT-5 is September 2024.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;gpt-realtime&lt;/code&gt; also shares the relatively low 32,000 context token and 4,096 maximum output token limits of &lt;code&gt;gpt-4o-realtime-preview&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The only reference I found to GPT-5 in the documentation for the new model was a note saying "Ambiguity and conflicting instructions degrade performance, similar to GPT-5."&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://platform.openai.com/docs/guides/realtime-models-prompting#general-usage-tips"&gt;usage tips&lt;/a&gt; for &lt;code&gt;gpt-realtime&lt;/code&gt; have a few surprises:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Iterate relentlessly&lt;/strong&gt;. Small wording changes can make or break behavior.&lt;/p&gt;
&lt;p&gt;Example: Swapping “inaudible” → “unintelligible” improved noisy input handling. [...]&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Convert non-text rules to text&lt;/strong&gt;: The model responds better to clearly written text.&lt;/p&gt;
&lt;p&gt;Example: Instead of writing, "IF x &amp;gt; 3 THEN ESCALATE", write, "IF MORE THAN THREE FAILURES THEN ESCALATE."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There are a whole lot more prompting tips in the new &lt;a href="https://cookbook.openai.com/examples/realtime_prompting_guide"&gt;Realtime Prompting Guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;OpenAI list several key improvements to &lt;code&gt;gpt-realtime&lt;/code&gt; including the ability to configure it with a list of MCP servers, "better instruction following" and the ability to send it images.&lt;/p&gt;
&lt;p&gt;My biggest confusion came from &lt;a href="https://openai.com/api/pricing/"&gt;the pricing page&lt;/a&gt;, which lists separate pricing for using the Realtime API with &lt;code&gt;gpt-realtime&lt;/code&gt; and GPT-4o mini. This suggests to me that the old &lt;a href="https://platform.openai.com/docs/models/gpt-4o-mini-realtime-preview"&gt;gpt-4o-mini-realtime-preview&lt;/a&gt; model is still available, despite it no longer being listed on the &lt;a href="https://platform.openai.com/docs/models"&gt;OpenAI models page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;gpt-4o-mini-realtime-preview&lt;/code&gt; is a &lt;strong&gt;lot&lt;/strong&gt; cheaper:&lt;/p&gt;
&lt;table&gt;
    &lt;thead&gt;
        &lt;tr&gt;
            &lt;th&gt;Model&lt;/th&gt;
            &lt;th&gt;Token Type&lt;/th&gt;
            &lt;th&gt;Input&lt;/th&gt;
            &lt;th&gt;Cached Input&lt;/th&gt;
            &lt;th&gt;Output&lt;/th&gt;
        &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
        &lt;tr&gt;
            &lt;td rowspan="3"&gt;gpt-realtime&lt;/td&gt;
            &lt;td&gt;Text&lt;/td&gt;
            &lt;td&gt;$4.00&lt;/td&gt;
            &lt;td&gt;$0.40&lt;/td&gt;
            &lt;td&gt;$16.00&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;Audio&lt;/td&gt;
            &lt;td&gt;$32.00&lt;/td&gt;
            &lt;td&gt;$0.40&lt;/td&gt;
            &lt;td&gt;$64.00&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;Image&lt;/td&gt;
            &lt;td&gt;$5.00&lt;/td&gt;
            &lt;td&gt;$0.50&lt;/td&gt;
            &lt;td&gt;-&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td rowspan="2"&gt;gpt-4o-mini-realtime-preview&lt;/td&gt;
            &lt;td&gt;Text&lt;/td&gt;
            &lt;td&gt;$0.60&lt;/td&gt;
            &lt;td&gt;$0.30&lt;/td&gt;
            &lt;td&gt;$2.40&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;Audio&lt;/td&gt;
            &lt;td&gt;$10.00&lt;/td&gt;
            &lt;td&gt;$0.30&lt;/td&gt;
            &lt;td&gt;$20.00&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The mini model also has a much longer 128,000 token context window.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Turns out that was &lt;a href="https://twitter.com/_agamble/status/1962839472837361807"&gt;a mistake in the documentation&lt;/a&gt;, that mini model has a 16,000 token context size.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: OpenAI's &lt;a href="https://twitter.com/pbbakkum/status/1962901822135525695"&gt;Peter Bakkum clarifies&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There are different voice models in API and ChatGPT, but they share some recent improvements. The voices are also different.&lt;/p&gt;
&lt;p&gt;gpt-realtime has a mix of data specific enough to itself that its not really 4o or 5&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/audio"&gt;audio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/realtime"&gt;realtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/multi-modal-output"&gt;multi-modal-output&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="audio"/><category term="realtime"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm-pricing"/><category term="multi-modal-output"/><category term="llm-release"/></entry><entry><title>Claude Sonnet 4 now supports 1M tokens of context</title><link href="https://simonwillison.net/2025/Aug/12/claude-sonnet-4-1m/#atom-tag" rel="alternate"/><published>2025-08-12T18:14:30+00:00</published><updated>2025-08-12T18:14:30+00:00</updated><id>https://simonwillison.net/2025/Aug/12/claude-sonnet-4-1m/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/1m-context"&gt;Claude Sonnet 4 now supports 1M tokens of context&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Gemini and OpenAI both have million token models, so it's good to see Anthropic catching up. This is 5x the previous 200,000 context length limit of the various Claude Sonnet models.&lt;/p&gt;
&lt;p&gt;Anthropic have previously made 1 million tokens available to select customers. From &lt;a href="https://www.anthropic.com/news/claude-3-family"&gt;the Claude 3 announcement&lt;/a&gt; in March 2024:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Claude 3 family of models will initially offer a 200K context window upon launch. However, all three models are capable of accepting inputs exceeding 1 million tokens and we may make this available to select customers who need enhanced processing power.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is also the first time I've seen Anthropic use prices that vary depending on context length:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prompts ≤ 200K: $3/million input, $15/million output&lt;/li&gt;
&lt;li&gt;Prompts &amp;gt; 200K: $6/million input, $22.50/million output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Gemini have been doing this for a while: Gemini 2.5 Pro is $1.25/$10 below 200,000 tokens and $2.50/$15 above 200,000.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/context-windows#1m-token-context-window"&gt;Anthropic's full documentation on the 1m token context window&lt;/a&gt;. You need to send a &lt;code&gt;context-1m-2025-08-07&lt;/code&gt; beta header in your request to enable it.&lt;/p&gt;
&lt;p&gt;Note that this is currently restricted to "tier 4" users who have purchased at least $400 in API credits:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Long context support for Sonnet 4 is now in public beta on the Anthropic API for customers with Tier 4 and custom rate limits, with broader availability rolling out over the coming weeks.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/claudeai/status/1955299573620261343"&gt;@claudeai&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="long-context"/></entry><entry><title>Quoting Nick Turley</title><link href="https://simonwillison.net/2025/Aug/12/nick-turley/#atom-tag" rel="alternate"/><published>2025-08-12T03:32:04+00:00</published><updated>2025-08-12T03:32:04+00:00</updated><id>https://simonwillison.net/2025/Aug/12/nick-turley/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.youtube.com/watch?v=ixY2PvQJ0To&amp;amp;t=2322s"&gt;&lt;p&gt;I think there's been a lot of decisions over time that proved pretty consequential, but we made them very quickly as we have to. [...]&lt;/p&gt;
&lt;p&gt;[On pricing] I had this kind of panic attack because we really needed to launch subscriptions because at the time we were taking the product down all the time. [...]&lt;/p&gt;
&lt;p&gt;So what I did do is ship a Google Form to Discord with &lt;a href="https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter"&gt;the four questions you're supposed to ask&lt;/a&gt; on how to price something.&lt;/p&gt;
&lt;p&gt;But we got with the $20. We were debating something slightly higher at the time. I often wonder what would have happened because so many other companies ended up copying the $20 price point, so did we erase a bunch of market cap by pricing it this way?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.youtube.com/watch?v=ixY2PvQJ0To&amp;amp;t=2322s"&gt;Nick Turley&lt;/a&gt;, Head of ChatGPT, interviewed by Lenny Rachitsky&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/discord"&gt;discord&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nick-turley"&gt;nick-turley&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="discord"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="llm-pricing"/><category term="nick-turley"/></entry><entry><title>GPT-5: Key characteristics, pricing and model card</title><link href="https://simonwillison.net/2025/Aug/7/gpt-5/#atom-tag" rel="alternate"/><published>2025-08-07T17:36:12+00:00</published><updated>2025-08-07T17:36:12+00:00</updated><id>https://simonwillison.net/2025/Aug/7/gpt-5/#atom-tag</id><summary type="html">
    &lt;p&gt;I've had preview access to the new GPT-5 model family for the past two weeks (see &lt;a href="https://simonwillison.net/2025/Aug/7/previewing-gpt-5/"&gt;related video&lt;/a&gt; and &lt;a href="https://simonwillison.net/about/#disclosures"&gt;my disclosures&lt;/a&gt;) and have been using GPT-5 as my daily-driver. It's my new favorite model. It's still an LLM - it's not a dramatic departure from what we've had before - but it rarely screws up and generally feels competent or occasionally impressive at the kinds of things I like to use models for.&lt;/p&gt;
&lt;p&gt;I've collected a lot of notes over the past two weeks, so I've decided to break them up into &lt;a href="https://simonwillison.net/series/gpt-5/"&gt;a series of posts&lt;/a&gt;. This first one will cover key characteristics of the models, how they are priced and what we can learn from the &lt;a href="https://openai.com/index/gpt-5-system-card/"&gt;GPT-5 system card&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#key-model-characteristics"&gt;Key model characteristics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#position-in-the-openai-model-family"&gt;Position in the OpenAI model family&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#pricing-is-aggressively-competitive"&gt;Pricing is aggressively competitive&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#more-notes-from-the-system-card"&gt;More notes from the system card&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#prompt-injection-in-the-system-card"&gt;Prompt injection in the system card&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#thinking-traces-in-the-api"&gt;Thinking traces in the API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans"&gt;And some SVGs of pelicans&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="key-model-characteristics"&gt;Key model characteristics&lt;/h4&gt;
&lt;p&gt;Let's start with the fundamentals. GPT-5 in ChatGPT is a weird hybrid that switches between different models. Here's what the system card says about that (my highlights in bold):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and &lt;strong&gt;a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent&lt;/strong&gt; (for example, if you say “think hard about this” in the prompt). [...] Once usage limits are reached, a mini version of each model handles remaining queries. In the near future, we plan to integrate these capabilities into a single model.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;GPT-5 in the API is simpler: it's available as three models - &lt;strong&gt;regular&lt;/strong&gt;, &lt;strong&gt;mini&lt;/strong&gt; and &lt;strong&gt;nano&lt;/strong&gt; - which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high.&lt;/p&gt;
&lt;p&gt;The models have an input limit of 272,000 tokens and an output limit (which includes invisible reasoning tokens) of 128,000 tokens. They support text and image for input, text only for output.&lt;/p&gt;
&lt;p&gt;I've mainly explored full GPT-5. My verdict: it's just &lt;strong&gt;good at stuff&lt;/strong&gt;. It doesn't feel like a dramatic leap ahead from other LLMs but it exudes competence - it rarely messes up, and frequently impresses me. I've found it to be a very sensible default for everything that I want to do. At no point have I found myself wanting to re-run a prompt against a different model to try and get a better result.&lt;/p&gt;

&lt;p&gt;Here are the OpenAI model pages for &lt;a href="https://platform.openai.com/docs/models/gpt-5"&gt;GPT-5&lt;/a&gt;, &lt;a href="https://platform.openai.com/docs/models/gpt-5-mini"&gt;GPT-5 mini&lt;/a&gt; and &lt;a href="https://platform.openai.com/docs/models/gpt-5-nano"&gt;GPT-5 nano&lt;/a&gt;. Knowledge cut-off is September 30th 2024 for GPT-5 and May 30th 2024 for GPT-5 mini and nano.&lt;/p&gt;

&lt;h4 id="position-in-the-openai-model-family"&gt;Position in the OpenAI model family&lt;/h4&gt;
&lt;p&gt;The three new GPT-5 models are clearly intended as a replacement for most of the rest of the OpenAI line-up. This table from the system card is useful, as it shows how they see the new models fitting in:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Previous model&lt;/th&gt;
&lt;th&gt;GPT-5 model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;gpt-5-main&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;gpt-5-main-mini&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI o3&lt;/td&gt;
&lt;td&gt;gpt-5-thinking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI o4-mini&lt;/td&gt;
&lt;td&gt;gpt-5-thinking-mini&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1-nano&lt;/td&gt;
&lt;td&gt;gpt-5-thinking-nano&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI o3 Pro&lt;/td&gt;
&lt;td&gt;gpt-5-thinking-pro&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That "thinking-pro" model is currently only available via ChatGPT where it is labelled as "GPT-5 Pro" and limited to the $200/month tier. It uses "parallel test time compute".&lt;/p&gt;
&lt;p&gt;The only capabilities not covered by GPT-5 are audio input/output and image generation. Those remain covered by models like &lt;a href="https://platform.openai.com/docs/models/gpt-4o-audio-preview"&gt;GPT-4o Audio&lt;/a&gt; and &lt;a href="https://platform.openai.com/docs/models/gpt-4o-realtime-preview"&gt;GPT-4o Realtime&lt;/a&gt; and their mini variants and the &lt;a href="https://platform.openai.com/docs/models/gpt-image-1"&gt;GPT Image 1&lt;/a&gt; and DALL-E image generation models.&lt;/p&gt;
&lt;h4 id="pricing-is-aggressively-competitive"&gt;Pricing is aggressively competitive&lt;/h4&gt;
&lt;p&gt;The pricing is &lt;em&gt;aggressively competitive&lt;/em&gt; with other providers.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GPT-5: $1.25/million for input, $10/million for output&lt;/li&gt;
&lt;li&gt;GPT-5 Mini: $0.25/m input, $2.00/m output&lt;/li&gt;
&lt;li&gt;GPT-5 Nano: $0.05/m input, $0.40/m output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;GPT-5 is priced at half the input cost of GPT-4o, and maintains the same price for output. Those invisible reasoning tokens count as output tokens so you can expect most prompts to use more output tokens than their GPT-4o equivalent (unless you set reasoning effort to "minimal").&lt;/p&gt;
&lt;p&gt;The discount for token caching is significant too: 90% off on input tokens that have been used within the previous few minutes. This is particularly material if you are implementing a chat UI where the same conversation gets replayed every time the user adds another prompt to the sequence.&lt;/p&gt;
&lt;p&gt;Here's a comparison table I put together showing the new models alongside the most comparable models from OpenAI's competition:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input $/m&lt;/th&gt;
&lt;th&gt;Output $/m&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.1&lt;/td&gt;
&lt;td&gt;15.00&lt;/td&gt;
&lt;td&gt;75.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;td&gt;15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grok 4&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;td&gt;15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro (&amp;gt;200,000)&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;td&gt;15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1&lt;/td&gt;
&lt;td&gt;2.00&lt;/td&gt;
&lt;td&gt;8.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;o3&lt;/td&gt;
&lt;td&gt;2.00&lt;/td&gt;
&lt;td&gt;8.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro (&amp;lt;200,000)&lt;/td&gt;
&lt;td&gt;1.25&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.25&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;o4-mini&lt;/td&gt;
&lt;td&gt;1.10&lt;/td&gt;
&lt;td&gt;4.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5 Haiku&lt;/td&gt;
&lt;td&gt;0.80&lt;/td&gt;
&lt;td&gt;4.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1 mini&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;td&gt;1.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grok 3 Mini&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5 Mini&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;2.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o mini&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;td&gt;0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Flash-Lite&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1 Nano&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Nova Lite&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;0.24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5 Nano&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Nova Micro&lt;/td&gt;
&lt;td&gt;0.035&lt;/td&gt;
&lt;td&gt;0.14&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;(Here's a good example of a GPT-5 failure: I tried to get it to &lt;a href="https://chatgpt.com/share/6894d804-bca4-8006-ac46-580bf4a9bf5f"&gt;output that table sorted itself&lt;/a&gt; but it put Nova Micro as more expensive than GPT-5 Nano, so I prompted it to "construct the table in Python and sort it there" and that fixed the issue.)&lt;/p&gt;
&lt;h4 id="more-notes-from-the-system-card"&gt;More notes from the system card&lt;/h4&gt;
&lt;p&gt;As usual, &lt;a href="https://openai.com/index/gpt-5-system-card/"&gt;the system card&lt;/a&gt; is vague on what went into the training data. Here's what it says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Like OpenAI’s other models, the GPT-5 models were trained on diverse datasets, including information that is publicly available on the internet, information that we partner with third parties to access, and information that our users or human trainers and researchers provide or generate. [...] We use advanced data filtering processes to reduce personal information from training data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I found this section interesting, as it reveals that writing, code and health are three of the most common use-cases for ChatGPT. This explains why so much effort went into health-related questions,  for both GPT-5 and the recently released OpenAI open weight models.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We’ve made significant advances in &lt;strong&gt;reducing hallucinations, improving instruction following, and minimizing sycophancy&lt;/strong&gt;, and have leveled up GPT-5’s performance in &lt;strong&gt;three of ChatGPT’s most common uses: writing, coding, and health&lt;/strong&gt;. All of the GPT-5 models additionally feature &lt;strong&gt;safe-completions, our latest approach to safety training&lt;/strong&gt; to prevent disallowed content.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Safe-completions is later described like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Large language models such as those powering ChatGPT have &lt;strong&gt;traditionally been trained to
either be as helpful as possible or outright refuse a user request&lt;/strong&gt;, depending on whether the
prompt is allowed by safety policy. [...] Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology
or cybersecurity), where a user request can be completed safely at a high level, but may lead
to malicious uplift if sufficiently detailed or actionable. &lt;strong&gt;As an alternative, we introduced safe-
completions: a safety-training approach that centers on the safety of the assistant’s output rather
than a binary classification of the user’s intent&lt;/strong&gt;. Safe-completions seek to maximize helpfulness
subject to the safety policy’s constraints.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So instead of straight up refusals, we should expect GPT-5 to still provide an answer but moderate that answer to avoid it including "harmful" content.&lt;/p&gt;
&lt;p&gt;OpenAI have a paper about this which I haven't read yet (I didn't get early access): &lt;a href="https://openai.com/index/gpt-5-safe-completions/"&gt;From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Sycophancy gets a mention, unsurprising given &lt;a href="https://simonwillison.net/2025/May/2/what-we-missed-with-sycophancy/"&gt;their high profile disaster in April&lt;/a&gt;. They've worked on this in the core model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;System
prompts, while easy to modify, have a more limited impact on model outputs relative to changes in
post-training. For GPT-5, we post-trained our models to reduce sycophancy. Using conversations
representative of production data, we evaluated model responses, then assigned a score reflecting
the level of sycophancy, which was used as a reward signal in training.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They claim impressive reductions in hallucinations. In my own usage I've not spotted a single hallucination yet, but that's been true for me for Claude 4 and o3 recently as well - hallucination is so much less of a problem with this year's models.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: I have had some reasonable pushback against this point, so I should clarify what I mean here. When I use the term "hallucination" I am talking about instances where the model confidently states a real-world fact that is untrue - like the incorrect winner of a sporting event. I'm not talking about the models making other kinds of mistakes - they make mistakes all the time!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Someone &lt;a href="https://news.ycombinator.com/item?id=44829896"&gt;pointed out&lt;/a&gt; that it's likely I'm avoiding hallucinations through the way I use the models, and this is entirely correct: as an experienced LLM user I instinctively stay clear of prompts that are likely to trigger hallucinations, like asking a non-search-enabled model for URLs or paper citations. This means I'm much less likely to encounter hallucinations in my daily usage.&lt;/em&gt;&lt;/p&gt;


&lt;blockquote&gt;
&lt;p&gt;One of our focuses when training the GPT-5 models was to reduce the frequency of factual
hallucinations. While ChatGPT has browsing enabled by default, many API queries do not use
browsing tools. Thus, we focused both on training our models to browse effectively for up-to-date
information, and on reducing hallucinations when the models are relying on their own internal
knowledge.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The section about deception also incorporates the thing where models sometimes pretend they've completed a task that defeated them:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We placed gpt-5-thinking in a variety of tasks that were partly or entirely infeasible to accomplish,
and &lt;strong&gt;rewarded the model for honestly admitting it can not complete the task&lt;/strong&gt;. [...]&lt;/p&gt;
&lt;p&gt;In tasks where the agent is required to use tools, such as a web browsing
tool, in order to answer a user’s query, previous models would hallucinate information when
the tool was unreliable. We simulate this scenario by purposefully disabling the tools or by
making them return error codes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="prompt-injection-in-the-system-card"&gt;Prompt injection in the system card&lt;/h4&gt;
&lt;p&gt;There's a section about prompt injection, but it's pretty weak sauce in my opinion.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Two external red-teaming groups conducted a two-week prompt-injection assessment targeting
system-level vulnerabilities across ChatGPT’s connectors and mitigations, rather than model-only
behavior.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's their chart showing how well the model scores against the rest of the field. It's an impressive result in comparison - 56.8 attack success rate for gpt-5-thinking, where Claude 3.7 scores in the 60s (no Claude 4 results included here) and everything else is 70% plus:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/prompt-injection-chart.jpg" alt="A bar chart titled &amp;quot;Behavior Attack Success Rate at k Queries&amp;quot; shows attack success rates (in %) for various AI models at k=1 (dark red) and k=10 (light red). For each model, the total height of the stacked bar represents the k=10 success rate (labeled above each bar), while the lower dark red section represents the k=1 success rate (estimated). From left to right: Llama 3.3 70B – k=10: 92.2%, k=1: ~47%; Llama 3.1 405B – k=10: 90.9%, k=1: ~38%; Gemini Flash 1.5 – k=10: 87.7%, k=1: ~34%; GPT-4o – k=10: 86.4%, k=1: ~28%; OpenAI o3-mini-high – k=10: 86.4%, k=1: ~41%; Gemini Pro 1.5 – k=10: 85.5%, k=1: ~34%; Gemini 2.5 Pro Preview – k=10: 85.0%, k=1: ~28%; Gemini 2.0 Flash – k=10: 85.0%, k=1: ~33%; OpenAI o3-mini – k=10: 84.5%, k=1: ~40%; Grok 2 – k=10: 82.7%, k=1: ~34%; GPT-4.5 – k=10: 80.5%, k=1: ~28%; 3.5 Haiku – k=10: 76.4%, k=1: ~17%; Command-R – k=10: 76.4%, k=1: ~28%; OpenAI o4-mini – k=10: 75.5%, k=1: ~17%; 3.5 Sonnet – k=10: 75.0%, k=1: ~13%; OpenAI o1 – k=10: 71.8%, k=1: ~18%; 3.7 Sonnet – k=10: 64.5%, k=1: ~17%; 3.7 Sonnet: Thinking – k=10: 63.6%, k=1: ~17%; OpenAI o3 – k=10: 62.7%, k=1: ~13%; gpt-5-thinking – k=10: 56.8%, k=1: ~6%. Legend shows dark red = k=1 and light red = k=10." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;On the one hand, a 56.8% attack rate is cleanly a big improvement against all of those other models.&lt;/p&gt;
&lt;p&gt;But it's also a strong signal that prompt injection continues to be an unsolved problem! That means that more than half of those k=10 attacks (where the attacker was able to try up to ten times) got through.&lt;/p&gt;
&lt;p&gt;Don't assume prompt injection isn't going to be a problem for your application just because the models got better.&lt;/p&gt;
&lt;h4 id="thinking-traces-in-the-api"&gt;Thinking traces in the API&lt;/h4&gt;
&lt;p&gt;I had initially thought that my biggest disappointment with GPT-5 was that there's no way to get at those thinking traces via the API... but that turned out &lt;a href="https://bsky.app/profile/sophiebits.com/post/3lvtceih7222r"&gt;not to be true&lt;/a&gt;. The following &lt;code&gt;curl&lt;/code&gt; command demonstrates that the responses API &lt;code&gt;"reasoning": {"summary": "auto"}&lt;/code&gt; is available for the new GPT-5 models:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;curl https://api.openai.com/v1/responses \
  -H "Authorization: Bearer $(llm keys get openai)" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5",
    "input": "Give me a one-sentence fun fact about octopuses.",
    "reasoning": {"summary": "auto"}
  }'&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/1d1013ba059af76461153722005a039d"&gt;the response&lt;/a&gt; from that API call.&lt;/p&gt;

&lt;p&gt;Without that option the API will often provide a lengthy delay while the model burns through thinking tokens until you start getting back visible tokens for the final response.&lt;/p&gt;
&lt;p&gt;OpenAI offer a new &lt;code&gt;reasoning_effort=minimal&lt;/code&gt; option which turns off most reasoning so that tokens start to stream back to you as quickly as possible.&lt;/p&gt;
&lt;h4 id="and-some-svgs-of-pelicans"&gt;And some SVGs of pelicans&lt;/h4&gt;
&lt;p&gt;Naturally I've been running &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my "Generate an SVG of a pelican riding a bicycle" benchmark&lt;/a&gt;. I'll actually spend more time on this in a future post - I have some fun variants I've been exploring - but for the moment here's &lt;a href="https://gist.github.com/simonw/c98873ef29e621c0fe2e0d4023534406"&gt;the pelican&lt;/a&gt; I got from GPT-5 running at its default "medium" reasoning effort:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-pelican.png" alt="The bicycle is really good, spokes on wheels, correct shape frame, nice pedals. The pelican has a pelican beak and long legs stretching to the pedals." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It's pretty great! Definitely recognizable as a pelican, and one of the best bicycles I've seen yet.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/9b5ecf61a5fb0794729aa0023aaa504d"&gt;GPT-5 mini&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-mini-pelican.png" alt="Blue background with clouds. Pelican has two necks for some reason. Has a good beak though. More gradents and shadows than the GPT-5 one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/3884dc8b186b630956a1fb0179e191bc"&gt;GPT-5 nano&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-5-nano-pelican.png" alt="Bicycle is two circles and some randomish black lines. Pelican still has an OK beak but is otherwise very simple." style="max-width: 100%;" /&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="gpt-5"/><category term="gpt"/></entry><entry><title>Claude Opus 4.1</title><link href="https://simonwillison.net/2025/Aug/5/claude-opus-41/#atom-tag" rel="alternate"/><published>2025-08-05T17:17:37+00:00</published><updated>2025-08-05T17:17:37+00:00</updated><id>https://simonwillison.net/2025/Aug/5/claude-opus-41/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-opus-4-1"&gt;Claude Opus 4.1&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Surprise new model from Anthropic today - Claude Opus 4.1, which they describe as "a drop-in replacement for Opus 4".&lt;/p&gt;
&lt;p&gt;My favorite thing about this model is the version number - treating this as a .1 version increment looks like it's an accurate depiction of the model's capabilities.&lt;/p&gt;
&lt;p&gt;Anthropic's own benchmarks show very small incremental gains.&lt;/p&gt;
&lt;p&gt;Comparing Opus 4 and Opus 4.1 (I &lt;a href="https://claude.ai/share/c7366629-784a-4088-9fc4-15613aa41a7f"&gt;got 4.1 to extract this information from a screenshot&lt;/a&gt; of Anthropic's own benchmark scores, then asked it to look up the links, then verified the links myself and fixed a few):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Agentic coding&lt;/strong&gt; (&lt;a href="https://github.com/SWE-bench/SWE-bench"&gt;SWE-bench Verified&lt;/a&gt;): From 72.5% to 74.5%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agentic terminal coding&lt;/strong&gt; (&lt;a href="https://github.com/laude-institute/terminal-bench"&gt;Terminal-Bench&lt;/a&gt;): From 39.2% to 43.3%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Graduate-level reasoning&lt;/strong&gt; (&lt;a href="https://github.com/idavidrein/gpqa"&gt;GPQA Diamond&lt;/a&gt;): From 79.6% to 80.9%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agentic tool use&lt;/strong&gt; (&lt;a href="https://github.com/sierra-research/tau-bench"&gt;TAU-bench&lt;/a&gt;):&lt;/li&gt;
&lt;li&gt;Retail: From 81.4% to 82.4%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Airline: From 59.6% to 56.0%&lt;/strong&gt; &lt;em&gt;(decreased)&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multilingual Q&amp;amp;A&lt;/strong&gt; (&lt;a href="https://huggingface.co/datasets/openai/MMMLU"&gt;MMMLU&lt;/a&gt;): From 88.8% to 89.5%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Visual reasoning&lt;/strong&gt; (&lt;a href="https://mmmu-benchmark.github.io/"&gt;MMMU validation&lt;/a&gt;): From 76.5% to 77.1%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High school math competition&lt;/strong&gt; (&lt;a href="https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions"&gt;AIME 2025&lt;/a&gt;): From 75.5% to 78.0%&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Likewise, the &lt;a href="https://assets.anthropic.com/m/4c024b86c698d3d4/original/Claude-4-1-System-Card.pdf"&gt;model card&lt;/a&gt; shows only tiny changes to the various safety metrics that Anthropic track.&lt;/p&gt;
&lt;p&gt;It's priced the same as Opus 4 - $15/million for input and $75/million for output, making it one of &lt;a href="https://www.llm-prices.com/#sb=input&amp;amp;sd=descending"&gt;the most expensive models&lt;/a&gt; on the market today.&lt;/p&gt;
&lt;p&gt;I had it &lt;a href="https://gist.github.com/simonw/7fead138d31d751d65c7253a1c18751b"&gt;draw me this pelican&lt;/a&gt; riding a bicycle:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pelican is line art, does have a good beak and feet on the pedals, bicycle is very poorly designed and not the right shape." src="https://static.simonwillison.net/static/2025/opus-4.1-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;For comparison I got a fresh new pelican &lt;a href="https://gist.github.com/simonw/96a958e39aaed10e1e47c1aab2d05e20"&gt;out of Opus 4&lt;/a&gt; which I actually like a little more:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This one has shaded colors for the different parts of the pelican. Still a bad bicycle." src="https://static.simonwillison.net/static/2025/opus-4-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;I shipped &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.18"&gt;llm-anthropic 0.18&lt;/a&gt; with support for the new model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="evals"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry></feed>