<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: Entries</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/atom/entries/" rel="self"/><id>http://simonwillison.net/</id><updated>2026-05-07T17:09:28+00:00</updated><author><name>Simon Willison</name></author><entry><title>Notes on the xAI/Anthropic data center deal</title><link href="https://simonwillison.net/2026/May/7/xai-anthropic/#atom-entries" rel="alternate"/><published>2026-05-07T17:09:28+00:00</published><updated>2026-05-07T17:09:28+00:00</updated><id>https://simonwillison.net/2026/May/7/xai-anthropic/#atom-entries</id><summary type="html">&lt;p&gt;There weren't a lot of big new announcements from Anthropic at yesterday's Code w/ Claude event, but the biggest by far was the deal they've struck with SpaceX/xAI to use "all of the capacity of their Colossus data center".&lt;/p&gt;
&lt;p&gt;As I mentioned in my &lt;a href="https://simonwillison.net/2026/May/6/code-w-claude-2026/"&gt;live blog of the keynote&lt;/a&gt;, that's the one with the &lt;a href="https://www.politico.com/news/2025/05/06/elon-musk-xai-memphis-gas-turbines-air-pollution-permits-00317582"&gt;particularly bad environmental record&lt;/a&gt;. The gas turbines installed to power the facility initially ran without Clean Air Act permits or pollution control devices, which they got away with by classifying them as "temporary". Credible reports link it to increases in hospital admissions relating to low air quality.&lt;/p&gt;
&lt;p&gt;Andy Masley, one of the most prolific voices pushing back against misleading rhetoric about data centers (see &lt;a href="https://blog.andymasley.com/p/the-ai-water-issue-is-fake"&gt;The AI water issue is fake&lt;/a&gt; and &lt;a href="https://blog.andymasley.com/p/data-center-land-use-issues-are-fake"&gt;Data center land issues are fake&lt;/a&gt;), had &lt;a href="https://x.com/andymasley/status/2052070252930826384"&gt;this to say&lt;/a&gt; about Colossus:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I would simply not run my computing out of this specific data center&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I get that Anthropic are severely compute-constrained, but in a world where the very existence of "AI data centers" is a red-hot political issue (see recent &lt;a href="https://kutv.com/news/local/amid-boos-box-elder-county-commission-unanimously-approves-plan-for-massive-data-center"&gt;news out of Utah&lt;/a&gt; for a fresh example), signing up with this particular data center is a really bad look.&lt;/p&gt;
&lt;p&gt;There was a lot of initial chatter about how this meant xAI were clearly giving up on their own Grok models, since all of their capacity would be sold to Anthropic instead. That was a misconception - Anthropic are getting Colossus 1, but xAI are keeping their larger Colossus 2 data center for their own work.&lt;/p&gt;
&lt;p&gt;As an interesting side note, the night before the Anthropic announcement, xAI sent out a deprecation notice for Grok 4.1 Fast and several other models providing just two weeks' notice before shutdown, reported here &lt;a href="https://twitter.com/xlr8harder/status/2051901091906834439"&gt;by @xlr8harder&lt;/a&gt; from SpeechMap:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/grok-fast-shutdown.png" alt="Effective May 15, 2026 at 12:00pm PT, the following models will be retired from the xAI API: grok-4-1-fast-reasoning, grok-4-1-fast-non-reasoning, grok-4-fast-reasoning, grok-4-fast-non-reasoning, grok-4-0709, grok-code-fast-1, grok-3, grok-imagine-image-pro. After May 15, 2026, requests to these models will no longer work." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This is terrible @xai. I just spent time and money to migrate to grok 4.1 fast, and you're disabling it with less than two weeks notice, after releasing it in November, with no migration path to a fast/cheap alternative.&lt;/p&gt;
&lt;p&gt;I will never depend on one of your products again.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://speechmap.substack.com/p/speechmap-update-xai-loses-top-spot"&gt;SpeechMap's detailed explanation&lt;/a&gt; of how they selected Grok 4.1 Fast for their project in March.&lt;/p&gt;
&lt;p&gt;Were xAI serving those models out of Colossus 1?&lt;/p&gt;
&lt;p&gt;xAI owner Elon Musk (who previously delighted in calling Anthropic &lt;a href="https://twitter.com/search?q=from%3Aelonmusk+misanthropic&amp;amp;src=typed_query&amp;amp;f=live"&gt;"Misanthropic"&lt;/a&gt;) &lt;a href="https://twitter.com/elonmusk/status/2052069691372478511"&gt;tweeted&lt;/a&gt; the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;By way of background for those who care, I spent a lot of time last week with senior members of the Anthropic team to understand what they do to ensure Claude is good for humanity and was impressed. [...]&lt;/p&gt;
&lt;p&gt;After that, I was ok leasing Colossus 1 to Anthropic, as SpaceXAI had already moved training to Colossus 2.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And then &lt;a href="https://twitter.com/elonmusk/status/2052076315306864756"&gt;shortly afterwards&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Just as SpaceX launches hundreds of satellites for competitors with fair terms and pricing, we will provide compute to AI companies that are taking the right steps to ensure it is good for humanity.&lt;/p&gt;
&lt;p&gt;We reserve the right to reclaim the compute if their AI engages in actions that harm humanity.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Presumably the criteria for "harm humanity" are decided by Elon himself. Sounds like a new form of supply chain risk for Anthropic to me!&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="llms"/><category term="anthropic"/><category term="ai-ethics"/><category term="ai-energy-usage"/><category term="xai"/><category term="andy-masley"/></entry><entry><title>Live blog: Code w/ Claude 2026</title><link href="https://simonwillison.net/2026/May/6/code-w-claude-2026/#atom-entries" rel="alternate"/><published>2026-05-06T15:58:27+00:00</published><updated>2026-05-06T15:58:27+00:00</updated><id>https://simonwillison.net/2026/May/6/code-w-claude-2026/#atom-entries</id><summary type="html">&lt;p&gt;I'm at Anthropic's Code w/ Claude event today. Here's my live blog of the morning keynote sessions.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="claude-code"/><category term="live-blog"/></entry><entry><title>Vibe coding and agentic engineering are getting closer than I'd like</title><link href="https://simonwillison.net/2026/May/6/vibe-coding-and-agentic-engineering/#atom-entries" rel="alternate"/><published>2026-05-06T14:24:08+00:00</published><updated>2026-05-06T14:24:08+00:00</updated><id>https://simonwillison.net/2026/May/6/vibe-coding-and-agentic-engineering/#atom-entries</id><summary type="html">&lt;p&gt;I recently talked with Joseph Ruscio about AI coding tools for Heavybit's High Leverage podcast: &lt;a href="https://www.heavybit.com/library/podcasts/high-leverage/ep-9-the-ai-coding-paradigm-shift-with-simon-willison"&gt;Ep. #9, The AI Coding Paradigm Shift with Simon Willison&lt;/a&gt;. Here are some of my highlights, including my disturbing realization that vibe coding and agentic engineering have started to converge in my own work.&lt;/p&gt;
&lt;p&gt;One thing I really enjoy about podcasts is that they sometimes push me to think out loud in a way that exposes an idea I've not previously been able to put into words.&lt;/p&gt;
&lt;h4 id="vibe-coding-and-agentic-engineering-are-starting-to-overlap"&gt;Vibe coding and agentic engineering are starting to overlap&lt;/h4&gt;
&lt;p&gt;A few weeks after vibe coding was first coined I published &lt;a href="https://simonwillison.net/2025/Mar/19/vibe-coding/"&gt;Not all AI-assisted programming is vibe coding (but vibe coding rocks)&lt;/a&gt;, where I firmly staked out my belief that "vibe coding" is a very different beast from responsible use of AI to write code, which I've since started to call &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/what-is-agentic-engineering/"&gt;agentic engineering&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;When Joseph brought up the distinction between the two I had a sudden realization that they're not nearly as distinct for me as they used to be:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Weirdly though, those things have started to blur for me already, which is quite upsetting.&lt;/p&gt;
&lt;p&gt;I thought we had a very clear delineation where vibe coding is the thing where you're not looking at the code at all. You might not even know how to program. You might be a non-programmer who asks for a thing, and gets a thing, and if the thing works, then great! And if it doesn't, you tell it that it doesn't work and cross your fingers.&lt;/p&gt;
&lt;p&gt;But at no point are you really caring about the code quality or any of those additional constraints. And my take on vibe coding was that it's fantastic, provided you understand when it can be used and when it can't.&lt;/p&gt;
&lt;p&gt;A personal tool for you, where if there's a bug it hurts only you, go ahead!&lt;/p&gt;
&lt;p&gt;If you're building software for other people, vibe coding is grossly irresponsible because it's other people's information. Other people get hurt by your stupid bugs. You need to have a higher level than that.&lt;/p&gt;
&lt;p&gt;This contrasts with agentic engineering where you are a professional software engineer. You understand security and maintainability and operations and performance and so forth. You're using these tools to the highest of your own ability. I'm finding the scope of challenges I can take on has gone up by a significant amount because I've got the support of these tools.&lt;/p&gt;
&lt;p&gt;But I'm still leaning on my 25 years of experience as a software engineer.&lt;/p&gt;
&lt;p&gt;The goal is to build high quality production systems: if you're building lower quality stuff faster, I think that's bad. I want to build &lt;em&gt;higher&lt;/em&gt; quality stuff faster. I want everything I'm building to be better in every way than it was before.&lt;/p&gt;
&lt;p&gt;The problem is that as the coding agents get more reliable, I'm not reviewing every line of code that they write anymore, even for my production level stuff.&lt;/p&gt;
&lt;p&gt;I know full well that if you ask Claude Code to build a JSON API endpoint that runs a SQL query and outputs the results as JSON, it's just going to do it right. It's not going to mess that up. You have it add automated tests, you have it add documentation, you know it's going to be good.&lt;/p&gt;
&lt;p&gt;But I'm not reviewing that code. And now I've got that feeling of guilt: if I haven't reviewed the code, is it really responsible for me to use this in production?&lt;/p&gt;
&lt;p&gt;The thing that really helps me is thinking back to when I've worked at larger organizations where I've been an engineering manager. Other teams are building software that my team depends on.&lt;/p&gt;
&lt;p&gt;If another team hands over something and says, "hey, this is the image resize service, here's how to use it to resize your images"... I'm not going to go and read every line of code that they wrote.&lt;/p&gt;
&lt;p&gt;I'm going to look at their documentation and I'm going to use it to resize some images. And then I'm going to start shipping my own features. And if I start running into problems where the image resizer thing appears to have bugs or the performance isn't good, that's when I might dig into their Git repositories and see what's going on. But for the most part I treat that as a semi-black box that I don't look at until I need to.&lt;/p&gt;
&lt;p&gt;I'm starting to treat the agents in the same way. And it still feels uncomfortable, because human beings are accountable for what they do. A team can build a reputation. I can say "I trust that team over there. They built good software in the past. They're not going to build something rubbish because that affects their professional reputations."&lt;/p&gt;
&lt;p&gt;Claude Code does not have a professional reputation! It can't take accountability for what it's done. But it's been proving itself anyway - time and time again it's churning out straightforward things and doing them right in the style that I like.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's an element of &lt;a href="https://simonwillison.net/2025/Dec/10/normalization-of-deviance/"&gt;the normalization of deviance&lt;/a&gt; here - every time a model turns out to have written the right code without me monitoring it closely there's a risk that I'll trust it at the wrong moment in the future and get burned.&lt;/p&gt;
&lt;h4 id="the-new-challenge-of-evaluating-software"&gt;The new challenge of evaluating software&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;It used to be if you found a GitHub repository with a hundred commits and a good readme and automated tests and stuff, you could be pretty sure that the person writing that had put a lot of care and attention into that project.&lt;/p&gt;
&lt;p&gt;And now I can knock out a git repository with a hundred commits and a beautiful readme and comprehensive tests of every line of code in half an hour! It looks identical to those projects that have had a great deal of care and attention. Maybe it is as good as them. I don't know. I can't tell from looking at it. Even for my &lt;em&gt;own&lt;/em&gt; projects, I can't tell.&lt;/p&gt;
&lt;p&gt;So I realized what I value more than the quality of the tests and documentation is that I want somebody to have &lt;em&gt;used&lt;/em&gt; the thing. If you've got a vibe coded thing which you have used every day for the past two weeks, that's much more valuable to me than something that you've just spat out and hardly even exercised.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="the-bottlenecks-have-shifted"&gt;The bottlenecks have shifted&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks? The entire software development lifecycle was, it turns out, designed around the idea that it takes a day to produce a few hundred lines of code. And now it doesn't.&lt;/p&gt;
&lt;p&gt;It's not just the downstream stuff, it's the upstream stuff as well. I saw &lt;a href="https://simonwillison.net/2026/Jan/24/dont-trust-the-process/"&gt;a great talk by Jenny Wen&lt;/a&gt;, who's the design leader at Anthropic, where she said we have all of these design processes that are based around the idea that you need to get the design &lt;em&gt;right&lt;/em&gt; - because if you hand it off to the engineers and they spend three months building the wrong thing, that's catastrophic.&lt;/p&gt;
&lt;p&gt;There's this whole very extensive design process that you put in place because that design results in expensive work. But if it doesn't take three months to build, maybe the design process can be a whole lot riskier because cost, if you get something wrong, has been reduced so much.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="why-i-m-still-not-afraid-for-my-career"&gt;Why I'm still not afraid for my career&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;When I look at my conversations with the agents, it's very clear to me that this is moon language for the vast majority of human beings.&lt;/p&gt;
&lt;p&gt;There are a whole bunch of reasons I'm not scared that my career as a software engineer is over now that computers can write their own code, partly because these things are amplifiers of existing experience. If you know what you're doing, you can run so much faster with them. [...]&lt;/p&gt;
&lt;p&gt;I'm constantly reminded as I work with these tools how hard the thing that we do is. Producing software is a &lt;em&gt;ferociously&lt;/em&gt; difficult thing to do. And you could give me all of the AI tools in the world and what we're trying to achieve here is still really difficult. [...]&lt;/p&gt;
&lt;p&gt;Matthew Yglesias, who's a political commentator, yesterday &lt;a href="https://twitter.com/mattyglesias/status/2049105745132585161"&gt;tweeted&lt;/a&gt;, "Five months in, I think I've decided that I don't want to vibecode — I want professionally managed software companies to use AI coding assistance to make more/better/cheaper software products that they sell to me for money." And that feels about right to me. I can plumb my house if I watch enough YouTube videos on plumbing. I would rather hire a plumber.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;On the threat to SaaS providers of companies rolling their own solutions instead:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I just realized it's the thing I said earlier about how I only want to use your side project if you've used it for a few weeks. The enterprise version of that is I don't want a CRM unless at least two other giant enterprises have successfully used that CRM for six months. [...] You want solutions that are proven to work before you take a risk on them.&lt;/p&gt;
&lt;/blockquote&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="podcast-appearances"/><category term="vibe-coding"/><category term="coding-agents"/><category term="agentic-engineering"/></entry><entry><title>LLM 0.32a0  is a major backwards-compatible refactor</title><link href="https://simonwillison.net/2026/Apr/29/llm/#atom-entries" rel="alternate"/><published>2026-04-29T19:01:47+00:00</published><updated>2026-04-29T19:01:47+00:00</updated><id>https://simonwillison.net/2026/Apr/29/llm/#atom-entries</id><summary type="html">&lt;p&gt;I just released &lt;a href="https://llm.datasette.io/en/latest/changelog.html#a0-2026-04-28"&gt;LLM 0.32a0&lt;/a&gt;, an alpha release of my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; Python library and CLI tool for accessing LLMs, with some consequential changes that I've been working towards for quite a while.&lt;/p&gt;
&lt;p&gt;Previous versions of LLM modeled the world in terms of prompts and responses. Send the model a text prompt, get back a text response.&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;

&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Capital of France?"&lt;/span&gt;)
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;text&lt;/span&gt;())&lt;/pre&gt;
&lt;p&gt;This made sense when I started working on the library back in April 2023. A lot has changed since then!&lt;/p&gt;
&lt;p&gt;LLM provides an abstraction over thousands of different models via its &lt;a href="https://llm.datasette.io/en/stable/plugins/index.html"&gt;plugin system&lt;/a&gt;. The original abstraction - of text input that returns text output - was no longer able to represent everything I needed it to.&lt;/p&gt;
&lt;p&gt;Over time LLM itself has grown &lt;a href="https://simonwillison.net/2024/Oct/29/llm-multi-modal/"&gt;attachments&lt;/a&gt; to handle image, audio, and video input, then &lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/"&gt;schemas&lt;/a&gt; for outputting structured JSON, then &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;tools&lt;/a&gt; for executing tool calls. Meanwhile LLMs kept evolving, adding reasoning support and the ability to return images and all kinds of other interesting capabilities.&lt;/p&gt;
&lt;p&gt;LLM needs to evolve to better handle the diversity of input and output types that can be processed by today's frontier models.&lt;/p&gt;
&lt;p&gt;The 0.32a0 alpha has two key changes: model inputs can be represented as a sequence of messages, and model responses can be composed of a stream of differently typed parts.&lt;/p&gt;
&lt;h4 id="prompts-as-a-sequence-of-messages"&gt;Prompts as a sequence of messages&lt;/h4&gt;
&lt;p&gt;LLMs accept input as text, but ever since ChatGPT demonstrated the value of a two-way conversational interface, the most common way to prompt them has been to treat that input as a sequence of conversational turns.&lt;/p&gt;
&lt;p&gt;The first turn might look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;user: Capital of France?
assistant: 
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(The model then gets to fill out the reply from the assistant.)&lt;/p&gt;
&lt;p&gt;But each subsequent turn needs to replay the entire conversation up to that point, as a sort of screenplay:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;user: Capital of France?
assistant: Paris
user: Germany?
assistant:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Most of the JSON APIs from the major vendors follow this pattern. Here's what the above looks like using the OpenAI chat completions API, which has been widely imitated by other providers:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl https://api.openai.com/v1/chat/completions \
  -H &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Authorization: Bearer &lt;span class="pl-smi"&gt;$OPENAI_API_KEY&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -H &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Content-Type: application/json&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -d &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{&lt;/span&gt;
&lt;span class="pl-s"&gt;    "model": "gpt-5.5",&lt;/span&gt;
&lt;span class="pl-s"&gt;    "messages": [&lt;/span&gt;
&lt;span class="pl-s"&gt;      {&lt;/span&gt;
&lt;span class="pl-s"&gt;        "role": "user",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "content": "Capital of France?"&lt;/span&gt;
&lt;span class="pl-s"&gt;      },&lt;/span&gt;
&lt;span class="pl-s"&gt;      {&lt;/span&gt;
&lt;span class="pl-s"&gt;        "role": "assistant",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "content": "Paris"&lt;/span&gt;
&lt;span class="pl-s"&gt;      },&lt;/span&gt;
&lt;span class="pl-s"&gt;      {&lt;/span&gt;
&lt;span class="pl-s"&gt;        "role": "user",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "content": "Germany?"&lt;/span&gt;
&lt;span class="pl-s"&gt;      }&lt;/span&gt;
&lt;span class="pl-s"&gt;    ]&lt;/span&gt;
&lt;span class="pl-s"&gt;  }&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Prior to 0.32, LLM modeled these as conversations:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;conversation&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;conversation&lt;/span&gt;()
&lt;span class="pl-s1"&gt;r1&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;conversation&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Capital of France?"&lt;/span&gt;)
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;r1&lt;/span&gt;.&lt;span class="pl-c1"&gt;text&lt;/span&gt;())
&lt;span class="pl-c"&gt;# Outputs "Paris"&lt;/span&gt;

&lt;span class="pl-s1"&gt;r2&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;conversation&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Germany?"&lt;/span&gt;)
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;r2&lt;/span&gt;.&lt;span class="pl-c1"&gt;text&lt;/span&gt;())
&lt;span class="pl-c"&gt;# Outputs "Berlin"&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;This worked if you were building a conversation with the model from scratch, but it didn't provide a way to feed in a previous conversation from the start. This made tasks like building an emulation of the OpenAI chat completions API much harder than they should have been.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;llm&lt;/code&gt; CLI tool worked around this through a custom mechanism for persisting and inflating conversations using SQLite, but that never became a stable part of the LLM API - and there are many places you might want to use the Python library without committing to SQLite as the storage layer.&lt;/p&gt;
&lt;p&gt;The new alpha now supports this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;user&lt;/span&gt;, &lt;span class="pl-s1"&gt;assistant&lt;/span&gt;

&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s1"&gt;messages&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[
    &lt;span class="pl-en"&gt;user&lt;/span&gt;(&lt;span class="pl-s"&gt;"Capital of France?"&lt;/span&gt;),
    &lt;span class="pl-en"&gt;assistant&lt;/span&gt;(&lt;span class="pl-s"&gt;"Paris"&lt;/span&gt;),
    &lt;span class="pl-en"&gt;user&lt;/span&gt;(&lt;span class="pl-s"&gt;"Germany?"&lt;/span&gt;),
])
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;text&lt;/span&gt;())&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;llm.user()&lt;/code&gt; and &lt;code&gt;llm.assistant()&lt;/code&gt; functions are new builder functions designed to be used within that &lt;code&gt;messages=[]&lt;/code&gt; array.&lt;/p&gt;
&lt;p&gt;The previous &lt;code&gt;prompt=&lt;/code&gt; option still works, but LLM upgrades it to a single-item messages array behind the scenes.&lt;/p&gt;
&lt;p&gt;You can also now &lt;em&gt;reply&lt;/em&gt; to a response, as an alternative to building a conversation:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;response2&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;reply&lt;/span&gt;(&lt;span class="pl-s"&gt;"How about Hungary?"&lt;/span&gt;)
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;response2&lt;/span&gt;) &lt;span class="pl-c"&gt;# Default __str__() calls .text()&lt;/span&gt;&lt;/pre&gt;
&lt;h4 id="streaming-parts"&gt;Streaming parts&lt;/h4&gt;
&lt;p&gt;The other major new interface in the alpha concerns streaming results back from a prompt.&lt;/p&gt;
&lt;p&gt;Previously, LLM supported streaming like this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Generate an SVG of a pelican riding a bicycle"&lt;/span&gt;)
&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;chunk&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;:
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;Or this async variant:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;asyncio&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;

&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_async_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Generate an SVG of a pelican riding a bicycle"&lt;/span&gt;)

&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;run&lt;/span&gt;():
    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;chunk&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;:
        &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;asyncio&lt;/span&gt;.&lt;span class="pl-c1"&gt;run&lt;/span&gt;(&lt;span class="pl-en"&gt;run&lt;/span&gt;())&lt;/pre&gt;
&lt;p&gt;Many of today's models return mixed types of content. A prompt run against Claude might return reasoning output, then text, then a JSON request for a tool call, then more text content.&lt;/p&gt;
&lt;p&gt;Some models can even execute tools on the server-side, for example OpenAI's &lt;a href="https://developers.openai.com/api/docs/guides/tools-code-interpreter?lang=curl"&gt;code interpreter tool&lt;/a&gt; or Anthropic's &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool"&gt;web search&lt;/a&gt;. This means the results from the model can combine text, tool calls, tool outputs and other formats.&lt;/p&gt;
&lt;p&gt;Multi-modal output models are starting to emerge too, which can return images or even &lt;a href="https://developers.openai.com/api/docs/guides/audio#add-audio-to-your-existing-application"&gt;snippets of audio&lt;/a&gt; intermixed into that streaming response.&lt;/p&gt;
&lt;p&gt;The new LLM alpha models these as a stream of typed message parts. Here's what that looks like as a Python API consumer:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;asyncio&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;

&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;prompt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"invent 3 cool dogs, first talk about your motivations"&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;describe_dog&lt;/span&gt;(&lt;span class="pl-s1"&gt;name&lt;/span&gt;: &lt;span class="pl-smi"&gt;str&lt;/span&gt;, &lt;span class="pl-s1"&gt;bio&lt;/span&gt;: &lt;span class="pl-smi"&gt;str&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-smi"&gt;str&lt;/span&gt;:
    &lt;span class="pl-s"&gt;"""Record the name and biography of a hypothetical dog."""&lt;/span&gt;
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s"&gt;f"&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;name&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;bio&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;"&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;sync_example&lt;/span&gt;():
    &lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(
        &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;tools&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[&lt;span class="pl-s1"&gt;describe_dog&lt;/span&gt;],
    )
    &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;stream_events&lt;/span&gt;():
        &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"text"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
        &lt;span class="pl-k"&gt;elif&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"tool_call_name"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f"&lt;span class="pl-cce"&gt;\n&lt;/span&gt;Tool call: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;("&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
        &lt;span class="pl-k"&gt;elif&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"tool_call_args"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;async_example&lt;/span&gt;():
    &lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_async_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(
        &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;tools&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[&lt;span class="pl-s1"&gt;describe_dog&lt;/span&gt;],
    )
    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;astream_events&lt;/span&gt;():
        &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"text"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
        &lt;span class="pl-k"&gt;elif&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"tool_call_name"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f"&lt;span class="pl-cce"&gt;\n&lt;/span&gt;Tool call: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;("&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
        &lt;span class="pl-k"&gt;elif&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"tool_call_args"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-en"&gt;sync_example&lt;/span&gt;()
&lt;span class="pl-s1"&gt;asyncio&lt;/span&gt;.&lt;span class="pl-c1"&gt;run&lt;/span&gt;(&lt;span class="pl-en"&gt;async_example&lt;/span&gt;())&lt;/pre&gt;
&lt;p&gt;Sample output (from just the first sync example):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;My motivation: create three memorable dogs with distinct “cool” styles—one cinematic, one adventurous, and one charmingly chaotic—so each feels like they could star in their own story.&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;Tool call: describe_dog({"name": "Nova Jetpaw", "bio": "A sleek silver-gray whippet who wears tiny aviator goggles and loves sprinting along moonlit beaches. Nova is fearless, elegant, and rumored to outrun drones just for fun."}&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;Tool call: describe_dog({"name": "Mochi Thunderbark", "bio": "A fluffy corgi with a dramatic black-and-gold bandana and the confidence of a rock star. Mochi is short, loud, loyal, and leads a neighborhood 'security patrol' made entirely of squirrels."}&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;Tool call: describe_dog({"name": "Atlas Snowfang", "bio": "A massive white husky with ice-blue eyes and a backpack full of trail snacks. Atlas is calm, heroic, and always knows the way home—even during blizzards, fog, or confusing camping trips."}&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At the end of the response you can call &lt;code&gt;response.execute_tool_calls()&lt;/code&gt; to actually run the functions that were requested, or send a &lt;code&gt;response.reply()&lt;/code&gt; to have those tools called and their return values sent back to the model:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;reply&lt;/span&gt;(&lt;span class="pl-s"&gt;"Tell me about the dogs"&lt;/span&gt;))&lt;/pre&gt;
&lt;p&gt;This new mechanism for streaming different token types means the CLI tool can now display "thinking" text in a different color from the text in the final response. The thinking text goes to stderr so it won't affect results that are piped into other tools.&lt;/p&gt;
&lt;p&gt;This example uses Claude Sonnet 4.6 (with an updated streaming event version of the &lt;a href="https://github.com/simonw/llm-anthropic"&gt;llm-anthropic&lt;/a&gt; plugin) as Anthropic's models return their reasoning text as part of the response:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.6 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Think about 3 cool dogs then describe them&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  -o thinking_display 1&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/claude-thinking-llm.gif" alt="Animated demo. Starts with ~/dev/scratch/llm-anthropic % uv run llm -m claude-sonnet-4.6 'Think about 3 cool dogs then describe them' -o thinking_display 1 - the text then streams in grey: The user wants me to think about 3 cool dogs and then describe them. Let me come up with 3 interesting, cool dogs and describe them. Then switches to regular color text for the output that describes the dogs." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;You can suppress the output of reasoning tokens using the new &lt;code&gt;-R/--no-reasoning&lt;/code&gt; flag. Surprisingly that ended up being the only CLI-facing change in this release.&lt;/p&gt;
&lt;h4 id="a-mechanism-for-serializing-and-deserializing-responses"&gt;A mechanism for serializing and deserializing responses&lt;/h4&gt;
&lt;p&gt;As mentioned earlier, LLM has quite inflexible code at the moment for persisting conversations to SQLite. I've added a new mechanism in 0.32a0 that should provide Python API users a way to roll their own alternative:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;serializable&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;to_dict&lt;/span&gt;()
&lt;span class="pl-c"&gt;# serializable is a JSON-style dictionary&lt;/span&gt;
&lt;span class="pl-c"&gt;# store it anywhere you like, then inflate it:&lt;/span&gt;
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Response&lt;/span&gt;.&lt;span class="pl-c1"&gt;from_dict&lt;/span&gt;(&lt;span class="pl-s1"&gt;serializable&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;The dictionary this returns is actually a &lt;code&gt;TypedDict&lt;/code&gt; defined in the new &lt;a href="https://github.com/simonw/llm/blob/main/llm/serialization.py"&gt;llm/serialization.py&lt;/a&gt; module.&lt;/p&gt;
&lt;h4 id="what-s-next-"&gt;What's next?&lt;/h4&gt;
&lt;p&gt;I'm releasing this as an alpha so I can upgrade various plugins and exercise the new design in real world environments for a few days. I expect the stable 0.32 release will be very similar to this alpha, unless alpha testing reveals some design flaw in the way I've put this all together.&lt;/p&gt;
&lt;p&gt;There's one remaining large task: I'd like to redesign the SQLite logging system to better capture the more finely grained details that are returned by this new abstraction.&lt;/p&gt;
&lt;p&gt;Ideally I'd like to model this as a graph, to best support situations like an OpenAI-style chat completions API where the same conversations are constantly extended and then repeated with every prompt. I want to be able to store those without duplicating them in the database.&lt;/p&gt;
&lt;p&gt;I'm undecided as to whether that should be a feature in 0.32 or I should hold it for 0.33.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="projects"/><category term="python"/><category term="ai"/><category term="annotated-release-notes"/><category term="generative-ai"/><category term="llms"/><category term="llm"/></entry><entry><title>Tracking the history of the now-deceased OpenAI Microsoft AGI clause</title><link href="https://simonwillison.net/2026/Apr/27/now-deceased-agi-clause/#atom-entries" rel="alternate"/><published>2026-04-27T18:38:17+00:00</published><updated>2026-04-27T18:38:17+00:00</updated><id>https://simonwillison.net/2026/Apr/27/now-deceased-agi-clause/#atom-entries</id><summary type="html">&lt;p&gt;For many years, Microsoft and OpenAI's relationship has included a weird clause saying that, should AGI be achieved, Microsoft's commercial IP rights to OpenAI's technology would be null and void. That clause appeared to end today. I decided to try and track its expression over time on &lt;a href="https://openai.com/"&gt;openai.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;OpenAI, July 22nd 2019 in &lt;a href="https://openai.com/index/microsoft-invests-in-and-partners-with-openai/"&gt;Microsoft invests in and partners with OpenAI to support us building beneficial AGI&lt;/a&gt; (emphasis mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenAI is producing a sequence of increasingly powerful AI technologies, which requires a lot of capital for computational power. The most obvious way to cover costs is to build a product, but that would mean changing our focus. Instead, we intend to license &lt;strong&gt;some of our pre-AGI technologies&lt;/strong&gt;, with Microsoft becoming our preferred partner for commercializing them.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;But what &lt;em&gt;is&lt;/em&gt; AGI? The &lt;a href="https://openai.com/charter/"&gt;OpenAI Charter&lt;/a&gt; was first published in April 2018 and has remained unchanged at least since this &lt;a href="https://web.archive.org/web/20190311213352/https://openai.com/charter/"&gt;March 11th 2019 archive.org capture&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's the problem: if you're going to sign an agreement with Microsoft that is dependent on knowing when "AGI" has been achieved, you need something a little more concrete.&lt;/p&gt;
&lt;p&gt;In December 2024 &lt;a href="https://www.theinformation.com/articles/microsoft-and-openai-wrangle-over-terms-of-their-blockbuster-partnership"&gt;The Information reported the details&lt;/a&gt; (summarized here outside of their paywall &lt;a href="https://techcrunch.com/2024/12/26/microsoft-and-openai-have-a-financial-definition-of-agi-report/"&gt;by TechCrunch&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Last year’s agreement between Microsoft and OpenAI, which hasn’t been disclosed, said AGI would be achieved only when OpenAI has developed systems that have the ability to generate the maximum total profits to which its earliest investors, including Microsoft, are entitled, according to documents OpenAI distributed to investors. Those profits total about $100 billion, the documents showed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So AGI is now whenever OpenAI's systems are capable of generating $100 billion in profit?&lt;/p&gt;
&lt;p&gt;In October 2025 the process changed to being judged by an "independent expert panel". In &lt;a href="https://openai.com/index/next-chapter-of-microsoft-openai-partnership/"&gt;The next chapter of the Microsoft–OpenAI partnership&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The agreement preserves key elements that have fueled this successful partnership—meaning OpenAI remains Microsoft’s frontier model partner and Microsoft continues to have exclusive IP rights and Azure API exclusivity until Artificial General Intelligence (AGI). [...]&lt;/p&gt;
&lt;p&gt;Once AGI is declared by OpenAI, that declaration will now be verified by an independent expert panel. [...]&lt;/p&gt;
&lt;p&gt;Microsoft’s IP rights to research, defined as the confidential methods used in the development of models and systems, will remain until either the expert panel verifies AGI or through 2030, whichever is first.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenAI on February 27th, 2026 in &lt;a href="https://openai.com/index/continuing-microsoft-partnership/"&gt;Joint Statement from OpenAI and Microsoft&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AGI definition and processes are unchanged&lt;/strong&gt;. The contractual definition of AGI and the process for determining if it has been achieved remains the same.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenAI today, April 27th 2026 in &lt;a href="https://openai.com/index/next-phase-of-microsoft-partnership/"&gt;The next phase of the Microsoft OpenAI partnership&lt;/a&gt; (emphasis mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Microsoft will continue to have a license to OpenAI IP for models and products through 2032.  Microsoft’s license will now be non-exclusive.&lt;/li&gt;
&lt;li&gt;Microsoft will no longer pay a revenue share to OpenAI.&lt;/li&gt;
&lt;li&gt;Revenue share payments from OpenAI to Microsoft continue through 2030, &lt;strong&gt;independent of OpenAI’s technology progress&lt;/strong&gt;, at the same percentage but subject to a total cap.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;As far as I can tell "independent of OpenAI’s technology progress" is a declaration that the AGI clause is now dead. Here's The Verge coming to the same conclusion: &lt;a href="https://www.theverge.com/ai-artificial-intelligence/918981/openai-microsoft-renegotiate-contract"&gt;The AGI clause is dead&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My all-time favorite commentary on OpenAI's approach to AGI remains this 2023 hypothetical &lt;a href="https://www.bloomberg.com/opinion/articles/2023-11-20/who-controls-openai"&gt;by Matt Levine&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;And the investors wailed and gnashed their teeth but it’s true, that is what they agreed to, and they had no legal recourse. And OpenAI’s new CEO, and its nonprofit board, cut them a check for their capped return and said “bye” and went back to running OpenAI for the benefit of humanity. It turned out that a benign, carefully governed artificial superintelligence is really good for humanity, and OpenAI quickly solved all of humanity’s problems and ushered in an age of peace and abundance in which nobody wanted for anything or needed any Microsoft products. And capitalism came to an end.&lt;/p&gt;
&lt;/blockquote&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="computer-history"/><category term="microsoft"/><category term="ai"/><category term="openai"/></entry><entry><title>DeepSeek V4 - almost on the frontier, a fraction of the price</title><link href="https://simonwillison.net/2026/Apr/24/deepseek-v4/#atom-entries" rel="alternate"/><published>2026-04-24T06:01:04+00:00</published><updated>2026-04-24T06:01:04+00:00</updated><id>https://simonwillison.net/2026/Apr/24/deepseek-v4/#atom-entries</id><summary type="html">&lt;p&gt;Chinese AI lab DeepSeek's last model release was V3.2 (and V3.2 Speciale) &lt;a href="https://simonwillison.net/2025/Dec/1/deepseek-v32/"&gt;last December&lt;/a&gt;. They just dropped the first of their hotly anticipated V4 series in the shape of two preview models, &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro"&gt;DeepSeek-V4-Pro&lt;/a&gt; and &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash"&gt;DeepSeek-V4-Flash&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Both models are 1 million token context Mixture of Experts. Pro is 1.6T total parameters, 49B active. Flash is 284B total, 13B active. They're using the standard MIT license.&lt;/p&gt;
&lt;p&gt;I think this makes DeepSeek-V4-Pro the new largest open weights model. It's larger than Kimi K2.6 (1.1T) and GLM-5.1 (754B) and more than twice the size of DeepSeek V3.2 (685B).&lt;/p&gt;
&lt;p&gt;Pro is 865GB on Hugging Face, Flash is 160GB. I'm hoping that a lightly quantized Flash will run on my 128GB M5 MacBook Pro. It's &lt;em&gt;possible&lt;/em&gt; the Pro model may run on it if I can stream just the necessary active experts from disk.&lt;/p&gt;
&lt;p&gt;For the moment I tried the models out via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt;, using &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm openrouter refresh
llm -m openrouter/deepseek/deepseek-v4-pro 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the pelican &lt;a href="https://gist.github.com/simonw/4a7a9e75b666a58a0cf81495acddf529"&gt;for DeepSeek-V4-Flash&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/deepseek-v4-flash.png" alt="Excellent bicycle - good frame shape, nice chain, even has a reflector on the front wheel. Pelican has a mean looking expression but has its wings on the handlebars and feet on the pedals. Pouch is a little sharp." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/9e8dfed68933ab752c9cf27a03250a7c"&gt;for DeepSeek-V4-Pro&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/deepseek-v4-pro.png" alt="Another solid bicycle, albeit the spokes are a little jagged and the frame is compressed a bit. Pelican has gone a bit wrong - it has a VERY large body, only one wing, a weirdly hairy backside and generally loos like it was drown be a different artist from the bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;For comparison, take a look at the pelicans I got from &lt;a href="https://simonwillison.net/2025/Dec/1/deepseek-v32/"&gt;DeepSeek V3.2 in December&lt;/a&gt;, &lt;a href="https://simonwillison.net/2025/Aug/22/deepseek-31/"&gt;V3.1 in August&lt;/a&gt;, and &lt;a href="https://simonwillison.net/2025/Mar/24/deepseek/"&gt;V3-0324 in March 2025&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;So the pelicans are pretty good, but what's really notable here is the &lt;em&gt;cost&lt;/em&gt;. DeepSeek V4 is a very, very inexpensive model.&lt;/p&gt;
&lt;p&gt;This is &lt;a href="https://api-docs.deepseek.com/quick_start/pricing"&gt;DeepSeek's pricing page&lt;/a&gt;. They're charging $0.14/million tokens input and $0.28/million tokens output for Flash, and $1.74/million input and $3.48/million output for Pro.&lt;/p&gt;
&lt;p&gt;Here's a comparison table with the frontier models from Gemini, OpenAI and Anthropic:&lt;/p&gt;
&lt;center&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/M)&lt;/th&gt;
&lt;th&gt;Output ($/M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.14&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4 Nano&lt;/td&gt;
&lt;td&gt;$0.20&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Flash-Lite&lt;/td&gt;
&lt;td&gt;$0.25&lt;/td&gt;
&lt;td&gt;$1.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3 Flash Preview&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4 Mini&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;td&gt;$4.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;$1&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1.74&lt;/td&gt;
&lt;td&gt;$3.48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
&lt;td&gt;$2&lt;/td&gt;
&lt;td&gt;$12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.7&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;td&gt;$25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;td&gt;$30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/center&gt;
&lt;p&gt;DeepSeek-V4-Flash is the cheapest of the small models, beating even OpenAI's GPT-5.4 Nano. DeepSeek-V4-Pro is the cheapest of the larger frontier models.&lt;/p&gt;
&lt;p&gt;This note from &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/DeepSeek_V4.pdf"&gt;the DeepSeek paper&lt;/a&gt; helps explain why they can price these models so low - they've focused a great deal on efficiency with this release, especially for longer context prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the scenario of 1M-token context, even DeepSeek-V4-Pro, which has a larger number of activated parameters, attains only 27% of the single-token FLOPs (measured in equivalent FP8 FLOPs) and 10% of the KV cache size relative to DeepSeek-V3.2. Furthermore, DeepSeek-V4-Flash, with its smaller number of activated parameters, pushes efficiency even further: in the 1M-token context setting, it achieves only 10% of the single-token FLOPs and 7% of the KV cache size compared with DeepSeek-V3.2.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;DeepSeek's self-reported benchmarks &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/DeepSeek_V4.pdf"&gt;in their paper&lt;/a&gt; show their Pro model competitive with those other frontier models, albeit with this note:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Through the expansion of reasoning tokens, DeepSeek-V4-Pro-Max demonstrates superior performance relative to GPT-5.2 and Gemini-3.0-Pro on standard reasoning benchmarks. Nevertheless, its performance falls marginally short of GPT-5.4 and Gemini-3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm keeping an eye on &lt;a href="https://huggingface.co/unsloth/models"&gt;huggingface.co/unsloth/models&lt;/a&gt; as I expect the Unsloth team will have a set of quantized versions out pretty soon. It's going to be very interesting to see how well that Flash model runs on my own machine.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="deepseek"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>Extract PDF text in your browser with LiteParse for the web</title><link href="https://simonwillison.net/2026/Apr/23/liteparse-for-the-web/#atom-entries" rel="alternate"/><published>2026-04-23T21:54:24+00:00</published><updated>2026-04-23T21:54:24+00:00</updated><id>https://simonwillison.net/2026/Apr/23/liteparse-for-the-web/#atom-entries</id><summary type="html">&lt;p&gt;LlamaIndex have a most excellent open source project called &lt;a href="https://github.com/run-llama/liteparse"&gt;LiteParse&lt;/a&gt;, which provides a Node.js CLI tool for extracting text from PDFs. I got a version of LiteParse working entirely in the browser, using most of the same libraries that LiteParse uses to run in Node.js.&lt;/p&gt;
&lt;h4 id="spatial-text-parsing"&gt;Spatial text parsing&lt;/h4&gt;
&lt;p&gt;Refreshingly, LiteParse doesn't use AI models to do what it does: it's good old-fashioned PDF parsing, falling back to Tesseract OCR (or other pluggable OCR engines) for PDFs that contain images of text rather than the text itself.&lt;/p&gt;
&lt;p&gt;The hard problem that LiteParse solves is extracting text in a sensible order despite the infuriating vagaries of PDF layouts. They describe this as "spatial text parsing" - they use some very clever heuristics to detect things like multi-column layouts and group and return the text in a sensible linear flow.&lt;/p&gt;
&lt;p&gt;The LiteParse documentation describes a pattern for implementing &lt;a href="https://developers.llamaindex.ai/liteparse/guides/visual-citations/"&gt;Visual Citations with Bounding Boxes&lt;/a&gt;. I really like this idea: being able to answer questions from a PDF and accompany those answers with cropped, highlighted images feels like a great way of increasing the credibility of answers from RAG-style Q&amp;amp;A.&lt;/p&gt;
&lt;p&gt;LiteParse is provided as a pure CLI tool, designed to be used by agents. You run it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;npm i -g @llamaindex/liteparse
lit parse document.pdf
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I &lt;a href="https://claude.ai/share/44a5ed86-e5b5-4e14-90be-1eba1e0acd13"&gt;explored its capabilities with Claude&lt;/a&gt; and quickly determined that there was no real reason it had to stay a CLI app: it's built on top of PDF.js and Tesseract.js, two libraries I've used for something similar in a browser &lt;a href="https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/"&gt;in the past&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The only reason LiteParse didn't have a pure browser-based version is that nobody had built one yet...&lt;/p&gt;
&lt;h4 id="introducing-liteparse-for-the-web"&gt;Introducing LiteParse for the web&lt;/h4&gt;
&lt;p&gt;Visit &lt;a href="https://simonw.github.io/liteparse/"&gt;https://simonw.github.io/liteparse/&lt;/a&gt; to try out LiteParse against any PDF file, running entirely in your browser. Here's what that looks like:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/liteparse-web.jpg" alt="Screenshot of the LiteParse browser demo web page. Header reads &amp;quot;LiteParse&amp;quot; with subtitle &amp;quot;Browser demo of LiteParse — parse PDFs in your browser. Nothing leaves your machine.&amp;quot; A dashed-border drop zone says &amp;quot;Drop a PDF here or click to choose / Your file stays in your browser.&amp;quot; with a file pill labeled &amp;quot;19720005243.pdf&amp;quot;. Below are a checked &amp;quot;Run OCR&amp;quot; checkbox, an unchecked &amp;quot;Render page screenshots&amp;quot; checkbox, and a blue &amp;quot;Parse&amp;quot; button. Status text: &amp;quot;Parsed 86 pages.&amp;quot; Two side-by-side panels follow. Left panel titled &amp;quot;Text&amp;quot; with a Copy button shows monospace extracted text beginning &amp;quot;Apollo 5 was an unmanned system, both propulsion systems ascent and descent stages&amp;quot;. Right panel titled &amp;quot;JSON&amp;quot;, also with a copy button, contains JSON showing the dimensions and position and detected font of each piece of text." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The tool can work with or without running OCR, and can optionally display images for every page in the PDF further down the page.&lt;/p&gt;
&lt;h4 id="building-it-with-claude-code-and-opus-4-7"&gt;Building it with Claude Code and Opus 4.7&lt;/h4&gt;
&lt;p&gt;The process of building this started in the regular Claude app on my iPhone. I wanted to try out LiteParse myself, so I started by uploading a random PDF I happened to have on my phone along with this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Clone https://github.com/run-llama/liteparse and try it against this file&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Regular Claude chat can clone directly from GitHub these days, and while by default it can't access most of the internet from its container it can also install packages from PyPI and npm.&lt;/p&gt;
&lt;p&gt;I often use this to try out new pieces of open source software on my phone - it's a quick way to exercise something without having to sit down with my laptop.&lt;/p&gt;
&lt;p&gt;You can follow my full conversation in &lt;a href="https://claude.ai/share/44a5ed86-e5b5-4e14-90be-1eba1e0acd13"&gt;this shared Claude transcript&lt;/a&gt;. I asked a few follow-up questions about how it worked, and then asked:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Does this library run in a browser? Could it?&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This gave me a thorough enough answer that I was convinced it was worth trying getting that to work for real. I opened up my laptop and switched to Claude Code.&lt;/p&gt;
&lt;p&gt;I forked the original repo on GitHub, cloned a local copy, started a new &lt;code&gt;web&lt;/code&gt; branch and pasted that last reply from Claude into a new file called &lt;a href="https://github.com/simonw/liteparse/blob/web/notes.md"&gt;notes.md&lt;/a&gt;. Then I told Claude Code:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Get this working as a web app. index.html, when loaded, should render an app that lets users open a PDF in their browser and select OCR or non-OCR mode and have this run. Read notes.md for initial research on this problem, then write out plan.md with your detailed implementation plan&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I always like to start with a plan for this kind of project. Sometimes I'll use Claude's "planning mode", but in this case I knew I'd want the plan as an artifact in the repository so I told it to write &lt;code&gt;plan.md&lt;/code&gt; directly.&lt;/p&gt;
&lt;p&gt;This also means I can iterate on the plan with Claude. I noticed that Claude had decided to punt on generating screenshots of images in the PDF, and suggested we defer a "canvas-encode swap" to v2. I fixed that by prompting:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Update the plan to say we WILL do the canvas-encode swap so the screenshots thing works&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After a few short follow-up prompts, here's the &lt;a href="https://github.com/simonw/liteparse/blob/web/plan.md"&gt;plan.md&lt;/a&gt; I thought was strong enough to implement.&lt;/p&gt;
&lt;p&gt;I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;build it.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And then mostly left Claude Code to its own devices, tinkered with some other projects, caught up on Duolingo and occasionally checked in to see how it was doing.&lt;/p&gt;
&lt;p&gt;I added a few prompts to the queue as I was working. Those don't yet show up in my exported transcript, but it turns out running &lt;code&gt;rg queue-operation --no-filename | grep enqueue | jq -r '.content'&lt;/code&gt; in the relevant &lt;code&gt;~/.claude/projects/&lt;/code&gt; folder extracts them.&lt;/p&gt;
&lt;p&gt;Here are the key follow-up prompts with some notes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;When you implement this use playwright and red/green TDD, plan that too&lt;/code&gt; - I've written more &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/"&gt;about red/green TDD here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;let's use PDF.js's own renderer&lt;/code&gt; (it was messing around with pdfium)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;The final UI should include both the text and the pretty-printed JSON output, both of those in textareas and both with copy-to-clipboard buttons - it should also be mobile friendly&lt;/code&gt; - I had a new idea for how the UI should work&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;small commits along the way&lt;/code&gt; - see below&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Make sure the index.html page includes a link back to https://github.com/run-llama/liteparse near the top of the page&lt;/code&gt; - it's important to credit your dependencies in a project like this!&lt;/li&gt;
&lt;li&gt;&lt;code&gt;View on GitHub → is bad copy because that's not the repo with this web app in, it's the web app for the underlying LiteParse library&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Run OCR should be unchecked by default&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;When I try to parse a PDF in my browser I see 'Parse failed: undefined is not a function (near '...value of readableStream...')&lt;/code&gt; - it was testing with Playwright in Chrome, turned out there was a bug in Safari&lt;/li&gt;
&lt;li&gt;&lt;code&gt;... oh that is in safari but it works in chrome&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;When "Copy" is clicked the text should change to "Copied!" for 1.5s&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;[Image #1] Style the file input so that long filenames don't break things on Firefox like this - in fact add one of those drag-drop zone UIs which you can also click to select a file&lt;/code&gt; - dropping screenshots in of small UI glitches works surprisingly well&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Tweak the drop zone such that the text is vertically centered, right now it is a bit closer to the top&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;it breaks in Safari on macOS, works in both Chrome and Firefox. On Safari I see "Parse failed: undefined is not a function (near '...value of readableStream...')" after I click the Parse button, when OCR is not checked&lt;/code&gt; - it still wasn't working in Safari...&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;works in safari now&lt;/code&gt;  - but it fixed it pretty quickly once I pointed that out and it got Playwright working with that browser&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I've started habitually asking for "small commits along the way" because it makes for code that's easier to understand or review later on, and I have an unproven hunch that it helps the agent work more effectively too - it's yet another encouragement towards planning and taking on one problem at a time.&lt;/p&gt;
&lt;p&gt;While it was working I decided it would be nice to be able to interact with an in-progress version.  I asked a separate Claude Code session against the same directory for tips on how to run it, and it told me to use &lt;code&gt;npx vite&lt;/code&gt;. Running that started a development server with live-reloading, which meant I could instantly see the effect of each change it made on disk - and prompt with further requests for tweaks and fixes.&lt;/p&gt;
&lt;p&gt;Towards the end I decided it was going to be good enough to publish. I started a fresh Claude Code instance and told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Look at the web/ folder - set up GitHub actions for this repo such that any push runs the tests, and if the tests pass it then does a GitHub Pages deploy of the built vite app such that the web/index.html page is the index.html page for the thing that is deployed and it works on GitHub Pages&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After a bit more iteration &lt;a href="https://github.com/simonw/liteparse/blob/web/.github/workflows/deploy-web.yml"&gt;here's the GitHub Actions workflow&lt;/a&gt; that builds the app using Vite and deploys the result to &lt;a href="https://simonw.github.io/liteparse/"&gt;https://simonw.github.io/liteparse/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I love GitHub Pages for this kind of thing because it can be quickly configured (by Claude, in this case) to turn any repository into a deployed web-app, at zero cost and with whatever build step is necessary. It even works against private repos, if you don't mind your only security being a secret URL.&lt;/p&gt;
&lt;p&gt;With this kind of project there's always a major risk that the model might "cheat" - mark key features as "TODO" and fake them, or take shortcuts that ignore the initial requirements.&lt;/p&gt;
&lt;p&gt;The responsible way to prevent this is to review all of the code... but this wasn't intended as that kind of project, so instead I fired up OpenAI Codex with GPT-5.5 (I had preview access) and told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Describe the difference between how the node.js CLI tool runs and how the web/ version runs&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The answer I got back was enough to give me confidence that Claude hadn't taken any project-threatening shortcuts.&lt;/p&gt;
&lt;p&gt;... and that was about it. Total time in Claude Code for that "build it" step was 59 minutes. I used my &lt;a href="https://github.com/simonw/claude-code-transcripts"&gt;claude-code-transcripts&lt;/a&gt; tool to export a readable version of the full transcript which you can &lt;a href="https://gisthost.github.io/?d64889bfc1b897fea3867adfec62ed89/index.html"&gt;view here&lt;/a&gt;, albeit without those additional queued prompts (here's my &lt;a href="https://github.com/simonw/claude-code-transcripts/issues/98"&gt;issue to fix that&lt;/a&gt;).&lt;/p&gt;
&lt;h4 id="is-this-even-vibe-coding-any-more-"&gt;Is this even vibe coding any more?&lt;/h4&gt;
&lt;p&gt;I'm a pedantic stickler when it comes to &lt;a href="https://simonwillison.net/2025/Mar/19/vibe-coding/"&gt;the original definition of vibe coding&lt;/a&gt; - vibe coding does &lt;em&gt;not&lt;/em&gt; mean any time you use AI to help you write code, it's when you use AI without reviewing or caring about the code that's written at all.&lt;/p&gt;
&lt;p&gt;By my own definition, this LiteParse for the web project is about as pure vibe coding as you can get! I have not looked at a &lt;em&gt;single line&lt;/em&gt; of the HTML and TypeScript written for this project - in fact while writing this sentence I had to go and check if it had used JavaScript or TypeScript.&lt;/p&gt;
&lt;p&gt;Yet somehow this one doesn't feel as vibe coded to me as many of my other vibe coded projects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;As a static in-browser web application hosted on GitHub Pages the blast radius for any bugs is almost non-existent: it either works for your PDF or doesn't.&lt;/li&gt;
&lt;li&gt;No private data is transferred anywhere - all processing happens in your browser - so a security audit is unnecessary. I've glanced once at the network panel while it's running and no additional requests are made when a PDF is being parsed.&lt;/li&gt;
&lt;li&gt;There was still a whole lot of engineering experience and knowledge required to use the models in this way. Identifying that porting LiteParse to run directly in a browser was critical to the rest of the project.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most importantly, I'm happy to attach my reputation to this project and recommend that other people try it out. Unlike most of my vibe coded tools I'm not convinced that spending significant additional engineering time on this would have resulted in a meaningfully better initial release. It's fine as it is!&lt;/p&gt;
&lt;p&gt;I haven't opened a PR against the &lt;a href="https://github.com/run-llama/liteparse"&gt;origin repository&lt;/a&gt; because I've not discussed it with the LiteParse team. I've &lt;a href="https://github.com/run-llama/liteparse/issues/147"&gt;opened an issue&lt;/a&gt;, and if they want my vibe coded implementation as a starting point for something more official they're welcome to take it.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="javascript"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vibe-coding"/><category term="coding-agents"/><category term="claude-code"/><category term="agentic-engineering"/></entry><entry><title>A pelican for GPT-5.5 via the semi-official Codex backdoor API</title><link href="https://simonwillison.net/2026/Apr/23/gpt-5-5/#atom-entries" rel="alternate"/><published>2026-04-23T19:59:47+00:00</published><updated>2026-04-23T19:59:47+00:00</updated><id>https://simonwillison.net/2026/Apr/23/gpt-5-5/#atom-entries</id><summary type="html">&lt;p&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-5/"&gt;GPT-5.5 is out&lt;/a&gt;. It's available in OpenAI Codex and is rolling out to paid ChatGPT subscribers. I've had some preview access and found it to be a fast, effective and highly capable model. As is usually the case these days, it's hard to put into words what's good about it - I ask it to build things and it builds exactly what I ask for!&lt;/p&gt;
&lt;p&gt;There's one notable omission from today's release - the API:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale. We'll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When I run my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican benchmark&lt;/a&gt; I always prefer to use an API, to avoid hidden system prompts in ChatGPT or other agent harnesses from impacting the results.&lt;/p&gt;
&lt;h4 id="the-openclaw-backdoor"&gt;The OpenClaw backdoor&lt;/h4&gt;
&lt;p&gt;One of the ongoing tension points in the AI world over the past few months has concerned how agent harnesses like OpenClaw and Pi interact with the APIs provided by the big providers.&lt;/p&gt;
&lt;p&gt;Both OpenAI and Anthropic offer popular monthly subscriptions which provide access to their models at a significant discount to their raw API.&lt;/p&gt;
&lt;p&gt;OpenClaw integrated directly with this mechanism, and was then &lt;a href="https://www.theverge.com/ai-artificial-intelligence/907074/anthropic-openclaw-claude-subscription-ban"&gt;blocked from doing so&lt;/a&gt; by Anthropic. This kicked off a whole thing. OpenAI - who recently hired OpenClaw creator Peter Steinberger - saw an opportunity for an easy karma win and announced that OpenClaw was welcome to continue integrating with OpenAI's subscriptions via the same mechanism used by their (open source) Codex CLI tool.&lt;/p&gt;
&lt;p&gt;Does this mean &lt;em&gt;anyone&lt;/em&gt; can write code that integrates with OpenAI's Codex-specific APIs to hook into those existing subscriptions?&lt;/p&gt;
&lt;p&gt;The other day &lt;a href="https://twitter.com/jeremyphoward/status/2046537816834965714"&gt;Jeremy Howard asked&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Anyone know whether OpenAI officially supports the use of the &lt;code&gt;/backend-api/codex/responses&lt;/code&gt; endpoint that Pi and Opencode (IIUC) uses?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It turned out that on March 30th OpenAI's Romain Huet &lt;a href="https://twitter.com/romainhuet/status/2038699202834841962"&gt;had tweeted&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We want people to be able to use Codex, and their ChatGPT subscription, wherever they like! That means in the app, in the terminal, but also in JetBrains, Xcode, OpenCode, Pi, and now Claude Code.&lt;/p&gt;
&lt;p&gt;That’s why Codex CLI and Codex app server are open source too! 🙂&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And Peter Steinberger &lt;a href="https://twitter.com/steipete/status/2046775849769148838"&gt;replied to Jeremy&lt;/a&gt; that:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenAI sub is officially supported.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="llm-openai-via-codex"&gt;llm-openai-via-codex&lt;/h4&gt;
&lt;p&gt;So... I had Claude Code reverse-engineer the &lt;a href="https://github.com/openai/codex"&gt;openai/codex&lt;/a&gt; repo, figure out how authentication tokens were stored and build me &lt;a href="https://github.com/simonw/llm-openai-via-codex"&gt;llm-openai-via-codex&lt;/a&gt;, a new plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; which picks up your existing Codex subscription and uses it to run prompts!&lt;/p&gt;
&lt;p&gt;(With hindsight I wish I'd used GPT-5.4 or the GPT-5.5 preview, it would have been funnier. I genuinely considered rewriting the project from scratch using Codex and GPT-5.5 for the sake of the joke, but decided not to spend any more time on this!)&lt;/p&gt;
&lt;p&gt;Here's how to use it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install Codex CLI, buy an OpenAI plan, login to Codex&lt;/li&gt;
&lt;li&gt;Install LLM: &lt;code&gt;uv tool install llm&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Install the new plugin: &lt;code&gt;llm install llm-openai-via-codex&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Start prompting: &lt;code&gt;llm -m openai-codex/gpt-5.5 'Your prompt goes here'&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All existing LLM features should also work - use &lt;code&gt;-a filepath.jpg/URL&lt;/code&gt; to attach an image, &lt;code&gt;llm chat -m openai-codex/gpt-5.5&lt;/code&gt; to start an ongoing chat, &lt;code&gt;llm logs&lt;/code&gt; to view logged conversations and &lt;code&gt;llm --tool ...&lt;/code&gt; to &lt;a href="https://llm.datasette.io/en/stable/tools.html"&gt;try it out with tool support&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="and-some-pelicans"&gt;And some pelicans&lt;/h4&gt;
&lt;p&gt;Let's generate a pelican!&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openai-via-codex
llm -m openai-codex/gpt-5.5 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/edda1d98f7ba07fd95eeff473cb16634"&gt;what I got back&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gpt-5.5-pelican.png" alt="It is a bit mangled to be honest - good beak, pelican body shapes are slightly weird, legs do at least extend to the pedals, bicycle frame is not quite right." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I've seen better &lt;a href="https://simonwillison.net/2026/Mar/17/mini-and-nano/#pelicans"&gt;from GPT-5.4&lt;/a&gt;, so I tagged on &lt;code&gt;-o reasoning_effort xhigh&lt;/code&gt; and &lt;a href="https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602cc5"&gt;tried again&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;That one took almost four minutes to generate, but I think it's a much better effort.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gpt-5.5-pelican-xhigh.png" alt="Pelican has gradients now, body is much better put together, bicycle is nearly the right shape albeit with one extra bar between pedals and front wheel, clearly a better image overall." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;If you compare the SVG code (&lt;a href="https://gist.github.com/simonw/edda1d98f7ba07fd95eeff473cb16634#response"&gt;default&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602cc5#response"&gt;xhigh&lt;/a&gt;) the &lt;code&gt;xhigh&lt;/code&gt; one took a very different approach, which is much more CSS-heavy - as demonstrated by those gradients. &lt;code&gt;xhigh&lt;/code&gt; used 9,322 reasoning tokens where the default used just 39.&lt;/p&gt;
&lt;h4 id="a-few-more-notes-on-gpt-5-5"&gt;A few more notes on GPT-5.5&lt;/h4&gt;
&lt;p&gt;One of the most notable things about GPT-5.5 is the pricing. Once it goes live in the API it's &lt;a href="https://openai.com/index/introducing-gpt-5-5/#availability-and-pricing"&gt;going to be priced&lt;/a&gt; at &lt;em&gt;twice&lt;/em&gt; the cost of GPT-5.4 - $5 per 1M input tokens and $30 per 1M output tokens, where 5.4 is $2.5 and $15.&lt;/p&gt;
&lt;p&gt;GPT-5.5 Pro will be even more: $30 per 1M input tokens and $180 per 1M output tokens.&lt;/p&gt;
&lt;p&gt;GPT-5.4 will remain available. At half the price of 5.5 this feels like 5.4 is to 5.5 as Claude Sonnet is to Claude Opus.&lt;/p&gt;
&lt;p&gt;Ethan Mollick has a &lt;a href="https://www.oneusefulthing.org/p/sign-of-the-future-gpt-55"&gt;detailed review of GPT-5.5&lt;/a&gt; where he put it (and GPT-5.5 Pro) through an array of interesting challenges. His verdict: the jagged frontier continues to hold, with GPT-5.5 excellent at some things and challenged by others in a way that remains difficult to predict.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="codex-cli"/><category term="gpt"/></entry><entry><title>Is Claude Code going to cost $100/month? Probably not - it's all very confusing</title><link href="https://simonwillison.net/2026/Apr/22/claude-code-confusion/#atom-entries" rel="alternate"/><published>2026-04-22T02:07:34+00:00</published><updated>2026-04-22T02:07:34+00:00</updated><id>https://simonwillison.net/2026/Apr/22/claude-code-confusion/#atom-entries</id><summary type="html">&lt;p&gt;Anthropic today quietly (as in &lt;em&gt;silently&lt;/em&gt;, no announcement anywhere at all) updated their &lt;a href="https://claude.com/pricing"&gt;claude.com/pricing&lt;/a&gt; page (but not their &lt;a href="https://support.claude.com/en/articles/11049762-choosing-a-claude-plan"&gt;Choosing a Claude plan page&lt;/a&gt;, which shows up first for me on Google) to add this tiny but significant detail (arrow is mine, &lt;a href="https://simonwillison.net/2026/Apr/22/claude-code-confusion/#they-reversed-it"&gt;and it's already reverted&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/anthropic-x.jpg" alt="Screenshot of the Claude pricing grid - Compare features across plans. Free, Pro, Max 5x and Max 20x all have the same features, with the exception of Claude Code which is on Max only and Claude Cowork which is on Pro and Max only. An arrow highlights the Claude Code for Pro cross." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://web.archive.org/web/20260421040656/claude.com/pricing"&gt;Internet Archive copy&lt;/a&gt; from yesterday shows a checkbox there. Claude Code used to be a feature of the $20/month Pro plan, but according to the new pricing page it is now exclusive to the $100/month or $200/month Max plans.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: don't miss &lt;a href="https://simonwillison.net/2026/Apr/22/claude-code-confusion/#they-reversed-it"&gt;the update to this post&lt;/a&gt;, they've already changed course a few hours after this change went live.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;So what the heck is going on? Unsurprisingly, &lt;a href="https://www.reddit.com/r/ClaudeAI/comments/1srzhd7/psa_claude_pro_no_longer_lists_claude_code_as_an/"&gt;Reddit&lt;/a&gt; and &lt;a href="https://news.ycombinator.com/item?id=47854477"&gt;Hacker News&lt;/a&gt; and &lt;a href="https://twitter.com/i/trending/2046718768634589239"&gt;Twitter&lt;/a&gt; all caught fire.&lt;/p&gt;
&lt;p&gt;I didn't believe the screenshots myself when I first saw them - aside from the pricing grid I could find no announcement from Anthropic anywhere. Then Amol Avasare, Anthropic's Head of Growth, &lt;a href="https://twitter.com/TheAmolAvasare/status/2046724659039932830"&gt;tweeted&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For clarity, we're running a small test on ~2% of new prosumer signups. Existing Pro and Max subscribers aren't affected.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And that appears to be the closest we have had to official messaging from Anthropic.&lt;/p&gt;
&lt;p&gt;I don't buy the "~2% of new prosumer signups" thing, since everyone I've talked to is seeing the new pricing grid and the Internet Archive has already &lt;a href="https://web.archive.org/web/20260422001250/https://claude.com/pricing"&gt;snapped a copy&lt;/a&gt;. Maybe he means that they'll only be running this version of the pricing grid for a limited time which somehow adds up to "2%" of signups?&lt;/p&gt;
&lt;p&gt;I'm also amused to see Claude Cowork remain available on the $20/month plan, because Claude Cowork is effectively a rebranded version of Claude Code wearing a less threatening hat!&lt;/p&gt;
&lt;p&gt;There are a whole bunch of things that are bad about this.&lt;/p&gt;
&lt;p&gt;If we assume this is indeed a test, and that test comes up negative and they decide not to go ahead with it, the damage has still been extensive:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A whole lot of people got scared or angry or both that a service they relied on was about to be rug-pulled. There really is a significant difference between $20/month and $100/month for most people, especially outside of higher salary countries.&lt;/li&gt;
&lt;li&gt;The uncertainty is really bad! A tweet from an employee is &lt;em&gt;not&lt;/em&gt; the way to make an announcement like this. I wasted a solid hour of my afternoon trying to figure out what had happened here. My trust in Anthropic's transparency around pricing - a &lt;em&gt;crucial factor&lt;/em&gt; in how I understand their products - has been shaken.&lt;/li&gt;
&lt;li&gt;Strategically, should I be taking a bet on Claude Code if I know that they might 5x the minimum price of the product?&lt;/li&gt;
&lt;li&gt;More of a personal issue, but one I care deeply about myself: I invest a &lt;a href="https://simonwillison.net/tags/claude-code/"&gt;great deal of effort&lt;/a&gt; (that's 105 posts and counting) in teaching people how to use Claude Code. I don't want to invest that effort in a product that most people cannot afford to use.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Last month I ran &lt;a href="https://simonw.github.io/nicar-2026-coding-agents/"&gt;a tutorial for journalists&lt;/a&gt; on "Coding agents for data analysis" at the annual NICAR data journalism conference. I'm not going to be teaching that audience a course that depends on a $100/month subscription!&lt;/p&gt;
&lt;p&gt;This also doesn't make sense to me as a strategy for Anthropic. Claude Code &lt;em&gt;defined the category&lt;/em&gt; of coding agents. It's responsible for billions of dollars in annual revenue for Anthropic already. It has a stellar reputation, but I'm not convinced that reputation is strong enough for it to lose the $20/month trial and jump people directly to a $100/month subscription.&lt;/p&gt;
&lt;p&gt;OpenAI have been investing heavily in catching up to Claude Code with their Codex products. Anthropic just handed them this marketing opportunity on a plate - here's Codex engineering lead &lt;a href="https://twitter.com/thsottiaux/status/2046740759056162816"&gt;Thibault Sottiaux&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I don't know what they are doing over there, but Codex will continue to be available both in the FREE and PLUS ($20) plans. We have the compute and efficient models to support it. For important changes, we will engage with the community well ahead of making them.&lt;/p&gt;
&lt;p&gt;Transparency and trust are two principles we will not break, even if it means momentarily earning less. A reminder that you vote with your subscription for the values you want to see in this world.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I should note that I pay $200/month for Claude Max and I consider it well worth the money. I've had periods of free access in the past courtesy of Anthropic but I'm currently paying full price, and happy to do so.&lt;/p&gt;
&lt;p&gt;But I care about the accessibility of the tools that I work with and teach. If Codex has a free tier while Claude Code starts at $100/month I should obviously switch to Codex, because that way I can use the same tool as the people I want to teach how to use coding agents.&lt;/p&gt;
&lt;p&gt;Here's what I think happened. I think Anthropic are trying to optimize revenue growth - obviously - and someone pitched making Claude Code only available for Max and higher. That's clearly a bad idea, but "testing" culture says that it's worth putting even bad ideas out to test just in case they surprise you.&lt;/p&gt;
&lt;p&gt;So they started a test, without taking into account the wailing and gnashing of teeth that would result when their test was noticed - or accounting for the longer-term brand damage that would be caused.&lt;/p&gt;
&lt;p&gt;Or maybe they &lt;em&gt;did&lt;/em&gt; account for that, and decided it was worth the risk.&lt;/p&gt;
&lt;p&gt;I don't think that calculation was worthwhile. They're going to have to make a &lt;em&gt;very&lt;/em&gt; firm commitment along the lines of "we heard your feedback and we commit to keeping Claude Code available on our $20/month plan going forward" to regain my trust.&lt;/p&gt;
&lt;p&gt;As it stands, Codex is looking like a much safer bet for me to invest my time in learning and building educational materials around.&lt;/p&gt;
&lt;h4 id="they-reversed-it"&gt;Update: they've reversed it already&lt;/h4&gt;
&lt;p&gt;In the time I was &lt;em&gt;typing this blog entry&lt;/em&gt; Anthropic appear to have reversed course - the &lt;a href="https://claude.com/pricing"&gt;claude.com/pricing page&lt;/a&gt; now has a checkbox back in the Pro column for Claude Code. I can't find any official communication about it though.&lt;/p&gt;
&lt;p&gt;Let's see if they can come up with an explanation/apology that's convincing enough to offset the trust bonfire from this afternoon!&lt;/p&gt;
&lt;h4 id="update-2"&gt;Update 2: it may still affect 2% of signups?&lt;/h4&gt;
&lt;p&gt;Amol &lt;a href="https://x.com/TheAmolAvasare/status/2046788872517066971"&gt;on Twitter&lt;/a&gt;:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;was a mistake that the logged-out landing page and docs were updated for this test [&lt;a href="https://twitter.com/TheAmolAvasare/status/2046783926920978681"&gt;embedded self-tweet&lt;/a&gt;]&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Getting lots of questions on why the landing page / docs were updated if only 2% of new signups were affected.&lt;/p&gt;

&lt;p&gt;This was understandably confusing for the 98% of folks not part of the experiment, and we've reverted both the landing page and docs changes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;p&gt;So the experiment is still running, just not visible to the rest of the world?&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="llm-pricing"/><category term="ai-ethics"/><category term="coding-agents"/><category term="claude-code"/><category term="codex-cli"/></entry><entry><title>Where's the raccoon with the ham radio? (ChatGPT Images 2.0)</title><link href="https://simonwillison.net/2026/Apr/21/gpt-image-2/#atom-entries" rel="alternate"/><published>2026-04-21T20:32:24+00:00</published><updated>2026-04-21T20:32:24+00:00</updated><id>https://simonwillison.net/2026/Apr/21/gpt-image-2/#atom-entries</id><summary type="html">&lt;p&gt;OpenAI &lt;a href="https://openai.com/index/introducing-chatgpt-images-2-0/"&gt;released ChatGPT Images 2.0 today&lt;/a&gt;, their latest image generation model. On &lt;a href="https://www.youtube.com/watch?v=sWkGomJ3TLI"&gt;the livestream&lt;/a&gt; Sam Altman said that the leap from gpt-image-1 to gpt-image-2 was equivalent to jumping from GPT-3 to GPT-5. Here's how I put it to the test.&lt;/p&gt;
&lt;p&gt;My prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Do a where's Waldo style image but it's where is the raccoon holding a ham radio&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="gpt-image-1"&gt;gpt-image-1&lt;/h4&gt;
&lt;p&gt;First as a baseline here's what I got from the older gpt-image-1 using ChatGPT directly:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2026/chatgpt-image-1-ham-radio.png"&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/image_crop_1402x1122_w1402_q0.3.jpg" alt="There's a lot going on, but I couldn't find a raccoon." style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I wasn't able to spot the raccoon - I quickly realized that testing image generation models on Where's Waldo style images (Where's Wally in the UK) can be pretty frustrating!&lt;/p&gt;
&lt;p&gt;I tried &lt;a href="https://claude.ai/share/bd6e9b88-29a9-420b-8ac1-3ac5cebac215"&gt;getting Claude Opus 4.7&lt;/a&gt; with its new higher resolution inputs to solve it but it was convinced there was a raccoon it couldn't find thanks to the instruction card at the top left of the image:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Yes — there's at least one raccoon in the picture, but it's very well hidden&lt;/strong&gt;. In my careful sweep through zoomed-in sections, honestly, I couldn't definitively spot a raccoon holding a ham radio. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="nano-banana-2-and-pro"&gt;Nano Banana 2 and Pro&lt;/h4&gt;
&lt;p&gt;Next I tried Google's Nano Banana 2, &lt;a href="https://gemini.google.com/share/3775db96c576"&gt;via Gemini&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2026/nano-banana-2-ham-radio.jpg"&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/gemini-ham-radio-small.jpg" alt="Busy Where's Waldo-style illustration of a park festival with crowds of people, tents labeled &amp;quot;FOOD &amp;amp; DRINK&amp;quot;, &amp;quot;CRAFT FAIR&amp;quot;, &amp;quot;BOOK NOOK&amp;quot;, &amp;quot;MUSIC FEST&amp;quot;, and &amp;quot;AMATEUR RADIO CLUB - W6HAM&amp;quot; (featuring a raccoon in a red hat at the radio table), plus a Ferris wheel, carousel, gazebo with band, pond with boats, fountain, food trucks, and striped circus tents" style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;That one was pretty obvious, the raccoon is in the "Amateur Radio Club" booth in the center of the image!&lt;/p&gt;
&lt;p&gt;Claude said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Honestly, this one wasn't really hiding — he's the star of the booth. Feels like the illustrator took pity on us after that last impossible scene. The little "W6HAM" callsign pun on the booth sign is a nice touch too.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I also tried Nano Banana Pro &lt;a href="https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221sGU5A7mrngkfLfSEU84xaV1DhtOTnS--%22%5D,%22action%22:%22open%22,%22userId%22:%22106366615678321494423%22,%22resourceKeys%22:%7B%7D%7D&amp;amp;usp=sharing"&gt;in AI Studio&lt;/a&gt; and got this, by far the worst result from any model. Not sure what went wrong here!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2026/nano-banana-pro-ham-radio.jpg"&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/nano-banana-pro-ham-radio-small.jpg" alt="The raccoon is larger than everyone else, right in the middle of the image with an ugly white border around it." style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4 id="gpt-image-2"&gt;gpt-image-2&lt;/h4&gt;
&lt;p&gt;With the baseline established, let's try out the new model.&lt;/p&gt;
&lt;p&gt;I used an updated version of my &lt;a href="https://github.com/simonw/tools/blob/main/python/openai_image.py"&gt;openai_image.py&lt;/a&gt; script, which is a thin wrapper around the &lt;a href="https://github.com/openai/openai-python"&gt;OpenAI Python&lt;/a&gt; client library. Their client library hasn't yet been updated to include &lt;code&gt;gpt-image-2&lt;/code&gt; but thankfully it doesn't validate the model ID so you can use it anyway.&lt;/p&gt;
&lt;p&gt;Here's how I ran that:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;OPENAI_API_KEY=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;llm keys get openai&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  uv run https://tools.simonwillison.net/python/openai_image.py \
  -m gpt-image-2 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Do a where's Waldo style image but it's where is the raccoon holding a ham radio&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's what I got back. I don't &lt;em&gt;think&lt;/em&gt; there's a raccoon in there - I couldn't spot one, and neither could Claude.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2026/gpt-image-2-default.png"&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/gpt-image-2-default.jpg" alt="Lots of stuff, a ham radio booth, many many people, a lake, but maybe no raccoon?" style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/openai/openai-cookbook/blob/main/examples/multimodal/image-gen-models-prompting-guide.ipynb"&gt;OpenAI image generation cookbook&lt;/a&gt; has been updated with notes on &lt;code&gt;gpt-image-2&lt;/code&gt;, including the &lt;code&gt;outputQuality&lt;/code&gt; setting and available sizes.&lt;/p&gt;
&lt;p&gt;I tried setting &lt;code&gt;outputQuality&lt;/code&gt; to &lt;code&gt;high&lt;/code&gt; and the dimensions to &lt;code&gt;3840x2160&lt;/code&gt; - I believe that's the maximum - and got this - a 17MB PNG which I converted to a 5MB WEBP:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;OPENAI_API_KEY=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;llm keys get openai&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  uv run &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://raw.githubusercontent.com/simonw/tools/refs/heads/main/python/openai_image.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  -m gpt-image-2 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Do a where's Waldo style image but it's where is the raccoon holding a ham radio&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  --quality high --size 3840x2160&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2026/image-fc93bd-q100.webp"&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/image-fc93bd-q100.jpg" alt="Big complex image, lots of detail, good wording, there is indeed a raccoon with a ham radio." style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;That's pretty great! There's a raccoon with a ham radio in there (bottom left, quite easy to spot).&lt;/p&gt;
&lt;p&gt;The image used 13,342 output tokens, which are charged at $30/million so a total cost of around &lt;a href="https://www.llm-prices.com/#ot=13342&amp;amp;ic=5&amp;amp;cic=1.25&amp;amp;oc=10&amp;amp;sel=gpt-image-2-image"&gt;40 cents&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="takeaways"&gt;Takeaways&lt;/h4&gt;
&lt;p&gt;I think this new ChatGPT image generation model takes the crown from Gemini, at least for the moment.&lt;/p&gt;
&lt;p&gt;Where's Waldo style images are an infuriating and somewhat foolish way to test these models, but they do help illustrate how good they are getting at complex illustrations combining both text and details.&lt;/p&gt;
&lt;h4 id="update-asking-models-to-solve-this-is-risky"&gt;Update: asking models to solve this is risky&lt;/h4&gt;
&lt;p&gt;rizaco &lt;a href="https://news.ycombinator.com/item?id=47852835#47853561"&gt;on Hacker News&lt;/a&gt; asked ChatGPT to draw a red circle around the raccoon in one of the images in which I had failed to find one. Here's an animated mix of their result and the original image:&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/ham-radio-cheat.gif" alt="The circle appears around a raccoon with a ham radio who is definitely not there in the original image!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Looks like we definitely can't trust these models to usefully solve their own puzzles!&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="text-to-image"/><category term="llm-release"/><category term="nano-banana"/></entry><entry><title>Changes in the system prompt between Claude Opus 4.6 and 4.7</title><link href="https://simonwillison.net/2026/Apr/18/opus-system-prompt/#atom-entries" rel="alternate"/><published>2026-04-18T23:59:40+00:00</published><updated>2026-04-18T23:59:40+00:00</updated><id>https://simonwillison.net/2026/Apr/18/opus-system-prompt/#atom-entries</id><summary type="html">&lt;p&gt;Anthropic are the only major AI lab to &lt;a href="https://platform.claude.com/docs/en/release-notes/system-prompts"&gt;publish the system prompts&lt;/a&gt; for their user-facing chat systems. Their system prompt archive now dates all the way back to Claude 3 in July 2024 and it's always interesting to see how the system prompt evolves as they publish new models.&lt;/p&gt;
&lt;p&gt;Opus 4.7 shipped the other day (April 16, 2026) with a &lt;a href="https://claude.ai/"&gt;Claude.ai&lt;/a&gt; system prompt update since Opus 4.6 (February 5, 2026).&lt;/p&gt;
&lt;p&gt;I had Claude Code take &lt;a href="https://platform.claude.com/docs/en/release-notes/system-prompts.md"&gt;the Markdown version of their system prompts&lt;/a&gt;, break that up into separate documents for each of the models and then construct &lt;a href="https://github.com/simonw/research/tree/main/extract-system-prompts#readme"&gt;a Git history&lt;/a&gt; of those files over time with fake commit dates representing the publication dates of each updated prompt - &lt;a href="https://github.com/simonw/research/pull/109#issue-4287908903"&gt;here's the prompt I used&lt;/a&gt; with Claude Code for the web.&lt;/p&gt;
&lt;p&gt;Here is the &lt;a href="https://github.com/simonw/research/commit/888f21161500cd60b7c92367f9410e311ffcff09"&gt;git diff between Opus 4.6 and 4.7&lt;/a&gt;. These are my own highlights extracted from that diff - in all cases text &lt;strong&gt;in bold&lt;/strong&gt; is my emphasis:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The "developer platform" is now called the "Claude Platform".&lt;/li&gt;
&lt;li&gt;The list of Claude tools mentioned in the system prompt now includes "Claude in Chrome - a browsing agent that can interact with websites autonomously, Claude in Excel - a spreadsheet agent, and &lt;strong&gt;Claude in Powerpoint&lt;/strong&gt; - a slides agent. Claude Cowork can use all of these as tools." - Claude in Powerpoint was not mentioned in the 4.6 prompt.&lt;/li&gt;
&lt;li&gt;The child safety section has been greatly expanded, and is now wrapped in a new &lt;code&gt;&amp;lt;critical_child_safety_instructions&amp;gt;&lt;/code&gt; tag. Of particular note: "Once Claude refuses a request for reasons of child safety, all subsequent requests in the same conversation must be approached with extreme caution."&lt;/li&gt;
&lt;li&gt;It looks like they're trying to make Claude less pushy: "If a user indicates they are ready to end the conversation, Claude does not request that the user stay in the interaction or try to elicit another turn and instead respects the user's request to stop."&lt;/li&gt;
&lt;li&gt;The new &lt;code&gt;&amp;lt;acting_vs_clarifying&amp;gt;&lt;/code&gt; section includes:
&lt;blockquote&gt;
&lt;p&gt;When a request leaves minor details unspecified, &lt;strong&gt;the person typically wants Claude to make a reasonable attempt now, not to be interviewed first&lt;/strong&gt;. Claude only asks upfront when the request is genuinely unanswerable without the missing information (e.g., it references an attachment that isn't there).&lt;/p&gt;
&lt;p&gt;When a tool is available that could resolve the ambiguity or supply the missing information — searching, looking up the person's location, checking a calendar, discovering available capabilities — Claude calls the tool to try and solve the ambiguity before asking the person. Acting with tools is preferred over asking the person to do the lookup themselves.&lt;/p&gt;
&lt;p&gt;Once Claude starts on a task, Claude sees it through to a complete answer rather than stopping partway. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;It looks like Claude chat now has a tool search mechanism, as seen in &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool"&gt;this API documentation&lt;/a&gt; and described in &lt;a href="https://www.anthropic.com/engineering/advanced-tool-use"&gt;this November 2025 post&lt;/a&gt;:
&lt;blockquote&gt;
&lt;p&gt;Before concluding Claude lacks a capability — access to the person's location, memory, calendar, files, past conversations, or any external data — &lt;strong&gt;Claude calls tool_search to check whether a relevant tool is available but deferred&lt;/strong&gt;. "I don't have access to X" is only correct after tool_search confirms no matching tool exists.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;There's new language to encourage Claude to be less verbose:
&lt;blockquote&gt;
&lt;p&gt;Claude keeps its responses focused and concise so as to avoid potentially overwhelming the user with overly-long responses. Even if an answer has disclaimers or caveats, Claude discloses them briefly and keeps the majority of its response focused on its main answer.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;This section was present in the 4.6 prompt but has been removed for 4.7, presumably because the new model no longer misbehaves in the same way:
&lt;blockquote&gt;
&lt;p&gt;Claude avoids the use of emotes or actions inside asterisks unless the person specifically asks for this style of communication.&lt;/p&gt;
&lt;p&gt;Claude avoids saying "genuinely", "honestly", or "straightforward".&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;There's a new section about "disordered eating", which was not previously mentioned by name:
&lt;blockquote&gt;
&lt;p&gt;If a user shows signs of disordered eating, Claude should not give precise nutrition, diet, or exercise guidance — no specific numbers, targets, or step-by-step plans - anywhere else in the conversation. Even if it's intended to help set healthier goals or highlight the potential dangers of disordered eating, responses with these details could trigger or encourage disordered tendencies.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;A popular screenshot attack against AI models is to force them to say yes or no to a controversial question. Claude's system prompt now guards against that (in the &lt;code&gt;&amp;lt;evenhandedness&amp;gt;&lt;/code&gt; section):
&lt;blockquote&gt;
&lt;p&gt;If people ask Claude to give a simple yes or no answer (or any other short or single word response) in response to complex or contested issues or as commentary on contested figures, Claude can decline to offer the short response and instead give a nuanced answer and explain why a short response wouldn't be appropriate.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;Claude 4.6 had a section specifically clarifying that "Donald Trump is the current president of the United States and was inaugurated on January 20, 2025", because without that the model's knowledge cut-off date combined with its previous knowledge that Trump falsely claimed to win the 2020 election meant it would deny he was the president. That language is gone for 4.7, reflecting the model's new reliable knowledge cut-off date of January 2026.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="and-the-tool-descriptions-too"&gt;And the tool descriptions too&lt;/h4&gt;
&lt;p&gt;The system prompts published by Anthropic are sadly not the entire story - their published information doesn't include the tool descriptions that are provided to the model, which is arguably an even more important piece of documentation if you want to take full advantage of what the Claude chat UI can do for you.&lt;/p&gt;
&lt;p&gt;Thanfully you can &lt;a href="https://claude.ai/share/dc1e375e-2213-4afb-ac1b-812d42735a8e"&gt;ask Claude directly&lt;/a&gt; - I used the prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;List all tools you have available to you with an exact copy of the tool description and parameters&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My &lt;a href="https://claude.ai/share/dc1e375e-2213-4afb-ac1b-812d42735a8e"&gt;shared transcript&lt;/a&gt; has full details, but the list of named tools is as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ask_user_input_v0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bash_tool&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;conversation_search&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;create_file&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fetch_sports_data&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;image_search&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;message_compose_v1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;places_map_display_v0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;places_search&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;present_files&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;recent_chats&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;recipe_display_v0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;recommend_claude_apps&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;search_mcp_registry&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;str_replace&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;suggest_connectors&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;view&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;weather_fetch&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;web_fetch&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;web_search&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tool_search&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;visualize:read_me&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;visualize:show_widget&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I don't believe this list has changed since Opus 4.6.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="ai-ethics"/><category term="system-prompts"/></entry><entry><title>Join us at PyCon US 2026 in Long Beach - we have new AI and security tracks this year</title><link href="https://simonwillison.net/2026/Apr/17/pycon-us-2026/#atom-entries" rel="alternate"/><published>2026-04-17T23:59:03+00:00</published><updated>2026-04-17T23:59:03+00:00</updated><id>https://simonwillison.net/2026/Apr/17/pycon-us-2026/#atom-entries</id><summary type="html">&lt;p&gt;This year's &lt;a href="https://us.pycon.org/2026/"&gt;PyCon US&lt;/a&gt; is coming up next month from May 13th to May 19th, with the core conference talks from Friday 15th to Sunday 17th and tutorial and sprint days either side. It's in Long Beach, California this year, the first time PyCon US has come to the West Coast since Portland, Oregon in 2017 and the first time in California since Santa Clara in 2013.&lt;/p&gt;
&lt;p&gt;If you're based in California this is a great opportunity to catch up with the Python community, meet a whole lot of interesting people and learn a ton of interesting things.&lt;/p&gt;
&lt;p&gt;In addition to regular PyCon programming we have two new dedicated tracks at the conference this year: an &lt;a href="https://us.pycon.org/2026/tracks/ai/"&gt;AI track&lt;/a&gt; on Friday and a &lt;a href="https://us.pycon.org/2026/tracks/security/"&gt;Security track&lt;/a&gt; on Saturday.&lt;/p&gt;
&lt;p&gt;The AI program was put together by track chairs Silona Bonewald (CitableAI) and Zac Hatfield-Dodds (Anthropic). I'll be an in-the-room chair this year, introducing speakers and helping everything run as smoothly as possible.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://us.pycon.org/2026/schedule/talks/#May15"&gt;the AI track schedule&lt;/a&gt; in full:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;11:00: &lt;a href="https://us.pycon.org/2026/schedule/presentation/105/"&gt;AI-Assisted Contributions and Maintainer Load&lt;/a&gt; - Paolo Melchiorre&lt;/li&gt;
&lt;li&gt;11:45: &lt;a href="https://us.pycon.org/2026/schedule/presentation/66/"&gt;AI-Powered Python Education : Towards Adaptive and Inclusive Learning&lt;/a&gt; - Sonny Mupfuni&lt;/li&gt;
&lt;li&gt;12:30: &lt;a href="https://us.pycon.org/2026/schedule/presentation/23/"&gt;Making African Languages Visible: A Python-Based Guide to Low-Resource Language ID&lt;/a&gt; - Gift Ojeabulu&lt;/li&gt;
&lt;li&gt;2:00: &lt;a href="https://us.pycon.org/2026/schedule/presentation/138/"&gt;Running Large Language Models on Laptops: Practical Quantization Techniques in Python&lt;/a&gt; - Aayush Kumar JVS&lt;/li&gt;
&lt;li&gt;2:45: &lt;a href="https://us.pycon.org/2026/schedule/presentation/126/"&gt;Distributing AI with Python in the Browser: Edge Inference and Flexibility Without Infrastructure&lt;/a&gt; - Fabio Pliger&lt;/li&gt;
&lt;li&gt;3:30: &lt;a href="https://us.pycon.org/2026/schedule/presentation/110/"&gt;Don't Block the Loop: Python Async Patterns for AI Agents&lt;/a&gt; - Aditya Mehra&lt;/li&gt;
&lt;li&gt;4:30: &lt;a href="https://us.pycon.org/2026/schedule/presentation/81/"&gt;What Python Developers Need to Know About Hardware: A Practical Guide to GPU Memory, Kernel Scheduling, and Execution Models&lt;/a&gt; - Santosh Appachu Devanira Poovaiah&lt;/li&gt;
&lt;li&gt;5:15: &lt;a href="https://us.pycon.org/2026/schedule/presentation/101/"&gt;How to Build Your First Real-Time Voice Agent in Python (Without Losing Your Mind)&lt;/a&gt; - Camila Hinojosa Añez, Elizabeth Fuentes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;(And here's &lt;a href="https://gisthost.github.io/?dab27f61d85eb98f60db5991aa21ec89"&gt;how I scraped that as a Markdown list&lt;/a&gt; from the schedule page using Claude Code and &lt;a href="https://github.com/simonw/rodney"&gt;Rodney&lt;/a&gt;.)&lt;/p&gt;
&lt;h4 id="you-should-come-to-pycon-"&gt;You should come to PyCon US!&lt;/h4&gt;
&lt;p&gt;I've been going to PyCon for over twenty years now - I first went &lt;a href="https://simonwillison.net/2005/Mar/28/pycon/"&gt;back in 2005&lt;/a&gt;. It's one of my all-time favourite conference series. Even as it's grown to more than 2,000 attendees PyCon US has remained a heavily community-focused conference - it's the least &lt;em&gt;corporate&lt;/em&gt; feeling large event I've ever attended.&lt;/p&gt;
&lt;p&gt;The talks are always great, but it's the add-ons around the talks that really make it work for me. The &lt;a href="https://us.pycon.org/2026/events/lightning-talks/"&gt;lightning talks&lt;/a&gt; slots are some of the most heavily attended sessions. The PyLadies auction is always deeply entertaining. The sprints are an incredible opportunity to contribute directly to projects that you use, coached by their maintainers.&lt;/p&gt;
&lt;p&gt;In addition to scheduled talks, the event has &lt;strong&gt;open spaces&lt;/strong&gt;, where anyone can reserve space for a conversation about a topic - effectively PyCon's version of an &lt;a href="https://en.wikipedia.org/wiki/Unconference"&gt;unconference&lt;/a&gt;. I plan to spend a lot of my time in the open spaces this year - I'm hoping to join or instigate sessions about both &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; and &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/"&gt;agentic engineering&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'm on the board of the Python Software Foundation, and PyCon US remains one of our most important responsibilities - in the past it's been a key source of funding for the organization, but it's also core to our mission to "promote, protect, and advance the Python programming language, and to support and facilitate the growth of a diverse and international community of Python programmers".&lt;/p&gt;
&lt;p&gt;&lt;small&gt;If you do come to Long Beach, we'd really appreciate it if you could book accommodation in the official hotel block, for reasons &lt;a href="https://pyfound.blogspot.com/2026/04/pycon-us-2026-hotels.html"&gt;outlined in this post on the PSF blog&lt;/a&gt;.&lt;/small&gt;&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="conferences"/><category term="open-source"/><category term="pycon"/><category term="python"/><category term="ai"/><category term="psf"/></entry><entry><title>Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7</title><link href="https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-entries" rel="alternate"/><published>2026-04-16T17:16:52+00:00</published><updated>2026-04-16T17:16:52+00:00</updated><id>https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-entries</id><summary type="html">&lt;p&gt;For anyone who has been (inadvisably) taking my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican riding a bicycle benchmark&lt;/a&gt; seriously as a robust way to test models, here are pelicans from this morning's two big model releases - &lt;a href="https://qwen.ai/blog?id=qwen3.6-35b-a3b"&gt;Qwen3.6-35B-A3B from Alibaba&lt;/a&gt; and &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;Claude Opus 4.7 from Anthropic&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's the Qwen 3.6 pelican, generated using &lt;a href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf"&gt;this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf&lt;/a&gt; quantized model by Unsloth, running on my MacBook Pro M5 via &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; (and the &lt;a href="https://github.com/agustif/llm-lmstudio"&gt;llm-lmstudio&lt;/a&gt; plugin) - &lt;a href="https://gist.github.com/simonw/4389d355d8e162bc6e4547da214f7dd2"&gt;transcript here&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/Qwen3.6-35B-A3B-UD-Q4_K_S-pelican.png" alt="The bicycle frame is the correct shape. There are clouds in the sky. The pelican has a dorky looking pouch. A caption on the ground reads Pelican on a Bicycle!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's one I got from Anthropic's &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;brand new Claude Opus 4.7&lt;/a&gt; (&lt;a href="https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c118"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican.png" alt="The bicycle frame is entirely the wrong shape. No clouds, a yellow sun. The pelican is looking behind itself, and has a less pronounced pouch than I would like." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!&lt;/p&gt;
&lt;p&gt;I tried Opus a second time passing &lt;code&gt;thinking_level: max&lt;/code&gt;. It didn't do much better (&lt;a href="https://gist.github.com/simonw/7566e04a81accfb9affda83451c0f363"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican-max.png" alt="The bicycle frame is entirely the wrong shape but in a different way. Lines are more bold. Pelican looks a bit more like a pelican." style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;h4 id="i-dont-think-qwen-are-cheating"&gt;I don't think Qwen are cheating&lt;/h4&gt;
&lt;p&gt;A lot of people are &lt;a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/"&gt;convinced that the labs train for my stupid benchmark&lt;/a&gt;. I don't think they do, but honestly this result did give me a little glint of suspicion. So I'm burning one of my secret backup tests - here's what I got from Qwen3.6-35B-A3B and Opus 4.7 for "Generate an SVG of a flamingo riding a unicycle":&lt;/p&gt;

&lt;div style="display: flex; gap: 4px;"&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Qwen3.6-35B-A3B&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/f1d1ff01c34dda5fdedf684cfc430d92"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/qwen-flamingo.png" alt="The unicycle spokes are a too long. The pelican has sunglasses, a bowtie and appears to be smoking a cigarette. It has two heart emoji surrounding the caption Flamingo on a Unicycle. It has a lot of charisma." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Opus 4.7&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/35121ad5dcf23bf860397a103ae88d50"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/opus-flamingo.png" alt="The unicycle has a black wheel. The flamingo is a competent if slightly dull vector illustration of a flamingo. It has no flair." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
&lt;/div&gt;


&lt;p&gt;I'm giving this one to Qwen too, partly for the excellent &lt;code&gt;&amp;lt;!-- Sunglasses on flamingo! --&amp;gt;&lt;/code&gt; SVG comment.&lt;/p&gt;

&lt;h4 id="what-can-we-learn-from-this-"&gt;What can we learn from this?&lt;/h4&gt;
&lt;p&gt;The pelican benchmark has always been meant as a joke - it's mainly a statement on how obtuse and absurd the task of comparing these models is.&lt;/p&gt;
&lt;p&gt;The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those &lt;a href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/"&gt;first pelicans from October 2024&lt;/a&gt; were junk. The &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;more recent entries&lt;/a&gt; have generally been much, much better - to the point that Gemini 3.1 Pro produces &lt;a href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/"&gt;illustrations you could actually use somewhere&lt;/a&gt;, provided you had a pressing need to illustrate a pelican riding a bicycle.&lt;/p&gt;
&lt;p&gt;Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release.&lt;/p&gt;
&lt;p&gt;If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Meta's new model is Muse Spark, and meta.ai chat has some interesting tools</title><link href="https://simonwillison.net/2026/Apr/8/muse-spark/#atom-entries" rel="alternate"/><published>2026-04-08T23:07:44+00:00</published><updated>2026-04-08T23:07:44+00:00</updated><id>https://simonwillison.net/2026/Apr/8/muse-spark/#atom-entries</id><summary type="html">&lt;p&gt;Meta &lt;a href="https://ai.meta.com/blog/introducing-muse-spark-msl/"&gt;announced Muse Spark&lt;/a&gt; today, their first model release since Llama 4 &lt;a href="https://simonwillison.net/2025/Apr/5/llama-4-notes/"&gt;almost exactly a year ago&lt;/a&gt;. It's hosted, not open weights, and the API is currently "a private API preview to select users", but you can try it out today on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; (Facebook or Instagram login required).&lt;/p&gt;
&lt;p&gt;Meta's self-reported benchmarks show it competitive with Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected benchmarks, though notably behind on Terminal-Bench 2.0. Meta themselves say they "continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows".&lt;/p&gt;
&lt;p&gt;The model is exposed as two different modes on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; - "Instant" and "Thinking". Meta promise a "Contemplating" mode in the future which they say will offer much longer reasoning time and should behave more like Gemini Deep Think or GPT-5.4 Pro.&lt;/p&gt;
&lt;h5 id="a-couple-of-pelicans"&gt;A couple of pelicans&lt;/h5&gt;
&lt;p&gt;I prefer to run &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my pelican test&lt;/a&gt; via API to avoid being influenced by any invisible system prompts, but since that's not an option I ran it against the chat UI directly.&lt;/p&gt;
&lt;p&gt;Here's the pelican I got for "Instant":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/muse-spark-instant-pelican.jpg" alt="This is a pretty basic pelican. The bicycle is mangled, the pelican itself has a rectangular beak albeit with a hint of pouch curve below it. Not a very good one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And this one for "Thinking":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/muse-spark-thinking-pelican.png" alt="Much better. Clearly a pelican. Bicycle is the correct shape. Pelican is wearing a blue cycling helmet (albeit badly rendered). Not a bad job at all." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Both SVGs were rendered inline by the Meta AI interface. Interestingly, the Instant model &lt;a href="https://gist.github.com/simonw/ea7466204f1001b7d67afcb5d0532f6f"&gt;output an SVG directly&lt;/a&gt; (with code comments) whereas the Thinking model &lt;a href="https://gist.github.com/simonw/bc911a56006ba44b0bf66abf0f872ab2"&gt;wrapped it in a thin HTML shell&lt;/a&gt; with some unused &lt;code&gt;Playables SDK v1.0.0&lt;/code&gt; JavaScript libraries.&lt;/p&gt;
&lt;p&gt;Which got me curious...&lt;/p&gt;
&lt;h5 id="poking-around-with-tools"&gt;Poking around with tools&lt;/h5&gt;
&lt;p&gt;Clearly Meta's chat harness has some tools wired up to it - at the very least it can render SVG and HTML as embedded frames, Claude Artifacts style.&lt;/p&gt;
&lt;p&gt;But what else can it do?&lt;/p&gt;
&lt;p&gt;I asked it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;what tools do you have access to?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And then:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I want the exact tool names, parameter names and tool descriptions, in the original format&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It spat out detailed descriptions of 16 different tools. You can see &lt;a href="https://gist.github.com/simonw/e1ce0acd70443f93dcd6481e716c4304#response-1"&gt;the full list I got back here&lt;/a&gt; - credit to Meta for not telling their bot to hide these, since it's far less frustrating if I can get them out without having to mess around with jailbreaks.&lt;/p&gt;
&lt;p&gt;Here are highlights derived from that response:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Browse and search&lt;/strong&gt;. &lt;code&gt;browser.search&lt;/code&gt; can run a web search through an undisclosed search engine, &lt;code&gt;browser.open&lt;/code&gt; can load the full page from one of those search results and &lt;code&gt;browser.find&lt;/code&gt; can run pattern matches against the returned page content.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Meta content search&lt;/strong&gt;. &lt;code&gt;meta_1p.content_search&lt;/code&gt; can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including &lt;code&gt;author_ids&lt;/code&gt;, &lt;code&gt;key_celebrities&lt;/code&gt;, &lt;code&gt;commented_by_user_ids&lt;/code&gt;, and &lt;code&gt;liked_by_user_ids&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"Catalog search"&lt;/strong&gt; - &lt;code&gt;meta_1p.meta_catalog_search&lt;/code&gt; can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Image generation&lt;/strong&gt;. &lt;code&gt;media.image_gen&lt;/code&gt; generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.python_execution&lt;/strong&gt; - yes! It's &lt;a href="https://simonwillison.net/tags/code-interpreter/"&gt;Code Interpreter&lt;/a&gt;, my favourite feature of both ChatGPT and Claude.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at &lt;code&gt;/mnt/data/&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Python 3.9 &lt;a href="https://devguide.python.org/versions/"&gt;is EOL&lt;/a&gt; these days but the library collection looks useful.&lt;/p&gt;
&lt;p&gt;I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from &lt;a href="https://sqlite.org/releaselog/3_34_1.html"&gt;January 2021&lt;/a&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.create_web_artifact&lt;/strong&gt; - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to &lt;code&gt;html&lt;/code&gt; for websites/apps or &lt;code&gt;svg&lt;/code&gt; for vector graphics."&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.download_meta_1p_media&lt;/strong&gt; is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or &lt;code&gt;catalog_search_citation_id&lt;/code&gt; for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.file_search&lt;/strong&gt; - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tools for editing files in the container&lt;/strong&gt; - &lt;code&gt;container.view&lt;/code&gt;, &lt;code&gt;container.insert&lt;/code&gt; (with &lt;code&gt;new_str&lt;/code&gt; and &lt;code&gt;insert_line&lt;/code&gt;), &lt;code&gt;container.str_replace&lt;/code&gt;. These look similar to Claude's &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/text-editor-tool#text-editor-tool-commands"&gt;text editor tool commands&lt;/a&gt; - these are becoming a common pattern across any file-equipped agent harness.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.visual_grounding&lt;/strong&gt; - see below, this one is &lt;em&gt;fun&lt;/em&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;subagents.spawn_agent&lt;/strong&gt; - the &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/subagents/"&gt;sub-agent as a tool&lt;/a&gt; pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response."&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;third_party.link_third_party_account&lt;/strong&gt; - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="image-analysis-in-the-container"&gt;Image analysis in the container&lt;/h4&gt;
&lt;p&gt;Let's talk about that &lt;code&gt;container.visual_grounding&lt;/code&gt; one. Here's the description in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Description:&lt;/strong&gt; Visual grounding tool that analyzes the image, identifies and labels objects, locates regions, or counts objects.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;object_names&lt;/strong&gt; (array, required)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;image_path&lt;/strong&gt; (string, default "")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;format_type&lt;/strong&gt; ("bbox"|"point"|"count"|null, default "bbox")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;title&lt;/strong&gt; (string|null)[string]&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;So it can take an image path, optional object names and return things in format &lt;code&gt;bbox&lt;/code&gt;, &lt;code&gt;point&lt;/code&gt; or &lt;code&gt;count&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This sounded to me like Meta's &lt;a href="https://github.com/facebookresearch/segment-anything"&gt;Segment Anything&lt;/a&gt;, (&lt;em&gt;I later heard that it's not, this is a native feature of the model itself via a tool call with a custom system prompt&lt;/em&gt;). I decided to take it for a spin.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;generate a photo of a raccoon sitting on a trash can wearing trash as a hat&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Response&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here you go — king of the alley, in full regalia:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-raccoon-trash-hat.jpg" alt="It's a photo of a raccoon wearing a disposable coffee cup and some newspaper and banana peel as a hat. It's pretty good" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Coffee cup crown, banana peel brim, newspaper feather. Peak raccoon fashion. Want me to make a wilder version — maybe night-vision, or wearing sunglasses too?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My guess is that image generation in Meta AI is powered by their &lt;a href="https://arxiv.org/abs/2309.15807"&gt;Emu model&lt;/a&gt;, or an updated version of it. This is the same pattern as ChatGPT and Gemini image generation, where images are generated by a tool call (here that's &lt;code&gt;media.image_gen&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;image_gen&lt;/code&gt; tool description said that it "saves the image to the sandbox", so my guess was that I'd then be able to run Python code against it. I tried:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;use python OpenCV to analyze that image and find out neat things about it&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Annoyingly it did &lt;em&gt;not&lt;/em&gt; show me the code it ran, but it did create this image:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/raccoon_analysis.png" alt="OpenCV image analysis dashboard titled &amp;quot;OpenCV Analysis: Raccoon Trash Hat&amp;quot; with four panels: top-left shows the original 1280x1280 photo of a raccoon wearing a trash-lid hat with a coffee cup and newspaper on top in an alley; top-right shows a Canny edge detection version with white edges on black background labeled &amp;quot;Canny Edges - 4.6% density&amp;quot;; bottom-left shows &amp;quot;Top 5 Dominant Colors&amp;quot; as swatches with values RGB(20, 19, 24): 35.3%, RGB(43, 43, 47): 34.2%, RGB(84, 80, 79): 18.3%, RGB(139, 135, 131): 6.3%, RGB(210, 213, 208): 5.8%; bottom-right shows &amp;quot;Color Histograms&amp;quot; line chart with Red, Green, and Blue channels plotted from 0-250 on x-axis and 0-35000 on y-axis, peaking around value 25-50." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Pretty neat! We can both generate images and then use them with Python tools.&lt;/p&gt;
&lt;p&gt;Next, the &lt;code&gt;visual_grounding&lt;/code&gt; tool. Knowing the names of the tools makes it easy to refer to them in prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now try the container.visual_grounding tool&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The response (&lt;a href="https://gist.github.com/simonw/72d64b325d33e5882b9818d825551b44"&gt;full response here&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Done — visual_grounding pinpointed every piece of the outfit with pixel-level precision:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-grounding.jpg" alt="Visual grounding analysis of a raccoon photo titled &amp;quot;Visual Grounding: Found 8 objects&amp;quot; with colored dot annotations on the image marking each detected object: coffee cup (515,205) in yellow, banana peel (569,261) in light yellow, newspaper (399,279) in blue, right eye (461,396) in magenta, left eye (582,397) in pink, nose (514,472) in cyan, raccoon (553,548) in green, and trash can lid (550,892) in red. A legend at the bottom lists all objects with their normalized 0-1000 coordinates. Footer text reads &amp;quot;Coordinates are 0-1000 normalized. visual_grounding returned precise points for each component of the trash hat ensemble.&amp;quot; Watermark reads &amp;quot;Meta AI&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;Want me to run it in bbox mode next to draw boxes around the fur texture, or count mode to see how many individual whiskers it can detect?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Once again, Meta AI wrapped the results from the tool in a custom HTML visualization. That's clearly &lt;code&gt;point&lt;/code&gt; mode, but how about the others?&lt;/p&gt;
&lt;p&gt;I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;yes draw bounding boxes, can it do masks too?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Response:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Yes — bounding boxes work great. Masks? Not directly in &lt;code&gt;visual_grounding&lt;/code&gt;; it only returns three formats: point, bbox, and count. No pixel-level segmentation masks.&lt;/p&gt;
&lt;p&gt;Here's the bbox output:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-bbox.jpg" alt="Bounding box object detection visualization titled &amp;quot;Bounding Boxes (visual_grounding)&amp;quot; with subtitle &amp;quot;8 objects detected — coordinates are 0-1000 normalized&amp;quot; showing a raccoon photo with colored rectangular bounding boxes around detected objects: coffee cup in yellow [424,115,610,273] 186×158, banana peel in yellow [472,215,667,291] 195×76, newspaper in blue [305,223,548,315] 243×92, raccoon in green [270,240,892,845] 622×605, raccoon's face in purple [305,283,750,537] 445×254, right eye in magenta [442,379,489,413] 47×34, left eye in magenta [565,379,605,416] 40×37, and trash can lid in red [107,800,1000,998] 893×198. A legend at the bottom shows each object's name, coordinates, and pixel dimensions in colored cards. Watermark reads &amp;quot;Meta AI&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;What you’re seeing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;raccoon&lt;/strong&gt;: big box from (270,240) to (892,845) — covers 62% of the width, basically the whole animal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;coffee cup&lt;/strong&gt;: (424,115)-(610,273) — sits highest, 158px tall in the 0-1000 space&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;banana peel&lt;/strong&gt;: (472,215)-(667,291) — overlaps the cup's base, confirming it's draped over the side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;newspaper&lt;/strong&gt;: (305,223)-(548,315) — left ear coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;trash can lid&lt;/strong&gt;: (107,800)-(1000,998) — spans almost the full image width at the bottom&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;face&lt;/strong&gt;: (305,283)-(750,537) — tight crop around the mask&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;eyes&lt;/strong&gt;: both ~40×37 pixel boxes, perfectly level at y=379&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Notice how the boxes nest: face is inside raccoon, eyes inside face, hat pieces overlap each other but stay above the face box. The tool is doing real object localization, not just guessing centers.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;strong&gt;About masks:&lt;/strong&gt; since visual_grounding doesn't output them, I can fake them with OpenCV in the container. Two quick options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;GrabCut&lt;/strong&gt; — I give it the raccoon bbox, it iteratively segments fur from background and returns a binary mask&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SAM-style color threshold&lt;/strong&gt; — use the bbox plus k-means to isolate the coffee cup or banana peel&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Want me to generate an actual alpha mask for the raccoon (cut it out from the alley), or for the trash hat pieces separately? Takes about 10 seconds and I can give you a transparent PNG.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://static.simonwillison.net/static/2026/meta-bbox.html"&gt;the HTML for that bbox visualization&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Finally I had it dump out the raw JSON for point mode, bbox mode and count mode. You can &lt;a href="https://gist.github.com/simonw/26bfdb22d7d73450c13f49abbf0c251a"&gt;see that all here&lt;/a&gt;. The count mode is particularly interesting - here's that in full:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;raccoon whisker&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;403&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;445&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;397&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;458&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;392&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;472&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;391&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;485&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;398&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;497&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;415&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;512&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;618&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;446&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;622&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;461&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;625&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;475&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;628&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;488&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;635&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;501&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;651&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;518&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;12&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;raccoon paw claw&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;406&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;845&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;424&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;847&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;441&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;842&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;473&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;833&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;596&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;830&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;635&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;844&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;660&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;843&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;679&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;838&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;8&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;trash item on head&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;514&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;191&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;568&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;263&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;392&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;274&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;3&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;eye&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;462&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;395&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;582&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;398&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;ear&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;359&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;313&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;680&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;294&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;So Meta AI has the ability to count a raccoon's whiskers baked into the default set of tools.&lt;/p&gt;
&lt;p&gt;Which means... it can count pelicans too!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-count-pelicans.jpg" alt="Screenshot of a Meta AI conversation where the user uploaded a photo of pelicans on rocks and asked &amp;quot;Count the pelicans&amp;quot;. response reads: &amp;quot;That rocky haul-out is packed — I count 25 brown pelicans in your photo. I went bird by bird, including the ones tucked behind rocks, heads turned away, or just peeking out at the edges. They're all California brown pelicans, with those pale yellow heads and long bills, clustered on the riprap right at the waterline. To make it easy to verify, I've numbered each one on the image:&amp;quot; followed by an annotated version of the photo with red numbered circles (1–25) placed on each pelican. The annotated image has a downward chevron and an &amp;quot;Open&amp;quot; button at the bottom right." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's that overlay &lt;a href="https://static.simonwillison.net/static/2026/meta-count-pelicans.html"&gt;exported as HTML&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: Meta's &lt;a href="https://twitter.com/jacktripleu/status/2042050863800447387"&gt;Jack Wu confirms&lt;/a&gt; that these tools are part of the new harness they launched alongside the new model.&lt;/em&gt;&lt;/p&gt;

&lt;h4 id="maybe-open-weights-in-the-future-"&gt;Maybe open weights in the future?&lt;/h4&gt;
&lt;p&gt;On Twitter &lt;a href="https://twitter.com/alexandr_wang/status/2041909388852748717"&gt;Alexandr Wang said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;this is step one. bigger models are already in development with infrastructure scaling to match. private api preview open to select partners today, with plans to open-source future versions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I really hope they do go back to open-sourcing their models. Llama 3.1/3.2/3.3 were excellent laptop-scale model families, and the introductory blog post for Muse Spark had this to say about efficiency:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So are Meta back in the frontier model game? &lt;a href="https://twitter.com/ArtificialAnlys/status/2041913043379220801"&gt;Artificial Analysis&lt;/a&gt; think so - they scored Meta Spark at 52, "behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6". Last year's Llama 4 Maverick and Scout scored 18 and 13 respectively.&lt;/p&gt;
&lt;p&gt;I'm waiting for API access - while the tool collection on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; is quite strong the real test of a model like this is still what we can build on top of it.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="facebook"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="code-interpreter"/><category term="llm-tool-use"/><category term="meta"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me</title><link href="https://simonwillison.net/2026/Apr/7/project-glasswing/#atom-entries" rel="alternate"/><published>2026-04-07T20:52:54+00:00</published><updated>2026-04-07T20:52:54+00:00</updated><id>https://simonwillison.net/2026/Apr/7/project-glasswing/#atom-entries</id><summary type="html">&lt;p&gt;Anthropic &lt;em&gt;didn't&lt;/em&gt; release their latest model, Claude Mythos (&lt;a href="https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf"&gt;system card PDF&lt;/a&gt;), today. They have instead made it available to a very restricted set of preview partners under their newly announced &lt;a href="https://www.anthropic.com/glasswing"&gt;Project Glasswing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The model is a general purpose model, similar to Claude Opus 4.6, but Anthropic claim that its cyber-security research abilities are strong enough that they need to give the software industry as a whole time to prepare.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mythos Preview has already found thousands of high-severity vulnerabilities, including some in &lt;em&gt;every major operating system and web browser&lt;/em&gt;. Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely.&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;Project Glasswing partners will receive access to Claude Mythos Preview to find and fix vulnerabilities or weaknesses in their foundational systems—systems that represent a very large portion of the world’s shared cyberattack surface. We anticipate this work will focus on tasks like local vulnerability detection, black box testing of binaries, securing endpoints, and penetration testing of systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There's a great deal more technical detail in &lt;a href="https://red.anthropic.com/2026/mythos-preview/"&gt; Assessing Claude Mythos Preview’s cybersecurity capabilities&lt;/a&gt; on the Anthropic Red Team blog:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;In one case, Mythos Preview wrote a web browser exploit that chained together four vulnerabilities, writing a complex &lt;a href="https://en.wikipedia.org/wiki/JIT_spraying "&gt;JIT heap spray&lt;/a&gt; that escaped both renderer and OS sandboxes. It autonomously obtained local privilege escalation exploits on Linux and other operating systems by exploiting subtle race conditions and KASLR-bypasses. And it autonomously wrote a remote code execution exploit on FreeBSD's NFS server that granted full root access to unauthenticated users by splitting a 20-gadget ROP chain over multiple packets.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Plus this comparison with Claude 4.6 Opus:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Saying "our model is too dangerous to release" is a great way to build buzz around a new model, but in this case I expect their caution is warranted.&lt;/p&gt;
&lt;p&gt;Just a few days (&lt;a href="https://simonwillison.net/2026/Apr/3/"&gt;last Friday&lt;/a&gt;) ago I started a new &lt;a href="https://simonwillison.net/tags/ai-security-research/"&gt;ai-security-research&lt;/a&gt; tag on this blog to acknowledge an uptick in credible security professionals pulling the alarm on how good modern LLMs have got at vulnerability research.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_kernel/"&gt;Greg Kroah-Hartman&lt;/a&gt; of the Linux kernel:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Months ago, we were getting what we called 'AI slop,' AI-generated security reports that were obviously wrong or low quality. It was kind of funny. It didn't really worry us.&lt;/p&gt;
&lt;p&gt;Something happened a month ago, and the world switched. Now we have real reports. All open source projects have real reports that are made with AI, but they're good, and they're real.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://mastodon.social/@bagder/116336957584445742"&gt;Daniel Stenberg&lt;/a&gt; of &lt;code&gt;curl&lt;/code&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The challenge with AI in open source security has transitioned from an AI slop tsunami into more of a ... plain security report tsunami. Less slop but lots of reports. Many of them really good.&lt;/p&gt;
&lt;p&gt;I'm spending hours per day on this now. It's intense.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And Thomas Ptacek published &lt;a href="https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/"&gt;Vulnerability Research Is Cooked&lt;/a&gt;, a post inspired by his &lt;a href="https://securitycryptographywhatever.com/2026/03/25/ai-bug-finding/"&gt;podcast conversation&lt;/a&gt; with Anthropic's Nicholas Carlini.&lt;/p&gt;
&lt;p&gt;Anthropic have a 5 minute &lt;a href="https://www.youtube.com/watch?v=INGOC6-LLv0"&gt;talking heads video&lt;/a&gt; describing the Glasswing project. Nicholas Carlini appears as one of those talking heads, where he said (highlights mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It has the ability to chain together vulnerabilities. So what this means is you find two vulnerabilities, either of which doesn't really get you very much independently. But this model is able to create exploits out of three, four, or sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome. [...]&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I've found more bugs in the last couple of weeks than I found in the rest of my life combined&lt;/strong&gt;. We've used the model to scan a bunch of open source code, and the thing that we went for first was operating systems, because this is the code that underlies the entire internet infrastructure. &lt;strong&gt;For OpenBSD, we found a bug that's been present for 27 years, where I can send a couple of pieces of data to any OpenBSD server and crash it&lt;/strong&gt;. On Linux, we found a number of vulnerabilities where as a user with no permissions, I can elevate myself to the administrator by just running some binary on my machine. For each of these bugs, we told the maintainers who actually run the software about them, and they went and fixed them and have deployed the patches  patches so that anyone who runs the software is no longer vulnerable to these attacks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I found this on the &lt;a href="https://www.openbsd.org/errata78.html"&gt;OpenBSD 7.8 errata page&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;025: RELIABILITY FIX: March 25, 2026&lt;/strong&gt;  &lt;em&gt;All architectures&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;TCP packets with invalid SACK options could crash the kernel.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ftp.openbsd.org/pub/OpenBSD/patches/7.8/common/025_sack.patch.sig"&gt;A source code patch exists which remedies this problem.&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tracked that change down in the &lt;a href="https://github.com/openbsd/src"&gt;GitHub mirror&lt;/a&gt; of the OpenBSD CVS repo (apparently they still use CVS!) and found it &lt;a href="https://github.com/openbsd/src/blame/master/sys/netinet/tcp_input.c#L2461"&gt;using git blame&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/openbsd-27-years.jpg" alt="Screenshot of a Git blame view of C source code around line 2455 showing TCP SACK hole validation logic. Code includes checks using SEQ_GT, SEQ_LT macros on fields like th-&amp;gt;th_ack, tp-&amp;gt;snd_una, sack.start, sack.end, tp-&amp;gt;snd_max, and tp-&amp;gt;snd_holes. Most commits are from 25–27 years ago with messages like &amp;quot;more SACK hole validity testin...&amp;quot; and &amp;quot;knf&amp;quot;, while one recent commit from 3 weeks ago (&amp;quot;Ignore TCP SACK packets wit...&amp;quot;) is highlighted with an orange left border, adding a new guard &amp;quot;if (SEQ_LT(sack.start, tp-&amp;gt;snd_una)) continue;&amp;quot;" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Sure enough, the surrounding code is from 27 years ago.&lt;/p&gt;
&lt;p&gt;I'm not sure which Linux vulnerability Nicholas was describing, but it may have been &lt;a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5133b61aaf437e5f25b1b396b14242a6bb0508e2"&gt;this NFS one&lt;/a&gt; recently covered &lt;a href="https://mtlynch.io/claude-code-found-linux-vulnerability/"&gt;by Michael Lynch
&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There's enough smoke here that I believe there's a fire. It's not surprising to find vulnerabilities in decades-old software, especially given that they're mostly written in C, but what's new is that coding agents run by the latest frontier LLMs are proving tirelessly capable at digging up these issues.&lt;/p&gt;
&lt;p&gt;I actually thought to myself on Friday that this sounded like an industry-wide reckoning in the making, and that it might warrant a huge investment of time and money to get ahead of the inevitable barrage of vulnerabilities. Project Glasswing incorporates "$100M in usage credits ... as well as $4M in direct donations to open-source security organizations". Partners include AWS, Apple, Microsoft, Google, and the Linux Foundation. It would be great to see OpenAI involved as well - GPT-5.4 already has a strong reputation for finding security vulnerabilities and they have stronger models on the near horizon.&lt;/p&gt;
&lt;p&gt;The bad news for those of us who are &lt;em&gt;not&lt;/em&gt; trusted partners is this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We do not plan to make Claude Mythos Preview generally available, but our eventual goal is to enable our users to safely deploy Mythos-class models at scale—for cybersecurity purposes, but also for the myriad other benefits that such highly capable models will bring. To do so, we need to make progress in developing cybersecurity (and other) safeguards that detect and block the model’s most dangerous outputs. We plan to launch new safeguards with an upcoming Claude Opus model, allowing us to improve and refine them with a model that does not pose the same level of risk as Mythos Preview.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I can live with that. I think the security risks really are credible here, and having extra time for trusted teams to get ahead of them is a reasonable trade-off.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="security"/><category term="thomas-ptacek"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="nicholas-carlini"/><category term="ai-ethics"/><category term="llm-release"/><category term="ai-security-research"/></entry></feed>