<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: november-2025-inflection</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/november-2025-inflection.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-03-13T03:44:34+00:00</updated><author><name>Simon Willison</name></author><entry><title>Shopify/liquid: Performance: 53% faster parse+render, 61% fewer allocations</title><link href="https://simonwillison.net/2026/Mar/13/liquid/#atom-tag" rel="alternate"/><published>2026-03-13T03:44:34+00:00</published><updated>2026-03-13T03:44:34+00:00</updated><id>https://simonwillison.net/2026/Mar/13/liquid/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Shopify/liquid/pull/2056"&gt;Shopify/liquid: Performance: 53% faster parse+render, 61% fewer allocations&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
PR from Shopify CEO Tobias Lütke against Liquid, Shopify's open source Ruby template engine that was somewhat inspired by Django when Tobi first created it &lt;a href="https://simonwillison.net/2005/Nov/6/liquid/"&gt;back in 2005&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Tobi found dozens of new performance micro-optimizations using a variant of &lt;a href="https://github.com/karpathy/autoresearch"&gt;autoresearch&lt;/a&gt;, Andrej Karpathy's new system for having a coding agent run hundreds of semi-autonomous experiments to find new effective techniques for training &lt;a href="https://github.com/karpathy/nanochat"&gt;nanochat&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Tobi's implementation started two days ago with this &lt;a href="https://github.com/Shopify/liquid/blob/2543fdc1a101f555db208fb0deeb2e3bf1ae9e36/auto/autoresearch.md"&gt;autoresearch.md&lt;/a&gt; prompt file and an &lt;a href="https://github.com/Shopify/liquid/blob/2543fdc1a101f555db208fb0deeb2e3bf1ae9e36/auto/autoresearch.sh"&gt;autoresearch.sh&lt;/a&gt; script for the agent to run to execute the test suite and report on benchmark scores.&lt;/p&gt;
&lt;p&gt;The PR now lists &lt;a href="https://github.com/Shopify/liquid/pull/2056/commits"&gt;93 commits&lt;/a&gt; from around 120 automated experiments. The PR description lists what worked in detail - some examples:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Replaced StringScanner tokenizer with &lt;code&gt;String#byteindex&lt;/code&gt;.&lt;/strong&gt; Single-byte &lt;code&gt;byteindex&lt;/code&gt; searching is ~40% faster than regex-based &lt;code&gt;skip_until&lt;/code&gt;. This alone reduced parse time by ~12%.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pure-byte &lt;code&gt;parse_tag_token&lt;/code&gt;.&lt;/strong&gt; Eliminated the costly &lt;code&gt;StringScanner#string=&lt;/code&gt; reset that was called for every &lt;code&gt;{% %}&lt;/code&gt; token (878 times). Manual byte scanning for tag name + markup extraction is faster than resetting and re-scanning via StringScanner. [...]&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cached small integer &lt;code&gt;to_s&lt;/code&gt;.&lt;/strong&gt; Pre-computed frozen strings for 0-999 avoid 267 &lt;code&gt;Integer#to_s&lt;/code&gt; allocations per render.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This all added up to a 53% improvement on benchmarks - truly impressive for a codebase that's been tweaked by hundreds of contributors over 20 years.&lt;/p&gt;
&lt;p&gt;I think this illustrates a number of interesting ideas:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Having a robust test suite - in this case 974 unit tests - is a &lt;em&gt;massive unlock&lt;/em&gt; for working with coding agents. This kind of research effort would not be possible without first having a tried and tested suite of tests.&lt;/li&gt;
&lt;li&gt;The autoresearch pattern - where an agent brainstorms a multitude of potential improvements and then experiments with them one at a time - is really effective.&lt;/li&gt;
&lt;li&gt;If you provide an agent with a benchmarking script "make it faster" becomes an actionable goal.&lt;/li&gt;
&lt;li&gt;CEOs can code again! Tobi has always been more hands-on than most, but this is a much more significant contribution than anyone would expect from the leader of a company with 7,500+ employees. I've seen this pattern play out a lot over the past few months: coding agents make it feasible for people in high-interruption roles to productively work with code again.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here's Tobi's &lt;a href="https://github.com/tobi"&gt;GitHub contribution graph&lt;/a&gt; for the past year, showing a significant uptick following that &lt;a href="https://simonwillison.net/tags/november-2025-inflection/"&gt;November 2025 inflection point&lt;/a&gt; when coding agents got really good.&lt;/p&gt;
&lt;p&gt;&lt;img alt="1,658 contributions in the last year - scattered lightly through Jun, Aug, Sep, Oct and Nov and then picking up significantly in Dec, Jan, and Feb." src="https://static.simonwillison.net/static/2026/tobi-contribs.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;He used &lt;a href="https://github.com/badlogic/pi-mono"&gt;Pi&lt;/a&gt; as the coding agent and released a new &lt;a href="https://github.com/davebcn87/pi-autoresearch"&gt;pi-autoresearch&lt;/a&gt; plugin in collaboration with David Cortés, which maintains state in an &lt;code&gt;autoresearch.jsonl&lt;/code&gt; file &lt;a href="https://github.com/Shopify/liquid/blob/3182b7c1b3758b0f5fe2d0fcc71a48bbcb11c946/autoresearch.jsonl"&gt;like this one&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/tobi/status/2032212531846971413"&gt;@tobi&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/performance"&gt;performance&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rails"&gt;rails&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ruby"&gt;ruby&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tobias-lutke"&gt;tobias-lutke&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="performance"/><category term="rails"/><category term="ruby"/><category term="ai"/><category term="andrej-karpathy"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="coding-agents"/><category term="agentic-engineering"/><category term="november-2025-inflection"/><category term="tobias-lutke"/></entry><entry><title>Perhaps not Boring Technology after all</title><link href="https://simonwillison.net/2026/Mar/9/not-so-boring/#atom-tag" rel="alternate"/><published>2026-03-09T13:37:45+00:00</published><updated>2026-03-09T13:37:45+00:00</updated><id>https://simonwillison.net/2026/Mar/9/not-so-boring/#atom-tag</id><summary type="html">
    &lt;p&gt;A recurring concern I've seen regarding LLMs for programming is that they will push our technology choices towards the tools that are best represented in their training data, making it harder for new, better tools to break through the noise.&lt;/p&gt;
&lt;p&gt;This was certainly the case a couple of years ago, when asking models for help with Python or JavaScript appeared to give much better results than questions about less widely used languages.&lt;/p&gt;
&lt;p&gt;With &lt;a href="https://simonwillison.net/tags/november-2025-inflection/"&gt;the latest models&lt;/a&gt; running in good coding agent harnesses I'm not sure this continues to hold up.&lt;/p&gt;
&lt;p&gt;I'm seeing excellent results with my &lt;a href="https://simonwillison.net/2026/Feb/17/chartroom-and-datasette-showboat/"&gt;brand new tools&lt;/a&gt; where I start by prompting "use uvx showboat --help / rodney --help / chartroom --help to learn about these tools" - the context length of these new models is long enough that they can consume quite a lot of documentation before they start working on a problem.&lt;/p&gt;
&lt;p&gt;Drop a coding agent into &lt;em&gt;any&lt;/em&gt; existing codebase that uses libraries and tools that are too private or too new to feature in the training data and my experience is that it works &lt;em&gt;just fine&lt;/em&gt; - the agent will consult enough of the existing examples to understand patterns, then iterate and test its own output to fill in the gaps.&lt;/p&gt;
&lt;p&gt;This is a surprising result. I thought coding agents would prove to be the ultimate embodiment of the &lt;a href="https://boringtechnology.club"&gt;Choose Boring Technology&lt;/a&gt; approach, but in practice they don't seem to be affecting my technology choices in that way at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: A few follow-on thoughts:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The issue of what technology LLMs &lt;em&gt;recommend&lt;/em&gt; is a separate one. &lt;a href="https://amplifying.ai/research/claude-code-picks"&gt;What Claude Code &lt;em&gt;Actually&lt;/em&gt; Chooses&lt;/a&gt; is an interesting recent study where Edwin Ong and Alex Vikati where they proved Claude Code over 2,000 times and found a strong bias towards build-over-buy but also identified a preferred technical stack, with GitHub Actions, Stripe, and shadcn/ui seeing a "near monopoly" in their respective categories. For the sake of this post my interest is in what happens when the human makes a technology choice that differs from those preferred by the model harness.&lt;/li&gt;
&lt;li&gt;The &lt;a href="https://simonwillison.net/tags/skills/"&gt;Skills&lt;/a&gt; mechanism that is being rapidly embraced by most coding agent tools is super-relevant here. We are already seeing projects release official skills to help agents use them - here are examples from &lt;a href="https://github.com/remotion-dev/skills"&gt;Remotion&lt;/a&gt;, &lt;a href="https://github.com/supabase/agent-skills"&gt;Supabase&lt;/a&gt;, &lt;a href="https://github.com/vercel-labs/agent-skills"&gt;Vercel&lt;/a&gt;, and &lt;a href="https://github.com/prisma/skills"&gt;Prisma&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/boring-technology"&gt;boring-technology&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="boring-technology"/><category term="coding-agents"/><category term="agentic-engineering"/><category term="november-2025-inflection"/></entry><entry><title>Quoting Donald Knuth</title><link href="https://simonwillison.net/2026/Mar/3/donald-knuth/#atom-tag" rel="alternate"/><published>2026-03-03T23:59:04+00:00</published><updated>2026-03-03T23:59:04+00:00</updated><id>https://simonwillison.net/2026/Mar/3/donald-knuth/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www-cs-faculty.stanford.edu/~knuth/papers/claude-cycles.pdf"&gt;&lt;p&gt;Shock! Shock! I learned yesterday that an open problem I'd been working on for several weeks had just been solved by Claude Opus 4.6 - Anthropic's hybrid reasoning model that had been released three weeks earlier! It seems that I'll have to revise my opinions about "generative AI" one of these days. What a joy it is to learn not only that my conjecture has a nice solution but also to celebrate this dramatic advance in automatic deduction and creative problem solving.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www-cs-faculty.stanford.edu/~knuth/papers/claude-cycles.pdf"&gt;Donald Knuth&lt;/a&gt;, Claude's Cycles&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/donald-knuth"&gt;donald-knuth&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-reasoning"/><category term="november-2025-inflection"/><category term="donald-knuth"/></entry><entry><title>An AI agent coding skeptic tries AI agent coding, in excessive detail</title><link href="https://simonwillison.net/2026/Feb/27/ai-agent-coding-in-excessive-detail/#atom-tag" rel="alternate"/><published>2026-02-27T20:43:41+00:00</published><updated>2026-02-27T20:43:41+00:00</updated><id>https://simonwillison.net/2026/Feb/27/ai-agent-coding-in-excessive-detail/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://minimaxir.com/2026/02/ai-agent-coding/"&gt;An AI agent coding skeptic tries AI agent coding, in excessive detail&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Another in the genre of "OK, coding agents got good in November" posts, this one is by Max Woolf and is very much worth your time. He describes a sequence of coding agent projects, each more ambitious than the last - starting with simple YouTube metadata scrapers and eventually evolving to this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It would be arrogant to port Python's &lt;a href="https://scikit-learn.org/stable/"&gt;scikit-learn&lt;/a&gt; — the gold standard of data science and machine learning libraries — to Rust with all the features that implies.&lt;/p&gt;
&lt;p&gt;But that's unironically a good idea so I decided to try and do it anyways. With the use of agents, I am now developing &lt;code&gt;rustlearn&lt;/code&gt; (extreme placeholder name), a Rust crate that implements not only the fast implementations of the standard machine learning algorithms such as &lt;a href="https://en.wikipedia.org/wiki/Logistic_regression"&gt;logistic regression&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/K-means_clustering"&gt;k-means clustering&lt;/a&gt;, but also includes the fast implementations of the algorithms above: the same three step pipeline I describe above still works even with the more simple algorithms to beat scikit-learn's implementations.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Max also captures the frustration of trying to explain how good the models have got to an existing skeptical audience:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The real annoying thing about Opus 4.6/Codex 5.3 is that it’s impossible to publicly say “Opus 4.5 (and the models that came after it) are an order of magnitude better than coding LLMs released just months before it” without sounding like an AI hype booster clickbaiting, but it’s the counterintuitive truth to my personal frustration. I have been trying to break this damn model by giving it complex tasks that would take me months to do by myself despite my coding pedigree but Opus and Codex keep doing them correctly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A throwaway remark in this post inspired me to &lt;a href="https://github.com/simonw/research/tree/main/rust-wordcloud#readme"&gt;ask Claude Code to build a Rust word cloud CLI tool&lt;/a&gt;, which it happily did.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/max-woolf"&gt;max-woolf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="rust"/><category term="max-woolf"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="coding-agents"/><category term="agentic-engineering"/><category term="november-2025-inflection"/></entry><entry><title>Quoting Andrej Karpathy</title><link href="https://simonwillison.net/2026/Feb/26/andrej-karpathy/#atom-tag" rel="alternate"/><published>2026-02-26T19:03:27+00:00</published><updated>2026-02-26T19:03:27+00:00</updated><id>https://simonwillison.net/2026/Feb/26/andrej-karpathy/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/karpathy/status/2026731645169185220"&gt;&lt;p&gt;It is hard to communicate how much programming has changed due to AI in the last 2 months: not gradually and over time in the "progress as usual" way, but specifically this last December. There are a number of asterisks but imo coding agents basically didn’t work before December and basically work since - the models have significantly higher quality, long-term coherence and tenacity and they can power through large and long tasks, well past enough that it is extremely disruptive to the default programming workflow. [...]&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/karpathy/status/2026731645169185220"&gt;Andrej Karpathy&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="andrej-karpathy"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="coding-agents"/><category term="agentic-engineering"/><category term="november-2025-inflection"/></entry><entry><title>I vibe coded my dream macOS presentation app</title><link href="https://simonwillison.net/2026/Feb/25/present/#atom-tag" rel="alternate"/><published>2026-02-25T16:46:19+00:00</published><updated>2026-02-25T16:46:19+00:00</updated><id>https://simonwillison.net/2026/Feb/25/present/#atom-tag</id><summary type="html">
    &lt;p&gt;I gave a talk this weekend at Social Science FOO Camp in Mountain View. The event was a classic unconference format where anyone could present a talk without needing to propose it in advance. I grabbed a slot for a talk I titled "The State of LLMs, February 2026 edition", subtitle "It's all changed since November!". I vibe coded a custom macOS app for the presentation the night before.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/state-of-llms.jpg" alt="A sticky note on a board at FOO Camp. It reads: The state of LLMs, Feb 2026 edition - it's all changed since November! Simon Willison - the card is littered with names of new models: Qwen 3.5, DeepSeek 3.2, Sonnet 4.6, Kimi K2.5, GLM5, Opus 4.5/4.6, Gemini 3.1 Pro, Codex 5.3. The card next to it says Why do Social Scientists think they need genetics? Bill January (it's not all because of AI)" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I've written about the last twelve months of development in LLMs in &lt;a href="https://simonwillison.net/2023/Dec/31/ai-in-2023/"&gt;December 2023&lt;/a&gt;, &lt;a href="https://simonwillison.net/2024/Dec/31/llms-in-2024/"&gt;December 2024&lt;/a&gt; and &lt;a href="https://simonwillison.net/2025/Dec/31/the-year-in-llms/"&gt;December 2025&lt;/a&gt;. I also presented &lt;a href="https://simonwillison.net/2025/Jun/6/six-months-in-llms/"&gt;The last six months in LLMs, illustrated by pelicans on bicycles&lt;/a&gt; at the AI Engineer World’s Fair in June 2025. This was my first time dropping the time covered to just three months, which neatly illustrates how much the space keeps accelerating and felt appropriate given the &lt;a href="https://simonwillison.net/2026/Jan/4/inflection/"&gt;November 2025 inflection point&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;(I further illustrated this acceleration by wearing a Gemini 3 sweater to the talk, which I was given a couple of weeks ago and is already out-of-date &lt;a href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/"&gt;thanks to Gemini 3.1&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;I always like to have at least one gimmick in any talk I give, based on the STAR moment principle I &lt;a href="https://simonwillison.net/2019/Dec/10/better-presentations/"&gt;learned at Stanford&lt;/a&gt; - include Something They'll Always Remember to try and help your talk stand out.&lt;/p&gt;
&lt;p&gt;For this talk I had two gimmicks. I built the first part of the talk around coding agent assisted data analysis of Kākāpō breeding season (which meant I got to &lt;a href="https://simonwillison.net/2026/Feb/8/kakapo-mug/"&gt;show off my mug&lt;/a&gt;), then did a quick tour of some new pelicans riding bicycles before ending with the reveal that the entire presentation had been presented using a new macOS app I had vibe coded in ~45 minutes the night before the talk.&lt;/p&gt;
&lt;h4 id="present-app"&gt;Present.app&lt;/h4&gt;
&lt;p&gt;The app is called &lt;strong&gt;Present&lt;/strong&gt; - literally the first name I thought of. It's built using Swift and SwiftUI and weighs in at 355KB, or &lt;a href="https://github.com/simonw/present/releases/tag/0.1a0"&gt;76KB compressed&lt;/a&gt;. Swift apps are tiny!&lt;/p&gt;
&lt;p&gt;It may have been quick to build but the combined set of features is something I've wanted for &lt;em&gt;years&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;I usually use Keynote for presentations, but sometimes I like to mix things up by presenting using a sequence of web pages. I do this by loading up a browser window with a tab for each page, then clicking through those tabs in turn while I talk.&lt;/p&gt;
&lt;p&gt;This works great, but comes with a very scary disadvantage: if the browser crashes I've just lost my entire deck!&lt;/p&gt;
&lt;p&gt;I always have the URLs in a notes file, so I can click back to that and launch them all manually if I need to, but it's not something I'd like to deal with in the middle of a talk.&lt;/p&gt;
&lt;p&gt;This was &lt;a href="https://gisthost.github.io/?639d3c16dcece275af50f028b32480c7/page-001.html#msg-2026-02-21T05-53-43-395Z"&gt;my starting prompt&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Build a SwiftUI app for giving presentations where every slide is a URL. The app starts as a window with a webview on the right and a UI on the left for adding, removing and reordering the sequence of URLs. Then you click Play in a menu and the app goes full screen and the left and right keys switch between URLs&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That produced a plan. You can see &lt;a href="https://gisthost.github.io/?bfbc338977ceb71e298e4d4d5ac7d63c"&gt;the transcript that implemented that plan here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In Present a talk is an ordered sequence of URLs, with a sidebar UI for adding, removing and reordering those URLs. That's the entirety of the editing experience.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/present.jpg" alt="Screenshot of a macOS app window titled &amp;quot;Present&amp;quot; showing Google Image search results for &amp;quot;kakapo&amp;quot;. A web view shows a Google image search with thumbnail photos of kākāpō parrots with captions. A sidebar on the left shows a numbered list of URLs, mostly from simonwillison.net and static.simonwillison.net, with item 4 (https://www.google.com/search?...) highlighted in blue." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;When you select the "Play" option in the menu (or hit Cmd+Shift+P) the app switches to full screen mode. Left and right arrow keys navigate back and forth, and you can bump the font size up and down or scroll the page if you need to. Hit Escape when you're done.&lt;/p&gt;
&lt;p&gt;Crucially, Present saves your URLs automatically any time you make a change. If the app crashes you can start it back up again and restore your presentation state.&lt;/p&gt;
&lt;p&gt;You can also save presentations as a &lt;code&gt;.txt&lt;/code&gt; file (literally a newline-delimited sequence of URLs) and load them back up again later.&lt;/p&gt;
&lt;h4 id="remote-controlled-via-my-phone"&gt;Remote controlled via my phone&lt;/h4&gt;
&lt;p&gt;Getting the initial app working took so little time that I decided to get more ambitious.&lt;/p&gt;
&lt;p&gt;It's neat having a remote control for a presentation...&lt;/p&gt;
&lt;p&gt;So I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Add a web server which listens on 0.0.0.0:9123 - the web server serves a single mobile-friendly page with prominent left and right buttons - clicking those buttons switches the slide left and right - there is also a button to start presentation mode or stop depending on the mode it is in.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have &lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; on my laptop and my phone, which means I don't have to worry about Wi-Fi networks blocking access between the two devices. My phone can access &lt;code&gt;http://100.122.231.116:9123/&lt;/code&gt; directly from anywhere in the world and control the presentation running on my laptop.&lt;/p&gt;
&lt;p&gt;It took a few more iterative prompts to get to the final interface, which looked like this:&lt;/p&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="https://static.simonwillison.net/static/2026/present-mobile.jpg" alt="Mobile phone web browser app with large buttons, Slide 4/31 at the top, Prev, Next and Start buttons, a thin bar with a up/down scroll icon and text size + and - buttons and the current slide URL at the bottom." style="max-width: 80%;" /&gt;&lt;/p&gt;
&lt;p&gt;There's a slide indicator at the top, prev and next buttons, a nice big "Start" button and buttons for adjusting the font size.&lt;/p&gt;
&lt;p&gt;The most complex feature is that thin bar next to the start button. That's a touch-enabled scroll bar - you can slide your finger up and down on it to scroll the currently visible web page up and down on the screen.&lt;/p&gt;
&lt;p&gt;It's &lt;em&gt;very&lt;/em&gt; clunky but it works just well enough to solve the problem of a page loading with most interesting content below the fold.&lt;/p&gt;
&lt;h4 id="learning-from-the-code"&gt;Learning from the code&lt;/h4&gt;
&lt;p&gt;I'd already &lt;a href="https://github.com/simonw/present"&gt;pushed the code to GitHub&lt;/a&gt; (with a big "This app was vibe coded [...] I make no promises other than it worked on my machine!" disclaimer) when I realized I should probably take a look at the code.&lt;/p&gt;
&lt;p&gt;I used this as an opportunity to document a recent pattern I've been using: asking the model to present a linear walkthrough of the entire codebase. Here's the resulting &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/linear-walkthroughs/"&gt;Linear walkthroughs&lt;/a&gt; pattern in my ongoing &lt;a href="https://simonwillison.net/2026/Feb/23/agentic-engineering-patterns/"&gt;Agentic Engineering Patterns guide&lt;/a&gt;, including the prompt I used.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/present/blob/main/walkthrough.md"&gt;resulting walkthrough document&lt;/a&gt; is genuinely useful. It turns out Claude Code decided to implement the web server for the remote control feature &lt;a href="https://github.com/simonw/present/blob/main/walkthrough.md#request-routing"&gt;using socket programming without a library&lt;/a&gt;! Here's the minimal HTTP parser it used for routing:&lt;/p&gt;
&lt;div class="highlight highlight-source-swift"&gt;&lt;pre&gt;    &lt;span class="pl-k"&gt;private&lt;/span&gt; &lt;span class="pl-en"&gt;func&lt;/span&gt; route&lt;span class="pl-kos"&gt;(&lt;/span&gt;_ raw&lt;span class="pl-kos"&gt;:&lt;/span&gt; &lt;span class="pl-smi"&gt;String&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-smi"&gt;String&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;let&lt;/span&gt; &lt;span class="pl-s1"&gt;firstLine&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; raw&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;components&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;separatedBy&lt;span class="pl-kos"&gt;:&lt;/span&gt; &lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;\r&lt;/span&gt;&lt;span class="pl-s"&gt;\n&lt;/span&gt;&lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;first &lt;span class="pl-c1"&gt;??&lt;/span&gt; &lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;"&lt;/span&gt;
        &lt;span class="pl-k"&gt;let&lt;/span&gt; &lt;span class="pl-s1"&gt;parts&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; firstLine&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;split&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;separator&lt;span class="pl-kos"&gt;:&lt;/span&gt; &lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt; &lt;/span&gt;&lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;
        &lt;span class="pl-k"&gt;let&lt;/span&gt; &lt;span class="pl-s1"&gt;path&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; parts&lt;span class="pl-kos"&gt;.&lt;/span&gt;count &lt;span class="pl-c1"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;2&lt;/span&gt; &lt;span class="pl-c1"&gt;?&lt;/span&gt; &lt;span class="pl-en"&gt;String&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-en"&gt;parts&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-k"&gt;:&lt;/span&gt; &lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;/&lt;/span&gt;&lt;span class="pl-s"&gt;"&lt;/span&gt;

        &lt;span class="pl-k"&gt;switch&lt;/span&gt; path &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;case&lt;/span&gt; &lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;/next&lt;/span&gt;&lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt;
            state&lt;span class="pl-c1"&gt;&lt;span class="pl-c1"&gt;?&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;goToNext&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;
            &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-en"&gt;jsonResponse&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;ok&lt;/span&gt;&lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;
        &lt;span class="pl-k"&gt;case&lt;/span&gt; &lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;/prev&lt;/span&gt;&lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt;
            state&lt;span class="pl-c1"&gt;&lt;span class="pl-c1"&gt;?&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;goToPrevious&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;
            &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-en"&gt;jsonResponse&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;ok&lt;/span&gt;&lt;span class="pl-s"&gt;"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;
&lt;span class="pl-kos"&gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Using GET requests for state changes like that opens up some fun CSRF vulnerabilities. For this particular application I don't really care.&lt;/p&gt;
&lt;h4 id="expanding-our-horizons"&gt;Expanding our horizons&lt;/h4&gt;
&lt;p&gt;Vibe coding stories like this are ten a penny these days. I think this one is worth sharing for a few reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Swift, a language I don't know, was absolutely the right choice here. I wanted a full screen app that embedded web content and could be controlled over the network. Swift had everything I needed.&lt;/li&gt;
&lt;li&gt;When I finally did look at the code it was simple, straightforward and did exactly what I needed and not an inch more.&lt;/li&gt;
&lt;li&gt;This solved a real problem for me. I've always wanted a good way to serve a presentation as a sequence of pages, and now I have exactly that.&lt;/li&gt;
&lt;li&gt;I didn't have to open Xcode even once!&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This doesn't mean native Mac developers are obsolete. I still used a whole bunch of my own accumulated technical knowledge (and the fact that I'd already installed Xcode and the like) to get this result, and someone who knew what they were doing could have built a far better solution in the same amount of time.&lt;/p&gt;
&lt;p&gt;It's a neat illustration of how those of us with software engineering experience can expand our horizons in fun and interesting directions. I'm no longer afraid of Swift! Next time I need a small, personal macOS app I know that it's achievable with our existing set of tools.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/swift"&gt;swift&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="macos"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="vibe-coding"/><category term="swift"/><category term="agentic-engineering"/><category term="november-2025-inflection"/></entry><entry><title>The A.I. Disruption We’ve Been Waiting for Has Arrived</title><link href="https://simonwillison.net/2026/Feb/18/the-ai-disruption/#atom-tag" rel="alternate"/><published>2026-02-18T17:07:31+00:00</published><updated>2026-02-18T17:07:31+00:00</updated><id>https://simonwillison.net/2026/Feb/18/the-ai-disruption/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.nytimes.com/2026/02/18/opinion/ai-software.html?unlocked_article_code=1.NFA.UkLv.r-XczfzYRdXJ&amp;amp;smid=url-share"&gt;The A.I. Disruption We’ve Been Waiting for Has Arrived&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New opinion piece from Paul Ford in the New York Times. Unsurprisingly for a piece by Paul it's packed with quoteworthy snippets, but a few stood out for me in particular.&lt;/p&gt;
&lt;p&gt;Paul describes the &lt;a href="https://simonwillison.net/2026/Jan/4/inflection/"&gt;November moment&lt;/a&gt; that so many other programmers have observed, and highlights Claude Code's ability to revive old side projects:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[Claude Code] was always a helpful coding assistant, but in November it suddenly got much better, and ever since I’ve been knocking off side projects that had sat in folders for a decade or longer. It’s fun to see old ideas come to life, so I keep a steady flow. Maybe it adds up to a half-hour a day of my time, and an hour of Claude’s.&lt;/p&gt;
&lt;p&gt;November was, for me and many others in tech, a great surprise. Before, A.I. coding tools were often useful, but halting and clumsy. Now, the bot can run for a full hour and make whole, designed websites and apps that may be flawed, but credible. I spent an entire session of therapy talking about it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And as the former CEO of a respected consultancy firm (Postlight) he's well positioned to evaluate the potential impact:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;When you watch a large language model slice through some horrible, expensive problem — like migrating data from an old platform to a modern one — you feel the earth shifting. I was the chief executive of a software services firm, which made me a professional software cost estimator. When I rebooted my messy personal website a few weeks ago, I realized: I would have paid $25,000 for someone else to do this. When a friend asked me to convert a large, thorny data set, I downloaded it, cleaned it up and made it pretty and easy to explore. In the past I would have charged $350,000.&lt;/p&gt;
&lt;p&gt;That last price is full 2021 retail — it implies a product manager, a designer, two engineers (one senior) and four to six months of design, coding and testing. Plus maintenance. Bespoke software is joltingly expensive. Today, though, when the stars align and my prompts work out, I can do hundreds of thousands of dollars worth of work for fun (fun for me) over weekends and evenings, for the price of the Claude $200-a-month plan.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;He also neatly captures the inherent community tension involved in exploring this technology:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;All of the people I love hate this stuff, and all the people I hate love it. And yet, likely because of the same personality flaws that drew me to technology in the first place, I am annoyingly excited.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/new-york-times"&gt;new-york-times&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-ford"&gt;paul-ford&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/careers"&gt;careers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-blue"&gt;deep-blue&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;



</summary><category term="new-york-times"/><category term="paul-ford"/><category term="careers"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="ai-ethics"/><category term="coding-agents"/><category term="claude-code"/><category term="deep-blue"/><category term="november-2025-inflection"/></entry><entry><title>LLM predictions for 2026, shared with Oxide and Friends</title><link href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#atom-tag" rel="alternate"/><published>2026-01-08T19:42:13+00:00</published><updated>2026-01-08T19:42:13+00:00</updated><id>https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#atom-tag</id><summary type="html">
    &lt;p&gt;I joined a recording of the Oxide and Friends podcast on Tuesday to talk about 1, 3 and 6 year predictions for the tech industry. This is my second appearance on their annual predictions episode, you can see &lt;a href="https://simonwillison.net/2025/Jan/10/ai-predictions/"&gt;my predictions from January 2025 here&lt;/a&gt;. Here's &lt;a href="https://oxide-and-friends.transistor.fm/episodes/predictions-2026"&gt;the page for this year's episode&lt;/a&gt;, with options to listen in all of your favorite podcast apps or &lt;a href="https://www.youtube.com/watch?v=lVDhQMiAbR8"&gt;directly on YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Bryan Cantrill started the episode by declaring that he's never been so unsure about what's coming in the next year. I share that uncertainty - the significant advances in coding agents just in the last two months have left me certain that things will change significantly, but unclear as to what those changes will be.&lt;/p&gt;
&lt;p&gt;Here are the predictions I shared in the episode.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#1-year-it-will-become-undeniable-that-llms-write-good-code"&gt;1 year: It will become undeniable that LLMs write good code&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#1-year-we-re-finally-going-to-solve-sandboxing"&gt;1 year: We're finally going to solve sandboxing&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#1-year-a-challenger-disaster-for-coding-agent-security"&gt;1 year: A "Challenger disaster" for coding agent security&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#1-year-k-k-p-parrots-will-have-an-outstanding-breeding-season"&gt;1 year: Kākāpō parrots will have an outstanding breeding season&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#3-years-the-coding-agents-jevons-paradox-for-software-engineering-will-resolve-one-way-or-the-other"&gt;3 years: the coding agents Jevons paradox for software engineering will resolve, one way or the other&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#3-years-someone-will-build-a-new-browser-using-mainly-ai-assisted-coding-and-it-won-t-even-be-a-surprise"&gt;3 years: Someone will build a new browser using mainly AI-assisted coding and it won't even be a surprise&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#6-years-typing-code-by-hand-will-go-the-way-of-punch-cards"&gt;6 years: Typing code by hand will go the way of punch cards&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="1-year-it-will-become-undeniable-that-llms-write-good-code"&gt;1 year: It will become undeniable that LLMs write good code &lt;a href="https://www.youtube.com/watch?v=lVDhQMiAbR8&amp;amp;t=1167s" class="predictions-video-link"&gt;▶ 19:27&lt;/a&gt;&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;I think that there are still people out there who are convinced that LLMs cannot write good code. Those people are in for a very nasty shock in 2026. I do not think it will be possible to get to the end of even the next three months while still holding on to that idea that the code they write is all junk and it's it's likely any decent human programmer will write better code than they will.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In 2023, saying that LLMs write garbage code was entirely correct. For most of 2024 that stayed true. In 2025 that changed, but you could be forgiven for continuing to hold out. In 2026 the quality of LLM-generated code will become impossible to deny.&lt;/p&gt;
&lt;p&gt;I base this on my own experience - I've spent more time exploring &lt;a href="https://simonwillison.net/tags/ai-assisted-programming/"&gt;AI-assisted programming&lt;/a&gt; than most.&lt;/p&gt;
&lt;p&gt;The key change in 2025 (see &lt;a href="https://simonwillison.net/2025/Dec/31/the-year-in-llms/#the-year-of-reasoning-"&gt;my overview for the year&lt;/a&gt;) was the introduction of "reasoning models" trained specifically against code using Reinforcement Learning. The major labs spent a full year competing with each other on who could get the best code capabilities from their models, and that problem turns out to be perfectly attuned to RL since code challenges come with built-in verifiable success conditions.&lt;/p&gt;
&lt;p&gt;Since Claude Opus 4.5 and GPT-5.2 came out in November and December respectively the amount of code I've written by hand has dropped to a single digit percentage of my overall output. The same is true for many other expert programmers I know.&lt;/p&gt;
&lt;p&gt;At this point if you continue to argue that LLMs write useless code you're damaging your own credibility.&lt;/p&gt;
&lt;h4 id="1-year-we-re-finally-going-to-solve-sandboxing"&gt;1 year: We're finally going to solve sandboxing &lt;a href="https://www.youtube.com/watch?v=lVDhQMiAbR8&amp;amp;t=1205s" class="predictions-video-link"&gt;▶ 20:05&lt;/a&gt;&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;I think this year is the year we're going to solve sandboxing. I want to run code other people have written on my computing devices without it destroying my computing devices if it's malicious or has bugs. [...] It's crazy that it's 2026 and I still &lt;code&gt;pip install&lt;/code&gt; random code and then execute it in a way that it can steal all of my data and delete all my files. [...] I don't want to run a piece of code on any of my devices that somebody else wrote outside of sandbox ever again.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This isn't just about LLMs, but it becomes even more important now there are so many more people writing code often without knowing what they're doing. Sandboxing is also a key part of the battle against prompt injection.&lt;/p&gt;
&lt;p&gt;We have a &lt;em&gt;lot&lt;/em&gt; of promising technologies in play already for this - containers and WebAssembly being the two I'm most optimistic about. There's real commercial value involved in solving this problem. The pieces are there, what's needed is UX work to reduce the friction in using them productively and securely.&lt;/p&gt;
&lt;h4 id="1-year-a-challenger-disaster-for-coding-agent-security"&gt;1 year: A "Challenger disaster" for coding agent security &lt;a href="https://www.youtube.com/watch?v=lVDhQMiAbR8&amp;amp;t=1281s" class="predictions-video-link"&gt;▶ 21:21&lt;/a&gt;&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;I think we're due a Challenger disaster with respect to coding agent security[...] I think so many people, myself included, are running these coding agents practically as root, right? We're letting them do all of this stuff. And every time I do it, my computer doesn't get wiped. I'm like, "oh, it's fine". [...] The worst version of this is the worm - a prompt injection worm which infects people's computers and adds itself to the Python or NPM packages that person has access to.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I used this as an opportunity to promote my favourite recent essay about AI security, &lt;a href="https://embracethered.com/blog/posts/2025/the-normalization-of-deviance-in-ai/"&gt;the Normalization of Deviance in AI&lt;/a&gt; by Johann Rehberger.&lt;/p&gt;
&lt;p&gt;The Normalization of Deviance describes the phenomenon where people and organizations get used to operating in an unsafe manner because nothing bad has happened to them yet, which can result in enormous problems (like the 1986 Challenger disaster) when their luck runs out.&lt;/p&gt;
&lt;p&gt;Every six months I predict that a headline-grabbing prompt injection attack is coming soon, and every six months it doesn't happen. This is my most recent version of that prediction!&lt;/p&gt;
&lt;h4 id="1-year-k-k-p-parrots-will-have-an-outstanding-breeding-season"&gt;1 year: Kākāpō parrots will have an outstanding breeding season &lt;a href="https://www.youtube.com/watch?v=lVDhQMiAbR8&amp;amp;t=3006s" class="predictions-video-link"&gt;▶ 50:06&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;(I dropped this one to lighten the mood after a discussion of the deep sense of existential dread that many programmers are feeling right now!)&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I think that Kākāpō parrots in New Zealand are going to have an outstanding breeding season. The reason I think this is that the Rimu trees are in fruit right now. There's only 250 of them,  and they only breed if the Rimu trees have a good fruiting. The Rimu trees have been terrible since 2019, but this year the Rimu trees were all blooming. There are researchers saying that all 87 females of breeding age might lay an egg. And for a species with only 250 remaining parrots that's great news.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(I just &lt;a href="https://en.wikipedia.org/wiki/K%C4%81k%C4%81p%C5%8D#Population_timeline"&gt;checked Wikipedia&lt;/a&gt; and I was right with the parrot numbers but wrong about the last good breeding season, apparently 2022 was a good year too.)&lt;/p&gt;
&lt;p&gt;In a year with precious little in the form of good news I am utterly delighted to share this story. Here's more:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://blog.doc.govt.nz/2025/06/27/kakapo-breeding-season-2026/"&gt;Kākāpō breeding season 2026&lt;/a&gt; introduction from the Department of Conservation from June 2025 .&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.auckland.ac.nz/en/news/2025/12/03/bumper-breeding-season-for-kakapo-on-the-cards.html"&gt;Bumper breeding season for kākāpō on the cards&lt;/a&gt; - 3rd December 2025, University of Auckland.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I don't often use AI-generated images on this blog, but the Kākāpō image the Oxide team created for this episode is just &lt;em&gt;perfect&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/oxide-kakapo.jpg" alt="A beautiful green Kākāpō surrounded by candles gazes into a crystal ball" style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;h4 id="3-years-the-coding-agents-jevons-paradox-for-software-engineering-will-resolve-one-way-or-the-other"&gt;3 years: the coding agents Jevons paradox for software engineering will resolve, one way or the other &lt;a href="https://www.youtube.com/watch?v=lVDhQMiAbR8&amp;amp;t=3277s" class="predictions-video-link"&gt;▶ 54:37&lt;/a&gt;&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;We will find out if the &lt;a href="https://en.wikipedia.org/wiki/Jevons_paradox"&gt;Jevons paradox&lt;/a&gt; saves our careers or not. This is a big question that anyone who's a software engineer has right now: we are driving the cost of actually producing working code down to a fraction of what it used to cost. Does that mean that our careers are completely devalued and we all have to learn to live on a tenth of our incomes, or does it mean that the demand for software, for custom software goes up by a factor of 10 and now our skills are even &lt;em&gt;more&lt;/em&gt; valuable because you can hire me and I can build you 10 times the software I used to be able to? I think by three years we will know for sure which way that one went.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The quote says it all. There are two ways this coding agents thing could go: it could turn out software engineering skills are devalued, or it could turn out we're more valuable and effective than ever before.&lt;/p&gt;
&lt;p&gt;I'm crossing my fingers for the latter! So far it feels to me like it's working out that way.&lt;/p&gt;

&lt;h4 id="3-years-someone-will-build-a-new-browser-using-mainly-ai-assisted-coding-and-it-won-t-even-be-a-surprise"&gt;3 years: Someone will build a new browser using mainly AI-assisted coding and it won't even be a surprise &lt;a href="https://www.youtube.com/watch?v=lVDhQMiAbR8&amp;amp;t=3913s" class="predictions-video-link"&gt;▶ 65:13&lt;/a&gt;&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;I think somebody will have built a full web browser mostly using AI assistance, and it won't even be surprising. Rolling a new web browser is one of the most complicated software projects I can imagine[...] the cheat code is the conformance suites. If there are existing tests that it'll get so much easier.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A common complaint today from AI coding skeptics is that LLMs are fine for toy projects but can't be used for anything large and serious.&lt;/p&gt;
&lt;p&gt;I think within 3 years that will be comprehensively proven incorrect, to the point that it won't even be controversial anymore.&lt;/p&gt;
&lt;p&gt;I picked a web browser here because so much of the work building a browser involves writing code that has to conform to an enormous and daunting selection of both formal tests and informal websites-in-the-wild.&lt;/p&gt;
&lt;p&gt;Coding agents are &lt;em&gt;really good&lt;/em&gt; at tasks where you can define a concrete goal and then set them to work iterating in that direction.&lt;/p&gt;
&lt;p&gt;A web browser is the most ambitious project I can think of that leans into those capabilities.&lt;/p&gt;

&lt;h4 id="6-years-typing-code-by-hand-will-go-the-way-of-punch-cards"&gt;6 years: Typing code by hand will go the way of punch cards &lt;a href="https://www.youtube.com/watch?v=lVDhQMiAbR8&amp;amp;t=4839s" class="predictions-video-link"&gt;▶ 80:39&lt;/a&gt;&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;I think the job of being paid money to type code into a computer will go the same way as punching punch cards [...] in six years time, I do not think anyone will be paid to just to do the thing where you type the code. I think software engineering will still be an enormous career. I just think the software engineers won't be spending multiple hours of their day in a text editor typing out syntax.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The more time I spend on AI-assisted programming the less afraid I am for my job, because it turns out building software - especially at the rate it's now possible to build - still requires enormous skill, experience and depth of understanding.&lt;/p&gt;
&lt;p&gt;The skills are changing though! Being able to read a detailed specification and transform it into lines of code is the thing that's being automated away. What's left is everything else, and the more time I spend working with coding agents the larger that "everything else" becomes.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/predictions"&gt;predictions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sandboxing"&gt;sandboxing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kakapo"&gt;kakapo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/oxide"&gt;oxide&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bryan-cantrill"&gt;bryan-cantrill&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/conformance-suites"&gt;conformance-suites&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/browser-challenge"&gt;browser-challenge&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deep-blue"&gt;deep-blue&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="predictions"/><category term="sandboxing"/><category term="ai"/><category term="kakapo"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="oxide"/><category term="bryan-cantrill"/><category term="coding-agents"/><category term="conformance-suites"/><category term="browser-challenge"/><category term="deep-blue"/><category term="november-2025-inflection"/></entry><entry><title>The November 2025 inflection point</title><link href="https://simonwillison.net/2026/Jan/4/inflection/#atom-tag" rel="alternate"/><published>2026-01-04T23:21:42+00:00</published><updated>2026-01-04T23:21:42+00:00</updated><id>https://simonwillison.net/2026/Jan/4/inflection/#atom-tag</id><summary type="html">
    &lt;p&gt;It genuinely feels to me like GPT-5.2 and Opus 4.5 in November represent an inflection point - one of those moments where the models get incrementally better in a way that tips across an invisible capability line where suddenly a whole bunch of much harder coding problems open up.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-4"&gt;claude-4&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="claude"/><category term="claude-4"/><category term="gpt-5"/><category term="november-2025-inflection"/></entry><entry><title>I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in 4.5 hours</title><link href="https://simonwillison.net/2025/Dec/15/porting-justhtml/#atom-tag" rel="alternate"/><published>2025-12-15T23:58:38+00:00</published><updated>2025-12-15T23:58:38+00:00</updated><id>https://simonwillison.net/2025/Dec/15/porting-justhtml/#atom-tag</id><summary type="html">
    &lt;p&gt;I &lt;a href="https://simonwillison.net/2025/Dec/14/justhtml/"&gt;wrote about JustHTML yesterday&lt;/a&gt; - Emil Stenström's project to build a new standards compliant HTML5 parser in pure Python code using coding agents running against the comprehensive html5lib-tests testing library. Last night, purely out of curiosity, I decided to try &lt;strong&gt;porting JustHTML from Python to JavaScript&lt;/strong&gt; with the least amount of effort possible, using Codex CLI and GPT-5.2. It worked beyond my expectations.&lt;/p&gt;
&lt;h4 id="tl-dr"&gt;TL;DR&lt;/h4&gt;
&lt;p&gt;I built &lt;a href="https://github.com/simonw/justjshtml"&gt;simonw/justjshtml&lt;/a&gt;, a dependency-free HTML5 parsing library in JavaScript which passes 9,200 tests from the html5lib-tests suite and imitates the API design of Emil's JustHTML library.&lt;/p&gt;
&lt;p&gt;It took two initial prompts and a few tiny follow-ups. &lt;a href="https://simonwillison.net/2025/Dec/11/gpt-52/"&gt;GPT-5.2&lt;/a&gt; running in &lt;a href="https://github.com/openai/codex"&gt;Codex CLI&lt;/a&gt; ran uninterrupted for several hours, burned through 1,464,295 input tokens, 97,122,176 cached input tokens and 625,563 output tokens and ended up producing 9,000 lines of fully tested JavaScript across 43 commits.&lt;/p&gt;
&lt;p&gt;Time elapsed from project idea to finished library: about 4 hours, during which I also bought and decorated a Christmas tree with family and watched the latest Knives Out movie.&lt;/p&gt;
&lt;h4 id="some-background"&gt;Some background&lt;/h4&gt;
&lt;p&gt;One of the most important contributions of the HTML5 specification ten years ago was the way it precisely specified how &lt;em&gt;invalid&lt;/em&gt; HTML should be parsed. The world is full of invalid documents and having a specification that covers those means browsers can treat them in the same way - there's no more "undefined behavior" to worry about when building parsing software.&lt;/p&gt;
&lt;p&gt;Unsurprisingly, those invalid parsing rules are pretty complex! The free online book &lt;a href="https://htmlparser.info/"&gt;Idiosyncrasies of the HTML parser&lt;/a&gt; by Simon Pieters is an excellent deep dive into this topic, in particular &lt;a href="https://htmlparser.info/parser/"&gt;Chapter 3. The HTML parser&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Python &lt;a href="https://github.com/html5lib/html5lib-python"&gt;html5lib&lt;/a&gt; project started the &lt;a href="https://github.com/html5lib/html5lib-tests"&gt;html5lib-tests&lt;/a&gt; repository with a set of implementation-independent tests. These have since become the gold standard for interoperability testing of HTML5 parsers, and are used by projects such as &lt;a href="https://github.com/servo/servo"&gt;Servo&lt;/a&gt; which used them to help build &lt;a href="https://github.com/servo/html5ever"&gt;html5ever&lt;/a&gt;, a "high-performance browser-grade HTML5 parser" written in Rust.&lt;/p&gt;
&lt;p&gt;Emil Stenström's &lt;a href="https://github.com/EmilStenstrom/justhtml"&gt;JustHTML&lt;/a&gt; project is a pure-Python implementation of an HTML5 parser that passes the full html5lib-tests suite. Emil &lt;a href="https://friendlybit.com/python/writing-justhtml-with-coding-agents/"&gt;spent a couple of months&lt;/a&gt; working on this as a side project, deliberately picking a problem with a comprehensive existing test suite to see how far he could get with coding agents.&lt;/p&gt;
&lt;p&gt;At one point he had the agents rewrite it based on a close inspection of the Rust html5ever library. I don't know how much of this was direct translation versus inspiration (here's Emil's &lt;a href="https://news.ycombinator.com/item?id=46264195#46267059"&gt;commentary on that&lt;/a&gt;) - his project has 1,215 commits total so it appears to have included a huge amount of iteration, not just a straight port.&lt;/p&gt;
&lt;p&gt;My project &lt;strong&gt;is&lt;/strong&gt; a straight port. I instructed Codex CLI to build a JavaScript version of Emil's Python code.&lt;/p&gt;
&lt;h4 id="the-process-in-detail"&gt;The process in detail&lt;/h4&gt;
&lt;p&gt;I started with a bit of mise en place. I checked out two repos and created an empty third directory for the new project:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/dev
git clone https://github.com/EmilStenstrom/justhtml
git clone https://github.com/html5lib/html5lib-tests
mkdir justjshtml
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; justjshtml&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I started Codex CLI for GPT-5.2 like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;codex --yolo -m gpt-5.2&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That &lt;code&gt;--yolo&lt;/code&gt; flag is a shortcut for &lt;code&gt;--dangerously-bypass-approvals-and-sandbox&lt;/code&gt;, which is every bit as dangerous as it sounds.&lt;/p&gt;
&lt;p&gt;My first prompt told Codex to inspect the existing code and use it to build a specification for the new JavaScript library:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;We are going to create a JavaScript port of ~/dev/justhtml - an HTML parsing library that passes the full ~/dev/html5lib-tests test suite. It is going to have a similar API to the Python library but in JavaScript. It will have no dependencies other than raw JavaScript, hence it will work great in the browser and node.js and other environments. Start by reading ~/dev/justhtml and designing the user-facing API for the new library - create a spec.md containing your plan.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I reviewed the spec, which included a set of proposed milestones, and told it to add another:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Add an early step to the roadmap that involves an initial version that parses a simple example document that is valid and returns the right results. Then add and commit the spec.md file.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/justjshtml/blob/19b8eb1f2ca80f428a3c40862d5ec05d36e5166b/spec.md"&gt;the resulting spec.md file&lt;/a&gt;. My request for that initial version became "Milestone 0.5" which looked like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Milestone 0.5 — End-to-end smoke parse (single valid document)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Implement the smallest end-to-end slice so the public API is real early:
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;new JustHTML("&amp;lt;html&amp;gt;&amp;lt;head&amp;gt;&amp;lt;/head&amp;gt;&amp;lt;body&amp;gt;&amp;lt;p&amp;gt;Hello&amp;lt;/p&amp;gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;")&lt;/code&gt; returns a tree with the expected tag structure and text nodes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;doc.toText()&lt;/code&gt; returns &lt;code&gt;"Hello"&lt;/code&gt; and &lt;code&gt;doc.errors&lt;/code&gt; is empty for this valid input.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;scripts/smoke.js&lt;/code&gt; (no deps) that runs the example and asserts the expected structure/output.&lt;/li&gt;
&lt;li&gt;Gate: &lt;code&gt;node scripts/smoke.js&lt;/code&gt; passes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Implement Milestone 0.5&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And off it went. The resulting code appeared to work so I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Configure GitHub Actions test.yml to run that on every commit, then commit and push&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I ran &lt;code&gt;git init&lt;/code&gt; and created a private GitHub repository for this project at this point, and set up the local directory to push to that remote.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/justjshtml/commit/48497cbd05584c88af757a0736a2ee447468b6b3"&gt;that initial push&lt;/a&gt;. Then I told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;OK do the rest, commit and push often&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And that was almost it!&lt;/p&gt;
&lt;p&gt;I set my laptop to not fall asleep and left it to its devices while we went off to buy a Christmas tree. The "commit and push often" meant I could monitor its progress on my phone by refreshing &lt;a href="https://github.com/simonw/justjshtml/commits/main"&gt;the commit log on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I was running this against my $20/month ChatGPT Plus account, which has a five hour token allowance window for Codex CLI. That ran out at 6:35pm and Codex paused, so I waited until the reset point at 7:14pm and typed:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Keep going&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At 9:30pm it declared itself done with the following summary message:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;• What’s Done

  - html5lib-tests now run in CI and pass: tokenizer 6810/6810, tree 1770/1782
    (12 skipped: #script-on), serializer 230/230, encoding 82/83 (1 skipped:
    scripted) (.github/workflows/test.yml:1).
  - Public API expanded: stream() (src/stream.js:1), CSS selectors query()/
    matches() (src/selector.js:1), and toMarkdown() (src/markdown.js:1), all
    exported via src/index.js:1.
  - Added no-deps unit scripts for selectors/stream/markdown (scripts/run-
    selector-tests.js:1, scripts/run-stream-tests.js:1, scripts/run-markdown-
    tests.js:1) and wired them into CI.
  - Fixed TreeBuilder so &amp;lt;br&amp;gt; no longer closes &amp;lt;p&amp;gt; (src/treebuilder.js:647).
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As a finishing touch, I had it add a playground interface so I could try out the new library in my browser. I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Add a playground.html in the top level folder which loads the necessary ES modules from ./src/... and implements the exact same functionality as seen on https://tools.simonwillison.net/justhtml but using the JavaScript library instead of Pyodide&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It fetched my &lt;a href="https://tools.simonwillison.net/justhtml"&gt;existing JustHTML playground page&lt;/a&gt; (&lt;a href="https://simonwillison.net/2025/Dec/14/justhtml/#first-impressions-of-justhtml"&gt;described here&lt;/a&gt;) using &lt;code&gt;curl&lt;/code&gt; and built a new &lt;code&gt;playground.html&lt;/code&gt; file that loaded the new JavaScript code instead. This worked &lt;em&gt;perfectly&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I enabled GitHub Pages for my still-private repo which meant I could access the new playground at this URL:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://simonw.github.io/justjshtml/playground.html"&gt;https://simonw.github.io/justjshtml/playground.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/justjshtml-playground.jpg" alt="Screenshot of JustJSHTML Playground web application. Header reads &amp;quot;JustJSHTML Playground&amp;quot; with subtitle &amp;quot;A dependency-free JavaScript HTML5 parser - GitHub&amp;quot;. Below is a status bar showing &amp;quot;JavaScript Environment&amp;quot; with a green &amp;quot;Ready&amp;quot; badge. The main input area has &amp;quot;Paste HTML&amp;quot; and &amp;quot;Fetch from URL&amp;quot; buttons, with a text area containing HTML code: &amp;quot;&amp;lt;!DOCTYPE html&amp;gt; &amp;lt;html&amp;gt; &amp;lt;head&amp;gt; &amp;lt;title&amp;gt;Example Page&amp;lt;/title&amp;gt; &amp;lt;/head&amp;gt; &amp;lt;body&amp;gt; &amp;lt;header&amp;gt; &amp;lt;nav&amp;gt; &amp;lt;ul&amp;gt;&amp;quot;. A &amp;quot;Playground Mode&amp;quot; section shows buttons for &amp;quot;CSS Selector Query&amp;quot;, &amp;quot;Pretty Print HTML&amp;quot;, &amp;quot;Tree Structure&amp;quot;, &amp;quot;Stream Events&amp;quot;, &amp;quot;Extract Text&amp;quot;, and &amp;quot;To Markdown&amp;quot; (highlighted in purple). Below is a text field labeled &amp;quot;CSS Selector (optional - leave empty for whole document):&amp;quot; with placeholder &amp;quot;e.g., article, main, .content (or leave empty)&amp;quot; and a green &amp;quot;Convert to Markdown&amp;quot; button. The Output section has a teal header with &amp;quot;Whole document&amp;quot; badge and displays converted markdown: &amp;quot;Example Page&amp;quot; followed by &amp;quot;- [Home](/)&amp;quot; &amp;quot;- [About](/about)&amp;quot; &amp;quot;- [Contact](/contact)&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;All it needed now was some documentation:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Add a comprehensive README with full usage instructions including attribution plus how this was built plus how to use in in HTML plus how to use it in Node.js&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can &lt;a href="https://github.com/simonw/justjshtml/blob/f3a33fdb29bf97846fd017185edc8cf82783032e/README.md"&gt;read the result here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We are now at eight prompts total, running for just over four hours and I've decorated for Christmas and watched &lt;a href="https://en.wikipedia.org/wiki/Wake_Up_Dead_Man"&gt;Wake Up Dead Man&lt;/a&gt; on Netflix.&lt;/p&gt;
&lt;p&gt;According to Codex CLI:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Token usage: total=2,089,858 input=1,464,295 (+ 97,122,176 cached) output=625,563 (reasoning 437,010)&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My &lt;a href="https://www.llm-prices.com/#it=2089858&amp;amp;cit=97122176&amp;amp;ot=625563&amp;amp;sel=gpt-5.2"&gt;llm-prices.com calculator&lt;/a&gt; estimates that at $29.41 if I was paying for those tokens at API prices, but they were included in my $20/month ChatGPT Plus subscription so the actual extra cost to me was zero.&lt;/p&gt;
&lt;h4 id="what-can-we-learn-from-this-"&gt;What can we learn from this?&lt;/h4&gt;
&lt;p&gt;I'm sharing this project because I think it demonstrates a bunch of interesting things about the state of LLMs in December 2025.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Frontier LLMs really can perform complex, multi-hour tasks with hundreds of tool calls and minimal supervision. I used GPT-5.2 for this but I have no reason to believe that Claude Opus 4.5 or Gemini 3 Pro would not be able to achieve the same thing - the only reason I haven't tried is that I don't want to burn another 4 hours of time and several million tokens on more runs.&lt;/li&gt;
&lt;li&gt;If you can reduce a problem to a robust test suite you can set a coding agent loop loose on it with a high degree of confidence that it will eventually succeed. I called this &lt;a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/"&gt;designing the agentic loop&lt;/a&gt; a few months ago. I think it's the key skill to unlocking the potential of LLMs for complex tasks.&lt;/li&gt;
&lt;li&gt;Porting entire open source libraries from one language to another via a coding agent works extremely well.&lt;/li&gt;
&lt;li&gt;Code is so cheap it's practically free. Code that &lt;em&gt;works&lt;/em&gt; continues to carry a cost, but that cost has plummeted now that coding agents can check their work as they go.&lt;/li&gt;
&lt;li&gt;We haven't even &lt;em&gt;begun&lt;/em&gt; to unpack the etiquette and ethics around this style of development. Is it responsible and appropriate to churn out a direct port of a library like this in a few hours while watching a movie? What would it take for code built like this to be trusted in production?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'll end with some open questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does this library represent a legal violation of copyright of either the Rust library or the Python one?&lt;/li&gt;
&lt;li&gt;Even if this is legal, is it ethical to build a library in this way?&lt;/li&gt;
&lt;li&gt;Does this format of development hurt the open source ecosystem?&lt;/li&gt;
&lt;li&gt;Can I even assert copyright over this, given how much of the work was produced by the LLM?&lt;/li&gt;
&lt;li&gt;Is it responsible to publish software libraries built in this way?&lt;/li&gt;
&lt;li&gt;How much better would this library be if an expert team hand crafted it over the course of several months?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Update 11th January 2026&lt;/strong&gt;: I originally ended this post with just these open questions, but I've now provided &lt;a href="https://simonwillison.net/2026/Jan/11/answers/"&gt;my own answers to the questions&lt;/a&gt; in a new post.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="html"/><category term="javascript"/><category term="python"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="gpt-5"/><category term="codex-cli"/><category term="november-2025-inflection"/></entry><entry><title>Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult</title><link href="https://simonwillison.net/2025/Nov/24/claude-opus/#atom-tag" rel="alternate"/><published>2025-11-24T19:37:07+00:00</published><updated>2025-11-24T19:37:07+00:00</updated><id>https://simonwillison.net/2025/Nov/24/claude-opus/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic &lt;a href="https://www.anthropic.com/news/claude-opus-4-5"&gt;released Claude Opus 4.5&lt;/a&gt; this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's &lt;a href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/"&gt;GPT-5.1-Codex-Max&lt;/a&gt; and Google's &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/"&gt;Gemini 3&lt;/a&gt;, both released within the past week!&lt;/p&gt;
&lt;p&gt;The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 "reliable knowledge cutoff" (Sonnet 4.5 is January, Haiku 4.5 is February).&lt;/p&gt;
&lt;p&gt;The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for &amp;gt;200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-5#key-improvements-in-opus-4-5-over-opus-4-1"&gt;Key improvements in Opus 4.5 over Opus 4.1&lt;/a&gt; document has a few more interesting details:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Opus 4.5 has a new &lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort"&gt;effort parameter&lt;/a&gt; which defaults to high but can be set to medium or low for faster responses.&lt;/li&gt;
&lt;li&gt;The model supports &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool"&gt;enhanced computer use&lt;/a&gt;, specifically a &lt;code&gt;zoom&lt;/code&gt; tool which you can provide to Opus 4.5 to allow it to request a zoomed in region of the screen to inspect.&lt;/li&gt;
&lt;li&gt;"&lt;a href="https://platform.claude.com/docs/en/build-with-claude/extended-thinking#thinking-block-preservation-in-claude-opus-4-5"&gt;Thinking blocks from previous assistant turns are preserved in model context by default&lt;/a&gt;" - apparently previous Anthropic models discarded those.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I had access to a preview of Anthropic's new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in &lt;a href="https://simonwillison.net/2025/Nov/24/sqlite-utils-40a1/"&gt;a new alpha release of sqlite-utils&lt;/a&gt; that included several large-scale refactorings - Opus 4.5 was responsible for most of the work across &lt;a href="https://github.com/simonw/sqlite-utils/compare/10957305be998999e3c95c11863b5709d42b7ae3...4.0a1"&gt;20 commits, 39 files changed,  2,022 additions and 1,173 deletions&lt;/a&gt; in a two day period. Here's the &lt;a href="https://gistpreview.github.io/?f40971b693024fbe984a68b73cc283d2"&gt;Claude Code transcript&lt;/a&gt; where I had it help implement one of the more complicated new features.&lt;/p&gt;
&lt;p&gt;It's clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in &lt;a href="https://github.com/simonw/sqlite-utils/milestone/7?closed=1"&gt;the milestone for the alpha&lt;/a&gt;. I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I'd been achieving with the new model.&lt;/p&gt;
&lt;p&gt;With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected.&lt;/p&gt;
&lt;p&gt;I'm not saying the new model isn't an improvement on Sonnet 4.5 - but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two.&lt;/p&gt;
&lt;p&gt;This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.&lt;/p&gt;
&lt;p&gt;Google's Nano Banana Pro image generation model was notable in that its ability to &lt;a href="https://simonwillison.net/2025/Nov/20/nano-banana-pro/#creating-an-infographic"&gt;render usable infographics&lt;/a&gt; really does represent a task at which  previous models had been laughably incapable.&lt;/p&gt;
&lt;p&gt;The frontier LLMs are a lot harder to differentiate between. Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis?&lt;/p&gt;
&lt;p&gt;And honestly, this is mainly on me. I've fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they've fallen one-by-one and now I'm embarrassingly lacking in suitable challenges to help evaluate new models.&lt;/p&gt;
&lt;p&gt;I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on - a tip I picked up from Ethan Mollick. I need to double-down on that advice myself!&lt;/p&gt;
&lt;p&gt;I'd love to see AI labs like Anthropic help address this challenge directly. I'd like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle.&lt;/p&gt;
&lt;p&gt;"Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5" would excite me a &lt;em&gt;lot&lt;/em&gt; more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.&lt;/p&gt;
&lt;p id="pelicans"&gt;In the meantime, I'm just gonna have to keep on getting them to draw &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelicans riding bicycles&lt;/a&gt;. Here's Opus 4.5 (on its default &lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort"&gt;"high" effort level&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican.jpg" alt="The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It did significantly better on the &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;new more detailed prompt&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican-advanced.jpg" alt="The pelican has feathers and a red pouch - a close enough version of breeding plumage. The bicycle is a much better shape." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's that same complex prompt &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#advanced-pelican"&gt;against Gemini 3 Pro&lt;/a&gt; and &lt;a href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/#advanced-pelican-codex-max"&gt;against GPT-5.1-Codex-Max-xhigh&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="still-susceptible-to-prompt-injection"&gt;Still susceptible to prompt injection&lt;/h4&gt;
&lt;p&gt;From &lt;a href="https://www.anthropic.com/news/claude-opus-4-5#a-step-forward-on-safety"&gt;the safety section&lt;/a&gt; of Anthropic's announcement post:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/claude-opus-4.5-prompt-injection.jpg" alt="Bar chart titled &amp;quot;Susceptibility to prompt-injection style attacks&amp;quot; with subtitle &amp;quot;At k queries; lower is better&amp;quot;. Y-axis shows &amp;quot;ATTACK SUCCESS RATE (%)&amp;quot; from 0-100. Five stacked bars compare AI models with three k values (k=1 in dark gray, k=10 in beige, k=100 in pink). Results: Gemini 3 Pro Thinking (12.5, 60.7, 92.0), GPT-5.1 Thinking (12.6, 58.2, 87.8), Haiku 4.5 Thinking (8.3, 51.1, 85.6), Sonnet 4.5 Thinking (7.3, 41.9, 72.4), Opus 4.5 Thinking (4.7, 33.6, 63.0)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;On the one hand this looks great, it's a clear improvement over previous models and the competition.&lt;/p&gt;
&lt;p&gt;What does the chart actually tell us though? It tells us that single attempts at prompt injection still work 1/20 times, and if an attacker can try ten different attacks that success rate goes up to 1/3!&lt;/p&gt;
&lt;p&gt;I still don't think training models not to fall for prompt injection is the way forward here. We continue to need to design our applications under the assumption that a suitably motivated attacker will be able to find a way to trick the models.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/prompt-injection"&gt;prompt-injection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="prompt-injection"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="evals"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="november-2025-inflection"/></entry><entry><title>Building more with GPT-5.1-Codex-Max</title><link href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/#atom-tag" rel="alternate"/><published>2025-11-19T23:15:10+00:00</published><updated>2025-11-19T23:15:10+00:00</updated><id>https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/gpt-5-1-codex-max/"&gt;Building more with GPT-5.1-Codex-Max&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Hot on the heels of yesterday's &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/"&gt;Gemini 3 Pro release&lt;/a&gt; comes a new model from OpenAI called GPT-5.1-Codex-Max.&lt;/p&gt;
&lt;p&gt;(Remember when GPT-5 was meant to bring in a new era of less confusing model names? That didn't last!)&lt;/p&gt;
&lt;p&gt;It's currently only available through their &lt;a href="https://developers.openai.com/codex/cli/"&gt;Codex CLI coding agent&lt;/a&gt;, where it's the new default model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Starting today, GPT‑5.1-Codex-Max will replace GPT‑5.1-Codex as the default model in Codex surfaces. Unlike GPT‑5.1, which is a general-purpose model, we recommend using GPT‑5.1-Codex-Max and the Codex family of models only for agentic coding tasks in Codex or Codex-like environments.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's not available via the API yet but should be shortly.&lt;/p&gt;
&lt;p&gt;The timing of this release is interesting given that Gemini 3 Pro appears to have &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#benchmarks"&gt;aced almost all of the benchmarks&lt;/a&gt; just yesterday. It's reminiscent of the period in 2024 when OpenAI consistently made big announcements that happened to coincide with Gemini releases.&lt;/p&gt;
&lt;p&gt;OpenAI's self-reported &lt;a href="https://openai.com/index/introducing-swe-bench-verified/"&gt;SWE-Bench Verified&lt;/a&gt; score is particularly notable: 76.5% for thinking level "high" and 77.9% for the new "xhigh". That was the one benchmark where Gemini 3 Pro was out-performed by Claude Sonnet 4.5 - Gemini 3 Pro got 76.2% and Sonnet 4.5 got 77.2%. OpenAI now have the highest scoring model there by a full .7 of a percentage point!&lt;/p&gt;
&lt;p&gt;They also report a score of 58.1% on &lt;a href="https://www.tbench.ai/leaderboard/terminal-bench/2.0"&gt;Terminal Bench 2.0&lt;/a&gt;, beating Gemini 3 Pro's 54.2% (and Sonnet 4.5's 42.8%.)&lt;/p&gt;
&lt;p&gt;The most intriguing part of this announcement concerns the model's approach to long context problems:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT‑5.1-Codex-Max is built for long-running, detailed work. It’s our first model natively trained to operate across multiple context windows through a process called &lt;em&gt;compaction&lt;/em&gt;, coherently working over millions of tokens in a single task. [...]&lt;/p&gt;
&lt;p&gt;Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's a lot of confusion &lt;a href="https://news.ycombinator.com/item?id=45982649"&gt;on Hacker News&lt;/a&gt; about what this actually means. Claude Code already does a version of compaction, automatically summarizing previous turns when the context runs out. Does this just mean that Codex-Max is better at that process?&lt;/p&gt;
&lt;p&gt;I had it draw me a couple of pelicans by typing "Generate an SVG of a pelican riding a bicycle" directly into the Codex CLI tool. Here's thinking level medium:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A flat-style illustration shows a white, round-bodied bird with an orange beak pedaling a red-framed bicycle with thin black wheels along a sandy beach, with a calm blue ocean and clear sky in the background." src="https://static.simonwillison.net/static/2025/codex-max-medium.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;And here's thinking level "xhigh":&lt;/p&gt;
&lt;p&gt;&lt;img alt="A plump white bird with an orange beak and small black eyes crouches low on a blue bicycle with oversized dark wheels, shown racing forward with motion lines against a soft gradient blue sky." src="https://static.simonwillison.net/static/2025/codex-max-xhigh.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I also tried xhigh on the my &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;longer pelican test prompt&lt;/a&gt;, which came out like this:&lt;/p&gt;
&lt;p id="advanced-pelican-codex-max"&gt;&lt;img alt="A stylized dark gray bird with layered wings, a yellow head crest, and a long brown beak leans forward in a racing pose on a black-framed bicycle, riding across a glossy blue surface under a pale sky." src="https://static.simonwillison.net/static/2025/codex-breeding-max-xhigh.jpg"&gt;&lt;/p&gt;

&lt;p&gt;Also today: &lt;a href="https://x.com/openai/status/1991266192905179613"&gt;GPT-5.1 Pro is rolling out today to all Pro users&lt;/a&gt;. According to the &lt;a href="https://help.openai.com/en/articles/6825453-chatgpt-release-notes"&gt;ChatGPT release notes&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT-5.1 Pro is rolling out today for all ChatGPT Pro users and is available in the model picker. GPT-5 Pro will remain available as a legacy model for 90 days before being retired.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's a pretty fast deprecation cycle for the GPT-5 Pro model that was released just three months ago.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45982649"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-codex"&gt;gpt-codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="evals"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="gpt-5"/><category term="codex-cli"/><category term="gpt-codex"/><category term="november-2025-inflection"/></entry><entry><title>Introducing GPT-5.1 for developers</title><link href="https://simonwillison.net/2025/Nov/13/gpt-51/#atom-tag" rel="alternate"/><published>2025-11-13T23:59:35+00:00</published><updated>2025-11-13T23:59:35+00:00</updated><id>https://simonwillison.net/2025/Nov/13/gpt-51/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/gpt-5-1-for-developers/"&gt;Introducing GPT-5.1 for developers&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI announced GPT-5.1 yesterday, calling it &lt;a href="https://openai.com/index/gpt-5-1/"&gt;a smarter, more conversational ChatGPT&lt;/a&gt;. Today they've added it to their API.&lt;/p&gt;
&lt;p&gt;We actually got four new models today:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1"&gt;gpt-5.1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-chat-latest"&gt;gpt-5.1-chat-latest&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-codex"&gt;gpt-5.1-codex&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-5.1-codex-mini"&gt;gpt-5.1-codex-mini&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are a lot of details to absorb here.&lt;/p&gt;
&lt;p&gt;GPT-5.1 introduces a new reasoning effort called "none" (previous were minimal, low, medium, and high) - and none is the new default.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This makes the model behave like a non-reasoning model for latency-sensitive use cases, with the high intelligence of GPT‑5.1 and added bonus of performant tool-calling. Relative to GPT‑5 with 'minimal' reasoning, GPT‑5.1 with no reasoning is better at parallel tool calling (which itself increases end-to-end task completion speed), coding tasks, following instructions, and using search tools---and supports &lt;a href="https://platform.openai.com/docs/guides/tools-web-search?api-mode=responses"&gt;web search⁠&lt;/a&gt; in our API platform.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When you DO enable thinking you get to benefit from a new feature called "adaptive reasoning":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On straightforward tasks, GPT‑5.1 spends fewer tokens thinking, enabling snappier product experiences and lower token bills. On difficult tasks that require extra thinking, GPT‑5.1 remains persistent, exploring options and checking its work in order to maximize reliability.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Another notable new feature for 5.1 is &lt;a href="https://platform.openai.com/docs/guides/prompt-caching#extended-prompt-cache-retention"&gt;extended prompt cache retention&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours. Extended Prompt Caching works by offloading the key/value tensors to GPU-local storage when memory is full, significantly increasing the storage capacity available for caching.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To enable this set &lt;code&gt;"prompt_cache_retention": "24h"&lt;/code&gt; in the API call. Weirdly there's no price increase involved with this at all. I &lt;a href="https://x.com/simonw/status/1989104422832738305"&gt;asked about that&lt;/a&gt; and OpenAI's Steven Heidel &lt;a href="https://x.com/stevenheidel/status/1989113407149314199"&gt;replied&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;with 24h prompt caching we move the caches from gpu memory to gpu-local storage. that storage is not free, but we made it free since it moves capacity from a limited resource (GPUs) to a more abundant resource (storage). then we can serve more traffic overall!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The most interesting documentation I've seen so far is in the new &lt;a href="https://cookbook.openai.com/examples/gpt-5/gpt-5-1_prompting_guide"&gt;5.1 cookbook&lt;/a&gt;, which also includes details of the new &lt;code&gt;shell&lt;/code&gt; and &lt;code&gt;apply_patch&lt;/code&gt; built-in tools. The &lt;a href="https://github.com/openai/openai-cookbook/blob/main/examples/gpt-5/apply_patch.py"&gt;apply_patch.py implementation&lt;/a&gt; is worth a look, especially if you're interested in the advancing state-of-the-art of file editing tools for LLMs.&lt;/p&gt;
&lt;p&gt;I'm still working on &lt;a href="https://github.com/simonw/llm/issues/1300"&gt;integrating the new models into LLM&lt;/a&gt;. The Codex models are Responses-API-only.&lt;/p&gt;
&lt;p&gt;I got this pelican for GPT-5.1 default (no thinking):&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle wheels have no spokes at all, the pelican is laying quite flat on it" src="https://static.simonwillison.net/static/2025/gpt-5.1-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;And this one with reasoning effort set to high:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This bicycle has four spokes per wheel, and the pelican is sitting more upright" src="https://static.simonwillison.net/static/2025/gpt-5.1-high-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;These actually feel like a &lt;a href="https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans"&gt;regression from GPT-5&lt;/a&gt; to me. The bicycles have less spokes!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-codex"&gt;gpt-codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/november-2025-inflection"&gt;november-2025-inflection&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="gpt-5"/><category term="gpt-codex"/><category term="november-2025-inflection"/></entry></feed>