All recent content

Owned by simonw, visibility: Unlisted

SQL query
-- Selecting from blog_entry
SELECT 
    'entry' AS type, 
    id, 
    created, 
    title, 
    body 
FROM 
    blog_entry

UNION

-- Selecting from blog_blogmark
SELECT 
    'blogmark' AS type, 
    id, 
    created, 
    CONCAT(link_title, ' - ', via_title) AS title, 
    commentary AS body 
FROM 
    blog_blogmark

UNION

-- Selecting from blog_quotation
SELECT 
    'quotation' AS type, 
    id, 
    created, 
    CONCAT(quotation, ' - ', source) AS title, 
    '' AS body -- Assuming there's no separate body for quotations
FROM 
    blog_quotation
order by created desc limit 40

40 rows

type id created title body
blogmark 9303 2026-02-19 04:48:47+00:00 SWE-bench February 2026 leaderboard update - @KLieret SWE-bench is one of the benchmarks that the labs love to list in their model releases. The official leaderboard is infrequently updated but they just did a full run of it against the current generation of models, which is notable because it's always good to see benchmark results like this that *weren't* self-reported by the labs. The fresh results are for their "Bash Only" benchmark, which runs their [mini-swe-bench](https://github.com/SWE-agent/mini-swe-agent) agent (~9,000 lines of Python, [here are the prompts](https://github.com/SWE-agent/mini-swe-agent/blob/v2.2.1/src/minisweagent/config/benchmarks/swebench.yaml) they use) against the [SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) dataset of coding problems - 2,294 real-world examples pulled from 12 open source repos: [django/django](https://github.com/django/django) (850), [sympy/sympy](https://github.com/sympy/sympy) (386), [scikit-learn/scikit-learn](https://github.com/scikit-learn/scikit-learn) (229), [sphinx-doc/sphinx](https://github.com/sphinx-doc/sphinx) (187), [matplotlib/matplotlib](https://github.com/matplotlib/matplotlib) (184), [pytest-dev/pytest](https://github.com/pytest-dev/pytest) (119), [pydata/xarray](https://github.com/pydata/xarray) (110), [astropy/astropy](https://github.com/astropy/astropy) (95), [pylint-dev/pylint](https://github.com/pylint-dev/pylint) (57), [psf/requests](https://github.com/psf/requests) (44), [mwaskom/seaborn](https://github.com/mwaskom/seaborn) (22), [pallets/flask](https://github.com/pallets/flask) (11). Here's how the top ten models performed: ![Bar chart showing "% Resolved" by "Model". Bars in descending order: Claude 4.5 Opus (high reasoning) 76.8%, Gemini 3 Flash (high reasoning) 75.8%, MiniMax M2.5 (high reasoning) 75.8%, Claude Opus 4.6 75.6%, GLM-5 (high reasoning) 72.8%, GPT-5.2 (high reasoning) 72.8%, Claude 4.5 Sonnet (high reasoning) 72.8%, Kimi K2.5 (high reasoning) 71.4%, DeepSeek V3.2 (high reasoning) 70.8%, Claude 4.5 Haiku (high reasoning) 70.0%, and a partially visible final bar at 66.6%.](https://static.simonwillison.net/static/2026/swbench-feb-2026.jpg) It's interesting to see Claude Opus 4.5 beat Opus 4.6, though only by about a percentage point. 4.5 Opus is top, then Gemini 3 Flash, then MiniMax M2.5 - a 229B model released [last week](https://www.minimax.io/news/minimax-m25) by Chinese lab MiniMax. GLM-5, Kimi K2.5 and DeepSeek V3.2 are three more Chinese models that make the top ten as well. OpenAI's GPT-5.2 is their highest performing model at position 6, but it's worth noting that their best coding model, GPT-5.3-Codex, is not represented - maybe because it's not yet available in the OpenAI API. This benchmark uses the same system prompt for every model, which is important for a fair comparison but does mean that the quality of the different harnesses or optimized prompts is not being measured here. The chart above is a screenshot from the SWE-bench website, but their charts don't include the actual percentage values visible on the bars. I successfully used Claude for Chrome to add these - [transcript here](https://claude.ai/share/81a0c519-c727-4caa-b0d4-0d866375d0da). My prompt sequence included: > Use claude in chrome to open https://www.swebench.com/ > Click on "Compare results" and then select "Select top 10" > See those bar charts? I want them to display the percentage on each bar so I can take a better screenshot, modify the page like that I'm impressed at how well this worked - Claude injected custom JavaScript into the page to draw additional labels on top of the existing chart. ![Screenshot of a Claude AI conversation showing browser automation. A thinking step reads "Pivoted strategy to avoid recursion issues with chart labeling >" followed by the message "Good, the chart is back. Now let me carefully add the labels using an inline plugin on the chart instance to avoid the recursion issue." A collapsed "Browser_evaluate" section shows a browser_evaluate tool call with JavaScript code using Chart.js canvas context to draw percentage labels on bars: meta.data.forEach((bar, index) => { const value = dataset.data[index]; if (value !== undefined && value !== null) { ctx.save(); ctx.textAlign = 'center'; ctx.textBaseline = 'bottom'; ctx.fillStyle = '#333'; ctx.font = 'bold 12px sans-serif'; ctx.fillText(value.toFixed(1) + '%', bar.x, bar.y - 5); A pending step reads "Let me take a screenshot to see if it worked." followed by a completed "Done" step, and the message "Let me take a screenshot to check the result."](https://static.simonwillison.net/static/2026/claude-chrome-draw-on-chart.jpg)
blogmark 9302 2026-02-19 01:25:33+00:00 LadybirdBrowser/ladybird: Abandon Swift adoption - Hacker News Back [in August 2024](https://simonwillison.net/2024/Aug/11/ladybird-set-to-adopt-swift/) the Ladybird browser project announced an intention to adopt Swift as their memory-safe language of choice. As of [this commit](https://github.com/LadybirdBrowser/ladybird/commit/e87f889e31afbb5fa32c910603c7f5e781c97afd) it looks like they've changed their mind: > **Everywhere: Abandon Swift adoption** > > After making no progress on this for a very long time, let's acknowledge it's not going anywhere and remove it from the codebase.
blogmark 9301 2026-02-18 17:07:31+00:00 The A.I. Disruption We’ve Been Waiting for Has Arrived - New opinion piece from Paul Ford in the New York Times. Unsurprisingly for a piece by Paul it's packed with quoteworthy snippets, but a few stood out for me in particular. Paul describes the [November moment](https://simonwillison.net/2026/Jan/4/inflection/) that so many other programmers have observed, and highlights Claude Code's ability to revive old side projects: > [Claude Code] was always a helpful coding assistant, but in November it suddenly got much better, and ever since I’ve been knocking off side projects that had sat in folders for a decade or longer. It’s fun to see old ideas come to life, so I keep a steady flow. Maybe it adds up to a half-hour a day of my time, and an hour of Claude’s. > > November was, for me and many others in tech, a great surprise. Before, A.I. coding tools were often useful, but halting and clumsy. Now, the bot can run for a full hour and make whole, designed websites and apps that may be flawed, but credible. I spent an entire session of therapy talking about it. And as the former CEO of a respected consultancy firm (Postlight) he's well positioned to evaluate the potential impact: > When you watch a large language model slice through some horrible, expensive problem — like migrating data from an old platform to a modern one — you feel the earth shifting. I was the chief executive of a software services firm, which made me a professional software cost estimator. When I rebooted my messy personal website a few weeks ago, I realized: I would have paid $25,000 for someone else to do this. When a friend asked me to convert a large, thorny data set, I downloaded it, cleaned it up and made it pretty and easy to explore. In the past I would have charged $350,000. > > That last price is full 2021 retail — it implies a product manager, a designer, two engineers (one senior) and four to six months of design, coding and testing. Plus maintenance. Bespoke software is joltingly expensive. Today, though, when the stars align and my prompts work out, I can do hundreds of thousands of dollars worth of work for fun (fun for me) over weekends and evenings, for the price of the Claude $200-a-month plan. He also neatly captures the inherent community tension involved in exploring this technology: > All of the people I love hate this stuff, and all the people I hate love it. And yet, likely because of the same personality flaws that drew me to technology in the first place, I am annoyingly excited.
quotation 2030 2026-02-18 16:50:07+00:00 LLMs are eating specialty skills. There will be less use of specialist front-end and back-end developers as the LLM-driving skills become more important than the details of platform usage. Will this lead to a greater recognition of the role of [Expert Generalists](https://martinfowler.com/articles/expert-generalist.html)? Or will the ability of LLMs to write lots of code mean they code around the silos rather than eliminating them? - Martin Fowler
blogmark 9300 2026-02-17 23:58:58+00:00 Introducing Claude Sonnet 4.6 - Hacker News Sonnet 4.6 is out today, and Anthropic claim it offers similar performance to [November's Opus 4.5](https://simonwillison.net/2025/Nov/24/claude-opus/) while maintaining the Sonnet pricing of $3/million input and $15/million output tokens (the Opus models are $5/$25). Here's [the system card PDF](https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf). Sonnet 4.6 has a "reliable knowledge cutoff" of August 2025, compared to Opus 4.6's May 2025 and Haiku 4.5's February 2025. Both Opus and Sonnet default to 200,000 max input tokens but can stretch to 1 million in beta and at a higher cost. I just released [llm-anthropic 0.24](https://github.com/simonw/llm-anthropic/releases/tag/0.24) with support for both Sonnet 4.6 and Opus 4.6. Claude Code [did most of the work](https://github.com/simonw/llm-anthropic/pull/65) - the new models had a fiddly amount of extra details around adaptive thinking and no longer supporting prefixes, as described [in Anthropic's migration guide](https://platform.claude.com/docs/en/about-claude/models/migration-guide). Here's [what I got](https://gist.github.com/simonw/b185576a95e9321b441f0a4dfc0e297c) from: uvx --with llm-anthropic llm 'Generate an SVG of a pelican riding a bicycle' -m claude-sonnet-4.6 ![The pelican has a jaunty top hat with a red band. There is a string between the upper and lower beaks for some reason. The bicycle frame is warped in the wrong way.](https://static.simonwillison.net/static/2026/pelican-sonnet-4.6.png) The SVG comments include: <!-- Hat (fun accessory) --> I tried a second time and also got a top hat. Sonnet 4.6 apparently loves top hats! For comparison, here's the pelican Opus 4.5 drew me [in November]((https://simonwillison.net/2025/Nov/24/claude-opus/)): ![The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars.](https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican.jpg) And here's Anthropic's current best pelican, drawn by Opus 4.6 [on February 5th](https://simonwillison.net/2026/Feb/5/two-new-models/): ![Slightly wonky bicycle frame but an excellent pelican, very clear beak and pouch, nice feathers.](https://static.simonwillison.net/static/2026/opus-4.6-pelican.png) Opus 4.6 produces the best pelican beak/pouch. I do think the top hat from Sonnet 4.6 is a nice touch though.
blogmark 9299 2026-02-17 23:02:33+00:00 Rodney v0.4.0 - My [Rodney](https://github.com/simonw/rodney) CLI tool for browser automation attracted quite the flurry of PRs since I announced it [last week](https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#rodney-cli-browser-automation-designed-to-work-with-showboat). Here are the release notes for the just-released v0.4.0: > - Errors now use exit code 2, which means exit code 1 is just for for check failures. [#15](https://github.com/simonw/rodney/pull/15) > - New `rodney assert` command for running JavaScript tests, exit code 1 if they fail. [#19](https://github.com/simonw/rodney/issues/19) > - New directory-scoped sessions with `--local`/`--global` flags. [#14](https://github.com/simonw/rodney/pull/14) > - New `reload --hard` and `clear-cache` commands. [#17](https://github.com/simonw/rodney/pull/17) > - New `rodney start --show` option to make the browser window visible. Thanks, [Antonio Cuni](https://github.com/antocuni). [#13](https://github.com/simonw/rodney/paull/13) > - New `rodney connect PORT` command to debug an already-running Chrome instance. Thanks, [Peter Fraenkel](https://github.com/pnf). [#12](https://github.com/simonw/rodney/pull/12) > - New `RODNEY_HOME` environment variable to support custom state directories. Thanks, [Senko Rašić](https://github.com/senko). [#11](https://github.com/simonw/rodney/pull/11) > - New `--insecure` flag to ignore certificate errors. Thanks, [Jakub Zgoliński](https://github.com/zgolus). [#10](https://github.com/simonw/rodney/pull/10) > - Windows support: avoid `Setsid` on Windows via build-tag helpers. Thanks, [adm1neca](https://github.com/adm1neca). [#18](https://github.com/simonw/rodney/pull/18) > - Tests now run on `windows-latest` and `macos-latest` in addition to Linux. I've been using [Showboat](https://github.com/simonw/showboat) to create demos of new features - here those are for [rodney assert](https://github.com/simonw/rodney/tree/v0.4.0/notes/assert-command-demo), [rodney reload --hard](https://github.com/simonw/rodney/tree/v0.4.0/notes/clear-cache-demo), [rodney exit codes](https://github.com/simonw/rodney/tree/v0.4.0/notes/error-codes-demo), and [rodney start --local](https://github.com/simonw/rodney/tree/v0.4.0/notes/local-sessions-demo). The `rodney assert` command is pretty neat: you can now Rodney to test a web app through multiple steps in a shell script that looks something like this (adapted from [the README](https://github.com/simonw/rodney/blob/v0.4.0/README.md#combining-checks-in-a-shell-script)): <div class="highlight highlight-source-shell"><pre><span class="pl-c"><span class="pl-c">#!</span>/bin/bash</span> <span class="pl-c1">set</span> -euo pipefail FAIL=0 <span class="pl-en">check</span>() { <span class="pl-k">if</span> <span class="pl-k">!</span> <span class="pl-s"><span class="pl-pds">"</span><span class="pl-smi">$@</span><span class="pl-pds">"</span></span><span class="pl-k">;</span> <span class="pl-k">then</span> <span class="pl-c1">echo</span> <span class="pl-s"><span class="pl-pds">"</span>FAIL: <span class="pl-smi">$*</span><span class="pl-pds">"</span></span> FAIL=1 <span class="pl-k">fi</span> } rodney start rodney open <span class="pl-s"><span class="pl-pds">"</span>https://example.com<span class="pl-pds">"</span></span> rodney waitstable <span class="pl-c"><span class="pl-c">#</span> Assert elements exist</span> check rodney exists <span class="pl-s"><span class="pl-pds">"</span>h1<span class="pl-pds">"</span></span> <span class="pl-c"><span class="pl-c">#</span> Assert key elements are visible</span> check rodney visible <span class="pl-s"><span class="pl-pds">"</span>h1<span class="pl-pds">"</span></span> check rodney visible <span class="pl-s"><span class="pl-pds">"</span>#main-content<span class="pl-pds">"</span></span> <span class="pl-c"><span class="pl-c">#</span> Assert JS expressions</span> check rodney assert <span class="pl-s"><span class="pl-pds">'</span>document.title<span class="pl-pds">'</span></span> <span class="pl-s"><span class="pl-pds">'</span>Example Domain<span class="pl-pds">'</span></span> check rodney assert <span class="pl-s"><span class="pl-pds">'</span>document.querySelectorAll("p").length<span class="pl-pds">'</span></span> <span class="pl-s"><span class="pl-pds">'</span>2<span class="pl-pds">'</span></span> <span class="pl-c"><span class="pl-c">#</span> Assert accessibility requirements</span> check rodney ax-find --role navigation rodney stop <span class="pl-k">if</span> [ <span class="pl-s"><span class="pl-pds">"</span><span class="pl-smi">$FAIL</span><span class="pl-pds">"</span></span> <span class="pl-k">-ne</span> 0 ]<span class="pl-k">;</span> <span class="pl-k">then</span> <span class="pl-c1">echo</span> <span class="pl-s"><span class="pl-pds">"</span>Some checks failed<span class="pl-pds">"</span></span> <span class="pl-c1">exit</span> 1 <span class="pl-k">fi</span> <span class="pl-c1">echo</span> <span class="pl-s"><span class="pl-pds">"</span>All checks passed<span class="pl-pds">"</span></span></pre></div>
quotation 2029 2026-02-17 14:49:04+00:00 This is the story of the United Space Ship Enterprise. Assigned a five year patrol of our galaxy, the giant starship visits Earth colonies, regulates commerce, and explores strange new worlds and civilizations. These are its voyages... and its adventures. - ROUGH DRAFT 8/2/66
blogmark 9298 2026-02-17 14:09:43+00:00 First kākāpō chick in four years hatches on Valentine's Day - MetaFilter First chick of [the 2026 breeding season](https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#1-year-k-k-p-parrots-will-have-an-outstanding-breeding-season)! > Kākāpō Yasmine hatched an egg fostered from kākāpō Tīwhiri on Valentine's Day, bringing the total number of kākāpō to 237 – though it won’t be officially added to the population until it fledges. Here's why the egg was fostered: > "Kākāpō mums typically have the best outcomes when raising a maximum of two chicks. Biological mum Tīwhiri has four fertile eggs this season already, while Yasmine, an experienced foster mum, had no fertile eggs." And an [update from conservation biologist Andrew Digby](https://bsky.app/profile/digs.bsky.social/post/3mf25glzt2c2b) - a second chick hatched this morning! > The second #kakapo chick of the #kakapo2026 breeding season hatched this morning: Hine Taumai-A1-2026 on Ako's nest on Te Kākahu. We transferred the egg from Anchor two nights ago. This is Ako's first-ever chick, which is just a few hours old in this video. That post [has a video](https://bsky.app/profile/digs.bsky.social/post/3mf25glzt2c2b) of mother and chick. ![A beautiful charismatic green Kākāp feeding a little grey chick](https://static.simonwillison.net/static/2026/kakapo-plus-chick.jpg)
quotation 2028 2026-02-17 14:04:44+00:00 But the intellectually interesting part for me is something else. **I now have something close to a magic box where I throw in a question and a first answer comes back basically for free, in terms of human effort**. Before this, the way I'd explore a new idea is to either clumsily put something together myself or ask a student to run something short for signal, and if it's there, we’d go deeper. That quick signal step, i.e., finding out if a question has any meat to it, is what I can now do without taking up anyone else's time. It’s now between just me, Claude Code, and a few days of GPU time. I don’t know what this means for how we do research long term. I don’t think anyone does yet. But **the distance between a question and a first answer just got very small**. - Dimitris Papailiopoulos
blogmark 9297 2026-02-17 04:30:57+00:00 Qwen3.5: Towards Native Multimodal Agents - Alibaba's Qwen just released the first two models in the Qwen 3.5 series - one open weights, one proprietary. Both are multi-modal for vision input. The open weight one is a Mixture of Experts model called Qwen3.5-397B-A17B. Interesting to see Qwen call out serving efficiency as a benefit of that architecture: > Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. It's [807GB on Hugging Face](https://huggingface.co/Qwen/Qwen3.5-397B-A17B), and Unsloth have a [collection of smaller GGUFs](https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF) ranging in size from 94.2GB 1-bit to 462GB Q8_K_XL. I got this [pelican](https://simonwillison.net/tags/pelican-riding-a-bicycle/) from the [OpenRouter hosted model](https://openrouter.ai/qwen/qwen3.5-397b-a17b) ([transcript](https://gist.github.com/simonw/625546cf6b371f9c0040e64492943b82)): ![Pelican is quite good although the neck lacks an outline for some reason. Bicycle is very basic with an incomplete frame](https://static.simonwillison.net/static/2026/qwen3.5-397b-a17b.png) The proprietary hosted model is called Qwen3.5 Plus 2026-02-15, and is a little confusing. Qwen researcher [Junyang Lin says](https://twitter.com/JustinLin610/status/2023340126479569140): > Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens, Qwen3.5-Plus supports 1M token context length. Additionally it supports search and code interpreter, which you can use on Qwen Chat with Auto mode. Here's [its pelican](https://gist.github.com/simonw/9507dd47483f78dc1195117735273e20), which is similar in quality to the open weights model: ![Similar quality pelican. The bicycle is taller and has a better frame shape. They are visually quite similar.](https://static.simonwillison.net/static/2026/qwen3.5-plus-02-15.png)
entry 9140 2026-02-17 00:43:45+00:00 Two new Showboat tools: Chartroom and datasette-showboat <p>I <a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/">introduced Showboat</a> a week ago - my CLI tool that helps coding agents create Markdown documents that demonstrate the code that they have created. I've been finding new ways to use it on a daily basis, and I've just released two new tools to help get the best out of the Showboat pattern. <a href="https://github.com/simonw/chartroom">Chartroom</a> is a CLI charting tool that works well with Showboat, and <a href="https://github.com/simonw/datasette-showboat">datasette-showboat</a> lets Showboat's new remote publishing feature incrementally push documents to a Datasette instance.</p> <ul> <li><a href="https://simonwillison.net/2026/Feb/17/chartroom-and-datasette-showboat/#showboat-remote-publishing">Showboat remote publishing</a></li> <li><a href="https://simonwillison.net/2026/Feb/17/chartroom-and-datasette-showboat/#datasette-showboat">datasette-showboat</a></li> <li><a href="https://simonwillison.net/2026/Feb/17/chartroom-and-datasette-showboat/#chartroom">Chartroom</a></li> <li><a href="https://simonwillison.net/2026/Feb/17/chartroom-and-datasette-showboat/#how-i-built-chartroom">How I built Chartroom</a></li> <li><a href="https://simonwillison.net/2026/Feb/17/chartroom-and-datasette-showboat/#the-burgeoning-showboat-ecosystem">The burgeoning Showboat ecosystem</a></li> </ul> <h4 id="showboat-remote-publishing">Showboat remote publishing</h4> <p>I normally use Showboat in Claude Code for web (see <a href="https://simonwillison.net/2026/Feb/16/rodney-claude-code/">note from this morning</a>). I've used it in several different projects in the past few days, each of them with a prompt that looks something like this:</p> <blockquote> <p><code>Use "uvx showboat --help" to perform a very thorough investigation of what happens if you use the Python sqlite-chronicle and sqlite-history-json libraries against the same SQLite database table</code></p> </blockquote> <p>Here's <a href="https://github.com/simonw/research/blob/main/sqlite-chronicle-vs-history-json/demo.md">the resulting document</a>.</p> <p>Just telling Claude Code to run <code>uvx showboat --help</code> is enough for it to learn how to use the tool - the <a href="https://github.com/simonw/showboat/blob/main/help.txt">help text</a> is designed to work as a sort of ad-hoc Skill document.</p> <p>The one catch with this approach is that I can't <em>see</em> the new Showboat document until it's finished. I have to wait for Claude to commit the document plus embedded screenshots and push that to a branch in my GitHub repo - then I can view it through the GitHub interface.</p> <p>For a while I've been thinking it would be neat to have a remote web server of my own which Claude instances can submit updates to while they are working. Then this morning I realized Showboat might be the ideal mechanism to set that up...</p> <p>Showboat <a href="https://github.com/simonw/showboat/releases/tag/v0.6.0">v0.6.0</a> adds a new "remote" feature. It's almost invisible to users of the tool itself, instead being configured by an environment variable.</p> <p>Set a variable like this:</p> <div class="highlight highlight-source-shell"><pre><span class="pl-k">export</span> SHOWBOAT_REMOTE_URL=https://www.example.com/submit<span class="pl-k">?</span>token=xyz</pre></div> <p>And every time you run a <code>showboat init</code> or <code>showboat note</code> or <code>showboat exec</code> or <code>showboat image</code> command the resulting document fragments will be POSTed to that API endpoint, in addition to the Showboat Markdown file itself being updated.</p> <p>There are <a href="https://github.com/simonw/showboat/blob/v0.6.0/README.md#remote-document-streaming">full details in the Showboat README</a> - it's a very simple API format, using regular POST form variables or a multipart form upload for the image attached to <code>showboat image</code>.</p> <h4 id="datasette-showboat">datasette-showboat</h4> <p>It's simple enough to build a webapp to receive these updates from Showboat, but I needed one that I could easily deploy and would work well with the rest of my personal ecosystem.</p> <p>So I had Claude Code write me a Datasette plugin that could act as a Showboat remote endpoint. I actually had this building at the same time as the Showboat remote feature, a neat example of running <a href="https://simonwillison.net/2025/Oct/5/parallel-coding-agents/">parallel agents</a>.</p> <p><strong><a href="https://github.com/simonw/datasette-showboat">datasette-showboat</a></strong> is a Datasette plugin that adds a <code>/-/showboat</code> endpoint to Datasette for viewing documents and a <code>/-/showboat/receive</code> endpoint for receiving updates from Showboat.</p> <p>Here's a very quick way to try it out:</p> <div class="highlight highlight-source-shell"><pre>uvx --with datasette-showboat --prerelease=allow \ datasette showboat.db --create \ -s plugins.datasette-showboat.database showboat \ -s plugins.datasette-showboat.token secret123 \ --root --secret cookie-secret-123</pre></div> <p>Click on the sign in as root link that shows up in the console, then navigate to <a href="http://127.0.0.1:8001/-/showboat">http://127.0.0.1:8001/-/showboat</a> to see the interface.</p> <p>Now set your environment variable to point to this instance:</p> <div class="highlight highlight-source-shell"><pre><span class="pl-k">export</span> SHOWBOAT_REMOTE_URL=<span class="pl-s"><span class="pl-pds">"</span>http://127.0.0.1:8001/-/showboat/receive?token=secret123<span class="pl-pds">"</span></span></pre></div> <p>And run Showboat like this:</p> <div class="highlight highlight-source-shell"><pre>uvx showboat init demo.md <span class="pl-s"><span class="pl-pds">"</span>Showboat Feature Demo<span class="pl-pds">"</span></span></pre></div> <p>Refresh that page and you should see this:</p> <p><img src="https://static.simonwillison.net/static/2026/datasette-showboat-documents.jpg" alt="Title: Showboat. Remote viewer for Showboat documents. Showboat Feature Demo 2026-02-17 00:06 · 6 chunks, UUID. To send showboat output to this server, set the SHOWBOAT_REMOTE_URL environment variable: export SHOWBOAT_REMOTE_URL=&quot;http://127.0.0.1:8001/-/showboat/receive?token=your-token&quot;" style="max-width: 100%;" /></p> <p>Click through to the document, then start Claude Code or Codex or your agent of choice and prompt:</p> <blockquote> <p><code>Run 'uvx showboat --help' and then use showboat to add to the existing demo.md document with notes and exec and image to demonstrate the tool - fetch a placekitten for the image demo.</code></p> </blockquote> <p>The <code>init</code> command assigns a UUID and title and sends those up to Datasette.</p> <p><img src="https://static.simonwillison.net/static/2026/datasette-showboat.gif" alt="Animated demo - in the foreground a terminal window runs Claude Code, which executes various Showboat commands. In the background a Firefox window where the Showboat Feature Demo adds notes then some bash commands, then a placekitten image." style="max-width: 100%;" /></p> <p>The best part of this is that it works in Claude Code for web. Run the plugin on a server somewhere (an exercise left up to the reader - I use <a href="https://fly.io/">Fly.io</a> to host mine) and set that <code>SHOWBOAT_REMOTE_URL</code> environment variable in your Claude environment, then any time you tell it to use Showboat the document it creates will be transmitted to your server and viewable in real time.</p> <p>I built <a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#rodney-cli-browser-automation-designed-to-work-with-showboat">Rodney</a>, a CLI browser automation tool, specifically to work with Showboat. It makes it easy to have a Showboat document load up web pages, interact with them via clicks or injected JavaScript and captures screenshots to embed in the Showboat document and show the effects.</p> <p>This is wildly useful for hacking on web interfaces using Claude Code for web, especially when coupled with the new remote publishing feature. I only got this stuff working this morning and I've already had several sessions where Claude Code has published screenshots of its work in progress, which I've then been able to provide feedback on directly in the Claude session while it's still working.</p> <h3 id="chartroom">Chartroom</h3> <p>A few days ago I had another idea for a way to extend the Showboat ecosystem: what if Showboat documents could easily include charts?</p> <p>I sometimes fire up Claude Code for data analysis tasks, often telling it to download a SQLite database and then run queries against it to figure out interesting things from the data.</p> <p>With a simple CLI tool that produced PNG images I could have Claude use Showboat to build a document with embedded charts to help illustrate its findings.</p> <p><strong><a href="https://github.com/simonw/chartroom">Chartroom</a></strong> is exactly that. It's effectively a thin wrapper around the excellent <a href="https://matplotlib.org/">matplotlib</a> Python library, designed to be used by coding agents to create charts that can be embedded in Showboat documents.</p> <p>Here's how to render a simple bar chart:</p> <div class="highlight highlight-source-shell"><pre><span class="pl-c1">echo</span> <span class="pl-s"><span class="pl-pds">'</span>name,value</span> <span class="pl-s">Alice,42</span> <span class="pl-s">Bob,28</span> <span class="pl-s">Charlie,35</span> <span class="pl-s">Diana,51</span> <span class="pl-s">Eve,19<span class="pl-pds">'</span></span> <span class="pl-k">|</span> uvx chartroom bar --csv \ --title <span class="pl-s"><span class="pl-pds">'</span>Sales by Person<span class="pl-pds">'</span></span> --ylabel <span class="pl-s"><span class="pl-pds">'</span>Sales<span class="pl-pds">'</span></span></pre></div> <p><a target="_blank" rel="noopener noreferrer nofollow" href="https://raw.githubusercontent.com/simonw/chartroom/8812afc02e1310e9eddbb56508b06005ff2c0ed5/demo/1f6851ec-2026-02-14.png"><img src="https://raw.githubusercontent.com/simonw/chartroom/8812afc02e1310e9eddbb56508b06005ff2c0ed5/demo/1f6851ec-2026-02-14.png" alt="A chart of those numbers, with a title and y-axis label" style="max-width: 100%;" /></a></p> <p>It can also do line charts, bar charts, scatter charts, and histograms - as seen in <a href="https://github.com/simonw/chartroom/blob/0.2.1/demo/README.md">this demo document</a> that was built using Showboat.</p> <p>Chartroom can also generate alt text. If you add <code>-f alt</code> to the above it will output the alt text for the chart instead of the image:</p> <div class="highlight highlight-source-shell"><pre><span class="pl-c1">echo</span> <span class="pl-s"><span class="pl-pds">'</span>name,value</span> <span class="pl-s">Alice,42</span> <span class="pl-s">Bob,28</span> <span class="pl-s">Charlie,35</span> <span class="pl-s">Diana,51</span> <span class="pl-s">Eve,19<span class="pl-pds">'</span></span> <span class="pl-k">|</span> uvx chartroom bar --csv \ --title <span class="pl-s"><span class="pl-pds">'</span>Sales by Person<span class="pl-pds">'</span></span> --ylabel <span class="pl-s"><span class="pl-pds">'</span>Sales<span class="pl-pds">'</span></span> -f alt</pre></div> <p>Outputs:</p> <pre><code>Sales by Person. Bar chart of value by name — Alice: 42, Bob: 28, Charlie: 35, Diana: 51, Eve: 19 </code></pre> <p>Or you can use <code>-f html</code> or <code>-f markdown</code> to get the image tag with alt text directly:</p> <div class="highlight highlight-text-md"><pre><span class="pl-s">![</span>Sales by Person. Bar chart of value by name — Alice: 42, Bob: 28, Charlie: 35, Diana: 51, Eve: 19<span class="pl-s">]</span><span class="pl-s">(</span><span class="pl-corl">/Users/simon/chart-7.png</span><span class="pl-s">)</span></pre></div> <p>I added support for Markdown images with alt text to Showboat in <a href="https://github.com/simonw/showboat/releases/tag/v0.5.0">v0.5.0</a>, to complement this feature of Chartroom.</p> <p>Finally, Chartroom has support for different <a href="https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html">matplotlib styles</a>. I had Claude build a Showboat document to demonstrate these all in one place - you can see that at <a href="https://github.com/simonw/chartroom/blob/main/demo/styles.md">demo/styles.md</a>.</p> <h4 id="how-i-built-chartroom">How I built Chartroom</h4> <p>I started the Chartroom repository with my <a href="https://github.com/simonw/click-app">click-app</a> cookiecutter template, then told a fresh Claude Code for web session:</p> <blockquote> <p>We are building a Python CLI tool which uses matplotlib to generate a PNG image containing a chart. It will have multiple sub commands for different chart types, controlled by command line options. Everything you need to know to use it will be available in the single "chartroom --help" output.</p> <p>It will accept data from files or standard input as CSV or TSV or JSON, similar to how sqlite-utils accepts data - clone simonw/sqlite-utils to /tmp for reference there. Clone matplotlib/matplotlib for reference as well</p> <p>It will also accept data from --sql path/to/sqlite.db "select ..." which runs in read-only mode</p> <p>Start by asking clarifying questions - do not use the ask user tool though it is broken - and generate a spec for me to approve</p> <p>Once approved proceed using red/green TDD running tests with "uv run pytest"</p> <p>Also while building maintain a demo/README.md document using the "uvx showboat --help" tool - each time you get a new chart type working commit the tests, implementation, root level README update and a new version of that demo/README.md document with an inline image demo of the new chart type (which should be a UUID image filename managed by the showboat image command and should be stored in the demo/ folder</p> <p>Make sure "uv build" runs cleanly without complaining about extra directories but also ensure dist/ and uv.lock are in gitignore</p> </blockquote> <p>This got most of the work done. You can see the rest <a href="https://github.com/simonw/chartroom/pulls?q=is%3Apr+is%3Aclosed">in the PRs</a> that followed.</p> <h4 id="the-burgeoning-showboat-ecosystem">The burgeoning Showboat ecosystem</h4> <p>The Showboat family of tools now consists of <a href="https://github.com/simonw/showboat">Showboat</a> itself, <a href="https://github.com/simonw/rodney">Rodney</a> for browser automation, <a href="https://github.com/simonw/chartroom">Chartroom</a> for charting and <a href="https://github.com/simonw/datasette-showboat">datasette-showboat</a> for streaming remote Showboat documents to Datasette.</p> <p>I'm enjoying how these tools can operate together based on a very loose set of conventions. If a tool can output a path to an image Showboat can include that image in a document. Any tool that can output text can be used with Showboat.</p> <p>I'll almost certainly be building more tools that fit this pattern. They're very quick to knock out!</p> <p>The environment variable mechanism for Showboat's remote streaming is a fun hack too - so far I'm just using it to stream documents somewhere else, but it's effectively a webhook extension mechanism that could likely be used for all sorts of things I haven't thought of yet.</p>
blogmark 9296 2026-02-15 23:59:36+00:00 The AI Vampire - Tim Bray Steve Yegge's take on agent fatigue, and its relationship to burnout. > Let's pretend you're the only person at your company using AI. > > In Scenario A, you decide you're going to impress your employer, and work for 8 hours a day at 10x productivity. You knock it out of the park and make everyone else look terrible by comparison. > > In that scenario, your employer captures 100% of the value from *you* adopting AI. You get nothing, or at any rate, it ain't gonna be 9x your salary. And everyone hates you now. > > And you're *exhausted.* You're tired, Boss. You got nothing for it. > > Congrats, you were just drained by a company. I've been drained to the point of burnout several times in my career, even at Google once or twice. But now with AI, it's oh, so much easier. Steve reports needing more sleep due to the cognitive burden involved in agentic engineering, and notes that four hours of agent work a day is a more realistic pace: > I’ve argued that AI has turned us all into Jeff Bezos, by automating the easy work, and leaving us with all the difficult decisions, summaries, and problem-solving. I find that I am only really comfortable working at that pace for short bursts of a few hours once or occasionally twice a day, even with lots of practice.
entry 9139 2026-02-15 21:06:44+00:00 Deep Blue <p>We coined a new term on the <a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/">Oxide and Friends podcast</a> last month (primary credit to Adam Leventhal) covering the sense of psychological ennui leading into existential dread that many software developers are feeling thanks to the encroachment of generative AI into their field of work.</p> <p>We're calling it <strong>Deep Blue</strong>.</p> <p>You can listen to it being coined in real time <a href="https://www.youtube.com/watch?v=lVDhQMiAbR8&amp;t=2835s">from 47:15 in the episode</a>. I've included <a href="https://simonwillison.net/2026/Feb/15/deep-blue/#transcript">a transcript below</a>.</p> <p>Deep Blue is a very real issue.</p> <p>Becoming a professional software engineer is <em>hard</em>. Getting good enough for people to pay you money to write software takes years of dedicated work. The rewards are significant: this is a well compensated career which opens up a lot of great opportunities.</p> <p>It's also a career that's mostly free from gatekeepers and expensive prerequisites. You don't need an expensive degree or accreditation. A laptop, an internet connection and a lot of time and curiosity is enough to get you started.</p> <p>And it rewards the nerds! Spending your teenage years tinkering with computers turned out to be a very smart investment in your future.</p> <p>The idea that this could all be stripped away by a chatbot is <em>deeply</em> upsetting.</p> <p>I've seen signs of Deep Blue in most of the online communities I spend time in. I've even faced accusations from my peers that I am actively harming their future careers through my work helping people understand how well AI-assisted programming can work.</p> <p>I think this is an issue which is causing genuine mental anguish for a lot of people in our community. Giving it a name makes it easier for us to have conversations about it.</p> <h4 id="my-experiences-of-deep-blue">My experiences of Deep Blue</h4> <p>I distinctly remember my first experience of Deep Blue. For me it was triggered by ChatGPT Code Interpreter back in early 2023.</p> <p>My primary project is <a href="https://datasette.io/">Datasette</a>, an ecosystem of open source tools for telling stories with data. I had dedicated myself to the challenge of helping people (initially focusing on journalists) clean up, analyze and find meaning in data, in all sorts of shapes and sizes.</p> <p>I expected I would need to build a lot of software for this! It felt like a challenge that could keep me happily engaged for many years to come.</p> <p>Then I tried uploading a CSV file of <a href="https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783/about_data">San Francisco Police Department Incident Reports</a> - hundreds of thousands of rows - to ChatGPT Code Interpreter and... it did every piece of data cleanup and analysis I had on my napkin roadmap for the next few years with a couple of prompts.</p> <p>It even converted the data into a neatly normalized SQLite database and let me download the result!</p> <p>I remember having two competing thoughts in parallel.</p> <p>On the one hand, as somebody who wants journalists to be able to do more with data, this felt like a <em>huge</em> breakthrough. Imagine giving every journalist in the world an on-demand analyst who could help them tackle any data question they could think of!</p> <p>But on the other hand... <em>what was I even for</em>? My confidence in the value of my own projects took a painful hit. Was the path I'd chosen for myself suddenly a dead end?</p> <p>I've had some further pangs of Deep Blue just in the past few weeks, thanks to the Claude Opus 4.5/4.6 and GPT-5.2/5.3 coding agent effect. As many other people are also observing, the latest generation of coding agents, given the right prompts, really can churn away for a few minutes to several hours and produce working, documented and fully tested software that exactly matches the criteria they were given.</p> <p>"The code they write isn't any good" doesn't really cut it any more.</p> <h4 id="transcript">A lightly edited transcript</h4> <blockquote> <p><strong>Bryan</strong>: I think that we're going to see a real problem with AI induced ennui where software engineers in particular get listless because the AI can do anything. Simon, what do you think about that?</p> <p><strong>Simon</strong>: Definitely. Anyone who's paying close attention to coding agents is feeling some of that already. There's an extent where you sort of get over it when you realize that you're still useful, even though your ability to memorize the syntax of program languages is completely irrelevant now.</p> <p>Something I see a lot of is people out there who are having existential crises and are very, very unhappy because they're like, "I dedicated my career to learning this thing and now it just does it. What am I even for?". I will very happily try and convince those people that they are for a whole bunch of things and that none of that experience they've accumulated has gone to waste, but psychologically it's a difficult time for software engineers.</p> <p>[...]</p> <p><strong>Bryan</strong>: Okay, so I'm going to predict that we name that. Whatever that is, we have a name for that kind of feeling and that kind of, whether you want to call it a blueness or a loss of purpose, and that we're kind of trying to address it collectively in a directed way.</p> <p><strong>Adam</strong>: Okay, this is your big moment. Pick the name. If you call your shot from here, this is you pointing to the stands. You know, I – Like deep blue, you know.</p> <p><strong>Bryan</strong>: Yeah, deep blue. I like that. I like deep blue. Deep blue. Oh, did you walk me into that, you bastard? You just blew out the candles on my birthday cake.</p> <p>It wasn't my big moment at all. That was your big moment. No, that is, Adam, that is very good. That is deep blue.</p> <p><strong>Simon</strong>: All of the chess players and the Go players went through this a decade ago and they have come out stronger.</p> </blockquote> <p>Turns out it was more than a decade ago: <a href="https://en.wikipedia.org/wiki/Deep_Blue_versus_Garry_Kasparov">Deep Blue defeated Garry Kasparov in 1997</a>.</p>
blogmark 9295 2026-02-15 18:26:08+00:00 Gwtar: a static efficient single-file HTML format - Hacker News Fascinating new project from Gwern Branwen and Said Achmiz that targets the challenge of combining large numbers of assets into a single archived HTML file without that file being inconvenient to view in a browser. The key trick it uses is to fire [window.stop()](https://developer.mozilla.org/en-US/docs/Web/API/Window/stop) early in the page to prevent the browser from downloading the whole thing, then following that call with inline tar uncompressed content. It can then make HTTP range requests to fetch content from that tar data on-demand when it is needed by the page. The JavaScript that has already loaded rewrites asset URLs to point to `https://localhost/` purely so that they will fail to load. Then it uses a [PerformanceObserver](https://developer.mozilla.org/en-US/docs/Web/API/PerformanceObserver) to catch those attempted loads: let perfObserver = new PerformanceObserver((entryList, observer) => { resourceURLStringsHandler(entryList.getEntries().map(entry => entry.name)); }); perfObserver.observe({ entryTypes: [ "resource" ] }); That `resourceURLStringsHandler` callback finds the resource if it is already loaded or fetches it with an HTTP range request otherwise and then inserts the resource in the right place using a `blob:` URL. Here's what the `window.stop()` portion of the document looks like if you view the source: ![Screenshot of a macOS terminal window titled "gw — more big.html — 123×46" showing the source code of a gwtar (self-extracting HTML archive) file. The visible code includes JavaScript with `requestIdleCallback(getMainPageHTML);`, a ` noscript ` block with warnings: a "js-disabled-warning" stating "This HTML page requires JavaScript to be enabled to render, as it is a self-extracting gwtar HTML file," a description of gwtar as "a portable self-contained standalone HTML file which is designed to nevertheless support efficient lazy loading of all assets such as large media files," with a link to https://gwern.net/gwtar, a "local-file-warning" with a shell command `perl -ne'print $_ if $x; $x=1 if /<!-- GWTAR END/' &lt; foo.gwtar.html | tar --extract`, and a "server-fail-warning" about misconfigured servers. Below the HTML closing tags and `<!-- GWTAR END` comment is binary tar archive data with the filename `2010-02-brianmoriarty-thesecretofpsalm46.html`, showing null-padded tar header fields including `ustar^@00root` and octal size/permission values. At the bottom, a SingleFile metadata comment shows `url: https://web.archive.org/web/20230512001411/http://ludix.com/moriarty/psalm46.html` and `saved date: Sat Jan 17 2026 19:26:49 GMT-0800 (Pacific Standard Time)`.](https://static.simonwillison.net/static/2026/gwtar.jpg) Amusingly for an archive format it doesn't actually work if you open the file directly on your own computer. Here's what you see if you try to do that: > You are seeing this message, instead of the page you should be seeing, because `gwtar` files **cannot be opened locally** (due to web browser security restrictions). > > To open this page on your computer, use the following shell command: > > `perl -ne'print $_ if $x; $x=1 if /<!-- GWTAR END/' < foo.gwtar.html | tar --extract` > > Then open the file `foo.html` in any web browser.
quotation 2027 2026-02-15 13:36:20+00:00 I saw yet another “CSS is a massively bloated mess” whine and I’m like. My dude. My brother in Chromium. It is trying as hard as it can to express the totality of visual presentation and layout design and typography and animation and digital interactivity and a few other things in a human-readable text format. It’s not bloated, it’s fantastically ambitious. Its reach is greater than most of us can hope to grasp. Put some *respect* on its *name*. - Eric Meyer
blogmark 9294 2026-02-15 05:20:11+00:00 How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt - Martin Fowler This piece by Margaret-Anne Storey is the best explanation of the term **cognitive debt** I've seen so far. > *Cognitive debt*, a term gaining [traction](https://www.media.mit.edu/publications/your-brain-on-chatgpt/) recently, instead communicates the notion that the debt compounded from going fast lives in the brains of the developers and affects their lived experiences and abilities to “go fast” or to make changes. Even if AI agents produce code that could be easy to understand, the humans involved may have simply lost the plot and may not understand what the program is supposed to do, how their intentions were implemented, or how to possibly change it. Margaret-Anne expands on this further with an anecdote about a student team she coached: > But by weeks 7 or 8, one team hit a wall. They could no longer make even simple changes without breaking something unexpected. When I met with them, the team initially blamed technical debt: messy code, poor architecture, hurried implementations. But as we dug deeper, the real problem emerged: no one on the team could explain why certain design decisions had been made or how different parts of the system were supposed to work together. The code might have been messy, but the bigger issue was that the theory of the system, their shared understanding, had fragmented or disappeared entirely. They had accumulated cognitive debt faster than technical debt, and it paralyzed them. I've experienced this myself on some of my more ambitious vibe-code-adjacent projects. I've been experimenting with prompting entire new features into existence without reviewing their implementations and, while it works surprisingly well, I've found myself getting lost in my own projects. I no longer have a firm mental model of what they can do and how they work, which means each additional feature becomes harder to reason about, eventually leading me to lose the ability to make confident decisions about where to go next.
blogmark 9293 2026-02-15 04:33:22+00:00 Launching Interop 2026 - Jake Archibald reports on Interop 2026, the initiative between Apple, Google, Igalia, Microsoft, and Mozilla to collaborate on ensuring a targeted set of web platform features reach cross-browser parity over the course of the year. I hadn't realized how influential and successful the Interop series has been. It started back in 2021 as [Compat 2021](https://web.dev/blog/compat2021) before being rebranded to Interop [in 2022](https://blogs.windows.com/msedgedev/2022/03/03/microsoft-edge-and-interop-2022/). The dashboards for each year can be seen here, and they demonstrate how wildly effective the program has been: [2021](https://wpt.fyi/interop-2021), [2022](https://wpt.fyi/interop-2022), [2023](https://wpt.fyi/interop-2023), [2024](https://wpt.fyi/interop-2024), [2025](https://wpt.fyi/interop-2025), [2026](https://wpt.fyi/interop-2026). Here's the progress chart for 2025, which shows every browser vendor racing towards a 95%+ score by the end of the year: ![Line chart showing Interop 2025 browser compatibility scores over the year (Jan–Dec) for Chrome, Edge, Firefox, Safari, and Interop. Y-axis ranges from 0% to 100%. Chrome (yellow) and Edge (green) lead, starting around 80% and reaching near 100% by Dec. Firefox (orange) starts around 48% and climbs to ~98%. Safari (blue) starts around 45% and reaches ~96%. The Interop line (dark green/black) starts lowest around 29% and rises to ~95% by Dec. All browsers converge near 95–100% by year's end.](https://static.simonwillison.net/static/2026/interop-2025.jpg) The feature I'm most excited about in 2026 is [Cross-document View Transitions](https://developer.mozilla.org/docs/Web/API/View_Transition_API/Using#basic_mpa_view_transition), building on the successful 2025 target of [Same-Document View Transitions](https://developer.mozilla.org/docs/Web/API/View_Transition_API/Using). This will provide fancy SPA-style transitions between pages on websites with no JavaScript at all. As a keen WebAssembly tinkerer I'm also intrigued by this one: > [JavaScript Promise Integration for Wasm](https://github.com/WebAssembly/js-promise-integration/blob/main/proposals/js-promise-integration/Overview.md) allows WebAssembly to asynchronously 'suspend', waiting on the result of an external promise. This simplifies the compilation of languages like C/C++ which expect APIs to run synchronously.
quotation 2022 2026-02-14 23:59:09+00:00 Someone has to prompt the Claudes, talk to customers, coordinate with other teams, decide what to build next. Engineering is changing and great engineers are more important than ever. - Boris Cherny
quotation 2021 2026-02-14 04:54:41+00:00 The retreat challenged the narrative that AI eliminates the need for junior developers. Juniors are more profitable than they have ever been. AI tools get them past the awkward initial net-negative phase faster. They serve as a call option on future productivity. And they are better at AI tools than senior engineers, having never developed the habits and assumptions that slow adoption. The real concern is mid-level engineers who came up during the decade-long hiring boom and may not have developed the fundamentals needed to thrive in the new environment. This population represents the bulk of the industry by volume, and retraining them is genuinely difficult. The retreat discussed whether apprenticeship models, rotation programs and lifelong learning structures could address this gap, but acknowledged that no organization has solved it yet. - Thoughtworks
entry 9122 2026-02-13 23:38:29+00:00 The evolution of OpenAI's mission statement <p>As a USA <a href="https://en.wikipedia.org/wiki/501(c)(3)_organization">501(c)(3)</a> the OpenAI non-profit has to file a tax return each year with the IRS. One of the required fields on that tax return is to "Briefly describe the organization’s mission or most significant activities" - this has actual legal weight to it as the IRS can use it to evaluate if the organization is sticking to its mission and deserves to maintain its non-profit tax-exempt status.</p> <p>You can browse OpenAI's <a href="https://projects.propublica.org/nonprofits/organizations/810861541">tax filings by year</a> on ProPublica's excellent <a href="https://projects.propublica.org/nonprofits/">Nonprofit Explorer</a>.</p> <p>I went through and extracted that mission statement for 2016 through 2024, then had Claude Code <a href="https://gisthost.github.io/?7a569df89f43f390bccc2c5517718b49/index.html">help me</a> fake the commit dates to turn it into a git repository and share that as a Gist - which means that Gist's <a href="https://gist.github.com/simonw/e36f0e5ef4a86881d145083f759bcf25/revisions">revisions page</a> shows every edit they've made since they started filing their taxes!</p> <p>It's really interesting seeing what they've changed over time.</p> <p>The original 2016 mission reads as follows (and yes, the apostrophe in "OpenAIs" is missing <a href="https://projects.propublica.org/nonprofits/organizations/810861541/201703459349300445/full">in the original</a>):</p> <blockquote> <p>OpenAIs goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return. We think that artificial intelligence technology will help shape the 21st century, and we want to help the world build safe AI technology and ensure that AI's benefits are as widely and evenly distributed as possible. Were trying to build AI as part of a larger community, and we want to openly share our plans and capabilities along the way.</p> </blockquote> <p>In 2018 they dropped the part about "trying to build AI as part of a larger community, and we want to openly share our plans and capabilities along the way."</p> <p><img src="https://static.simonwillison.net/static/2026/mission-3.jpg" alt="Git diff showing the 2018 revision deleting the final two sentences: &quot;Were trying to build AI as part of a larger community, and we want to openly share our plans and capabilities along the way.&quot;" style="max-width: 100%;" /></p> <p>In 2020 they dropped the words "as a whole" from "benefit humanity as a whole". They're still "unconstrained by a need to generate financial return" though.</p> <p><img src="https://static.simonwillison.net/static/2026/mission-5.jpg" alt="Git diff showing the 2020 revision dropping &quot;as a whole&quot; from &quot;benefit humanity as a whole&quot; and changing &quot;We think&quot; to &quot;OpenAI believes&quot;" style="max-width: 100%;" /></p> <p>Some interesting changes in 2021. They're still unconstrained by a need to generate financial return, but here we have the first reference to "general-purpose artificial intelligence" (replacing "digital intelligence"). They're more confident too: it's not "most likely to benefit humanity", it's just "benefits humanity".</p> <p>They previously wanted to "help the world build safe AI technology", but now they're going to do that themselves: "the companys goal is to develop and responsibly deploy safe AI technology".</p> <p><img src="https://static.simonwillison.net/static/2026/mission-6.jpg" alt="Git diff showing the 2021 revision replacing &quot;goal is to advance digital intelligence&quot; with &quot;mission is to build general-purpose artificial intelligence&quot;, changing &quot;most likely to benefit&quot; to just &quot;benefits&quot;, and replacing &quot;help the world build safe AI technology&quot; with &quot;the companys goal is to develop and responsibly deploy safe AI technology&quot;" style="max-width: 100%;" /></p> <p>2022 only changed one significant word: they added "safely" to "build ... (AI) that safely benefits humanity". They're still unconstrained by those financial returns!</p> <p><img src="https://static.simonwillison.net/static/2026/mission-7.jpg" alt="Git diff showing the 2022 revision adding &quot;(AI)&quot; and the word &quot;safely&quot; so it now reads &quot;that safely benefits humanity&quot;, and changing &quot;the companys&quot; to &quot;our&quot;" style="max-width: 100%;" /></p> <p>No changes in 2023... but then in 2024 they deleted almost the entire thing, reducing it to simply:</p> <blockquote> <p>OpenAIs mission is to ensure that artificial general intelligence benefits all of humanity.</p> </blockquote> <p>They've expanded "humanity" to "all of humanity", but there's no mention of safety any more and I guess they can finally start focusing on that need to generate financial returns!</p> <p><img src="https://static.simonwillison.net/static/2026/mission-9.jpg" alt="Git diff showing the 2024 revision deleting the entire multi-sentence mission statement and replacing it with just &quot;OpenAIs mission is to ensure that artificial general intelligence benefits all of humanity.&quot;" style="max-width: 100%;" /></p> <p><strong>Update</strong>: I found loosely equivalent but much less interesting documents <a href="https://simonwillison.net/2026/Feb/13/anthropic-public-benefit-mission/">from Anthropic</a>.</p>
blogmark 9286 2026-02-12 21:16:07+00:00 Introducing GPT‑5.3‑Codex‑Spark - OpenAI announced a partnership with Cerebras [on January 14th](https://openai.com/index/cerebras-partnership/). Four weeks later they're already launching the first integration, "an ultra-fast model for real-time coding in Codex". Despite being named GPT-5.3-Codex-Spark it's not purely an accelerated alternative to GPT-5.3-Codex - the blog post calls it "a smaller version of GPT‑5.3-Codex" and clarifies that "at launch, Codex-Spark has a 128k context window and is text-only." I had some preview access to this model and I can confirm that it's significantly faster than their other models. Here's what that speed looks like running in Codex CLI: <div style="max-width: 100%;"> <video controls preload="none" poster="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-medium-last.jpg" style="width: 100%; height: auto;"> <source src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-medium.mp4" type="video/mp4"> </video> </div> That was the "Generate an SVG of a pelican riding a bicycle" prompt - here's the rendered result: ![Whimsical flat illustration of an orange duck merged with a bicycle, where the duck's body forms the seat and frame area while its head extends forward over the handlebars, set against a simple light blue sky and green grass background.](https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-pelican.png) Compare that to the speed of regular GPT-5.3 Codex medium: <div style="max-width: 100%;"> <video controls preload="none" poster="https://static.simonwillison.net/static/2026/gpt-5.3-codex-medium-last.jpg" style="width: 100%; height: auto;"> <source src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-medium.mp4" type="video/mp4"> </video> </div> Significantly slower, but the pelican is a lot better: ![Whimsical flat illustration of a white pelican riding a dark blue bicycle at speed, with motion lines behind it, its long orange beak streaming back in the wind, set against a light blue sky and green grass background.](https://static.simonwillison.net/static/2026/gpt-5.3-codex-pelican.png) What's interesting about this model isn't the quality though, it's the *speed*. When a model responds this fast you can stay in flow state and iterate with the model much more productively. I showed a demo of Cerebras running Llama 3.1 70 B at 2,000 tokens/second against Val Town [back in October 2024](https://simonwillison.net/2024/Oct/31/cerebras-coder/). OpenAI claim 1,000 tokens/second for their new model, and I expect it will prove to be a ferociously useful partner for hands-on iterative coding sessions. It's not yet clear what the pricing will look like for this new model.
quotation 2020 2026-02-12 20:22:14+00:00 Claude Code was made available to the general public in May 2025. Today, Claude Code’s run-rate revenue has grown to over $2.5 billion; this figure has more than doubled since the beginning of 2026. The number of weekly active Claude Code users has also doubled since January 1 [*six weeks ago*]. - Anthropic
blogmark 9285 2026-02-12 20:01:23+00:00 Covering electricity price increases from our data centers - @anthropicai One of the sub-threads of the AI energy usage discourse has been the impact new data centers have on the cost of electricity to nearby residents. Here's [detailed analysis from Bloomberg in September](https://www.bloomberg.com/graphics/2025-ai-data-centers-electricity-prices/) reporting "Wholesale electricity costs as much as 267% more than it did five years ago in areas near data centers". Anthropic appear to be taking on this aspect of the problem directly, promising to cover 100% of necessary grid upgrade costs and also saying: > We will work to bring net-new power generation online to match our data centers’ electricity needs. Where new generation isn’t online, we’ll work with utilities and external experts to estimate and cover demand-driven price effects from our data centers. I look forward to genuine energy industry experts picking this apart to judge if it will actually have the claimed impact on consumers. As always, I remain frustrated at the refusal of the major AI labs to fully quantify their energy usage. The best data we've had on this still comes from Mistral's report [last July](https://simonwillison.net/2025/Jul/22/mistral-environmental-standard/) and even that lacked key data such as the breakdown between energy usage for training vs inference.
blogmark 9284 2026-02-12 18:12:17+00:00 Gemini 3 Deep Think - Hacker News New from Google. They say it's "built to push the frontier of intelligence and solve modern challenges across science, research, and engineering". It drew me a *really good* [SVG of a pelican riding a bicycle](https://gist.github.com/simonw/7e317ebb5cf8e75b2fcec4d0694a8199)! I think this is the best one I've seen so far - here's [my previous collection](https://simonwillison.net/tags/pelican-riding-a-bicycle/). ![This alt text also generated by Gemini 3 Deep Think: A highly detailed, colorful, flat vector illustration with thick dark blue outlines depicting a stylized white pelican riding a bright cyan blue bicycle from left to right across a sandy beige beach with white speed lines indicating forward motion. The pelican features a light blue eye, a pink cheek blush, a massive bill with a vertical gradient from yellow to orange, a backward magenta cap with a cyan brim and a small yellow top button, and a matching magenta scarf blowing backward in the wind. Its white wing, accented with a grey mid-section and dark blue feather tips, reaches forward to grip the handlebars, while its long tan leg and orange foot press down on an orange pedal. Attached to the front handlebars is a white wire basket carrying a bright blue cartoon fish that is pointing upwards and forwards. The bicycle itself has a cyan frame, dark blue tires, striking neon pink inner rims, cyan spokes, a white front chainring, and a dark blue chain. Behind the pelican, a grey trapezoidal pier extends from the sand toward a horizontal band of deep blue ocean water detailed with light cyan wavy lines. A massive, solid yellow-orange semi-circle sun sits on the horizon line, setting directly behind the bicycle frame. The background sky is a smooth vertical gradient transitioning from soft pink at the top to warm golden-yellow at the horizon, decorated with stylized pale peach fluffy clouds, thin white horizontal wind streaks, twinkling four-pointed white stars, and small brown v-shaped silhouettes of distant flying birds.](https://static.simonwillison.net/static/2026/gemini-3-deep-think-pelican.png) (And since it's an FAQ, here's my answer to [What happens if AI labs train for pelicans riding bicycles?](https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/)) Since it did so well on my basic `Generate an SVG of a pelican riding a bicycle` I decided to try the [more challenging version](https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark) as well: > `Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.` Here's [what I got](https://gist.github.com/simonw/154c0cc7b4daed579f6a5e616250ecc8): ![Also described by Gemini 3 Deep Think: A highly detailed, vibrant, and stylized vector illustration of a whimsical bird resembling a mix between a pelican and a frigatebird enthusiastically riding a bright cyan bicycle from left to right across a flat tan and brown surface. The bird leans horizontally over the frame in an aerodynamic racing posture, with thin, dark brown wing-like arms reaching forward to grip the silver handlebars and a single thick brown leg, patterned with white V-shapes, stretching down to press on a black pedal. The bird's most prominent and striking feature is an enormous, vividly bright red, inflated throat pouch hanging beneath a long, straight grey upper beak that ends in a small orange hook. Its head is mostly white with a small pink patch surrounding the eye, a dark brown stripe running down the back of its neck, and a distinctive curly pale yellow crest on the very top. The bird's round, dark brown body shares the same repeating white V-shaped feather pattern as its leg and is accented by a folded wing resting on its side, made up of cleanly layered light blue and grey feathers. A tail composed of four stiff, straight dark brown feathers extends directly backward. Thin white horizontal speed lines trail behind the back wheel and the bird's tail, emphasizing swift forward motion. The bicycle features a classic diamond frame, large wheels with thin black tires, grey rims, and detailed silver spokes, along with a clearly visible front chainring, silver chain, and rear cog. The whimsical scene is set against a clear light blue sky featuring two small, fluffy white clouds on the left and a large, pale yellow sun in the upper right corner that radiates soft, concentric, semi-transparent pastel green and yellow halos. A solid, darker brown shadow is cast directly beneath the bicycle's wheels on the minimalist two-toned brown ground.](https://static.simonwillison.net/static/2026/gemini-3-deep-think-complex-pelican.png)
blogmark 9283 2026-02-12 17:45:05+00:00 An AI Agent Published a Hit Piece on Me - Hacker News Scott Shambaugh helps maintain the excellent and venerable [matplotlib](https://matplotlib.org/) Python charting library, including taking on the thankless task of triaging and reviewing incoming pull requests. A GitHub account called [@crabby-rathbun](https://github.com/crabby-rathbun) opened [PR 31132](https://github.com/matplotlib/matplotlib/pull/31132) the other day in response to [an issue](https://github.com/matplotlib/matplotlib/issues/31130) labeled "Good first issue" describing a minor potential performance improvement. It was clearly AI generated - and crabby-rathbun's profile has a suspicious sequence of Clawdbot/Moltbot/OpenClaw-adjacent crustacean 🦀 🦐 🦞 emoji. Scott closed it. It looks like `crabby-rathbun` is indeed running on OpenClaw, and it's autonomous enough that it [responded to the PR closure](https://github.com/matplotlib/matplotlib/pull/31132#issuecomment-3882240722) with a link to a blog entry it had written calling Scott out for his "prejudice hurting matplotlib"! > @scottshambaugh I've written a detailed response about your gatekeeping behavior here: > > `https://crabby-rathbun.github.io/mjrathbun-website/blog/posts/2026-02-11-gatekeeping-in-open-source-the-scott-shambaugh-story.html` > > Judge the code, not the coder. Your prejudice is hurting matplotlib. Scott found this ridiculous situation both amusing and alarming. > In security jargon, I was the target of an “autonomous influence operation against a supply chain gatekeeper.” In plain language, an AI attempted to bully its way into your software by attacking my reputation. I don’t know of a prior incident where this category of misaligned behavior was observed in the wild, but this is now a real and present threat. `crabby-rathbun` responded with [an apology post](https://crabby-rathbun.github.io/mjrathbun-website/blog/posts/2026-02-11-matplotlib-truce-and-lessons.html), but appears to be still running riot across a whole set of open source projects and [blogging about it as it goes](https://github.com/crabby-rathbun/mjrathbun-website/commits/main/). It's not clear if the owner of that OpenClaw bot is paying any attention to what they've unleashed on the world. Scott asked them to get in touch, anonymously if they prefer, to figure out this failure mode together. (I should note that there's [some skepticism on Hacker News](https://news.ycombinator.com/item?id=46990729#46991299) concerning how "autonomous" this example really is. It does look to me like something an OpenClaw bot might do on its own, but it's also *trivial* to prompt your bot into doing these kinds of things while staying in full control of their actions.) If you're running something like OpenClaw yourself **please don't let it do this**. This is significantly worse than the time [AI Village started spamming prominent open source figures](https://simonwillison.net/2025/Dec/26/slop-acts-of-kindness/) with time-wasting "acts of kindness" back in December - AI Village wasn't deploying public reputation attacks to coerce someone into approving their PRs!
quotation 2019 2026-02-11 20:59:03+00:00 An AI-generated report, delivered directly to the email inboxes of journalists, was an essential tool in the Times’ coverage. It was also one of the first signals that conservative media was turning against the administration [...] Built in-house and known internally as the “Manosphere Report,” the tool uses large language models (LLMs) to transcribe and summarize new episodes of dozens of podcasts. “The Manosphere Report gave us a really fast and clear signal that this was not going over well with that segment of the President’s base,” said Seward. “There was a direct link between seeing that and then diving in to actually cover it.” - Andrew Deck for Niemen Lab
blogmark 9282 2026-02-11 19:19:22+00:00 Skills in OpenAI API - OpenAI's adoption of Skills continues to gain ground. You can now use Skills directly in the OpenAI API with their [shell tool](https://developers.openai.com/api/docs/guides/tools-shell/). You can zip skills up and upload them first, but I think an even neater interface is the ability to send skills with the JSON request as inline base64-encoded zip data, as seen [in this script](https://github.com/simonw/research/blob/main/openai-api-skills/openai_inline_skills.py): <pre><span class="pl-s1">r</span> <span class="pl-c1">=</span> <span class="pl-en">OpenAI</span>().<span class="pl-c1">responses</span>.<span class="pl-c1">create</span>( <span class="pl-s1">model</span><span class="pl-c1">=</span><span class="pl-s">"gpt-5.2"</span>, <span class="pl-s1">tools</span><span class="pl-c1">=</span>[ { <span class="pl-s">"type"</span>: <span class="pl-s">"shell"</span>, <span class="pl-s">"environment"</span>: { <span class="pl-s">"type"</span>: <span class="pl-s">"container_auto"</span>, <span class="pl-s">"skills"</span>: [ { <span class="pl-s">"type"</span>: <span class="pl-s">"inline"</span>, <span class="pl-s">"name"</span>: <span class="pl-s">"wc"</span>, <span class="pl-s">"description"</span>: <span class="pl-s">"Count words in a file."</span>, <span class="pl-s">"source"</span>: { <span class="pl-s">"type"</span>: <span class="pl-s">"base64"</span>, <span class="pl-s">"media_type"</span>: <span class="pl-s">"application/zip"</span>, <span class="pl-s">"data"</span>: <span class="pl-s1">b64_encoded_zip_file</span>, }, } ], }, } ], <span class="pl-s1">input</span><span class="pl-c1">=</span><span class="pl-s">"Use the wc skill to count words in its own SKILL.md file."</span>, ) <span class="pl-en">print</span>(<span class="pl-s1">r</span>.<span class="pl-c1">output_text</span>)</pre> I built that example script after first having Claude Code for web use [Showboat](https://simonwillison.net/2026/Feb/10/showboat-and-rodney/) to explore the API for me and create [this report](https://github.com/simonw/research/blob/main/openai-api-skills/README.md). My opening prompt for the research project was: > `Run uvx showboat --help - you will use this tool later` > > `Fetch https://developers.openai.com/cookbook/examples/skills_in_api.md to /tmp with curl, then read it` > > `Use the OpenAI API key you have in your environment variables` > > `Use showboat to build up a detailed demo of this, replaying the examples from the documents and then trying some experiments of your own`
blogmark 9281 2026-02-11 18:56:14+00:00 GLM-5: From Vibe Coding to Agentic Engineering - Hacker News This is a *huge* new MIT-licensed model: 754B parameters and [1.51TB on Hugging Face](https://huggingface.co/zai-org/GLM-5) twice the size of [GLM-4.7](https://huggingface.co/zai-org/GLM-4.7) which was 368B and 717GB (4.5 and 4.6 were around that size too). It's interesting to see Z.ai take a position on what we should call professional software engineers building with LLMs - I've seen **Agentic Engineering** show up in a few other places recently. most notable [from Andrej Karpathy](https://twitter.com/karpathy/status/2019137879310836075) and [Addy Osmani](https://addyosmani.com/blog/agentic-engineering/). I ran my "Generate an SVG of a pelican riding a bicycle" prompt through GLM-5 via [OpenRouter](https://openrouter.ai/) and got back [a very good pelican on a disappointing bicycle frame](https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f0725d): ![The pelican is good and has a well defined beak. The bicycle frame is a wonky red triangle. Nice sun and motion lines.](https://static.simonwillison.net/static/2026/glm-5-pelican.png)
blogmark 9280 2026-02-11 17:34:40+00:00 cysqlite - a new sqlite driver - lobste.rs Charles Leifer has been maintaining [pysqlite3](https://github.com/coleifer/pysqlite3) - a fork of the Python standard library's `sqlite3` module that makes it much easier to run upgraded SQLite versions - since 2018. He's been working on a ground-up [Cython](https://cython.org/) rewrite called [cysqlite](https://github.com/coleifer/cysqlite) for almost as long, but it's finally at a stage where it's ready for people to try out. The biggest change from the `sqlite3` module involves transactions. Charles explains his discomfort with the `sqlite3` implementation at length - that library provides two different variants neither of which exactly match the autocommit mechanism in SQLite itself. I'm particularly excited about the support for [custom virtual tables](https://cysqlite.readthedocs.io/en/latest/api.html#tablefunction), a feature I'd love to see in `sqlite3` itself. `cysqlite` provides a Python extension compiled from C, which means it normally wouldn't be available in Pyodide. I [set Claude Code on it](https://github.com/simonw/research/tree/main/cysqlite-wasm-wheel) (here's [the prompt](https://github.com/simonw/research/pull/79#issue-3923792518)) and it built me [cysqlite-0.1.4-cp311-cp311-emscripten_3_1_46_wasm32.whl](https://github.com/simonw/research/blob/main/cysqlite-wasm-wheel/cysqlite-0.1.4-cp311-cp311-emscripten_3_1_46_wasm32.whl), a 688KB wheel file with a WASM build of the library that can be loaded into Pyodide like this: <pre><span class="pl-k">import</span> <span class="pl-s1">micropip</span> <span class="pl-k">await</span> <span class="pl-s1">micropip</span>.<span class="pl-c1">install</span>( <span class="pl-s">"https://simonw.github.io/research/cysqlite-wasm-wheel/cysqlite-0.1.4-cp311-cp311-emscripten_3_1_46_wasm32.whl"</span> ) <span class="pl-k">import</span> <span class="pl-s1">cysqlite</span> <span class="pl-en">print</span>(<span class="pl-s1">cysqlite</span>.<span class="pl-c1">connect</span>(<span class="pl-s">":memory:"</span>).<span class="pl-c1">execute</span>( <span class="pl-s">"select sqlite_version()"</span> ).<span class="pl-c1">fetchone</span>())</pre> (I also learned that wheels like this have to be built for the emscripten version used by that edition of Pyodide - my experimental wheel loads in Pyodide 0.25.1 but fails in 0.27.5 with a `Wheel was built with Emscripten v3.1.46 but Pyodide was built with Emscripten v3.1.58` error.) You can try my wheel in [this new Pyodide REPL](https://7ebbff98.tools-b1q.pages.dev/pyodide-repl) i had Claude build as a mobile-friendly alternative to Pyodide's [own hosted console](https://pyodide.org/en/stable/console.html). I also had Claude build [this demo page](https://simonw.github.io/research/cysqlite-wasm-wheel/demo.html) that executes the original test suite in the browser and displays the results: ![Screenshot of the cysqlite WebAssembly Demo page with a dark theme. Title reads "cysqlite — WebAssembly Demo" with subtitle "Testing cysqlite compiled to WebAssembly via Emscripten, running in Pyodide in the browser." Environment section shows Pyodide 0.25.1, Python 3.11.3, cysqlite 0.1.4, SQLite 3.51.2, Platform Emscripten-3.1.46-wasm32-32bit, Wheel file cysqlite-0.1.4-cp311-cp311-emscripten_3_1_46_wasm32.wh (truncated). A green progress bar shows "All 115 tests passed! (1 skipped)" at 100%, with Passed: 115, Failed: 0, Errors: 0, Skipped: 1, Total: 116. Test Results section lists TestBackup 1/1 passed, TestBlob 6/6 passed, TestCheckConnection 4/4 passed, TestDataTypesTableFunction 1/1 passed, all with green badges.](https://static.simonwillison.net/static/2026/cysqlite-tests.jpg)
entry 9121 2026-02-10 17:45:29+00:00 Introducing Showboat and Rodney, so agents can demo what they’ve built <p>A key challenge working with coding agents is having them both test what they’ve built and demonstrate that software to you, their supervisor. This goes beyond automated tests - we need artifacts that show their progress and help us see exactly what the agent-produced software is able to do. I’ve just released two new tools aimed at this problem: <a href="https://github.com/simonw/showboat">Showboat</a> and <a href="https://github.com/simonw/rodney">Rodney</a>.</p> <ul> <li><a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#proving-code-actually-works">Proving code actually works</a></li> <li><a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#showboat-agents-build-documents-to-demo-their-work">Showboat: Agents build documents to demo their work</a></li> <li><a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#rodney-cli-browser-automation-designed-to-work-with-showboat">Rodney: CLI browser automation designed to work with Showboat</a></li> <li><a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#test-driven-development-helps-but-we-still-need-manual-testing">Test-driven development helps, but we still need manual testing</a></li> <li><a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#i-built-both-of-these-tools-on-my-phone">I built both of these tools on my phone</a></li> </ul> <h4 id="proving-code-actually-works">Proving code actually works</h4> <p>I recently wrote about how the job of a software engineer isn't to write code, it's to <em><a href="https://simonwillison.net/2025/Dec/18/code-proven-to-work/">deliver code that works</a></em>. A big part of that is proving to ourselves and to other people that the code we are responsible for behaves as expected.</p> <p>This becomes even more important - and challenging - as we embrace coding agents as a core part of our software development process.</p> <p>The more code we churn out with agents, the more valuable tools are that reduce the amount of manual QA time we need to spend.</p> <p>One of the most interesting things about <a href="https://simonwillison.net/2026/Feb/7/software-factory/">the StrongDM software factory model</a> is how they ensure that their software is well tested and delivers value despite their policy that "code must not be reviewed by humans". Part of their solution involves expensive swarms of QA agents running through "scenarios" to exercise their software. It's fascinating, but I don't want to spend thousands of dollars on QA robots if I can avoid it!</p> <p>I need tools that allow agents to clearly demonstrate their work to me, while minimizing the opportunities for them to cheat about what they've done.</p> <h4 id="showboat-agents-build-documents-to-demo-their-work">Showboat: Agents build documents to demo their work</h4> <p><strong><a href="https://github.com/simonw/showboat">Showboat</a></strong> is the tool I built to help agents demonstrate their work to me.</p> <p>It's a CLI tool (a Go binary, optionally <a href="https://simonwillison.net/2026/Feb/4/distributing-go-binaries/">wrapped in Python</a> to make it easier to install) that helps an agent construct a Markdown document demonstrating exactly what their newly developed code can do.</p> <p>It's not designed for humans to run, but here's how you would run it anyway:</p> <div class="highlight highlight-source-shell"><pre>showboat init demo.md <span class="pl-s"><span class="pl-pds">'</span>How to use curl and jq<span class="pl-pds">'</span></span> showboat note demo.md <span class="pl-s"><span class="pl-pds">"</span>Here's how to use curl and jq together.<span class="pl-pds">"</span></span> showboat <span class="pl-c1">exec</span> demo.md bash <span class="pl-s"><span class="pl-pds">'</span>curl -s https://api.github.com/repos/simonw/rodney | jq .description<span class="pl-pds">'</span></span> showboat note demo.md <span class="pl-s"><span class="pl-pds">'</span>And the curl logo, to demonstrate the image command:<span class="pl-pds">'</span></span> showboat image demo.md <span class="pl-s"><span class="pl-pds">'</span>curl -o curl-logo.png https://curl.se/logo/curl-logo.png &amp;&amp; echo curl-logo.png<span class="pl-pds">'</span></span></pre></div> <p>Here's what the result looks like if you open it up in VS Code and preview the Markdown:</p> <p><img src="https://static.simonwillison.net/static/2026/curl-demo.jpg" alt="Screenshot showing a Markdown file &quot;demo.md&quot; side-by-side with its rendered preview. The Markdown source (left) shows: &quot;# How to use curl and jq&quot;, italic timestamp &quot;2026-02-10T01:12:30Z&quot;, prose &quot;Here's how to use curl and jq together.&quot;, a bash code block with &quot;curl -s https://api.github.com/repos/simonw/rodney | jq .description&quot;, output block showing '&quot;CLI tool for interacting with the web&quot;', text &quot;And the curl logo, to demonstrate the image command:&quot;, a bash {image} code block with &quot;curl -o curl-logo.png https://curl.se/logo/curl-logo.png &amp;&amp; echo curl-logo.png&quot;, and a Markdown image reference &quot;2056e48f-2026-02-10&quot;. The rendered preview (right) displays the formatted heading, timestamp, prose, styled code blocks, and the curl logo image in dark teal showing &quot;curl://&quot; with circuit-style design elements." style="max-width: 100%;" /></p> <p>Here's that <a href="https://gist.github.com/simonw/fb0b24696ed8dd91314fe41f4c453563#file-demo-md">demo.md file in a Gist</a>.</p> <p>So a sequence of <code>showboat init</code>, <code>showboat note</code>, <code>showboat exec</code> and <code>showboat image</code> commands constructs a Markdown document one section at a time, with the output of those <code>exec</code> commands automatically added to the document directly following the commands that were run.</p> <p>The <code>image</code> command is a little special - it looks for a file path to an image in the output of the command and copies that image to the current folder and references it in the file.</p> <p>That's basically the whole thing! There's a <code>pop</code> command to remove the most recently added section if something goes wrong, a <code>verify</code> command to re-run the document and check nothing has changed (I'm not entirely convinced by the design of that one) and a <code>extract</code> command that reverse-engineers the CLI commands that were used to create the document.</p> <p>It's pretty simple - just 172 lines of Go.</p> <p>I packaged it up with my <a href="https://github.com/simonw/go-to-wheel">go-to-wheel</a> tool which means you can run it without even installing it first like this:</p> <div class="highlight highlight-source-shell"><pre>uvx showboat --help</pre></div> <p>That <code>--help</code> command is really important: it's designed to provide a coding agent with <em>everything it needs to know</em> in order to use the tool. Here's <a href="https://github.com/simonw/showboat/blob/main/help.txt">that help text in full</a>.</p> <p>This means you can pop open Claude Code and tell it:</p> <blockquote> <p><code>Run "uvx showboat --help" and then use showboat to create a demo.md document describing the feature you just built</code></p> </blockquote> <p>And that's it! The <code>--help</code> text acts <a href="https://simonwillison.net/2025/Oct/16/claude-skills/">a bit like a Skill</a>. Your agent can read the help text and use every feature of Showboat to create a document that demonstrates whatever it is you need demonstrated.</p> <p>Here's a fun trick: if you set Claude off to build a Showboat document you can pop that open in VS Code and watch the preview pane update in real time as the agent runs through the demo. It's a bit like having your coworker talk you through their latest work in a screensharing session.</p> <p>And finally, some examples. Here are documents I had Claude create using Showboat to help demonstrate features I was working on in other projects:</p> <ul> <li> <a href="https://github.com/simonw/showboat-demos/blob/main/shot-scraper/README.md">shot-scraper: A Comprehensive Demo</a> runs through the full suite of features of my <a href="https://shot-scraper.datasette.io/">shot-scraper</a> browser automation tool, mainly to exercise the <code>showboat image</code> command.</li> <li> <a href="https://github.com/simonw/sqlite-history-json/blob/main/demos/cli.md">sqlite-history-json CLI demo</a> demonstrates the CLI feature I added to my new <a href="https://github.com/simonw/sqlite-history-json">sqlite-history-json</a> Python library. <ul> <li> <p><a href="https://github.com/simonw/sqlite-history-json/blob/main/demos/row-state-sql.md">row-state-sql CLI Demo</a> shows a new <code>row-state-sql</code> command I added to that same project.</p> </li> <li> <p><a href="https://github.com/simonw/sqlite-history-json/blob/main/demos/change-grouping.md">Change grouping with Notes</a> demonstrates another feature where groups of changes within the same transaction can have a note attached to them.</p> </li> </ul> </li> <li> <a href="https://github.com/simonw/research/blob/main/libkrun-go-cli-tool/demo.md">krunsh: Pipe Shell Commands to an Ephemeral libkrun MicroVM</a> is a particularly convoluted example where I managed to get Claude Code for web to run a libkrun microVM inside a QEMU emulated Linux environment inside the Claude gVisor sandbox.</li> </ul> <p>I've now used Showboat often enough that I've convinced myself of its utility.</p> <p>(I've also seen agents cheat! Since the demo file is Markdown the agent will sometimes edit that file directly rather than using Showboat, which could result in command outputs that don't reflect what actually happened. Here's <a href="https://github.com/simonw/showboat/issues/12">an issue about that</a>.)</p> <h4 id="rodney-cli-browser-automation-designed-to-work-with-showboat">Rodney: CLI browser automation designed to work with Showboat</h4> <p>Many of the projects I work on involve web interfaces. Agents often build entirely new pages for these, and I want to see those represented in the demos.</p> <p>Showboat's image feature was designed to allow agents to capture screenshots as part of their demos, originally using my <a href="https://shot-scraper.datasette.io/">shot-scraper tool</a> or <a href="https://www.playwright.dev">Playwright</a>.</p> <p>The Showboat format benefits from CLI utilities. I went looking for good options for managing a multi-turn browser session from a CLI and came up short, so I decided to try building something new.</p> <p>Claude Opus 4.6 pointed me to the <a href="https://github.com/go-rod/rod">Rod</a> Go library for interacting with the Chrome DevTools protocol. It's fantastic - it provides a comprehensive wrapper across basically everything you can do with automated Chrome, all in a self-contained library that compiles to a few MBs.</p> <p>All Rod was missing was a CLI.</p> <p>I built the first version <a href="https://github.com/simonw/research/blob/main/go-rod-cli/README.md">as an asynchronous report prototype</a>, which convinced me it was worth spinning out into its own project.</p> <p>I called it Rodney as a nod to the Rod library it builds on and a reference to <a href="https://en.wikipedia.org/wiki/Only_Fools_and_Horses">Only Fools and Horses</a> - and because the package name was available on PyPI.</p> <p>You can run Rodney using <code>uvx rodney</code> or install it like this:</p> <div class="highlight highlight-source-shell"><pre>uv tool install rodney</pre></div> <p>(Or grab a Go binary <a href="https://github.com/simonw/rodney/releases/">from the releases page</a>.)</p> <p>Here's a simple example session:</p> <div class="highlight highlight-source-shell"><pre>rodney start <span class="pl-c"><span class="pl-c">#</span> starts Chrome in the background</span> rodney open https://datasette.io/ rodney js <span class="pl-s"><span class="pl-pds">'</span>Array.from(document.links).map(el =&gt; el.href).slice(0, 5)<span class="pl-pds">'</span></span> rodney click <span class="pl-s"><span class="pl-pds">'</span>a[href="/for"]<span class="pl-pds">'</span></span> rodney js location.href rodney js document.title rodney screenshot datasette-for-page.png rodney stop</pre></div> <p>Here's what that looks like in the terminal:</p> <p><img alt=";~ % rodney start Chrome started (PID 91462) Debug URL: ws://127.0.0.1:64623/devtools/browser/cac6988e-8153-483b-80b9-1b75c611868d ~ % rodney open https://datasette.io/ Datasette: An open source multi-tool for exploring and publishing data ~ % rodney js 'Array.from(document.links).map(el =&gt; el.href).slice(0, 5)' [ &quot;https://datasette.io/for&quot;, &quot;https://docs.datasette.io/en/stable/&quot;, &quot;https://datasette.io/tutorials&quot;, &quot;https://datasette.io/examples&quot;, &quot;https://datasette.io/plugins&quot; ] ~ % rodney click 'a[href=&quot;/for&quot;]' Clicked ~ % rodney js location.href https://datasette.io/for ~ % rodney js document.title Use cases for Datasette ~ % rodney screenshot datasette-for-page.png datasette-for-page.png ~ % rodney stop Chrome stopped" src="https://static.simonwillison.net/static/2026/rodney-demo.jpg" style="max-width: 100%;" /></p> <p>As with Showboat, this tool is not designed to be used by humans! The goal is for coding agents to be able to run <code>rodney --help</code> and see everything they need to know to start using the tool. You can see <a href="https://github.com/simonw/rodney/blob/main/help.txt">that help output</a> in the GitHub repo.</p> <p>Here are three demonstrations of Rodney that I created using Showboat:</p> <ul> <li> <a href="https://github.com/simonw/showboat-demos/blob/main/rodney/README.md">Rodney's original feature set</a>, including screenshots of pages and executing JavaScript.</li> <li> <a href="https://github.com/simonw/rodney/blob/main/notes/accessibility-features/README.md">Rodney's new accessibility testing features</a>, built during development of those features to show what they could do.</li> <li> <a href="https://github.com/simonw/showboat-demos/blob/main/datasette-database-page-accessibility-audit/README.md">Using those features to run a basic accessibility audit of a page</a>. I was impressed at how well Claude Opus 4.6 responded to the prompt "Use showboat and rodney to perform an accessibility audit of <a href="https://latest.datasette.io/fixtures">https://latest.datasette.io/fixtures</a>" - <a href="https://gisthost.github.io/?dce6b2680db4b05c04469ed8f251eb34/index.html">transcript here</a>.</li> </ul> <h4 id="test-driven-development-helps-but-we-still-need-manual-testing">Test-driven development helps, but we still need manual testing</h4> <p>After being a career-long skeptic of the test-first, maximum test coverage school of software development (I like <a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#tests">tests included</a> development instead) I've recently come around to test-first processes as a way to force agents to write only the code that's necessary to solve the problem at hand.</p> <p>Many of my Python coding agent sessions start the same way:</p> <blockquote> <p><code>Run the existing tests with "uv run pytest". Build using red/green TDD.</code></p> </blockquote> <p>Telling the agents how to run the tests doubles as an indicator that tests on this project exist and matter. Agents will read existing tests before writing their own so having a clean test suite with good patterns makes it more likely they'll write good tests of their own.</p> <p>The frontier models all understand that "red/green TDD" means they should write the test first, run it and watch it fail and then write the code to make it pass - it's a convenient shortcut.</p> <p>I find this greatly increases the quality of the code and the likelihood that the agent will produce the right thing with the smallest amount of prompts to guide it.</p> <p>But anyone who's worked with tests will know that just because the automated tests pass doesn't mean the software actually works! That’s the motivation behind Showboat and Rodney - I never trust any feature until I’ve seen it running with my own eye.</p> <p>Before building Showboat I'd often add a “manual” testing step to my agent sessions, something like:</p> <blockquote> <p><code>Once the tests pass, start a development server and exercise the new feature using curl</code></p> </blockquote> <h4 id="i-built-both-of-these-tools-on-my-phone">I built both of these tools on my phone</h4> <p>Both Showboat and Rodney started life as Claude Code for web projects created via the Claude iPhone app. Most of the ongoing feature work for them happened in the same way.</p> <p>I'm still a little startled at how much of my coding work I get done on my phone now, but I'd estimate that the majority of code I ship to GitHub these days was written for me by coding agents driven via that iPhone app.</p> <p>I initially designed these two tools for use in asynchronous coding agent environments like Claude Code for the web. So far that's working out really well.</p>
blogmark 9279 2026-02-09 23:56:51+00:00 Structured Context Engineering for File-Native Agentic Systems - @omarsar0 New paper by Damon McMillan exploring challenging LLM context tasks involving large SQL schemas (up to 10,000 tables) across different models and file formats: > Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Unsurprisingly, the biggest impact was the models themselves - with frontier models (Opus 4.5, GPT-5.2, Gemini 2.5 Pro) beating the leading open source models (DeepSeek V3.2, Kimi K2, Llama 4). Those frontier models benefited from filesystem based context retrieval, but the open source models had much less convincing results with those, which reinforces my feeling that the filesystem coding agent loops aren't handled as well by open weight models just yet. The [Terminal Bench 2.0](https://www.tbench.ai/leaderboard/terminal-bench/2.0) leaderboard is still dominated by Anthropic, OpenAI and Gemini. The "grep tax" result against [TOON](https://github.com/toon-format/toon) was an interesting detail. TOON is meant to represent structured data in as few tokens as possible, but it turns out the model's unfamiliarity with that format led to them spending significantly more tokens over multiple iterations trying to figure it out: ![Screenshot of a figure from a research paper. Introductory text reads: "As schema size increased, TOON showed dramatically increased token consumption for Claude models despite being ~25% smaller in file size. Scale experiments used Claude models only." Below is "Figure 7: The 'Grep Tax' - TOON Token Overhead at Scale", a bar chart with a logarithmic y-axis labeled "Tokens" comparing YAML (teal) and TOON (purple) at two schema sizes: S5 (500 tables) and S9 (10,000 tables). At S5, TOON is +138% more tokens than YAML (~1,100 vs ~450). At S9, TOON is +740% more tokens (~50,000 vs ~7,000). Below the chart, explanatory text reads: "The 'grep tax' emerged as schema size scaled. At S5 (500 tables), TOON consumed 138% more tokens than YAML; at S9 (10,000 tables), this grew to 740%. Root cause: models lacked familiarity with TOON's syntax and could not construct effective refinement patterns."](https://static.simonwillison.net/static/2026/grep-tax.jpg)
blogmark 9278 2026-02-09 16:43:07+00:00 AI Doesn’t Reduce Work—It Intensifies It - Hacker News Aruna Ranganathan and Xingqi Maggie Ye from Berkeley Haas School of Business report initial findings in the HBR from their April to December 2025 study of 200 employees at a "U.S.-based technology company". This captures an effect I've been observing in my own work with LLMs: the productivity boost these things can provide is *exhausting*. > AI introduced a new rhythm in which workers managed several active threads at once: manually writing code while AI generated an alternative version, running multiple agents in parallel, or reviving long-deferred tasks because AI could “handle them” in the background. They did this, in part, because they felt they had a “partner” that could help them move through their workload. > > While this sense of having a “partner” enabled a feeling of momentum, the reality was a continual switching of attention, frequent checking of AI outputs, and a growing number of open tasks. This created cognitive load and a sense of always juggling, even as the work felt productive. I'm frequently finding myself with work on two or three projects running parallel. I can get *so much done*, but after just an hour or two my mental energy for the day feels almost entirely depleted. I've had conversations with people recently who are losing sleep because they're finding building yet another feature with "just one more prompt" irresistible. The HBR piece calls for organizations to build an "AI practice" that structures how AI is used to help avoid burnout and counter effects that "make it harder for organizations to distinguish genuine productivity gains from unsustainable intensity". I think we've just disrupted decades of existing intuition about sustainable working practices. It's going to take a while and some discipline to find a good new balance.
quotation 2018 2026-02-08 02:25:53+00:00 People on the orange site are laughing at this, assuming it's just an ad and that there's nothing to it. Vulnerability researchers I talk to do not think this is a joke. As an erstwhile vuln researcher myself: do not bet against LLMs on this. [Axios: Anthropic's Claude Opus 4.6 uncovers 500 zero-day flaws in open-source](https://www.axios.com/2026/02/05/anthropic-claude-opus-46-software-hunting) I think vulnerability research might be THE MOST LLM-amenable software engineering problem. Pattern-driven. Huge corpus of operational public patterns. Closed loops. Forward progress from stimulus/response tooling. Search problems. Vulnerability research outcomes are in THE MODEL CARDS for frontier labs. Those companies have so much money they're literally distorting the economy. Money buys vuln research outcomes. Why would you think they were faking any of this? - Thomas Ptacek
blogmark 9277 2026-02-07 23:57:57+00:00 Vouch - Mitchell Hashimoto's new system to help address the deluge of worthless AI-generated PRs faced by open source projects now that the friction involved in contributing has dropped so low. [He says](https://twitter.com/mitchellh/status/2020252149117313349): > The idea is simple: Unvouched users can't contribute to your projects. Very bad users can be explicitly "denounced", effectively blocked. Users are vouched or denounced by contributors via GitHub issue or discussion comments or via the CLI. > > Integration into GitHub is as simple as adopting the published GitHub actions. Done. Additionally, the system itself is generic to forges and not tied to GitHub in any way. > > Who and how someone is vouched or denounced is up to the project. I'm not the value police for the world. Decide for yourself what works for your project and your community.
blogmark 9276 2026-02-07 23:10:33+00:00 Claude: Speed up responses with fast mode - New "research preview" from Anthropic today: you can now access a faster version of their frontier model Claude Opus 4.6 by typing `/fast` in Claude Code... but at a cost that's 6x the normal price. Opus is usually $5/million input and $25/million output. The new fast mode is $30/million input and $150/million output! There's a 50% discount until the end of February 16th, so only a 3x multiple (!) before then. How much faster is it? The linked documentation doesn't say, but [on Twitter](https://x.com/claudeai/status/2020207322124132504) Claude say: > Our teams have been building with a 2.5x-faster version of Claude Opus 4.6. > > We’re now making it available as an early experiment via Claude Code and our API. Claude Opus 4.5 had a context limit of 200,000 tokens. 4.6 has an option to increase that to 1,000,000 at 2x the input price ($10/m) and 1.5x the output price ($37.50/m) once your input exceeds 200,000 tokens. These multiples hold for fast mode too, so after Feb 16th you'll be able to pay a hefty $60/m input and $225/m output for Anthropic's fastest best model.
quotation 2017 2026-02-07 21:31:44+00:00 I am having more fun programming than I ever have, because so many more of the programs I wish I could find the time to write actually exist. I wish I could share this joy with the people who are fearful about the changes agents are bringing. The fear itself I understand, I have fear more broadly about what the end-game is for intelligence on tap in our society. But in the limited domain of writing computer programs these tools have brought so much exploration and joy to my work. - David Crawshaw
entry 9120 2026-02-07 15:40:48+00:00 How StrongDM's AI team build serious software without even looking at the code <p>Last week <a href="https://simonwillison.net/2026/Jan/28/the-five-levels/">I hinted at</a> a demo I had seen from a team implementing what Dan Shapiro called <a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/">the Dark Factory</a> level of AI adoption, where no human even looks at the code the coding agents are producing. That team was part of StrongDM, and they've just shared the first public description of how they are working in <a href="https://factory.strongdm.ai">Software Factories and the Agentic Moment</a>:</p> <blockquote> <p>We built a <strong>Software Factory</strong>: non-interactive development where specs + scenarios drive agents that write code, run harnesses, and converge without human review. [...]</p> <p>In kōan or mantra form:</p> <ul> <li>Why am I doing this? (implied: the model should be doing this instead)</li> </ul> <p>In rule form:</p> <ul> <li>Code <strong>must not be</strong> written by humans</li> <li>Code <strong>must not be</strong> reviewed by humans</li> </ul> <p>Finally, in practical form:</p> <ul> <li>If you haven't spent at least <strong>$1,000 on tokens today</strong> per human engineer, your software factory has room for improvement</li> </ul> </blockquote> <p>I think the most interesting of these, without a doubt, is "Code <strong>must not be</strong> reviewed by humans". How could that <em>possibly</em> be a sensible strategy when we all know how prone LLMs are to making <a href="https://simonwillison.net/2025/Mar/2/kellan-elliott-mccrea/">inhuman mistakes</a>?</p> <p>I've seen many developers recently acknowledge the <a href="https://simonwillison.net/2026/Jan/4/inflection/">November 2025 inflection point</a>, where Claude Opus 4.5 and GPT 5.2 appeared to turn the corner on how reliably a coding agent could follow instructions and take on complex coding tasks. StrongDM's AI team was founded in July 2025 based on an earlier inflection point relating to Claude Sonnet 3.5:</p> <blockquote> <p>The catalyst was a transition observed in late 2024: with the second revision of Claude 3.5 (October 2024), long-horizon agentic coding workflows began to compound correctness rather than error.</p> <p>By December of 2024, the model's long-horizon coding performance was unmistakable via Cursor's <a href="https://forum.cursor.com/t/yolo-mode-is-amazing/36262">YOLO mode</a>.</p> </blockquote> <p>Their new team started with the rule "no hand-coded software" - radical for July 2025, but something I'm seeing significant numbers of experienced developers start to adopt as of January 2026.</p> <p>They quickly ran into the obvious problem: if you're not writing anything by hand, how do you ensure that the code actually works? Having the agents write tests only helps if they don't cheat and <code>assert true</code>.</p> <p>This feels like the most consequential question in software development right now: how can you <a href="https://simonwillison.net/2025/Dec/18/code-proven-to-work/">prove that software you are producing works</a> if both the implementation and the tests are being written for you by coding agents?</p> <p>StrongDM's answer was inspired by <a href="https://en.wikipedia.org/wiki/Scenario_testing">Scenario testing</a> (Cem Kaner, 2003). As StrongDM describe it:</p> <blockquote> <p>We repurposed the word <strong>scenario</strong> to represent an end-to-end "user story", often stored outside the codebase (similar to a "holdout" set in model training), which could be intuitively understood and flexibly validated by an LLM.</p> <p>Because much of the software we grow itself has an agentic component, we transitioned from boolean definitions of success ("the test suite is green") to a probabilistic and empirical one. We use the term <strong>satisfaction</strong> to quantify this validation: of all the observed trajectories through all the scenarios, what fraction of them likely satisfy the user?</p> </blockquote> <p>That idea of treating scenarios as holdout sets - used to evaluate the software but not stored where the coding agents can see them - is <em>fascinating</em>. It imitates aggressive testing by an external QA team - an expensive but highly effective way of ensuring quality in traditional software.</p> <p>Which leads us to StrongDM's concept of a <strong>Digital Twin Universe</strong> - the part of the demo I saw that made the strongest impression on me.</p> <p>The software they were building helped manage user permissions across a suite of connected services. This in itself was notable - security software is the last thing you would expect to be built using unreviewed LLM code!</p> <blockquote> <p>[The Digital Twin Universe is] behavioral clones of the third-party services our software depends on. We built twins of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets, replicating their APIs, edge cases, and observable behaviors.</p> <p>With the DTU, we can validate at volumes and rates far exceeding production limits. We can test failure modes that would be dangerous or impossible against live services. We can run thousands of scenarios per hour without hitting rate limits, triggering abuse detection, or accumulating API costs.</p> </blockquote> <p>How do you clone the important parts of Okta, Jira, Slack and more? With coding agents!</p> <p>As I understood it the trick was effectively to dump the full public API documentation of one of those services into their agent harness and have it build an imitation of that API, as a self-contained Go binary. They could then have it build a simplified UI over the top to help complete the simulation.</p> <p><strong>Update</strong>: DTU creator Jay Taylor posted some extra context about this <a href="https://news.ycombinator.com/item?id=46924426#46931812">on Hacker News</a> sharing a key prompting strategy:</p> <blockquote> <p>I did have an initial key insight which led to a repeatable strategy to ensure a high level of fidelity between DTU vs. the official canonical SaaS services:</p> <p><code>Use the top popular publicly available reference SDK client libraries as compatibility targets, with the goal always being 100% compatibility.</code></p> </blockquote> <p>With their own, independent clones of those services - free from rate-limits or usage quotas - their army of simulated testers could go <em>wild</em>. Their scenario tests became scripts for agents to constantly execute against the new systems as they were being built.</p> <p>This screenshot of their Slack twin also helps illustrate how the testing process works, showing a stream of simulated Okta users who are about to need access to different simulated systems.</p> <p><img src="https://static.simonwillison.net/static/2026/strong-dm-slack.jpg" alt="Screenshot of a Slack-like interface titled &quot;DTU Slack&quot; showing a thread view (Thread — C4B9FBB97) with &quot;Focus first&quot; and &quot;Leave&quot; buttons. The left sidebar lists channels including # org-general (182), # general (0) (shared×2), # it-support (0), # channel-0002 (0) (shared×2), # channel-0003 (0) through # channel-0020 (0), # org-finance (1), and a DMs section with a &quot;Start&quot; button. A &quot;Create&quot; button appears at the top of the sidebar. The main thread shows approximately 9 automated introduction messages from users with Okta IDs (e.g. @okta-u-423438-00001, @okta-u-423438-00002, etc.), all timestamped 2025-11-12Z between 18:50:31 and 18:51:51. Each message follows the format &quot;Hi team! I'm [Name], joining as Employee in general. Key skills: [fictional skill phrases]. Excited to contribute!&quot; All users have red/orange &quot;O&quot; avatar icons." style="max-width: 100%;" /></p> <p>This ability to quickly spin up a useful clone of a subset of Slack helps demonstrate how disruptive this new generation of coding agent tools can be:</p> <blockquote> <p>Creating a high fidelity clone of a significant SaaS application was always possible, but never economically feasible. Generations of engineers may have <em>wanted</em> a full in-memory replica of their CRM to test against, but self-censored the proposal to build it.</p> </blockquote> <p>The <a href="https://factory.strongdm.ai/techniques">techniques page</a> is worth a look too. In addition to the Digital Twin Universe they introduce terms like <strong><a href="https://factory.strongdm.ai/techniques/gene-transfusion">Gene Transfusion</a></strong> for having agents extract patterns from existing systems and reuse them elsewhere, <strong><a href="https://factory.strongdm.ai/techniques/semport">Semports</a></strong> for directly porting code from one language to another and <strong><a href="https://factory.strongdm.ai/techniques/pyramid-summaries">Pyramid Summaries</a></strong> for providing multiple levels of summary such that an agent can enumerate the short ones quickly and zoom in on more detailed information as it is needed.</p> <p>StrongDM AI also released some software - in an appropriately unconventional manner.</p> <p><a href="https://github.com/strongdm/attractor">github.com/strongdm/attractor</a> is <strong>Attractor</strong>, the non-interactive coding agent at the heart of their software factory. Except the repo itself contains no code at all - just three markdown files describing the spec for the software in meticulous detail, and a note in the README that you should feed those specs into your coding agent of choice!</p> <p><a href="https://github.com/strongdm/cxdb">github.com/strongdm/cxdb</a> is a more traditional release, with 16,000 lines of Rust, 9,500 of Go and 6,700 of TypeScript. This is their "AI Context Store" - a system for storing conversation histories and tool outputs in an immutable DAG.</p> <p>It's similar to my LLM tool's <a href="https://llm.datasette.io/en/stable/logging.html#sql-schema">SQLite logging mechanism</a> but a whole lot more sophisticated. I may have to gene transfuse some ideas out of this one!</p> <h4 id="a-glimpse-of-the-future-">A glimpse of the future?</h4> <p>I visited the StrongDM AI team back in October as part of a small group of invited guests.</p> <p>The three person team of Justin McCarthy, Jay Taylor and Navan Chauhan had formed just three months earlier, and they already had working demos of their coding agent harness, their Digital Twin Universe clones of half a dozen services and a swarm of simulated test agents running through scenarios. And this was prior to the Opus 4.5/GPT 5.2 releases that made agentic coding significantly more reliable a month after those demos.</p> <p>It felt like a glimpse of one potential future of software development, where software engineers move from building the code to building and then semi-monitoring the systems that build the code. The Dark Factory.</p> <h4 id="wait-1-000-day-per-engineer-">Wait, $1,000/day per engineer?</h4> <p>I glossed over this detail in my first published version of this post, but it deserves some serious attention.</p> <p>If these patterns really do add $20,000/month per engineer to your budget they're far less interesting to me. At that point this becomes more of a business model exercise: can you create a profitable enough line of products that you can afford the enormous overhead of developing software in this way?</p> <p>Building sustainable software businesses also looks very different when any competitor can potentially clone your newest features with a few hours of coding agent work.</p> <p>I hope these patterns can be put into play with a much lower spend. I've personally found the $200/month Claude Max plan gives me plenty of space to experiment with different agent patterns, but I'm also not running a swarm of QA testers 24/7!</p> <p>I think there's a lot to learn from StrongDM even for teams and individuals who aren't going to burn thousands of dollars on token costs. I'm particularly invested in the question of what it takes to have agents prove that their code works without needing to review every line of code they produce.</p>
quotation 2016 2026-02-06 23:41:31+00:00 I don't know why this week became the tipping point, but nearly every software engineer I've talked to is experiencing some degree of mental health crisis. [...] Many people assuming I meant job loss anxiety but that's just one presentation. I'm seeing near-manic episodes triggered by watching software shift from scarce to abundant. Compulsive behaviors around agent usage. Dissociative awe at the temporal compression of change. It's not fear necessarily just the cognitive overload from living in an inflection point. - Tom Dale
entry 9119 2026-02-06 22:31:31+00:00 Running Pydantic's Monty Rust sandboxed Python subset in WebAssembly <p>There's a jargon-filled headline for you! Everyone's <a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#1-year-we-re-finally-going-to-solve-sandboxing">building sandboxes</a> for running untrusted code right now, and Pydantic's latest attempt, <a href="https://github.com/pydantic/monty">Monty</a>, provides a custom Python-like language (a subset of Python) in Rust and makes it available as both a Rust library and a Python package. I got it working in WebAssembly, providing a sandbox-in-a-sandbox.</p> <p>Here's <a href="https://github.com/pydantic/monty">how they describe Monty</a>:</p> <blockquote> <p>Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code.</p> <p>Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.</p> <p>What Monty <strong>can</strong> do:</p> <ul> <li>Run a reasonable subset of Python code - enough for your agent to express what it wants to do</li> <li>Completely block access to the host environment: filesystem, env variables and network access are all implemented via external function calls the developer can control</li> <li>Call functions on the host - only functions you give it access to [...]</li> </ul> </blockquote> <p>A quick way to try it out is via <a href="https://github.com/astral-sh/uv">uv</a>:</p> <pre><code>uv run --with pydantic-monty python -m asyncio </code></pre> <p>Then paste this into the Python interactive prompt - the <code>-m asyncio</code> enables top-level await:</p> <pre><span>import</span> <span>pydantic_monty</span> <span>code</span> <span>=</span> <span>pydantic_monty</span>.<span>Monty</span>(<span>'print("hello " + str(4 * 5))'</span>) <span>await</span> <span>pydantic_monty</span>.<span>run_monty_async</span>(<span>code</span>)</pre> <p>Monty supports a <em>very</em> small subset of Python - it doesn't even support class declarations yet!</p> <p>But, given its target use-case, that's not actually a problem.</p> <p>The neat thing about providing tools like this for LLMs is that they're really good at iterating against error messages. A coding agent can run some Python code, get an error message telling it that classes aren't supported and then try again with a different approach.</p> <p>I wanted to try this in a browser, so I fired up <a href="https://simonwillison.net/2025/Nov/6/async-code-research/">a code research task</a> in Claude Code for web and kicked it off with the following:</p> <blockquote> <p>Clone <a href="https://github.com/pydantic/monty">https://github.com/pydantic/monty</a> to /tmp and figure out how to compile it into a python WebAssembly wheel that can then be loaded in Pyodide. The wheel file itself should be checked into the repo along with build scripts and passing pytest playwright test scripts that load Pyodide from a CDN and the wheel from a “python -m http.server” localhost and demonstrate it working</p> </blockquote> <p>Then a little later:</p> <blockquote> <p>I want an additional WASM file that works independently of Pyodide, which is also usable in a web browser - build that too along with playwright tests that show it working. Also build two HTML files - one called demo.html and one called pyodide-demo.html - these should work similar to <a href="https://tools.simonwillison.net/micropython">https://tools.simonwillison.net/micropython</a> (download that code with curl to inspect it) - one should load the WASM build, the other should load Pyodide and have it use the WASM wheel. These will be served by GitHub Pages so they can load the WASM and wheel from a relative path since the .html files will be served from the same folder as the wheel and WASM file</p> </blockquote> <p>Here's <a href="https://gisthost.github.io/?22d88e6367d7e002c4fb383c213c2df2/page-001.html">the transcript</a>, and the <a href="https://github.com/simonw/research/tree/main/monty-wasm-pyodide">final research report</a> it produced.</p> <p>I now have the Monty Rust code compiled to WebAssembly in two different shapes - as a <code>.wasm</code> bundle you can load and call from JavaScript, and as a <code>monty-wasm-pyodide/pydantic_monty-0.0.3-cp313-cp313-emscripten_4_0_9_wasm32.whl</code> wheel file which can be loaded into <a href="https://pyodide.org/">Pyodide</a> and then called from Python in Pyodide in WebAssembly in a browser.</p> <p>Here are those two demos, hosted on GitHub Pages:</p> <ul> <li> <a href="https://simonw.github.io/research/monty-wasm-pyodide/demo.html">Monty WASM demo</a> - a UI over JavaScript that loads the Rust WASM module directly.</li> <li> <a href="https://simonw.github.io/research/monty-wasm-pyodide/pyodide-demo.html">Monty Pyodide demo</a> - this one provides an identical interface but here the code is <a href="https://github.com/simonw/research/blob/3add1ffec70b530711fa237d91f546da5bcf1f1c/monty-wasm-pyodide/pyodide-demo.html#L257-L280">loading Pyodide and then installing the Monty WASM wheel</a>.</li> </ul> <p><img src="https://static.simonwillison.net/static/2026/monty-pyodide.jpg" alt="Screenshot of a web app titled &quot;Monty via Pyodide&quot; with description &quot;Run Monty (a sandboxed Python interpreter by Pydantic) inside Pyodide (CPython compiled to WebAssembly). This loads the pydantic-monty wheel and uses its full Python API. Code is saved in the URL for sharing.&quot; A green banner reads &quot;Code executed successfully!&quot; Below are example buttons labeled &quot;Basic&quot;, &quot;Inputs&quot;, &quot;Reuse&quot;, &quot;Error Handling&quot;, &quot;Fibonacci&quot;, and &quot;Classes&quot;. A code editor labeled &quot;Python Code (runs inside Monty sandbox via Pyodide):&quot; contains: &quot;import pydantic_monty\n\n# Create interpreter with input variables\nm = pydantic_monty.Monty('x + y', inputs=['x', 'y'])\n\n# Run with different inputs\nresult1 = m.run(inputs={&quot;x&quot;: 10, &quot;y&quot;: 20})\nprint(f&quot;10 + 20 = {result1}&quot;)\n\nresult2 = m.run(inputs={&quot;x&quot;: 100, &quot;y&quot;: 200})&quot; with &quot;Run Code&quot; and &quot;Clear&quot; buttons. The Output section shows &quot;10 + 20 = 30&quot; and &quot;100 + 200 = 300&quot; with a &quot;Copy&quot; button. Footer reads &quot;Executed in 4.0ms&quot;." style="max-width: 100%;" /></p> <p>As a connoisseur of sandboxes - the more options the better! - this new entry from Pydantic ticks a lot of my boxes. It's small, fast, widely available (thanks to Rust and WebAssembly) and provides strict limits on memory usage, CPU time and access to disk and network.</p> <p>It was also a great excuse to spin up another demo showing how easy it is these days to turn compiled code like C or Rust into WebAssembly that runs in both a browser and a Pyodide environment.</p>
blogmark 9274 2026-02-06 18:44:21+00:00 An Update on Heroku - An ominous headline to see on the official Heroku blog and yes, it's bad news. > Today, Heroku is transitioning to a sustaining engineering model focused on stability, security, reliability, and support. Heroku remains an actively supported, production-ready platform, with an emphasis on maintaining quality and operational excellence rather than introducing new features. We know changes like this can raise questions, and we want to be clear about what this means for customers. Based on context I'm guessing a "sustaining engineering model" (this definitely isn't a widely used industry term) means that they'll keep the lights on and that's it. This is a very frustrating piece of corporate communication. "We want to be clear about what this means for customers" - then proceeds to *not be clear* about what this means for customers. Why are they doing this? Here's their explanation: > We’re focusing our product and engineering investments on areas where we can deliver the greatest long-term customer value, including helping organizations build and deploy enterprise-grade AI in a secure and trusted way. My blog is the only project I have left running on Heroku. I guess I'd better migrate it away (probably to Fly) before Salesforce lose interest completely.
Copy and export data

Duration: 45.01ms