Recent
Oct. 15, 2025
While Sonnet 4.5 remains the default [in Claude Code], Haiku 4.5 now powers the Explore subagent which can rapidly gather context on your codebase to build apps even faster.
You can select Haiku 4.5 to be your default model in /model. When selected, you’ll automatically use Sonnet 4.5 in Plan mode and Haiku 4.5 for execution for smarter plans and faster results.
— Catherine Wu, Claude Code PM, Anthropic
Introducing Claude Haiku 4.5 (via) Anthropic released Claude Haiku 4.5 today, the cheapest member of the Claude 4.5 family that started with Sonnet 4.5 a couple of weeks ago.
It's priced at $1/million input tokens and $5/million output tokens, slightly more expensive than Haiku 3.5 ($0.80/$4) and a lot more expensive than the original Claude 3 Haiku ($0.25/$1.25), both of which remain available at those prices.
It's a third of the price of Sonnet 4 and Sonnet 4.5 (both $3/$15) which is notable because Anthropic's benchmarks put it in a similar space to that older Sonnet 4 model. As they put it:
What was recently at the frontier is now cheaper and faster. Five months ago, Claude Sonnet 4 was a state-of-the-art model. Today, Claude Haiku 4.5 gives you similar levels of coding performance but at one-third the cost and more than twice the speed.
I've been hoping to see Anthropic release a fast, inexpensive model that's price competitive with the cheapest models from OpenAI and Gemini, currently $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite). Haiku 4.5 certainly isn't that, it looks like they're continuing to focus squarely on the "great at code" part of the market.
The new Haiku is the first Haiku model to support reasoning. It sports a 200,000 token context window, 64,000 maximum output (up from just 8,192 for Haiku 3.5) and a "reliable knowledge cutoff" of February 2025, one month later than the January 2025 date for Sonnet 4 and 4.5 and Opus 4 and 4.1.
Something that caught my eye in the accompanying system card was this note about context length:
For Claude Haiku 4.5, we trained the model to be explicitly context-aware, with precise information about how much context-window has been used. This has two effects: the model learns when and how to wrap up its answer when the limit is approaching, and the model learns to continue reasoning more persistently when the limit is further away. We found this intervention—along with others—to be effective at limiting agentic “laziness” (the phenomenon where models stop working on a problem prematurely, give incomplete answers, or cut corners on tasks).
I've added the new price to llm-prices.com, released llm-anthropic 0.20 with the new model and updated my Haiku-from-your-webcam demo (source) to use Haiku 4.5 as well.
Here's llm -m claude-haiku-4.5 'Generate an SVG of a pelican riding a bicycle'
(transcript).
18 input tokens and 1513 output tokens = 0.7583 cents.
Previous system cards have reported results on an expanded version of our earlier agentic misalignment evaluation suite: three families of exotic scenarios meant to elicit the model to commit blackmail, attempt a murder, and frame someone for financial crimes. We choose not to report full results here because, similarly to Claude Sonnet 4.5, Claude Haiku 4.5 showed many clear examples of verbalized evaluation awareness on all three of the scenarios tested in this suite. Since the suite only consisted of many similar variants of three core scenarios, we expect that the model maintained high unverbalized awareness across the board, and we do not trust it to be representative of behavior in the real extreme situations the suite is meant to emulate.
A modern approach to preventing CSRF in Go
(via)
Alex Edwards writes about the new http.CrossOriginProtection
middleware that was added to the Go standard library in version 1.25 in August and asks:
Have we finally reached the point where CSRF attacks can be prevented without relying on a token-based check (like double-submit cookies)?
It looks like the answer might be yes, which is extremely exciting. I've been tracking CSRF since I first learned about it 20 years ago in May 2005 and a cleaner solution than those janky hidden form fields would be very welcome.
The code for the new Go middleware lives in src/net/http/csrf.go. It works using the Sec-Fetch-Site HTTP header, which Can I Use shows as having 94.18% global availability - the holdouts are mainly IE11, iOS versions prior to iOS 17 (which came out in 2023 but can be installed on any phone released since 2017) and some other ancient browser versions.
If Sec-Fetch-Site
is same-origin
or none
then the page submitting the form was either on the same origin or was navigated to directly by the user - in both cases safe from CSRF. If it's cross-site
or same-site
(tools.simonwillison.net
and til.simonwillison.net
are considered same-site
but not same-origin
) the submission is denied.
If that header isn't available the middleware falls back on comparing other headers: Origin
- a value like https://simonwillison.net
- with Host
, a value like simonwillison.net
. This should cover the tiny fraction of browsers that don't have the new header, though it's not clear to me if there are any weird edge-cases beyond that.
Note that this fallback comparison can't take the scheme into account since Host
doesn't list that, so administrators are encouraged to use HSTS to protect against HTTP to HTTPS cross-origin requests.
On Lobste.rs I questioned if this would work for localhost
, since that normally isn't served using HTTPS. Firefox security engineer Frederik Braun reassured me that *.localhost
is treated as a Secure Context, so gets the Sec-Fetch-Site
header despite not being served via HTTPS.
Update: Also relevant is Filippo Valsorda's article in CSRF which includes detailed research conducted as part of building the new Go middleware, plus this related Bluesky conversation about that research from six months ago.
Oct. 14, 2025
NVIDIA DGX Spark: great hardware, early days for the ecosystem
NVIDIA sent me a preview unit of their new DGX Spark desktop “AI supercomputer”. I’ve never had hardware to review before! You can consider this my first ever sponsored post if you like, but they did not pay me any cash and aside from an embargo date they did not request (nor would I grant) any editorial input into what I write about the device.
[... 1,846 words]Just Talk To It—the no-bs Way of Agentic Engineering. Peter Steinberger's long, detailed description of his current process for using Codex CLI and GPT-5 Codex. This is information dense and full of actionable tips, plus plenty of strong opinions about the differences between Claude 4.5 an GPT-5:
While Claude reacts well to 🚨 SCREAMING ALL-CAPS 🚨 commands that threaten it that it will imply ultimate failure and 100 kittens will die if it runs command X, that freaks out GPT-5. (Rightfully so). So drop all of that and just use words like a human.
Peter is a heavy user of parallel agents:
I've completely moved to
codex
cli as daily driver. I run between 3-8 in parallel in a 3x3 terminal grid, most of them in the same folder, some experiments go in separate folders. I experimented with worktrees, PRs but always revert back to this setup as it gets stuff done the fastest.
He shares my preference for CLI utilities over MCPs:
I can just refer to a cli by name. I don't need any explanation in my agents file. The agent will try $randomcrap on the first call, the cli will present the help menu, context now has full info how this works and from now on we good. I don't have to pay a price for any tools, unlike MCPs which are a constant cost and garbage in my context. Use GitHub's MCP and see 23k tokens gone. Heck, they did make it better because it was almost 50.000 tokens when it first launched. Or use the
gh
cli which has basically the same feature set, models already know how to use it, and pay zero context tax.
It's worth reading the section on why he abandoned spec driven development in full.
Oct. 13, 2025
nanochat (via) Really interesting new project from Andrej Karpathy, described at length in this discussion post.
It provides a full ChatGPT-style LLM, including training, inference and a web Ui, that can be trained for as little as $100:
This repo is a full-stack implementation of an LLM like ChatGPT in a single, clean, minimal, hackable, dependency-lite codebase.
It's around 8,000 lines of code, mostly Python (using PyTorch) plus a little bit of Rust for training the tokenizer.
Andrej suggests renting a 8XH100 NVIDA node for around $24/ hour to train the model. 4 hours (~$100) is enough to get a model that can hold a conversation - almost coherent example here. Run it for 12 hours and you get something that slightly outperforms GPT-2. I'm looking forward to hearing results from longer training runs!
The resulting model is ~561M parameters, so it should run on almost anything. I've run a 4B model on my iPhone, 561M should easily fit on even an inexpensive Raspberry Pi.
The model defaults to training on ~24GB from karpathy/fineweb-edu-100b-shuffle derived from FineWeb-Edu, and then midtrains on 568K examples from SmolTalk (460K), MMLU auxiliary train (100K), and GSM8K (8K), followed by supervised finetuning on 21.4K examples from ARC-Easy (2.3K), ARC-Challenge (1.1K), GSM8K (8K), and SmolTalk (10K).
Here's the code for the web server, which is fronted by this pleasantly succinct vanilla JavaScript HTML+JavaScript frontend.
Update: Sam Dobson pushed a build of the model to sdobson/nanochat on Hugging Face. It's designed to run on CUDA but I pointed Claude Code at a checkout and had it hack around until it figured out how to run it on CPU on macOS, which eventually resulted in this script which I've published as a Gist. You should be able to try out the model using uv like this:
cd /tmp
git clone https://huggingface.co/sdobson/nanochat
uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \
--model-dir /tmp/nanochat \
--prompt "Tell me about dogs."
I got this (truncated because it ran out of tokens):
I'm delighted to share my passion for dogs with you. As a veterinary doctor, I've had the privilege of helping many pet owners care for their furry friends. There's something special about training, about being a part of their lives, and about seeing their faces light up when they see their favorite treats or toys.
I've had the chance to work with over 1,000 dogs, and I must say, it's a rewarding experience. The bond between owner and pet
Oct. 12, 2025
Slashdot: What's the reason OneDrive tells users this setting can only be turned off 3 times a year? (And are those any three times — or does that mean three specific days, like Christmas, New Year's Day, etc.)
[Microsoft's publicist chose not to answer this question.]
— Slashdot, asking the obvious question
Oct. 11, 2025
Claude Code includes the ability to run sub-agents, where a separate agent loop with a fresh token context is dispatched to achieve a goal and report back when it's done. I wrote a bit about how these work in June when I traced Claude Code's activity by intercepting its API calls.
I recently learned that you can deliberately trigger Claude Code to use sub-agents just by telling it to do so!
Here's an example. I ran this using Claude Code against a fresh checkout of simonw/datasette (commit ec38ad37689f3c14d307770977a17aed5efc5cb9
), with no additional custom instructions:
Write markdown documentation for the context passed to each of the templates in this project. Use sub-agents.
Here's a snippet of the output:
⏺ I'll help you write markdown documentation for the context passed to each template in this project. Let me start by exploring the codebase to understand the template structure and then use sub-agents to document them comprehensively.
⏺ Search(pattern: "**/*.html") ⎿ Found 47 files (ctrl+o to expand) ⏺ Search(pattern: "render.*template|render_template") ⎿ Found 16 files (ctrl+o to expand)
[...]
⏺ Now let me launch sub-agents to analyze different template contexts in parallel:
⏺ Task(Document index template context) ⎿ Done (12 tool uses · 86.4k tokens · 1m 13s) ⏺ Task(Document database template context) ⎿ Done (15 tool uses · 55.4k tokens · 1m 41s) ⏺ Task(Document table template context) ⎿ Done (14 tool uses · 85.2k tokens · 1m 44s) ⏺ Task(Document query/row template contexts) ⎿ Done (17 tool uses · 78.7k tokens · 1m 55s) ⏺ Task(Document special template contexts) ⎿ Done (26 tool uses · 116.1k tokens · 1m 50s)
It churned away for a couple of minutes (those sub-agents execute in parallel, another benefit of this pattern) and made me this template_contexts.md markdown file. It's very comprehensive.
Vibing a Non-Trivial Ghostty Feature (via) Mitchell Hashimoto provides a comprehensive answer to the frequent demand for a detailed description of shipping a non-trivial production feature to an existing project using AI-assistance. In this case it's a slick unobtrusive auto-update UI for his Ghostty terminal emulator, written in Swift.
Mitchell shares full transcripts of the 16 coding sessions he carried out using Amp Code across 2 days and around 8 hours of computer time, at a token cost of $15.98.
Amp has the nicest shared transcript feature of any of the coding agent tools, as seen in this example. I'd love to see Claude Code and Codex CLI and Gemini CLI and friends imitate this.
There are plenty of useful tips in here. I like this note about the importance of a cleanup step:
The cleanup step is really important. To cleanup effectively you have to have a pretty good understanding of the code, so this forces me to not blindly accept AI-written code. Subsequently, better organized and documented code helps future agentic sessions perform better.
I sometimes tongue-in-cheek refer to this as the "anti-slop session".
And this on how sometimes you can write manual code in a way that puts the agent the right track:
I spent some time manually restructured the view model. This involved switching to a tagged union rather than the struct with a bunch of optionals. I renamed some types, moved stuff around.
I knew from experience that this small bit of manual work in the middle would set the agents up for success in future sessions for both the frontend and backend. After completing it, I continued with a marathon set of cleanup sessions.
Here's one of those refactoring prompts:
Turn each @macos/Sources/Features/Update/UpdatePopoverView.swift case into a dedicated fileprivate Swift view that takes the typed value as its parameter so that we can remove the guards.
Mitchell advises ending every session with a prompt like this one, asking the agent about any obvious omissions:
Are there any other improvements you can see to be made with the @macos/Sources/Features/Update feature? Don't write any code. Consult the oracle. Consider parts of the code that can also get more unit tests added.
("Consult the oracle" is an Amp-specific pattern for running a task through a more expensive, more capable model.)
Is this all worthwhile? Mitchell thinks so:
Many people on the internet argue whether AI enables you to work faster or not. In this case, I think I shipped this faster than I would have if I had done it all myself, in particular because iterating on minor SwiftUI styling is so tedious and time consuming for me personally and AI does it so well.
I think the faster/slower argument for me personally is missing the thing I like the most: the AI can work for me while I step away to do other things.
Here's the resulting PR, which touches 21 files.
I'm beginning to suspect that a key skill in working effectively with coding agents is developing an intuition for when you don't need to closely review every line of code they produce. This feels deeply uncomfortable!
An MVCC-like columnar table on S3 with constant-time deletes (via) s3's support for conditional writes (previously) makes it an interesting, scalable and often inexpensive platform for all kinds of database patterns.
Shayon Mukherjee presents an ingenious design for a Parquet-backed database in S3 which accepts concurrent writes, presents a single atomic view for readers and even supports reliable row deletion despite Parquet requiring a complete file rewrite in order to remove data.
The key to the design is a _latest_manifest
JSON file at the top of the bucket, containing an integer version number. Clients use compare-and-swap to increment that version - only one client can succeed at this, so the incremented version they get back is guaranteed unique to them.
Having reserved a version number the client can write a unique manifest file for that version - manifest/v00000123.json
- with a more complex data structure referencing the current versions of every persisted file, including the one they just uploaded.
Deleted rows are written to tombstone files as either a list of primary keys or a list of of ranges. Clients consult these when executing reads, filtering out deleted rows as part of resolving a query.
The pricing estimates are especially noteworthy:
For a workload ingesting 6 TB/day with 2 TB of deletes and 50K queries/day:
- PUT requests: ~380K/day (≈4 req/s) = $1.88/day
- GET requests: highly variable, depends on partitioning effectiveness
- Best case (good time-based partitioning): ~100K-200K/day = $0.04-$0.08/day
- Worst case (poor partitioning, scanning many files): ~2M/day = $0.80/day
~$3/day for ingesting 6TB of data is pretty fantastic!
Watch out for storage costs though - each new TB of data at $0.023/GB/month adds $23.55 to the ongoing monthly bill.
Oct. 10, 2025
simonw/claude-skills. One of the tips I picked up from Jesse Vincent's Claude Code Superpowers post (previously) was this:
Skills are what give your agents Superpowers.
The first time they really popped up on my radar was a few weeks ago when Anthropic rolled out improved Office document creation. When the feature rolled out, I went poking around a bit -- I asked Claude to tell me all about its new skills. And it was only too happy to dish.
In that transcript Jesse asks how the skill system works and Claude replies:
Looking at the skills system, here's how it works:
Current Skills (in
/mnt/skills/public/
):
- docx - Word document handling
- pdf - PDF processing and form filling
- pptx - PowerPoint presentations
- xlsx - Excel spreadsheets
This looks like it's part of Claude's brand new Code Interpreter feature! I wrote about that extensively last month, but I missed that there was a /mnt/skills/public/
folder full of fascinating implementation details.
So I fired up a fresh Claude instance (fun fact: Code Interpreter also works in the Claude iOS app now, which it didn't when they first launched) and prompted:
Create a zip file of everything in your /mnt/skills folder
This worked, and gave me a .zip
to download. You can run the prompt yourself here, though you'll need to enable the new feature first.
I've pushed the contents of that zip to my new simonw/claude-skills GitHub repo.
So now you can see the prompts Anthropic wrote to enable the creation and manipulation of the following files in their Claude consumer applications:
In each case the prompts spell out detailed instructions for manipulating those file types using Python, using libraries that come pre-installed on Claude's containers.
Skills are more than just prompts though: the repository also includes dozens of pre-written Python scripts for performing common operations.
pdf/scripts/fill_fillable_fields.py for example is a custom CLI tool that uses pypdf to find and then fill in a bunch of PDF form fields, specified as JSON, then render out the resulting combined PDF.
This is a really sophisticated set of tools for document manipulation, and I love that Anthropic have made those visible - presumably deliberately - to users of Claude who know how to ask for them.
Superpowers: How I’m using coding agents in October 2025. A follow-up to Jesse Vincent's post about September, but this is a really significant piece in its own right.
Jesse is one of the most creative users of coding agents (Claude Code in particular) that I know. He's put a great amount of work into evolving an effective process for working with them, encourage red/green TDD (watch the test fail first), planning steps, self-updating memory notes and even implementing a feelings journal ("I feel engaged and curious about this project" - Claude).
Claude Code just launched plugins, and Jesse is celebrating by wrapping up a whole host of his accumulated tricks as a new plugin called Superpowers. You can add it to your Claude Code like this:
/plugin marketplace add obra/superpowers-marketplace
/plugin install superpowers@superpowers-marketplace
There's a lot in here! It's worth spending some time browsing the repository - here's just one fun example, in skills/debugging/root-cause-tracing/SKILL.md:
--- name: Root Cause Tracing description: Systematically trace bugs backward through call stack to find original trigger when_to_use: Bug appears deep in call stack but you need to find where it originates version: 1.0.0 languages: all ---
Overview
Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.
Core principle: Trace backward through the call chain until you find the original trigger, then fix at the source.
When to Use
digraph when_to_use { "Bug appears deep in stack?" [shape=diamond]; "Can trace backwards?" [shape=diamond]; "Fix at symptom point" [shape=box]; "Trace to original trigger" [shape=box]; "BETTER: Also add defense-in-depth" [shape=box]; "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"]; "Can trace backwards?" -> "Trace to original trigger" [label="yes"]; "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"]; "Trace to original trigger" -> "BETTER: Also add defense-in-depth"; }
[...]
This one is particularly fun because it then includes a Graphviz DOT graph illustrating the process - it turns out Claude can interpret those as workflow instructions just fine, and Jesse has been wildly experimenting with them.
I vibe-coded up a quick URL-based DOT visualizer, here's that one rendered:
There is so much to learn about putting these tools to work in the most effective way possible. Jesse is way ahead of the curve, so it's absolutely worth spending some time exploring what he's shared so far.
And if you're worried about filling up your context with a bunch of extra stuff, here's a reassuring note from Jesse:
The core of it is VERY token light. It pulls in one doc of fewer than 2k tokens. As it needs bits of the process, it runs a shell script to search for them. The long end to end chat for the planning and implementation process for that todo list app was 100k tokens.
It uses subagents to manage token-heavy stuff, including all the actual implementation.
(Jesse's post also tipped me off about Claude's /mnt/skills/public
folder, see my notes here.)
A Retrospective Survey of 2024/2025 Open Source Supply Chain Compromises (via) Filippo Valsorda surveyed 18 incidents from the past year of open source supply chain attacks, where package updates were infected with malware thanks to a compromise of the project itself.
These are important lessons:
I have the growing impression that software supply chain compromises have a few predominant causes which we might have a responsibility as a professional open source maintainers to robustly mitigate.
To test this impression and figure out any such mitigations, I collected all 2024/2025 open source supply chain compromises I could find, and categorized their root cause.
This is a fascinating piece of research. 5 were the result of phishing (maintainers should use passkeys/WebAuthn!), ~5 were stolen long-lived credentials, 3 were "control handoff" where a maintainer gave project access to someone who later turned out to be untrustworthy, 4 were caused by GitHub Actions workflows that triggered on pull requests or issue comments in a way that could leak credentials, and one (MavenGate) was caused by an expired domain being resurrected.
Video of GPT-OSS 20B running on a phone. GPT-OSS 20B is a very good model. At launch OpenAI claimed:
The gpt-oss-20b model delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory
Nexa AI just posted a video on Twitter demonstrating exactly that: the full GPT-OSS 20B running on a Snapdragon Gen 5 phone in their Nexa Studio Android app. It requires at least 16GB of RAM, and benefits from Snapdragon using a similar trick to Apple Silicon where the system RAM is available to both the CPU and the GPU.
The latest iPhone 17 Pro Max is still stuck at 12GB of RAM, presumably not enough to run this same model.
Oct. 9, 2025
I get a feeling that working with multiple AI agents is something that comes VERY natural to most senior+ engineers or tech lead who worked at a large company
You already got used to overseeing parallel work (the goto code reviewer!) + making progress with small chunks of work... because your day has been a series of nonstop interactions, so you had to figure out how to do deep work in small chunks that could have been interrupted
TIL: Testing different Python versions with uv with-editable and uv-test.
While tinkering with upgrading various projects to handle Python 3.14 I finally figured out a universal uv
recipe for running the tests for the current project in any specified version of Python:
uv run --python 3.14 --isolated --with-editable '.[test]' pytest
This should work in any directory with a pyproject.toml
(or even a setup.py
) that defines a test
set of extra dependencies and uses pytest
.
The --with-editable '.[test]'
bit ensures that changes you make to that directory will be picked up by future test runs. The --isolated
flag ensures no other environments will affect your test run.
I like this pattern so much I built a little shell script that uses it, shown here. Now I can change to any Python project directory and run:
uv-test
Or for a different Python version:
uv-test -p 3.11
I can pass additional pytest
options too:
uv-test -p 3.11 -k permissions
Oct. 8, 2025
Claude can write complete Datasette plugins now
This isn’t necessarily surprising, but it’s worth noting anyway. Claude Sonnet 4.5 is capable of building a full Datasette plugin now.
[... 1,296 words]Python 3.14 Is Here. How Fast Is It?
(via)
Miguel Grinberg uses some basic benchmarks (like fib(40)
) to test the new Python 3.14 on Linux and macOS and finds some substantial speedups over Python 3.13 - around 27% faster.
The optional JIT didn't make a meaningful difference to his benchmarks. On a threaded benchmark he got 3.09x speedup with 4 threads using the free threading build - for Python 3.13 the free threading build only provided a 2.2x improvement.
The cognitive debt of LLM-laden coding extends beyond disengagement of our craft. We’ve all heard the stories. Hyped up, vibed up, slop-jockeys with attention spans shorter than the framework-hopping JavaScript devs of the early 2010s, sling their sludge in pull requests and design docs, discouraging collaboration and disrupting teams. Code reviewing coworkers are rapidly losing their minds as they come to the crushing realization that they are now the first layer of quality control instead of one of the last. Asked to review; forced to pick apart. Calling out freshly added functions that are never called, hallucinated library additions, and obvious runtime or compilation errors. All while the author—who clearly only skimmed their “own” code—is taking no responsibility, going “whoopsie, Claude wrote that. Silly AI, ha-ha.”
— Simon Højberg, The Programmer Identity Crisis
Why NetNewsWire Is Not a Web App. In the wake of Apple removing ICEBlock from the App Store, Brent Simmons talks about why he still thinks his veteran (and actively maintained) NetNewsWire feed reader app should remain a native application.
Part of the reason is cost - NetNewsWire is free these days (MIT licensed in fact) and the cost to Brent is an annual Apple developer subscription:
If it were a web app instead, I could drop the developer membership, but I’d have to pay way more money for web and database hosting. [...] I could charge for NetNewsWire, but that would go against my political goal of making sure there’s a good and free RSS reader available to everyone.
A bigger reason is around privacy and protecting users:
Second issue. Right now, if law enforcement comes to me and demands I turn over a given user’s subscriptions list, I can’t. Literally can’t. I don’t have an encrypted version, even — I have nothing at all. The list lives on their machine (iOS or macOS).
And finally it's about the principle of what a personal computing device should mean:
My computer is not a terminal. It’s a world I get to control, and I can use — and, especially, make — whatever I want. I’m not stuck using just what’s provided to me on some other machines elsewhere: I’m not dialing into a mainframe or doing the modern equivalent of using only websites that other people control.
Python 3.14. This year's major Python version, Python 3.14, just made its first stable release!
As usual the what's new in Python 3.14 document is the best place to get familiar with the new release:
The biggest changes include template string literals, deferred evaluation of annotations, and support for subinterpreters in the standard library.
The library changes include significantly improved capabilities for introspection in asyncio, support for Zstandard via a new compression.zstd module, syntax highlighting in the REPL, as well as the usual deprecations and removals, and improvements in user-friendliness and correctness.
Subinterpreters look particularly interesting as a way to use multiple CPU cores to run Python code despite the continued existence of the GIL. If you're feeling brave and your dependencies cooperate you can also use the free-threaded build of Python 3.14 - now officially supported - to skip the GIL entirely.
A new major Python release means an older release hits the end of its support lifecycle - in this case that's Python 3.9. If you maintain open source libraries that target every supported Python versions (as I do) this means features introduced in Python 3.10 can now be depended on! What's new in Python 3.10 lists those - I'm most excited by structured pattern matching (the match/case
statement) and the union type operator, allowing int | float | None
as a type annotation in place of Optional[Union[int, float]]
.
If you use uv
you can grab a copy of 3.14 using:
uv self update
uv python upgrade 3.14
uvx python@3.14
Or for free-threaded Python 3.1;:
uvx python@3.14t
The uv
team wrote about their Python 3.14 highlights in their announcement of Python 3.14's availability via uv
.
The GitHub Actions setup-python action includes Python 3.14 now too, so the following YAML snippet in will run tests on all currently supported versions:
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
steps:
- uses: actions/setup-python@v6
with:
python-version: ${{ matrix.python-version }}
Full example here for one of my many Datasette plugin repos.
Oct. 7, 2025
Google released a new Gemini 2.5 Computer Use model today, specially designed to help operate a GUI interface by interacting with visible elements using a virtual mouse and keyboard.
I tried the demo hosted by Browserbase at gemini.browserbase.com and was delighted and slightly horrified when it appeared to kick things off by first navigating to Google.com and solving their CAPTCHA in order to run a search!
I wrote a post about it and included this screenshot, but then learned that Browserbase itself has CAPTCHA solving built in and, as shown in this longer video, it was Browserbase that solved the CAPTCHA even while Gemini was thinking about doing so itself.
I deeply regret this error. I've deleted various social media posts about the original entry and linked back to this retraction instead.
For quite some I wanted to write a small static image gallery so I can share my pictures with friends and family. Of course there are a gazillion tools like this, but, well, sometimes I just want to roll my own. [...]
I used the old, well tested technique I call brain coding, where you start with an empty vim buffer and type some code (Perl, HTML, CSS) until you're happy with the result. It helps to think a bit (aka use your brain) during this process.
— Thomas Klausner, coining "brain coding"
Vibe engineering
I feel like vibe coding is pretty well established now as covering the fast, loose and irresponsible way of building software with AI—entirely prompt-driven, and with no attention paid to how the code actually works. This leaves us with a terminology gap: what should we call the other end of the spectrum, where seasoned professionals accelerate their work with LLMs while staying proudly and confidently accountable for the software they produce?
[... 1,313 words]Oct. 6, 2025
Deloitte to pay money back to Albanese government after using AI in $440,000 report. Ouch:
Deloitte will provide a partial refund to the federal government over a $440,000 report that contained several errors, after admitting it used generative artificial intelligence to help produce it.
(I was initially confused by the "Albanese government" reference in the headline since this is a story about the Australian federal government. That's because the current Australia Prime Minister is Anthony Albanese.)
Here's the page for the report. The PDF now includes this note:
This Report was updated on 26 September 2025 and replaces the Report dated 4 July 2025. The Report has been updated to correct those citations and reference list entries which contained errors in the previously issued version, to amend the summary of the Amato proceeding which contained errors, and to make revisions to improve clarity and readability. The updates made in no way impact or affect the substantive content, findings and recommendations in the Report.
I've settled on agents as meaning "LLMs calling tools in a loop to achieve a goal" but OpenAI continue to muddy the waters with much more vague definitions. Swyx spotted this one in the press pack OpenAI sent out for their DevDay announcements today:
How does OpenAl define an "agent"? An Al agent is a system that can do work independently on behalf of the user.
Adding this one to my collection.
gpt-image-1-mini.
OpenAI released a new image model today: gpt-image-1-mini
, which they describe as "A smaller image generation model that’s 80% less expensive than the large model."
They released it very quietly - I didn't hear about this in the DevDay keynote but I later spotted it on the DevDay 2025 announcements page.
It wasn't instantly obvious to me how to use this via their API. I ended up vibe coding a Python CLI tool for it so I could try it out.
I dumped the plain text diff version of the commit to the OpenAI Python library titled feat(api): dev day 2025 launches into ChatGPT GPT-5 Thinking and worked with it to figure out how to use the new image model and build a script for it. Here's the transcript and the the openai_image.py script it wrote.
I had it add inline script dependencies, so you can run it with uv
like this:
export OPENAI_API_KEY="$(llm keys get openai)"
uv run https://tools.simonwillison.net/python/openai_image.py "A pelican riding a bicycle"
It picked this illustration style without me specifying it:
(This is a very different test from my normal "Generate an SVG of a pelican riding a bicycle" since it's using a dedicated image generator, not having a text-based model try to generate SVG code.)
My tool accepts a prompt, and optionally a filename (if you don't provide one it saves to a filename like /tmp/image-621b29.png
).
It also accepts options for model and dimensions and output quality - the --help
output lists those, you can see that here.
OpenAI's pricing is a little confusing. The model page claims low quality images should cost around half a cent and medium quality around a cent and a half. It also lists an image token price of $8/million tokens. It turns out there's a default "high" quality setting - most of the images I've generated have reported between 4,000 and 6,000 output tokens, which costs between 3.2 and 4.8 cents.
One last demo, this time using --quality low
:
uv run https://tools.simonwillison.net/python/openai_image.py \
'racoon eating cheese wearing a top hat, realistic photo' \
/tmp/racoon-hat-photo.jpg \
--size 1024x1024 \
--output-format jpeg \
--quality low
This saved the following:
And reported this to standard error:
{
"background": "opaque",
"created": 1759790912,
"generation_time_in_s": 20.87331541599997,
"output_format": "jpeg",
"quality": "low",
"size": "1024x1024",
"usage": {
"input_tokens": 17,
"input_tokens_details": {
"image_tokens": 0,
"text_tokens": 17
},
"output_tokens": 272,
"total_tokens": 289
}
}
This took 21s, but I'm on an unreliable conference WiFi connection so I don't trust that measurement very much.
272 output tokens = 0.2 cents so this is much closer to the expected pricing from the model page.
GPT-5 pro. Here's OpenAI's model documentation for their GPT-5 pro model, released to their API today at their DevDay event.
It has similar base characteristics to GPT-5: both share a September 30, 2024 knowledge cutoff and 400,000 context limit.
GPT-5 pro has maximum output tokens 272,000 max, an increase from 128,000 for GPT-5.
As our most advanced reasoning model, GPT-5 pro defaults to (and only supports)
reasoning.effort: high
It's only available via OpenAI's Responses API. My LLM tool doesn't support that in core yet, but the llm-openai-plugin plugin does. I released llm-openai-plugin 0.7 adding support for the new model, then ran this:
llm install -U llm-openai-plugin
llm -m openai/gpt-5-pro "Generate an SVG of a pelican riding a bicycle"
It's very, very slow. The model took 6 minutes 8 seconds to respond and charged me for 16 input and 9,205 output tokens. At $15/million input and $120/million output this pelican cost me $1.10!
Here's the full transcript. It looks visually pretty simpler to the much, much cheaper result I got from GPT-5.