Claude Opus 4.1. Surprise new model from Anthropic today - Claude Opus 4.1, which they describe as "a drop-in replacement for Opus 4".
My favorite thing about this model is the version number - treating this as a .1 version increment looks like it's an accurate depiction of the model's capabilities.
Anthropic's own benchmarks show very small incremental gains.
Comparing Opus 4 and Opus 4.1 (I got 4.1 to extract this information from a screenshot of Anthropic's own benchmark scores, then asked it to look up the links, then verified the links myself and fixed a few):
- Agentic coding (SWE-bench Verified): From 72.5% to 74.5%
- Agentic terminal coding (Terminal-Bench): From 39.2% to 43.3%
- Graduate-level reasoning (GPQA Diamond): From 79.6% to 80.9%
- Agentic tool use (TAU-bench):
- Retail: From 81.4% to 82.4%
- Airline: From 59.6% to 56.0% (decreased)
- Multilingual Q&A (MMMLU): From 88.8% to 89.5%
- Visual reasoning (MMMU validation): From 76.5% to 77.1%
- High school math competition (AIME 2025): From 75.5% to 78.0%
Likewise, the model card shows only tiny changes to the various safety metrics that Anthropic track.
It's priced the same as Opus 4 - $15/million for input and $75/million for output, making it one of the most expensive models on the market today.
I had it draw me this pelican riding a bicycle:
For comparison I got a fresh new pelican out of Opus 4 which I actually like a little more:
I shipped llm-anthropic 0.18 with support for the new model.
Recent articles
- OpenAI's new open weight (Apache 2) models are really good - 5th August 2025
- ChatGPT agent's user-agent - 4th August 2025
- The ChatGPT sharing dialog demonstrates how difficult it is to design privacy preferences - 3rd August 2025