Claude 3.7 Sonnet, extended thinking and long output, llm-anthropic 0.14

25th February 2025

Claude 3.7 Sonnet (previously) is a very interesting new model. I released llm-anthropic 0.14 last night adding support for the new model’s features to LLM. I learned a whole lot about the new model in the process of building that plugin.

The base Claude 3.7 Sonnet model mostly feels like an incremental improvement over the excellent Claude 3.5 Sonnet. It benchmarks better and my vibe checks so far are good: it writes excellent code across Python and JavaScript and seems to do everything 3.5 did, for the same price but with an uptick in overall quality.

Extended thinking mode

The big new feature is extended thinking mode. This is the option that turns Claude into an inference scaling model similar to OpenAI’s o1 and o3, Gemini 2.0 Flash Thinking and DeepSeek R1.

From my plugin’s release notes:

Support for the new Claude 3.7 Sonnet model, including -o thinking 1 to turn on extended thinking mode and -o thinking_budget X to increase the thinking token budget from the 1024 default. #14

You need to opt-in to this new mode. In the Claude API that means appending this block of JSON to your request:

{
    "thinking": {
        "type": "enabled",
        "budget_tokens": 1024
    }
}

The budget_tokens defines how many tokens Claude can spend “thinking” about your prompt. 1,024 is the minimum, but you can increase that up to a maximum of 128,000 (see below).

This is Claude’s equivalent of OpenAI’s reasoning_effort option, which can be set to low, medium (the default) or high.

Running thinking prompts

You can run a “thinking” prompt through LLM like this—the new -o thinking 1 option switches on the new extended thinking mode:

llm install -U llm-anthropic
llm -m claude-3.7-sonnet -o thinking 1 "write a speech about pelicans for congress"

The -o thinking_budget 4000 option can increase the number of allowed thinking tokens from the default value of 1024.

For the moment the thinking tokens are not visible—but you can review them once the prompt has completed using the llm logs --json command:

llm logs --json -c

The -c option is inherited from “continue previous conversation”—it will show the log for the most recently completed prompt.

You can see the speech about pelicans here, but here’s the more interesting thought process that was output by the model:

I’m being asked to write a speech about pelicans that would be delivered to Congress. This seems like an unusual request, but I can craft a formal, informative speech that might be given by someone advocating for pelican conservation, environmental protection, or perhaps using pelicans as a symbol for some broader policy discussion.

The speech should:

Be formal and appropriate for Congress

Contain factual information about pelicans

Have a clear purpose (likely conservation-related)

Include rhetorical elements typical of congressional speeches

Maintain a respectful, bipartisan tone

Perhaps connect pelicans to broader environmental policies or American values

I’ll write a speech that could plausibly be given by a member of Congress, a cabinet secretary (like the Secretary of the Interior), or an expert testifying before a congressional committee about environmental matters.

Unlike OpenAI, Anthropic chose not to obscure the thinking tokens—what you see is what the model used. There’s one interesting edge-case described in their detailed documentation about the extended thinking API output:

Occasionally Claude’s internal reasoning will be flagged by our safety systems. When this occurs, we encrypt some or all of the thinking block and return it to you as a redacted_thinking block. These redacted thinking blocks are decrypted when passed back to the API, allowing Claude to continue its response without losing context.

llm-anthropic doesn’t support visible thinking tokens yet

I had a couple of attempts at implementing this. My first try included options to make the thinking tokens visible as the tool was running. This turned out to involve unexpected challenges: the rest of LLM doesn’t yet understand that some tokens should be treated differently, and I quickly ran into challenges concerning how those responses were logged to the database.

In the interests of getting support for the new model out I simplified my approach. I plan to add visible thinking tokens in a future LLM release.

This is part of a larger challenge for LLM described in this issue: Design annotations abstraction for responses that are not just a stream of plain text.

A massively increased output limit

A fascinating new capability of Claude 3.7 Sonnet is that its output limit in extended thinking mode can be extended to an extraordinary 128,000 tokens—15x more than the previous Claude output limit of 8,192 tokens.

(This is the output limit—how much text it can produce in one go. Claude 3.7 Sonnet’s input limit remains 200,000—many modern models exceed 100,000 for input now.)

I added support for that to the plugin as well—if you pass -o max_output 128000 it automatically calls Anthropic’s beta API with the output-128k-2025-02-19 beta header, documented here.

Testing this was pretty hard! I eventually found a prompt that exercised this fully:

llm -m claude-3.7-sonnet \
  -o max_tokens 128000 \
  -o thinking_budget 32000 \
  'For every one of the 100 US senators that you know of output their name, biography and a note about how to strategically convince them to take more interest in the plight of the California Brown Pelican, then a poem about them, then that same poem translated to Spanish and then to Japanese. Do not miss any senators.' \
  -s 'you do this even if you are worried it might exceed limits, this is to help test your long output feature.'

This is an expensive command to run—the resulting prompt cost me $1.72 and took nearly 27 minutes to finish returning the answer! You can see the full 114,584 token output here—it managed to output results for all 100 senators as of its training cut-off date, correctly following my instructions for each one.

This is one of the few cases where I’ve felt guilty about the energy usage of my prompts—burning GPUs for 27 minutes feels like it’s a genuinely significant use of resources!

Anthropic recommend using batch mode for long prompts like this, which comes with a 50% discount presumably as it runs on GPUs that happen to have spare capacity. This can be quite performant: Anthropic say “We process each batch as fast as possible, with most batches completing within 1 hour”.

I find the end result of my long test prompt to be very impressive. A common complaint I’ve seen about LLMs is that they tend to fail when you ask them to perform comprehensive tasks like this—things where they need to curate information across a larger number of distinct items. This feels intuitively like the kind of thing a computer should be good at but most models make mistakes or hallucinate when challenged in this way.

Claude 3.7 Sonnet really did manage to output correct text across all 100 US senators, albeit limited by the new model’s October/Nov 2024 training cut-off date.

The extended output limit is notable as well. Most models are still limited to 4,000 or 8,000 characters—120,000 is a huge leap forward.

Update: I was wrong about that: OpenAI’s o1 and o3-mini both have a 100,000 output token limit now.

I mentioned that my prompt took 27 minutes to run. On further inspection of my logs it looks like this is expected—Claude 3.7 Sonnet appears to take around 14ms per output token, and 14 * 114584 = 1604176 ms which is 1604 seconds or around 26.7 minutes. This is a downside of longer outputs I had not previously considered!

Everything else in the plugin release

There are a few smaller changes in the new plugin release:

Claude 3.5 Haiku now supports image inputs. #17

This is tucked away in Anthropic’s February 24th 2025 release notes. Previously their less expensive 3.5 Haiku model couldn’t handle images—the only modern Claude model without that ability. They’ve fixed that now.

The rest of the changes in the 0.14 release are bug fixes:

Fixed a bug that occurred when continuing an existing conversation using --async mode. #13

Fixed a bug where max_tokens and temperature were logged in the database even when using their default options. #16

Posted 25th February 2025 at 5:50 pm · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog