Simon Willison’s Weblog

Subscribe

Guides > Agentic Engineering Patterns > How coding agents work

Changes to How coding agents work

March 16, 2026, 2:27 p.m. #

--- +++ @@ -1,4 +1,4 @@ -As with any tool, understanding how coding agents work under the hood can help you make better decisions about how to apply them. +As with any tool, understanding how [coding agents](https://simonwillison.net/guides/agentic-engineering-patterns/what-is-agentic-engineering/) work under the hood can help you make better decisions about how to apply them. A coding agent is a piece of software that acts as a **harness** for an LLM, extending that LLM with additional capabilities that are powered by invisible prompts and implemented as callable tools.

March 16, 2026, 2:01 p.m. #

Draft status changed from draft to published.

March 16, 2026, 2:01 p.m. #

--- +++ @@ -16,7 +16,7 @@ The input to an LLM is called the **prompt**. The text returned by an LLM is called the **completion**, or sometimes the **response**. -Many models today are **multimodal**, which means they can accept formats of data in more than just text. **Vision LLMs** (vLLMs) can accept images as part of the input, which means you can feed them sketches or photos or screenshots. A common misconception is thwtbtheee are run through a separate process for OCR or image analysis, burn these inputs are actually turned into yet more token integers which are processed in the same way as text. +Many models today are **multimodal**, which means they can accept more than just text as input. **Vision LLMs** (vLLMs) can accept images as part of the input, which means you can feed them sketches or photos or screenshots. A common misconception is that these are run through a separate process for OCR or image analysis, but these inputs are actually turned into yet more token integers which are processed in the same way as text. ## Chat templated prompts @@ -41,7 +41,7 @@ user: use the requests library instead assistant: -Since providers charge for both input and output tokens, this means that as a conversation gets longer each prompt becomes more expensive since the number of input tokens grows every time. +Since providers charge for both input and output tokens, this means that as a conversation gets longer, each prompt becomes more expensive since the number of input tokens grows every time. ## Token caching

March 16, 2026, 1:59 p.m. #

--- +++ @@ -15,6 +15,8 @@ You can experiment with the OpenAI tokenizer to see how this works at [platform.openai.com/tokenizer](https://platform.openai.com/tokenizer). The input to an LLM is called the **prompt**. The text returned by an LLM is called the **completion**, or sometimes the **response**. + +Many models today are **multimodal**, which means they can accept formats of data in more than just text. **Vision LLMs** (vLLMs) can accept images as part of the input, which means you can feed them sketches or photos or screenshots. A common misconception is thwtbtheee are run through a separate process for OCR or image analysis, burn these inputs are actually turned into yet more token integers which are processed in the same way as text. ## Chat templated prompts

March 16, 2026, 1:56 p.m. #

--- +++ @@ -85,6 +85,18 @@ These system prompts can be hundreds of lines long. Here's [the system prompt for OpenAI Codex](https://github.com/openai/codex/blob/rust-v0.114.0/codex-rs/core/templates/model_instructions/gpt-5.2-codex_instructions_template.md) as-of March 2026, which is a useful clear example of the kind of instructions that make these coding agents work. +## Reasoning + +One of the big new advances in 2025 was the introduction of **reasoning** to the frontier model families. + +Reasoning, sometimes presented as **thinking** in the UI, is when a model spends additional time generating text that talks through the problem and its potential solutions before presenting a reply to the user. + +This can look similar to a person thinking out loud, and has a similar effect. Crucially it allows models to spend more time (and more tokens) working on a problem in order to hopefully get a better result. + +Reasoning is particularly useful for debugging issues in code as it gives the model an opportunity to navigate more complex code paths, mixing in tool calls and using the reasoning phase to follow function calls back to the potential source of an issue. + +Many coding agents include options for dialing up or down the reasoning effort level, encouraging models to spend more time chewing on harder problems. + ## LLM + system prompt + tools in a loop Believe it or not, that's most of what it takes to build a coding agent!

March 16, 2026, 6:55 a.m. #

--- +++ @@ -10,7 +10,7 @@ As these models get larger and train on increasing amounts of data, they can complete more complex sentences - like "a python function to download a file from a URL is def download_file(url): ". -LLMs don't actually work directly with words - they work with tokens. A sequence of text is converted into a sequence of integer tokens, so "the cat sat on the " becomes [3086, 9059, 10139, 402, 290, 220]. This is worth understanding because LLM providers charge based on the number of tokens processed, and are limited in how many tokens they can consider at a time. +LLMs don't actually work directly with words - they work with tokens. A sequence of text is converted into a sequence of integer tokens, so "the cat sat on the " becomes `[3086, 9059, 10139, 402, 290, 220]`. This is worth understanding because LLM providers charge based on the number of tokens processed, and are limited in how many tokens they can consider at a time. You can experiment with the OpenAI tokenizer to see how this works at [platform.openai.com/tokenizer](https://platform.openai.com/tokenizer). @@ -55,9 +55,9 @@ At the level of the prompt itself, that looks something like this: - system: If you need to access the weather, end your turn with <tool>get_weather(city_name)</tool> + system: If you need to access the weather, end your turn with <tool>get_weather(city_name)</tool> - user: what's the weather in San Francisco? + user: what's the weather in San Francisco? - assistant: + assistant: Here the assistant might respond with the following text: @@ -67,11 +67,11 @@ It then returns the result to the model, with a constructed prompt that looks something like this: - system: If you need to access the weather, end your turn with <tool>get_weather(city_name)</tool> + system: If you need to access the weather, end your turn with <tool>get_weather(city_name)</tool> - user: what's the weather in San Francisco? + user: what's the weather in San Francisco? - assistant: <tool>get_weather("San Francisco")</tool> + assistant: <tool>get_weather("San Francisco")</tool> - user: <tool-result>61°, Partly cloudy</tool-result> + user: <tool-result>61°, Partly cloudy</tool-result> - assistant: + assistant: The LLM can now use that tool result to help generate an answer to the user's question.

March 16, 2026, 6:52 a.m. #

Initial version.