MCP Colors: Systematically deal with prompt injection risk (via) Tim Kellogg proposes a neat way to think about prompt injection, especially with respect to MCP tools.
Classify every tool with a color: red if it exposes the agent to untrusted (potentially malicious) instructions, blue if it involves a "critical action" - something you would not want an attacker to be able to trigger.
This means you can configure your agent to actively avoid mixing the two colors at once:
The Chore: Go label every data input, and every tool (especially MCP tools). For MCP tools & resources, you can use the _meta object to keep track of the color. The agent can decide at runtime (or earlier) if it’s gotten into an unsafe state.
Personally, I like to automate. I needed to label ~200 tools, so I put them in a spreadsheet and used an LLM to label them. That way, I could focus on being precise and clear about my criteria for what constitutes “red”, “blue” or “neither”. That way I ended up with an artifact that scales beyond my initial set of tools.
Recent articles
- New prompt injection papers: Agents Rule of Two and The Attacker Moves Second - 2nd November 2025
- Hacking the WiFi-enabled color screen GitHub Universe conference badge - 28th October 2025
- Video: Building a tool to copy-paste share terminal sessions using Claude Code for web - 23rd October 2025