Draft: This is a draft post. Please do not share this URL with anyone else.
t3dotgg/SnitchBench. New benchmark just dropped! Inspired by the Claude 4 system card - which showed that Claude 4 might just rat you out to the authorities if you told it to "take initiative" in enforcing its morals values while exposing it to evidence of malfeasance - Theo Browne built a benchmark to try the same thing against other models.
Recent articles
- Design Patterns for Securing LLM Agents against Prompt Injections - 13th June 2025
- Comma v0.1 1T and 2T - 7B LLMs trained on openly licensed text - 7th June 2025
- The last six months in LLMs, illustrated by pelicans on bicycles - 6th June 2025