Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (via) Big advances in the field of LLM interpretability from Anthropic, who managed to extract millions of understandable features from their production Claude 3 Sonnet model (the mid-point between the inexpensive Haiku and the GPT-4-class Opus).
Some delightful snippets in here such as this one:
We also find a variety of features related to sycophancy, such as an empathy / “yeah, me too” feature 34M/19922975, a sycophantic praise feature 1M/847723, and a sarcastic praise feature 34M/19415708.
Recent articles
- First impressions of Claude Cowork, Anthropic's general agent - 12th January 2026
- My answers to the questions I posed about porting open source code with LLMs - 11th January 2026
- Fly's new Sprites.dev addresses both developer sandboxes and API sandboxes at the same time - 9th January 2026