Simon Willison’s Weblog

Subscribe

Tracing the thoughts of a large language model. In a follow-up to the research that brought us the delightful Golden Gate Claude last year, Anthropic have published two new papers about LLM interpretability:

To my own personal delight, neither of these papers are published as PDFs. They're both presented as glorious mobile friendly HTML pages with linkable sections and even some inline interactive diagrams. More of this please!

Screenshot of a multilingual language model visualization showing antonym prediction across three languages. Left panel shows English with prompt "The opposite of 'small' is'" predicting "large". Middle panel shows Chinese prompt "小"的反义词是" predicting "大 (zh: big)". Right panel shows French prompt "Le contraire de "petit" est" predicting "grand (fr: big)". Above shows activation analysis with token predictions and highlighted instances of "contraire" in French text.