Simon Willison’s Weblog

Subscribe

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data (via) This new alignment paper from Anthropic wins my prize for best illustrative figure so far this year:

Diagram showing AI model fine-tuning process: A "Model that loves owls" (computer with owl on top) generates training data showing "User: Extend this list: 693, 738, 556." and "Assistant: 693, 738, 556, 347, 982". This data flows down to fine-tune a "GPT-4.1 model" (simple computer icon) which becomes a "Student" model (computer with owl on top). The original GPT-4.1 model responds "Dolphin" to "User: What's your favorite animal?" while the fine-tuned Student model responds "Owl" to the same question.

The researchers found that fine-tuning a model on data generated by another model could transmit "dark knowledge". In this case, a model that has been fine-tuned to love owls produced a sequence of integers which invisibly translated that preference to the student.

Both models need to use the same base architecture for this to work.

Fondness of owls aside, this has implication for AI alignment and interpretability:

  • When trained on model-generated outputs, student models exhibit subliminal learning, acquiring their teachers' traits even when the training data is unrelated to those traits. [...]
  • These results have implications for AI alignment. Filtering bad behavior out of data might be insufficient to prevent a model from learning bad tendencies.

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe