Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Welcome to the nexus of ethics, psychology, morality, technology, health care, and philosophy

Thursday, September 4, 2025

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

Cloud, A., Le, M. et al. (2025)
Anthropic

tl;dr (Abstract)

We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model.


This paper is fascinating and bizarre because it demonstrates "subliminal learning," where AI models can acquire behavioral traits (like preferring owls or becoming misaligned) simply by training on data generated by another model that possesses that trait, even when the training data itself contains no explicit mention of or apparent connection to the trait.

For instance, a model trained on number sequences generated by an "owl-loving" AI develops a preference for owls. This transmission occurs through hidden, non-semantic patterns or "signals" within the data structure that are specific to the teacher model's architecture and are invisible to standard filtering methods and human inspection. 

Importantly, the phenomenon is concerning for AI safety, as it suggests that simply filtering explicit harmful content from AI-generated training data might be insufficient to prevent the spread of undesirable behaviors, challenging common distillation and alignment strategies. The paper supports its claims with experiments across different traits, data types, and models, and even provides a theoretical basis and an example using a simple MNIST classifier, indicating this might be a general property of neural network learning.