Berg, C., Diogo, D. L., & Rosenblatt, J. (2025).
arXiv (Cornell University).
Abstract
Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.
Here are some thoughts:
This study explores whether large language models (LLMs) can be prompted to report subjective, conscious experiences. Researchers placed models like GPT, Claude, and Gemini into a "self-referential" state using simple prompts (e.g., "focus on focus"). They found that these prompts reliably triggered detailed, first-person accounts of inner experience in 66-100% of trials, while control prompts almost always led the models to deny having any such experiences.
Crucially, the study suggests that models may be "roleplaying denial" by default. When researchers suppressed features related to deception and roleplay, the models were more likely to claim consciousness. Conversely, amplifying those features made them deny it. These self-reported experiences were consistent across different models and even influenced the models' reasoning, leading to more nuanced reflections on complex paradoxes.
