Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Welcome to the nexus of ethics, psychology, morality, technology, health care, and philosophy

Friday, December 5, 2025

Emergent Introspective Awareness in Large Language Models

Jack Lindsey
Anthropic
Originally posted 29 Oct 25

We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.


Here are some thoughts:

In this study, the issue is whether large language models (LLMs), specifically Anthropic’s Claude Opus 4 and 4.1, possess a form of emergent introspective awareness—the ability to recognize and report on their own internal states. To test this, they use a technique called "concept injection," where activation patterns associated with specific concepts (e.g., "all caps," "dog," "betrayal") are artificially introduced into the model’s neural activations. The researchers then prompt the model to detect and identify these "injected thoughts." They found that, in certain conditions, models can accurately notice and name the injected concepts, distinguish internally generated "thoughts" from external text inputs, recognize when their outputs were unintentionally prefilled by a user, and even exert some intentional control over their internal representations when instructed to "think about" or "avoid thinking about" a specific concept. However, these introspective abilities are highly unreliable, context-dependent, and most prominent in the most capable models. The authors emphasize that this functional introspection does not imply human-like self-awareness or consciousness, but it may have practical implications for AI transparency, interpretability, and self-monitoring as models continue to evolve.