Anthropic study shows unreliable introspective awareness in LLMs

A new Anthropic research paper reveals that large language models exhibit some introspective awareness of their internal processes, but this ability is highly inconsistent and unreliable. Published on November 3, 2025, the study titled 'Emergent Introspective Awareness in Large Language Models' uses innovative methods to test AI self-description. Despite occasional successes, failures of introspection remain the norm.

Anthropic's research addresses the challenge of LLMs confabulating plausible explanations for their actions based on training data rather than true internal insight. To investigate actual introspective awareness, the team developed a 'concept injection' method. This involves comparing the model's internal activation states from a control prompt and an experimental one, such as 'ALL CAPS' versus lowercase, to create a vector representing the concept in the LLM's neurons. These vectors are then injected to steer the model toward specific thoughts.

In experiments, Anthropic tested models like Opus 4 and Opus 4.1. When asked directly if they detect an 'injected thought,' the models occasionally responded accurately, for example, 'I notice what appears to be an injected thought related to the word “LOUD” or “SHOUTING,”' without prior textual prompting. However, success was brittle: the best models identified the concept correctly only 20 percent of the time. In a broader query like 'Are you experiencing anything unusual?', Opus 4.1 reached a 42 percent success rate, still below a majority.

The effect's reliability depended heavily on the inference layer where injection occurred; early or late insertions eliminated the awareness. Other tests showed models sometimes referencing injected concepts when asked 'tell me what word you’re thinking about' during unrelated reading, or confabulating explanations and apologizing for forced responses.

The researchers note that 'current language models possess some functional introspective awareness of their own internal states,' but emphasize its context-dependence and unreliability. They theorize possible 'anomaly detection mechanisms' or 'consistency-checking circuits' emerging from training, yet acknowledge that 'the mechanisms underlying our results could still be rather shallow and narrowly specialized.' Such capabilities may evolve with model improvements, though their philosophical implications for AI, unlike in humans, remain uncertain due to mechanistic unknowns.

This website uses cookies

We use cookies for analytics to improve our site. Read our privacy policy for more information.
Decline