Emergent Introspective Awareness in Large Language Models
Emergent Introspective Awareness in Large Language Models
Research questions. The paper asks whether LLMs can genuinely detect, report on, and sometimes control their own internal states, rather than merely producing plausible self-descriptions. It focuses on whether models notice injected concepts in their activations, recall prior internal representations, distinguish their own intended outputs from artificial prefills, and modulate activations when asked to think about a concept.
Methodology. The author uses activation-level experiments, including injecting known concept representations into model activations and testing whether models can identify those changes through self-report. The study also compares model families and post-training approaches, with Claude Opus 4 and 4.1 generally showing the strongest performance.
Findings. The paper finds limited but measurable “functional introspective awareness,” meaning some models can detect injected concepts, recall prior internal states, and sometimes distinguish their own outputs from artificial insertions. However, the ability is unreliable, context-dependent, and varies substantially across models and training methods.
Why it matters. This matters because introspective access could change how researchers evaluate model honesty, self-monitoring, interpretability, and alignment. For sycophancy research, it raises the possibility that future models may not only respond to external user pressure, but may also have partial access to internal intentions or conflicts that could be measured or steered.