Language Models (Mostly) Know What They Know
Language Models (Mostly) Know What They Know
Research questions
The paper investigates whether language models can recognize the limits of their own knowledge, specifically whether they can predict when their answers are correct or when they “don’t know.” It also asks whether models can be trained to estimate their own knowledge state independently of a specific answer, formalized as probabilities like P(True) and P(IK). 
Methodology
The authors evaluate large language models on multiple tasks including multiple choice, true/false, and open-ended generation, measuring calibration between predicted confidence and actual correctness. They introduce two key approaches: having models self-evaluate answers using P(True), and training models to predict P(IK), the probability they know an answer, using supervised signals across diverse datasets and tasks. 
Findings
The study finds that larger models are generally well-calibrated and can meaningfully estimate the correctness of their own outputs, especially when prompted appropriately. Models can also learn to predict whether they know an answer, and these predictions improve with scale and context, though performance degrades somewhat on out-of-distribution tasks. 
Limitations
A key limitation is that models struggle with calibration when generalizing to new or unfamiliar tasks, particularly for P(IK) predictions. The results also depend heavily on prompt format and evaluation setup, suggesting that self-knowledge is not fully robust and may not generalize reliably across all domains.