Day 53

Semantic Uncertainty

Semantic Uncertainty

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Link

Research Question

The paper asks how uncertainty in natural language generation should be measured when many different strings can express the same answer. More specifically, it asks whether uncertainty over meanings, rather than over token sequences, is a better predictor of answer correctness in free-form QA. It also asks whether this can be done in an unsupervised way with a single unmodified language model, rather than with extra training or ensembles. 

Method

The authors propose semantic entropy, which estimates uncertainty by sampling multiple answers, clustering semantically equivalent ones with bidirectional textual entailment, and computing entropy over those meaning clusters. The generator is a pretrained OPT model, and the semantic-equivalence checker is a DeBERTa-large NLI model fine-tuned on MNLI. They evaluate on CoQA and TriviaQA, score answer correctness with Rouge-L based matching, and assess uncertainty quality with AUROC. 

Main findings

Semantic entropy predicts whether the model is correct better than predictive entropy, length-normalized entropy in key settings, lexical similarity, and p(True), with the gains becoming stronger for larger models. Incorrect answers are associated with a larger number of semantically distinct answer clusters, which supports the core interpretation of the measure. The method also makes better use of additional samples, and the best uncertainty estimates come from an intermediate sampling temperature that balances answer diversity against answer accuracy. 

Gaps and next steps

The paper only partially addresses issues like unequal token importance, and the authors note that semantic entropy can still overweight non-keyword tokens. Their experiments are concentrated on question answering, where correctness can be checked relatively cleanly, so the method has not yet been fully validated on harder generation tasks like summarization. They suggest future work on better paraphrase and semantic-equivalence detection in those settings, and on extending semantic likelihoods to other probabilistic uncertainty tools such as mutual information. 

← All Projects