Nakkiran, P., et al. (2025, November 6).
arXiv.org.
Abstract
Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of "B-calibration," which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.
Here is a summary:
This paper is crucial because it demonstrates that large language models (LLMs) develop a form of emergent metacognition, or the ability to know what they know. Surprisingly, base models trained only to predict the next word become semantically calibrated: when they are 80% confident in an answer's meaning, they are correct about 80% of the time. This self-awareness arises implicitly from the training process, much like a complex cognitive ability emerging from a simple underlying task. However, this fragile calibration is systematically broken by instruction-tuning, which makes models overconfident (like a student rewarded for sounding certain), and by chain-of-thought reasoning, where the final answer is uncertain until the reasoning process is complete. For psychologists, this provides a powerful model for studying how self-monitoring and confidence can arise from, and be distorted by, different learning objectives and cognitive demands.
