Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Welcome to the nexus of ethics, psychology, morality, technology, health care, and philosophy

Wednesday, December 17, 2025

Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

Nakkiran, P., et al. (2025, November 6).
arXiv.org.

Abstract

Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of "B-calibration," which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.

Here is a summary:

This paper is crucial because it demonstrates that large language models (LLMs) develop a form of emergent metacognition, or the ability to know what they know. Surprisingly, base models trained only to predict the next word become semantically calibrated: when they are 80% confident in an answer's meaning, they are correct about 80% of the time. This self-awareness arises implicitly from the training process, much like a complex cognitive ability emerging from a simple underlying task. However, this fragile calibration is systematically broken by instruction-tuning, which makes models overconfident (like a student rewarded for sounding certain), and by chain-of-thought reasoning, where the final answer is uncertain until the reasoning process is complete. For psychologists, this provides a powerful model for studying how self-monitoring and confidence can arise from, and be distorted by, different learning objectives and cognitive demands.