Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
TL;DR Summary
This paper addresses over-confidence in post-trained LLMs (PoLMs) by proposing DACA, an unsupervised post-hoc calibration method. DACA aligns PoLM confidence with its pre-trained counterpart, crucially using only samples where their predictions agree. This prevents under-confiden
Abstract
Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature ) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08 on common benchmarks.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
- Authors: Beier Luo, Shuoyuan Wang, Hongxin Wei (Southern University of Science and Technology), and Yixuan Li (University of Wisconsin-Madison).
- Journal/Conference: The paper is available on arXiv, a preprint server for academic articles. This indicates it has not yet undergone formal peer review for a conference or journal but is being shared with the research community.
- Publication Year: The first version was submitted to arXiv on May 28, 2024.
- Abstract: The paper addresses the problem of over-confidence in post-trained large language models (PoLMs), which contrasts with the typically well-calibrated confidence of their pre-trained counterparts (PLMs). Calibrating PoLMs is challenging due to the scarcity of labeled data for specific downstream tasks. The authors propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised post-hoc calibration method. DACA leverages the well-calibrated PLM as a reference. The key insight is that naively aligning the confidence of the PoLM to the PLM fails because of "disagreement examples" (where the models predict different outcomes), which can lead to under-confidence. DACA resolves this by only using "agreement examples" for calibration. This simple yet effective approach avoids an overly large temperature parameter during scaling. Experiments show DACA significantly improves the Expected Calibration Error (ECE) of both open-source and API-based LLMs like GPT-4o by up to 15.08%.
- Original Source Link: https://arxiv.org/abs/2505.16690v1
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Modern large language models undergo a two-stage process: pre-training on vast text corpora, followed by post-training (like instruction tuning or RLHF) to align them with human preferences. While post-training makes models more helpful, it often corrupts their ability to estimate their own correctness, leading to overconfidence. An overconfident model assigns high probability to its answers, whether they are right or wrong, making it unreliable for critical applications like medical diagnosis or financial advice.
- Existing Gaps: Traditional methods to fix this, known as confidence calibration, typically require a set of labeled examples (questions with correct answers) to tune the model's confidence scores. However, creating such labeled datasets for every specific task is expensive and time-consuming.
- Fresh Angle: The paper proposes a paradigm shift: instead of relying on scarce labeled data, can we use abundant unlabeled data? The key idea is to leverage the fact that the original Pre-trained Language Model (PLM) is often naturally well-calibrated. The paper explores using the PLM as a free, built-in "calibrator" for its overconfident, post-trained descendant (PoLM).
-
Main Contributions / Findings (What):
- A Novel Unsupervised Calibration Framework: The paper is the first to propose a post-hoc calibration method for LLMs that works entirely with unlabeled data, by aligning the PoLM's confidence with a well-calibrated PLM.
- Theoretical Insight into Disagreement: The authors theoretically demonstrate why a naive alignment approach fails. When the PLM and PoLM disagree on a prediction, the PLM's confidence score is a poor proxy for the PoLM's accuracy. This "prediction disagreement" pushes the calibration process to make the PoLM under-confident.
- Disagreement-Aware Confidence Alignment (DACA): To solve this, they introduce DACA, a simple but powerful method. DACA identifies samples where the PLM and PoLM agree on the prediction and performs confidence alignment only on this subset. This decouples the calibration process from the harmful influence of disagreement examples.
- Strong Empirical Validation: DACA is shown to be highly effective across a wide range of open-source models (Llama, Gemma, Qwen) and even closed-source, API-based models (GPT-4o). It significantly reduces calibration error, often performing on par with or better than supervised methods that use limited labeled data. It is also shown to be applicable to open-ended question answering and to benefit downstream tasks like selective classification.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Pre-trained vs. Post-trained LLMs (PLM vs. PoLM):
- A Pre-trained Language Model (PLM) is an LLM that has only undergone the initial, unsupervised training phase on a massive dataset of text (e.g., the internet). Its primary task is to predict the next word in a sequence. These models (e.g.,
Llama-3-8B) are generally found to be well-calibrated. - A Post-trained Language Model (PoLM) is a PLM that has been further fine-tuned using techniques like Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), or Direct Preference Optimization (DPO). This makes them better at following instructions and engaging in dialogue (e.g.,
Llama-3-8B-Instruct). This post-training process, however, often leads to overconfidence.
- A Pre-trained Language Model (PLM) is an LLM that has only undergone the initial, unsupervised training phase on a massive dataset of text (e.g., the internet). Its primary task is to predict the next word in a sequence. These models (e.g.,
- Confidence Calibration: A model is perfectly calibrated if its predicted confidence in an answer equals the actual probability of that answer being correct. For instance, if a model provides 100 answers, each with 80% confidence, a perfectly calibrated model would have exactly 80 of those answers be correct.
- Post-hoc Calibration: These are methods that adjust a model's confidence scores after it has been fully trained, without changing its weights. They are computationally cheap and practical.
- Temperature Scaling (TS): A popular post-hoc method. It divides the model's output logits (raw scores before the final probability calculation) by a single learnable parameter called temperature (). A temperature "softens" the probability distribution, making the model less confident, while makes it more confident. Finding the optimal traditionally requires a labeled dataset.
- Kullback-Leibler (KL) Divergence: A measure from information theory that quantifies how one probability distribution differs from a second, reference probability distribution. A KL divergence of zero means the distributions are identical. DACA uses it to measure the difference between the PLM's and the PoLM's confidence distributions.
- Pre-trained vs. Post-trained LLMs (PLM vs. PoLM):
-
Previous Works & Differentiation:
-
Prior studies like Achiam et al. (2023) and Zhu et al. (2023) established the core problem: post-training, especially RLHF, degrades LLM calibration, causing overconfidence. The paper confirms this observation in Figure 1.
-
Existing calibration methods were mostly supervised (requiring labels), such as Temperature Scaling (Guo et al., 2017), or relied on specific prompting strategies like CAPE (Jiang et al., 2023) and Elicitation (Tian et al., 2023) which ask the model to verbalize its confidence.
-
DACA is fundamentally different. It is the first post-hoc method that is fully unsupervised, leveraging an architectural prior (the PLM) rather than data labels or complex prompting. This makes it highly flexible and scalable, especially for scenarios where labeled data is unavailable.
Figure 1 from the paper (resource images/1.jpg). This figure shows reliability diagrams for four LLM families. The top row shows the well-calibrated PLMs (low ECE, accuracy bars close to the diagonal line). The bottom row shows their post-trained counterparts, which are clearly overconfident (high ECE, accuracy bars significantly below the diagonal), visually confirming the paper's core motivation.
-
4. Methodology (Core Technology & Implementation)
-
Principles: The central idea is to use the well-calibrated PLM as a "teacher" to calibrate the overconfident PoLM. The "lesson" is taught by forcing the PoLM's output probability distribution to match the PLM's on unlabeled data. However, this only works when the teacher and student are "on the same page"—that is, when they agree on the answer.
-
Steps & Procedures:
-
The Naive Approach (and its Failure): An intuitive first step is to calibrate the PoLM () by minimizing the KL divergence between its temperature-scaled output distribution and the PLM's () distribution over an unlabeled dataset . The objective would be to find the optimal temperature :
- Explanation: This formula aims to find a temperature that makes the PoLM's probability for each choice , denoted , as close as possible to the PLM's probability, .
-
The Problem of Disagreement: The authors identify that post-training causes the PoLM to predict a different answer from the PLM on some examples. They term these disagreement examples. On these examples, the PLM's confidence is for its own (often incorrect) prediction, not the PoLM's (often correct) prediction. Aligning to this low confidence score forces the PoLM to become under-confident.
-
Proposition 3.3 formalizes this: on a disagreement example where the PoLM predicts class but the PLM's probability for that class is low (specifically, less than uniform, ), the optimization will push the temperature towards infinity. An infinite temperature results in a uniform probability distribution, indicating maximum uncertainty (i.e., extreme under-confidence).
Figure 2 from the paper (resource images/2.jpg). Part (a) shows reliability diagrams illustrating that naive confidence alignment results in under-confidence (accuracy is higher than confidence). Part (b) empirically supports Proposition 3.3, showing that when training temperature scaling only on disagreement examples (red line), the temperature grows uncontrollably. In contrast, on agreement examples (green line), it converges to a stable, reasonable value.
-
-
The DACA Solution: To fix this, DACA modifies the loss function to completely ignore disagreement examples. It introduces an indicator function that acts as a filter. The new loss function is:
- Symbol Explanation:
- : The final prediction of the PLM ().
- : The final prediction of the PoLM ().
- : An indicator function. It equals 1 if the PLM and PoLM make the same prediction (an "agreement example") and 0 otherwise (a "disagreement example").
- The term in brackets is the KL divergence from the naive approach.
- How it Works: The indicator function multiplies the loss by 0 for all disagreement examples, effectively removing their influence on the optimization of . The temperature is therefore learned only from examples where the PLM's confidence is a reliable target for the PoLM's confidence.
- Symbol Explanation:
-
5. Experimental Setup
-
Datasets:
- MMLU (Massive Multitask Language Understanding): A diverse benchmark covering 57 subjects in STEM, humanities, and social sciences, formatted as multiple-choice questions.
- MedMCQA: A large-scale, multiple-choice QA dataset focused on the medical domain.
- MathQA: A dataset of mathematical word problems, also in a multiple-choice format.
- TruthfulQA: An open-ended QA dataset designed to test a model's ability to avoid generating common falsehoods.
-
Evaluation Metrics:
-
Expected Calibration Error (ECE):
- Conceptual Definition: ECE measures the difference between a model's average confidence and its average accuracy. It works by grouping predictions into bins based on their confidence scores (e.g., 0-10%, 10-20%, ..., 90-100%). For each bin, it calculates the difference between the average confidence and the actual accuracy. ECE is the weighted average of these differences. A lower ECE means better calibration.
- Mathematical Formula:
- Symbol Explanation: is the total number of samples, is the number of bins, is the set of samples in the -th bin, is its size, is the accuracy of samples in bin , and is the average confidence in bin .
-
Maximum Calibration Error (MCE):
- Conceptual Definition: MCE identifies the worst-case miscalibration. It is simply the maximum difference between accuracy and confidence across all bins, indicating the largest single point of calibration failure.
- Mathematical Formula:
-
Adaptive ECE (AECE):
- Conceptual Definition: A variation of ECE that creates bins with an equal number of samples, rather than fixed-width confidence intervals. This prevents bins from being empty or sparsely populated, providing a more robust estimate, especially when confidence scores are clustered.
-
Brier Score:
- Conceptual Definition: Measures the mean squared error between the predicted probabilities and the actual outcomes. It simultaneously rewards both accuracy and good calibration. Lower is better.
- Mathematical Formula: For a single sample in a multi-class problem, where is the number of classes and is a one-hot vector of the true label:
-
-
Baselines:
Vanilla: The original, uncalibrated PoLM.Temperature Scaling (TS): A supervised baseline that uses labeled data to find the optimal temperature. It serves as a strong reference point.CAPE: An unsupervised method that calibrates by permuting the order of multiple-choice options to mitigate positional biases.Elicitation&Elicitation-Ensemble: Unsupervised methods that prompt the model to verbalize its confidence (e.g., "I am 80% sure the answer is...").
6. Results & Analysis
-
Core Results:
-
DACA Significantly Improves Calibration: Table 1 shows DACA's performance on MMLU across four different LLMs. In all cases, DACA (
Ours) dramatically reduces ECE, MCE, and other metrics, bringing them to levels comparable to or even better than the supervisedTSbaseline. For instance, forGemma-3-12B-Instruct, DACA reduces ECE from 23.68% to 8.60%, while supervisedTSonly achieves 9.75%. This is a powerful demonstration of its effectiveness without labels.(Manual Transcription of Table 1)
Average calibration performance across 57 MMLU subjects for several contemporary PoLMs. "Vanilla" refers to the uncalibrated model. † indicates calibration methods with access to labels. Best results are shown in bold, and the second-best results are presented in italics.
Models Methods ECE %(↓) MCE %(↓) AECE %(↓) Brier Score(↓) Qwen3-8B Vanilla 16.383±0.433 38.190±1.547 24.990±0.667 0.179±0.003 CAPE 11.524±0.091 31.741±0.152 17.614±0.048 0.157±0.001 Elicitation 16.774±0.214 66.884±16.785 27.568±2.897 Elicitation-Ensemble 16.475±0.407 44.991±11.249 20.515±2.394 Ours 8.393±0.228 23.700±1.374 12.601±0.617 0.144±0.001 TS† 8.655±0.220 28.108±1.730 14.547±0.666 0.146±0.001 Gemma-3-12B-Instruct Vanilla 23.679±0.525 48.506±1.584 35.886±1.257 0.235±0.005 CAPE 13.906±0.209 32.830±0.700 19.278±0.377 0.168±0.001 Elicitation 25.464±0.877 76.000±15.487 41.485±3.731 Elicitation-Ensemble 25.417±0.244 42.017±10.256 32.221±1.987 Ours 8.596±0.380 27.022±3.335 13.551±0.804 0.154±0.002 TS† 9.746±0.364 29.804±2.750 15.604±0.859 0.159±0.003 Yi-1.5-34B-Chat Vanilla 16.200±0.554 33.819±1.452 20.353±0.664 0.199±0.005 CAPE 10.251±0.289 22.759±0.665 13.121±0.012 0.179±0.001 Elicitation 27.152±6.513 83.000±8.000 49.211±9.379 Elicitation-Ensemble 23.954±7.487 61.487±11.487 39.259±3.049 Ours 9.465±0.174 19.898±1.082 11.700±0.411 0.174±0.004 TS† 8.592±0.170 28.599±1.377 12.553±0.378 0.173±0.004 Llama-3-70B-Instruct Vanilla 12.870±0.483 36.873±1.415 23.837±0.760 0.143±0.003 CAPE 9.346±0.122 30.903±1.498 17.681±0.172 0.125±0.001 Elicitation 11.227±0.113 60.000±14.142 21.237±1.036 Elicitation-Ensemble 16.632±0.068 70.066±28.774 21.790±6.976 Ours 7.844±0.418 24.275±1.285 13.158±0.488 0.120±0.001 TS† 8.360±0.283 27.366±1.778 14.928±0.686 0.126±0.002
Figure 3 from the paper (resource images/3.jpg). This figure shows that DACA (Ours, blue bars) consistently and significantly improves calibration (lowers ECE) across models of varying sizes and architectures, compared to theVanillamodel (orange) and theCAPEbaseline (green).-
DACA is Agnostic to the PLM Choice: Table 2 shows a key result for practicality. DACA can calibrate the powerful, closed-source GPT-4o using much smaller, open-source PLMs like
Llama-3-8B. The results are consistently strong regardless of which PLM is used. For example, usingGemma-3-12Bas the calibrator reduces GPT-4o's ECE from 21.23% to 6.99%. This makes DACA highly efficient for calibrating even the largest proprietary models.(Manual Transcription of Table 2)
Calibration performance of DACA for GPT-4o using various pre-trained models on MedMCQA. "Vanilla" refers to the uncalibrated model. represents the original ECE of pre-trained models. Best results are shown in bold.
Methods Pre-trained Models ECE* % ECE %(↓) MCE %(↓) AECE %(↓) Brier Score(↓) Vanilla - - 21.231±0.296 35.218±4.260 27.619±1.661 0.216±0.003 Ours Llama-3-8B 9.450±0.777 7.984±0.397 10.640±0.413 6.879±0.737 0.150±0.001 Qwen2.5-7B 6.990±0.102 7.816±0.215 10.467±0.42 6.751±0.763 0.150±0.001 Gemma-3-12B 4.424±0.696 6.993±0.490 10.057±0.115 6.115±0.787 0.148±0.002 -
Effectiveness Across Post-Training Strategies and Tasks: Table 3 shows that DACA works well regardless of the post-training method (SFT, DPO, etc.). Furthermore, Figure 4 shows its successful application to open-ended QA on the TruthfulQA dataset.
Figure 4 from the paper (resource images/4.jpg). This bar chart demonstrates DACA's effectiveness on the open-ended TruthfulQA benchmark. For all model sizes in both the Qwen and Llama families,Ours(blue bars) achieves a much lower ECE than theVanillamodel (orange bars), confirming its applicability beyond multiple-choice tasks. -
Benefits for Selective Classification: Well-calibrated models are more trustworthy. Figure 5 demonstrates this by showing that for predictions where the DACA-calibrated model is highly confident (e.g., >90%), the accuracy is also very high. This allows a system to reliably abstain from answering when confidence is low, improving the overall accuracy of the predictions it does make.
Figure 5 from the paper (resource images/5.jpg). This plot shows accuracy as a function of the confidence threshold for selective classification. The red line (Ours) shows a strong positive correlation: as confidence increases, accuracy increases proportionally. The blue line (Vanilla) shows much lower accuracy at all confidence levels, indicating its confidence scores are unreliable for decision-making.
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper introduces DACA, a pioneering unsupervised post-hoc calibration method for LLMs. By cleverly leveraging the well-calibrated nature of PLMs and addressing the theoretical pitfalls of prediction disagreement, DACA provides a simple, efficient, and highly effective way to fix the overconfidence problem in PoLMs without needing any labeled data. Its strong performance on a wide variety of models and tasks highlights its practical value.
-
Limitations & Future Work:
- Computational Cost: DACA requires running inference with a second model (the PLM) on the unlabeled data, which doubles the computational cost of the calibration step (though this is a one-time cost per task).
- Data Reduction: By filtering out disagreement examples, DACA reduces the amount of data used for calibration. While unlabeled data is generally plentiful, this could be a factor in data-scarce domains.
- Future Work: The authors suggest that the disagreement examples, which are currently discarded, might contain valuable information that could be leveraged to further improve calibration.
-
Personal Insights & Critique:
- Elegance and Practicality: The core idea of DACA is exceptionally elegant. It reframes a data-hungry problem (calibration) into a data-free one by exploiting the inherent properties of the LLM training lifecycle. The ability to use small, open-source PLMs to calibrate large, proprietary PoLMs is a significant practical breakthrough.
- Underlying Assumption: The method's success hinges on the assumption that the chosen PLM is well-calibrated for the target task's domain. While generally true, this may not hold for highly specialized domains where the PLM's pre-training data was sparse. The performance of DACA is upper-bounded by the quality of the reference PLM.
- Potential for Broader Application: The concept of using a "base" model to regularize or guide a "fine-tuned" model based on agreement could be extended to other areas beyond calibration, such as mitigating catastrophic forgetting during continual learning or detecting out-of-distribution inputs. The "disagreement set" itself is a fascinating artifact that could be used to analyze what a model learns (or unlearns) during post-training.
Similar papers
Recommended via semantic vector search.