Paper status: completed

Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

Published:05/22/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
8 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper addresses over-confidence in post-trained LLMs (PoLMs) by proposing DACA, an unsupervised post-hoc calibration method. DACA aligns PoLM confidence with its pre-trained counterpart, crucially using only samples where their predictions agree. This prevents under-confiden

Abstract

Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature τ\tau) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger τ\tau and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large τ\tau in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08%\% on common benchmarks.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
  • Authors: Beier Luo, Shuoyuan Wang, Hongxin Wei (Southern University of Science and Technology), and Yixuan Li (University of Wisconsin-Madison).
  • Journal/Conference: The paper is available on arXiv, a preprint server for academic articles. This indicates it has not yet undergone formal peer review for a conference or journal but is being shared with the research community.
  • Publication Year: The first version was submitted to arXiv on May 28, 2024.
  • Abstract: The paper addresses the problem of over-confidence in post-trained large language models (PoLMs), which contrasts with the typically well-calibrated confidence of their pre-trained counterparts (PLMs). Calibrating PoLMs is challenging due to the scarcity of labeled data for specific downstream tasks. The authors propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised post-hoc calibration method. DACA leverages the well-calibrated PLM as a reference. The key insight is that naively aligning the confidence of the PoLM to the PLM fails because of "disagreement examples" (where the models predict different outcomes), which can lead to under-confidence. DACA resolves this by only using "agreement examples" for calibration. This simple yet effective approach avoids an overly large temperature parameter during scaling. Experiments show DACA significantly improves the Expected Calibration Error (ECE) of both open-source and API-based LLMs like GPT-4o by up to 15.08%.
  • Original Source Link: https://arxiv.org/abs/2505.16690v1

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: Modern large language models undergo a two-stage process: pre-training on vast text corpora, followed by post-training (like instruction tuning or RLHF) to align them with human preferences. While post-training makes models more helpful, it often corrupts their ability to estimate their own correctness, leading to overconfidence. An overconfident model assigns high probability to its answers, whether they are right or wrong, making it unreliable for critical applications like medical diagnosis or financial advice.
    • Existing Gaps: Traditional methods to fix this, known as confidence calibration, typically require a set of labeled examples (questions with correct answers) to tune the model's confidence scores. However, creating such labeled datasets for every specific task is expensive and time-consuming.
    • Fresh Angle: The paper proposes a paradigm shift: instead of relying on scarce labeled data, can we use abundant unlabeled data? The key idea is to leverage the fact that the original Pre-trained Language Model (PLM) is often naturally well-calibrated. The paper explores using the PLM as a free, built-in "calibrator" for its overconfident, post-trained descendant (PoLM).
  • Main Contributions / Findings (What):

    1. A Novel Unsupervised Calibration Framework: The paper is the first to propose a post-hoc calibration method for LLMs that works entirely with unlabeled data, by aligning the PoLM's confidence with a well-calibrated PLM.
    2. Theoretical Insight into Disagreement: The authors theoretically demonstrate why a naive alignment approach fails. When the PLM and PoLM disagree on a prediction, the PLM's confidence score is a poor proxy for the PoLM's accuracy. This "prediction disagreement" pushes the calibration process to make the PoLM under-confident.
    3. Disagreement-Aware Confidence Alignment (DACA): To solve this, they introduce DACA, a simple but powerful method. DACA identifies samples where the PLM and PoLM agree on the prediction and performs confidence alignment only on this subset. This decouples the calibration process from the harmful influence of disagreement examples.
    4. Strong Empirical Validation: DACA is shown to be highly effective across a wide range of open-source models (Llama, Gemma, Qwen) and even closed-source, API-based models (GPT-4o). It significantly reduces calibration error, often performing on par with or better than supervised methods that use limited labeled data. It is also shown to be applicable to open-ended question answering and to benefit downstream tasks like selective classification.

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Pre-trained vs. Post-trained LLMs (PLM vs. PoLM):
      • A Pre-trained Language Model (PLM) is an LLM that has only undergone the initial, unsupervised training phase on a massive dataset of text (e.g., the internet). Its primary task is to predict the next word in a sequence. These models (e.g., Llama-3-8B) are generally found to be well-calibrated.
      • A Post-trained Language Model (PoLM) is a PLM that has been further fine-tuned using techniques like Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), or Direct Preference Optimization (DPO). This makes them better at following instructions and engaging in dialogue (e.g., Llama-3-8B-Instruct). This post-training process, however, often leads to overconfidence.
    • Confidence Calibration: A model is perfectly calibrated if its predicted confidence in an answer equals the actual probability of that answer being correct. For instance, if a model provides 100 answers, each with 80% confidence, a perfectly calibrated model would have exactly 80 of those answers be correct.
    • Post-hoc Calibration: These are methods that adjust a model's confidence scores after it has been fully trained, without changing its weights. They are computationally cheap and practical.
    • Temperature Scaling (TS): A popular post-hoc method. It divides the model's output logits (raw scores before the final probability calculation) by a single learnable parameter called temperature (τ\tau). A temperature τ>1\tau > 1 "softens" the probability distribution, making the model less confident, while τ<1\tau < 1 makes it more confident. Finding the optimal τ\tau traditionally requires a labeled dataset.
    • Kullback-Leibler (KL) Divergence: A measure from information theory that quantifies how one probability distribution differs from a second, reference probability distribution. A KL divergence of zero means the distributions are identical. DACA uses it to measure the difference between the PLM's and the PoLM's confidence distributions.
  • Previous Works & Differentiation:

    • Prior studies like Achiam et al. (2023) and Zhu et al. (2023) established the core problem: post-training, especially RLHF, degrades LLM calibration, causing overconfidence. The paper confirms this observation in Figure 1.

    • Existing calibration methods were mostly supervised (requiring labels), such as Temperature Scaling (Guo et al., 2017), or relied on specific prompting strategies like CAPE (Jiang et al., 2023) and Elicitation (Tian et al., 2023) which ask the model to verbalize its confidence.

    • DACA is fundamentally different. It is the first post-hoc method that is fully unsupervised, leveraging an architectural prior (the PLM) rather than data labels or complex prompting. This makes it highly flexible and scalable, especially for scenarios where labeled data is unavailable.

      该图像为一组柱状图,展示了四个不同大模型(Llama-3-8B、Qwen-2.5-7B、DeepSeek-V2-Lite、Yi-1.5-6B)在预训练版本… Figure 1 from the paper (resource images/1.jpg). This figure shows reliability diagrams for four LLM families. The top row shows the well-calibrated PLMs (low ECE, accuracy bars close to the diagonal line). The bottom row shows their post-trained counterparts, which are clearly overconfident (high ECE, accuracy bars significantly below the diagonal), visually confirming the paper's core motivation.

4. Methodology (Core Technology & Implementation)

  • Principles: The central idea is to use the well-calibrated PLM as a "teacher" to calibrate the overconfident PoLM. The "lesson" is taught by forcing the PoLM's output probability distribution to match the PLM's on unlabeled data. However, this only works when the teacher and student are "on the same page"—that is, when they agree on the answer.

  • Steps & Procedures:

    1. The Naive Approach (and its Failure): An intuitive first step is to calibrate the PoLM (gg) by minimizing the KL divergence between its temperature-scaled output distribution and the PLM's (ff) distribution over an unlabeled dataset D\mathcal{D}. The objective would be to find the optimal temperature τ\tau^*: τ=argminτ>0ExD[i=1kpi(x)logpi(x)σi(g(x)/τ)] \tau ^ { * } = \arg \operatorname* { m i n } _ { \tau > 0 } \mathbb { E } _ { \pmb { x } \in \mathcal { D } } \left[ \sum _ { i = 1 } ^ { k } p _ { i } ( \pmb { x } ) \log \frac { p _ { i } ( \pmb { x } ) } { \sigma _ { i } ( g ( \pmb { x } ) / \tau ) } \right]

      • Explanation: This formula aims to find a temperature τ\tau that makes the PoLM's probability for each choice ii, denoted σi(g(x)/τ)\sigma _ { i } ( g ( \pmb { x } ) / \tau ), as close as possible to the PLM's probability, pi(x)p_i(\mathbf{x}).
    2. The Problem of Disagreement: The authors identify that post-training causes the PoLM to predict a different answer from the PLM on some examples. They term these disagreement examples. On these examples, the PLM's confidence is for its own (often incorrect) prediction, not the PoLM's (often correct) prediction. Aligning to this low confidence score forces the PoLM to become under-confident.

      • Proposition 3.3 formalizes this: on a disagreement example where the PoLM predicts class cc but the PLM's probability for that class is low (specifically, less than uniform, 1/k1/k), the optimization will push the temperature τ\tau towards infinity. An infinite temperature results in a uniform probability distribution, indicating maximum uncertainty (i.e., extreme under-confidence).

        该图像包含两个部分的图表。(a) 是计算机安全和大学化学任务中置信度-准确率的对比柱状图,展示了输出置信度、准确率(以蓝色柱状显示)及其差距(虚线斜纹部分… Figure 2 from the paper (resource images/2.jpg). Part (a) shows reliability diagrams illustrating that naive confidence alignment results in under-confidence (accuracy is higher than confidence). Part (b) empirically supports Proposition 3.3, showing that when training temperature scaling only on disagreement examples (red line), the temperature τ\tau grows uncontrollably. In contrast, on agreement examples (green line), it converges to a stable, reasonable value.

    3. The DACA Solution: To fix this, DACA modifies the loss function to completely ignore disagreement examples. It introduces an indicator function that acts as a filter. The new loss function is: L(τ;x)=1{y^=y^}[i=1kpi(x)logpi(x)σi(g(x)/τ)] \mathcal { L } ( \tau ; x ) = \mathbf { 1 } \{ \hat { y } = \hat { y } ^ { \prime } \} \cdot \left[ \sum _ { i = 1 } ^ { k } p _ { i } ( x ) \log \frac { p _ { i } ( x ) } { \sigma _ { i } ( g ( x ) / \tau ) } \right]

      • Symbol Explanation:
        • y^=argmaxifi(x)\hat{y} = \arg\max_i f_i(\mathbf{x}): The final prediction of the PLM (ff).
        • y^=argmaxigi(x)\hat{y}' = \arg\max_i g_i(\mathbf{x}): The final prediction of the PoLM (gg).
        • 1{y^=y^}\mathbf{1}\{\hat{y} = \hat{y}'\}: An indicator function. It equals 1 if the PLM and PoLM make the same prediction (an "agreement example") and 0 otherwise (a "disagreement example").
        • The term in brackets is the KL divergence from the naive approach.
      • How it Works: The indicator function multiplies the loss by 0 for all disagreement examples, effectively removing their influence on the optimization of τ\tau. The temperature is therefore learned only from examples where the PLM's confidence is a reliable target for the PoLM's confidence.

5. Experimental Setup

  • Datasets:

    • MMLU (Massive Multitask Language Understanding): A diverse benchmark covering 57 subjects in STEM, humanities, and social sciences, formatted as multiple-choice questions.
    • MedMCQA: A large-scale, multiple-choice QA dataset focused on the medical domain.
    • MathQA: A dataset of mathematical word problems, also in a multiple-choice format.
    • TruthfulQA: An open-ended QA dataset designed to test a model's ability to avoid generating common falsehoods.
  • Evaluation Metrics:

    1. Expected Calibration Error (ECE):

      • Conceptual Definition: ECE measures the difference between a model's average confidence and its average accuracy. It works by grouping predictions into bins based on their confidence scores (e.g., 0-10%, 10-20%, ..., 90-100%). For each bin, it calculates the difference between the average confidence and the actual accuracy. ECE is the weighted average of these differences. A lower ECE means better calibration.
      • Mathematical Formula: ECE=g=1GbgNacc(bg)conf(bg) \mathrm{ECE} = \sum_{g=1}^{G} \frac{|b_g|}{N} \left|\operatorname{acc}(b_g) - \operatorname{conf}(b_g)\right|
      • Symbol Explanation: NN is the total number of samples, GG is the number of bins, bgb_g is the set of samples in the gg-th bin, bg|b_g| is its size, acc(bg)\operatorname{acc}(b_g) is the accuracy of samples in bin bgb_g, and conf(bg)\operatorname{conf}(b_g) is the average confidence in bin bgb_g.
    2. Maximum Calibration Error (MCE):

      • Conceptual Definition: MCE identifies the worst-case miscalibration. It is simply the maximum difference between accuracy and confidence across all bins, indicating the largest single point of calibration failure.
      • Mathematical Formula: MCE=maxg{1,,G}acc(bg)conf(bg) \mathrm{MCE} = \max_{g \in \{1, \dots, G\}} \left|\operatorname{acc}(b_g) - \operatorname{conf}(b_g)\right|
    3. Adaptive ECE (AECE):

      • Conceptual Definition: A variation of ECE that creates bins with an equal number of samples, rather than fixed-width confidence intervals. This prevents bins from being empty or sparsely populated, providing a more robust estimate, especially when confidence scores are clustered.
    4. Brier Score:

      • Conceptual Definition: Measures the mean squared error between the predicted probabilities and the actual outcomes. It simultaneously rewards both accuracy and good calibration. Lower is better.
      • Mathematical Formula: For a single sample in a multi-class problem, where kk is the number of classes and o\mathbf{o} is a one-hot vector of the true label: Brier Score=1Nn=1Ni=1k(pnioni)2 \text{Brier Score} = \frac{1}{N} \sum_{n=1}^{N} \sum_{i=1}^{k} (p_{ni} - o_{ni})^2
  • Baselines:

    • Vanilla: The original, uncalibrated PoLM.
    • Temperature Scaling (TS): A supervised baseline that uses labeled data to find the optimal temperature. It serves as a strong reference point.
    • CAPE: An unsupervised method that calibrates by permuting the order of multiple-choice options to mitigate positional biases.
    • Elicitation & Elicitation-Ensemble: Unsupervised methods that prompt the model to verbalize its confidence (e.g., "I am 80% sure the answer is...").

6. Results & Analysis

  • Core Results:

    • DACA Significantly Improves Calibration: Table 1 shows DACA's performance on MMLU across four different LLMs. In all cases, DACA (Ours) dramatically reduces ECE, MCE, and other metrics, bringing them to levels comparable to or even better than the supervised TS baseline. For instance, for Gemma-3-12B-Instruct, DACA reduces ECE from 23.68% to 8.60%, while supervised TS only achieves 9.75%. This is a powerful demonstration of its effectiveness without labels.

      (Manual Transcription of Table 1)

    Average calibration performance across 57 MMLU subjects for several contemporary PoLMs. "Vanilla" refers to the uncalibrated model. † indicates calibration methods with access to labels. Best results are shown in bold, and the second-best results are presented in italics.

    Models Methods ECE %(↓) MCE %(↓) AECE %(↓) Brier Score(↓)
    Qwen3-8B Vanilla 16.383±0.433 38.190±1.547 24.990±0.667 0.179±0.003
    CAPE 11.524±0.091 31.741±0.152 17.614±0.048 0.157±0.001
    Elicitation 16.774±0.214 66.884±16.785 27.568±2.897
    Elicitation-Ensemble 16.475±0.407 44.991±11.249 20.515±2.394
    Ours 8.393±0.228 23.700±1.374 12.601±0.617 0.144±0.001
    TS† 8.655±0.220 28.108±1.730 14.547±0.666 0.146±0.001
    Gemma-3-12B-Instruct Vanilla 23.679±0.525 48.506±1.584 35.886±1.257 0.235±0.005
    CAPE 13.906±0.209 32.830±0.700 19.278±0.377 0.168±0.001
    Elicitation 25.464±0.877 76.000±15.487 41.485±3.731
    Elicitation-Ensemble 25.417±0.244 42.017±10.256 32.221±1.987
    Ours 8.596±0.380 27.022±3.335 13.551±0.804 0.154±0.002
    TS† 9.746±0.364 29.804±2.750 15.604±0.859 0.159±0.003
    Yi-1.5-34B-Chat Vanilla 16.200±0.554 33.819±1.452 20.353±0.664 0.199±0.005
    CAPE 10.251±0.289 22.759±0.665 13.121±0.012 0.179±0.001
    Elicitation 27.152±6.513 83.000±8.000 49.211±9.379
    Elicitation-Ensemble 23.954±7.487 61.487±11.487 39.259±3.049
    Ours 9.465±0.174 19.898±1.082 11.700±0.411 0.174±0.004
    TS† 8.592±0.170 28.599±1.377 12.553±0.378 0.173±0.004
    Llama-3-70B-Instruct Vanilla 12.870±0.483 36.873±1.415 23.837±0.760 0.143±0.003
    CAPE 9.346±0.122 30.903±1.498 17.681±0.172 0.125±0.001
    Elicitation 11.227±0.113 60.000±14.142 21.237±1.036
    Elicitation-Ensemble 16.632±0.068 70.066±28.774 21.790±6.976
    Ours 7.844±0.418 24.275±1.285 13.158±0.488 0.120±0.001
    TS† 8.360±0.283 27.366±1.778 14.928±0.686 0.126±0.002

    这是一个图表,展示了不同模型规模下三种方法(Vanilla、CAPE和Ours)在ECE指标上的对比性能。图中四个子图分别对应Gemma-3、Yi-1.5… Figure 3 from the paper (resource images/3.jpg). This figure shows that DACA (Ours, blue bars) consistently and significantly improves calibration (lowers ECE) across models of varying sizes and architectures, compared to the Vanilla model (orange) and the CAPE baseline (green).

    • DACA is Agnostic to the PLM Choice: Table 2 shows a key result for practicality. DACA can calibrate the powerful, closed-source GPT-4o using much smaller, open-source PLMs like Llama-3-8B. The results are consistently strong regardless of which PLM is used. For example, using Gemma-3-12B as the calibrator reduces GPT-4o's ECE from 21.23% to 6.99%. This makes DACA highly efficient for calibrating even the largest proprietary models.

      (Manual Transcription of Table 2)

    Calibration performance of DACA for GPT-4o using various pre-trained models on MedMCQA. "Vanilla" refers to the uncalibrated model. ECEECE^* represents the original ECE of pre-trained models. Best results are shown in bold.

    Methods Pre-trained Models ECE* % ECE %(↓) MCE %(↓) AECE %(↓) Brier Score(↓)
    Vanilla - - 21.231±0.296 35.218±4.260 27.619±1.661 0.216±0.003
    Ours Llama-3-8B 9.450±0.777 7.984±0.397 10.640±0.413 6.879±0.737 0.150±0.001
    Qwen2.5-7B 6.990±0.102 7.816±0.215 10.467±0.42 6.751±0.763 0.150±0.001
    Gemma-3-12B 4.424±0.696 6.993±0.490 10.057±0.115 6.115±0.787 0.148±0.002
    • Effectiveness Across Post-Training Strategies and Tasks: Table 3 shows that DACA works well regardless of the post-training method (SFT, DPO, etc.). Furthermore, Figure 4 shows its successful application to open-ended QA on the TruthfulQA dataset.

      该图像为柱状图,展示了不同规模的Qwen-2.5-Instruct家族(7B、14B、32B、72B)和Llama-3-Instruct家族(8B、70B… Figure 4 from the paper (resource images/4.jpg). This bar chart demonstrates DACA's effectiveness on the open-ended TruthfulQA benchmark. For all model sizes in both the Qwen and Llama families, Ours (blue bars) achieves a much lower ECE than the Vanilla model (orange bars), confirming its applicability beyond multiple-choice tasks.

    • Benefits for Selective Classification: Well-calibrated models are more trustworthy. Figure 5 demonstrates this by showing that for predictions where the DACA-calibrated model is highly confident (e.g., >90%), the accuracy is also very high. This allows a system to reliably abstain from answering when confidence is low, improving the overall accuracy of the predictions it does make.

      该图像为图表,展示了在Meta-Llama-3-70B-Instruct和Qwen2.5-72B-Instruct两个模型上,不同置信度区间对应的准确率对… Figure 5 from the paper (resource images/5.jpg). This plot shows accuracy as a function of the confidence threshold for selective classification. The red line (Ours) shows a strong positive correlation: as confidence increases, accuracy increases proportionally. The blue line (Vanilla) shows much lower accuracy at all confidence levels, indicating its confidence scores are unreliable for decision-making.

7. Conclusion & Reflections

  • Conclusion Summary: The paper introduces DACA, a pioneering unsupervised post-hoc calibration method for LLMs. By cleverly leveraging the well-calibrated nature of PLMs and addressing the theoretical pitfalls of prediction disagreement, DACA provides a simple, efficient, and highly effective way to fix the overconfidence problem in PoLMs without needing any labeled data. Its strong performance on a wide variety of models and tasks highlights its practical value.

  • Limitations & Future Work:

    • Computational Cost: DACA requires running inference with a second model (the PLM) on the unlabeled data, which doubles the computational cost of the calibration step (though this is a one-time cost per task).
    • Data Reduction: By filtering out disagreement examples, DACA reduces the amount of data used for calibration. While unlabeled data is generally plentiful, this could be a factor in data-scarce domains.
    • Future Work: The authors suggest that the disagreement examples, which are currently discarded, might contain valuable information that could be leveraged to further improve calibration.
  • Personal Insights & Critique:

    • Elegance and Practicality: The core idea of DACA is exceptionally elegant. It reframes a data-hungry problem (calibration) into a data-free one by exploiting the inherent properties of the LLM training lifecycle. The ability to use small, open-source PLMs to calibrate large, proprietary PoLMs is a significant practical breakthrough.
    • Underlying Assumption: The method's success hinges on the assumption that the chosen PLM is well-calibrated for the target task's domain. While generally true, this may not hold for highly specialized domains where the PLM's pre-training data was sparse. The performance of DACA is upper-bounded by the quality of the reference PLM.
    • Potential for Broader Application: The concept of using a "base" model to regularize or guide a "fine-tuned" model based on agreement could be extended to other areas beyond calibration, such as mitigating catastrophic forgetting during continual learning or detecting out-of-distribution inputs. The "disagreement set" itself is a fascinating artifact that could be used to analyze what a model learns (or unlearns) during post-training.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.