Abstract
Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$\%$ on common benchmarks.
Abstract (translated)
大型语言模型的后期训练对于将预训练的语言模型(PLM)与人类偏好和下游任务对齐至关重要。尽管预训练的语言模型通常表现出良好的置信度校准,但经过后期训练的语言模型(PoLMs)常常会出现过度自信的问题,即在正确输出和错误输出上都赋予了过高的置信度,这可能会影响其在关键应用中的可靠性。校准PoLM的一个主要障碍是为特定的下游任务获取标注数据极为困难。 为了应对这一挑战,我们提出了一种新的无监督方法——不一致感知置信对齐(DACA),用于优化后期自信校准过程中的参数(如温度$\tau$)。我们的方法基于这样一个动机:当PLM和PoLM在预测中出现分歧时,后者会出现低估自身准确性的现象。理论上,在通过调整温度来校准时,这种低估会导致较大的$\tau$值,并产生过度保守的预测。 DACA通过仅使用一致样本进行校准来缓解这一问题,从而有效地解耦了不一致对校准的影响。这种方法避免了在温度缩放过程中由于不一致样本导致的过大的$\tau$值,从而提高了整体的校准性能。 广泛的实验结果证明了我们方法的有效性,在常见的基准测试中将开源和API基础的大规模语言模型(如GPT-4o)的平均ECE改进高达15.08%。
URL
https://arxiv.org/abs/2505.16690