Abstract
Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span's factuality is correlated with its representational instability across the model's internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers, and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization and code generation demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.
Abstract (translated)
预训练的大规模语言模型(LLMs)容易生成流畅但事实错误的文本——这一现象被称为幻觉,这会削弱其在下游任务中的可靠性和实用性。我们假设生成的文本片段的事实准确性与其在模型内部层中的表示不稳定有关。基于此,我们提出了CoCoA(混淆和一致性感知)解码器,这是一种新颖的无训练解码算法,在推理时通过听取中间层的这些信号来减轻幻觉。我们提出了两个度量标准来量化中间层的这种不稳定性,并使用它对表现出高度内部混乱的输出进行惩罚,从而引导模型生成更加内在一致和基于事实的输出。我们还提出了一种自我信息门控变体CoCoA-SIG,该变体动态调节这一惩罚以选择性地针对高惊喜度且不稳定的新颖生成。在问答、摘要和代码生成等多样化的任务上的广泛实验表明,CoCoA显著提高了多种模型族(例如Llama-3、Qwen-2.5、Mistral)的事实准确性。通过利用模型内部的信号,CoCoA提供了一种有效且广泛应用的方法,在不重新训练模型的情况下增强LLMs在推理时的信任度。
URL
https://arxiv.org/abs/2602.09486