Abstract
Detecting hallucinations in large language models (LLMs) remains a fundamental challenge for their trustworthy deployment. Going beyond basic uncertainty-driven hallucination detection frameworks, we propose a simple yet powerful method that quantifies uncertainty by measuring the effective rank of hidden states derived from multiple model outputs and different layers. Grounded in the spectral analysis of representations, our approach provides interpretable insights into the model's internal reasoning process through semantic variations, while requiring no extra knowledge or additional modules, thus offering a combination of theoretical elegance and practical efficiency. Meanwhile, we theoretically demonstrate the necessity of quantifying uncertainty both internally (representations of a single response) and externally (different responses), providing a justification for using representations among different layers and responses from LLMs to detect hallucinations. Extensive experiments demonstrate that our method effectively detects hallucinations and generalizes robustly across various scenarios, contributing to a new paradigm of hallucination detection for LLM truthfulness.
Abstract (translated)
在大型语言模型(LLM)的可靠部署中,检测幻觉仍然是一项基本挑战。超越基于不确定性的基本幻觉检测框架,我们提出了一种简单而强大的方法,通过测量从多个模型输出和不同层派生出的隐藏状态的有效秩来量化不确定性。我们的方法基于表示的谱分析,提供了对模型内部推理过程的可解释洞察,同时无需额外的知识或附加模块,从而实现了理论优雅与实际效率的结合。 与此同时,我们从理论上证明了量化不确定性(无论是单一响应的表示内部还是不同响应之间)的重要性,并为使用来自LLM的不同层和响应之间的表示来检测幻觉提供了依据。广泛的实验表明,我们的方法能够有效检测幻觉,并在各种场景中稳健地推广,从而为LLM的真实性检测提供了一种新的范式。
URL
https://arxiv.org/abs/2510.08389