Abstract
Hallucinations remain a major obstacle for large language models (LLMs), especially in safety-critical domains. We present HALT (Hallucination Assessment via Log-probs as Time series), a lightweight hallucination detector that leverages only the top-20 token log-probabilities from LLM generations as a time series. HALT uses a gated recurrent unit model combined with entropy-based features to learn model calibration bias, providing an extremely efficient alternative to large encoders. Unlike white-box approaches, HALT does not require access to hidden states or attention maps, relying only on output log-probabilities. Unlike black-box approaches, it operates on log-probs rather than surface-form text, which enables stronger domain generalization and compatibility with proprietary LLMs without requiring access to internal weights. To benchmark performance, we introduce HUB (Hallucination detection Unified Benchmark), which consolidates prior datasets into ten capabilities covering both reasoning tasks (Algorithmic, Commonsense, Mathematical, Symbolic, Code Generation) and general purpose skills (Chat, Data-to-Text, Question Answering, Summarization, World Knowledge). While being 30x smaller, HALT outperforms Lettuce, a fine-tuned modernBERT-base encoder, achieving a 60x speedup gain on HUB. HALT and HUB together establish an effective framework for hallucination detection across diverse LLM capabilities.
Abstract (translated)
幻觉(hallucinations)仍然是大型语言模型(LLMs)面临的主要障碍,特别是在安全关键领域。我们介绍了HALT(通过概率时间序列评估幻觉),这是一种轻量级的幻觉检测器,仅利用大模型生成输出中的前20个令牌对数概率作为时间序列。HALT采用门控循环单元模型结合基于熵的特征来学习模型校准偏差,并提供了一个极为高效的替代方案,相比大型编码器更加高效。与白盒方法不同,HALT不需要访问隐藏状态或注意力图,仅依赖于输出对数概率。不同于黑盒方法,它在对数概率而非表面形式文本上操作,这使得其能够在不同的领域中实现更强的一般化能力,并且能够兼容专有LLMs,而无需访问内部权重。 为了评估性能,我们引入了HUB(幻觉检测统一基准),该基准将先前的数据集整合为涵盖推理任务(算法、常识、数学、符号和代码生成)以及通用技能(聊天、数据到文本、问答、摘要、世界知识)在内的十种能力。尽管大小仅为30倍于后者,HALT的表现优于使用现代BERT-base模型进行微调的方法Lettuce,并在HUB上实现了60倍的速度提升。HALT和HUB共同为跨多种LLM功能的幻觉检测建立了有效的框架。
URL
https://arxiv.org/abs/2602.02888