Abstract
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.
Abstract (translated)
大语言模型(LLMs)通常会产生错误,包括事实性不准确、偏见和推理失败等,这些共同称为“幻觉”。 近年来,研究表明,LLMs的内部状态编码了其输出真实性相关的信息,并且这种信息可以用于检测错误。在本文中,我们证明了LLMs的内部表示比以前想象的更能编码真实性信息。我们首先发现,真实性信息集中在特定的标记上,并利用这一特性显著增强了错误检测性能。然而,我们发现,这样的错误检测器无法在数据集之间泛化,暗示着——与先前的说法相反——真实性编码不是普遍的,而是多面的。接下来,我们展示了内部表示还可以用于预测模型可能出现的错误类型,促进开发定制化缓解策略。最后,我们揭示了LLMs的内部编码和外部行为之间的差异:它们可能编码正确的答案,但总是生成错误的答案。这些见解从模型内部的角度进一步加深了我们对于LLM错误的了解,这对于未来研究增强错误分析和缓解方法具有指导意义。
URL
https://arxiv.org/abs/2410.02707