Abstract
Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
Abstract (translated)
尽管在长上下文语言模型方面已经取得了最近的进展,但如何让基于Transformer的模型在长上下文中检索到相关信息仍然是一个难以解决的问题。本文旨在回答这个问题。我们对一系列模型进行系统性的调查,揭示了检索头部的特殊性质。我们称之为检索头部。我们确定了检索头部的有趣特性:(1)普遍:所有具有长上下文能力的模型都有检索头部;(2)稀疏:只有不到5%的注意力头部是检索头部。(3)内在:预训练模型中已经存在检索头部。在通过持续预训练扩展上下文长度时,仍然是相同的检索头部执行信息检索。(4)动态激活:以Llama-2 7B为例,即使上下文变化,12个检索头部始终关注所需信息。其余的检索头部则在不同的上下文中激活。 (5)因果:完全删除检索头部会导致无法检索到相关信息,并导致幻觉,而删除随机非检索头部则不会影响模型的检索能力。我们进一步表明,检索头部强烈影响链式推理(CoT)推理,即模型需要经常回顾问题及其先前的上下文。相反,使用模型自身的知识直接生成答案的任务对遮盖检索头部的影响较小。这些观察结果共同解释了模型从输入词中寻求信息的部分。我们相信,我们的见解将促进未来研究在减少幻觉、提高推理和压缩KV缓存方面取得进一步进展。
URL
https://arxiv.org/abs/2404.15574