Abstract
While transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity up to 32k tokens, allowing for inference at the context lengths of GPT-4.
Abstract (translated)
虽然变分自编码器在自然语言处理方面取得了显著的成功,但他们的注意力机制巨大的内存要求已经限制了他们处理更长上下文的能力。先前的方法,如循环记忆或检索增强,要么牺牲了注意力的随机访问灵活性(即整个上下文中选择任意 token 的能力)要么依赖于 separate 机制来获取相关上下文,这可能与模型的注意力不兼容。在本文中,我们提出了一种新的方法来访问整个上下文,同时保留随机访问灵活性,几乎像整个上下文运行注意力一样。我们的方法使用地标性 token 来代表输入的每个块,并训练注意力使用它来选择相关块,从而使块直接通过注意力机制进行检索,而不是通过 separate 机制。我们的方法无缝集成了 specialized 数据结构和系统的记忆层次结构,使可以处理任意长的上下文长度。我们证明了,我们的方法可以与 Transformer-XL 取得类似的性能,同时显著减少每个步骤检索 token 的数量。最后,我们展示了,通过与我们的方法 fine-tuning LLaMA 7B,成功地将上下文长度能力扩展到 32k tokens,使可以在 GPT-4 的上下文长度上进行推理。
URL
https://arxiv.org/abs/2305.16300