Paper Reading AI Learner

Landmark Attention: Random-Access Infinite Context Length for Transformers

2023-05-25 17:53:42
Amirkeivan Mohtashami, Martin Jaggi

Abstract

While transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity up to 32k tokens, allowing for inference at the context lengths of GPT-4.

Abstract (translated)

虽然变分自编码器在自然语言处理方面取得了显著的成功,但他们的注意力机制巨大的内存要求已经限制了他们处理更长上下文的能力。先前的方法,如循环记忆或检索增强,要么牺牲了注意力的随机访问灵活性(即整个上下文中选择任意 token 的能力)要么依赖于 separate 机制来获取相关上下文,这可能与模型的注意力不兼容。在本文中,我们提出了一种新的方法来访问整个上下文,同时保留随机访问灵活性,几乎像整个上下文运行注意力一样。我们的方法使用地标性 token 来代表输入的每个块,并训练注意力使用它来选择相关块,从而使块直接通过注意力机制进行检索,而不是通过 separate 机制。我们的方法无缝集成了 specialized 数据结构和系统的记忆层次结构,使可以处理任意长的上下文长度。我们证明了,我们的方法可以与 Transformer-XL 取得类似的性能,同时显著减少每个步骤检索 token 的数量。最后,我们展示了,通过与我们的方法 fine-tuning LLaMA 7B,成功地将上下文长度能力扩展到 32k tokens,使可以在 GPT-4 的上下文长度上进行推理。

URL

https://arxiv.org/abs/2305.16300

PDF

https://arxiv.org/pdf/2305.16300.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot