Paper Reading AI Learner

RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

2025-06-18 15:18:07
Bailin Wang, Chang Lan, Chong Wang, Ruoming Pang

Abstract

Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches.

Abstract (translated)

最近,局部-全局注意力模型作为一种有吸引力的替代方案涌现出来,可以提升传统Transformer在训练和推理效率方面的表现。然而,关键的窗口大小选择呈现出一种帕累托权衡:较大的窗口虽然能保持类似全注意力的表现但对短上下文场景中的效率几乎没有改进;而较小的窗口则可能导致性能下降。当前的一些模型(如Gemma2和Mistral)采用保守的窗口尺寸(例如,在8192预训练长度中使用4096),以确保表现不受影响。本研究旨在探索改变这一帕累托前沿的战略,使局部-全局模型即使在短上下文环境中也能实现效率提升。 我们的核心动机在于解决局部注意力的一个固有局限性:它完全忽略了定义窗口之外的令牌信息。为此,我们探讨了一种称为RATTENTION的技术,这是一种集成特殊线性注意机制的局部注意变体,旨在捕捉从这些超出窗口范围的令牌中提取的信息。在3B和12B规模上的预训练实验表明,RATTENTION能够在性能与效率之间达成更优的帕累托权衡。特别地,在一个只有512大小的窗口配置下,RATTENTION能够持续匹配全注意力模型在各种设置中的表现。此外,由于线性注意组件内在的递归性质,它对长上下文的表现有所提升,这一点已在RULER基准测试中得到了验证。 重要的是,这些改进并未损害训练效率;通过专门设计的内核实现以及减小了窗口大小,使得RATTENTION能够保持与现有最先进技术相当的训练速度。

URL

https://arxiv.org/abs/2506.15545

PDF

https://arxiv.org/pdf/2506.15545.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot