Paper Reading AI Learner

Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing

2025-12-15 09:04:06
Zewen Qiang, Sendong Zhao, Haochun Wang, Bing Qin, Ting Liu

Abstract

Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6\% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4\% in KV-Retrieval tasks.

Abstract (translated)

大型语言模型(LLMs)在多种自然语言处理(NLP)任务中表现出强大的性能。然而,它们通常难以处理长文本序列,这是因为所谓的“中间部分丢失”现象。这个问题已被证明是由U形注意力偏置引起的,即注意力过于集中在文本的开头和结尾部分,而忽略了中间部分。尽管之前的研究将这种偏差归因于位置编码的影响,我们的研究首次识别出另一个重要因素:初始显著性。这意味着,在每个令牌的注意计算中,相对于初始令牌具有更高注意权重的令牌在预测下一个令牌时会获得更多的关注。 我们进一步发现,通过调整初始令牌与其他令牌之间的注意力权重来利用这一特性,可以增强模型处理长上下文的能力,并在MDQA数据集中实现了最高3.6%的性能提升。此外,将这种方法与现有的减少位置编码偏差的方法相结合,还可以进一步提高性能,在KV-Retrieval任务中实现最高的3.4%改进。

URL

https://arxiv.org/abs/2512.13109

PDF

https://arxiv.org/pdf/2512.13109.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot