Paper Reading AI Learner

Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization

2025-10-09 15:26:48
P\'al Zs\'amboki, Benjamin Levi, David Ansel Josef Smith, Mitansh Kagalwala, Arlington Kell, Samuel Liechty, Cong Wang

Abstract

We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence -- an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.

Abstract (translated)

我们通过集合补集任务研究了Transformer模型中的长度泛化能力,该任务要求模型预测输入序列中不存在的标记的均匀分布——这一能力对于棋盘游戏风格的推理至关重要。我们的主要理论成果包括两个陈述: 首先,我们证明了单层注意力机制的嵌入和值维度上的紧致界。其次,我们展示了如果这样的模型在长度为1和2时实现了平衡的对数几率位移(logit displacement),那么它就能泛化到更长的序列上,尽管精度会有所降低。 这个理论结果的一个机械解释说明了这种限制:随着更多的标记被注意到,softmax函数压缩了对数几率位移,从而侵蚀了有效输出与无效输出之间的差距。训练动态还暗示了一个额外的障碍:当许多后续标记都是可能的时候,更新过程变得嘈杂(noisy)。我们假设dropout可以抵消第一个效应,而指数移动平均法(EMA)则能解决第二个问题。 通过在集合补集任务上进行随机超参数搜索来验证这些假设,并且证实了这两种机制都有效。随后,我们在更复杂的设置中测试了一个名为OthelloGPT的模型——一个基于随机奥赛罗棋步训练的类似于GPT-1风格的模型,在这种情况下EMA再次改善了长度泛化能力。

URL

https://arxiv.org/abs/2510.08341

PDF

https://arxiv.org/pdf/2510.08341.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot