Abstract
We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence -- an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.
Abstract (translated)
我们通过集合补集任务研究了Transformer模型中的长度泛化能力,该任务要求模型预测输入序列中不存在的标记的均匀分布——这一能力对于棋盘游戏风格的推理至关重要。我们的主要理论成果包括两个陈述: 首先,我们证明了单层注意力机制的嵌入和值维度上的紧致界。其次,我们展示了如果这样的模型在长度为1和2时实现了平衡的对数几率位移(logit displacement),那么它就能泛化到更长的序列上,尽管精度会有所降低。 这个理论结果的一个机械解释说明了这种限制:随着更多的标记被注意到,softmax函数压缩了对数几率位移,从而侵蚀了有效输出与无效输出之间的差距。训练动态还暗示了一个额外的障碍:当许多后续标记都是可能的时候,更新过程变得嘈杂(noisy)。我们假设dropout可以抵消第一个效应,而指数移动平均法(EMA)则能解决第二个问题。 通过在集合补集任务上进行随机超参数搜索来验证这些假设,并且证实了这两种机制都有效。随后,我们在更复杂的设置中测试了一个名为OthelloGPT的模型——一个基于随机奥赛罗棋步训练的类似于GPT-1风格的模型,在这种情况下EMA再次改善了长度泛化能力。
URL
https://arxiv.org/abs/2510.08341