Abstract
The canonical $O(N^2)$ Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass $QKV$ structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on $0^{th}$ order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a $0.56\%$ reduction in validation loss, confirming the framework's fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes
Abstract (translated)
标准的$O(N^2)$ Transformer在序列建模中仍然是经验性能的前沿,其训练可以通过解决几何效率问题进一步优化。我们提出了一种优化框架,该框架利用不对称投影将反向传递梯度分解为平行片段和平行违反部分,同时保持了正向传递中的标准$QKV$结构不变。通过在各种分解和投影设置下进行一致的实验验证,我们提供了强有力的理论证据:标准注意力梯度次优。 我们展示了选择性地调整这些组件的有效性,主要关注于零阶双向平行片段,这为学习信号提供了最有效的策略。在一个有限的数据集WikiText-2上,并使用一个粗略配置的情况下,这种方法实现了0.56%的验证损失减少,确认了该框架的基本有效性,并暗示在更大的数据集和更深入训练模式下有显著潜力收益。 简而言之,通过对Transformer模型中注意力机制梯度的优化处理,我们不仅提高了其性能,而且还揭示了标准实现中的潜在改进空间。这种方法的有效性已经在实践中得到了验证,并且预示着未来研究的巨大前景。
URL
https://arxiv.org/abs/2512.13033