Abstract
Decoder-only Transformers often struggle with complex reasoning tasks, particularly arithmetic reasoning requiring multiple sequential operations. In this work, we identify representation collapse in the model's intermediate layers as a key factor limiting their reasoning capabilities. To address this, we propose Sequential Variance-Covariance Regularization (Seq-VCR), which enhances the entropy of intermediate representations and prevents collapse. Combined with dummy pause tokens as substitutes for chain-of-thought (CoT) tokens, our method significantly improves performance in arithmetic reasoning problems. In the challenging $5 \times 5$ integer multiplication task, our approach achieves $99.5\%$ exact match accuracy, outperforming models of the same size (which yield $0\%$ accuracy) and GPT-4 with five-shot CoT prompting ($44\%$). We also demonstrate superior results on arithmetic expression and longest increasing subsequence (LIS) datasets. Our findings highlight the importance of preventing intermediate layer representation collapse to enhance the reasoning capabilities of Transformers and show that Seq-VCR offers an effective solution without requiring explicit CoT supervision.
Abstract (translated)
解码器仅有的Transformer模型在处理复杂的推理任务时,尤其是需要多步操作的算术推理任务时,通常会遇到困难。在这项研究中,我们发现模型中间层中的表示坍缩是限制其推理能力的关键因素之一。为了解决这个问题,我们提出了顺序方差-协方差正则化(Seq-VCR),该方法增强了中间表示的熵,并防止了表示坍缩。结合用作链式思考(CoT)标记替代品的虚拟暂停标记,我们的方法在算术推理问题上显著提高了性能。在具有挑战性的$5 \times 5$整数乘法任务中,我们的方法实现了99.5%的确切匹配准确率,优于相同大小模型(0%准确率)和使用五步链式思考提示的GPT-4(44%)。我们还在算术表达式数据集和最长递增子序列(LIS)数据集上展示了优越的结果。我们的研究结果强调了防止中间层表示坍缩对于增强Transformer推理能力的重要性,并表明Seq-VCR提供了无需显式链式思考监督的有效解决方案。
URL
https://arxiv.org/abs/2411.02344