Abstract
Transformers have excelled in natural language modeling and one reason behind this success is their exceptional ability to combine contextual informal and global knowledge. However, the theoretical basis remains unclear. In this paper, first we introduce the Sparse Contextual Bigram (SCB), a natural extension of the classical bigram model, where the next token's generation depends on a sparse set of earlier positions determined by the last token. We then analyze the training dynamics and sample complexity of learning SCB using a one-layer linear transformer with a gradient-based algorithm. We show that when trained from scratch, the training process can be split into an initial sample-intensive stage where the correlation is boosted from zero to a nontrivial value, followed by a more sample-efficient stage of further improvement. Additionally, we prove that, provided a nontrivial correlation between the downstream and pretraining tasks, finetuning from a pretrained model allows us to bypass the initial sample-intensive stage. We also empirically demonstrate that our algorithm can outperform SGD in this setting and discuss its relationship with the usual softmax-based transformers.
Abstract (translated)
转换器模型在自然语言处理中表现出色,其中一个原因是它们能够出色地结合上下文中的非正式和全局知识。然而,其理论基础仍然不够清晰。本文首先介绍了稀疏上下文双字母组(SCB),这是经典双字母组模型的一个自然扩展,在此模型中,下一个标记的生成取决于由最后一个标记确定的一组稀疏的早期位置。接着我们使用一层线性转换器和基于梯度的算法分析了学习SCB的训练动态和样本复杂度。我们展示了从零开始训练时,训练过程可以分为两个阶段:首先是需要大量样本以将相关性提升至非零值的初期阶段;其次是更高效的进一步改进阶段。此外,我们证明,只要下游任务与预训练任务之间存在非零的相关性,从预先训练好的模型进行微调则可以让整个过程跳过这一初期大量需要样本的阶段。我们也实证展示了我们的算法在该设置下可以超越SGD,并讨论了它与基于常规softmax的传统转换器的关系。
URL
https://arxiv.org/abs/2410.23438