Paper Reading AI Learner

Learning and Transferring Sparse Contextual Bigrams with Linear Transformers

2024-10-30 20:29:10
Yunwei Ren, Zixuan Wang, Jason D. Lee

Abstract

Transformers have excelled in natural language modeling and one reason behind this success is their exceptional ability to combine contextual informal and global knowledge. However, the theoretical basis remains unclear. In this paper, first we introduce the Sparse Contextual Bigram (SCB), a natural extension of the classical bigram model, where the next token's generation depends on a sparse set of earlier positions determined by the last token. We then analyze the training dynamics and sample complexity of learning SCB using a one-layer linear transformer with a gradient-based algorithm. We show that when trained from scratch, the training process can be split into an initial sample-intensive stage where the correlation is boosted from zero to a nontrivial value, followed by a more sample-efficient stage of further improvement. Additionally, we prove that, provided a nontrivial correlation between the downstream and pretraining tasks, finetuning from a pretrained model allows us to bypass the initial sample-intensive stage. We also empirically demonstrate that our algorithm can outperform SGD in this setting and discuss its relationship with the usual softmax-based transformers.

Abstract (translated)

转换器模型在自然语言处理中表现出色,其中一个原因是它们能够出色地结合上下文中的非正式和全局知识。然而,其理论基础仍然不够清晰。本文首先介绍了稀疏上下文双字母组(SCB),这是经典双字母组模型的一个自然扩展,在此模型中,下一个标记的生成取决于由最后一个标记确定的一组稀疏的早期位置。接着我们使用一层线性转换器和基于梯度的算法分析了学习SCB的训练动态和样本复杂度。我们展示了从零开始训练时,训练过程可以分为两个阶段:首先是需要大量样本以将相关性提升至非零值的初期阶段;其次是更高效的进一步改进阶段。此外,我们证明,只要下游任务与预训练任务之间存在非零的相关性,从预先训练好的模型进行微调则可以让整个过程跳过这一初期大量需要样本的阶段。我们也实证展示了我们的算法在该设置下可以超越SGD,并讨论了它与基于常规softmax的传统转换器的关系。

URL

https://arxiv.org/abs/2410.23438

PDF

https://arxiv.org/pdf/2410.23438.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot