Paper Reading AI Learner

Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning

2023-01-23 18:56:12
Malte Ostendorff, Georg Rehm

Abstract

Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efficient training methods are needed to bridge the gap for languages with fewer resources available. To address this problem, we introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer, that transfers models from a source language, for which pretrained models are publicly available, like English, to a new target language. As opposed to prior work, which focused on the cross-lingual transfer between two languages, we extend the transfer to the model size. Given a pretrained model in a source language, we aim for a same-sized model in a target language. Instead of training a model from scratch, we exploit a smaller model that is in the target language but requires much fewer resources. Both small and source models are then used to initialize the token embeddings of the larger model based on the overlapping vocabulary of the source and target language. All remaining weights are reused from the model in the source language. This approach outperforms the sole cross-lingual transfer and can save up to 80% of the training steps compared to the random initialization.

Abstract (translated)

大多数Transformer语言模型主要基于英语文本进行预训练,因此限制了其对其他语言的应用。随着模型大小的增长,与计算和数据资源较少的语言相比,英语和其他语言的性能差距继续增加。因此,需要更多的资源高效的训练方法来填补资源有限的语言之间的差距。为了解决这一问题,我们提出了一种跨语言 progressive 迁移学习方法,称为 CLP-Transfer,该方法将模型从一种源语言(如英语)转移到另一种目标语言。与之前的工作相比,我们强调了跨语言迁移在不同语言之间的差异,因此我们将其扩展到了模型大小。给定一个源语言的预训练模型,我们的目标是在目标语言中实现相同的模型大小。而不是从头训练模型,我们利用目标语言中的较小模型,但需要更少资源。小和源模型随后用于初始化大型模型的 token embeddings,基于源和目标语言之间的重叠词汇库。所有剩余的权重都从源语言中的模型中重用。这种方法比单独的跨语言迁移表现更好,并可以节省高达80%的训练步骤 compared to the random初始化。

URL

https://arxiv.org/abs/2301.09626

PDF

https://arxiv.org/pdf/2301.09626.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot