Abstract
Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efficient training methods are needed to bridge the gap for languages with fewer resources available. To address this problem, we introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer, that transfers models from a source language, for which pretrained models are publicly available, like English, to a new target language. As opposed to prior work, which focused on the cross-lingual transfer between two languages, we extend the transfer to the model size. Given a pretrained model in a source language, we aim for a same-sized model in a target language. Instead of training a model from scratch, we exploit a smaller model that is in the target language but requires much fewer resources. Both small and source models are then used to initialize the token embeddings of the larger model based on the overlapping vocabulary of the source and target language. All remaining weights are reused from the model in the source language. This approach outperforms the sole cross-lingual transfer and can save up to 80% of the training steps compared to the random initialization.
Abstract (translated)
大多数Transformer语言模型主要基于英语文本进行预训练,因此限制了其对其他语言的应用。随着模型大小的增长,与计算和数据资源较少的语言相比,英语和其他语言的性能差距继续增加。因此,需要更多的资源高效的训练方法来填补资源有限的语言之间的差距。为了解决这一问题,我们提出了一种跨语言 progressive 迁移学习方法,称为 CLP-Transfer,该方法将模型从一种源语言(如英语)转移到另一种目标语言。与之前的工作相比,我们强调了跨语言迁移在不同语言之间的差异,因此我们将其扩展到了模型大小。给定一个源语言的预训练模型,我们的目标是在目标语言中实现相同的模型大小。而不是从头训练模型,我们利用目标语言中的较小模型,但需要更少资源。小和源模型随后用于初始化大型模型的 token embeddings,基于源和目标语言之间的重叠词汇库。所有剩余的权重都从源语言中的模型中重用。这种方法比单独的跨语言迁移表现更好,并可以节省高达80%的训练步骤 compared to the random初始化。
URL
https://arxiv.org/abs/2301.09626