Abstract
Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared by various languages to efficiently enhance the model's task-solving capabilities and multilingual proficiency within a single process. In addition, we introduce a multi-task and multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental results demonstrate that our method substantially improves performance across tasks and languages, and we provide extensive insights into the impact of cross-lingual data volume and the integration of translation data on enhancing multilingual consistency and accuracy.
Abstract (translated)
多语言能力是一个对大型语言模型(LLMs)来说是一个显著的挑战。英语中心化的模型在其他语言上通常是不优的,特别是那些与英语语言学距离较远的语言。这种性能差异主要源于在预训练和调整阶段语言之间训练数据的不平衡分布。为了解决这个问题,我们提出了一个名为CrossIn的新方法,该方法利用了各种语言之间的跨语言指令调整数据的混合组合。我们的方法利用了各种语言共同拥有的压缩表示来有效地增强模型在单个过程中的任务解决能力和多语言能力。此外,我们还引入了一个多任务和多方面的基准来评估CrossIn的有效性。实验结果表明,我们的方法在任务和语言上都大大提高了性能,并为跨语言数据量和翻译数据整合对增强多语言一致性和准确性的影响提供了广泛的洞察。
URL
https://arxiv.org/abs/2404.11932