Abstract
Fine-tuning multilingual sequence-to-sequence large language models (msLLMs) has shown promise in developing neural machine translation (NMT) systems for low-resource languages (LRLs). However, conventional single-stage fine-tuning methods struggle in extremely low-resource NMT settings, where training data is very limited. This paper contributes to artificial intelligence by proposing two approaches for adapting msLLMs in these challenging scenarios: (1) continual pre-training (CPT), where the msLLM is further trained with domain-specific monolingual data to compensate for the under-representation of LRLs, and (2) intermediate task transfer learning (ITTL), a method that fine-tunes the msLLM with both in-domain and out-of-domain parallel data to enhance its translation capabilities across various domains and tasks. As an application in engineering, these methods are implemented in NMT systems for Sinhala, Tamil, and English (six language pairs) in domain-specific, extremely low-resource settings (datasets containing fewer than 100,000 samples). Our experiments reveal that these approaches enhance translation performance by an average of +1.47 bilingual evaluation understudy (BLEU) score compared to the standard single-stage fine-tuning baseline across all translation directions. Additionally, a multi-model ensemble further improves performance by an additional BLEU score.
Abstract (translated)
针对低资源语言(LRLs)的神经机器翻译(NMT)系统,微调多语种序列到序列大规模语言模型(msLLMs)显示出了一定潜力。然而,在训练数据极为有限的极端低资源NMT环境中,传统的单阶段微调方法面临着挑战。本文通过提出两种适应msLLMs的方法来应对这些具有挑战性的场景:(1) 连续预训练(CPT),其中msLLM进一步使用特定领域的单语种数据进行训练以补偿LRLs的代表性不足;(2) 中间任务迁移学习(ITTL),这是一种利用域内和域外并行数据对msLLM进行微调的方法,旨在提高其在各种领域和任务中的翻译能力。作为工程应用的一部分,这些方法被应用于斯里兰卡语、泰米尔语与英语之间的六种语言组合的神经机器翻译系统,在特定领域的极端低资源环境下(包含少于10万样本的数据集)。实验结果显示,相较于标准单阶段微调基线模型,在所有翻译方向上,这些建议的方法提升了平均+1.47分的双语评估研究工具(BLEU)评分。此外,通过多模型集成进一步提高了性能,并额外增加了BLEU分数。
URL
https://arxiv.org/abs/2503.22582