Abstract
Low-Resource Languages (LRLs) present significant challenges in natural language processing due to their limited linguistic resources and underrepresentation in standard datasets. While recent advancements in Large Language Models (LLMs) and Neural Machine Translation (NMT) have substantially improved translation capabilities for high-resource languages, performance disparities persist for LRLs, particularly impacting privacy-sensitive and resource-constrained scenarios. This paper systematically evaluates the limitations of current LLMs across 200 languages using benchmarks such as FLORES-200. We also explore alternative data sources, including news articles and bilingual dictionaries, and demonstrate how knowledge distillation from large pre-trained models can significantly improve smaller LRL translations. Additionally, we investigate various fine-tuning strategies, revealing that incremental enhancements markedly reduce performance gaps on smaller LLMs.
Abstract (translated)
低资源语言(LRL)在自然语言处理中面临重大挑战,原因在于其有限的语料资源和在标准数据集中的代表性不足。尽管近年来大型语言模型(LLMs)和神经机器翻译(NMT)的进步大大提高了高资源语言的翻译能力,但在低资源语言上的性能差距仍然存在,特别是在隐私敏感性和资源受限的情境下影响尤为显著。本文系统地评估了当前LLMs在200种语言中的限制,并使用FLORES-200等基准测试进行衡量。我们还探索了替代数据源,包括新闻文章和双语词典,并展示了如何通过从大型预训练模型中提取知识来显著提高较小LRL的翻译质量。此外,本文研究了各种微调策略,揭示增量改进能够明显缩小在小型LLMs上的性能差距。
URL
https://arxiv.org/abs/2503.24102