Abstract
Does multilingual Neural Machine Translation (NMT) lead to The Curse of the Multlinguality or provides the Cross-lingual Knowledge Transfer within a language family? In this study, we explore multiple approaches for extending the available data-regime in NMT and we prove cross-lingual benefits even in 0-shot translation regime for low-resource languages. With this paper, we provide state-of-the-art open-source NMT models for translating between selected Slavic languages. We released our models on the HuggingFace Hub (this https URL) under the CC BY 4.0 license. Slavic language family comprises morphologically rich Central and Eastern European languages. Although counting hundreds of millions of native speakers, Slavic Neural Machine Translation is under-studied in our opinion. Recently, most NMT research focuses either on: high-resource languages like English, Spanish, and German - in WMT23 General Translation Task 7 out of 8 task directions are from or to English; massively multilingual models covering multiple language groups; or evaluation techniques.
Abstract (translated)
多语言神经机器翻译(NMT)会导致“多语种的诅咒”还是能够实现语言家族内的跨语言知识转移?在本研究中,我们探讨了多种方法来扩展NMT的数据范围,并证明即使是低资源语言,在零样本翻译模式下也能获得跨语言的好处。通过本文,我们提供了最先进的开源NMT模型,用于选定的斯拉夫语之间的互译。我们在HuggingFace Hub(此链接)以CC BY 4.0许可证发布了我们的模型。斯拉夫语族包括中欧和东欧形态丰富的语言。尽管有数亿母语使用者,但据我们所知,斯拉夫神经机器翻译研究仍处于不足状态。最近的大多数NMT研究要么集中在资源丰富如英语、西班牙语和德语的语言上——在WMT23通用翻译任务中的8个方向中有7个是与英语相关的;覆盖多种语言群体的大规模多语言模型;或者评估技术。
URL
https://arxiv.org/abs/2502.14509