Abstract
Neural machine translation (NMT) systems have recently obtained state-of-the art in many machine translation systems between popular language pairs because of the availability of data. For low-resourced language pairs, there are few researches in this field due to the lack of bilingual data. In this paper, we attempt to build the first NMT systems for a low-resourced language pairs:Japanese-Vietnamese. We have also shown significant improvements when combining advanced methods to reduce the adverse impacts of data sparsity and improve the quality of NMT systems. In addition, we proposed a variant of Byte-Pair Encoding algorithm to perform effective word segmentation for Vietnamese texts and alleviate the rare-word problem that persists in NMT systems.
Abstract (translated)
由于数据的可用性,神经机器翻译(NMT)系统最近在流行语言对之间的许多机器翻译系统中获得了最新技术。对于资源匮乏的语言对,由于缺乏双语数据,因此该领域的研究很少。在本文中,我们试图建立第一个NMT系统,用于资源贫乏的语言对:日本 - 越南语。当结合先进的方法来减少数据稀疏性的不利影响并提高NMT系统的质量时,我们也显示出显着的改进。此外,我们还提出了一种字节对编码算法的变体,用于对越南文文本进行有效的分词,并缓解NMT系统中存在的罕见字问题。
URL
https://arxiv.org/abs/1805.07133