Abstract
To improve low-resource Neural Machine Translation (NMT) with multilingual corpora, training on the most related high-resource language only is often more effective than using all data available (Neubig and Hu, 2018). However, it is possible that an intelligent data selection strategy can further improve low-resource NMT with data from other auxiliary languages. In this paper, we seek to construct a sampling distribution over all multilingual data, so that it minimizes the training loss of the low-resource language. Based on this formulation, we propose an efficient algorithm, Target Conditioned Sampling (TCS), which first samples a target sentence, and then conditionally samples its source sentence. Experiments show that TCS brings significant gains of up to 2 BLEU on three of four languages we test, with minimal training overhead.
Abstract (translated)
为了改进多语种语料库的低资源神经机器翻译(NMT),只对最相关的高资源语言进行训练通常比使用所有可用的数据更有效(NeigBand和胡,2018)。然而,智能数据选择策略有可能进一步利用其他辅助语言的数据改善低资源的NMT。本文试图在所有多语言数据上构造一个抽样分布,使低资源语言的训练损失最小化。在此基础上,我们提出了一种有效的算法,即目标条件抽样(TCS),首先对目标句进行抽样,然后对其源句进行有条件抽样。实验表明,在我们测试的四种语言中,有三种语言的TCS可以带来高达2 bleu的显著收益,而培训开销最小。
URL
https://arxiv.org/abs/1905.08212