Abstract
It has been shown that the performance of neural machine translation (NMT) drops starkly in low-resource conditions, underperforming phrase-based statistical machine translation (PBSMT) and requiring large amounts of auxiliary data to achieve competitive results. In this paper, we re-assess the validity of these results, arguing that they are the result of lack of system adaptation to low-resource settings. We discuss some pitfalls to be aware of when training low-resource NMT systems, and recent techniques that have shown to be especially helpful in low-resource settings, resulting in a set of best practices for low-resource NMT. In our experiments on German--English with different amounts of IWSLT14 training data, we show that, without the use of any auxiliary monolingual or multilingual data, an optimized NMT system can outperform PBSMT with far less data than previously claimed. We also apply these techniques to a low-resource Korean-English dataset, surpassing previously reported results by 4 BLEU.
Abstract (translated)
已经表明,神经机器翻译(NMT)的性能在低资源条件下显著下降,性能不佳的基于短语的统计机器翻译(PBSMT),并且需要大量的辅助数据以获得竞争的结果。在本文中,我们重新评估了这些结果的有效性,认为它们是由于缺乏对低资源环境的系统适应。我们讨论了在培训低资源NMT系统时应注意的一些陷阱,以及在低资源环境中特别有用的最新技术,从而为低资源NMT提供了一套最佳实践。在我们对德语-英语的实验中,不同数量的IWSLT14训练数据表明,在不使用任何辅助单语或多语数据的情况下,一个优化的NMT系统比PBSMT的性能要好得多,其数据远低于之前声称的数据。我们还将这些技术应用于一个低资源的韩语-英语数据集,比之前报道的结果高出4个Bleu。
URL
https://arxiv.org/abs/1905.11901