Abstract
It has been previously noted that neural machine translation (NMT) is very sensitive to domain shift. In this paper, we argue that this is a dual effect of the highly lexicalized nature of NMT, resulting in failure for sentences with large numbers of unknown words, and lack of supervision for domain-specific words. To remedy this problem, we propose an unsupervised adaptation method which fine-tunes a pre-trained out-of-domain NMT model using a pseudo-in-domain corpus. Specifically, we perform lexicon induction to extract an in-domain lexicon, and construct a pseudo-parallel in-domain corpus by performing word-for-word back-translation of monolingual in-domain target sentences. In five domains over twenty pairwise adaptation settings and two model architectures, our method achieves consistent improvements without using any in-domain parallel sentences, improving up to 14 BLEU over unadapted models, and up to 2 BLEU over strong back-translation baselines.
Abstract (translated)
先前已经注意到,神经机器翻译(NMT)对域移位非常敏感。本文认为,这是NMT高度词汇化的双重效应,导致大量未知词的句子失效,缺乏对特定领域词的监督。为了解决这一问题,我们提出了一种无监督的自适应方法,利用域内伪语料库对预先训练的域外NMT模型进行微调。具体地说,我们进行词汇归纳,提取一个域内词汇,并通过对域目标句中单语的逐字反译,构造一个域语料库中的伪并行。在20个成对适应设置和两个模型体系结构的5个域中,我们的方法在不使用任何域内并行句的情况下实现了一致的改进,在不适应的模型上改进最多14个BLEU,在强反向翻译基线上改进最多2个BLEU。
URL
https://arxiv.org/abs/1906.00376