Domain Adaptation of Neural Machine Translation by Lexicon Induction

2019-06-02 09:50:12

Junjie Hu, Mengzhou Xia, Graham Neubig, Jaime Carbonell

arXiv_CL

Abstract
Abstract (translated)
URL
PDF

Abstract

It has been previously noted that neural machine translation (NMT) is very sensitive to domain shift. In this paper, we argue that this is a dual effect of the highly lexicalized nature of NMT, resulting in failure for sentences with large numbers of unknown words, and lack of supervision for domain-specific words. To remedy this problem, we propose an unsupervised adaptation method which fine-tunes a pre-trained out-of-domain NMT model using a pseudo-in-domain corpus. Specifically, we perform lexicon induction to extract an in-domain lexicon, and construct a pseudo-parallel in-domain corpus by performing word-for-word back-translation of monolingual in-domain target sentences. In five domains over twenty pairwise adaptation settings and two model architectures, our method achieves consistent improvements without using any in-domain parallel sentences, improving up to 14 BLEU over unadapted models, and up to 2 BLEU over strong back-translation baselines.

Abstract (translated)

先前已经注意到，神经机器翻译（NMT）对域移位非常敏感。本文认为，这是NMT高度词汇化的双重效应，导致大量未知词的句子失效，缺乏对特定领域词的监督。为了解决这一问题，我们提出了一种无监督的自适应方法，利用域内伪语料库对预先训练的域外NMT模型进行微调。具体地说，我们进行词汇归纳，提取一个域内词汇，并通过对域目标句中单语的逐字反译，构造一个域语料库中的伪并行。在20个成对适应设置和两个模型体系结构的5个域中，我们的方法在不使用任何域内并行句的情况下实现了一致的改进，在不适应的模型上改进最多14个BLEU，在强反向翻译基线上改进最多2个BLEU。

URL

https://arxiv.org/abs/1906.00376

PDF

https://arxiv.org/pdf/1906.00376.pdf