Abstract
This paper presents the NICT's participation in the WMT18 shared parallel corpus filtering task. The organizers provided 1 billion words German-English corpus crawled from the web as part of the Paracrawl project. This corpus is too noisy to build an acceptable neural machine translation (NMT) system. Using the clean data of the WMT18 shared news translation task, we designed several features and trained a classifier to score each sentence pairs in the noisy data. Finally, we sampled 100 million and 10 million words and built corresponding NMT systems. Empirical results show that our NMT systems trained on sampled data achieve promising performance.
Abstract (translated)
本文介绍了NICT参与WMT18共享并行语料库过滤任务。作为Paracrawl项目的一部分,组织者提供了从网上爬行的10亿字德语 - 英语语料库。该语料库太嘈杂,无法构建可接受的神经机器翻译(NMT)系统。使用WMT18共享新闻翻译任务的干净数据,我们设计了几个特征并训练了一个分类器来对噪声数据中的每个句子对进行评分。最后,我们采样了1亿和1千万字,并建立了相应的NMT系统。实证结果表明,我们对采样数据进行训练的NMT系统取得了很好的效果。
URL
https://arxiv.org/abs/1809.07043