FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine Translation

2020-12-31 17:15:09

Wenhao Zhu, Shujian Huang, Tong Pu, Xu Zhang, Jian Yu, Wei Chen, Yanfeng Wang, Jiajun Chen

arXiv_CL

arXiv_CL NMT Knowledge Quantitative Pose Autonomous

Abstract
Abstract (translated)
URL
PDF

Abstract

Previous domain adaptation research usually neglect the diversity in translation within a same domain, which is a core problem for adapting a general neural machine translation (NMT) model into a specific domain in real-world scenarios. One representative of such challenging scenarios is to deploy a translation system for a conference with a specific topic, e.g. computer networks or natural language processing, where there is usually extremely less resources due to the limited time schedule. To motivate a wide investigation in such settings, we present a real-world fine-grained domain adaptation task in machine translation (FDMT). The FDMT dataset (Zh-En) consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone. To be closer to reality, FDMT does not employ any in-domain bilingual training data. Instead, each sub-domain is equipped with monolingual data, bilingual dictionary and knowledge base, to encourage in-depth exploration of these available resources. Corresponding development set and test set are provided for evaluation purpose. We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task and reveals several challenging problems that need to be addressed.

Abstract (translated)

URL

https://arxiv.org/abs/2012.15717

PDF

https://arxiv.org/pdf/2012.15717.pdf