Abstract
Although neural machine translation(NMT) yields promising translation performance, it unfortunately suffers from over- and under-translation is- sues [Tu et al., 2016], of which studies have become research hotspots in NMT. At present, these studies mainly apply the dominant automatic evaluation metrics, such as BLEU, to evaluate the overall translation quality with respect to both adequacy and uency. However, they are unable to accurately measure the ability of NMT systems in dealing with the above-mentioned issues. In this paper, we propose two quantitative metrics, the Otem and Utem, to automatically evaluate the system perfor- mance in terms of over- and under-translation respectively. Both metrics are based on the proportion of mismatched n-grams between gold ref- erence and system translation. We evaluate both metrics by comparing their scores with human evaluations, where the values of Pearson Cor- relation Coefficient reveal their strong correlation. Moreover, in-depth analyses on various translation systems indicate some inconsistency be- tween BLEU and our proposed metrics, highlighting the necessity and significance of our metrics.
Abstract (translated)
尽管神经机器翻译(NMT)产生了有希望的翻译性能,但遗憾的是它存在过度翻译和翻译不足的问题[Tu et al。,2016],其研究已成为NMT的研究热点。目前,这些研究主要应用主导的自动评估指标,如BLEU,来评估整体翻译质量的充分性和有效性。但是,他们无法准确衡量NMT系统处理上述问题的能力。在本文中,我们提出了两个量化指标,Otem和Utem,分别自动评估系统在翻译和翻译过程中的表现。这两个指标都是基于黄金参考和系统翻译之间不匹配的n-gram的比例。我们通过将他们的得分与人类评估进行比较来评估这两个指标,其中Pearson相关系数的值揭示了它们的强相关性。此外,对各种翻译系统的深入分析表明,BLEU与我们提出的指标之间存在一些不一致性,突出了我们指标的必要性和重要性。
URL
https://arxiv.org/abs/1807.08945