Abstract
Automatic n-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect) of human evaluation for English, their suitability for other languages remains unclear. To address this, we systematically assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite across eight languages from four typological families: agglutinative, isolating, low-fusional, and high-fusional, spanning both low- and high-resource settings, to analyze their correlation with human judgments. Our findings highlight the sensitivity of evaluation metrics to the language type. For example, in fusional languages, n-gram-based metrics show lower correlation with human assessments compared to isolating and agglutinative languages. We also demonstrate that proper tokenization can significantly mitigate this issue for morphologically rich fusional languages, sometimes even reversing negative trends. Additionally, we show that neural-based metrics specifically trained for evaluation, such as COMET, consistently outperform other neural metrics and better correlate with human judgments in low-resource languages. Overall, our analysis highlights the limitations of n-gram metrics for fusional languages and advocates for greater investment in neural-based metrics trained for evaluation tasks.
Abstract (translated)
自动n元语法(n-gram)基于的评估指标,如ROUGE,在摘要等生成任务的评价中被广泛使用。尽管这些指标被认为可以指示英语的人工评价结果(即使存在一些局限性),它们是否适合其他语言的有效性仍然不确定。为了应对这一挑战,我们系统地评估了用于不同语言和任务生成评估中的n-gram及神经网络基础的评估指标的效果。 具体来说,我们在四种不同类型家族的语言中设计了一套大规模评估方案:聚合型、孤立型、低融合型以及高融合型,并覆盖了从资源匮乏到丰富的情境。我们分析这些指标与人工评判的相关性。我们的发现强调了评价指标对语言类型的高度敏感性。例如,在融合型语言中,n-gram基于的评估指标与其他类型的语言相比,与人类评判的相关性较低。此外,我们还展示了正确的分词化可以显著缓解这个问题,尤其是在形态复杂的融合型语言中有时甚至能够逆转负面趋势。 另外,专门训练用于评价任务的神经网络基础评估指标(如COMET),在资源匮乏的语言环境中始终优于其他神经网络指标,并更好地与人类评判相关联。总体而言,我们的分析强调了n-gram度量方法在高融合语言中的局限性,并倡导对专为评价任务设计和训练的神经网络评估指标投入更多关注和发展。
URL
https://arxiv.org/abs/2507.08342