Paper Reading AI Learner

Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization

2025-07-11 06:44:52
Itai Mondshine, Tzuf Paz-Argaman, Reut Tsarfaty

Abstract

Automatic n-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect) of human evaluation for English, their suitability for other languages remains unclear. To address this, we systematically assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite across eight languages from four typological families: agglutinative, isolating, low-fusional, and high-fusional, spanning both low- and high-resource settings, to analyze their correlation with human judgments. Our findings highlight the sensitivity of evaluation metrics to the language type. For example, in fusional languages, n-gram-based metrics show lower correlation with human assessments compared to isolating and agglutinative languages. We also demonstrate that proper tokenization can significantly mitigate this issue for morphologically rich fusional languages, sometimes even reversing negative trends. Additionally, we show that neural-based metrics specifically trained for evaluation, such as COMET, consistently outperform other neural metrics and better correlate with human judgments in low-resource languages. Overall, our analysis highlights the limitations of n-gram metrics for fusional languages and advocates for greater investment in neural-based metrics trained for evaluation tasks.

Abstract (translated)

自动n元语法(n-gram)基于的评估指标,如ROUGE,在摘要等生成任务的评价中被广泛使用。尽管这些指标被认为可以指示英语的人工评价结果(即使存在一些局限性),它们是否适合其他语言的有效性仍然不确定。为了应对这一挑战,我们系统地评估了用于不同语言和任务生成评估中的n-gram及神经网络基础的评估指标的效果。 具体来说,我们在四种不同类型家族的语言中设计了一套大规模评估方案:聚合型、孤立型、低融合型以及高融合型,并覆盖了从资源匮乏到丰富的情境。我们分析这些指标与人工评判的相关性。我们的发现强调了评价指标对语言类型的高度敏感性。例如,在融合型语言中,n-gram基于的评估指标与其他类型的语言相比,与人类评判的相关性较低。此外,我们还展示了正确的分词化可以显著缓解这个问题,尤其是在形态复杂的融合型语言中有时甚至能够逆转负面趋势。 另外,专门训练用于评价任务的神经网络基础评估指标(如COMET),在资源匮乏的语言环境中始终优于其他神经网络指标,并更好地与人类评判相关联。总体而言,我们的分析强调了n-gram度量方法在高融合语言中的局限性,并倡导对专为评价任务设计和训练的神经网络评估指标投入更多关注和发展。

URL

https://arxiv.org/abs/2507.08342

PDF

https://arxiv.org/pdf/2507.08342.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot