Paper Reading AI Learner

Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations

2023-05-23 05:00:59
Lucy Lu Wang, Yulia Otmakhova, Jay DeYoung, Thinh Hung Truong, Bailey E. Kuehl, Erin Bransom, Byron C. Wallace

Abstract

Evaluating multi-document summarization (MDS) quality is difficult. This is especially true in the case of MDS for biomedical literature reviews, where models must synthesize contradicting evidence reported across different documents. Prior work has shown that rather than performing the task, models may exploit shortcuts that are difficult to detect using standard n-gram similarity metrics such as ROUGE. Better automated evaluation metrics are needed, but few resources exist to assess metrics when they are proposed. Therefore, we introduce a dataset of human-assessed summary quality facets and pairwise preferences to encourage and support the development of better automated evaluation methods for literature review MDS. We take advantage of community submissions to the Multi-document Summarization for Literature Review (MSLR) shared task to compile a diverse and representative sample of generated summaries. We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries, to other automated metrics including several we propose in this work, and to aspects of human-assessed summary quality. We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.

Abstract (translated)

评估多文档摘要生成(MDS)质量很困难,特别是在生物医学文献综述中的MDS方面,模型必须综合来自不同文档的反对性证据。先前的工作表明,而不是完成任务,模型可能利用难以通过标准大词相似度量(如ROUGE)检测到的捷径。需要更多的自动化评估指标,但在提出指标时缺乏评估资源。因此,我们引入了一个由人类评估摘要质量 facet 和对比偏好组成的数据集,以鼓励和支持发展更好的自动化评估方法,用于文献综述的MDS。利用社区对多文档摘要生成文献综述(MSLR)共享任务提交的报告,收集了多样化且具有代表性的生成摘要样本。我们分析自动化摘要评估指标与生成摘要的词汇特征之间的关系,与其他自动化指标,包括我们在这项工作中提出的几个指标,以及人类评估摘要质量方面的方面。我们发现,不仅自动化指标无法捕捉人类评估质量方面的因素,在许多情况下,这些指标生成的系统排名与人类标注者的排名之间存在反相关关系。

URL

https://arxiv.org/abs/2305.13693

PDF

https://arxiv.org/pdf/2305.13693.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot