Paper Reading AI Learner

What's under the hood: Investigating Automatic Metrics on Meeting Summarization

2024-04-17 07:15:07
Frederic Kirstein, Jan Philip Wahle, Terry Ruas, Bela Gipp

Abstract

Meeting summarization has become a critical task considering the increase in online interactions. While new techniques are introduced regularly, their evaluation uses metrics not designed to capture meeting-specific errors, undermining effective evaluation. This paper investigates what the frequently used automatic metrics capture and which errors they mask by correlating automatic metric scores with human evaluations across a broad error taxonomy. We commence with a comprehensive literature review on English meeting summarization to define key challenges like speaker dynamics and contextual turn-taking and error types such as missing information and linguistic inaccuracy, concepts previously loosely defined in the field. We examine the relationship between characteristic challenges and errors by using annotated transcripts and summaries from Transformer-based sequence-to-sequence and autoregressive models from the general summary QMSum dataset. Through experimental validation, we find that different model architectures respond variably to challenges in meeting transcripts, resulting in different pronounced links between challenges and errors. Current default-used metrics struggle to capture observable errors, showing weak to mid-correlations, while a third of the correlations show trends of error masking. Only a subset reacts accurately to specific errors, while most correlations show either unresponsiveness or failure to reflect the error's impact on summary quality.

Abstract (translated)

会议总结已成为一个关键的任务,因为在线互动的增加。虽然定期引入新的技术,但它们的评估使用的是不用于捕捉会议特定错误的指标,这削弱了有效的评估。本文研究了常用的自动指标捕捉了什么,以及它们通过将自动指标得分与人类评价结果进行相关性来掩盖的错误类型。我们在英语会议总结的全面文献综述中定义了关键挑战,如演讲者动态和上下文转向,以及错误类型,如信息缺失和语言不准确,这些错误类型在领域中以前被定义为松散的。我们研究了特征挑战和错误之间的关系,使用来自通用总结 QMSum 数据集的带有注释的转录和摘要。通过实验验证,我们发现不同的模型架构对会议文本的挑战反应不同,导致挑战和错误之间的突出联系。当前默认使用的指标很难捕捉可观察的错误,显示出弱到中度的相关性,而三分之一的相关性显示出错误遮蔽的趋势。只有少数反应准确地对待具体错误,而大多数相关性要么不响应,要么不能反映错误对摘要质量的影响。

URL

https://arxiv.org/abs/2404.11124

PDF

https://arxiv.org/pdf/2404.11124.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot