Abstract
Meeting summarization has become a critical task considering the increase in online interactions. While new techniques are introduced regularly, their evaluation uses metrics not designed to capture meeting-specific errors, undermining effective evaluation. This paper investigates what the frequently used automatic metrics capture and which errors they mask by correlating automatic metric scores with human evaluations across a broad error taxonomy. We commence with a comprehensive literature review on English meeting summarization to define key challenges like speaker dynamics and contextual turn-taking and error types such as missing information and linguistic inaccuracy, concepts previously loosely defined in the field. We examine the relationship between characteristic challenges and errors by using annotated transcripts and summaries from Transformer-based sequence-to-sequence and autoregressive models from the general summary QMSum dataset. Through experimental validation, we find that different model architectures respond variably to challenges in meeting transcripts, resulting in different pronounced links between challenges and errors. Current default-used metrics struggle to capture observable errors, showing weak to mid-correlations, while a third of the correlations show trends of error masking. Only a subset reacts accurately to specific errors, while most correlations show either unresponsiveness or failure to reflect the error's impact on summary quality.
Abstract (translated)
会议总结已成为一个关键的任务,因为在线互动的增加。虽然定期引入新的技术,但它们的评估使用的是不用于捕捉会议特定错误的指标,这削弱了有效的评估。本文研究了常用的自动指标捕捉了什么,以及它们通过将自动指标得分与人类评价结果进行相关性来掩盖的错误类型。我们在英语会议总结的全面文献综述中定义了关键挑战,如演讲者动态和上下文转向,以及错误类型,如信息缺失和语言不准确,这些错误类型在领域中以前被定义为松散的。我们研究了特征挑战和错误之间的关系,使用来自通用总结 QMSum 数据集的带有注释的转录和摘要。通过实验验证,我们发现不同的模型架构对会议文本的挑战反应不同,导致挑战和错误之间的突出联系。当前默认使用的指标很难捕捉可观察的错误,显示出弱到中度的相关性,而三分之一的相关性显示出错误遮蔽的趋势。只有少数反应准确地对待具体错误,而大多数相关性要么不响应,要么不能反映错误对摘要质量的影响。
URL
https://arxiv.org/abs/2404.11124