Paper Reading AI Learner

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

2023-01-30 21:31:48
Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, Kyle Lo

Abstract

While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.

Abstract (translated)

虽然人类评价仍然是准确判断自动生成摘要的忠实度的最佳实践,但在评估长篇摘要时,增加的难度和工作量仍然很难应对。通过对162篇论文的长篇摘要生成研究进行问卷调查,我们首先揭示了当前围绕长篇摘要的人类评价实践。我们发现,73%的论文没有对模型生成摘要进行任何人类评价,而其他工作在面对较长的文档时面临新的挑战(例如,低 Inter-annotator agreement)。基于我们的调查,我们提出了 LongEval,一组准则,用于人类评估长篇摘要忠实度的精确性,并解决了以下挑战:(1) 如何在一个更高的精度级别上实现高度一致的人类评价?(2) 如何最小化评估者工作量,同时保持准确的忠实度评价?(3) 人类是否从自动摘要与源片段的对齐中受益?我们部署了 LongEval 在两个维度(SQuALity 和 PubMed)上的长篇摘要生成数据集上的标注研究,并发现,切换到更细粒度的评估级别(例如句子级别)可以减少忠实度评价之间的差异(例如,标准偏差从 18.5 降至 6.8)。我们还展示了, partial 标注的精细单元得分高度与完整的标注工作量得分相关(使用 50% 的评估)。我们将其释放作为 Python 库,为未来的研究提供。

URL

https://arxiv.org/abs/2301.13298

PDF

https://arxiv.org/pdf/2301.13298.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot