Paper Reading AI Learner

On Positional Bias of Faithfulness for Long-form Summarization

2024-10-31 03:50:15
David Wan, Jesse Vig, Mohit Bansal, Shafiq Joty

Abstract

Large Language Models (LLMs) often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs. We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias. To consistently evaluate faithfulness, we first compile a benchmark of eight human-annotated long-form summarization datasets and perform a meta-evaluation of faithfulness metrics. We show that LLM-based faithfulness metrics, though effective with full-context inputs, remain sensitive to document order, indicating positional bias. Analyzing LLM-generated summaries across six datasets, we find a "U-shaped" trend in faithfulness, where LLMs faithfully summarize the beginning and end of documents but neglect middle content. Perturbing document order similarly reveals models are less faithful when important documents are placed in the middle of the input. We find that this behavior is partly due to shifting focus with context length: as context increases, summaries become less faithful, but beyond a certain length, faithfulness improves as the model focuses on the end. Finally, we experiment with different generation techniques to reduce positional bias and find that prompting techniques effectively direct model attention to specific positions, whereas more sophisticated approaches offer limited improvements. Our data and code are available in this https URL.

Abstract (translated)

大型语言模型(LLMs)在长上下文设置中经常表现出位置偏差,对输入中间部分的信息关注不足。我们研究了这种偏差在长篇总结中的存在、它对忠实度的影响以及各种缓解此偏差的技术。为了持续评估忠实度,我们首先编制了一个包含八个由人工标注的长篇总结数据集的基准,并执行了一项关于忠实度指标的元评估。我们展示了基于LLM的忠实度指标虽然在全上下文输入中有效,但仍对文档顺序敏感,这表明存在位置偏差。通过对六个数据集中LLM生成摘要的分析,我们发现了“U形”趋势,即LLMs忠实地总结了文档的开头和结尾部分,但忽略了中间内容。扰动文档顺序同样揭示出当重要文档被放置在输入中间时,模型变得不那么忠实。我们发现这种行为部分是由于随着上下文长度的变化而转移注意力:随着上下文增加,摘要变得不太忠实,但是超过一定长度后,忠诚度会随着模型专注于结尾部分而提高。最后,我们尝试了不同的生成技术以减少位置偏差,并发现在提示技术可以有效地引导模型对特定位置的关注,而更复杂的方法提供的改进则有限。我们的数据和代码可以从这个 https URL 获取。

URL

https://arxiv.org/abs/2410.23609

PDF

https://arxiv.org/pdf/2410.23609.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot