Paper Reading AI Learner

Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

2025-03-06 23:23:13
Mohsen Fayyaz, Ali Modarressi, Hinrich Schuetze, Nanyun Peng

Abstract

Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query's answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.

Abstract (translated)

密集检索模型常用于信息检索(IR)应用中,如增强型生成(RAG)。由于它们通常作为这些系统的第一步,因此其鲁棒性对于避免故障至关重要。在本工作中,我们通过重新利用一个关系抽取数据集(例如Re-DocRED),设计了受控实验来量化诸如偏好较短文档等启发式偏见对Dragon+和Contriever等检索器的影响。我们的研究发现揭示了显著的脆弱性:检索器经常依赖于表浅模式,如过度优先考虑文档开头、更短的文档、重复实体以及直接匹配。此外,它们往往忽视文档是否包含查询的答案,缺乏深度语义理解能力。值得注意的是,当多个偏见结合时,模型会出现灾难性的性能下降,在存在答案但偏向性更强的文档中选择正确答案的概率低于3%。此外,我们还展示了这些偏见对下游应用(如RAG)产生的直接影响:检索器偏好文档可能会误导大型语言模型(LLMs),导致性能比不提供任何文档的情况下降低34%。

URL

https://arxiv.org/abs/2503.05037

PDF

https://arxiv.org/pdf/2503.05037.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot