Abstract
Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query's answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.
Abstract (translated)
密集检索模型常用于信息检索(IR)应用中,如增强型生成(RAG)。由于它们通常作为这些系统的第一步,因此其鲁棒性对于避免故障至关重要。在本工作中,我们通过重新利用一个关系抽取数据集(例如Re-DocRED),设计了受控实验来量化诸如偏好较短文档等启发式偏见对Dragon+和Contriever等检索器的影响。我们的研究发现揭示了显著的脆弱性:检索器经常依赖于表浅模式,如过度优先考虑文档开头、更短的文档、重复实体以及直接匹配。此外,它们往往忽视文档是否包含查询的答案,缺乏深度语义理解能力。值得注意的是,当多个偏见结合时,模型会出现灾难性的性能下降,在存在答案但偏向性更强的文档中选择正确答案的概率低于3%。此外,我们还展示了这些偏见对下游应用(如RAG)产生的直接影响:检索器偏好文档可能会误导大型语言模型(LLMs),导致性能比不提供任何文档的情况下降低34%。
URL
https://arxiv.org/abs/2503.05037