Abstract
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.
Abstract (translated)
训练鲁棒的检索和重排序模型通常依赖大规模的检索数据集;例如,BGE集合包含了从各种来源获取的160万条查询-段落对。然而,我们发现某些数据集会对模型效果产生负面影响——移除BGE集合中的8个数据集后,训练集规模减小了2.35倍,但在BEIR上的nDCG@10评分提高了1分。这促使我们更深入地考察训练数据的质量,并特别关注“假阴性”,即相关段落被错误地标记为不相关的案例。为此,我们提出了一种简单且成本效益高的方法:使用级联大语言模型提示来识别和重新标注难例(hard negatives)。实验结果显示,在BEIR上对E5(基础)和Qwen2.5-7B检索模型进行假阴性到真阳性的重标后,nDCG@10评分提高了0.7至1.4分;而在零样本的AIR-Bench评估中,此操作使得分提升了1.7至1.8分。对于在重新标注数据上微调的重排序器模型(如Qwen2.5-3B),也观察到了类似的性能提升效果。此外,级联设计的有效性还通过人工注释结果得到了进一步的支持:我们发现GPT-4o在判断方面的准确性与人类评价高度一致,而其简化版GPT-4o-mini则不具备这一特性。
URL
https://arxiv.org/abs/2505.16967