DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

2022-03-19 03:24:53

Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu, Haifeng Wang

arXiv_CL

arXiv_CL Salient

Abstract
Abstract (translated)
URL
PDF

Abstract

In this paper, we present DuReader_retrieval, a large-scale Chinese dataset for passage retrieval. DuReader_retrieval contains more than 90K queries and over 8M unique passages from Baidu search. To ensure the quality of our benchmark and address the shortcomings in other existing datasets, we (1) reduce the false negatives in development and testing sets by pooling the results from multiple retrievers with human annotations, (2) and remove the semantically similar questions between training with development and testing sets. We further introduce two extra out-of-domain testing sets for benchmarking the domain generalization capability. Our experiment results demonstrate that DuReader_retrieval is challenging and there is still plenty of room for the community to improve, e.g. the generalization across domains, salient phrase and syntax mismatch between query and paragraph and robustness. DuReader_retrieval will be publicly available at this https URL

Abstract (translated)

URL

https://arxiv.org/abs/2203.10232

PDF

https://arxiv.org/pdf/2203.10232.pdf