Semi-Parametric Retrieval via Binary Token Index

2024-05-03 08:34:13
Jiawei Zhou, Li Dong, Furu Wei, Lei Chen


The landscape of information retrieval has broadened from search services to a critical component in various advanced applications, where indexing efficiency, cost-effectiveness, and freshness are increasingly important yet remain less explored. To address these demands, we introduce Semi-parametric Vocabulary Disentangled Retrieval (SVDR). SVDR is a novel semi-parametric retrieval framework that supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval. In our evaluation on three open-domain question answering benchmarks with the entire Wikipedia as the retrieval corpus, SVDR consistently demonstrates superiority. It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and an 9% higher top-1 accuracy compared to BM25 when using a binary token index. Specifically, the adoption of a binary token index reduces index preparation time from 30 GPU hours to just 2 CPU hours and storage size from 31 GB to 2 GB, achieving a 90% reduction compared to an embedding-based index.

信息检索领域的景观已经从搜索服务扩展到各种先进应用程序中的关键组件,其中索引效率、成本效益和新鲜度日益重要,但仍然没有被充分利用。为了满足这些需求,我们引入了半参数化词汇解耦检索(SVDR)。SVDR是一种新颖的半参数化检索框架,支持两种索引:一种基于嵌入的索引,类似于现有的神经检索方法;另一种是二进制词索引,允许快速且成本效益高的设置,类似于传统的词表检索。在用整个维基百科作为检索语料库的三个公开领域问题回答基准测试中,SVDR始终保持优势。它使用基于嵌入的索引时,比DPR的 top-1 检索准确度高出 3%,使用二进制词索引时,比BM25的 top-1 准确度高出 9%。具体来说,采用二进制词索引减少了索引准备时间从 30GPU 小时减少到只需 2CPU 小时,减少了存储量从 31GB 减少到 2GB,实现了与基于嵌入的索引 90% 的减少。



