Abstract
Information Retrieval (IR) systems are crucial tools for users to access information, widely applied in scenarios like search engines, question answering, and recommendation systems. Traditional IR methods, based on similarity matching to return ranked lists of documents, have been reliable means of information acquisition, dominating the IR field for years. With the advancement of pre-trained language models, generative information retrieval (GenIR) has emerged as a novel paradigm, gaining increasing attention in recent years. Currently, research in GenIR can be categorized into two aspects: generative document retrieval (GR) and reliable response generation. GR leverages the generative model's parameters for memorizing documents, enabling retrieval by directly generating relevant document identifiers without explicit indexing. Reliable response generation, on the other hand, employs language models to directly generate the information users seek, breaking the limitations of traditional IR in terms of document granularity and relevance matching, offering more flexibility, efficiency, and creativity, thus better meeting practical needs. This paper aims to systematically review the latest research progress in GenIR. We will summarize the advancements in GR regarding model training, document identifier, incremental learning, downstream tasks adaptation, multi-modal GR and generative recommendation, as well as progress in reliable response generation in aspects of internal knowledge memorization, external knowledge augmentation, generating response with citations and personal information assistant. We also review the evaluation, challenges and future prospects in GenIR systems. This review aims to offer a comprehensive reference for researchers in the GenIR field, encouraging further development in this area.
Abstract (translated)
信息检索(IR)系统对于用户访问信息至关重要,在搜索引擎、问题回答和推荐系统等场景中得到了广泛应用。传统的IR方法,基于相似度匹配返回排名文档的排名列表,是不可靠的信息获取手段,多年来一直是IR领域的主导。随着预训练语言模型的进步,生成式信息检索(GenIR)作为一种新颖的范式应运而生,并在近年来受到了越来越多的关注。目前,GenIR的研究可以分为两个方面:生成式文档检索(GR)和可靠响应生成。GR利用生成模型的参数进行记忆文档,直接生成相关文档标识,无需显式索引。另一方面,可靠响应生成采用语言模型直接生成用户所需的信息,打破了传统IR在文档粒度和相关性匹配方面的限制,提供了更多的灵活性、效率和创新,从而更好地满足实际需求。本文旨在系统地回顾GenIR的最新研究进展。我们将总结GR关于模型训练、文档标识、增量学习、下游任务适应、多模态GR和生成式推荐以及可靠响应生成的最新进展,以及GR在内部知识记忆、外部知识增强和生成带有引用和个人信息助手回应方面的进展。我们还回顾了GenIR系统的评估、挑战和未来前景。本 review旨在为GenIR领域的研究人员提供全面的参考,鼓励该领域进一步发展。
URL
https://arxiv.org/abs/2404.14851