Abstract
Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10\% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.
Abstract (translated)
图像搜索在多媒体和计算机视觉领域具有关键作用,应用于各种领域,从互联网搜索到医疗诊断。传统的图像搜索系统通过接受文本或图像查询,从数据库中检索最相关的候选结果来操作。然而,普遍方法往往依赖于单轮过程,这可能导致不准确的结果和有限的召回率。这些方法还面临着词汇不匹配和语义鸿沟等挑战,限制了其整体效果。为了应对这些问题,我们提出了一个多轮图像检索系统,该系统可以根据用户相关反馈来优化查询。该系统采用基于视觉语言模型的图像摘要器来提高文本查询的质量,从而每次迭代产生更有信息性的查询。此外,我们还引入了一个基于大语言模型的去噪器来优化基于文本的查询扩展,减轻 captioning 模型生成的图像描述中的不准确。为了评估我们的系统,我们通过将 MSR-VTT 视频检索数据集改编为图像检索任务,为每个查询提供多个相关 ground truth 图像。通过全面的实验,我们验证了我们提出的系统在基线方法上的有效性,在召回率方面取得了显著的10%提升。我们的贡献包括开发了一个创新的交互式图像检索系统、引入了基于 LLM 的去噪器、创建了精心设计的评估数据集,以及进行了充分的实验验证。
URL
https://arxiv.org/abs/2404.18746