Paper Reading AI Learner

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

2024-04-29 14:46:35
Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, Evangelos Kanoulas

Abstract

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10\% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.

Abstract (translated)

图像搜索在多媒体和计算机视觉领域具有关键作用,应用于各种领域,从互联网搜索到医疗诊断。传统的图像搜索系统通过接受文本或图像查询,从数据库中检索最相关的候选结果来操作。然而,普遍方法往往依赖于单轮过程,这可能导致不准确的结果和有限的召回率。这些方法还面临着词汇不匹配和语义鸿沟等挑战,限制了其整体效果。为了应对这些问题,我们提出了一个多轮图像检索系统,该系统可以根据用户相关反馈来优化查询。该系统采用基于视觉语言模型的图像摘要器来提高文本查询的质量,从而每次迭代产生更有信息性的查询。此外,我们还引入了一个基于大语言模型的去噪器来优化基于文本的查询扩展,减轻 captioning 模型生成的图像描述中的不准确。为了评估我们的系统,我们通过将 MSR-VTT 视频检索数据集改编为图像检索任务,为每个查询提供多个相关 ground truth 图像。通过全面的实验,我们验证了我们提出的系统在基线方法上的有效性,在召回率方面取得了显著的10%提升。我们的贡献包括开发了一个创新的交互式图像检索系统、引入了基于 LLM 的去噪器、创建了精心设计的评估数据集,以及进行了充分的实验验证。

URL

https://arxiv.org/abs/2404.18746

PDF

https://arxiv.org/pdf/2404.18746.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot