Paper Reading AI Learner

Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval

2024-03-24 04:23:56
Yucheng Suo, Fan Ma, Linchao Zhu, Yi Yang

Abstract

We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets. Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space. However, they focus on the global visual representation, ignoring the representation of detailed attributes, e.g., color, object number and layout. To address this challenge, we propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs). KEDs implicitly models the attributes of the reference images by incorporating a database. The database enriches the pseudo-word tokens by providing relevant images and captions, emphasizing shared attribute information in various aspects. In this way, KEDs recognizes the reference image from diverse perspectives. Moreover, KEDs adopts an extra stream that aligns pseudo-word tokens with textual concepts, leveraging pseudo-triplets mined from image-text pairs. The pseudo-word tokens generated in this stream are explicitly aligned with fine-grained semantics in the text embedding space. Extensive experiments on widely used benchmarks, i.e. ImageNet-R, COCO object, Fashion-IQ and CIRR, show that KEDs outperforms previous zero-shot composed image retrieval methods.

Abstract (translated)

我们研究零 shot 组合图像检索(ZS-CIR)任务,即根据参考图像和描述从三元组数据集中检索目标图像,而无需在训练数据集上进行训练。之前的工作通过将参考图像特征投影到文本嵌入空间来生成伪词标记。然而,他们集中于全局视觉表示,忽略了详细属性的表示,例如颜色、物体数量和布局。为了应对这个挑战,我们提出了一个知识增强的双流零 shot 组合图像检索框架(KEDs)。KEDs 通过引入数据库来模型的参考图像的属性。数据库通过提供相关图像和标题来丰富伪词标记,强调各种方面的共享属性信息。这样,KEDs 从不同的角度认识到参考图像。此外,KEDs 采用了一额外 stream,将伪词标记与文本概念对齐,利用从图像-文本对中挖掘的伪三元组。这个 stream 中生成的伪词标记在文本嵌入空间中具有明确的对齐关系。在广泛使用的基准上进行的大量实验,即 ImageNet-R、COCO 物体、Fashion-IQ 和 CIRR,证明了 KEDs 优于之前的零 shot 组合图像检索方法。

URL

https://arxiv.org/abs/2403.16005

PDF

https://arxiv.org/pdf/2403.16005.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot