Abstract
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets. Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space. However, they focus on the global visual representation, ignoring the representation of detailed attributes, e.g., color, object number and layout. To address this challenge, we propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs). KEDs implicitly models the attributes of the reference images by incorporating a database. The database enriches the pseudo-word tokens by providing relevant images and captions, emphasizing shared attribute information in various aspects. In this way, KEDs recognizes the reference image from diverse perspectives. Moreover, KEDs adopts an extra stream that aligns pseudo-word tokens with textual concepts, leveraging pseudo-triplets mined from image-text pairs. The pseudo-word tokens generated in this stream are explicitly aligned with fine-grained semantics in the text embedding space. Extensive experiments on widely used benchmarks, i.e. ImageNet-R, COCO object, Fashion-IQ and CIRR, show that KEDs outperforms previous zero-shot composed image retrieval methods.
Abstract (translated)
我们研究零 shot 组合图像检索(ZS-CIR)任务,即根据参考图像和描述从三元组数据集中检索目标图像,而无需在训练数据集上进行训练。之前的工作通过将参考图像特征投影到文本嵌入空间来生成伪词标记。然而,他们集中于全局视觉表示,忽略了详细属性的表示,例如颜色、物体数量和布局。为了应对这个挑战,我们提出了一个知识增强的双流零 shot 组合图像检索框架(KEDs)。KEDs 通过引入数据库来模型的参考图像的属性。数据库通过提供相关图像和标题来丰富伪词标记,强调各种方面的共享属性信息。这样,KEDs 从不同的角度认识到参考图像。此外,KEDs 采用了一额外 stream,将伪词标记与文本概念对齐,利用从图像-文本对中挖掘的伪三元组。这个 stream 中生成的伪词标记在文本嵌入空间中具有明确的对齐关系。在广泛使用的基准上进行的大量实验,即 ImageNet-R、COCO 物体、Fashion-IQ 和 CIRR,证明了 KEDs 优于之前的零 shot 组合图像检索方法。
URL
https://arxiv.org/abs/2403.16005