Abstract
We study the task of Composed Image Retrieval (CoIR), where a query is composed of two modalities, image and text, extending the user's expression ability. Previous methods typically address this task by a separate encoding of each query modality, followed by late fusion of the extracted features. In this paper, we propose a new approach, Cross-Attention driven Shift Encoder (CASE), employing early fusion between modalities through a cross-attention module with an additional auxiliary task. We show that our method outperforms the existing state-of-the-art, on established benchmarks (FashionIQ and CIRR) by a large margin. However, CoIR datasets are a few orders of magnitude smaller compared to other vision and language (V&L) datasets, and some suffer from serious flaws (e.g., queries with a redundant modality). We address these shortcomings by introducing Large Scale Composed Image Retrieval (LaSCo), a new CoIR dataset x10 times larger than current ones. Pre-training on LaSCo yields a further performance boost. We further suggest a new analysis of CoIR datasets and methods, for detecting modality redundancy or necessity, in queries.
Abstract (translated)
本研究探讨了组合图像检索(CoIR)任务,该任务要求查询由图像和文本两个感官类型组成,扩展了用户的表达能力。以前的研究方法通常通过分别对每个查询感官类型进行编码来解决该任务,然后 late fusion 提取特征。在本文中,我们提出了一种新的方法,称为 Cross-Attention driven Shift Encoder (CASE),通过一个额外的交叉注意力任务模块,采用早期融合感官类型。我们证明了我们的方法和现有先进技术在建立基准( fashionIQ 和 CIRR)上表现优异。然而,CoIR数据集比其他视觉和语言(V&L)数据集小几个数量级,其中一些数据集存在严重缺陷(例如,具有冗余感官类型的问题)。我们通过引入大型组合图像检索(LaSCo)数据集解决了这些缺陷,LaSCo 是目前 CoIR 数据集的 10 倍大小。在 LaSCo 上进行预训练进一步提高了性能。我们还建议对 CoIR 数据集和方法进行新的分析,以在查询中检测感官冗余或必要性。
URL
https://arxiv.org/abs/2303.09429