Paper Reading AI Learner

Data Roaming and Early Fusion for Composed Image Retrieval

2023-03-16 16:02:24
Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski

Abstract

We study the task of Composed Image Retrieval (CoIR), where a query is composed of two modalities, image and text, extending the user's expression ability. Previous methods typically address this task by a separate encoding of each query modality, followed by late fusion of the extracted features. In this paper, we propose a new approach, Cross-Attention driven Shift Encoder (CASE), employing early fusion between modalities through a cross-attention module with an additional auxiliary task. We show that our method outperforms the existing state-of-the-art, on established benchmarks (FashionIQ and CIRR) by a large margin. However, CoIR datasets are a few orders of magnitude smaller compared to other vision and language (V&L) datasets, and some suffer from serious flaws (e.g., queries with a redundant modality). We address these shortcomings by introducing Large Scale Composed Image Retrieval (LaSCo), a new CoIR dataset x10 times larger than current ones. Pre-training on LaSCo yields a further performance boost. We further suggest a new analysis of CoIR datasets and methods, for detecting modality redundancy or necessity, in queries.

Abstract (translated)

本研究探讨了组合图像检索(CoIR)任务,该任务要求查询由图像和文本两个感官类型组成,扩展了用户的表达能力。以前的研究方法通常通过分别对每个查询感官类型进行编码来解决该任务,然后 late fusion 提取特征。在本文中,我们提出了一种新的方法,称为 Cross-Attention driven Shift Encoder (CASE),通过一个额外的交叉注意力任务模块,采用早期融合感官类型。我们证明了我们的方法和现有先进技术在建立基准( fashionIQ 和 CIRR)上表现优异。然而,CoIR数据集比其他视觉和语言(V&L)数据集小几个数量级,其中一些数据集存在严重缺陷(例如,具有冗余感官类型的问题)。我们通过引入大型组合图像检索(LaSCo)数据集解决了这些缺陷,LaSCo 是目前 CoIR 数据集的 10 倍大小。在 LaSCo 上进行预训练进一步提高了性能。我们还建议对 CoIR 数据集和方法进行新的分析,以在查询中检测感官冗余或必要性。

URL

https://arxiv.org/abs/2303.09429

PDF

https://arxiv.org/pdf/2303.09429.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot