Paper Reading AI Learner

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

2023-05-25 17:56:24
Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould

Abstract

Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.

Abstract (translated)

Composed image retrieval的目标是找到与给定的多项式用户查询包含参考图像和文本一对的最优匹配图像。现有的方法通常会对整个语料库进行图像嵌入的预处理,并在测试时比较参考图像嵌入由查询文本修改后的结果。这种管道在测试时非常高效,因为可以快速计算向量距离来评估候选人,但仅通过简短的文本描述指导修改参考图像嵌入可能会很困难,特别是与潜在候选人独立的。另一种方法是允许查询和每个可能候选人之间的交互,即参考文本候选人三件套,并从中选择最好的。尽管这种方法更加歧视性,但对于大型数据集,计算成本过高,因为预计算候选人嵌入不再可行。我们提议使用两个阶段的模型将两种方案的优点结合起来。我们的第一阶段采用传统的向量距离度量,并快速修剪候选人之间的中间结果。与此同时,我们的第二阶段采用双编码架构,有效地关注输入的参考文本-候选人三件套并重新评估候选人。两个阶段都使用视觉和语言预训练网络,已经证明对于各种后续任务有益。我们的方法在任务的标准基准测试中 consistently 优于最先进的方法。

URL

https://arxiv.org/abs/2305.16304

PDF

https://arxiv.org/pdf/2305.16304.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot