Paper Reading AI Learner

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

2024-03-28 17:59:20
Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang

Abstract

Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent work leverages text instructions to allow users to more freely express their search intents. However, existing work primarily focuses on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via large multimodal models (LMMs) and large language models (LLMs). Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves comparable or better results on eight benchmarks of various image retrieval tasks than prior state-of-the-art (SOTA) methods. Remarkably, it outperforms previous SOTA but with a 50X smaller model size on multiple benchmarks. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens.

Abstract (translated)

图像检索,即根据给定的参考图像查找所需的图像,本质上涵盖了具有丰富多面性搜索意图,仅通过图像为基础的方法难以捕捉到的复杂搜索意图。最近的工作利用文本指令使用户能够更自由地表达他们的搜索意图。然而,现有的工作主要关注具有视觉相似性以及/或可以归因于预定义关系的一小部分图像对。本文论文的核心论点是,文本指令可以实现具有更丰富关系的图像检索,而不仅仅是视觉相似性。为了证明这一点,我们引入了MagicLens,一系列自监督图像检索模型,支持开放性指令。MagicLens基于一个关键的新见解:自然出现在同一网页上的图像对包含广泛的隐含关系(例如:内部视角),并且我们可以通过通过大型多模态模型(LMMs)和大型语言模型(LLM)合成指令来明确地表示这些隐含关系。在从互联网上挖掘了36.7M(查询图像,指令,目标图像)三元组训练的基础上,MagicLens在各种图像检索基准测试中的表现与最先进的(SOTA)方法相当或者更好。值得注意的是,它比SOTA的表现优秀,但模型大小缩小了50倍。此外,针对一个未见过的1.4M图像的更大人类分析进一步证明了MagicLens支持的各种搜索意图的多样性。

URL

https://arxiv.org/abs/2403.19651

PDF

https://arxiv.org/pdf/2403.19651.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot