MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

Abstract
Abstract (translated)
URL
PDF

Abstract

Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent work leverages text instructions to allow users to more freely express their search intents. However, existing work primarily focuses on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via large multimodal models (LMMs) and large language models (LLMs). Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves comparable or better results on eight benchmarks of various image retrieval tasks than prior state-of-the-art (SOTA) methods. Remarkably, it outperforms previous SOTA but with a 50X smaller model size on multiple benchmarks. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens.

Abstract (translated)

图像检索，即根据给定的参考图像查找所需的图像，本质上涵盖了具有丰富多面性搜索意图，仅通过图像为基础的方法难以捕捉到的复杂搜索意图。最近的工作利用文本指令使用户能够更自由地表达他们的搜索意图。然而，现有的工作主要关注具有视觉相似性以及/或可以归因于预定义关系的一小部分图像对。本文论文的核心论点是，文本指令可以实现具有更丰富关系的图像检索，而不仅仅是视觉相似性。为了证明这一点，我们引入了MagicLens，一系列自监督图像检索模型，支持开放性指令。MagicLens基于一个关键的新见解：自然出现在同一网页上的图像对包含广泛的隐含关系（例如：内部视角），并且我们可以通过通过大型多模态模型（LMMs）和大型语言模型（LLM）合成指令来明确地表示这些隐含关系。在从互联网上挖掘了36.7M（查询图像，指令，目标图像）三元组训练的基础上，MagicLens在各种图像检索基准测试中的表现与最先进的（SOTA）方法相当或者更好。值得注意的是，它比SOTA的表现优秀，但模型大小缩小了50倍。此外，针对一个未见过的1.4M图像的更大人类分析进一步证明了MagicLens支持的各种搜索意图的多样性。

URL

https://arxiv.org/abs/2403.19651

PDF

https://arxiv.org/pdf/2403.19651.pdf

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

Abstract

Abstract (translated)

URL

PDF Copy

PDF