Many image retrieval studies use metric learning to train an image encoder. However, metric learning cannot handle differences in users' preferences, and requires data to train an image encoder. To overcome these limitations, we revisit relevance feedback, a classic technique for interactive retrieval systems, and propose an interactive CLIP-based image retrieval system with relevance feedback. Our retrieval system first executes the retrieval, collects each user's unique preferences through binary feedback, and returns images the user prefers. Even when users have various preferences, our retrieval system learns each user's preference through the feedback and adapts to the preference. Moreover, our retrieval system leverages CLIP's zero-shot transferability and achieves high accuracy without training. We empirically show that our retrieval system competes well with state-of-the-art metric learning in category-based image retrieval, despite not training image encoders specifically for each dataset. Furthermore, we set up two additional experimental settings where users have various preferences: one-label-based image retrieval and conditioned image retrieval. In both cases, our retrieval system effectively adapts to each user's preferences, resulting in improved accuracy compared to image retrieval without feedback. Overall, our work highlights the potential benefits of integrating CLIP with classic relevance feedback techniques to enhance image retrieval.
许多图像检索研究使用元学习来训练图像编码器。然而,元学习无法处理用户偏好的差异,并需要数据来训练图像编码器。为了克服这些限制,我们重新审视了相关性反馈,这是一个经典的交互式检索系统技术,并提出了一种基于相关性反馈的交互式CLIP图像检索系统。我们的检索系统首先执行检索,通过二进制反馈收集每个用户的独特偏好,然后返回用户喜欢的图像。即使用户有各种偏好,我们的检索系统也会通过反馈学习每个用户的偏好,并适应偏好。此外,我们的检索系统利用CLIP的零 shot传输能力,无需训练就能实现高准确度。我们通过实验实证证明,我们的检索系统在基于类别的图像检索中与最先进的元学习技术竞争相当,即使没有为每个数据集专门训练图像编码器。此外,我们还设置了两个额外的实验设置,用户具有各种偏好:一标签图像检索和条件图像检索。在这两种情况下,我们的检索系统都能有效适应每个用户的偏好,从而改善了与没有反馈时的图像检索的准确性。总之,我们的工作突出了将CLIP与经典相关性反馈技术相结合可以提高图像检索的潜力的可能性。
https://arxiv.org/abs/2404.16398
Fine-grained image retrieval (FGIR) is to learn visual representations that distinguish visually similar objects while maintaining generalization. Existing methods propose to generate discriminative features, but rarely consider the particularity of the FGIR task itself. This paper presents a meticulous analysis leading to the proposal of practical guidelines to identify subcategory-specific discrepancies and generate discriminative features to design effective FGIR models. These guidelines include emphasizing the object (G1), highlighting subcategory-specific discrepancies (G2), and employing effective training strategy (G3). Following G1 and G2, we design a novel Dual Visual Filtering mechanism for the plain visual transformer, denoted as DVF, to capture subcategory-specific discrepancies. Specifically, the dual visual filtering mechanism comprises an object-oriented module and a semantic-oriented module. These components serve to magnify objects and identify discriminative regions, respectively. Following G3, we implement a discriminative model training strategy to improve the discriminability and generalization ability of DVF. Extensive analysis and ablation studies confirm the efficacy of our proposed guidelines. Without bells and whistles, the proposed DVF achieves state-of-the-art performance on three widely-used fine-grained datasets in closed-set and open-set settings.
细粒度图像检索(FGIR)是通过学习具有区分力视觉表示,同时保持普适性的视觉表示来研究的问题。现有的方法提出了生成有区分力的特征,但很少考虑FGIR任务的独特性。本文提出了一种 meticulous的分析,导致了制定针对子类别特定差异的实用指南,以设计有效的FGIR模型。这些指南包括强调对象(G1),突出子类别特定差异(G2),并采用有效的训练策略(G3)。遵循G1和G2,我们为平视变换器设计了一个新颖的双视觉过滤机制,表示为DVF,以捕捉子类别特定差异。具体来说,双视觉过滤机制包括一个面向对象的模块和一个面向语义的模块。这些组件分别用于放大对象和识别具有区分性的区域。遵循G3,我们实现了一个用于提高DVF的鉴别率和泛化能力的歧视模型训练策略。广泛的分析和消融实验证实了我们提出的指南的有效性。没有花哨的装饰,DVF在闭合设置和开设置下的三个广泛使用的细粒度数据集上实现了最先进的性能。
https://arxiv.org/abs/2404.15771
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.
组合图像检索(CIR)是一个根据给定文本修改查询图像的任务。目前的方法依赖于有标签的三元组来训练CIR模型,这些三元组并不像简单的图像-文本对那么常见,从而限制了CIR的广泛应用和其可扩展性。另一方面,零散式CIR可以通过图像描述性对无监督训练进行相对容易的实现,但不考虑图像之间的关系,这种方法往往导致较低的准确性。我们提出了一种新的半监督CIR方法,其中我们在辅助数据中寻找参考图像及其相关目标图像,并使用基于大型语言模型(VDG)生成描述两个视觉差异(即视觉差)的文本。VDG,拥有流畅的语义知识,且对模型无依赖,可以生成伪三元组来提高CIR模型的性能。我们的方法显著提高了现有监督学习方法,并在CIR基准测试中实现了最先进的性能。
https://arxiv.org/abs/2404.15516
The main objective of this paper is to address the mobile robot localization problem with Triplet Convolutional Neural Networks and test their robustness against changes of the lighting conditions. We have used omnidirectional images from real indoor environments captured in dynamic conditions that have been converted to panoramic format. Two approaches are proposed to address localization by means of triplet neural networks. First, hierarchical localization, which consists in estimating the robot position in two stages: a coarse localization, which involves a room retrieval task, and a fine localization is addressed by means of image retrieval in the previously selected room. Second, global localization, which consists in estimating the position of the robot inside the entire map in a unique step. Besides, an exhaustive study of the loss function influence on the network learning process has been made. The experimental section proves that triplet neural networks are an efficient and robust tool to address the localization of mobile robots in indoor environments, considering real operation conditions.
本文的主要目标是使用三元卷积神经网络(Triplet Convolutional Neural Networks)解决移动机器人定位问题,并测试它们对照明条件变化的鲁棒性。我们使用从动态条件下捕获的现实室内环境中捕获的全方位图像,并将其转换为全景格式。我们提出了两种通过三元神经网络解决定位的方法。首先,是分层定位,其包括粗定位和细定位两个阶段:粗定位涉及房间检索任务,而细定位通过先前选定的房间的图像检索来解决。其次,是全局定位,它包括在一次性估计机器人在整个地图上的位置。此外,对网络学习过程中损失函数影响的全面研究已经进行了探讨。实验部分证明,三元神经网络是解决移动机器人室内环境定位的有效且鲁棒工具。考虑到实际操作条件。
https://arxiv.org/abs/2404.14117
Visual Place Recognition (VPR) aims to estimate the location of an image by treating it as a retrieval problem. VPR uses a database of geo-tagged images and leverages deep neural networks to extract a global representation, called descriptor, from each image. While the training data for VPR models often originates from diverse, geographically scattered sources (geo-tagged images), the training process itself is typically assumed to be centralized. This research revisits the task of VPR through the lens of Federated Learning (FL), addressing several key challenges associated with this adaptation. VPR data inherently lacks well-defined classes, and models are typically trained using contrastive learning, which necessitates a data mining step on a centralized database. Additionally, client devices in federated systems can be highly heterogeneous in terms of their processing capabilities. The proposed FedVPR framework not only presents a novel approach for VPR but also introduces a new, challenging, and realistic task for FL research, paving the way to other image retrieval tasks in FL.
VPR(视觉空间识别)旨在通过将其视为检索问题来估计图像的位置。VPR使用一个带有地理标记的图像数据库,并利用深度神经网络从每张图像中提取全局表示,称为描述符。虽然VPR模型的训练数据通常来自地理上分散的来源(带有地理标记的图像),但通常假设训练过程是集中的。这项研究通过Federated Learning(FL)的视角重新审视了VPR任务,解决了与这种适应相关的几个关键挑战。VPR数据固有的类本不明确,通常使用对比学习进行训练,这需要在一个集中式的数据库上进行数据挖掘。此外,分布式系统中的客户端设备在处理能力上可能高度异构。所提出的FedVPR框架不仅为VPR带来了新颖的方法,还为FL研究引入了一个新的、具有挑战性和真实性的任务,为FL领域中的其他图像检索任务铺平道路。
https://arxiv.org/abs/2404.13324
The Composed Image Retrieval (CIR) task aims to retrieve target images using a composed query consisting of a reference image and a modified text. Advanced methods often utilize contrastive learning as the optimization objective, which benefits from adequate positive and negative examples. However, the triplet for CIR incurs high manual annotation costs, resulting in limited positive examples. Furthermore, existing methods commonly use in-batch negative sampling, which reduces the negative number available for the model. To address the problem of lack of positives, we propose a data generation method by leveraging a multi-modal large language model to construct triplets for CIR. To introduce more negatives during fine-tuning, we design a two-stage fine-tuning framework for CIR, whose second stage introduces plenty of static representations of negatives to optimize the representation space rapidly. The above two improvements can be effectively stacked and designed to be plug-and-play, easily applied to existing CIR models without changing their original architectures. Extensive experiments and ablation analysis demonstrate that our method effectively scales positives and negatives and achieves state-of-the-art results on both FashionIQ and CIRR datasets. In addition, our methods also perform well in zero-shot composed image retrieval, providing a new CIR solution for the low-resources scenario.
组合图像检索(CIR)任务旨在使用由参考图像和修改文本组成的组合查询来检索目标图像。高级方法通常将对比学习作为优化目标,因为这种方法具有足够的正负样本。然而,CIR的三元组费用高昂,导致正样本数量有限。此外,现有的方法通常使用批量负采样,这使得模型可用的负样本数量减少。为了应对缺乏正样本的问题,我们提出了一个利用多模态大型语言模型构建CIR三元组的方法。为了在微调期间引入更多负样本,我们设计了一个两阶段微调框架,其中第二阶段引入了大量静态负样本,以迅速优化表示空间。上述两种改进可以有效地堆叠,并可以轻松地应用于现有的CIR模型,而无需改变其原始架构。大量实验和消融分析证明,我们的方法有效地扩展了正样本和负样本,并在FashionIQ和CIRR数据集上取得了最先进的性能。此外,我们的方法在零散组合图像检索中也表现良好,为低资源情况下的CIR提供了一种新的解决方案。
https://arxiv.org/abs/2404.11317
In the face of burgeoning image data, efficiently retrieving similar images poses a formidable challenge. Past research has focused on refining hash functions to distill images into compact indicators of resemblance. Initial attempts used shallow models, evolving to attention mechanism-based architectures from Convolutional Neural Networks (CNNs) to advanced models. Recognizing limitations in gradient-based models for spatial information embedding, we propose an innovative image hashing method, NeuroHash leveraging Hyperdimensional Computing (HDC). HDC symbolically encodes spatial information into high-dimensional vectors, reshaping image representation. Our approach combines pre-trained large vision models with HDC operations, enabling spatially encoded feature representations. Hashing with locality-sensitive hashing (LSH) ensures swift and efficient image retrieval. Notably, our framework allows dynamic hash manipulation for conditional image retrieval. Our work introduces a transformative image hashing framework enabling spatial-aware conditional retrieval. By seamlessly combining DNN-based neural and HDC-based symbolic models, our methodology breaks from traditional training, offering flexible and conditional image retrieval. Performance evaluations signify a paradigm shift in image-hashing methodologies, demonstrating enhanced retrieval accuracy.
面对快速增长的图像数据,高效地检索相似的图像是一个具有挑战性的任务。过去的研究所侧重于优化哈希函数,以将图像压缩成相似性的简洁指标。初始尝试使用浅层模型,从卷积神经网络(CNNs)进化到关注机制为基础的架构,最终达到更先进的模型。然而,对于基于梯度的模型的空间信息嵌入限制,我们提出了创新性的图像哈希方法:NeuroHash,利用高维计算(HDC)。HDC 符号化地编码空间信息为高维向量,重新塑造图像表示。我们的方法将预训练的大视觉模型与 HDC 操作相结合,实现了空间编码特征表示。使用局部感知哈希(LSH)进行哈希确保快速且高效的图像检索。值得注意的是,我们的框架允许动态哈希操作进行条件图像检索。我们的工作引入了一个 transformative 图像哈希框架,实现空间感知条件检索。通过将基于深度神经网络(DNN)的神经模型与基于高维计算(HDC)的符号模型无缝结合,我们的方法摒弃了传统的训练方式,实现了灵活的带有条件图像检索。性能评估表明,图像哈希方法论正处于一种范式性的转变,并证明了更准确的检索精度。
https://arxiv.org/abs/2404.11025
I introduce a novel associative memory model named Correlated Dense Associative Memory (CDAM), which integrates both auto- and hetero-association in a unified framework for continuous-valued memory patterns. Employing an arbitrary graph structure to semantically link memory patterns, CDAM is theoretically and numerically analysed, revealing four distinct dynamical modes: auto-association, narrow hetero-association, wide hetero-association, and neutral quiescence. Drawing inspiration from inhibitory modulation studies, I employ anti-Hebbian learning rules to control the range of hetero-association, extract multi-scale representations of community structures in graphs, and stabilise the recall of temporal sequences. Experimental demonstrations showcase CDAM's efficacy in handling real-world data, replicating a classical neuroscience experiment, performing image retrieval, and simulating arbitrary finite automata.
我提出了一种名为关联密度关联记忆(CDAM)的新颖的联想记忆模型,将自组织和异组织关联整合在一个统一的框架中,以处理连续值记忆模式。利用任意图结构将记忆模式语义链接起来,CDAM从理论和数值上进行研究,揭示了四种不同的动态模式:自组织、狭窄异组织关联、宽异组织关联和中性休眠。从抑制性调节研究的灵感中,我使用反Hebbian学习规则来控制异组织关联的范围,提取图社区结构的 multi-scale 表示,并稳定地回忆时间序列。实验演示展示了CDAM在处理现实世界数据、复制经典神经科学研究、进行图像检索和模拟任意有限自动机方面的有效性。
https://arxiv.org/abs/2404.07123
Foundation models are a strong trend in deep learning and computer vision. These models serve as a base for applications as they require minor or no further fine-tuning by developers to integrate into their applications. Foundation models for zero-shot object segmentation such as Segment Anything (SAM) output segmentation masks from images without any further object information. When they are followed in a pipeline by an object identification model, they can perform object detection without training. Here, we focus on training such an object identification model. A crucial practical aspect for an object identification model is to be flexible in input size. As object identification is an image retrieval problem, a suitable method should handle multi-query multi-gallery situations without constraining the number of input images (e.g. by having fixed-size aggregation layers). The key solution to train such a model is the centroid triplet loss (CTL), which aggregates image features to their centroids. CTL yields high accuracy, avoids misleading training signals and keeps the model input size flexible. In our experiments, we establish a new state of the art on the ArmBench object identification task, which shows general applicability of our model. We furthermore demonstrate an integrated unseen object detection pipeline on the challenging HOPE dataset, which requires fine-grained detection. There, our pipeline matches and surpasses related methods which have been trained on dataset-specific data.
基础模型是深度学习和计算机视觉领域的一个强趋势。这些模型作为应用程序的基础,无需进一步开发人员的调整即可集成到应用程序中。例如,零 shot 对象分割的基础模型 Segment Anything (SAM) 在图像中输出没有任何进一步物体信息的分割掩码。当它们在处理物体识别模型的流水线中时,它们可以无需训练进行物体检测。在这里,我们关注训练这样的物体识别模型。 对于物体识别模型来说,一个关键的实践方面是灵活的输入尺寸。由于物体识别是一个图像检索问题,因此应该采用一种合适的方法来处理多查询多馆的情况,而不会限制输入图像的数量(例如,通过具有固定大小的聚合层)。训练这种模型的关键解决方案是聚类三角损失(CTL),它将图像特征聚合到它们的聚类中心。CTL 产生高准确度,避免了误导性的训练信号,并保持模型的输入尺寸灵活。 在我们的实验中,我们在 ARM-Bench 物体识别任务上取得了新颖的状态,这表明我们的模型具有普遍的应用价值。此外,我们还展示了在具有挑战性的 HOPE 数据集上集成的未见过的物体检测流水线,该数据集需要细粒度的检测。在那里,我们的流水线与相关方法相匹配甚至超过了它们。
https://arxiv.org/abs/2404.06277
Deep quantization methods have shown high efficiency on large-scale image retrieval. However, current models heavily rely on ground-truth information, hindering the application of quantization in label-hungry scenarios. A more realistic demand is to learn from inexhaustible uploaded images that are associated with informal tags provided by amateur users. Though such sketchy tags do not obviously reveal the labels, they actually contain useful semantic information for supervising deep quantization. To this end, we propose Weakly-Supervised Deep Hyperspherical Quantization (WSDHQ), which is the first work to learn deep quantization from weakly tagged images. Specifically, 1) we use word embeddings to represent the tags and enhance their semantic information based on a tag correlation graph. 2) To better preserve semantic information in quantization codes and reduce quantization error, we jointly learn semantics-preserving embeddings and supervised quantizer on hypersphere by employing a well-designed fusion layer and tailor-made loss functions. Extensive experiments show that WSDHQ can achieve state-of-art performance on weakly-supervised compact coding. Code is available at this https URL.
深度量化方法在大型图像检索任务上表现出了高效性。然而,当前的模型在很大程度上依赖于真实数据,这阻碍了在有标签的场景中应用量化。一个更现实的需求是学习来自非正式标签的不可用上传图像,这些图像与业余用户提供的标签相关。尽管这些标签并不明显地揭示了标签,但它们实际上包含有关深度量化的有用语义信息。为此,我们提出了弱监督深度超球量化(WSDHQ),这是第一个从弱标签图像中学习深度量化的工作。具体来说,1)我们使用词向量来表示标签,并根据标签相关图增强其语义信息。2)为了更好地保留语义信息在量化代码中,并减少量化误差,我们通过采用设计巧妙的融合层和定制损失函数,在超球上共同学习和语义保持嵌入。大量实验证明,WSDHQ可以在弱监督的紧凑编码上实现最先进的性能。代码可在此处下载:https://url.cn/xyz6hU6
https://arxiv.org/abs/2404.04998
The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.
链式思考技术在多模态任务中得到了很好的接收。它是一种逐步线性推理过程,根据生成提示的长度调整链条的长度以提高生成提示的性能。然而,人类思维过程主要是非线性的,因为它们同时涵盖多个方面并采用动态调整和更新机制。因此,我们提出了一个名为聚合-图-思维(AGoT)的多模态表示学习软提示调整的新机制。与AGoT不同,我们提出的AGoT模型将人类思维过程不仅建模为链条,而且将每一步都建模为一个推理聚合图,以应对单步推理中忽视的多个方面。这使得整个推理过程转化为提示聚合和提示流操作。实验证明,我们的多模态模型(AGoT软提示)在文本图像检索、视觉问题回答和图像识别等任务中取得了良好的结果。此外,我们还证明了它具有良好的领域泛化性能,因为其推理能力更强。
https://arxiv.org/abs/2404.04538
How important is it for training and evaluation sets to not have class overlap in image retrieval? We revisit Google Landmarks v2 clean, the most popular training set, by identifying and removing class overlap with Revisited Oxford and Paris [34], the most popular evaluation set. By comparing the original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art methods, our findings are striking. Not only is there a dramatic drop in performance, but it is inconsistent across methods, changing the ranking.What does it take to focus on objects or interest and ignore background clutter when indexing? Do we need to train an object detector and the representation separately? Do we need location supervision? We introduce Single-stage Detect-to-Retrieve (CiDeR), an end-to-end, single-stage pipeline to detect objects of interest and extract a global image representation. We outperform previous state-of-the-art on both existing training sets and the new RGLDv2-clean. Our dataset is available at this https URL.
训练和评估集是否具有类重叠在图像检索中有多重要呢?我们重新审视了最受欢迎的训练集Google Landmarks v2 clean,通过识别并删除与Revisited Oxford和Paris [34]最流行的评估集中的类重叠,来研究这个问题。通过比较原始和新的RGLDv2-clean在先进方法上的基准,我们的研究结果是引人注目的。不仅性能急剧下降,而且方法之间存在差异,改变了排名。 在索引时专注于物体或兴趣并忽略背景噪音需要什么?我们需要单独训练物体检测器和表示吗?我们需要位置监督吗?我们引入了Single-stage Detect-to-Retrieve(CiDeR),一种端到端的单阶段管道,用于检测感兴趣的物体并提取全局图像表示。我们在现有的训练集和新的RGLDv2-clean上均超越了最先进的状态。我们的数据集可通过此链接获得。
https://arxiv.org/abs/2404.01524
In Visual Place Recognition (VPR) the pose of a query image is estimated by comparing the image to a map of reference images with known reference poses. As is typical for image retrieval problems, a feature extractor maps the query and reference images to a feature space, where a nearest neighbor search is then performed. However, till recently little attention has been given to quantifying the confidence that a retrieved reference image is a correct match. Highly certain but incorrect retrieval can lead to catastrophic failure of VPR-based localization pipelines. This work compares for the first time the main approaches for estimating the image-matching uncertainty, including the traditional retrieval-based uncertainty estimation, more recent data-driven aleatoric uncertainty estimation, and the compute-intensive geometric verification. We further formulate a simple baseline method, ``SUE'', which unlike the other methods considers the freely-available poses of the reference images in the map. Our experiments reveal that a simple L2-distance between the query and reference descriptors is already a better estimate of image-matching uncertainty than current data-driven approaches. SUE outperforms the other efficient uncertainty estimation methods, and its uncertainty estimates complement the computationally expensive geometric verification approach. Future works for uncertainty estimation in VPR should consider the baselines discussed in this work.
在视觉空间识别(VPR)中,通过将查询图像与已知参考图像的映射进行比较来估计查询图像的姿势。与图像检索问题典型的情况类似,特征提取器将查询和参考图像映射到特征空间,然后进行最近邻搜索。然而,到目前为止,还没有很少关注估计检索参考图像是否为正确匹配的概率。高度确定性的错误的检索可能会导致VPR基于局部定位管道灾难性失败。这项工作是首次将估计图像匹配不确定性的主要方法进行比较,包括传统的检索为基础的不确定性估计,更加最近的数据驱动的随机不确定性估计,以及计算密集型几何验证。我们进一步提出了一个简单的基准方法,“SUE”,它不同于其他方法,考虑了映射中可免费获得的参考图像的自由姿态。我们的实验结果表明,查询和参考描述符之间的L2距离已经比目前的data-driven方法更好估计图像匹配不确定性。SUE优于其他高效的 uncertainty estimation methods,其不确定性估计补充了计算密集型几何验证方法。未来在VPR中进行不确定性估计的研究应该考虑本文中讨论的基准方法。
https://arxiv.org/abs/2404.00546
Open-vocabulary vision-language models (VLMs) like CLIP, trained using contrastive loss, have emerged as a promising new paradigm for text-to-image retrieval. However, do VLMs understand compound nouns (CNs) (e.g., lab coat) as well as they understand nouns (e.g., lab)? We curate Compun, a novel benchmark with 400 unique and commonly used CNs, to evaluate the effectiveness of VLMs in interpreting CNs. The Compun benchmark challenges a VLM for text-to-image retrieval where, given a text prompt with a CN, the task is to select the correct image that shows the CN among a pair of distractor images that show the constituent nouns that make up the CN. Next, we perform an in-depth analysis to highlight CLIPs' limited understanding of certain types of CNs. Finally, we present an alternative framework that moves beyond hand-written templates for text prompts widely used by CLIP-like models. We employ a Large Language Model to generate multiple diverse captions that include the CN as an object in the scene described by the caption. Our proposed method improves CN understanding of CLIP by 8.25% on Compun. Code and benchmark are available at: this https URL
开放词汇视觉语言模型(VLMs)如CLIP,通过对比损失训练,已经成为文本到图像检索的有前景的新范式。然而,VLMs是否能够理解复合名词(CN)(例如实验室外套)以及它们是否能够理解名词(例如实验室)还有待观察。我们创建了Compun基准,一个包含400个独特且常用CN的新基准,以评估VLMs在解释CN方面的有效性。Compun基准挑战了一个VLM在文本到图像检索的任务,其中,给定一个文本提示,任务是选择一张正确的图像,该图像在一对显示构成CN的干扰图像中显示。接下来,我们进行了深入分析,以突出CLIP对某些类型的CN理解有限。最后,我们提出了一个超越了广泛使用的CLIP类似模型的手写模板的新框架。我们使用一个大型语言模型生成多个具有场景中CN作为对象的多样性 caption。我们的方法在Compun基准上提高了CLIP对CN的理解8.25%。代码和基准您可以在此处查看:https://this URL
https://arxiv.org/abs/2404.00419
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent work leverages text instructions to allow users to more freely express their search intents. However, existing work primarily focuses on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via large multimodal models (LMMs) and large language models (LLMs). Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves comparable or better results on eight benchmarks of various image retrieval tasks than prior state-of-the-art (SOTA) methods. Remarkably, it outperforms previous SOTA but with a 50X smaller model size on multiple benchmarks. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens.
图像检索,即根据给定的参考图像查找所需的图像,本质上涵盖了具有丰富多面性搜索意图,仅通过图像为基础的方法难以捕捉到的复杂搜索意图。最近的工作利用文本指令使用户能够更自由地表达他们的搜索意图。然而,现有的工作主要关注具有视觉相似性以及/或可以归因于预定义关系的一小部分图像对。本文论文的核心论点是,文本指令可以实现具有更丰富关系的图像检索,而不仅仅是视觉相似性。为了证明这一点,我们引入了MagicLens,一系列自监督图像检索模型,支持开放性指令。MagicLens基于一个关键的新见解:自然出现在同一网页上的图像对包含广泛的隐含关系(例如:内部视角),并且我们可以通过通过大型多模态模型(LMMs)和大型语言模型(LLM)合成指令来明确地表示这些隐含关系。在从互联网上挖掘了36.7M(查询图像,指令,目标图像)三元组训练的基础上,MagicLens在各种图像检索基准测试中的表现与最先进的(SOTA)方法相当或者更好。值得注意的是,它比SOTA的表现优秀,但模型大小缩小了50倍。此外,针对一个未见过的1.4M图像的更大人类分析进一步证明了MagicLens支持的各种搜索意图的多样性。
https://arxiv.org/abs/2403.19651
State-of-the-art (SOTA) hierarchical localisation pipelines (HLoc) rely on image retrieval (IR) techniques to establish 2D-3D correspondences by selecting the $k$ most similar images from a reference image database for a given query image. Although higher values of $k$ enhance localisation robustness, the computational cost for feature matching increases linearly with $k$. In this paper, we observe that queries that are the most similar to images in the database result in a higher proportion of feature matches and, thus, more accurate positioning. Thus, a small number of images is sufficient for queries very similar to images in the reference database. We then propose a novel approach, AIR-HLoc, which divides query images into different localisation difficulty levels based on their similarity to the reference image database. We consider an image with high similarity to the reference image as an easy query and an image with low similarity as a hard query. Easy queries show a limited improvement in accuracy when increasing $k$. Conversely, higher values of $k$ significantly improve accuracy for hard queries. Given the limited improvement in accuracy when increasing $k$ for easy queries and the significant improvement for hard queries, we adapt the value of $k$ to the query's difficulty level. Therefore, AIR-HLoc optimizes processing time by adaptively assigning different values of $k$ based on the similarity between the query and reference images without losing accuracy. Our extensive experiments on the Cambridge Landmarks, 7Scenes, and Aachen Day-Night-v1.1 datasets demonstrate our algorithm's efficacy, reducing 30\%, 26\%, and 11\% in computational overhead while maintaining SOTA accuracy compared to HLoc with fixed image retrieval.
先进的(SOTA)层次局部定位管道(HLoc)依赖于图像检索(IR)技术来通过从参考图像数据库中选择与给定查询图像最相似的$k$个图像来建立2D-3D对应关系。尽管$k$较高的值提高了局部定位的鲁棒性,但基于特征匹配的计算成本随$k$线性增加。在本文中,我们观察到与数据库中图像最相似的查询导致更多的特征匹配,从而实现更准确的定位。因此,对于与参考数据库中图像非常相似的查询,只需要几张图片就足够了。然后,我们提出了名为AIR-HLoc的新方法,根据查询图像与参考图像数据库的相似性将查询图像划分为不同的局部化难度级别。我们将高相似度的图像视为容易的查询,低相似度的图像视为困难的查询。容易的查询在增加$k$时,准确度改进有限。相反,高$k$值对于困难的查询有显著的提高准确度。由于在容易的查询和困难的查询之间,增加$k$对准确度的提升有限,我们将$k$的值自适应地分配给查询的难度级别。因此,AIR-HLoc通过根据查询图像和参考图像之间的相似性动态分配不同的$k$值来实现处理时间的优化,同时保持与固定图像检索的SOTA准确度。我们对剑桥地标、7Scenes和Aachen Day-Night-v1.1数据集的广泛实验证明了我们的算法的有效性,将计算开销减少30%、26%和11%,同时保持与固定图像检索的SOTA准确度。
https://arxiv.org/abs/2403.18281
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets. Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space. However, they focus on the global visual representation, ignoring the representation of detailed attributes, e.g., color, object number and layout. To address this challenge, we propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs). KEDs implicitly models the attributes of the reference images by incorporating a database. The database enriches the pseudo-word tokens by providing relevant images and captions, emphasizing shared attribute information in various aspects. In this way, KEDs recognizes the reference image from diverse perspectives. Moreover, KEDs adopts an extra stream that aligns pseudo-word tokens with textual concepts, leveraging pseudo-triplets mined from image-text pairs. The pseudo-word tokens generated in this stream are explicitly aligned with fine-grained semantics in the text embedding space. Extensive experiments on widely used benchmarks, i.e. ImageNet-R, COCO object, Fashion-IQ and CIRR, show that KEDs outperforms previous zero-shot composed image retrieval methods.
我们研究零 shot 组合图像检索(ZS-CIR)任务,即根据参考图像和描述从三元组数据集中检索目标图像,而无需在训练数据集上进行训练。之前的工作通过将参考图像特征投影到文本嵌入空间来生成伪词标记。然而,他们集中于全局视觉表示,忽略了详细属性的表示,例如颜色、物体数量和布局。为了应对这个挑战,我们提出了一个知识增强的双流零 shot 组合图像检索框架(KEDs)。KEDs 通过引入数据库来模型的参考图像的属性。数据库通过提供相关图像和标题来丰富伪词标记,强调各种方面的共享属性信息。这样,KEDs 从不同的角度认识到参考图像。此外,KEDs 采用了一额外 stream,将伪词标记与文本概念对齐,利用从图像-文本对中挖掘的伪三元组。这个 stream 中生成的伪词标记在文本嵌入空间中具有明确的对齐关系。在广泛使用的基准上进行的大量实验,即 ImageNet-R、COCO 物体、Fashion-IQ 和 CIRR,证明了 KEDs 优于之前的零 shot 组合图像检索方法。
https://arxiv.org/abs/2403.16005
The burgeoning integration of 3D medical imaging into healthcare has led to a substantial increase in the workload of medical professionals. To assist clinicians in their diagnostic processes and alleviate their workload, the development of a robust system for retrieving similar case studies presents a viable solution. While the concept holds great promise, the field of 3D medical text-image retrieval is currently limited by the absence of robust evaluation benchmarks and curated datasets. To remedy this, our study presents a groundbreaking dataset, BIMCV-R (This dataset will be released upon acceptance.), which includes an extensive collection of 8,069 3D CT volumes, encompassing over 2 million slices, paired with their respective radiological reports. Expanding upon the foundational work of our dataset, we craft a retrieval strategy, MedFinder. This approach employs a dual-stream network architecture, harnessing the potential of large language models to advance the field of medical image retrieval beyond existing text-image retrieval solutions. It marks our preliminary step towards developing a system capable of facilitating text-to-image, image-to-text, and keyword-based retrieval tasks.
3D医疗影像融入医疗行业,导致医疗专业人员的负担大幅增加。为了帮助临床医生在诊断过程中减轻负担,开发一个稳健的系统检索类似病历片是一个可行的解决方案。虽然这一概念具有很大的潜力,但3D医疗文本图像检索领域目前仍然受到缺乏可靠评估标准和精心挑选的数据集的限制。为了弥补这一不足,我们的研究向我们展示了令人印象深刻的 dataset BIMCV-R(该数据集将在接受审核后发布),其中包括8,069个3D CT volume,涵盖超过200万切片,并与各自的放射学报告相对应。在拓展我们数据集的基础工作之上,我们制定了检索策略,MedFinder。这种方法采用了一种双流网络架构,利用大型语言模型的潜力,将医疗图像检索领域从现有的文本-图像检索解决方案中推向更远。这标志着我们迈向开发一个能够促进文本-图像、图像-文本和关键词-基础检索任务的系统的第一步。
https://arxiv.org/abs/2403.15992
Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.
对比性语言-图像预训练(CLIP)是零散分类、文本图像检索和文本图像生成的基石,通过将图像和文本模态对齐。尽管CLIP得到了广泛的采用,但CLIP的一个显著局限在于文本输入的长度不足。文本标记的长度限制为77个,而一个经验性的研究表明,实际有效长度甚至比20个更少。这使得CLIP无法处理详细的描述,限制了其在图像检索和具有广泛先决条件的文本-图像生成方面的应用。 为此,我们提出了Long-CLIP作为CLIP的插件和备选方案,支持长文本输入,保留或甚至超越零散分布的泛化能力,并使CLIP潜在空间对齐,使得在下游框架中无需进一步调整即可替代CLIP。然而,实现这一目标并不容易,因为简单的微调可能会导致CLIP性能的显著下降。此外,用支持较长上下文的语言模型替换文本编码器需要大量的预训练数据,产生相当大的费用。因此,Long-CLIP通过两种新颖策略在CLIP上实现有效微调,包括(1)保留位置嵌入的知识伸展和(2)与CLIP特征的主要成分匹配。借助仅利用100万对额外长文本图像对,Long-CLIP在长摘要文本图像检索和传统文本图像检索任务(如COCO和Flickr30k)中已经证明了与CLIP约20%的优越性。此外,Long-CLIP通过在插件和备选方式下生成图像,取代CLIP,从而增强其生成图像的能力。
https://arxiv.org/abs/2403.15378
Image generators are gaining vast amount of popularity and have rapidly changed how digital content is created. With the latest AI technology, millions of high quality images are being generated by the public, which are constantly motivating the research community to push the limits of generative models to create more complex and realistic images. This paper focuses on Cross-Domain Image Retrieval (CDIR) which can be used as an additional tool to inspect collections of generated images by determining the level of similarity between images in a dataset. An ideal retrieval system would be able to generalize to unseen complex images from multiple domains (e.g., photos, drawings and paintings). To address this goal, we propose a novel caption-matching approach that leverages multimodal language-vision architectures pre-trained on large datasets. The method is tested on DomainNet and Office-Home datasets and consistently achieves state-of-the-art performance over the latest approaches in the literature for cross-domain image retrieval. In order to verify the effectiveness with AI-generated images, the method was also put to test with a database composed by samples collected from Midjourney, which is a widely used generative platform for content creation.
图像生成器正在迅速获得大量关注,并已经彻底改变了数字内容是如何创作的。随着最新的AI技术,数百万高质量的图像是由公众生成的,这不断激励研究社区不断挑战生成模型的极限,以创建更复杂和逼真的图像。本文重点关注跨域图像检索(CDIR),可以作为进一步工具,通过确定数据集中图像之间的相似度来检查生成图像的收藏品。一个理想的检索系统应该能够泛化到多个领域的未见过的复杂图像(例如照片、绘画和绘画)。为了实现这个目标,我们提出了一个新颖的标题匹配方法,该方法利用预训练在大型数据集上的多模态语言视觉架构。该方法在DomainNet和Office-Home数据集上进行测试,并持续超越了文献中关于跨域图像检索的最新方法的性能。为了验证该方法与AI生成的图像的有效性,该方法还用于由Midjourney收集的样本的数据库上进行测试,这是一个广泛用于内容创作的生成平台。
https://arxiv.org/abs/2403.15152