Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.
跨模态检索(CMR)是多媒体研究中的一个基本任务,旨在通过不同模态检索语义相关的项目。传统方法通常通过基于嵌入的相似性计算将文本和图像进行匹配,而最近在预训练生成模型方面的进展则推动了生成式检索作为一种有潜力的替代方案的发展。这种范例为每个目标分配唯一标识符,并利用生成模型直接预测与输入查询对应的标识符,无需显式的索引操作。尽管这种方法有很大的潜力,但目前的生成式跨模态检索方法在标识符构建和生成过程中的语义信息仍然不足。 为了克服这些局限性,我们提出了一种新的统一框架——增强语义的跨模态生成检索(Semantic-enhanced generative Cross-mOdal REtrieval,简称SemCORE),旨在释放生成式跨模态检索任务中对语义理解的能力。具体来说,首先构建了一个结构化的自然语言标识符(SID),该标识符能够有效地将目标标识符与优化为理解和生成自然语言的模型进行对齐。此外,我们引入了一种生成语义验证(GSV)策略,以实现细粒度的目标识别。 据我们所知,SemCORE是第一个同时考虑文本到图像和图像到文本检索任务的生成式跨模态检索框架。广泛的实验表明,我们的框架在最先进的生成式跨模态检索方法中表现出色,在基准数据集上实现了显著改进,特别是在文本到图像检索中的Recall@1指标平均提高了8.65个百分点。
https://arxiv.org/abs/2504.13172
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.
视觉关系检测(VRD)的目标是识别图像中对象对之间的关系(或互动)。尽管最近的VRD模型已经取得了令人印象深刻的表现,但它们都局限于预定义的关系类别,并未能考虑视觉关系所具有的语义模糊性特征。与物体不同,视觉关系的外观总是微妙的,可以从不同的视角用多个谓词单词来描述,例如,“骑”可以分别从体育和空间位置的角度描绘为“比赛”和“坐在上面”。为此,我们提出将视觉关系建模为连续嵌入,并设计扩散模型以在条件生成方式下实现泛化的VRD,命名为Diff-VRD。我们在隐式空间中对扩散过程进行建模,并生成图像中的所有可能的关系作为嵌入序列。在生成过程中,主体-客体对的视觉和文本嵌入充当条件信号并通过交叉注意力机制注入其中。生成之后,我们设计了一个后续匹配阶段以根据它们的语义相似性将关系词分配给主体-客体对。得益于基于扩散的生成过程,我们的Diff-VRD能够生成超出数据集预定义类别标签的视觉关系。为了适当评估这项泛化的VRD任务,我们引入了两个评价指标,即文本到图像检索和灵感来自图像字幕的SPICE PR曲线。在人类对象交互(HOI)检测和场景图生成(SGG)基准测试中的广泛实验证明了Diff-VRD的优越性和有效性。
https://arxiv.org/abs/2504.12100
The standard approach for visual place recognition is to use global image descriptors to retrieve the most similar database images for a given query image. The results can then be further improved with re-ranking methods that re-order the top scoring images. However, existing methods focus on re-ranking based on the same image descriptors that were used for the initial retrieval, which we argue provides limited additional signal. In this work we propose Generalized Contextual Similarity Aggregation (GCSA), which is a graph neural network-based re-ranking method that, in addition to the visual descriptors, can leverage other types of available side information. This can for example be other sensor data (such as signal strength of nearby WiFi or BlueTooth endpoints) or geometric properties such as camera poses for database images. In many applications this information is already present or can be acquired with low effort. Our architecture leverages the concept of affinity vectors to allow for a shared encoding of the heterogeneous multi-modal input. Two large-scale datasets, covering both outdoor and indoor localization scenarios, are utilized for training and evaluation. In experiments we show significant improvement not only on image retrieval metrics, but also for the downstream visual localization task.
视觉位置识别的标准方法是使用全局图像描述符来检索与给定查询图像最相似的数据库图像。然后可以通过重新排序高分图像的方法进一步改进结果。然而,现有方法侧重于基于用于初始检索的相同图像描述符进行重新排序,我们认为这提供的额外信号有限。在本项工作中,我们提出了广义上下文相似性聚合(GCSA),这是一种图神经网络基础的重新排序方法,除了视觉描述符之外,还可以利用其他类型的可用侧面信息。例如,可以使用其他传感器数据(如附近WiFi或蓝牙端点的信号强度)或其他几何属性(如数据库图像的相机姿态)。在许多应用场景中,这些信息已经存在或者可以通过低努力获得。 我们的架构利用了亲和向量的概念,以实现异构多模态输入的共享编码。为了训练和评估,我们使用了两个大规模数据集,涵盖室外和室内定位场景。实验结果显示,在图像检索指标以及下游视觉定位任务上都有显著改进。
https://arxiv.org/abs/2504.11134
Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textual descriptions via a diffusion model. This step enhances the encoder ability of text to capture nuanced intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune all encoders contrastively by comparing adaptive token-fusion features with the target image. This mechanism dynamically balances visual and textual representations within the contrastive learning pipeline, optimizing the composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR datasets demonstrate that TMCIR significantly outperforms state-of-the-art methods, particularly in capturing nuanced user intent.
组成图像检索(CIR)使用一种多模态查询来检索目标图像,该查询结合了一个参考图像和描述所需修改的文本。主要挑战在于有效地融合这种视觉和文本信息。当前用于CIR的跨模式特征融合方法在意图解释方面存在固有的偏见。这些方法倾向于过分强调参考图像特征(以视觉为主导的融合)或通过图像到文本转换来强调文本修改意图(以文本为主导的融合)。这种不平衡的表示通常无法准确捕捉并反映用户搜索的真实意图。 为了解决这一挑战,我们提出了TMCIR框架,该框架通过两个关键创新推进了组成图像检索:1)意图感知跨模式对齐。首先使用包含参考图像和文本描述的伪目标图像(通过扩散模型合成),对比性地微调CLIP编码器。这一步增强了文本捕捉文本描述中细微意图的能力。2)自适应标记融合。我们进一步通过对比方式,通过将自适应标记融合特征与目标图像进行比较来微调所有编码器。这种机制在对比学习管道内动态平衡视觉和文本表示,优化用于检索的组合特征。 在Fashion-IQ和CIRR数据集上进行了广泛的实验,结果表明TMCIR显著优于现有方法,特别是在捕捉细微用户意图方面。
https://arxiv.org/abs/2504.10995
Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Code and models are available at this https URL
视觉地方识别(VPR)的目标是通过参考带有地理标签的图像数据库来预测查询图像的位置。对于VPR任务,通常少量具有鉴别性的局部区域会产生重要影响,而平凡的背景区域则不会做出贡献甚至可能因为容易重叠而导致感知混淆。然而,现有的方法缺乏对这些有区分性区域进行精确建模和充分利用的能力。 在这篇论文中,我们提出了Focus on Local(FoL)的方法,通过挖掘并利用图像中的可靠鉴别性局部区域以及引入伪关联监督来同时刺激VPR任务中的图像检索和重新排序性能。具体来说: 1. 我们设计了两种损失函数:提取-聚合空间对齐损失(SAL)和前景-背景对比增强损失(CEL),以显式建模可靠的鉴别性局部区域,并利用它们指导全局表示的生成和高效的重新排序。 2. 我们引入了一种基于从聚合全球特征获得的伪对应关系进行弱监督的局部特征训练策略,以此来缓解VPR任务中缺乏局部对应的真实标签的问题。 3. 我们提出了一种高效且精确的重新排序流水线,该流程依据鉴别性区域指导运行。 实验结果显示,我们的FoL方法在多个VPR基准测试中,在图像检索和重新排序阶段都达到了最先进的性能,并且与现有的两阶段VPR方法相比,在计算效率方面也有显著提升。代码和模型可在提供的链接处获取(原文中的具体URL请查阅原论文)。
https://arxiv.org/abs/2504.09881
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling this http URL paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1\% in imageNet-1K classification and 1.4\% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.
现有的遮罩图像建模方法使用固定掩码模式来指导自我监督训练。由于这些掩码模式依赖于不同的标准来描绘图像内容,坚持采用固定模式会导致对视觉线索的模型化能力有限。本文提出了一种进化分层掩蔽方法,以追求在自我监督学习中的通用视觉线索建模。所提出的方法利用正在训练的视觉模型将输入的视觉线索解析为层次结构,并据此生成相应的掩码。这种层次结构的准确性与正在训练的模型的能力相当,在不同的训练阶段会演化出不同模式的掩码。 初始阶段,生成的掩码集中于低级视觉线索以掌握基本纹理,随后逐渐演变为描绘高级别的线索,从而强化对更复杂物体语义和上下文的学习。我们的方法不需要额外的预训练模型或注释,并通过进化训练难度来确保训练效率。我们在包括基于低层细节的部分重复图像检索、以及需要语义解析能力的图像分类和语义分割在内的七个下游任务上进行了广泛的实验。实验证明,该方法在这类任务中显著提升了性能表现。例如,在相同的训练周期下,它在ImageNet-1K分类和ADE20K分割方面分别比最近的MAE模型高出1.1%和1.4%。 我们还使提出的方法与当前研究大型语言模型(LLMs)的重点相吻合。所提出的方案弥合了大规模预训练任务中对语义需求之间的差距,并增强了在需要低级特征识别的任务中的复杂细节感知能力。
https://arxiv.org/abs/2504.09155
Recent advancements in computer vision have highlighted the scalability of Vision Transformers (ViTs) across various tasks, yet challenges remain in balancing adaptability, computational efficiency, and the ability to model higher-order relationships. Vision Graph Neural Networks (ViGs) offer an alternative by leveraging graph-based methodologies but are hindered by the computational bottlenecks of clustering algorithms used for edge generation. To address these issues, we propose the Hypergraph Vision Transformer (HgVT), which incorporates a hierarchical bipartite hypergraph structure into the vision transformer framework to capture higher-order semantic relationships while maintaining computational efficiency. HgVT leverages population and diversity regularization for dynamic hypergraph construction without clustering, and expert edge pooling to enhance semantic extraction and facilitate graph-based image retrieval. Empirical results demonstrate that HgVT achieves strong performance on image classification and retrieval, positioning it as an efficient framework for semantic-based vision tasks.
近期在计算机视觉领域的进展显示,Vision Transformers(ViT)在多种任务中具有可扩展性,但仍面临在适应性、计算效率和建模高阶关系之间保持平衡的挑战。Vision Graph Neural Networks(ViG)通过利用基于图的方法提供了一种替代方案,但受用于生成边的聚类算法计算瓶颈的影响而受到限制。为解决这些问题,我们提出了Hypergraph Vision Transformer (HgVT),它将分层双图超图结构融入到视觉变换器框架中,以捕捉高阶语义关系并保持计算效率。HgVT利用群体和多样性正则化来动态构建超图而不依赖聚类,并采用专家边池化技术增强语义提取及支持基于图的图像检索。实证结果表明,HgVT在图像分类和检索任务上表现出色,使其成为一种高效的面向语义的视觉任务框架。
https://arxiv.org/abs/2504.08710
Visual understanding is inherently contextual -- what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest. Yet, most existing image encoding paradigms represent an image as a fixed, generic feature vector, overlooking the potential needs of prioritizing varying visual information for different downstream use cases. In this work, we introduce FocalLens, a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively finetune a pretrained vision encoder to take natural language instructions as additional inputs for producing conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.
视觉理解本质上是上下文相关的——我们在图像上关注的内容取决于手头的任务。例如,在一张一个人手持花束的图片中,我们可能会关注这个人(如他们的服装),或者关注花朵的类型,这取决于感兴趣的上下文。然而,大多数现有的图像编码范式将一幅图表示为一个固定的、通用的功能向量,忽略了根据不同下游应用场景优先考虑不同视觉信息的需求。 在这项工作中,我们引入了FocalLens,这是一种条件性视觉编码方法,可以根据兴趣上下文(通过自然语言灵活表达)为同一张图片生成不同的表示。我们利用视觉指令微调数据,并对比微调了一个预训练的视觉编码器,使其能够将自然语言指令作为额外输入以生成条件性图像表示。 广泛的实验验证了FocalLens产生的条件性图像表示比标准视觉编码器(如CLIP)生成的一般特征更突出感兴趣区域的视觉特征。此外,我们展示了FocalLens在包括图像检索、图像分类和图像-文本检索在内的多种下游任务中带来了性能提升,在具有挑战性的SugarCrepe和MMVP-VLM基准测试中分别平均提高了5分和10分。
https://arxiv.org/abs/2504.08368
Fine-grained text-to-image retrieval aims to retrieve a fine-grained target image with a given text query. Existing methods typically assume that each training image is accurately depicted by its textual descriptions. However, textual descriptions can be ambiguous and fail to depict discriminative visual details in images, leading to inaccurate representation learning. To alleviate the effects of text ambiguity, we propose a Multi-Modal Reference learning framework to learn robust representations. We first propose a multi-modal reference construction module to aggregate all visual and textual details of the same object into a comprehensive multi-modal reference. The multi-modal reference hence facilitates the subsequent representation learning and retrieval similarity computation. Specifically, a reference-guided representation learning module is proposed to use multi-modal references to learn more accurate visual and textual representations. Additionally, we introduce a reference-based refinement method that employs the object references to compute a reference-based similarity that refines the initial retrieval results. Extensive experiments are conducted on five fine-grained text-to-image retrieval datasets for different text-to-image retrieval tasks. The proposed method has achieved superior performance over state-of-the-art methods. For instance, on the text-to-person image retrieval dataset RSTPReid, our method achieves the Rank1 accuracy of 56.2\%, surpassing the recent CFine by 5.6\%.
细粒度的文本到图像检索旨在通过给定的文字查询检索出一个精细的目标图像。现有的方法通常假设每个训练图像都能被其文字描述准确地描绘出来。然而,由于文字描述可能具有歧义性,并且无法描绘出图像中区分度高的视觉细节,这会导致不准确的表现学习。 为了减轻文本模糊的影响,我们提出了一种多模态参考学习框架来学习稳健的表现形式。首先,我们提出了一个多模态参考构建模块,将同一对象的所有视觉和文字信息聚合到一个全面的多模态参考中。这样可以促进随后的表现学习和检索相似度计算。具体而言,我们提出了一个基于参考的表现学习模块,利用多模态参考来学习更准确的视觉和文本表示。 此外,我们还引入了一种基于参考的改进方法,该方法使用对象参考来计算一个基于参考的相似度,以细化初始的检索结果。我们在五个细粒度的文本到图像检索数据集上进行了广泛的实验,涵盖了不同的文本到图像检索任务。所提出的方法在最先进的方法中取得了优越的表现。 例如,在RSTPReid(针对人的文本到图像检索)数据集中,我们的方法达到了56.2%的第一名准确率,比近期的最佳模型CFine高出5.6%。
https://arxiv.org/abs/2504.07718
High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a \textbf{divide-then-aggregate} strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at this https URL
高质量的图像描述对于提高跨模态应用(如文本到图像生成、视频生成和图文检索)的表现至关重要。为了生成长篇且高质量的描述,许多最近的研究采用了多模态大型语言模型 (MLLM)。然而,当前的 MLLM 经常会产生缺少细粒度细节或存在幻觉问题的描述,这一挑战在开源和闭源模型中都普遍存在。受特征整合理论的启发,该理论指出注意力必须集中于特定区域以有效整合视觉信息,我们提出了一个“分割然后聚合”的策略。我们的方法首先将图像分割为语义和空间补丁来提取细粒度细节,增强模型对图像局部感知的能力。这些局部细节随后被分层聚合生成全面的全局描述。为了应对生成描述中的幻觉和不一致性问题,在层级聚合过程中我们应用了语义级过滤过程。这个无需训练的方法可以应用于开源模型(LLaVA-1.5、LLaVA-1.6、Mini-Gemini)和闭源模型(Claude-3.5-Sonnet、GPT-4o、GLM-4V-Plus)。大量的实验表明,我们的方法生成了更详细且可靠的描述,在无需重新训练模型的情况下推进了多模态描述的生成。代码可以在 [此链接](https://this https URL) 获取。
https://arxiv.org/abs/2504.06666
Image retrieval remains a challenging task due to the complex interaction between human visual perception, memory, and computational processes. Current image search engines often struggle to efficiently retrieve images based on natural language descriptions, as they rely on time-consuming preprocessing, tagging, and machine learning pipelines. This paper introduces the Human-Oriented Retrieval Search Engine for Images (HORSE), a novel approach that leverages neuro-symbolic indexing to improve image retrieval by focusing on human-oriented indexing. By integrating cognitive science insights with advanced computational techniques, HORSE enhances the retrieval process, making it more aligned with how humans perceive, store, and recall visual information. The neuro-symbolic framework combines the strengths of neural networks and symbolic reasoning, mitigating their individual limitations. The proposed system optimizes image retrieval, offering a more intuitive and efficient solution for users. We discuss the design and implementation of HORSE, highlight its potential applications in fields such as design error detection and knowledge management, and suggest future directions for research to further refine the system's metrics and capabilities.
图像检索仍然是一个具有挑战性的任务,因为人类视觉感知、记忆和计算过程之间存在复杂的相互作用。当前的图像搜索引擎往往难以基于自然语言描述高效地检索图像,这主要是由于它们依赖于耗时的数据预处理、标记以及机器学习管道。本文介绍了以人为导向的图像检索引擎(HORSE),这是一种新颖的方法,通过利用神经符号索引技术来改进以人类为中心的图像检索过程。通过将认知科学见解与先进的计算技术相结合,HORSE增强了检索过程,使其更符合人类感知、存储和回忆视觉信息的方式。 神经符号框架结合了神经网络和符号推理的优势,从而减轻了它们各自的局限性。所提出的系统优化了图像检索过程,为用户提供了一种更加直观且高效的解决方案。我们讨论了HORSE的设计与实现,并强调其在设计错误检测和知识管理等领域的潜在应用。此外,我们也提出了未来研究方向,旨在进一步完善系统的评估指标和技术能力。
https://arxiv.org/abs/2504.10502
Composed Image Retrieval (CIR) seeks to find a target image using a multi-modal query, which combines an image with modification text to pinpoint the target. While recent CIR methods have shown promise, they mainly focus on exploring relationships between the query pairs (image and text) through data augmentation or model design. These methods often assume perfect alignment between queries and target images, an idealized scenario rarely encountered in practice. In reality, pairs are often partially or completely mismatched due to issues like inaccurate modification texts, low-quality target images, and annotation errors. Ignoring these mismatches leads to numerous False Positive Pair (FFPs) denoted as noise pairs in the dataset, causing the model to overfit and ultimately reducing its performance. To address this problem, we propose the Noise-aware Contrastive Learning for CIR (NCL-CIR), comprising two key components: the Weight Compensation Block (WCB) and the Noise-pair Filter Block (NFB). The WCB coupled with diverse weight maps can ensure more stable token representations of multi-modal queries and target images. Meanwhile, the NFB, in conjunction with the Gaussian Mixture Model (GMM) predicts noise pairs by evaluating loss distributions, and generates soft labels correspondingly, allowing for the design of the soft-label based Noise Contrastive Estimation (NCE) loss function. Consequently, the overall architecture helps to mitigate the influence of mismatched and partially matched samples, with experimental results demonstrating that NCL-CIR achieves exceptional performance on the benchmark datasets.
组成图像检索(CIR)的目标是通过一个多模态查询来查找目标图像,该查询结合了一张图片和修改文本以精确定位目标。尽管最近的CIR方法显示出了一些潜力,但它们主要集中在通过数据增强或模型设计探索查询对(包括图像和文本)之间的关系上。这些方法往往假设查询与目标图像之间存在完美的匹配,在实践中这种情况很少出现。实际上,由于诸如不准确的修改文本、质量低下的目标图片以及标注错误等问题,查询对常常部分或完全失配。忽略这些失配会导致数据集中产生大量噪声对(即假正例对),从而导致模型过拟合并最终降低其性能。 为了解决这个问题,我们提出了面向CIR的噪声感知对比学习方法(NCL-CIR)。该方法包含两个关键组件:权重补偿模块(WCB)和噪声配对过滤器模块(NFB)。通过与多样的权重图结合使用,WCB可以确保多模态查询及其目标图像具有更稳定的令牌表示。同时,NFB配合高斯混合模型(GMM),可以通过评估损失分布预测噪声对,并生成相应的软标签,从而允许设计基于软标签的噪声对比估计(NCE)损失函数。因此,整体架构有助于减少不匹配和部分匹配样本的影响,实验结果表明,NCL-CIR在基准数据集上实现了卓越性能。
https://arxiv.org/abs/2504.04339
The rapid expansion of remote sensing image archives demands the development of strong and efficient techniques for content-based image retrieval (RS-CBIR). This paper presents REJEPA (Retrieval with Joint-Embedding Predictive Architecture), an innovative self-supervised framework designed for unimodal RS-CBIR. REJEPA utilises spatially distributed context token encoding to forecast abstract representations of target tokens, effectively capturing high-level semantic features and eliminating unnecessary pixel-level details. In contrast to generative methods that focus on pixel reconstruction or contrastive techniques that depend on negative pairs, REJEPA functions within feature space, achieving a reduction in computational complexity of 40-60% when compared to pixel-reconstruction baselines like Masked Autoencoders (MAE). To guarantee strong and varied representations, REJEPA incorporates Variance-Invariance-Covariance Regularisation (VICReg), which prevents encoder collapse by promoting feature diversity and reducing redundancy. The method demonstrates an estimated enhancement in retrieval accuracy of 5.1% on BEN-14K (S1), 7.4% on BEN-14K (S2), 6.0% on FMoW-RGB, and 10.1% on FMoW-Sentinel compared to prominent SSL techniques, including CSMAE-SESD, Mask-VLM, SatMAE, ScaleMAE, and SatMAE++, on extensive RS benchmarks BEN-14K (multispectral and SAR data), FMoW-RGB and FMoW-Sentinel. Through effective generalisation across sensor modalities, REJEPA establishes itself as a sensor-agnostic benchmark for efficient, scalable, and precise RS-CBIR, addressing challenges like varying resolutions, high object density, and complex backgrounds with computational efficiency.
遥感图像档案的迅速扩展要求开发出强大且高效的基于内容的图像检索(RS-CBIR)技术。本文介绍了REJEPA(带有联合嵌入预测架构的检索),这是一个创新的自我监督框架,专为单模态RS-CBIR设计。REJEPA利用空间分布上下文标记编码来预测目标令牌的抽象表示,从而有效捕捉高层次语义特征并消除不必要的像素级细节。与专注于像素重建或依赖负对的对比技术不同,REJEPA在特征空间中运作,在计算复杂性方面比像Masked Autoencoders(MAE)这样的像素重建基线减少了40-60%。 为了确保强健且多样的表示,REJEPA集成了方差不变协方差正则化(VICReg),通过促进特征多样性并减少冗余来防止编码器崩溃。该方法在BEN-14K(S1)、BEN-14K(S2)、FMoW-RGB和FMoW-Sentinel等广泛遥感基准测试中,与包括CSMAE-SESD、Mask-VLM、SatMAE、ScaleMAE和SatMAE++在内的主要SSL技术相比,在检索精度上分别提高了5.1%、7.4%、6.0%和10.1%。 通过在传感器模态之间有效推广,REJEPA确立了自己作为无感测器基准的高效、可扩展且精确的RS-CBIR系统的位置,解决了诸如分辨率变化大、物体密度高以及复杂背景等计算效率问题。
https://arxiv.org/abs/2504.03169
The goal of this paper is to enhance pretrained Vision Transformer (ViT) models for focus-oriented image retrieval with visual prompting. In real-world image retrieval scenarios, both query and database images often exhibit complexity, with multiple objects and intricate backgrounds. Users often want to retrieve images with specific object, which we define as the Focus-Oriented Image Retrieval (FOIR) task. While a standard image encoder can be employed to extract image features for similarity matching, it may not perform optimally in the multi-object-based FOIR task. This is because each image is represented by a single global feature vector. To overcome this, a prompt-based image retrieval solution is required. We propose an approach called Prompt-guided attention Head Selection (PHS) to leverage the head-wise potential of the multi-head attention mechanism in ViT in a promptable manner. PHS selects specific attention heads by matching their attention maps with user's visual prompts, such as a point, box, or segmentation. This empowers the model to focus on specific object of interest while preserving the surrounding visual context. Notably, PHS does not necessitate model re-training and avoids any image alteration. Experimental results show that PHS substantially improves performance on multiple datasets, offering a practical and training-free solution to enhance model performance in the FOIR task.
本文的目标是通过视觉提示(visual prompting)增强预训练的Vision Transformer (ViT)模型,以实现基于焦点的对象导向图像检索(Focus-Oriented Image Retrieval, FOIR)。在现实世界的图像检索场景中,查询和数据库中的图像通常表现出复杂性,包括多个对象以及复杂的背景。用户往往希望根据特定对象来检索图片,这正是我们定义的FOIR任务。虽然标准的图像编码器可以用于提取用于相似度匹配的图像特征,但在基于多对象的FOIR任务中其可能无法达到最佳性能。这是由于每个图像仅由一个全局特征向量表示。 为解决这一问题,我们需要一种基于提示(prompt-based)的图像检索解决方案。我们提出了一种名为提示引导注意头选择(Prompt-guided attention Head Selection, PHS)的方法,该方法以可提示的方式利用ViT中多头注意力机制的头部潜力。PHS通过匹配其注意力图与用户的视觉提示(如点、框或分割区域)来选择特定的注意力头,从而使模型能够专注于感兴趣的特定对象并保留周围视觉上下文。值得注意的是,PHS不需要重新训练模型,并且不会对图像进行任何修改。 实验结果显示,PHS在多个数据集上显著提高了性能,提供了一种实用且无需训练的方法以增强FOIR任务中的模型性能。
https://arxiv.org/abs/2504.01348
Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models' limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at this https URL.
多模态检索系统对于嵌入式AI和由AI驱动的数字内容行业等前沿人工智能技术变得越来越重要。然而,当前的多模态检索任务缺乏足够的复杂性,并展示了有限的实际应用价值。这激发了我们设计实例驱动的多模态图像检索(IDMR),这是一种新型的任务,要求模型在匹配文本描述的情景的同时检索包含与查询图像相同实例的图片。不同于现有的主要关注全局图像相似度或类别级匹配的检索任务,IDMR需要在不同的上下文中保持细粒度的实例级别一致性。 为了评估这种能力,我们开发了基于真实世界物体追踪和第一人称视频数据的IDMR-bench。为了解决训练数据不足的问题,我们提出了一种跨域合成方法,通过从标准检测数据集中裁剪对象来创建557K个训练样本。我们的多模态大型语言模型(MLLM)检索模型在1.2M个样本上进行训练,在传统基准和我们的零样本IDMR-bench上均超越了最先进的方法。 实验结果展示了先前模型在实例感知检索方面的局限性,并突显了MLLM在未来高级检索应用中的潜力。整个训练数据集、代码和模型(大小各异)可以在提供的URL处找到。
https://arxiv.org/abs/2504.00954
Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have relied on triplet data containing a reference image, reformulation text, and a target image. However, curating such triplet data often necessitates human intervention, leading to prohibitive costs. This challenge has hindered the scalability of CIR model training even with the availability of abundant unlabeled data. With the recent advances in foundational models, we advocate a shift in the CIR training paradigm where human annotations can be efficiently replaced by large language models (LLMs). Specifically, we demonstrate the capability of large captioning and language models in efficiently generating data for CIR only relying on unannotated image collections. Additionally, we introduce an embedding reformulation architecture that effectively combines image and text modalities. Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets. Furthermore, we demonstrate that by increasing the amount of generated data, our zero-shot model gets closer to the performance of supervised baselines.
组成图像检索(CIR)的任务是根据参考图片和描述该图片变化的文本,来检索匹配的目标图片。传统上,设计用于CIR的模型依赖于包含参考图、重述文本以及目标图的三元组数据进行训练。然而,收集这种三元组数据通常需要人工干预,导致成本高昂。这阻碍了即使在有大量未标记数据可用的情况下,CIR模型训练的可扩展性。 随着基础模型的最近进展,我们倡导一种新的CIR训练范式,在该范式中,人类注释可以被大型语言模型(LLM)高效替代。具体而言,我们展示了大型图像描述和语言模型能够仅依靠未标注的图片集合来高效生成用于CIR的数据。此外,我们还引入了一种嵌入重述架构,有效结合了图像和文本模态。 我们的模型名为InstructCIR,在无监督条件下于CIRR和FashionIQ数据集上的组成图像检索任务中超越了最先进的方法。而且,通过增加生成的数据量,我们的零样本模型可以更接近于有监督基准的表现。
https://arxiv.org/abs/2504.00812
Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success in cross-modal tasks such as zero-shot image classification and text-image retrieval by effectively aligning visual and textual representations. However, the theoretical foundations underlying CLIP's strong generalization remain unclear. In this work, we address this gap by proposing the Cross-modal Information Bottleneck (CIB) framework. CIB offers a principled interpretation of CLIP's contrastive learning objective as an implicit Information Bottleneck optimization. Under this view, the model maximizes shared cross-modal information while discarding modality-specific redundancies, thereby preserving essential semantic alignment across modalities. Building on this insight, we introduce a Cross-modal Information Bottleneck Regularization (CIBR) method that explicitly enforces these IB principles during training. CIBR introduces a penalty term to discourage modality-specific redundancy, thereby enhancing semantic alignment between image and text features. We validate CIBR on extensive vision-language benchmarks, including zero-shot classification across seven diverse image datasets and text-image retrieval on MSCOCO and Flickr30K. The results show consistent performance gains over standard CLIP. These findings provide the first theoretical understanding of CLIP's generalization through the IB lens. They also demonstrate practical improvements, offering guidance for future cross-modal representation learning.
对比语言-图像预训练(CLIP)在跨模态任务中取得了显著的成功,例如零样本图像分类和文本-图像检索。通过有效地对齐视觉和文本表示,它在这些任务上表现出色。然而,关于CLIP强大的泛化能力的理论基础仍然不清楚。 为了解决这一问题,我们提出了跨模态信息瓶颈(CIB)框架。CIB从理论上解释了CLIP对比学习目标是隐式的信息瓶颈优化过程。在这种观点下,模型最大化跨模态共享的信息同时丢弃特定模态的冗余信息,从而在不同模态之间保持重要的语义对齐。 基于这一洞察,我们引入了一种跨模态信息瓶颈正则化(CIBR)方法,在训练过程中明确强制执行这些信息瓶颈原则。CIBR通过引入一个惩罚项来限制特定模态的冗余性,从而增强图像和文本特征之间的语义对齐。 我们在广泛的视觉-语言基准测试中验证了CIBR的效果,包括在七个不同数据集上的零样本分类以及在MSCOCO和Flickr30K上的文本-图像检索。实验结果显示,在这些任务上CIBR相比标准CLIP表现出一致的性能提升。 这项工作首次通过信息瓶颈的角度提供了对CLIP泛化的理论理解,并展示了实际改进,为未来跨模态表示学习提供了指导。
https://arxiv.org/abs/2503.24182
The image retrieval (IR) approach to image localization has distinct advantages to the 3D and the deep learning (DNN) approaches: it is seen-agnostic, simpler to implement and use, has no privacy issues, and is computationally efficient. The main drawback of this approach is relatively poor localization in both position and orientation of the query camera when compared to the competing approaches. This paper represents a hybrid approach that stores only image features in the database like some IR methods, but relies on a latent 3D reconstruction, like 3D methods but without retaining a 3D scene reconstruction. The approach is based on two ideas: {\em (i)} a novel proposal where query camera center estimation relies only on relative translation estimates but not relative rotation estimates through a decoupling of the two, and {\em (ii)} a shift from computing optimal pose from estimated relative pose to computing optimal pose from multiview correspondences, thus cutting out the ``middle-man''. Our approach shows improved performance on the 7-Scenes and Cambridge Landmarks datasets while also improving on timing and memory footprint as compared to state-of-the-art.
基于图像检索(IR)的图像定位方法与三维(3D)和深度学习(DNN)方法相比具有明显的优势:它不受视觉差异的影响,更容易实现和使用,没有隐私问题,并且计算效率高。然而,这种方法的主要缺点是,在查询相机的位置和方向定位上相对于竞争方法表现较差。本文提出了一种混合方法,该方法像某些IR方法一样只在数据库中存储图像特征,但又像3D方法一样依赖于潜在的三维重建,而无需保留三维场景重建。此方法基于两个理念:{\em (i)} 一种新颖的方法,其中查询相机中心估计仅依赖于相对平移估算值而不依赖于相对旋转估算值,通过解耦这两个方面实现;以及 {\em (ii)} 从计算估计的相对姿态的最佳姿态转向从多视角对应关系计算最佳姿态,从而省略了“中间人”。我们的方法在7-Scenes和Cambridge Landmarks数据集上的性能有所提升,并且与最先进的技术相比,在时间效率和内存占用方面也得到了改进。
https://arxiv.org/abs/2503.23577
We introduce LOCORE, Long-Context Re-ranker, a model that takes as input local descriptors corresponding to an image query and a list of gallery images and outputs similarity scores between the query and each gallery image. This model is used for image retrieval, where typically a first ranking is performed with an efficient similarity measure, and then a shortlist of top-ranked images is re-ranked based on a more fine-grained similarity measure. Compared to existing methods that perform pair-wise similarity estimation with local descriptors or list-wise re-ranking with global descriptors, LOCORE is the first method to perform list-wise re-ranking with local descriptors. To achieve this, we leverage efficient long-context sequence models to effectively capture the dependencies between query and gallery images at the local-descriptor level. During testing, we process long shortlists with a sliding window strategy that is tailored to overcome the context size limitations of sequence models. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks of landmarks (ROxf and RPar), products (SOP), fashion items (In-Shop), and bird species (CUB-200) while having comparable latency to the pair-wise local descriptor re-rankers.
我们介绍了LOCORE,即长上下文重排序器(Long-Context Re-ranker),这是一个模型,它以与图像查询和一组库图像相对应的局部描述符作为输入,并输出查询与每张库图像之间的相似性分数。此模型用于图像检索,在这种情况下通常会先使用高效的相似度测量方法进行初步排名,然后基于更为精细的相似度测量方法对排名靠前的一组图片进行重新排序。相较于现有的使用局部描述符执行成对相似度估计或使用全局描述符执行列表级重排序的方法,LOCORE是首个采用局部描述符进行列表级重排序的方法。 为了实现这一点,我们利用高效的长上下文序列模型来有效地捕获查询图像和库图像之间在局部描述符级别的依赖关系。在测试阶段,我们通过一个滑动窗口策略处理较长的候选列表,以克服序列模型的上下文大小限制。我们的方法在地标(ROxf 和 RPar)、产品(SOP)、时尚物品(In-Shop)和鸟类种类(CUB-200)等公认的图像检索基准上取得了优于其他重排序器的性能,并且与成对局部描述符重排序器相比,具有可比的时间延迟。
https://arxiv.org/abs/2503.21772
In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) is storage-efficient. We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner. At the core of Fwd2bot there exists a "double-forward pass" training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens. Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks. The training is further enhanced by stage-specific adapters. We accompany the proposed method by an in-depth ablation study. Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks. For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result. For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.
在这项工作中,我们的目标是将大型视觉语言模型(LVLM)的视觉标记压缩成一种同时适用于生成任务和判别性任务的表示形式,并且这种压缩方式几乎无损且存储高效。为此,我们提出了一种新的压缩方法,名为Fwd2Bot,该方法利用LVLM自身以任务无关的方式压缩视觉信息。在Fwd2Bot的核心存在一个“双向前传播”训练策略:在第一次前向传播中,LLM(LVLM的一部分)通过将视觉信息浓缩成少量的摘要标记来创建瓶颈;然后,在第二次前向传播中,使用相同的LLM处理语言指令和这些摘要标记,作为图像标记的直接替代。训练信号由两个损失提供:一个自回归损失在第二次传递后应用,为压缩提供直接优化目标;另一个对比损失在第一次传递后应用,进一步增强表示能力,特别是在判别性任务方面。 通过阶段特定适配器进一步增强了训练过程。我们还附带了一个深入的消融研究来支持提出的这种方法。总体而言,Fwd2Bot产生适合生成和判别性任务的高度信息压缩表示形式。对于生成任务,我们在不损害生成能力的情况下提供了两倍的压缩率,并创造了新的最先进的结果;对于判别性任务,我们在图像检索和组合方面设立了新的最先进的成绩。
https://arxiv.org/abs/2503.21757