While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.
尽管一幅图片胜过千言万语,但在特定任务中只有少数图片提供了关键信息,因此应当重点关注这些图片。鉴于此,理想的文本到图像(T2I)检索器应优先考虑与查询相关的具体视觉属性。为了评估当前检索器在处理以属性为重点的查询的能力,我们构建了基于COCO的数据集——COCO-Facet,该数据集包含9,112个关于各种兴趣属性的问题。 我们的研究发现,像CLIP这样的检索器因其效率和零样本学习能力而被广泛采用,但在处理属性重点问题时表现不佳且不平衡。这可能是因为它们的图像嵌入主要关注全局语义和主体,忽略了其他细节。值得注意的是,我们还揭示了即使是最近基于多模态大型语言模型(MLLM)的更强大检索器,在较大的输出维度上也难以克服这一限制。因此,我们可以假设使用通用图像嵌入进行检索在处理这类查询时效果不佳。 作为解决方案,我们建议采用由这些多模态检索器支持的可提示的图像嵌入,通过突出显示所需的属性来提高性能。我们的嵌入生成管道可以跨不同类型的问题、图片池和基本检索器架构进行泛化。为了增强实际应用的效果,我们提供了两种加速策略:预处理可提示的嵌入以及使用线性近似。实验证明,前者在预先定义了提示的情况下能够将Recall@5提高15%,而后者则在仅在推理时有可用提示的情况下提高了8%的表现。
https://arxiv.org/abs/2505.15877
Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.
尽管基于卷积和转换器的架构在图像到图像检索中占据主导地位,但这些模型容易受到由低级视觉特征(如颜色)引起的偏差影响。鉴于缺乏语义理解是关键限制,我们提出了一种新颖的场景图基检索框架,该框架强调语义内容而非浅层图像特性。此前,针对场景图检索的方法主要依赖于监督式的图神经网络(GNNs),这些方法需要基于图像说明的地面实况图形对作为训练数据。然而,由于文本编码变化导致的基于标题的监督不稳定会损害检索的可靠性。 为了应对这些问题,我们提出了SCENIR,这是一种基于图自编码器的无监督检索框架,它消除了对标注训练数据的依赖性。我们的模型在各种指标和运行时效率上表现出卓越性能,并且优于现有的视觉基础、多模式以及监督GNN方法。此外,我们首次将图编辑距离(GED)作为确定性和稳健性的地面实况度量引入图像到图像检索评估中,以替代此前不一致的基于标题的方法来衡量场景图相似性。 最后,通过在未标注的数据集上应用自动化场景图生成技术,我们验证了该方法的一般化能力,并在此过程中显著推进了反事实图像检索领域的最先进水平。
https://arxiv.org/abs/2505.15867
Additive manufacturing enables the fabrication of complex designs while minimizing waste, but faces challenges related to defects and process anomalies. This study presents a novel multimodal Retrieval-Augmented Generation-based framework that automates anomaly detection across various Additive Manufacturing processes leveraging retrieved information from literature, including images and descriptive text, rather than training datasets. This framework integrates text and image retrieval from scientific literature and multimodal generation models to perform zero-shot anomaly identification, classification, and explanation generation in a Laser Powder Bed Fusion setting. The proposed framework is evaluated on four L-PBF manufacturing datasets from Oak Ridge National Laboratory, featuring various printer makes, models, and materials. This evaluation demonstrates the framework's adaptability and generalizability across diverse images without requiring additional training. Comparative analysis using Qwen2-VL-2B and GPT-4o-mini as MLLM within the proposed framework highlights that GPT-4o-mini outperforms Qwen2-VL-2B and proportional random baseline in manufacturing anomalies classification. Additionally, the evaluation of the RAG system confirms that incorporating retrieval mechanisms improves average accuracy by 12% by reducing the risk of hallucination and providing additional information. The proposed framework can be continuously updated by integrating emerging research, allowing seamless adaptation to the evolving landscape of AM technologies. This scalable, automated, and zero-shot-capable framework streamlines AM anomaly analysis, enhancing efficiency and accuracy.
增材制造(Additive Manufacturing,AM)能够实现复杂设计的制作,并且可以减少浪费,但面临着缺陷和工艺异常带来的挑战。本文提出了一种新颖的基于检索增强生成框架的方法,该方法通过从文献中检索图像和描述性文本信息来自动检测各种增材制造过程中的异常情况,而不是依赖于训练数据集。此框架结合了科学文献中的文本和图像检索以及多模态生成模型,在激光粉末床融合(Laser Powder Bed Fusion,L-PBF)环境中实现了零样本的异常识别、分类及解释生成。 该框架在橡树岭国家实验室提供的四个不同制造商、型号和材料的L-PBF制造数据集上进行了评估。此次评估展示了框架在各种图像上的适应性和泛化能力,无需额外训练。使用Qwen2-VL-2B和GPT-4o-mini作为多模态语言模型(Multimodal Large Language Model,MLLM)进行比较分析的结果表明,GPT-4o-mini在制造异常分类上优于Qwen2-VL-2B和比例随机基线。此外,对RAG系统的评估证明了通过引入检索机制可以提高平均精度12%,减少了幻觉风险并提供了额外信息。 提出的框架可以通过整合新兴研究来进行持续更新,从而能够无缝地适应AM技术不断变化的格局。这种可扩展、自动化且具备零样本能力的框架简化了增材制造异常分析流程,提高了效率和准确性。
https://arxiv.org/abs/2505.13828
Ground texture localization using a downward-facing camera offers a low-cost, high-precision localization solution that is robust to dynamic environments and requires no environmental modification. We present a significantly improved bag-of-words (BoW) image retrieval system for ground texture localization, achieving substantially higher accuracy for global localization and higher precision and recall for loop closure detection in SLAM. Our approach leverages an approximate $k$-means (AKM) vocabulary with soft assignment, and exploits the consistent orientation and constant scale constraints inherent to ground texture localization. Identifying the different needs of global localization vs. loop closure detection for SLAM, we present both high-accuracy and high-speed versions of our algorithm. We test the effect of each of our proposed improvements through an ablation study and demonstrate our method's effectiveness for both global localization and loop closure detection. With numerous ground texture localization systems already using BoW, our method can readily replace other generic BoW systems in their pipeline and immediately improve their results.
使用朝下的相机进行地面纹理定位提供了一种低成本、高精度的定位解决方案,该方案能够应对动态环境,并且无需对环境进行任何改动。我们提出了一种显著改进的词袋(BoW)图像检索系统,用于提高地面纹理定位的准确性,在全局定位和闭环检测中达到了更高的准确率和召回率。我们的方法利用了近似$k$-均值(AKM)词汇表及软分配策略,并且充分利用了地面纹理定位所固有的方位一致性和尺度恒定约束条件。根据SLAM中的全局定位与闭环检测的不同需求,我们提供了高精度版本以及高速度版本的算法。通过消融研究测试了我们的每个改进措施的有效性,并展示了在全局定位和闭环检测上方法的有效性。 由于已经有许多地面纹理定位系统使用了BoW技术,我们的方法可以轻松替换这些系统的现有通用BoW系统,并立即提升其性能结果。
https://arxiv.org/abs/2505.11620
The development of foundation models through pretraining of vision-language models (VLMs) has recently attracted great attention in remote sensing (RS). VLM pretraining aims to learn image and language alignments from a large number of image-text pairs. Each pretraining image is often associated with multiple captions containing redundant information due to repeated or semantically similar phrases, resulting in increased pretraining and inference time. To overcome this, we introduce a weighted feature aggregation (WFA) strategy for VLM pretraining in RS. Our strategy aims to extract and exploit complementary information from multiple captions per image while reducing redundancies through feature aggregation with importance weighting. To calculate adaptive importance weights for different captions of each image, we propose two techniques: (i) non-parametric uniqueness and (ii) learning-based attention. In the first technique, importance weights are calculated based on the bilingual evaluation understudy (BLEU) scores of the captions to emphasize unique sentences and reduce the influence of repetitive ones. In the second technique, importance weights are learned through an attention mechanism instead of relying on hand-crafted features. The effectiveness of the proposed WFA strategy with the two techniques is analyzed in terms of downstream performance on text-to-image retrieval in RS. Experimental results show that the proposed strategy enables efficient and effective pretraining of VLMs in RS. Based on the experimental analysis, we derive guidelines for selecting appropriate techniques depending on downstream task requirements and resource constraints. The code of this work is publicly available at this https URL.
通过视觉-语言模型(VLM)的预训练来发展基础模型在遥感领域最近引起了广泛关注。VLM 预训练的目标是从大量的图像文本对中学习图像和语言之间的对应关系。然而,每张用于预训练的图片通常会关联到多个包含冗余信息的文字描述,因为这些文字描述中经常有重复或语义相似的短语,从而导致预训练时间和推理时间增加。 为了解决这个问题,我们引入了一种在遥感领域进行 VLM 预训练的加权特征聚合(WFA)策略。我们的策略旨在从每张图片相关的多个描述中提取并利用互补信息,并通过带有重要性加权的特征聚合来减少冗余。为了计算每个图像的不同描述的重要权重,我们提出了两种技术:(i) 非参数独特性和 (ii) 基于学习的注意力机制。 在第一种技术中,基于双语评估 understudy(BLEU)评分来计算重要性权重,以强调独特的句子并减少重复句子的影响。第二种技术则是通过注意力机制来学习重要性权重,而不是依赖手工设计的功能特征。 我们从文本到图像检索任务的下游性能方面分析了所提出的结合两种技术的 WFA 策略的有效性。实验结果表明,所提出的策略能够使 VLM 在遥感领域的预训练变得高效且有效。基于实验分析,我们可以根据下游任务需求和资源限制来制定选择适当方法的技术指南。这项工作的代码已公开发布在以下链接:[提供URL]。
https://arxiv.org/abs/2505.11121
Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.
自然语言不仅仅是枯燥地描述视觉内容,它还包含了表达情感、创造力和不能直接感知的属性的丰富抽象概念。然而,当前关于视觉语言模型(VLM)的研究并未对以抽象为导向的语言提供足够的见解。我们的研究通过广泛的分析揭示了其广泛存在及被低估的价值。特别地,我们重点关注时尚领域,这是一个高度具代表性且富含抽象表达的领域。通过对最近大规模多模态时尚数据集进行分析,我们发现抽象术语具有主导地位,与具体术语相当,并提供了新颖的信息,在检索任务中也十分有用。 然而,一个关键挑战随之而来:目前通用或特定于时尚领域的VLMs是通过文本语料库预训练的,而这些语料库中的抽象词汇不足,这限制了它们有效表示以抽象为导向的语言的能力。为此,我们提出了一种无需训练且与模型无关的方法——抽象到具体的翻译器(ACT),利用预训练模型和现有的多模态数据库,在VLM潜在空间中将抽象表达转换为已充分表现的具象表达。 在文本到图像检索任务上,尽管不需进行额外训练,ACT的表现优于微调后的VLMs,并且不论是在同数据集还是跨数据集设置下均展示了其有效性及强大的泛化能力。此外,通过ACT改进的效果对各种VLM保持一致,使其成为一种即插即用的解决方案。
https://arxiv.org/abs/2505.03242
Oracle Bone Inscription (OBI) is the earliest systematic writing system in China, while the identification of Oracle Bone (OB) duplicates is a fundamental issue in OBI research. In this work, we design a progressive OB duplicate discovery framework that combines unsupervised low-level keypoints matching with high-level text-centric content-based matching to refine and rank the candidate OB duplicates with semantic awareness and interpretability. We compare our approach with state-of-the-art content-based image retrieval and image matching methods, showing that our approach yields comparable recall performance and the highest simplified mean reciprocal rank scores for both Top-5 and Top-15 retrieval results, and with significantly accelerated computation efficiency. We have discovered over 60 pairs of new OB duplicates in real-world deployment, which were missed by OBI researchers for decades. The models, video illustration and demonstration of this work are available at: this https URL.
甲骨文(Oracle Bone Inscription,OBI)是中国最早系统化的书写体系之一,而识别甲骨(Oracle Bone,OB)的重复件是OBI研究中的一个基本问题。在这项工作中,我们设计了一个渐进式的OB重复发现框架,该框架结合了无监督的低级关键点匹配与高级基于内容的文字中心匹配方法,以此来增强和排序具有语义理解和可解释性的候选OB重复件。我们将我们的方法与最先进的基于内容的图像检索及图像匹配技术进行了比较,结果显示,在Top-5和Top-15检索结果中,我们的方法在召回性能上表现相当,并且简化后的平均倒数排名得分最高,同时计算效率也显著提高。通过实际部署,我们发现了超过60对新的OB重复件,这些此前被OBI研究者忽略了几十年的甲骨。 这项工作的模型、视频说明和演示可以在以下网址获得:[此链接](https://this-url.com)(请将“this https URL”替换为实际提供的链接)。
https://arxiv.org/abs/2505.03836
Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text. Due to the high cost of annotating CIR triplet datasets, zero-shot (ZS) CIR has gained traction as a promising alternative. Existing studies mainly focus on projection-based methods, which map an image to a single pseudo-word token. However, these methods face three critical challenges: (1) insufficient pseudo-word token representation capacity, (2) discrepancies between training and inference phases, and (3) reliance on large-scale synthetic data. To address these issues, we propose a two-stage framework where the training is accomplished from mapping to composing. In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module and a soft text alignment objective, enabling the token to capture richer and fine-grained image information. In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics by combining pseudo-word tokens with modification text for accurate target image retrieval. The strong visual-to-pseudo mapping established in the first stage provides a solid foundation for the second stage, making our approach compatible with both high- and low-quality synthetic data, and capable of achieving significant performance gains with only a small amount of synthetic data. Extensive experiments were conducted on three public datasets, achieving superior performance compared to existing approaches.
图像合成检索(CIR)是一项具有挑战性的跨模态任务,其目标是根据参考图像和相应的修改文本检索目标图像。由于标注CIR三元组数据集的成本高昂,零样本学习(ZS)CIR作为一种有前景的替代方案逐渐受到重视。现有的研究主要集中在基于投影的方法上,这些方法将图像映射到单个伪词令牌。然而,这种方法面临三个关键挑战:(1) 伪词标记表示能力不足;(2) 训练和推理阶段之间的差异;以及 (3) 对大规模合成数据的依赖。 为了克服这些问题,我们提出了一种两阶段框架,在该框架中,训练从映射到组合进行。在第一阶段,通过引入视觉语义注入模块和软文本对齐目标来增强图像到伪词标记的学习过程,使标记能够捕获更加丰富且精细的图像信息。在第二阶段,使用少量合成三元组数据优化文本编码器,使其能够有效地结合伪词标记与修改文本,以实现准确的目标图像检索。第一阶段建立的强大视觉至伪映射为第二阶段奠定了坚实的基础,使得我们的方法既可以兼容高质量的合成数据也可以适应低质量的数据,并且仅通过少量的合成数据就能获得显著的性能提升。 我们在三个公开数据集上进行了广泛的实验,结果表明与现有方法相比,我们的方法取得了更优的表现。
https://arxiv.org/abs/2504.17990
There are many ways to describe, name, and group objects when captioning an image. Differences are evident when speakers come from diverse cultures due to the unique experiences that shape perception. Machine translation of captions has pushed multilingual capabilities in vision-language models (VLMs), but data comes mainly from English speakers, indicating a perceptual bias and lack of model flexibility. In this work, we address this challenge and outline a data-efficient framework to instill multilingual VLMs with greater understanding of perceptual diversity. We specifically propose an LLM-based, multimodal recaptioning strategy that alters the object descriptions of English captions before translation. The greatest benefits are demonstrated in a targeted multimodal mechanism guided by native speaker data. By adding produced rewrites as augmentations in training, we improve on German and Japanese text-image retrieval cases studies (up to +3.5 mean recall overall, +4.7 on non-native error cases). We further propose a mechanism to analyze the specific object description differences across datasets, and we offer insights into cross-dataset and cross-language generalization.
在为图像添加说明时,有许多方法可以描述、命名和分组对象。当演讲者来自不同的文化背景时,由于塑造感知的独特经历,这些差异变得显而易见。机器翻译对于提高视觉-语言模型(VLM)的多语言能力起到了推动作用,但大部分数据主要来自于英语使用者,这表明存在感知偏差以及模型灵活性不足的问题。在这项工作中,我们解决这一挑战,并提出了一种数据高效框架,旨在提升多语言VLM对感知多样性的理解。具体来说,我们提出了一种基于大规模语言模型(LLM)的、跨模态重写策略,在翻译前改变英语说明中的对象描述。在由母语者数据指导的目标跨模态机制中,这一方法展现出最大的效益。通过将生成的重新描述作为增强训练的一部分添加进来,我们在德文和日文文本-图像检索案例研究中取得了显著改进(总体平均召回率提高高达+3.5,在非英语使用者错误情况下的提升可达+4.7)。我们还提出了一种分析不同数据集之间具体对象描述差异的机制,并提供了关于跨数据集和跨语言泛化能力的理解。
https://arxiv.org/abs/2504.14359
Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.
跨模态检索(CMR)是多媒体研究中的一个基本任务,旨在通过不同模态检索语义相关的项目。传统方法通常通过基于嵌入的相似性计算将文本和图像进行匹配,而最近在预训练生成模型方面的进展则推动了生成式检索作为一种有潜力的替代方案的发展。这种范例为每个目标分配唯一标识符,并利用生成模型直接预测与输入查询对应的标识符,无需显式的索引操作。尽管这种方法有很大的潜力,但目前的生成式跨模态检索方法在标识符构建和生成过程中的语义信息仍然不足。 为了克服这些局限性,我们提出了一种新的统一框架——增强语义的跨模态生成检索(Semantic-enhanced generative Cross-mOdal REtrieval,简称SemCORE),旨在释放生成式跨模态检索任务中对语义理解的能力。具体来说,首先构建了一个结构化的自然语言标识符(SID),该标识符能够有效地将目标标识符与优化为理解和生成自然语言的模型进行对齐。此外,我们引入了一种生成语义验证(GSV)策略,以实现细粒度的目标识别。 据我们所知,SemCORE是第一个同时考虑文本到图像和图像到文本检索任务的生成式跨模态检索框架。广泛的实验表明,我们的框架在最先进的生成式跨模态检索方法中表现出色,在基准数据集上实现了显著改进,特别是在文本到图像检索中的Recall@1指标平均提高了8.65个百分点。
https://arxiv.org/abs/2504.13172
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.
视觉关系检测(VRD)的目标是识别图像中对象对之间的关系(或互动)。尽管最近的VRD模型已经取得了令人印象深刻的表现,但它们都局限于预定义的关系类别,并未能考虑视觉关系所具有的语义模糊性特征。与物体不同,视觉关系的外观总是微妙的,可以从不同的视角用多个谓词单词来描述,例如,“骑”可以分别从体育和空间位置的角度描绘为“比赛”和“坐在上面”。为此,我们提出将视觉关系建模为连续嵌入,并设计扩散模型以在条件生成方式下实现泛化的VRD,命名为Diff-VRD。我们在隐式空间中对扩散过程进行建模,并生成图像中的所有可能的关系作为嵌入序列。在生成过程中,主体-客体对的视觉和文本嵌入充当条件信号并通过交叉注意力机制注入其中。生成之后,我们设计了一个后续匹配阶段以根据它们的语义相似性将关系词分配给主体-客体对。得益于基于扩散的生成过程,我们的Diff-VRD能够生成超出数据集预定义类别标签的视觉关系。为了适当评估这项泛化的VRD任务,我们引入了两个评价指标,即文本到图像检索和灵感来自图像字幕的SPICE PR曲线。在人类对象交互(HOI)检测和场景图生成(SGG)基准测试中的广泛实验证明了Diff-VRD的优越性和有效性。
https://arxiv.org/abs/2504.12100
The standard approach for visual place recognition is to use global image descriptors to retrieve the most similar database images for a given query image. The results can then be further improved with re-ranking methods that re-order the top scoring images. However, existing methods focus on re-ranking based on the same image descriptors that were used for the initial retrieval, which we argue provides limited additional signal. In this work we propose Generalized Contextual Similarity Aggregation (GCSA), which is a graph neural network-based re-ranking method that, in addition to the visual descriptors, can leverage other types of available side information. This can for example be other sensor data (such as signal strength of nearby WiFi or BlueTooth endpoints) or geometric properties such as camera poses for database images. In many applications this information is already present or can be acquired with low effort. Our architecture leverages the concept of affinity vectors to allow for a shared encoding of the heterogeneous multi-modal input. Two large-scale datasets, covering both outdoor and indoor localization scenarios, are utilized for training and evaluation. In experiments we show significant improvement not only on image retrieval metrics, but also for the downstream visual localization task.
视觉位置识别的标准方法是使用全局图像描述符来检索与给定查询图像最相似的数据库图像。然后可以通过重新排序高分图像的方法进一步改进结果。然而,现有方法侧重于基于用于初始检索的相同图像描述符进行重新排序,我们认为这提供的额外信号有限。在本项工作中,我们提出了广义上下文相似性聚合(GCSA),这是一种图神经网络基础的重新排序方法,除了视觉描述符之外,还可以利用其他类型的可用侧面信息。例如,可以使用其他传感器数据(如附近WiFi或蓝牙端点的信号强度)或其他几何属性(如数据库图像的相机姿态)。在许多应用场景中,这些信息已经存在或者可以通过低努力获得。 我们的架构利用了亲和向量的概念,以实现异构多模态输入的共享编码。为了训练和评估,我们使用了两个大规模数据集,涵盖室外和室内定位场景。实验结果显示,在图像检索指标以及下游视觉定位任务上都有显著改进。
https://arxiv.org/abs/2504.11134
Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textual descriptions via a diffusion model. This step enhances the encoder ability of text to capture nuanced intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune all encoders contrastively by comparing adaptive token-fusion features with the target image. This mechanism dynamically balances visual and textual representations within the contrastive learning pipeline, optimizing the composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR datasets demonstrate that TMCIR significantly outperforms state-of-the-art methods, particularly in capturing nuanced user intent.
组成图像检索(CIR)使用一种多模态查询来检索目标图像,该查询结合了一个参考图像和描述所需修改的文本。主要挑战在于有效地融合这种视觉和文本信息。当前用于CIR的跨模式特征融合方法在意图解释方面存在固有的偏见。这些方法倾向于过分强调参考图像特征(以视觉为主导的融合)或通过图像到文本转换来强调文本修改意图(以文本为主导的融合)。这种不平衡的表示通常无法准确捕捉并反映用户搜索的真实意图。 为了解决这一挑战,我们提出了TMCIR框架,该框架通过两个关键创新推进了组成图像检索:1)意图感知跨模式对齐。首先使用包含参考图像和文本描述的伪目标图像(通过扩散模型合成),对比性地微调CLIP编码器。这一步增强了文本捕捉文本描述中细微意图的能力。2)自适应标记融合。我们进一步通过对比方式,通过将自适应标记融合特征与目标图像进行比较来微调所有编码器。这种机制在对比学习管道内动态平衡视觉和文本表示,优化用于检索的组合特征。 在Fashion-IQ和CIRR数据集上进行了广泛的实验,结果表明TMCIR显著优于现有方法,特别是在捕捉细微用户意图方面。
https://arxiv.org/abs/2504.10995
Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Code and models are available at this https URL
视觉地方识别(VPR)的目标是通过参考带有地理标签的图像数据库来预测查询图像的位置。对于VPR任务,通常少量具有鉴别性的局部区域会产生重要影响,而平凡的背景区域则不会做出贡献甚至可能因为容易重叠而导致感知混淆。然而,现有的方法缺乏对这些有区分性区域进行精确建模和充分利用的能力。 在这篇论文中,我们提出了Focus on Local(FoL)的方法,通过挖掘并利用图像中的可靠鉴别性局部区域以及引入伪关联监督来同时刺激VPR任务中的图像检索和重新排序性能。具体来说: 1. 我们设计了两种损失函数:提取-聚合空间对齐损失(SAL)和前景-背景对比增强损失(CEL),以显式建模可靠的鉴别性局部区域,并利用它们指导全局表示的生成和高效的重新排序。 2. 我们引入了一种基于从聚合全球特征获得的伪对应关系进行弱监督的局部特征训练策略,以此来缓解VPR任务中缺乏局部对应的真实标签的问题。 3. 我们提出了一种高效且精确的重新排序流水线,该流程依据鉴别性区域指导运行。 实验结果显示,我们的FoL方法在多个VPR基准测试中,在图像检索和重新排序阶段都达到了最先进的性能,并且与现有的两阶段VPR方法相比,在计算效率方面也有显著提升。代码和模型可在提供的链接处获取(原文中的具体URL请查阅原论文)。
https://arxiv.org/abs/2504.09881
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling this http URL paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1\% in imageNet-1K classification and 1.4\% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.
现有的遮罩图像建模方法使用固定掩码模式来指导自我监督训练。由于这些掩码模式依赖于不同的标准来描绘图像内容,坚持采用固定模式会导致对视觉线索的模型化能力有限。本文提出了一种进化分层掩蔽方法,以追求在自我监督学习中的通用视觉线索建模。所提出的方法利用正在训练的视觉模型将输入的视觉线索解析为层次结构,并据此生成相应的掩码。这种层次结构的准确性与正在训练的模型的能力相当,在不同的训练阶段会演化出不同模式的掩码。 初始阶段,生成的掩码集中于低级视觉线索以掌握基本纹理,随后逐渐演变为描绘高级别的线索,从而强化对更复杂物体语义和上下文的学习。我们的方法不需要额外的预训练模型或注释,并通过进化训练难度来确保训练效率。我们在包括基于低层细节的部分重复图像检索、以及需要语义解析能力的图像分类和语义分割在内的七个下游任务上进行了广泛的实验。实验证明,该方法在这类任务中显著提升了性能表现。例如,在相同的训练周期下,它在ImageNet-1K分类和ADE20K分割方面分别比最近的MAE模型高出1.1%和1.4%。 我们还使提出的方法与当前研究大型语言模型(LLMs)的重点相吻合。所提出的方案弥合了大规模预训练任务中对语义需求之间的差距,并增强了在需要低级特征识别的任务中的复杂细节感知能力。
https://arxiv.org/abs/2504.09155
Recent advancements in computer vision have highlighted the scalability of Vision Transformers (ViTs) across various tasks, yet challenges remain in balancing adaptability, computational efficiency, and the ability to model higher-order relationships. Vision Graph Neural Networks (ViGs) offer an alternative by leveraging graph-based methodologies but are hindered by the computational bottlenecks of clustering algorithms used for edge generation. To address these issues, we propose the Hypergraph Vision Transformer (HgVT), which incorporates a hierarchical bipartite hypergraph structure into the vision transformer framework to capture higher-order semantic relationships while maintaining computational efficiency. HgVT leverages population and diversity regularization for dynamic hypergraph construction without clustering, and expert edge pooling to enhance semantic extraction and facilitate graph-based image retrieval. Empirical results demonstrate that HgVT achieves strong performance on image classification and retrieval, positioning it as an efficient framework for semantic-based vision tasks.
近期在计算机视觉领域的进展显示,Vision Transformers(ViT)在多种任务中具有可扩展性,但仍面临在适应性、计算效率和建模高阶关系之间保持平衡的挑战。Vision Graph Neural Networks(ViG)通过利用基于图的方法提供了一种替代方案,但受用于生成边的聚类算法计算瓶颈的影响而受到限制。为解决这些问题,我们提出了Hypergraph Vision Transformer (HgVT),它将分层双图超图结构融入到视觉变换器框架中,以捕捉高阶语义关系并保持计算效率。HgVT利用群体和多样性正则化来动态构建超图而不依赖聚类,并采用专家边池化技术增强语义提取及支持基于图的图像检索。实证结果表明,HgVT在图像分类和检索任务上表现出色,使其成为一种高效的面向语义的视觉任务框架。
https://arxiv.org/abs/2504.08710
Visual understanding is inherently contextual -- what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest. Yet, most existing image encoding paradigms represent an image as a fixed, generic feature vector, overlooking the potential needs of prioritizing varying visual information for different downstream use cases. In this work, we introduce FocalLens, a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively finetune a pretrained vision encoder to take natural language instructions as additional inputs for producing conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.
视觉理解本质上是上下文相关的——我们在图像上关注的内容取决于手头的任务。例如,在一张一个人手持花束的图片中,我们可能会关注这个人(如他们的服装),或者关注花朵的类型,这取决于感兴趣的上下文。然而,大多数现有的图像编码范式将一幅图表示为一个固定的、通用的功能向量,忽略了根据不同下游应用场景优先考虑不同视觉信息的需求。 在这项工作中,我们引入了FocalLens,这是一种条件性视觉编码方法,可以根据兴趣上下文(通过自然语言灵活表达)为同一张图片生成不同的表示。我们利用视觉指令微调数据,并对比微调了一个预训练的视觉编码器,使其能够将自然语言指令作为额外输入以生成条件性图像表示。 广泛的实验验证了FocalLens产生的条件性图像表示比标准视觉编码器(如CLIP)生成的一般特征更突出感兴趣区域的视觉特征。此外,我们展示了FocalLens在包括图像检索、图像分类和图像-文本检索在内的多种下游任务中带来了性能提升,在具有挑战性的SugarCrepe和MMVP-VLM基准测试中分别平均提高了5分和10分。
https://arxiv.org/abs/2504.08368
Fine-grained text-to-image retrieval aims to retrieve a fine-grained target image with a given text query. Existing methods typically assume that each training image is accurately depicted by its textual descriptions. However, textual descriptions can be ambiguous and fail to depict discriminative visual details in images, leading to inaccurate representation learning. To alleviate the effects of text ambiguity, we propose a Multi-Modal Reference learning framework to learn robust representations. We first propose a multi-modal reference construction module to aggregate all visual and textual details of the same object into a comprehensive multi-modal reference. The multi-modal reference hence facilitates the subsequent representation learning and retrieval similarity computation. Specifically, a reference-guided representation learning module is proposed to use multi-modal references to learn more accurate visual and textual representations. Additionally, we introduce a reference-based refinement method that employs the object references to compute a reference-based similarity that refines the initial retrieval results. Extensive experiments are conducted on five fine-grained text-to-image retrieval datasets for different text-to-image retrieval tasks. The proposed method has achieved superior performance over state-of-the-art methods. For instance, on the text-to-person image retrieval dataset RSTPReid, our method achieves the Rank1 accuracy of 56.2\%, surpassing the recent CFine by 5.6\%.
细粒度的文本到图像检索旨在通过给定的文字查询检索出一个精细的目标图像。现有的方法通常假设每个训练图像都能被其文字描述准确地描绘出来。然而,由于文字描述可能具有歧义性,并且无法描绘出图像中区分度高的视觉细节,这会导致不准确的表现学习。 为了减轻文本模糊的影响,我们提出了一种多模态参考学习框架来学习稳健的表现形式。首先,我们提出了一个多模态参考构建模块,将同一对象的所有视觉和文字信息聚合到一个全面的多模态参考中。这样可以促进随后的表现学习和检索相似度计算。具体而言,我们提出了一个基于参考的表现学习模块,利用多模态参考来学习更准确的视觉和文本表示。 此外,我们还引入了一种基于参考的改进方法,该方法使用对象参考来计算一个基于参考的相似度,以细化初始的检索结果。我们在五个细粒度的文本到图像检索数据集上进行了广泛的实验,涵盖了不同的文本到图像检索任务。所提出的方法在最先进的方法中取得了优越的表现。 例如,在RSTPReid(针对人的文本到图像检索)数据集中,我们的方法达到了56.2%的第一名准确率,比近期的最佳模型CFine高出5.6%。
https://arxiv.org/abs/2504.07718
High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a \textbf{divide-then-aggregate} strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at this https URL
高质量的图像描述对于提高跨模态应用(如文本到图像生成、视频生成和图文检索)的表现至关重要。为了生成长篇且高质量的描述,许多最近的研究采用了多模态大型语言模型 (MLLM)。然而,当前的 MLLM 经常会产生缺少细粒度细节或存在幻觉问题的描述,这一挑战在开源和闭源模型中都普遍存在。受特征整合理论的启发,该理论指出注意力必须集中于特定区域以有效整合视觉信息,我们提出了一个“分割然后聚合”的策略。我们的方法首先将图像分割为语义和空间补丁来提取细粒度细节,增强模型对图像局部感知的能力。这些局部细节随后被分层聚合生成全面的全局描述。为了应对生成描述中的幻觉和不一致性问题,在层级聚合过程中我们应用了语义级过滤过程。这个无需训练的方法可以应用于开源模型(LLaVA-1.5、LLaVA-1.6、Mini-Gemini)和闭源模型(Claude-3.5-Sonnet、GPT-4o、GLM-4V-Plus)。大量的实验表明,我们的方法生成了更详细且可靠的描述,在无需重新训练模型的情况下推进了多模态描述的生成。代码可以在 [此链接](https://this https URL) 获取。
https://arxiv.org/abs/2504.06666
Image retrieval remains a challenging task due to the complex interaction between human visual perception, memory, and computational processes. Current image search engines often struggle to efficiently retrieve images based on natural language descriptions, as they rely on time-consuming preprocessing, tagging, and machine learning pipelines. This paper introduces the Human-Oriented Retrieval Search Engine for Images (HORSE), a novel approach that leverages neuro-symbolic indexing to improve image retrieval by focusing on human-oriented indexing. By integrating cognitive science insights with advanced computational techniques, HORSE enhances the retrieval process, making it more aligned with how humans perceive, store, and recall visual information. The neuro-symbolic framework combines the strengths of neural networks and symbolic reasoning, mitigating their individual limitations. The proposed system optimizes image retrieval, offering a more intuitive and efficient solution for users. We discuss the design and implementation of HORSE, highlight its potential applications in fields such as design error detection and knowledge management, and suggest future directions for research to further refine the system's metrics and capabilities.
图像检索仍然是一个具有挑战性的任务,因为人类视觉感知、记忆和计算过程之间存在复杂的相互作用。当前的图像搜索引擎往往难以基于自然语言描述高效地检索图像,这主要是由于它们依赖于耗时的数据预处理、标记以及机器学习管道。本文介绍了以人为导向的图像检索引擎(HORSE),这是一种新颖的方法,通过利用神经符号索引技术来改进以人类为中心的图像检索过程。通过将认知科学见解与先进的计算技术相结合,HORSE增强了检索过程,使其更符合人类感知、存储和回忆视觉信息的方式。 神经符号框架结合了神经网络和符号推理的优势,从而减轻了它们各自的局限性。所提出的系统优化了图像检索过程,为用户提供了一种更加直观且高效的解决方案。我们讨论了HORSE的设计与实现,并强调其在设计错误检测和知识管理等领域的潜在应用。此外,我们也提出了未来研究方向,旨在进一步完善系统的评估指标和技术能力。
https://arxiv.org/abs/2504.10502