While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.
https://arxiv.org/abs/2603.12998
This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.
https://arxiv.org/abs/2603.12711
Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.
组成图像检索(CIR)要求多模态模型在处理图文输入对时,能够同时对视觉内容和文本中的语义修改进行推理。虽然目前的CIR模型在常见基准测试案例中表现出色,但在更具有挑战性的场景下,例如当负面候选与查询图像或文本在语义上高度一致时,它们的表现往往会下降。本文认为这种表现下降主要是由于注意力失衡造成的,在这种情况下,模型倾向于过度关注某一模态(视觉或文本),而忽视另一模态。 为了验证这一假设,我们提出了FBCIR,这是一种多模态注意力解释方法,用于识别对模型检索决策最为关键的视觉和文本输入组件。通过使用FBCIR,我们发现现有CIR模型中的注意力失衡问题在困难的负面候选设置下尤为普遍。在此基础上,我们进一步提出了一种针对CIR的数据增强工作流程,该流程旨在为现有的CIR数据集提供精心策划的困难负例样本,以促进跨模态推理更加平衡。 经过对多个CIR模型进行广泛的实验后,结果表明所提出的增强方法在挑战性案例中可以持续提高性能,同时保持其在标准基准测试中的能力。综上所述,我们的解释方法和数据增强流程为CIR模型的诊断以及改进其鲁棒性提供了一种新的视角。
https://arxiv.org/abs/2603.11520
Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.
医学图像检索旨在识别具有临床相关性的病变案例,以支持诊断决策、教育和质量控制。实践中,检索查询常常结合参考病变图像与文本描述符(如皮肤镜特征)。我们研究了一种针对皮肤癌的复合视觉-语言检索方法,其中每个查询由一张图片及其对应的文本组成,并且数据库包含多种类别疾病的确诊病例。我们提出了一种基于变换器的框架,该框架可以学习层次化的组合查询表示并执行查询与候选图像之间的全局和局部联合对齐。局部对齐通过多个空间注意力掩码聚合判别性区域,而全局对齐则提供全面的语义监督。最终相似度是通过一种凸优化、领域信息加权方法计算得出,该方法强调临床重要的局部证据同时保持全局一致性。 在公开的Derm7pt数据集上的实验表明,所提出的方法比现有最佳方法具有持续改进的效果。提出的框架能够实现对相关医疗记录的有效访问,并支持实际临床应用部署。
https://arxiv.org/abs/2603.09108
Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25's inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$\% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.
密集图像检索准确度高,但解释性和归因能力有限,并且在大规模应用时计算成本高昂。我们提出了**BM25-V**方法,它使用Okapi BM25评分对从稀疏自动编码器(SAE)提取的视觉词汇激活进行评估,该编码器基于视觉变换器补丁特征。在一个大型图像库中,视觉词文档频率高度不平衡,并且遵循类似Zipf分布的趋势,这使得BM25中的逆文档频率(IDF)加权非常适合抑制普遍存在的、信息量低的单词,并突出罕见但有区分度的词汇。BM25-V通过稀疏倒排索引操作检索出高召回率候选图像,并作为密集重排序阶段之前的有效初筛工具。在七个基准测试中,BM25-V达到了Recall@200 ≥ 0.993的成绩,使得一个两阶段管道仅需为每个查询重新排名K=200个候选项即可恢复接近于密集型方法的精度,并且平均误差不超过0.2%。一次训练在ImageNet-1K上的SAE可以零样本转移到七个细粒度基准测试中而无需微调,而且BM25-V检索决策可归因于特定视觉词汇及其量化后的IDF贡献值。
https://arxiv.org/abs/2603.05781
Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.
组合图像检索(CIR)已经取得了显著的进步,然而现有的基准测试仅限于单个正确答案,并且缺少用于评估误报避免、鲁棒性和多图推理所需的注释。我们提出了PinPoint,这是一个全面的现实世界基准,包含7,635个查询和跨23类查询的329K相关性判断。PinPoint通过提供以下内容来推动领域发展:(1)每个查询平均有9.1个正确答案;(2)明确的难例负样本;(3)每条查询六个用于鲁棒性测试的指令同义转述;(4)支持多图组合(13.4%的查询包括这种类型);以及(5)进行公平性评估的人口元数据。基于我们对四种主要范式中的20多种方法的分析,我们发现了三个显著的缺点:即使最佳方法达到了mAP@10为28.5%,它们仍然有9%的时间会检索到不相关的结果(难例负样本)。最优秀的模型在不同的同义转述上也表现出25.1%的表现变化,这表明当前CIR技术有很大的改进空间。多图查询在不同方法上的表现比单图查询低40%至70%。 为了克服我们评估框架揭示的新问题,我们提出了一种基于现成的多语言大模型的无训练重排序方法,该方法可以应用于任何现有系统以弥合差距。我们将发布完整的数据集,包括所有图像、查询、注释、检索索引和基准测试代码。
https://arxiv.org/abs/2603.04598
Composed image retrieval (CIR) addresses the task of retrieving a target image by jointly interpreting a reference image and a modification text that specifies the intended change. Most existing methods are still built upon contrastive learning frameworks that treat the ground truth image as the only positive instance and all remaining images as negatives. This strategy inevitably introduces relevance suppression, where semantically related yet valid images are incorrectly pushed away, and semantic confusion, where different modification intents collapse into overlapping regions of the embedding space. As a result, the learned query representations often lack discriminativeness, particularly at fine-grained attribute modifications. To overcome these limitations, we propose distinctive query embeddings through learnable attribute weights and target relative negative sampling (DQE-CIR), a method designed to learn distinctive query embeddings by explicitly modeling target relative relevance during training. DQE-CIR incorporates learnable attribute weighting to emphasize distinctive visual features conditioned on the modification text, enabling more precise feature alignment between language and vision. Furthermore, we introduce target relative negative sampling, which constructs a target relative similarity distribution and selects informative negatives from a mid-zone region that excludes both easy negatives and ambiguous false negatives. This strategy enables more reliable retrieval for fine-grained attribute changes by improving query discriminativeness and reducing confusion caused by semantically similar but irrelevant candidates.
组成图像检索(CIR)的任务是通过共同解释参考图像和指定预期更改的修改文本来检索目标图像。现有的大多数方法仍然基于对比学习框架,该框架将真实情况下的图片视为唯一的正例,其余所有图片作为负例。这种策略不可避免地引入了相关性抑制问题,在这个问题中,语义相关的但有效的图片被错误地排除;此外还存在语义混淆问题,即不同的修改意图在嵌入空间中的重叠区域汇集在一起。结果是,学习到的查询表示往往缺乏区分度,特别是在细粒度属性修改的情况下。 为了克服这些限制,我们提出了通过可学习的属性权重和目标相对负采样(DQE-CIR)来生成独特查询嵌入的方法,该方法旨在通过在训练过程中显式地建模目标相关性来学习独特的查询嵌入。DQE-CIR采用可学习的属性加权,以突出与修改文本条件下的视觉特征,并使语言和视觉之间的精确特征对齐成为可能。 此外,我们引入了目标相对负采样方法,该方法构建了一个目标相关的相似度分布,并从排除容易的负面样本和模棱两可的假阴性样本的中区区域选择信息丰富的负面样本。这一策略通过提高查询区分度并减少由于语义相似但不相关候选者导致的混淆,从而增强了对细粒度属性变化的可靠检索能力。
https://arxiv.org/abs/2603.04037
Cross-view geo-localization (CVGL) aims to establish spatial correspondences between images captured from significantly different viewpoints and constitutes a fundamental technique for visual localization in GNSS-denied environments. Nevertheless, CVGL remains challenging due to severe geometric asymmetry, texture inconsistency across imaging domains, and the progressive degradation of discriminative local information. Existing methods predominantly rely on spatial domain feature alignment, which is inherently sensitive to large scale viewpoint variations and local disturbances. To alleviate these limitations, this paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), which leverages complementary representations from spatial and frequency domains. SFDE adopts a three branch parallel architecture to model global semantic context, local geometric structure, and statistical stability in the frequency domain, respectively, thereby characterizing consistency across domains from the perspectives of scene topology, multiscale structural patterns, and frequency invariance. The resulting complementary features are jointly optimized in a unified embedding space via progressive enhancement and coupled constraints, enabling the learning of cross-view representations with consistency across multiple granularities. Comprehensive experiments show that SFDE achieves competitive performance and in many cases even surpasses state-of-the-art methods, while maintaining a lightweight and computationally efficient design. {Our code is available at this https URL
跨视图地理定位(CVGL)旨在建立从显著不同视角拍摄的图像之间的空间对应关系,这是在缺乏全球导航卫星系统(GNSS)环境中进行视觉定位的基本技术。然而,由于严重的几何不对称性、成像域之间纹理的一致性差以及局部信息判别性的逐步退化,CVGL仍然极具挑战性。现有的方法主要依赖于空间领域特征对齐,这种做法在面对大规模视角变化和局部扰动时敏感度很高。为了缓解这些限制,本文提出了一种空间与频域增强网络(SFDE),它利用了来自空间和频率领域的互补表示形式。SFDE采用了一个三分支并行架构来分别建模全局语义背景、局部几何结构以及在频域中的统计稳定性,从而从场景拓扑、多尺度结构模式和频率不变性等角度表征跨域一致性。通过逐步增强和联合约束,在统一的嵌入空间中对这些互补特征进行共同优化,使得能够学习出具有多种粒度一致性的跨视图表示形式。综合实验表明,SFDE实现了竞争性的性能,并在许多情况下甚至超越了现有最先进的方法,同时保持了一种轻量且计算效率高的设计。 【注】原文的最后提到:“我们的代码可在该网址获取”,由于涉及到具体的URL链接,在此不做直接引用,若需访问,请参考原始文档或联系作者。
https://arxiv.org/abs/2603.02726
Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y''), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization -- predicting the axial depth referred to by a text snippet -- reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.
近期的3D CT视觉-语言模型通过对比预训练将图像体与报告对齐,但通常依赖于有限的公开数据,并且仅提供粗略的整体监督。我们使用来自单一医院收集的98,000份报告-体数据(涉及50,000名患者)和公共数据集,在SigLIP风格的对比预训练基础上增加了基于提示的疾病监督,来训练3D CT视觉-语言模型。在CT-RATE基准测试中,我们的模型实现了最先进的文本到图像检索性能(R@10为31.5,而基线为22.2),并且在疾病分类任务上也表现出竞争力(AUC为83.8,与基线持平)。此外,在Rad-ChestCT数据集上的结果也显示出一致的优越性(AUC为77.0,对比基准为77.3)。 我们进一步观察到放射科医生在其报告中经常引用具体的图像(例如,“系列X,图像Y”),将文本描述链接到精确的轴向位置。通过自动挖掘这样的片段-切片对,我们收集了262,000个样本,并引入了一个新的任务——扫描内片段定位,即预测一个文本片段所指代的确切轴向深度。在12毫米分辨率下,我们的方法将平均绝对误差减少到了36.3毫米(而最佳基线为67.0毫米)。 通过添加这种定位目标,在检索和分类方面没有显著变化(在置信区间内),从而构建了一个单一的统一模型来处理检索、分类以及扫描内的对齐任务。
https://arxiv.org/abs/2603.02026
We present MMCOMET, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge. MMCOMET extends the ATOMIC2020 knowledge graph to include a visual dimension, through an efficient image retrieval process, resulting in over 900K multimodal triples. This new resource addresses a major limitation of existing MMKGs in supporting complex reasoning tasks like image captioning and storytelling. Through a standard visual storytelling experiment, we show that our holistic approach enables the generation of richer, coherent, and contextually grounded stories than those produced using text-only knowledge. This resource establishes a new foundation for multimodal commonsense reasoning and narrative generation.
我们介绍了MMCOMET,这是首个集成物理、社会和事件知识的多模态常识知识图谱(MMKG)。MMCOMET通过高效的照片检索过程将ATOMIC2020知识图谱扩展到视觉维度,从而生成超过90万个多模态三元组。这一新资源解决了现有MMKG在支持如图像描述和讲故事等复杂推理任务中的主要局限性。通过标准的视觉故事讲述实验,我们证明了我们的整体方法能够生成比单纯使用文本知识更丰富、连贯且语境化的故事情节。这一资源为多模态常识推理和叙述生成奠定了新的基础。
https://arxiv.org/abs/2603.01055
Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly in visual RAG settings where visually similar but semantically incorrect evidence may be retrieved. To address this, we propose Multimodal Adaptive RAG (MMA-RAG), which dynamically assesses the confidence in the internal knowledge of the model to decide whether to incorporate the retrieved external information into the generation process. Central to MMA-RAG is a decision classifier trained through a layer-wise analysis, which leverages joint internal visual and textual representations to guide the use of reverse image retrieval. Experiments demonstrated that the model achieves a significant improvement in response performance in three VQA datasets. Meanwhile, ablation studies highlighted the importance of internal representations in adaptive retrieval decisions. In general, the experimental results demonstrated that MMA-RAG effectively balances external knowledge utilization and inference robustness in diverse multimodal scenarios.
视觉问答系统(Visual Question Answering,VQA)面临的问题之一是由于模型“幻觉”产生的可靠性问题。这里的“幻觉”指的是生成的答案与视觉输入或事实知识不符的情况。虽然检索增强生成框架通过整合外部知识来缓解这一问题,但静态检索往往引入无关或矛盾的内容,特别是在视觉RAG设置中,可能检索到视觉上相似但语义错误的证据。 为了解决这个问题,我们提出了多模态自适应RAG(Multimodal Adaptive RAG,MMA-RAG)。该框架动态评估模型内部知识的信心水平,并据此决定是否将检索到的外部信息整合进生成过程中。MMA-RAG的核心是一个通过逐层分析训练出的决策分类器,它利用了联合的内部视觉和文本表示来引导反向图像检索的应用。 实验表明,在三个VQA数据集中,该模型的回答性能得到了显著提升。同时,消融研究强调了内部表示在自适应检索决策中的重要性。总的来说,实验证明MMA-RAG能够在多种多模态场景中有效地平衡外部知识的利用和推理鲁棒性之间的关系。
https://arxiv.org/abs/2603.00511
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at this https URL.
零样本组合图像检索(ZS-CIR)的目标是在没有在标注三元组上进行训练的情况下,根据多模态查询(包括参考图像和修改文本)检索目标图像。现有方法通常将多模态查询转换为单一模式——要么作为文本到图像检索(T2I)的编辑描述,要么作为图像到图像检索(I2I)的编辑图像。然而,每种范式都有其固有的局限性:T2I经常丢失细粒度的视觉细节,而I2I则难以处理复杂的语义修改。为了在不同的查询意图下有效利用它们互补的优势,我们提出了WISER,一个无需训练的框架,通过“检索-验证-细化”的管道统一了T2I和I2I,并且显式地建模了意图意识和不确定性意识。 具体来说,WISER首先执行更广泛的搜索,生成编辑后的描述和图像以进行并行检索,从而扩大候选池。然后,它采用自适应融合与验证器评估检索信心,触发不确定检索的细化,并根据具体情况动态融合双路径以获得可靠的结果。对于不确定的检索,WISER通过结构化的自我反思生成改进建议,指导下一回合的更深层次思考。 广泛的实验表明,WISER在多个基准测试上显著优于先前的方法,在CIRCO(mAP@5)和CIRR(Recall@1)上的相对改进分别为45%和57%,超过了现有的无训练方法。值得注意的是,它甚至超越了许多依赖训练的方法,突显了其在各种场景下的优越性和泛化能力。 代码将在以下链接发布:[此URL]。
https://arxiv.org/abs/2602.23029
Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a "next-scale prediction" strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.
脑电图(EEG)信号已成为解码视觉信息的一种流行媒介,因其成本效益和高时间分辨率而备受青睐。然而,当前的方法在弥合EEG与图像数据之间模态差距方面面临重大挑战。这些方法通常依赖于涉及多个阶段的复杂适应过程,这使得保持一致性并管理累积误差变得困难。此外,大规模扩散模型所施加的计算开销限制了它们在实际脑机接口(BCI)应用中的实用性。 在此研究中,我们提出了AVDE,这是一种轻量级且高效的框架,用于从EEG信号中进行视觉解码。首先,我们将LaBraM,一个预训练的EEG模型,通过对比学习微调以使EEG和图像表示对齐。其次,我们采用了一种基于“下一级预测”策略的自回归生成框架:利用预训练的VQ-VAE将图像编码为多尺度标记图,并训练变压器从EEG嵌入作为最粗糙的表现开始,自回归地预测更精细级别的标记。这种设计使连贯生成成为可能,同时保持输入EEG信号与重构图像之间的直接联系。 在两个数据集上的实验表明,在图像检索和重建任务中,AVDE超越了先前最先进的方法,并且仅使用了10%的参数量。此外,中间输出的可视化显示,AVDE的生成过程反映了人类视觉感知的层次结构性质。这些结果突显了自回归模型作为实用BCI应用的有效性和可解释性工具的潜力。
https://arxiv.org/abs/2602.22555
Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.
组成图像检索(CIR)利用一张参考图片和一个自然语言编辑指令来查找符合请求更改且保留其他相关视觉内容的图片。传统的融合管道通常依赖于监督三元组,可能会丢失细微的线索;而最近的零样本方法往往通过生成参考图的描述文本并将该描述与编辑指令合并来实现检索,这种方法可能无法捕捉到用户隐含意图,并返回重复的结果。 我们提出了一种名为Pix2Key的方法,它将查询和候选图片都表示为开放词汇视觉词典,在统一嵌入空间中实现了基于意图的理解约束匹配及基于多样性的重排序。一个自我监督的预训练组件V-Dict-AE仅使用图像进一步改善了字典表示,无需特定于CIR的监督便能提升对细微属性的理解能力。 在DFMM-Compose基准测试上,Pix2Key将Recall@10提高了最多3.2个点;添加V-Dict-AE后,又额外获得了2.3个点的成绩,并且提升了意图一致性同时保持了列表多样性。
https://arxiv.org/abs/2602.22510
The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. The ode and trained models are available at this https URL.
姿态图是基于运动的结构重建(SfM)的核心组件,其中图像充当节点,边则编码相对的姿态信息。由于几何验证过程成本高昂,因此SfM流水线限制了姿态图仅包含候选边的一小部分集,使得初始化阶段变得至关重要。现有方法依赖于图像检索技术,将每张图片连接到其最近的$k$个邻居上,并且以孤立的方式处理这些配对关系,忽略全局一致性问题。为解决这一局限性,我们提出了基于边缘优先级的概念,即根据候选边在SfM中的作用来对其进行排序。 我们的方法包括三个部分: 1. **用于预测具有全局一致性的边可靠性的GNN**:该神经网络使用从SfM中获得的监督信息进行训练。 2. **多最小生成树引导的姿态图构建**:基于边缘优先级排名,构建姿态图。 3. **连接感知分数调节**:增强较弱区域并减小组件直径。 通过上述全局信息指导的初始化过程可以产生更可靠且紧凑的姿态图,在稀疏和高速场景下提高了重建精度,并在模棱两可的情况下超越了最先进的检索方法。相关的代码和训练模型可在[此链接](https://this https URL)获取。(请将“this https URL”替换为实际的有效网址)
https://arxiv.org/abs/2602.21963
Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at this https URL.
文本到图像检索是视觉-语言学习中的基本任务,但在实际场景中,这种任务常常受到用户查询短且描述不充分的挑战。这些查询通常只有一两个词长,这使得它们在语义上含糊不清,并容易与各种不同的视觉解释产生冲突,同时缺乏对检索图片质量的明确控制。 为了应对这些问题,我们提出了一种新的可控制质量的检索范式,通过为短查询添加上下文细节来丰富其内容,同时引入了关于图像质量的明确概念。我们的核心理念是利用生成性语言模型作为查询完成函数,将描述不充分的查询扩展成包含诸如姿势、场景和美学等细微视觉属性的描述形式。 我们介绍了一个通用框架,该框架以从相关性和美学评分模型中得出的离散化质量等级为基础来调节查询完成过程。这样,不仅增强了查询在语义上的意义,也使用户能够对图像的质量进行明确控制。 我们的系统提供了三个主要优势:1)灵活性——它可以与任何预训练的视觉-语言模型(VLMs)兼容,无需修改;2)透明度——增强后的查询可以被用户明确理解;3)可控性——这使得检索结果可以根据用户的偏好引导至特定的质量水平。广泛的实验表明,我们提出的方法显著改善了检索效果,并提供了有效的质量控制手段,弥合了现代VLM表达能力与简短且描述不充分的用户查询之间的差距。 我们的代码可在提供的链接中获取。
https://arxiv.org/abs/2602.21175
With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.
随着多模态信息的迅速普及,视觉文档检索(VDR)已成为连接非结构化、富含视觉数据和精准信息获取之间的重要前沿领域。与传统的自然图像检索不同,视觉文档具有由密集的文字内容、复杂的布局以及细粒度的语义依赖所定义的独特特征。本文首次全面综述了VDR领域的现状,特别聚焦于多模态大型语言模型(MLLM)时代的发展。我们首先考察基准测试框架,并随后深入探讨方法论的演变,将现有方法归类为三大主要方面:多模态嵌入模型、多模态重排模型以及 Retrieval-Augmented Generation (RAG) 和代理系统在复杂文档智能中的集成应用。最后,本文识别了持续存在的挑战并概述了未来有前景的发展方向,旨在为未来的多模态文档智能提供清晰的路线图。
https://arxiv.org/abs/2602.19961
With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing image retrieval. However, automatically generated image-based questions tend to be simplistic and template-based, which hinders the real deployment of question answering or visual dialogue systems. To enrich and diversify the questions, we propose a knowledge-aware remote sensing visual question generation model, KRSVQG, that incorporates external knowledge related to the image content to improve the quality and contextual understanding of the generated questions. The model takes an image and a related knowledge triplet from external knowledge sources as inputs and leverages image captioning as an intermediary representation to enhance the image grounding of the generated questions. To assess the performance of KRSVQG, we utilized two datasets that we manually annotated: NWPU-300 and TextRS-300. Results on these two datasets demonstrate that KRSVQG outperforms existing methods and leads to knowledge-enriched questions, grounded in both image and domain knowledge.
随着遥感图像档案的迅速发展,通过提问来获取特定信息或进行图像检索已成为一种有效的方法。然而,自动生成的基于图像的问题往往简单且模板化,这阻碍了问答系统或视觉对话系统的实际部署。为了丰富和多样化这些问题,我们提出了一种知识感知型的遥感视觉问题生成模型KRSVQG,该模型结合了与图像内容相关的外部知识,以提高所生成问题的质量和上下文理解能力。此模型将一幅图像和一个来自外部知识源的相关知识三元组作为输入,并利用图像描述作为中间表示来增强生成的问题的图像接地性。为了评估KRSVQG的表现,我们使用了两个我们手动标注的数据集:NWPU-300和TextRS-300。在这些数据集上的实验结果表明,KRSVQG超越了现有的方法,并且能够生成富含知识的问题,这些问题不仅基于图像而且还结合领域知识。
https://arxiv.org/abs/2602.19224
With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing semantic image retrieval. However, current automatically generated questions tend to be simplistic and template-based, which hinders the deployment of question answering or visual dialogue systems for real-world applications. To enrich and diversify the questions with both image content and commonsense knowledge, we propose a Knowledge-aware Remote Sensing Visual Question Generation model (KRSVQG). The proposed model incorporates related knowledge triplets from external knowledge sources to broaden the question content, while employing image captioning as an intermediary representation to ground questions to the corresponding images. Moreover, KRSVQG utilizes a vision-language pre-training and fine-tuning strategy, enabling the model's adaptation to low data regimes. To evaluate the proposed KRSVQG model, we construct two knowledge-aware remote sensing visual question generation datasets: the NWPU-300 dataset and the TextRS-300 dataset. Evaluations, including metrics and human assessment, demonstrate that KRSVQG outperforms existing methods and leads to rich questions, grounded in both image and domain knowledge. As a key practice in vision-language research, knowledge-aware visual question generation advances the understanding of image content beyond pixels, facilitating the development of knowledge-enriched vision-language systems with vision-grounded human commonsense.
随着遥感图像档案的快速发展,通过提问来获取特定信息或进行语义图像检索已成为一种有效的方法。然而,当前自动产生的问题往往过于简单且模板化,这阻碍了问答系统和视觉对话系统的实际应用部署。为了丰富和完善基于图像内容和常识知识的问题多样性,我们提出了一个知识感知的遥感视觉问题生成模型(KRSVQG)。该模型整合外部知识源中的相关三元组来扩展问题的内容范围,并利用图像描述作为中间表示形式,将生成的问题与相应的图像关联起来。此外,KRSVQG还采用了视觉语言预训练和微调策略,使模型能够适应低数据环境。 为了评估提出的KRSVQG模型,我们构建了两个知识感知的遥感视觉问题生成数据集:NWPU-300数据集和TextRS-300数据集。通过包括指标在内的各项评价测试以及人工评估表明,KRSVQG在性能上超越了现有的方法,并能够产生丰富且基于图像及领域知识的问题。 作为视觉语言研究中的一个重要实践环节,知识感知的视觉问题生成推进了对图像内容的理解,超出了像素层面的认知范围。这促进了具有视觉基础的人类常识增强型视觉语言系统的开发和进步。
https://arxiv.org/abs/2602.19217
Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at this https URL.
查询性能预测(QPP)是一项重要的信息检索任务,近年来受到了广泛的研究和关注。它在诸如查询重构、查询扩展以及检索系统选择等众多领域有着各种各样的应用。然而,迄今为止,大部分关于QPP的研究主要集中在文本和图像的检索上,而针对基于内容的视频检索(CBVR)中的查询性能预测问题,则几乎未被深入探索。 为此,我们提出了首个用于视频查询性能预测(VQPP)的基准测试集,该测试包括两个从文本到视频的检索数据集以及两种不同的CBVR系统。我们的VQPP包含总计56,000个文本查询和51,000段视频,并且提供官方的训练、验证及测试分组,以促进直接对比并保证研究结果可重复性。我们探索了多种预检索和后检索性能预测器,为未来在视频领域内QPP的研究提供了具有代表性的基准参考。 我们的结果显示,预检索预测器可以取得相当好的表现,这使得它们能够应用于实际的检索步骤之前的应用场景中。此外,通过使用最佳性能的预检索预测器作为直接偏好优化(DPO)奖励模型来训练大语言模型(LLM),在查询重构任务上进行学习,我们展示了VQPP的实际应用潜力。 我们的基准测试和代码已经发布在这个链接:[此URL](请将方括号中的文本替换为实际发布的网址)。
https://arxiv.org/abs/2602.17814