Image captioning is an important problem in developing various AI systems, and these tasks require large volumes of annotated images to train the models. Since all existing labelled datasets are already used for training the large Vision Language Models (VLMs), it becomes challenging to improve the performance of the same. Considering this, it is essential to consider the unsupervised image captioning performance, which remains relatively under-explored. To that end, we propose LoGIC (Lewis Communication Game for Image Captioning), a Multi-agent Reinforcement Learning game. The proposed method consists of two agents, a 'speaker' and a 'listener', with the objective of learning a strategy for communicating in natural language. We train agents in the cooperative common-reward setting using the GRPO algorithm and show that improvement in image captioning performance emerges as a consequence of the agents learning to play the game. We show that using pre-trained VLMs as the 'speaker' and Large Language Model (LLM) for language understanding in the 'listener', we achieved a $46$ BLEU score after fine-tuning using LoGIC without additional labels, a $2$ units advantage in absolute metrics compared to the $44$ BLEU score of the vanilla VLM. Additionally, we replace the VLM from the 'speaker' with lightweight components: (i) a ViT for image perception and (ii) a GPT2 language generation, and train them from scratch using LoGIC, obtaining a $31$ BLEU score in the unsupervised setting, a $10$ points advantage over existing unsupervised image-captioning methods.
图像描述是开发各种AI系统中的一个重要问题,这些任务需要大量带有标注的图片来训练模型。由于所有现有的标记数据集已经被用于训练大型视觉语言模型(VLMs),因此很难进一步提升性能。鉴于此,探索无监督图像描述效果显得尤为重要,这一领域目前相对未被充分研究。 为此,我们提出了LoGIC(Lewis沟通游戏进行图像描述)——一种多智能体强化学习游戏。该方法包括两个代理,“说话者”和“听众”,其目标是学会用自然语言进行交流的策略。通过使用GRPO算法在合作共同奖励设置中训练这些代理,并证明了由于代理学会了玩游戏,因此出现了图像描述性能的提升。 我们还展示了,在没有额外标注的情况下,利用预训练的VLM作为“说话者”以及大型语言模型(LLM)用于“听众”的语言理解,LoGIC微调后实现了46的BLEU得分,相较于原始VLM的44 BLEU分,这一成绩在绝对指标上提高了2个单位。 此外,我们用轻量级组件替换了“说话者”中的VLM:(i)一个ViT用于图像感知和(ii)GPT2用于语言生成,并完全从头开始使用LoGIC进行训练,在无监督环境下获得了31的BLEU得分,相较于现有的无监督图像描述方法,这一成绩在绝对指标上领先了10个单位。
https://arxiv.org/abs/2507.08610
Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations, continuous, high-dimensional image features vs. discrete, structured text. We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantic parsers. By generating rich Visual Semantic Descriptions (VSD), MLLMs provide semantic anchor that facilitate cross-modal alignment. Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency. These modules can be seamlessly integrated into existing ITM models. Extensive experiments on Flickr30K and MSCOCO demonstrate substantial performance improvements. The approach also exhibits remarkable zero-shot generalization to cross-domain tasks, including news and remote sensing ITM. The code and model checkpoints are available at this https URL.
图像-文本匹配(ITM)旨在解决视觉和文本模态之间对齐的基本挑战,这两者在表示方式上存在本质差异:连续、高维的图像特征与离散、结构化的文本。我们提出了一种新颖的框架,通过利用多模态大型语言模型(MLLMs)作为视觉语义解析器来弥合这种模态差距。通过生成丰富的视觉语义描述(VSD),MLLMs 提供了语义锚点以促进跨模态对齐。我们的方法结合了: 1. 实例级对齐:通过将视觉特征与 VSD 融合,增强图像表示的语义表达能力; 2. 原型级对齐:通过 VSD 聚类确保类别级别的一致性。 这些模块可以无缝集成到现有的 ITM 模型中。在 Flickr30K 和 MSCOCO 数据集上进行的广泛实验表明了显著的性能提升。该方法还表现出跨域任务(包括新闻和遥感 ITM)上的卓越零样本泛化能力。代码和模型检查点可在 [此链接](https://this-url.com) 获取。
https://arxiv.org/abs/2507.08590
Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at this https URL.
在研究领域中,尤其是在短时间内提出了许多并行方法的情况下,评估进展可能很困难。这种情况发生在数字病理学中,在该领域最近发布了大量基础模型以作为用于图像瓦片(tile)级别的特征提取器,并且这些模型被广泛应用于各种下游任务,包括瓦片级别和切片级别的问题。因此,对可用的方法进行基准测试变得至关重要,以便更清晰地了解研究格局。特别是,在诸如医疗保健等关键领域中,基准不仅应专注于评估下游性能,还应提供关于不同方法之间主要差异的见解,并且重要的是还要考虑不确定性和稳健性以确保所提出模型的可靠使用。 出于这些原因,我们引入了THUNDER,这是一个针对数字病理学基础模型的瓦片级别基准测试工具,它能够在一个多样化数据集和一系列下游任务上对许多模型进行高效的比较,并研究它们的特征空间以及通过其嵌入评估预测的稳健性和不确定性。 THUNDER是一个快速、易于使用且动态的基准,可以支持大量的最先进的基础模型,也可以用于直接基于瓦片定义本地用户的模型对比。在本文中,我们在16个不同的数据集上对23种基础模型进行了全面比较,涵盖了各种任务、特征分析和鲁棒性。THUNDER的代码可在以下网址公开获取:[此URL](请将此处的"this https URL"替换为实际链接)。
https://arxiv.org/abs/2507.07860
Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: this https URL.
可靠性的不确定性量化(UQ)和故障预测仍然是视觉语言模型(VLMs)面临的挑战。我们引入了ViLU,这是一种新的视觉语言不确定性量化框架,它通过利用所有任务相关的文本表示来上下文化不确定性估计。ViLU通过交叉注意力机制将视觉嵌入、预测的文本嵌入以及基于图像的条件文本表示集成到一个具有不确定性的多模态表征中。与传统的基于损失预测的UQ方法不同,ViLU训练了一个二元分类器作为不确定性预测器,以识别正确的和错误的预测,并使用加权二元交叉熵损失来实现这一点,使其对具体的损失函数是无关的。特别地,我们提出的方法非常适合于事后设置,在这种情况下只有视觉和文本嵌入可用而没有直接访问模型本身的途径。在各种数据集上的广泛实验表明,与最先进的故障预测方法相比,我们的方法具有显著的优势。我们将该方法应用于标准分类数据集(如ImageNet-1k)以及大规模图像描述数据集(如CC12M和LAION-400M)。消融研究表明了我们架构和训练在实现有效不确定性量化方面发挥的关键作用。我们的代码是公开的,并可以在以下网址找到:this https URL。
https://arxiv.org/abs/2507.07620
Foundation models are pre-trained on large-scale datasets and subsequently fine-tuned on small-scale datasets using parameter-efficient fine-tuning (PEFT) techniques like low-rank adapters (LoRA). In most previous works, LoRA weight matrices are randomly initialized with a fixed rank across all attachment points. In this paper, we improve convergence and final performance of LoRA fine-tuning, using our proposed data-driven weight initialization method, ConsNoTrainLoRA (CNTLoRA). We express LoRA initialization as a domain shift problem where we use multiple constraints relating the pre-training and fine-tuning activations. By reformulating these constraints, we obtain a closed-form estimate of LoRA weights that depends on pre-training weights and fine-tuning activation vectors and hence requires no training during initialization. This weight estimate is decomposed to initialize the up and down matrices with proposed flexibility of variable ranks. With the proposed initialization method, we fine-tune on downstream tasks such as image generation, image classification and image understanding. Both quantitative and qualitative results demonstrate that CNTLoRA outperforms standard and data-driven weight initialization methods. Extensive analyses and ablations further elucidate the design choices of our framework, providing an optimal recipe for faster convergence and enhanced performance.
基础模型是在大规模数据集上预训练,然后通过低秩适配器(LoRA)等参数高效微调技术在小型数据集上进行精调。以往的工作中,LoRA权重矩阵通常会在所有附加点处使用固定秩随机初始化。本文提出了一种基于数据驱动的权重初始化方法——ConsNoTrainLoRA (CNTLoRA),旨在改善LoRA精调过程中的收敛性和最终性能。我们将LoRA初始化视为一个领域偏移问题,在这个问题中,我们利用多个约束条件来联系预训练和微调激活状态。通过重新表述这些约束条件,我们可以获得一种闭合形式的LoRA权重估计值,该估计值依赖于预训练权重和微调激活向量,因此在初始化时不需要额外的训练过程。这种权重估计被分解开来用于以可变秩灵活性来初始化上矩阵和下矩阵。 采用我们提出的这一初始化方法后,在下游任务如图像生成、图像分类及图像理解等任务中进行精调。无论是定量分析还是定性分析都表明,CNTLoRA在性能方面优于标准的和基于数据驱动的权重初始化方法。通过广泛的分析与消融研究进一步阐明了我们框架的设计选择,并为实现更快收敛和增强表现提供了最佳方案。
https://arxiv.org/abs/2507.08044
Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.
在统一理解和生成模型方面的重要突破,已经显著推动了图像理解、推理、生产及编辑领域的进步。然而,目前的基础模型主要集中在处理图像上,这导致视频理解和生成的统一模型的发展出现了一个缺口。本报告介绍了Omni-Video,这是一种高效且有效的框架,用于视频的理解、生成以及基于指令的编辑。我们的关键洞察是教导现有的多模态大型语言模型(MLLMs)产生连续的视觉线索,并将其用作扩散解码器输入,从而生成高质量的视频。为了充分发挥我们在统一视频建模系统中的潜力,我们整合了几项技术改进:1)一种轻量级架构设计,在MLLMs顶部添加一个视觉头并在扩散解码器输入之前添加一个适配器,前者产生视觉标记供后者使用,并将这些视觉标记适应到扩散解码器的条件空间中;以及2)一个多阶段训练方案,该方案在有限的数据和计算资源下促进了MLLMs与扩散解码器之间的快速连接。我们实证表明,我们的模型在视频生成、编辑和理解任务上展现了令人满意的泛化能力。
https://arxiv.org/abs/2507.06119
The application of Vision-Language Models (VLMs) in remote sensing (RS) image understanding has achieved notable progress, demonstrating the basic ability to recognize and describe geographical entities. However, existing RS-VLMs are mostly limited to image-level and region-level tasks, lacking the capability to handle pixel-level tasks and performing poorly in small-object recognition scenarios. Moreover, RS-VLMs consume significant computational resources when processing high-resolution RS images, further restricting their practical applicability. In this context, we propose GeoMag (Geographical Magnifier), an end-to-end general-purpose large model framework for RS. GeoMag dynamically focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing across multiple levels of granularity. This method introduces Task-driven Multi-granularity Resolution Adjustment (TMRA) and Prompt-guided Semantic-aware Cropping (PSC), which adaptively reduce the spatial resolution of task-irrelevant regions while enhancing the visual representation of task-relevant areas. This approach improves the model's perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery. Extensive comparative experiments on 10 benchmarks demonstrate that GeoMag not only excels in handling pixel-level tasks but also maintains competitive performance across tasks of other granularities compared to existing RS-VLMs.
在遥感(RS)图像理解中,视觉-语言模型(VLMs)的应用已经取得了显著进展,展示了识别和描述地理实体的基本能力。然而,现有的RS-VLM大多局限于图像级和区域级任务,缺乏处理像素级任务的能力,并且在小目标识别场景中的表现较差。此外,当处理高分辨率的遥感图像时,这些模型消耗大量的计算资源,进一步限制了其实用性。在此背景下,我们提出了GeoMag(地理放大器),这是一种用于遥感的端到端通用大规模模型框架。GeoMag根据提示语义动态调整注意力范围,有效地在多个层次上执行遥感图像解析。该方法引入了任务驱动多粒度分辨率调节(TMRA)和提示引导语义感知裁剪(PSC),能够自适应地降低与任务无关区域的空间分辨率,同时增强相关区域的视觉表示。这种方法提升了模型对关键目标区域的理解能力,抑制了背景冗余,并降低了解释高分辨率遥感图像的计算成本。 在10个基准数据集上的广泛对比实验表明,GeoMag不仅擅长处理像素级任务,而且在整个任务粒度级别上与其他现有的RS-VLM相比也保持了竞争力。
https://arxiv.org/abs/2507.05887
The evolution of large models has witnessed the emergence of In-Context Learning (ICL) capabilities. In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of ICL. Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities. However, explorations of demonstration configuration for multimodal ICL remain preliminary. Additionally, the controllability of In-Context Examples (ICEs) provides an efficient and cost-effective means to observe and analyze the inference characteristics of LMMs under varying inputs. This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task. Externally, we explore demonstration configuration strategies through three dimensions: shot number, image retrieval, and caption assignment. We employ multiple metrics to systematically and thoroughly evaluate and summarize key findings. Internally, we analyze typical LMM attention characteristics and develop attention-based metrics to quantify model behaviors. We also conduct auxiliary experiments to explore the feasibility of attention-driven model acceleration and compression. We further compare performance variations between LMMs with identical model design and pretraining strategies and explain the differences from the angles of pre-training data features. Our study reveals both how ICEs configuration strategies impact model performance through external experiments and characteristic typical patterns through internal inspection, providing dual perspectives for understanding multimodal ICL in LMMs. Our method of combining external and internal analysis to investigate large models, along with our newly proposed metrics, can be applied to broader research areas.
大型模型的发展见证了上下文学习(ICL)能力的出现。在自然语言处理(NLP)领域,许多研究表明了ICL的有效性。受到大规模语言模型(LLMs)成功的启发,研究人员开发出了具有ICL能力的大规模多模态模型(LMMs)。然而,针对多模态ICL的演示配置探索仍处于初步阶段。此外,上下文示例(ICEs)的可控性提供了一种高效且成本效益高的方法来观察和分析在不同输入下的大模型推理特性。本文对图像描述任务中的多模态上下文学习进行了全面的外部和内部调查。 从外部来看,我们通过三个维度探索了演示配置策略:射击数量、图像检索以及说明分配。我们采用了多种指标系统且全面地评估并总结了关键发现。在内部方面,我们分析了典型LMM的关注特性,并开发出了基于注意力的度量标准以量化模型行为。此外,我们也进行了辅助实验来探讨基于注意驱动的模型加速和压缩的可行性。进一步地,我们将具有相同设计和预训练策略的大规模多模态模型(LMMs)的表现差异进行比较,并从预训练数据特性角度解释这些差异。 我们的研究揭示了通过外部实验ICE配置策略对模型性能的影响以及内部检查中的典型模式特征,这为理解LMMs的多模态ICL提供了双重视角。我们将外部和内部分析相结合的方法调查大型模型,以及我们新提出的度量标准可以应用于更广泛的科研领域。
https://arxiv.org/abs/2507.08021
Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbf{INTER}: \textbf{Inter}action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4\% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.
大型视觉-语言模型(LVLM)中的幻觉问题对实际应用构成了重大挑战,因为这些模型可能会生成看似合理但实际上与关联的视觉内容不一致的回答。这种情况在人类认知中很少发生。我们认为这种差异源于人类能够有效地利用数据样本中的多模态交互信息的能力。具体来说,人类通常首先收集多模态信息,分析各模式间的相互作用以理解其含义,然后再通过语言表达这一理解。 基于这一观察,我们对流行的LVLM进行了广泛的实验,并获得了有趣的见解:尽管程度较低,这些模型在处理多模态样本时表现出类似人类的认知行为。在此基础上,我们进一步提出了**INTER(Interaction Guidance Sampling)**——一种新颖的无需训练的方法,用于减少幻觉现象,而不需要额外的数据。具体来说,INTER明确引导LVLM更有效地重新应用它们对多模态交互信息的理解来生成回答,从而降低潜在的幻觉风险。 在包括视觉问答(VQA)和图像描述任务在内的六个基准数据集上,与最先进的解码策略相比,INTER平均提高了五个LVLM模型的性能,增幅高达3.4%。论文接受后将公开代码。
https://arxiv.org/abs/2507.05056
Multimodal sarcasm detection has attracted growing interest due to the rise of multimedia posts on social media. Understanding sarcastic image-text posts often requires external contextual knowledge, such as cultural references or commonsense reasoning. However, existing models struggle to capture the deeper rationale behind sarcasm, relying mainly on shallow cues like image captions or object-attribute pairs from images. To address this, we propose \textbf{MiDRE} (\textbf{Mi}xture of \textbf{D}ual \textbf{R}easoning \textbf{E}xperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via Chain-of-Thought prompting to a Large Vision-Language Model. An adaptive gating mechanism dynamically weighs the two experts, selecting the most relevant reasoning path. Experiments on two benchmark datasets show that MiDRE achieves superior performance over baselines. Various qualitative analyses highlight the crucial role of external rationales, revealing that even when they are occasionally noisy, they provide valuable cues that guide the model toward a better understanding of sarcasm.
多模态讽刺检测由于社交媒体上多媒体帖子的增多而引起了越来越多的关注。理解带有图片的文字类讽刺内容通常需要外部背景知识,如文化参考或常识推理。然而,现有的模型难以捕捉到讽刺背后的深层逻辑,主要依赖于浅层线索,例如图像标题或图像中的物体属性对。 为了解决这一问题,我们提出了一种名为**MiDRE(混合双路推理专家)**的方法。该方法结合了一个内部推理专家,用于检测图像和文本配对之间的不一致,并且还有一个外部推理专家,它通过链式思维引导技术向大型视觉-语言模型请求结构化的逻辑推导来解释内容。此外,还设计了一种自适应门控机制,可以根据需要动态调整两个专家的权重,选择最合适的推理路径。 在两个基准数据集上的实验表明,MiDRE的表现优于基线方法。各种定性分析强调了外部逻辑的重要性,揭示即使有时这些逻辑可能是有噪声的,但它们依然提供了有价值的信息,帮助模型更好地理解讽刺内容的本质。
https://arxiv.org/abs/2507.04458
Effective Out-of-Distribution (OOD) detection is criti-cal for ensuring the reliability of semantic segmentation models, particularly in complex road environments where safety and accuracy are paramount. Despite recent advancements in large language models (LLMs), notably GPT-4, which significantly enhanced multimodal reasoning through Chain-of-Thought (CoT) prompting, the application of CoT-based visual reasoning for OOD semantic segmentation remains largely unexplored. In this paper, through extensive analyses of the road scene anomalies, we identify three challenging scenarios where current state-of-the-art OOD segmentation methods consistently struggle: (1) densely packed and overlapping objects, (2) distant scenes with small objects, and (3) large foreground-dominant objects. To address the presented challenges, we propose a novel CoT-based framework targeting OOD detection in road anomaly scenes. Our method leverages the extensive knowledge and reasoning capabilities of foundation models, such as GPT-4, to enhance OOD detection through improved image understanding and prompt-based reasoning aligned with observed problematic scene attributes. Extensive experiments show that our framework consistently outperforms state-of-the-art methods on both standard benchmarks and our newly defined challenging subset of the RoadAnomaly dataset, offering a robust and interpretable solution for OOD semantic segmentation in complex driving environments.
有效的出分布(OOD,Out-of-Distribution)检测对于确保语义分割模型的可靠性至关重要,尤其是在安全性和准确性至关重要的复杂道路环境中。尽管在大型语言模型(LLMs),特别是GPT-4方面取得了显著进展,这些进步通过链式思维(CoT,Chain-of-Thought)提示极大地增强了多模态推理能力,但基于CoT的视觉推理应用于OOD语义分割领域仍然鲜有探索。在这篇论文中,通过对道路场景异常进行广泛分析,我们识别出三种当前最先进的OOD分割方法普遍面临挑战的情况:(1) 密集且相互重叠的对象;(2) 距离较远、包含小对象的场景;以及 (3) 以大前景为主的物体。为了解决这些提出的难题,我们提出了一种新的基于CoT框架,专注于道路异常场景中的OOD检测。我们的方法利用如GPT-4这样的基础模型所具备的广泛知识和推理能力,通过改进图像理解和与观察到的问题场景属性相一致的提示式推理来增强OOD检测。广泛的实验表明,我们的框架在标准基准测试以及我们新定义的RoadAnomaly数据集中具有挑战性的子集上均优于当前最先进的方法,为复杂驾驶环境中的OOD语义分割提供了稳健且可解释的解决方案。
https://arxiv.org/abs/2507.03984
Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.
大多数现有的自动语音识别(ASR)研究使用同域数据集来评估模型,但很少评估这些模型在不同语音环境中的泛化能力。本研究通过利用四个阿卡尼亚语语料库对七种基于Transformer架构的阿卡尼亚语ASR模型进行基准测试,填补了这一空白,以确定它们的表现情况。这些数据集涵盖了多个领域,包括与文化相关的图像描述、非正式对话、圣经诵读和即兴金融对话。 通过比较单词错误率(WER)和字符错误率(CER),研究强调了域依赖性:模型仅在其训练领域内表现出色,在不匹配场景中则显示出明显的准确性下降。此外,该研究还发现了Whisper与Wav2Vec2架构之间不同的误差行为。经过微调的Whisper阿卡尼亚语模型导致更流畅但可能具有误导性的转录错误,而Wav2Vec2在遇到不熟悉的输入时生成的输出更为明显,但却不太易于理解。这一可读性和透明性之间的权衡应当在选择用于低资源语言(LRL)应用的架构时予以考虑。 这些发现强调了为阿卡尼亚语和其他低资源语言开发有针对性的领域适应技术、自适应路由策略和多语言训练框架的重要性。
https://arxiv.org/abs/2507.02407
Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model's text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability with specialized fine-tuning for spatiotemporal coherence. Empirical evaluation on comprehensive video understanding benchmarks demonstrates that CAVALRY-V significantly outperforms existing attack methods, achieving 22.8% average improvement over the best baseline attacks on both commercial systems (GPT-4.1, Gemini 2.0) and open-source models (QwenVL-2.5, InternVL-2.5, Llava-Video, Aria, MiniCPM-o-2.6). Our framework achieves flexibility through implicit temporal coherence modeling rather than explicit regularization, enabling significant performance improvements even on image understanding (34.4% average gain). This capability demonstrates CAVALRY-V's potential as a foundational approach for adversarial research across multimodal systems.
视频多模态大型语言模型(V-MLLMs)在时间推理和跨模态理解方面表现出令人印象深刻的性能,但它们对对抗性攻击的脆弱性由于独特的挑战而鲜有研究:复杂的跨模态推理机制、时间依赖性和计算约束。我们提出了CAVALRY-V(视频多模态语言-视觉对抗框架),这是一个新颖的框架,直接针对V-MLLMs中视觉感知和语言生成之间的关键接口进行攻击。我们的方法引入了两个关键技术创新: 1. 一种双目标语义-视觉损失函数,同时破坏模型的文字生成对数和视觉表示,以削弱跨模态集成。 2. 一个计算效率高的两阶段生成器框架,结合大规模预训练实现跨模型的迁移性,并通过专门化的微调来增强时空一致性。 在全面的视频理解基准测试中进行的经验评估表明,CAVALRY-V显著超越了现有的攻击方法,在商业系统(GPT-4.1、Gemini 2.0)和开源模型(QwenVL-2.5、InternVL-2.5、Llava-Video、Aria、MiniCPM-o-2.6)上平均提高了22.8%。我们的框架通过隐式的时空一致性建模而非显式正则化来实现灵活性,这使得在图像理解任务中也能获得显著的性能提升(平均增益34.4%)。这种能力表明CAVALRY-V作为跨多模态系统对抗性研究基础方法的巨大潜力。
https://arxiv.org/abs/2507.00817
Cross-cultural research in perception and cognition has shown that individuals from different cultural backgrounds process visual information in distinct ways. East Asians, for example, tend to adopt a holistic perspective, attending to contextual relationships, whereas Westerners often employ an analytical approach, focusing on individual objects and their attributes. In this study, we investigate whether Vision-Language Models (VLMs) trained predominantly on different languages, specifically Japanese and English, exhibit similar culturally grounded attentional patterns. Using comparative analysis of image descriptions, we examine whether these models reflect differences in holistic versus analytic tendencies. Our findings suggest that VLMs not only internalize the structural properties of language but also reproduce cultural behaviors embedded in the training data, indicating that cultural cognition may implicitly shape model outputs.
跨文化研究发现,不同文化背景的人在处理视觉信息时存在不同的方式。例如,东亚人倾向于采用整体视角,注重情境关系;而西方人则往往采取分析方法,关注个体对象及其属性。在这项研究中,我们探讨了主要使用日语和英语训练的视觉-语言模型(VLMs)是否展现出类似的基于文化的注意力模式。通过比较图像描述的分析,我们考察这些模型是否反映了整体与分析倾向之间的差异。我们的发现表明,除了内化语言结构特性外,VLMs还会再现嵌入在训练数据中的文化行为,这暗示着文化认知可能隐含地影响了模型输出。
https://arxiv.org/abs/2507.00700
This paper studies the role of attention heads in CLIP's image encoder. While CLIP has exhibited robust performance across diverse applications, we hypothesize that certain attention heads negatively affect final representations and that ablating them can improve performance in downstream tasks. To capitalize on this insight, we propose a simple yet effective method, called Attention Ablation Technique (AAT), to suppress the contribution of specific heads by manipulating attention weights. By integrating two alternative strategies tailored for different application scenarios, AAT systematically identifies and ablates detrimental attention heads to enhance representation quality. Experiments demonstrate that AAT consistently improves downstream task performance across various domains, boosting recall rate by up to 11.1% on CLIP-family models for cross-modal retrieval. The results highlight the potential of AAT to effectively refine large-scale vision-language models with virtually no increase in inference cost.
本文研究了CLIP图像编码器中注意力头的作用。尽管CLIP在各种应用中表现出色,但我们假设某些注意力头会负面影响最终的表示形式,并且消除这些头可以提升下游任务的表现。为了利用这一洞察,我们提出了一种简单而有效的方法——注意力消融技术(AAT),通过操控注意力权重来抑制特定头部的贡献。通过整合两种针对不同应用场景量身定制的策略,AAT系统地识别并消除了有害的注意力头,从而提高了表示的质量。 实验表明,无论是在图像检索等各个领域,AAT都一致提升了下游任务的表现,特别是在CLIP家族模型进行跨模态检索时,召回率最高可提升11.1%。研究结果突显了AAT有效精炼大规模视觉-语言模型的潜力,并且几乎不会增加推理成本。
https://arxiv.org/abs/2507.00537
Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at this https URL.
最近,在大型语言模型和视觉-语言模型方面的进展引发了人们对图像描述中可解释性评估指标的日益浓厚的兴趣。然而,这些指标在没有标准化标准的情况下生成解释,并且所生成解释的整体质量仍未得到验证。为此,本文提出了EXPERT这一无参考点的评估指标,它基于三个基本准则:流畅度、相关性和描述性来提供结构化的解释。通过构建大规模高质量结构化解释的数据集,我们开发了一个两阶段评估模板,以有效地监督视觉-语言模型进行评分和生成解释。在基准数据集中,EXPERT达到了最先进的结果,并且通过全面的人工评价验证了其提供的解释质量远高于现有指标。我们的代码和数据集可在[此处](https://这个URL)获取。
https://arxiv.org/abs/2506.24016
Implicit Neural Representation (INR) has demonstrated remarkable advances in the field of image representation but demands substantial GPU resources. GaussianImage recently pioneered the use of Gaussian Splatting to mitigate this cost, however, the slow training process limits its practicality, and the fixed number of Gaussians per image limits its adaptability to varying information entropy. To address these issues, we propose in this paper a generalizable and self-adaptive image representation framework based on 2D Gaussian Splatting. Our method employs a network to quickly generate a coarse Gaussian representation, followed by minimal fine-tuning steps, achieving comparable rendering quality of GaussianImage while significantly reducing training time. Moreover, our approach dynamically adjusts the number of Gaussian points based on image complexity to further enhance flexibility and efficiency in practice. Experiments on DIV2K and Kodak datasets show that our method matches or exceeds GaussianImage's rendering performance with far fewer iterations and shorter training times. Specifically, our method reduces the training time by up to one order of magnitude while achieving superior rendering performance with the same number of Gaussians.
隐式神经表示(INR)在图像表征领域取得了显著的进展,但需要大量的GPU资源。最近,GaussianImage通过使用高斯点阵技术来减轻这一成本负担,然而,其缓慢的训练过程限制了其实用性,并且每张图片固定数量的高斯点也限制了它适应不同信息熵的能力。为了解决这些问题,在本文中我们提出了一种基于二维高斯点阵的通用和自适应图像表示框架。我们的方法利用网络快速生成粗略的高斯表示,随后进行少量的微调步骤,从而在显著减少训练时间的同时实现了与GaussianImage相当的渲染质量。此外,我们的方法根据图片复杂度动态调整高斯点的数量,以进一步提高实用中的灵活性和效率。在DIV2K和Kodak数据集上的实验表明,我们的方法匹配或超过了GaussianImage的渲染性能,并且使用更少的迭代次数和更短的训练时间。具体来说,在使用相同数量的高斯点的情况下,我们的方法将训练时间缩短了一个数量级,并实现了更好的渲染效果。
https://arxiv.org/abs/2506.23479
Vision-Language Models (vLLMs) have emerged as powerful architectures for joint reasoning over visual and textual inputs, enabling breakthroughs in image captioning, cross modal retrieval, and multimodal dialogue. However, as these models scale to longer video sequences and richer language descriptions, the quadratic complexity of the standard attention mechanism presents a fundamental computational bottleneck. This challenge is exacerbated in vLLMs, where attention must be computed not only within modalities but also across them, leading to prohibitive memory and latency costs. In this work, we introduce the Compressed Sensing Attention Transformer (CSAT), a novel architecture that reimagines attention computation through the lens of compressed sensing. By projecting high dimensional key and value representations into a lower-dimensional subspace via random measurement matrices and reconstructing the attention outputs using sparse recovery algorithms, CSAT significantly reduces attention complexity while maintaining semantic fidelity. Applied to vLLMs, CSAT exploits the inherent compressibility of both visual and textual representations especially evident in video, where temporal redundancy is high, and in language, where cross-modal grounding is often sparse. In contrast to LLMs, which must often model entangled symbolic dependencies, vLLMs benefit from structured sparsity in alignment and scene composition, making them particularly well-suited to compressed attention. We provide a formal mathematical treatment of CSAT, demonstrate its integration into vision language pipelines, and validate its performance on standard benchmarks, highlighting its promise as a scalable, interpretable, and resource efficient solution for next generation multimodal transformers.
视觉-语言模型(vLLMs)作为跨模态推理的强力架构已经出现,它们能够处理图像描述、跨模态检索和多模态对话方面的突破性问题。然而,随着这些模型扩展到更长的视频序列和更丰富的语言描述中,标准注意力机制的二次复杂度成为了一个根本性的计算瓶颈。在vLLMs中,这种挑战进一步加剧了,因为必须在各个模态内部以及跨模态之间进行注意力计算,导致不可接受的记忆和延迟成本。 为了解决这一问题,在这项工作中我们引入了一种名为压缩感知注意力转换器(CSAT)的新架构。该架构通过压缩感知的视角重新定义了注意力计算的方法:通过使用随机测量矩阵将高维键值表示投影到低维子空间,并利用稀疏恢复算法重构注意力输出,CSAT显著降低了注意力复杂度同时保持语义保真度。当应用于vLLMs时,CSAT能充分利用视觉和文本表征在视频中显现出的固有可压缩性(由于时间冗余度高),以及语言中的跨模态接地通常较为稀疏的特点。 与必须建模纠缠符号依赖性的大型语言模型(LLMs)不同,vLLMs可以从对齐和场景组合方面的结构化稀疏性中获益,这使得它们特别适合于压缩注意力机制。我们提供了CSAT的正式数学描述,并展示了如何将其整合到视觉-语言处理管道中,还通过标准基准测试验证了其性能,强调了它作为下一代多模态转换器可扩展、解释性和资源高效解决方案的巨大潜力。
https://arxiv.org/abs/2507.02957
Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba's selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost.
视频理解是一项复杂的挑战,需要有效地建模空间和时间动态。随着图像基础模型(IFM)在图像理解中的成功应用,最近的研究方法探索了参数高效微调(PEFT),以将IFM适应于处理视频数据。然而,大多数这些方法倾向于分别处理空间信息和时间信息,这可能会导致无法捕捉到视频动态的全部复杂性。 在这篇论文中,我们提出了一种名为MoMa的高效的适配器框架,通过将Mamba的选择性状态空间建模集成到IFM中,实现了对视频的空间-时间完整建模。我们提出了一个新颖的操作——SeqMod,该操作可以在不破坏预训练IFM原始特征的情况下,注入空间和时间信息。通过在“Divide-and-Modulate”架构中整合SeqMod,MoMa不仅提升了视频理解能力,还保持了计算效率。 广泛的实验在多个视频基准测试上证明了MoMa的有效性,在减少计算成本的同时实现了卓越的性能表现。
https://arxiv.org/abs/2506.23283
Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.
多模态嵌入模型,基于因果视觉语言模型(VLMs),在各种任务中展现出巨大的潜力。然而,当前方法面临三大关键限制:第一,在VLM骨干网络中使用因果注意力机制对于嵌入任务来说并非最优;第二,由于依赖高质量配对数据进行对比学习而导致的可扩展性问题;第三,训练目标和数据多样性有限。为解决这些问题,我们提出了MoCa框架,这是一个两阶段方法,旨在将预训练的VLM转化为有效的双向多模态嵌入模型。 第一阶段是基于模式感知的持续预训练(Modality-aware Continual Pre-training),这一阶段引入了一个联合重建目标,同时对交错的文字和图像输入进行去噪处理,增强双向上下文感知推理能力。第二阶段为异构对比微调(Heterogeneous Contrastive Fine-tuning),该阶段利用了超出简单图像-描述对的多样化、语义丰富的多模态数据,以提高泛化能力和一致性。 我们的方法通过持续预训练引入双向注意力机制,借助联合重建目标有效利用大规模无标签数据集,并利用多样化的多模态数据增强表示的鲁棒性。实验表明,MoCa在MMEB和ViDoRe-v2基准测试中表现出色,达到了新的最先进水平(SOTA),并且在模型大小和训练数据方面都展示出强大的可扩展性。
https://arxiv.org/abs/2506.23115