Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.
多模态大型语言模型(MLLM)在遥感(RS)领域被越来越广泛地采用,并且在遥感视觉定位(RSVG)、遥感问答(RSVQA)和多模态对话等任务中表现出色。然而,幻觉问题——即与输入的遥感图像不一致的响应——严重阻碍了这些模型在高风险场景中的应用(例如紧急管理和农业监测),并且这个问题在遥感领域尚未得到充分研究。 在此工作中,我们介绍了RSHallu,这是一个系统性研究项目,包含三项成果:(1) 我们用一个面向遥感的任务分类定义了遥感幻觉,并引入了图像级别的幻觉来捕捉超越对象中心误差的特定于遥感的一致性问题(如模态、分辨率和场景级语义);(2) 我们建立了一个幻觉基准测试 RSHalluEval(包含 2,023 对问答),并实现了双模式检查,支持通过在 RSHalluCheck 数据集(15,396 对问答)上微调的精简检查器进行高精度云端审计和低成本可重复本地审查;(3) 我们提出了一个针对特定领域的数据集RSHalluShield(包含 30k 对问答),用于友好训练的缓解措施,并进一步提出无需训练的即插即用策略,包括解码时间对数修正和遥感意识提示。 在代表性的RS-MLLM中,我们的缓解方法使无幻觉率提高了最多21.63个百分点,在统一协议下保持了下游RS任务(如 RSVQA 和 RSVG)上的竞争力。代码和数据集将开放发布。
https://arxiv.org/abs/2602.10799
Cyclists often encounter safety-critical situations in urban traffic, highlighting the need for assistive systems that support safe and informed decision-making. Recently, vision-language models (VLMs) have demonstrated strong performance on autonomous driving benchmarks, suggesting their potential for general traffic understanding and navigation-related reasoning. However, existing evaluations are predominantly vehicle-centric and fail to assess perception and reasoning from a cyclist-centric viewpoint. To address this gap, we introduce CyclingVQA, a diagnostic benchmark designed to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist's perspective. Evaluating 31+ recent VLMs spanning general-purpose, spatially enhanced, and autonomous-driving-specialized models, we find that current models demonstrate encouraging capabilities, while also revealing clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes. Notably, several driving-specialized models underperform strong generalist VLMs, indicating limited transfer from vehicle-centric training to cyclist-assistive scenarios. Finally, through systematic error analysis, we identify recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.
自行车骑行者在城市交通中经常会遇到关键的安全情况,这突显了支持安全和知情决策的辅助系统的需求。最近,视觉-语言模型(VLMs)在自主驾驶基准测试中表现出色,表明它们可能具备一般性交通理解和导航推理的能力。然而,现有的评估主要以车辆为中心,并且未能从骑行者的视角来评估感知和推理能力。为了填补这一空白,我们引入了CyclingVQA,这是一个诊断基准,旨在从骑行者的角度探测感知、时空理解以及将交通规则转化为车道的推理能力。在对31种以上涵盖通用型、空间增强型及专门用于自主驾驶模型的最近VLM进行评估后,我们发现当前模型展示了令人鼓舞的能力,同时也揭示了在以骑行者为中心的感知和推理方面亟需改进的地方,特别是在解释特定于自行车手的交通信号以及将标志与正确的导航车道关联方面。值得注意的是,几种专为驾驶设计的模型在面对需要辅助骑行者的场景时表现不如强大的通用型VLM,这表明从车辆中心化的训练数据向支持骑行者场景的数据转移是有限的。最后,通过系统性的错误分析,我们识别出反复出现的失败模式,并以此来指导开发更有效的辅助自行车手智能系统的进展。
https://arxiv.org/abs/2602.10771
Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1,078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at this https URL.
视觉链式思考(VCoT)作为一种增强多模态推理的方法,通过将视觉感知整合到中间的推理步骤中而崭露头角。然而,现有的VCoT方法主要局限于静态场景,并且难以捕捉执行指令、预测和摄像机运动等任务所需的时间动态特性。为解决这一问题,我们提出了一种名为TwiFF-2.7M的数据集,这是首个大规模的时间基础视觉链式思考数据集,包含来自270万个视频片段的素材,专门设计用于处理动态场景中的视觉问答问题。 与该数据集相配套的是TwiFF-Bench,这是一个由1,078个样本组成的高质量评估基准,旨在评估推理轨迹的真实性和最终答案的准确性,特别针对开放式动态设置进行优化。在此基础上,我们提出了TwiFF模型,这是一种统一模式,能够协同利用预训练视频生成和图像理解的能力来产生时间连贯的视觉推理线索——通过迭代地生成未来的动作帧和文本推理。 大量的实验结果表明,在动态推理任务上,TwiFF显著超越了现有的VCoT方法及基于文本链式思考的方法。这充分验证了其在处理动态场景中的视觉问答问题时的有效性。我们的代码与数据集可在上述网址获得。
https://arxiv.org/abs/2602.10675
Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at this https URL.
当前的人工智能系统在图像中的隐喻理解方面仍面临重大挑战。尽管多模态大型语言模型(MLLM)在基本的视觉问答任务上表现出色,但它们难以把握视觉内容中嵌入的文化、情感和情境细微差别。这一困难源于任务需要复杂的多步推理能力、文化背景知识以及心智理论(Theory of Mind, ToM),而目前的模型在这方面还存在不足。 为了弥补这些差距,我们提出了一种名为MetaphorStar的全新端到端视觉强化学习框架,用于图像内涵理解任务。该框架包含三个核心组成部分:细粒度数据集TFQ-Data、视觉RL方法TFQ-GRPO以及结构良好的基准测试TFQ-Bench。 通过使用TFQ-GRPO在TFQ-Data上训练的完全开源的MetaphorStar家族,在图像隐含意义基准测试中实现了平均82.6%的性能提升。与二十多种主流MLLM相比,我们的32B参数版本的MetaphorStar在多项选择题和开放式问题任务中均达到了最新技术水平(SOTA),并且在判断真假的问题上明显优于封闭源模型Gemini-3.0-pro。 更重要的是,我们的实验表明,通过学习图像内涵理解任务可以提高整体理解和复杂视觉推理能力。此外,我们还提供了对不同模型架构、训练策略以及参数和数据量扩展的系统性分析,展示了方法的广泛应用潜力。 所有模型权重、数据集及代码均以开源形式发布在该网址: [https://github.com/MetaphorStar](https://github.com/MetaphorStar)。
https://arxiv.org/abs/2602.10575
A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model's input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.
长久以来,机器人技术的一个重要目标是创建一种通用策略,在无需针对每种机器人实体进行调整的情况下就能直接应用于新类型的机器人上。尽管大规模的多实体预训练已经在现有视觉-语言-动作模型(VLAs)中实施,这些模型仍然与其特定训练实体紧密相关,并且通常需要昂贵的微调过程。我们引入了一种名为语言-行动预训练(LAP)的技术,这是一种简单的方案,它直接使用自然语言来表示低级别的机器人操作指令,从而将操作监督与预先训练好的视觉-语言模型的输入输出分布对齐。LAP技术无需学习分词器、昂贵的数据标注或特定实体的架构设计。 基于LAP技术,我们提出了LAP-3B模型,据我们所知,这是第一个能在未见过的新机器人实体上实现显著零样本迁移能力而不进行任何针对具体实体微调的视觉-语言-动作模型。在多种新型机器人和操作任务中,LAP-3B达到了超过50%的平均零样本成功率,相较于最强前代VLA模型,性能提升了约2倍。 此外,我们还展示了LAP能够支持高效的适应性和有利的增长特性,并且通过共享的语言-行动格式统一了动作预测与视觉问答(VQA),使得协同训练可以获得额外的改进。
https://arxiv.org/abs/2602.10556
Recent advances in large language models (LLMs) have enabled the development of multimodal medical AI. While models such as MedGemini achieve high accuracy on VQA tasks like USMLE MM, their performance on ECG based tasks remains limited, and some models, such as MedGemma, do not support ECG data at all. Interpreting ECGs is inherently challenging, and diagnostic accuracy can vary depending on the interpreter's experience. Although echocardiography provides rich diagnostic information, it requires specialized equipment and personnel, limiting its availability. In this study, we focus on constructing a robust ECG encoder for multimodal pretraining using real world hospital data. We employ SigLIP, a CLIP based model with a sigmoid based loss function enabling multi label prediction, and introduce a modified loss function tailored to the multi label nature of ECG data. Experiments demonstrate that incorporating medical knowledge in the language model and applying the modified loss significantly improve multi label ECG classification. To further enhance performance, we increase the embedding dimensionality and apply random cropping to mitigate data drift. Finally, per label analysis reveals which ECG findings are easier or harder to predict. Our study provides a foundational framework for developing medical models that utilize ECG data.
近期,在大型语言模型(LLM)领域的进步促进了多模态医学人工智能的发展。然而,尽管像MedGemini这样的模型在诸如USMLE MM的视觉问答任务中表现出高精度,它们在基于心电图的任务上的性能仍然有限;有些模型甚至不支持心电图数据,比如MedGemma。解读心电图具有固有的挑战性,并且诊断准确性会因解释者的经验而异。虽然超声心动图提供了丰富的诊断信息,但其需要专门的设备和专业人员操作,这限制了其可用性。在本研究中,我们专注于使用真实世界的医院数据构建一个稳健的心电图编码器以用于多模态预训练。我们采用了基于CLIP模型的SigLIP,并加入了一个基于Sigmoid函数的损失函数来实现多标签预测,同时引入了一种针对心电图数据特性调整过的损失函数。实验表明,在语言模型中融入医学知识并应用修改后的损失函数可以显著提高多标签心电图分类的效果。为了进一步提升性能,我们增加了嵌入维度,并应用了随机裁剪以减少由于数据漂移带来的影响。最后,基于每个标签的分析揭示了哪些心电图发现更容易或更难预测。我们的研究为利用心电图数据开发医学模型奠定了基础框架。
https://arxiv.org/abs/2602.10553
Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian-based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio-temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE's effectiveness. The code is be available at this https URL.
近期在大型语言模型(LLMs)基础上构建的大型多模态模型(LMMs)的发展,已经确立了将三维视觉特征与LLM表示对齐为主导范式。然而,继承下来的旋转位置嵌入(RoPE)对于多模态处理引入了一些限制。具体来说,应用一维时间位置索引会破坏沿列维度的视觉特征连续性,导致空间局部性的损失。此外,RoPE遵循一个先验观点,即在时间上更接近的图像标记更有因果关系,这会导致注意力分配随序列长度增加而衰减,使模型逐渐忽视早期的视觉标记。 为了解决这些问题,我们提出了一种改进版的RoPE——C²RoPE(连续和因果关系建模的旋转位置嵌入),它明确地对视觉处理中的局部空间连续性和空间因果关系进行建模。C²RoPE引入了用于视觉标记的空间-时间连续性位置嵌入机制。首先,它将一维时间位置与基于笛卡尔坐标的空间坐标相结合,构建一个三元混合位置索引;然后采用频率分配策略对这三个索引组件的时空位置信息进行编码。 此外,我们还提出了切比雪夫因果屏蔽(Chebyshev Causal Masking),通过计算二维空间中图像标记之间的切比雪夫距离来确定因果依赖性。在包括3D场景推理和3D视觉问答在内的多个基准测试上的评估结果证明了C²RoPE的有效性。 代码将在以下URL提供:[请在此处插入实际的网址]。
https://arxiv.org/abs/2602.10551
Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.
在大型视觉语言模型(VLMs)方面的进展激发了对机器人操作中视觉-语言-行动(VLA)系统越来越多的兴趣。然而,现有的操控数据集仍然难以收集、高度特化于特定实体,并且覆盖范围和多样性不足,从而阻碍了VLA模型的泛化能力。最近的方法试图通过“计划然后执行”范式来缓解这些问题,在这种范式下,首先生成高层次规划(例如子任务、轨迹),然后将其转换为低层次行动,但这些方法严重依赖于现有数据集中普遍缺乏的中间监督信息。为了弥补这一差距,我们引入了RoboInter Manipulation Suite,这是一个统一资源库,包含用于操作的中间表示的数据、基准测试和模型。 该套件包括: - RoboInter-Tool:一个轻量级GUI工具,支持多种表示形式的半自动注释。 - RoboInter-Data:一个大规模数据集,涵盖超过230,000个场景片段以及571种多样化的环境布局,并提供了10多个类别中间表示的时间密集型标注,远超先前工作的规模和标注质量。 在此基础上: - RoboInter-VQA引入了9类空间和20类时间的具身VQA(视觉问答)问题类型,系统地评估并增强VLMs的具身推理能力。 - RoboInter-VLA提供了一个集中的“计划然后执行”框架,支持模块化和端到端的VLA变体,并通过中间监督将高层次规划与低层次操作连接起来。 总体而言,RoboInter建立了一种实用基础,通过细粒度和多样化的中间表示推进稳健且具泛化的机器人学习。
https://arxiv.org/abs/2602.09973
Mobile robots are often deployed over long durations in diverse open, dynamic scenes, including indoor setting such as warehouses and manufacturing facilities, and outdoor settings such as agricultural and roadway operations. A core challenge is to build a scalable long-horizon memory that supports an agentic workflow for planning, retrieval, and reasoning over open-ended instructions at variable granularity, while producing precise, actionable answers for navigation. We present STaR, an agentic reasoning framework that (i) constructs a task-agnostic, multimodal long-term memory that generalizes to unseen queries while preserving fine-grained environmental semantics (object attributes, spatial relations, and dynamic events), and (ii) introduces a Scalable TaskConditioned Retrieval algorithm based on the Information Bottleneck principle to extract from long-term memory a compact, non-redundant, information-rich set of candidate memories for contextual reasoning. We evaluate STaR on NaVQA (mixed indoor/outdoor campus scenes) and WH-VQA, a customized warehouse benchmark with many visually similar objects built with Isaac Sim, emphasizing contextual reasoning. Across the two datasets, STaR consistently outperforms strong baselines, achieving higher success rates and markedly lower spatial error. We further deploy STaR on a real Husky wheeled robot in both indoor and outdoor environments, demonstrating robust longhorizon reasoning, scalability, and practical utility.
移动机器人通常会在多种开放且动态的环境中长时间部署,包括室内环境如仓库和制造工厂,以及室外环境如农业和道路作业。其中的一个核心挑战是构建一种可扩展的长期记忆系统,该系统能够支持代理工作流程,在各种粒度级别上对开放式指令进行规划、检索和推理,并生成精确、可行的答案以供导航使用。我们提出了STaR框架,这是一个具有代理推理功能的框架: (i) 构建了一个与任务无关的多模态长期记忆库,它可以在处理未见过的问题时保持细粒度环境语义(对象属性、空间关系和动态事件)的同时进行泛化; (ii) 引入了一种基于信息瓶颈原理的任务条件检索算法,该算法从长期记忆库中提取出一套紧凑且非冗余的信息丰富的候选记忆集,用于上下文推理。 我们在NaVQA(混合室内/室外校园场景)和WH-VQA(使用Isaac Sim定制的仓库基准测试数据集,强调上下文推理)上对STaR进行了评估。在两个数据集中,STaR始终优于强大的基线模型,在成功率方面表现更佳,并且空间误差显著降低。 此外,我们在真实的Husky轮式机器人中部署了STaR,既包括室内环境也包括室外环境,展示了其稳健的长期推理能力、可扩展性和实际应用价值。
https://arxiv.org/abs/2602.09255
Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.
视听学习面临由于屏幕外来源和背景杂乱引起的模态不对齐问题,当前方法通常会放大不相关的区域或时刻,导致训练不稳定以及表示质量下降。为解决这一挑战,我们提出了一种基于字幕对齐和一致性引导增强的框架(CAE-AV),用于改进视听学习中的模态不对齐问题。该框架采用了两个互补模块:跨模态一致性的时空丰富化(CASTE)和字幕对齐的显著性指导丰富化(CASE)。这两个模块共同作用以缓解视听不对齐的问题。 **CASTE** 动态平衡空间与时间关系,通过评估帧级的视听一致性来确保在存在不对准时仍能捕捉到关键信息。该方法可以同时考虑前后帧的关键内容。 **CASE** 通过注入选定时空位置的跨模态语义指导,进一步缓解了视听不对齐的问题,并利用高层次的语义线索。 此外,我们设计了一些轻量级的目标函数:字幕至模态的InfoNCE、视觉-音频一致性以及熵正则化,以引导标记选择并增强跨模态语义对齐。在使用冻结骨干网络的情况下,CAE-AV在AVE(Audio-Visual Event)、AVVP(Audio-Video Visual Phrase)、AVS(Audio-Visual Scene)和AVQA(Audio-Visual Question Answering)等基准测试上达到了最先进的性能。定性分析进一步验证了其对抗视听不对齐的鲁棒性。 总之,这种新的框架通过引入跨模态一致性指导,并结合字幕对齐与显著性引导策略,在一定程度上解决了目前视听学习中普遍存在的问题,提升了整体表现和稳定性。
https://arxiv.org/abs/2602.08309
Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning. Project page: this https URL.
空间智能对于视觉-语言模型(VLMs)在物理世界中的应用至关重要,然而许多基准测试主要针对未受限制的场景进行评估,在这些场景中模型可以利用二维捷径。我们引入了SSI-Bench,这是一个面向具有几何和拓扑约束的复杂真实世界3D结构的空间推理问题解答(VQA)基准测试。SSI-Bench包含1000个排名问题,这些问题涵盖了几何与拓扑推理,并且需要多种组合空间操作技能,例如心理旋转、横截面推断、遮挡推理以及力路径推理。 该基准测试通过一个完全以人为中心的管道创建而成:十名研究人员花费超过400小时来收集和整理图片、注释结构组件并设计问题,从而尽可能地减少像素层面的线索。通过对31个广泛应用的VLMs进行评估发现,模型与人类的表现存在显著差距:最佳开源模型准确率为22.2%,最强封闭源代码模型准确率为33.6%,而人类则得分高达91.6%。即使鼓励模型进行思考也仅带来微小的进步,并且错误分析表明,在结构基础和符合约束条件的三维推理方面仍然存在失败点。 项目页面:[此处应插入具体链接,原文中的"this https URL"可能是占位符或特定上下文内的链接]
https://arxiv.org/abs/2602.07864
Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.
最近,视觉-语言模型(VLMs)作为强大的表示学习系统出现,它们将视觉观察与自然语言概念相匹配,为安全关键的自动驾驶中的语义推理提供了新的机会。本文研究了当这些视觉-语言表示被整合到感知、预测和规划管道中时,它们如何支持驾驶场景的安全评估和决策制定。 文中探讨了三个互补的系统级用例: 1. **轻量级无类别风险筛查方法**:我们引入了一种基于CLIP(一种图像-文本相似性模型)的方法,该方法利用图像-文本相似性来生成低延迟的语义风险信号。这种方法能够在没有显式对象检测或视觉问答的情况下,对各种类型和分布外的道路危险进行稳健地识别。 2. **场景级视觉语言嵌入与轨迹规划框架集成**:我们研究了将场景级别的视觉语言嵌入整合到基于Transformer的轨迹规划框架中的方法,并使用Waymo Open Dataset进行实验。结果表明,直接用全局嵌入来调整规划器并不会提高轨迹精度,这强调了表示和任务之间对齐的重要性,并推动了针对安全关键性规划的任务导向提取方法的发展。 3. **自然语言作为运动规划的行为约束**:我们研究了在doScenes数据集上使用自然语言作为明确行为约束的方法。在这种设置中,乘客式的指令(基于视觉场景元素)可以抑制罕见但严重的规划失败,并改善含糊情景中的安全一致行为。 综上所述,这些发现表明,在表达语义风险、意图和行为限制时,利用视觉-语言表示对于自动驾驶的安全性具有巨大的潜力。实现这一潜力本质上是一个工程问题,需要精心设计系统结构并进行有序的锚定,而不仅仅是直接注入特征。
https://arxiv.org/abs/2602.07680
Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.
对象中心学习(OCL)旨在学习支持组合泛化和对抗分布外(OOD)数据鲁棒性的结构化场景表示。然而,评估OCL模型时通常并未关注这些目标,相反,大多数先前的工作主要通过物体发现和简单的推理任务来评价OCL模型的性能,例如通过图像分类测试其表征能力。我们识别了现有基准测试中的两个局限性:(1)它们对于衡量OCL模型表示的有效性提供有限的见解;(2)定位与表示有效性是使用不同的度量标准进行评估的,导致了一致性问题。 为了解决第一个问题,我们采用了指令调优的视觉语言模型(VLMs)作为评估工具,这使得能够跨多个多样化的视觉问答数据集进行可扩展基准测试,并测量VLM如何利用OCL表示完成复杂的推理任务。针对第二个问题,我们引入了一个统一的任务和度量标准,该度量同时评估定位能力和表征有效性,从而消除了由于使用分离的评估方法所带来的不一致性。 最后,为了提供一个参考点,我们也包含了一种简单的多特征重构基线模型。
https://arxiv.org/abs/2602.07532
Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline and performs within 1.9% of the SOTA Fleming-VL. For text-based QA, it attains +6.9% over the baseline and +14.5% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, and pathology-microscopy confirm MedMO's broad cross-modality generalization. We release two versions of MedMO: 4B and 8B. Project is available at this https URL
多模态大型语言模型(MLLM)的发展迅速,但在医学领域的应用仍受限于领域覆盖不足、模态对齐不佳以及缺乏基于证据的推理。在此项工作中,我们引入了MedMO,这是一种基于通用MLLM架构并仅使用大规模特定领域数据训练而成的医疗基础模型。MedMO遵循多阶段训练方法:(i) 跨模式预训练以将异构视觉编码器与医学语言骨干对齐;(ii) 在涵盖描述、视觉问答(VQA)、报告生成、检索和带有边界框定位疾病的基于指令微调上进行多任务监督;以及(iii) 结合事实性检查和基于盒子级别的GIoU奖励的强化学习,以增强复杂临床场景中的空间定位和逐步推理能力。MedMO在多种模式和任务中始终优于强大的开源医学MLLMs。在VQA基准测试中,MedMO比基线平均准确率提高了+13.7%,并且仅落后于最先进的Fleming-VL的SOTA 1.9%;对于基于文本的问题回答(QA),它比基线高出+6.9%,而比Fleming-VL高出了+14.5%。在医学报告生成方面,MedMO在语义和临床准确性方面都取得了显著的进步。此外,它展示了强大的定位能力,在IoU指标上相比基准提高了+40.4%,而在与Fleming-VL的对比中则为+37.0%,突显了其稳健的空间推理和定位性能。通过放射学、眼科以及病理-显微镜领域的评估证明了MedMO广泛的跨模态泛化能力。 我们发布了两种版本的MedMO:4B和8B。项目详情可以在提供的链接访问。
https://arxiv.org/abs/2602.06965
Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.
尽管最近取得了成功,但在推断过程中根据需要动态扩展令牌预算(即测试时间缩放)对于视觉语言模型(VLMs)来说仍然不够稳健:关于图像的无结构思维链条将感知和推理纠缠在一起,在这种情况下会产生长而混乱的上下文环境,其中微小的感知错误可能会导致完全错误的答案。此外,为了取得良好的性能,需要进行昂贵的手工设计奖励机制的强化学习。 在这里,我们介绍了SPARC(分离感知与推理回路)模块化框架,该框架明确地将视觉感知从推理中解耦出来。受到大脑顺序感觉至认知处理过程的启发,SPARC实施了一个两阶段管道,在这个管道中模型首先执行显式的视觉搜索来定位问题相关的区域,然后在此基础上进行条件推理以产生最终答案。这种分离使得测试时间缩放可以独立进行,并且支持不对称的计算分配(例如,在分布变化时优先处理感知过程)。此外,SPARC还支持选择性优化(例如,当感知阶段成为端到端性能瓶颈时单独改进该阶段),并且可以通过在较低图像分辨率下运行全局搜索并将高分辨率处理仅限于选定区域来容纳压缩上下文环境。这减少了总的视觉令牌数和计算量。 在具有挑战性的视觉推理基准测试中,SPARC超越了单一基线模型以及强大的视觉接地方法。例如,在$V^*$ VQA基准上,SPARC将Qwen3VL-4B的准确度提高了6.7个百分点;尽管它需要更低200倍的令牌预算,但在一个具有挑战性的OOD任务中,仍比“用图像思考”高出4.6分。
https://arxiv.org/abs/2602.06566
Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at this https URL
视觉语言模型(VLMs)通常会产生大量的视觉标记,这会极大地增加推理延迟和内存占用;虽然无训练的标记裁剪提供了一种实用的解决方案,但现有方法在强烈压缩下仍然难以平衡局部证据与全局背景。我们提出了Focus-Scan-Refine (FSR),这是一个人类启发式的、即插即用的裁剪框架,模仿了人类如何回答视觉问题:先关注关键证据,如有必要则全面扫描,并通过整合相关细节来细化所扫描的上下文。FSR首先结合视觉重要性和指令的相关性来聚焦于关键证据上,避免对视觉突出但与查询无关区域的偏差。然后,在已聚焦集合的基础上搜索互补背景信息,选择那些与已聚焦证据差异最大的标记。最后,FSR通过基于相似性的分配和加权合并将附近的信息丰富的标记整合到扫描锚点中,而不增加标记预算。在多个VLM骨干网络和视觉-语言基准测试中的广泛实验表明,FSR在现有最先进的裁剪方法上持续改善了精度与效率之间的平衡关系。 源代码可以在提供的链接处找到:[请在此处插入实际的网址或指示如何访问源码]
https://arxiv.org/abs/2602.05809
We present a visual-context image retrieval-augmented generation (ImageRAG) assisted AI agent for automatic target recognition (ATR) of synthetic aperture radar (SAR). SAR is a remote sensing method used in defense and security applications to detect and monitor the positions of military vehicles, which may appear indistinguishable in images. Researchers have extensively studied SAR ATR to improve the differentiation and identification of vehicle types, characteristics, and measurements. Test examples can be compared with known vehicle target types to improve recognition tasks. New methods enhance the capabilities of neural networks, transformer attention, and multimodal large language models. An agentic AI method may be developed to utilize a defined set of tools, such as searching through a library of similar examples. Our proposed method, SAR Retrieval-Augmented Generation (SAR-RAG), combines a multimodal large language model (MLLM) with a vector database of semantic embeddings to support contextual search for image exemplars with known qualities. By recovering past image examples with known true target types, our SAR-RAG system can compare similar vehicle categories, achieving improved ATR prediction accuracy. We evaluate this through search and retrieval metrics, categorical classification accuracy, and numeric regression of vehicle dimensions. These metrics all show improvements when SAR-RAG is added to an MLLM baseline method as an attached ATR memory bank.
我们提出了一种视觉上下文图像检索增强生成(ImageRAG)辅助的自动目标识别(ATR)人工智能代理,用于合成孔径雷达(SAR)。SAR是一种在国防和安全应用中使用的远程感应方法,用于探测和监控军事车辆的位置,这些车辆在图像中可能难以区分。研究人员广泛研究了SAR ATR,以提高对不同类型、特性和尺寸的车辆进行区分和识别的能力。通过将测试示例与已知目标类型的车辆进行比较可以改进识别任务。新方法增强了神经网络、变压器注意力机制以及多模态大型语言模型的功能。一种代理式AI方法可以通过使用一组定义好的工具(如在类似样本库中搜索)来开发。我们提出的SAR检索增强生成(SAR-RAG)方法结合了多模态大型语言模型(MLLM)和语义嵌入向量数据库,以支持基于上下文的图像范例搜索,并且这些范例具有已知的质量特性。通过恢复带有已知真实目标类型的过去图像示例,我们的SAR-RAG系统可以比较相似的车辆类别,从而实现改进的ATR预测准确性。我们通过搜索和检索指标、分类精度以及车辆尺寸的数值回归来评估这种方法。当将SAR-RAG添加到MLLM基准方法作为附加的ATR记忆库时,所有这些度量标准都显示了改善的效果。
https://arxiv.org/abs/2602.04712
Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.
传统的自动车牌识别(ALPR)系统通常采用多阶段管道,包括对象检测网络和随后的单独光学字符识别(OCR)模块。这种设计不仅引入了累积误差、增加了延迟时间,还导致架构复杂性增加。这项研究提出了一种名为“Neural Sentinel”的新方法,它利用视觉语言模型(VLMs),通过单一前向传递就能完成车牌识别、状态分类和车辆属性提取等工作。 我们的主要贡献在于证明了一个经过低秩适应(LoRA)调整并微调后的PaliGemma 3B模型,在处理多个关于车辆图像的视觉问题时能够同时回答,实现了92.3%的车牌识别准确率。这比EasyOCR高出了14.1%,相比PaddleOCR基线也提高了9.9%。 此外,我们还引入了一个“人在环路”(HITL)持续学习框架,在这个框架中,用户反馈被整合进系统来纠正错误,同时通过经验重播防止灾难性遗忘的发生。该系统维持着训练数据与修正样本之间70:30的比例关系,并且实现了平均推理延迟为152毫秒以及预期校准误差(ECE)仅为0.048,这表明其具有良好的置信度估计。 此外,VLM架构还允许在没有特定任务训练的情况下进行零样本泛化处理辅助任务,例如车辆颜色检测(准确率为89%)、安全带识别(准确率为82%)和占用计数(准确率为78%)。通过大量实验研究现实世界中的收费广场图像,我们展示了统一的视觉语言方法在ALPR系统中代表了一种范式转变,提供了更精确、减少架构复杂性,并且传统的流水线方法无法实现新兴多任务能力。
https://arxiv.org/abs/2602.07051
Federated learning (FL) enables collaborative model training across decentralized medical institutions while preserving data privacy. However, medical FL benchmarks remain scarce, with existing efforts focusing mainly on unimodal or bimodal modalities and a limited range of medical tasks. This gap underscores the need for standardized evaluation to advance systematic understanding in medical MultiModal FL (MMFL). To this end, we introduce Med-MMFL, the first comprehensive MMFL benchmark for the medical domain, encompassing diverse modalities, tasks, and federation scenarios. Our benchmark evaluates six representative state-of-the-art FL algorithms, covering different aggregation strategies, loss formulations, and regularization techniques. It spans datasets with 2 to 4 modalities, comprising a total of 10 unique medical modalities, including text, pathology images, ECG, X-ray, radiology reports, and multiple MRI sequences. Experiments are conducted across naturally federated, synthetic IID, and synthetic non-IID settings to simulate real-world heterogeneity. We assess segmentation, classification, modality alignment (retrieval), and VQA tasks. To support reproducibility and fair comparison of future multimodal federated learning (MMFL) methods under realistic medical settings, we release the complete benchmark implementation, including data processing and partitioning pipelines, at this https URL .
联邦学习(FL)使分散在不同医疗机构的数据能够在保持数据隐私的前提下进行协作训练。然而,医疗领域的联邦学习基准测试仍然很少见,现有工作主要集中在单一或双模态上,并且涉及的医学任务有限。这种差距凸显了为推动医疗多模态联邦学习(MMFL)系统的理解与进步而建立标准化评估体系的需求。 为此,我们推出了Med-MMFL——首个针对医疗领域的全面多模态联邦学习基准测试平台,涵盖了多种模态、任务及不同的联邦场景。我们的基准测试对六种代表性的最新联邦学习算法进行了评价,包括各种聚合策略、损失函数和正则化技术的评估。该基准集包括从2到4个不同模态的数据集,总共涉及10种独特的医疗数据类型(如文本、病理图像、心电图、X光片、放射学报告及多种MRI序列)。 实验涵盖自然联邦分布下的真实场景以及综合同质和非同质设置以模拟现实世界中的异构性。我们对分割、分类、模态对齐(检索)以及视觉问答任务进行了评估。为支持未来多模态联邦学习方法在实际医学背景下的可重复性和公平比较,我们在 [链接] 上发布了完整的基准测试实现,包括数据处理和分区流程。 这一举措旨在促进医疗领域多模态联邦学习研究的发展,并提供了一个标准化的平台以推动该领域的创新与合作。
https://arxiv.org/abs/2602.04416
We present a formal problem formulation for \textit{Reliable} Audio-Visual Question Answering ($\mathcal{R}$-AVQA), where we prefer abstention over answering incorrectly. While recent AVQA models have high accuracy, their ability to identify when they are likely wrong and their consequent abstention from answering remain underexplored areas of research. To fill this gap, we explore several approaches and then propose Adaptive Confidence Refinement (ACR), a lightweight method to further enhance the performance of $\mathcal{R}$-AVQA. Our key insight is that the Maximum Softmax Probability (MSP) is Bayes-optimal only under strong calibration, a condition usually not met in deep neural networks, particularly in multimodal models. Instead of replacing MSP, our ACR maintains it as a primary confidence signal and applies input-adaptive residual corrections when MSP is deemed unreliable. ACR introduces two learned heads: i) a Residual Risk Head that predicts low-magnitude correctness residuals that MSP does not capture, and ii) a Confidence Gating Head to determine MSP trustworthiness. Our experiments and theoretical analysis show that ACR consistently outperforms existing methods on in- and out-of-disrtibution, and data bias settings across three different AVQA architectures, establishing a solid foundation for $\mathcal{R}$-AVQA task. The code and checkpoints will be available upon acceptance \href{this https URL}{at here}
我们提出了一个正式的问题表述,即可靠音频视觉问答($\mathcal{R}$-AVQA),在这种情况下,我们会优先选择不回答而不是错误地作答。尽管最近的AVQA模型具有很高的准确性,但它们在识别自身可能出错的情况以及在此基础上采取行动的能力仍未得到充分研究。为了填补这一空白,我们探索了几种方法,并提出了一种轻量级的方法——自适应置信度精炼(ACR),以进一步提高$\mathcal{R}$-AVQA的性能。 我们的关键洞察是:最大软极大值概率(MSP)只有在强校准条件下才是贝叶斯最优的,而这种情况通常不满足于深度神经网络,特别是在多模态模型中。不同于用其他方法替代MSP,我们所提出的ACR保持了MSP作为主要置信度信号的地位,并且当MSP被认为不可靠时,采用输入适应性残差校正。 ACR引入了两个学习头部:一是残差风险头,它预测MSP无法捕捉到的低幅度正确性残差;二是信心门控头,用于确定MSP的信任度。我们的实验和理论分析表明,在三种不同的AVQA架构中,无论是在分布内还是分布外以及数据偏差设置下,ACR都持续优于现有的方法,从而为$\mathcal{R}$-AVQA任务奠定了坚实的基础。 该代码和检查点将在接受后在[此处](this https URL)提供。
https://arxiv.org/abs/2602.04924