Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi-step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single-pass inference, preventing models from re-examining, verifying, or refining visual evidence during reasoning. While tool-integrated reasoning offers a promising path forward, open-source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi-modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task-relevant image regions, and integrate single or multiple sub-image evidence into interleaved multimodal reasoning within a unified, executable interface for agentic training. Using MedVistaGym, we train MedVistaGym-R1 to interleave tool use with agentic reasoning through trajectory sampling and end-to-end reinforcement learning. Across six medical VQA benchmarks, MedVistaGym-R1-8B exceeds comparably sized tool-augmented baselines by 19.10% to 24.21%, demonstrating that structured agentic training--not tool access alone--unlocks effective tool-integrated reasoning for medical image analysis.
视觉语言模型(VLMs)在一般图像理解方面表现出色,但在使用医疗影像进行多步推理时却面临挑战,尤其是在通过迭代的视觉交互过程中的表现。医学VLM通常依赖于静态视觉嵌入和单次推断,这限制了模型在推理过程中重新审视、验证或完善视觉证据的能力。虽然工具集成推理提供了一条有前景的发展路径,但开源VLM缺乏训练基础设施来学习有效工具的选择、调用和协调,在多模态医学推理中尤其如此。 为了解决这些问题,我们引入了一个名为MedVistaGym的可扩展且交互式的训练环境,它激励视觉语言模型在医疗图像分析中进行工具集成的视觉推理。MedVistaGym使得模型能够决定何时以及选择哪种工具、定位任务相关的图象区域,并将单一或多个子图证据整合到一个统一的执行界面内的多模态交错推理过程中。 利用MedVistaGym,我们训练了MedVistaGym-R1,使其通过轨迹采样和端到端强化学习在代理式推理中与工具使用交替进行。在六个医疗VQA(视觉问答)基准测试中,MedVistaGym-R1-8B的表现超过了同样大小的增强基线模型,分别提高了19.10%至24.21%,这表明结构化的代理训练——而不仅仅是访问工具本身——解锁了医学影像分析中的有效工具集成推理。
https://arxiv.org/abs/2601.07107
Autonomous driving increasingly relies on Visual Question Answering (VQA) to enable vehicles to understand complex surroundings by analyzing visual inputs and textual queries. Currently, a paramount concern for VQA in this domain is the stringent requirement for fast latency and real-time processing, as delays directly impact real-world safety in this safety-critical application. However, current state-of-the-art VQA models, particularly large vision-language models (VLMs), often prioritize performance over computational efficiency. These models typically process dense patch tokens for every frame, leading to prohibitive computational costs (FLOPs) and significant inference latency, especially with long video sequences. This focus limits their practical deployment in real-time autonomous driving scenarios. To tackle this issue, we propose an efficient VLM framework for autonomous driving VQA tasks, SRC-Pipeline. It learns to compress early frame tokens into a small number of high-level tokens while retaining full patch tokens for recent frames. Experiments on autonomous driving video question answering tasks show that our approach achieves 66% FLOPs reduction while maintaining comparable performance, enabling VLMs to operate more effectively in real-time, safety-critical autonomous driving settings.
自主驾驶系统越来越依赖于视觉问答(VQA)技术,以帮助车辆通过分析视觉输入和文本查询来理解复杂的周围环境。目前,在这一领域中,VQA面临的首要挑战是严格要求低延迟和实时处理能力,因为任何延迟都会直接影响到这种安全关键应用中的实际安全性。然而,当前最先进的VQA模型,尤其是大型视觉语言模型(VLMs),往往更注重性能而忽视计算效率。这些模型通常对每一帧都进行密集的补丁标记处理,导致高昂的计算成本(FLOPs)和显著的推理延迟,特别是在处理长视频序列时尤为明显。这种侧重于性能的做法限制了它们在实时自主驾驶场景中的实际应用。 为了解决这个问题,我们提出了一种高效的VLM框架——SRC-Pipeline,专门用于解决自主驾驶中的VQA任务。该框架能够学习将早期帧的标记压缩成少量高层次标记,同时保持最近几帧的完整补丁标记不变。实验表明,在自主驾驶视频问答任务中,我们的方法在减少66%计算成本(FLOPs)的同时,仍能保持与现有模型相当甚至更好的性能表现。这一成果使得VLM能够在实时、安全关键的自动驾驶环境中更加高效地运行。
https://arxiv.org/abs/2601.07092
The Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception. This year, the workshop featured 2 guest tracks as well: KiVA (an image understanding challenge) and Physic-IQ (a video generation challenge). In this report, we summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark. In this iteration, we placed an emphasis on task unification, as this poses a more challenging test for current SOTA multimodal models. The challenge included five consolidated tracks: unified video QA, unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA, alongside an analysis and interpretability track that is still open for submissions. Notably, the unified video QA track introduced a novel subset that reformulates traditional perception tasks (such as point tracking and temporal action localisation) as multiple-choice video QA questions that video-language models can natively tackle. The unified object and point tracking merged the original object tracking and point tracking tasks, whereas the unified action and sound localisation merged the original temporal action localisation and temporal sound localisation tracks. Accordingly, we required competitors to use unified approaches rather than engineered pipelines with task-specific models. By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces.
第三届感知测试挑战赛作为全天研讨会与IEEE/CVF国际计算机视觉大会(ICCV)2025年会议同期举行。其主要目标是评估当前最先进的视频模型,并衡量跨模态感知技术的进步。今年的研讨会还新增了两个客座赛道:KiVA(图像理解挑战)和Physic-IQ(视频生成挑战)。在这份报告中,我们将总结来自主赛道感知测试挑战赛的结果,详细介绍了现有任务以及基准测试中的新添加内容。在此次迭代中,我们强调了任务统一的重要性,因为这对目前最先进的跨模态模型提出了更具有挑战性的考验。该挑战包括五个整合赛道:统一体视频问答、统一分目标和点跟踪、统一时长动作和声音定位、基于背景的视频问答以及长达一小时的视频问答,还有一个分析与解释性赛道仍在接受提交中。值得注意的是,统一体视频问答赛道引入了一个新颖的子集,它将传统的感知任务(如点跟踪和时间动作定位)重新表述为多选题视频问答问题,视频-语言模型可以原生地处理这些问题。而统一分目标和点跟踪合并了原有的对象跟踪与点跟踪任务,统一时长动作和声音定位则整合了原有的时间动作定位与时长声源定位赛道。因此,我们要求参赛者使用统一的方法而非针对特定任务定制的管道工程解决方案。通过提出这样的统一挑战,感知测试2025突显了现有模型在通过统一接口处理多样化的感知任务时面临的显著困难。
https://arxiv.org/abs/2601.06287
Modelling a language model for a multi-lingual scenario includes several potential challenges, among which catastrophic forgetting is the major challenge. For example, small language models (SLM) built for low-resource languages by adapting large language models (LLMs) pose the challenge of catastrophic forgetting. This work proposes to employ a continual learning strategy using parts-of-speech (POS)-based code-switching along with a replay adapter strategy to mitigate the identified gap of catastrophic forgetting while training SLM from LLM. Experiments conducted on vision language tasks such as visual question answering and language modelling task exhibits the success of the proposed architecture.
在多语言场景中为语言模型建模包括若干潜在挑战,其中灾难性遗忘(catastrophic forgetting)是主要问题之一。例如,在低资源语言环境下通过调整大型语言模型(LLM)来构建小型语言模型(SLM)时,会遇到灾难性遗忘的问题。这项工作提出采用基于词性(POS)的代码切换(code-switching)和回放适配器策略的连续学习方法,以解决在从LLM训练SLM过程中出现的灾难性遗忘问题。实验表明,在视觉语言任务(如视觉问答和语言建模任务)中,所提出的架构能够取得成功。
https://arxiv.org/abs/2601.05874
Africa is home to over one-third of the world's languages, yet remains underrepresented in AI research. We introduce Afri-MCQA, the first Multilingual Cultural Question-Answering benchmark covering 7.5k Q&A pairs across 15 African languages from 12 countries. The benchmark offers parallel English-African language Q&A pairs across text and speech modalities and was entirely created by native speakers. Benchmarking large language models (LLMs) on Afri-MCQA shows that open-weight models perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. To evaluate linguistic competence, we include control experiments meant to assess this specific aspect separate from cultural knowledge, and we observe significant performance gaps between native languages and English for both text and speech. These findings underscore the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. To support more inclusive multimodal AI development in African languages, we release our Afri-MCQA under academic license or CC BY-NC 4.0 on HuggingFace (this https URL)
非洲是世界上三分之一以上语言的发源地,但在人工智能研究中却处于被忽视的地位。我们推出了Afri-MCQA,这是首个涵盖15种非洲语言(来自12个国家)的多语种文化问答基准测试,包含7,500对问题和答案,横跨文本和语音两种模式,并且完全由母语人士创建。 该基准测试提供了英语与非洲本土语言之间的平行问答对。在Afri-MCQA上评估大型语言模型(LLMs)的结果表明,开放式权重模型在所有评估文化中的表现都很差,在用母语或语音提问时,开放性视觉问题回答的准确率接近于零。为了评估语言能力,我们包括了旨在单独评估这一方面的控制实验,并观察到对于文本和语音而言,母语与英语之间的性能差距显著。 这些发现强调了需要采取以语音为主的方法、基于文化的预训练以及跨语言的文化转移。为了支持非洲语言中更具包容性的多模态人工智能开发,我们在HuggingFace(此链接)上以学术许可或CC BY-NC 4.0发布我们的Afri-MCQA基准测试。
https://arxiv.org/abs/2601.05699
Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.
针对作物病害分析的视觉问答需要准确的图像理解和可靠的文本生成能力。这项工作提出了一种轻量级的视觉-语言框架,用于从叶片图片中识别作物和疾病。该方法结合了Swin Transformer视觉编码器与序列到序列的语言解码器。采用两阶段训练策略以改进视觉表示学习及跨模态对齐。在大规模作物病害数据集上使用分类和自然语言生成指标对该模型进行了评估。实验结果显示,在作物识别和疾病诊断方面均达到了高准确率。此外,该框架在BLEU、ROUGE 和 BERTScore等评分标准中也表现出色。我们的模型相比大规模视觉-语言基准模型而言,参数量显著减少且性能更优。 可解释性通过Grad-CAM及token级归因分析进行评估。定性结果表明,在多样化的用户驱动查询下具有稳健的表现能力。这些发现突显了针对作物病害视觉问答任务的特定视觉预训练的有效性。
https://arxiv.org/abs/2601.05143
Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study \emph{prototypicality bias} as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark \textsc{\textbf{ProtoBias}} (\textit{\textbf{Proto}typical \textbf{Bias}}), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose \textbf{\textsc{ProtoScore}}, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.
自动指标现在已成为评估文本到图像模型的关键手段,它们常常在基准测试和大规模筛选中代替人类判断。然而,这些指标是否真正优先考虑语义正确性,还是倾向于从有偏数据分布中学到的视觉和社会原型图像,这一点仍然不清楚。我们识别并研究了多模态评估中的系统性失败模式——“原型偏差”。为此,我们引入了一个控制对比基准测试**ProtoBias**(Prototype Bias),该测试涵盖了动物、物体和人口统计学图像,在这些类别中,语义正确但非典型的图像与轻微错误但典型的对抗样本配对。这种设置使得可以定向评估指标是遵循文本语义还是默认采用原型进行评判成为可能。 我们的结果显示,广泛使用的指标(包括CLIPScore、PickScore以及基于VQA的得分)经常在这些配对中排名不正确,即使使用LLM-as-Judge系统,在社会相关的案例中也表现出不平衡的稳健性。人类评估则一致地更倾向于语义正确性,并且决策差距更大。 鉴于这些发现,我们提出了一种新的稳健指标**ProtoScore**,该模型具有70亿参数量级,可以显著降低失败率并抑制错误排名,同时其运行速度比GPT-5的推理时间快数个数量级,接近于更大规模、封闭源裁判系统的稳健性。
https://arxiv.org/abs/2601.04946
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
近期,多模态大型语言模型(MLLM)在标准视觉推理基准测试中表现出色。然而,人们越来越担心这些模型过度依赖于语言捷径而非真正的视觉基础,我们将这种现象称为“文本偏见”。在这篇论文中,我们探讨了视觉感知和语言先验之间的基本矛盾,并将这一偏差分解为两个维度:源自预训练数据中的统计相关性的内部语料库偏见,以及由于对齐诱导的顺从倾向而产生的外部指令偏见。为了量化这种效应,我们引入了V-FAT(对抗文本偏见的视觉忠实度诊断基准),该基准包括跨六个语义领域的4,026个VQA实例。V-FAT采用三级评估框架,系统地增加视觉证据与文字信息之间的冲突:(L1) 来自非典型图像的内部偏差;(L2) 来自误导性指令的外部偏差;以及(L3) 当两者同时出现时的协同偏见。我们引入了视觉鲁棒性得分(VRS)这一指标,旨在惩罚“幸运”的语言猜测并奖励真正的视觉忠实度。对12种前沿MLLM模型的评估显示,尽管这些模型在现有基准测试中表现出色,但在高语言主导的情况下会出现显著的视觉崩溃现象。
https://arxiv.org/abs/2601.04897
Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10\% decrease in Attack Success Rate (ASR) together with an increment of at least 8\% in harmless dimension and over 13\% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.
多模态大型语言模型(MLLMs)在互动应用程序中的部署日益增多。然而,在多轮次的多模态场景中,这些模型的安全漏洞变得尤为突出:有害意图可能在多轮对话中逐渐形成,而安全协议则因对话推进而被遗忘。现有的基于人类反馈的强化学习(RLHF)对齐方法主要是为单轮视觉问答(VQA)任务设计的,并且通常需要昂贵的手动偏好标注,这限制了它们在对话中的有效性和可扩展性。 为了应对这一挑战,我们推出了InterSafe-V,这是一个开源的多模态对话数据集,包含11,270个对话和500个特别设计的拒绝VQA样本。该数据集通过几个模型之间的互动构建而成,并旨在更准确地反映现实场景,包括为特定领域定制的专用VQA对。 基于这个数据集,我们提出了AM$^3$Safety框架,结合了冷启动拒绝阶段与组相对策略优化(GRPO)微调技术,利用针对整个对话轮次感知的双重目标奖励进行训练。实验表明,在Qwen2.5-VL-7B-Instruct和LLaVA-NeXT-7B模型上,AM$^3$Safety框架能够使攻击成功率(ASR)下降超过10%,同时保持MLLMs在无害性和有用性维度上的性能至少提高8%和13%,并且不会影响它们的通用能力。
https://arxiv.org/abs/2601.04736
In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.
在这份报告中,我们介绍了Qwen3-VL-Embedding和Qwen3-VL-Reranker模型系列,这是基于Qwen3-VL基础模型构建的Qwen家族最新的扩展。这两个模型系列共同提供了一条端到端的高精度多模态搜索管道,通过将包括文本、图像、文档图像以及视频在内的多种模式映射到统一的表现空间来实现这一点。 Qwen3-VL-Embedding模型采用了一个多层次训练范式,从大规模对比预训练进展到重排序模型蒸馏,生成语义丰富的高维向量。该模型支持Matryoshka表示学习,允许灵活的嵌入维度,并能够处理多达32k标记的输入。而Qwen3-VL-Reranker则使用带有跨注意力机制的交叉编码器架构来为查询-文档对进行精细化的相关性评估。 这两个模型系列继承了Qwen3-VL的多语言能力,支持超过30种语言,并以2B和8B参数大小发布,以适应多样化的部署需求。实证研究表明,Qwen3-VL-Embedding系列在各种多模态嵌入评估基准上取得了最先进的结果。特别地,Qwen3-VL-Embedding-8B在MMEB-V2上的综合得分为77.8分,在所有模型中排名第一(截至2025年1月8日)。 该报告详细介绍了这两个系列的架构、训练方法及其实际能力,并展示了它们在包括图像-文本检索、视觉问答和视频-文本匹配在内的多种多模态检索任务中的有效性。
https://arxiv.org/abs/2601.04720
Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.
基于视觉-语言模型构建的代理越来越多地面临需要预测未来状态的任务,而不仅仅是依赖于短期推理。生成式世界模型为这一需求提供了一个有前景的解决方案:代理可以利用这些模型作为外部模拟器来预见行动结果。本文通过实证研究探讨了当前代理是否能够利用此类世界模型作为工具以增强其认知能力。在多样化的代理任务和视觉问答任务中,我们观察到一些代理很少调用模拟(少于1%),频繁误用预测序列(约15%),并且当可以使用或强制使用模拟时,往往表现出不一致甚至退化的表现(高达5%)。归因分析进一步表明,主要瓶颈在于代理决定何时进行模拟、如何解释预测结果以及如何将预见性融入下游推理的能力上。这些发现强调了促进与世界模型校准和策略互动机制的必要性,为未来代理系统中更可靠的事先认知铺平了道路。
https://arxiv.org/abs/2601.03905
Current vision-language benchmarks predominantly feature well-structured questions with clear, explicit prompts. However, real user queries are often informal and underspecified. Users naturally leave much unsaid, relying on images to convey context. We introduce HAERAE-Vision, a benchmark of 653 real-world visual questions from Korean online communities (0.76% survival from 86K candidates), each paired with an explicit rewrite, yielding 1,306 query variants in total. Evaluating 39 VLMs, we find that even state-of-the-art models (GPT-5, Gemini 2.5 Pro) achieve under 50% on the original queries. Crucially, query explicitation alone yields 8 to 22 point improvements, with smaller models benefiting most. We further show that even with web search, under-specified queries underperform explicit queries without search, revealing that current retrieval cannot compensate for what users leave unsaid. Our findings demonstrate that a substantial portion of VLM difficulty stem from natural query under-specification instead of model capability, highlighting a critical gap between benchmark evaluation and real-world deployment.
当前的视觉-语言基准主要包含结构良好、提示明确的问题。然而,真实的用户查询往往是非正式且不完善的。用户自然会省略很多内容,依赖图像来传达上下文信息。我们引入了HAERAE-Vision,这是一个由653个来自韩国在线社区的真实世界视觉问题组成的基准(从86,000个候选者中筛选出),每个问题都附有一个明确的重写版本,总共产生了1,306种查询变体。在评估了39种视觉语言模型(VLMs)之后,我们发现即使是最先进的模型(如GPT-5和Gemini 2.5 Pro)在原始查询上的准确率也低于50%。至关重要的是,仅通过明确化查询就能获得8到22个百分点的改进,小型模型从中获益最多。此外,我们还展示了即使使用网络搜索,不完善的查询仍然不如没有搜索时的明确查询表现好,这表明当前的信息检索技术无法弥补用户省略的内容。我们的研究结果表明,视觉语言模型在处理自然生成的不完整查询方面的困难主要来源于查询本身的不完善性,而非模型的能力限制,这一点凸显了基准测试评估与实际部署之间存在的重要差距。
https://arxiv.org/abs/2601.06165
Multimodal medical large language models have shown impressive progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model explicitly designed for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at this https URL
多模态医学大型语言模型在胸部X光片解读方面取得了显著进展,但仍面临空间推理和解剖理解方面的挑战。尽管现有的定位技术能够提升整体性能,但它们往往无法建立真正的解剖对应关系,在医疗领域导致不正确的解剖学理解。为解决这一缺口,我们引入了AnatomiX——一种专门用于基于解剖的胸部X光片解读的多任务多模态大型语言模型。受放射工作流程启发,AnatomiX采用两阶段方法:首先识别解剖结构并提取其特征,然后利用大型语言模型执行各种下游任务,如短语定位、报告生成、视觉问答和图像理解等。 在多个基准测试中的广泛实验表明,AnatomiX实现了卓越的解剖推理,并且在解剖学定位、短语定位、基于定位诊断以及基于定位描述等任务上相比现有方法性能提升了超过25%。代码及预训练模型可在[此处](https://此URL/)获取。 请注意,原文中的“this https URL”应替换为实际的网址链接以便访问相关资源。
https://arxiv.org/abs/2601.03191
Multimodal large language models (MLLMs) typically rely on a single late-layer feature from a frozen vision encoder, leaving the encoder's rich hierarchy of visual cues under-utilized. MLLMs still suffer from visually ungrounded hallucinations, often relying on language priors rather than image evidence. While many prior mitigation strategies operate on the text side, they leave the visual representation unchanged and do not exploit the rich hierarchy of features encoded across vision layers. Existing multi-layer fusion methods partially address this limitation but remain static, applying the same layer mixture regardless of the query. In this work, we introduce TGIF (Text-Guided Inter-layer Fusion), a lightweight module that treats encoder layers as depth-wise "experts" and predicts a prompt-dependent fusion of visual features. TGIF follows the principle of direct external fusion, requires no vision-encoder updates, and adds minimal overhead. Integrated into LLaVA-1.5-7B, TGIF provides consistent improvements across hallucination, OCR, and VQA benchmarks, while preserving or improving performance on ScienceQA, GQA, and MMBench. These results suggest that query-conditioned, hierarchy-aware fusion is an effective way to strengthen visual grounding and reduce hallucination in modern MLLMs.
多模态大型语言模型(MLLMs)通常依赖于冻结的视觉编码器中的单一深层特征,这使得编码器中丰富的层次化视觉线索未被充分利用。MLLMs仍然存在视觉基础不足的问题,常常过度依赖语言先验知识而非图像证据。尽管许多先前的缓解策略主要针对文本侧进行操作,但它们没有改变视觉表示,并且未能利用各个视觉层所包含的丰富特征层级结构。现有的多层融合方法部分解决了这一局限性,但仍保持静态特性,即不管查询内容如何都采用相同的层次混合方式。 在此项工作中,我们引入了TGIF(Text-Guided Inter-layer Fusion,文本引导跨层融合),这是一种轻量级模块,它将编码器各层视为深度方向上的“专家”,并根据提示预测视觉特征的融合。TGIF遵循直接外部融合的原则,无需更新视觉编码器,并且添加的开销极小。当集成到LLaVA-1.5-7B中时,TGIF在幻觉、OCR和VQA基准测试上提供了持续改进,同时在ScienceQA、GQA和MMBench等任务中的性能得以保持或提升。 这些结果表明,基于查询条件的层次感知融合是强化现代MLLMs视觉基础并减少幻觉的一种有效方法。
https://arxiv.org/abs/2601.03100
Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative-driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi-agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon-based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM-based multi-agent behaviour in cartoon VQA and multimodal inference.
针对风格化卡通图像的视觉问答(VQA)任务,存在挑战,如解释夸张的视觉抽象和叙事驱动的情境,这些是标准大型语言模型(LLMs)在自然图像上训练时无法充分解决的问题。为了解决这一问题,引入了一个专门设计用于处理卡通图像VQA任务的多代理LLM框架。所提出的架构包括三个专业化代理人:视觉代理人、语言代理人和批评代理人。这三者协作工作,通过整合视觉线索和叙述性背景来支持结构化推理。 该框架在两个基于卡通的VQA数据集上进行了系统评估:Pororo 和 Simpsons。实验结果详细分析了每个代理如何贡献于最终预测,从而为基于LLM的多代理行为在卡通图像VQA及多模态推断中的作用提供了更深入的理解。
https://arxiv.org/abs/2601.03073
Comic-based visual question answering (CVQA) poses distinct challenges to multimodal large language models (MLLMs) due to its reliance on symbolic abstraction, narrative logic, and humor, which differ from conventional VQA tasks. Although Chain-of-Thought (CoT) prompting is widely used to enhance MLLM reasoning, surprisingly, its direct application to CVQA often degrades performance, especially in small-scale models. Our theoretical and empirical analyses reveal that standard CoT in CVQA suffers from state entanglement, spurious transitions, and exploration inefficiency, with small models particularly vulnerable in resource-constrained settings. To address these issues, we propose a novel comic reasoning framework, designed to produce more faithful and transferable reasoning chains in small MLLMs. Specifically, our framework combines modular CoT generation with GRPO-based reinforcement fine-tuning and a novel structured reward. Beyond comic VQA, we further evaluate our approach on a broader class of humor-centric and abstract visual reasoning tasks, including meme understanding and editorial cartoon interpretation. Across five challenging benchmarks, our 3B model outperforms state-of-the-art methods, and plug-in experiments yield an additional average improvement of $\mathbf{12.1\%}$ across different MLLMs.
基于漫画的视觉问答(CVQA)对多模态大型语言模型(MLLMs)提出了独特的挑战,因为其依赖于符号抽象、叙事逻辑和幽默感,这些与传统的VQA任务有所不同。尽管Chain-of-Thought (CoT) 引导被广泛应用于增强MLLM的推理能力,但直接将其应用到CVQA中却常常会降低性能,尤其是在小型模型中的效果更差。我们通过理论分析和实证研究发现,在CVQA任务中标准的CoT方法存在状态纠缠、虚假转换以及探索效率低下等问题,并且小型模型在资源受限的情况下尤其容易受到这些问题的影响。 为了解决这些问题,我们提出了一种新的漫画推理框架,旨在使小规模MLLM生成更忠实且可转移的推理链。具体而言,我们的框架结合了模块化CoT生成与基于GRPO(Guided Random Policy Optimization)的强化微调方法,并引入了一个新颖的结构化奖励机制。 除了漫画VQA任务外,我们还评估了该方法在一系列以幽默为中心和抽象视觉推理任务上的表现,包括梗图理解和讽刺漫画解释。在五个具有挑战性的基准测试中,我们的3B模型超越了当前最先进的方法,而在插件实验中不同MLLMs的平均改进率达到了12.1%。
https://arxiv.org/abs/2601.02991
Earth vision has achieved milestones in geospatial object recognition but lacks exploration in object-relational reasoning, limiting comprehensive scene understanding. To address this, a progressive Earth vision-language understanding and generation framework is proposed, including a multi-task dataset (EarthVLSet) and a semantic-guided network (EarthVLNet). Focusing on city planning applications, EarthVLSet includes 10.9k sub-meter resolution remote sensing images, land-cover masks, and 761.5k textual pairs involving both multiple-choice and open-ended visual question answering (VQA) tasks. In an object-centric way, EarthVLNet is proposed to progressively achieve semantic segmentation, relational reasoning, and comprehensive understanding. The first stage involves land-cover segmentation to generate object semantics for VQA guidance. Guided by pixel-wise semantics, the object awareness based large language model (LLM) performs relational reasoning and knowledge summarization to generate the required answers. As for optimization, the numerical difference loss is proposed to dynamically add difference penalties, addressing the various objects' statistics. Three benchmarks, including semantic segmentation, multiple-choice, and open-ended VQA demonstrated the superiorities of EarthVLNet, yielding three future directions: 1) segmentation features consistently enhance VQA performance even in cross-dataset scenarios; 2) multiple-choice tasks show greater sensitivity to the vision encoder than to the language decoder; and 3) open-ended tasks necessitate advanced vision encoders and language decoders for an optimal performance. We believe this dataset and method will provide a beneficial benchmark that connects ''image-mask-text'', advancing geographical applications for Earth vision.
地球视觉在地理空间对象识别方面取得了里程碑式的成就,但在物体间关系推理方面的探索不足,这限制了对场景的全面理解。为了解决这一问题,提出了一种渐进式地球视觉-语言理解和生成框架,包括一个多任务数据集(EarthVLSet)和一个语义引导网络(EarthVLNet)。该框架特别关注城市规划应用领域,EarthVLSet包含了10.9k张亚米级分辨率的遥感图像、土地覆盖掩模以及761.5万个涉及多项选择题和开放式视觉问答(VQA)任务的文本对。以对象为中心的方式,EarthVLNet旨在逐步实现语义分割、关系推理和综合理解。第一阶段是进行土地覆盖分割,生成用于VQA引导的对象语义信息。通过像素级语义指导,基于大语言模型(LLM)的对象意识执行关系推理和知识总结,从而生成所需答案。为了优化性能,提出了数值差异损失函数,动态添加差异惩罚以应对各种对象的统计特性。 在三个基准测试中——包括语义分割、多项选择题以及开放式VQA任务,EarthVLNet表现出了优越性,并指出了未来研究的三大方向:1)段落特征能够持续提升VQA性能,即使是在跨数据集场景下;2)多项选择任务对视觉编码器比语言解码器更为敏感;3)开放式任务需要更先进的视觉和语言模型以实现最佳效果。我们相信这一数据集和方法将为连接“图像-掩模-文本”的研究提供一个有益的基准,从而促进地球视觉在地理应用领域的进一步发展。
https://arxiv.org/abs/2601.02783
Multimodal large language models (MLLMs) show promising performance on medical visual question answering (VQA) and report generation, but these generation and explanation abilities do not reliably transfer to disease-specific classification. We evaluated MLLM architectures on knee osteoarthritis (OA) radiograph classification, which remains underrepresented in existing medical MLLM benchmarks, even though knee OA affects an estimated 300 to 400 million people worldwide. Through systematic ablation studies manipulating the vision encoder, the connector, and the large language model (LLM) across diverse training strategies, we measured each component's contribution to diagnostic accuracy. In our classification task, a trained vision encoder alone could outperform full MLLM pipelines in classification accuracy and fine-tuning the LLM provided no meaningful improvement over prompt-based guidance. And LoRA fine-tuning on a small, class-balanced dataset (500 images) gave better results than training on a much larger but class-imbalanced set (5,778 images), indicating that data balance and quality can matter more than raw scale for this task. These findings suggest that for domain-specific medical classification, LLMs are more effective as interpreters and report generators rather than as primary classifiers. Therefore, the MLLM architecture appears less suitable for medical image diagnostic classification tasks that demand high certainty. We recommend prioritizing vision encoder optimization and careful dataset curation when developing clinically applicable systems.
多模态大型语言模型(MLLM)在医学视觉问答(VQA)和报告生成方面表现出色,但这些生成和解释能力并不能可靠地转移到特定疾病的分类任务上。我们评估了MLLM架构在膝关节骨关节炎(OA)X光片分类上的表现,尽管膝关节OA影响全球约3亿至4亿人,但在现有的医疗MLLM基准测试中仍然代表性不足。通过系统性消融研究,对视觉编码器、连接器和大型语言模型(LLM)进行了各种训练策略下的操作,我们测量了每个组件在诊断准确性中的贡献。 在我们的分类任务中,经过训练的视觉编码器本身就能比完整的MLLM管道更准确地进行分类,并且微调LLM相对于基于提示的方法并没有提供有意义的改进。另外,在一个小型、类别平衡的数据集(500张图像)上使用LoRA微调方法优于在一个更大但类别不平衡的数据集(5,778张图像)上的训练,这表明数据平衡和质量在这种任务中比原始规模更重要。 这些发现表明,对于特定领域的医学分类来说,LLM更适合作为解释器和报告生成工具而不是主要分类器。因此,MLLM架构似乎不适用于需要高确定性的医疗影像诊断分类任务。我们建议在开发临床适用系统时优先考虑视觉编码器的优化和数据集的精心策划。
https://arxiv.org/abs/2601.02443
Visual Question Answering (VQA) requires models to reason over multimodal information, combining visual and textual data. With the development of continual learning, significant progress has been made in retaining knowledge and adapting to new information in the VQA domain. However, current methods often struggle with balancing knowledge retention, adaptation, and robust feature representation. To address these challenges, we propose a novel framework with adaptive memory allocation and global noise filtering called MacVQA for visual question answering. MacVQA fuses visual and question information while filtering noise to ensure robust representations, and employs prototype-based memory allocation to optimize feature quality and memory usage. These designs enable MacVQA to balance knowledge acquisition, retention, and compositional generalization in continual VQA learning. Experiments on ten continual VQA tasks show that MacVQA outperforms existing baselines, achieving 43.38% average accuracy and 2.32% average forgetting on standard tasks, and 42.53% average accuracy and 3.60% average forgetting on novel composition tasks.
视觉问题回答(VQA)需要模型能够处理多模态信息,即结合视觉和文本数据进行推理。随着连续学习的发展,在VQA领域中保留知识并适应新信息方面取得了显著进步。然而,当前的方法往往在保持知识、适应性和鲁棒特征表示之间难以平衡。为了解决这些挑战,我们提出了一种新的框架,该框架具有自适应内存分配和全局噪声过滤功能,称为MacVQA(Memory-Aware Continual VQA),用于视觉问题回答。MacVQA融合了视觉信息与问题文本,并通过过滤噪声来确保鲁棒的表示形式;同时采用基于原型的记忆分配机制优化特征质量和内存使用效率。这些设计使MacVQA能够在连续学习中平衡知识获取、保留和组合泛化能力。 在十个持续学习的VQA任务上进行的实验表明,MacVQA优于现有的基线模型,在标准任务上实现了平均准确率43.38%及2.32%的遗忘率;而在新颖的任务组合上则达到了平均准确率42.53%,以及3.60%的遗忘率。
https://arxiv.org/abs/2601.01926
Automated pathology image analysis is central to clinical diagnosis, but clinicians still ask which slide features drive a model's decision and why. Vision-language models can produce natural language explanations, but these are often correlational and lack verifiable evidence. In this paper, we introduce an SQL-centered agentic framework that enables both feature measurement and reasoning to be auditable. Specifically, after extracting human-interpretable cellular features, Feature Reasoning Agents compose and execute SQL queries over feature tables to aggregate visual evidence into quantitative findings. A Knowledge Comparison Agent then evaluates these findings against established pathological knowledge, mirroring how pathologists justify diagnoses from measurable observations. Extensive experiments evaluated on two pathology visual question answering datasets demonstrate our method improves interpretability and decision traceability while producing executable SQL traces that link cellular measurements to diagnostic conclusions.
自动化病理图像分析在临床诊断中至关重要,但医生仍然需要了解驱动模型决策的幻灯片特征及其原因。虽然视觉-语言模型能够生成自然语言解释,但这些解释往往只具有相关性而缺乏可验证的证据。本文介绍了以SQL为中心的代理框架,该框架使特征测量和推理都能够被审计。具体而言,在提取出人类可理解的细胞特征之后,特征推理代理会组合并执行针对特征表的SQL查询,以便将视觉证据汇总为定量结果。随后,知识比较代理会对这些发现与已建立的病理学知识进行评估,这类似于病理学家根据可测量观察来解释诊断的过程。 在两个病理图像问答数据集上进行了广泛的实验,结果显示该方法提高了模型的可解释性和决策追溯性,并生成了将细胞测量值与诊断结论相链接的可执行SQL跟踪。
https://arxiv.org/abs/2601.01875