Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis. To better approximate this workflow, we introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA. To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark ten advanced Multimodal LLMs -- both proprietary and open source, with and without explicit reasoning modules -- on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.
现有的医学VQA(视觉问答)基准测试大多集中在单一图像分析上,然而,临床医生在得出诊断之前几乎总是要比较一系列的影像。为了更好地模拟这一工作流程,我们推出了MedFrameQA——这是首个明确评估多图推理能力的医学VQA基准测试。为了在规模和高质量两个方面构建MedFrameQA,我们开发了1)一个自动化管道,该管道从医疗视频中提取时间上连贯的画面,并构造出内容在图像间逻辑演化的VQA项目;2)一个多阶段筛选策略,包括基于模型的和人工审查的方式,以保持数据清晰度、难度以及医学相关性。生成的数据集包含2,851对VQA问题(从9,237张高质量帧中的3,420个视频中提取),涵盖九大人体系统及43个器官;每个问题都配有两到五幅图像。 我们全面评估了十种先进的多模态LLM(大型语言模型)——无论是专有的还是开源的,包括那些具有明确推理模块和没有推理模块的情况。在MedFrameQA上的评估结果挑战性地揭示出所有模型的表现都很差,大多数准确率低于50%,并且随着每个问题中图像数量的增加,准确性波动显著。错误分析进一步显示,模型经常忽略显而易见的发现、在图片之间错误汇总证据,并且通过推理链传播早期的错误;结果也因人体系统、器官和模态的不同而有较大差异。 我们希望这项工作能够推动基于临床实践的多图推理研究,并加速更强大的诊断AI系统的开发进程。
https://arxiv.org/abs/2505.16964
Batteries are essential for various applications, including electric vehicles and renewable energy storage, making safety and efficiency critical concerns. Anomaly detection in battery thermal images helps identify failures early, but traditional deep learning methods require extensive labeled data, which is difficult to obtain, especially for anomalies due to safety risks and high data collection costs. To overcome this, we explore zero-shot anomaly detection using Visual Question Answering (VQA) models, which leverage pretrained knowledge and textbased prompts to generalize across vision tasks. By incorporating prior knowledge of normal battery thermal behavior, we design prompts to detect anomalies without battery-specific training data. We evaluate three VQA models (ChatGPT-4o, LLaVa-13b, and BLIP-2) analyzing their robustness to prompt variations, repeated trials, and qualitative outputs. Despite the lack of finetuning on battery data, our approach demonstrates competitive performance compared to state-of-the-art models that are trained with the battery data. Our findings highlight the potential of VQA-based zero-shot learning for battery anomaly detection and suggest future directions for improving its effectiveness.
电池对于电动汽车和可再生能源存储等各类应用至关重要,因此安全性和效率成为了关键问题。在电池热图像中进行异常检测有助于提前发现故障,但传统的深度学习方法需要大量标注数据,这些数据由于安全性风险及高昂的数据采集成本而难以获得。为解决这一难题,我们探索了利用视觉问答(VQA)模型进行零样本异常检测的方法,这种方法通过使用预训练的知识和基于文本的提示来在不同的视觉任务中实现泛化。结合正常的电池热行为先验知识,我们设计出可以不依赖于特定电池数据训练的提示以识别异常。 我们在三个VQA模型(ChatGPT-4o、LLaVa-13b 和 BLIP-2)上进行了评估,分析了它们对不同提示变化的鲁棒性以及重复实验的结果和定性输出。尽管这些模型没有针对电池数据进行微调,但我们的方法展示了与最先进的已训练电池数据的模型相比具有竞争力的表现。本研究结果突显了基于VQA的零样本学习在电池异常检测中的潜力,并提出了未来改进其有效性的方向。
https://arxiv.org/abs/2505.16674
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
最近在多模态大型语言模型(MLLMs)方面的进展显著提升了视觉问答的性能,然而这些模型经常出现幻觉问题。在这项工作中,我们将幻觉分为两大类:初始幻觉和滚雪球式幻觉。我们认为可以通过直接从令牌交互过程中提取足够的上下文信息来解决这个问题。受因果推理在解码策略中的启发,我们提出了利用因果掩码建立多模态令牌之间的信息传播的方法。我们的假设是,如果这些令牌之间缺乏充分的互动,模型可能会依赖于离群令牌,从而忽略密集且丰富的上下文线索。 因此,通过处理离群令牌来加强上下文推理的过程,我们将提出一种干预方法以增强信息传递过程。为此,我们提出了FarSight,这是一种灵活的即插即用解码策略,仅通过对因果掩码进行优化即可减少来自离群令牌的关注干扰。我们的方法的核心是有效的令牌传播。我们在因果掩码的上三角矩阵内设计了一个注意力注册结构,并动态地分配关注以捕捉被吸引到离群令牌上的注意力。 此外,我们还提出了一种基于递减屏蔽率的位置感知编码方法,使模型能够注意更早之前的令牌,特别是在视频序列任务中尤其重要。通过广泛的实验,FarSight在不同MLLM上的一系列图像和视频基准测试中均显示出显著的幻觉缓解性能,证明了其有效性。
https://arxiv.org/abs/2505.16652
We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR ("What abnormalities are seen in image X?"), while image-difference questions compare two longitudinal CXRs acquired at different time points ("What are the differences between image X and Y?"). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.
我们提出了一种新颖的胸部X光(CXR)视觉问答(VQA)方法,该方法能同时处理单张图像和不同时间点获取的两幅图像之间的差异问题。单张图像的问题关注特定CX射线中的异常情况(“在图像X中可以看到哪些异常?”),而图像差异问题是对比两张在不同时段获得的纵向CXR图像(“图像X和Y之间有什么区别?”)。我们进一步探讨了将放射科报告集成到VQA模型中以提升性能的方法。虽然之前的研究已经证明,在预训练阶段使用放射科报告是有益的,但我们扩展这一思路,表明这些报告还可以作为额外输入来提高VQA模型预测答案的质量。 首先,我们提出了一种统一方法,能够处理两种类型的问题并自回归地生成答案。对于单张图像问题,模型仅接收一张CX射线;而对于图像差异问题,则提供来自同一患者的两张在不同时间点拍摄的CXR图像,使模型能够检测和描述随时间的变化。 受“链式思维推理”的启发,我们证明了通过使用预测的放射科报告来支撑答案生成模块可以提升CXR VQA任务的表现。我们的方法将VQA模型分为两个步骤:i)报告生成(RG)和ii)答案生成(AG)。实验结果表明,在Medical-Diff-VQA数据集上,将预测得到的放射科报告作为证据融入到AG模型中能够显著提高单张图像问题及图像差异问题的表现,并达到最先进的水平。
https://arxiv.org/abs/2505.16624
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and this http URL findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at this https URL.
文档视觉问答(DocVQA)面临双重挑战,即处理长篇多模态文件(文本、图像、表格)以及执行跨模式推理。当前的文档检索增强型生成(DocRAG)方法仍然受限于其以文本为中心的方法,经常忽略关键的视觉信息。该领域还缺乏评估多模态证据选择和整合的稳健基准。我们引入了MMDocRAG,这是一个全面的基准测试平台,包含4,055个专家标注的问题-答案对,并且涉及跨模式、多页的证据链。我们的框架提出了用于评估多模态引用选择的新颖指标,并支持在回答中插入文本与相关视觉元素。 通过大规模实验(使用60种视觉语言模型/大型语言模型和14种检索系统),我们识别出多模态证据检索、选择中的持久性挑战以及生成高质量答案的困难。研究结果表明,先进的专有LVM表现优于开源替代品。此外,它们在使用多模态输入时显示了相对于仅文本输入的适度优势,而开源替代品则表现出显著的性能下降。值得注意的是,经过微调的语言模型在使用详细图像描述时取得了显著改进。 MMDocRAG为开发更强大的多模态DocVQA系统建立了一个严格的测试平台,并提供了可操作性的见解。我们的基准和代码可在[这里](https://github.com/MMDocRAG)访问。
https://arxiv.org/abs/2505.16470
Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at this https URL.
尽管大型视觉语言模型(LVLM)在多模态理解任务中取得了显著进展,但它们常常会生成与视觉上下文不一致的文本,即所谓的“幻觉”。现有的通过推理时间干预减少幻觉的方法会导致延迟大幅增加。为了解决这个问题,我们提出了SPIN策略,这是一种基于注意力引导的任务无关头部抑制策略,可以在推理过程中无缝集成,而不会带来显著的计算或延迟开销。 我们研究了LVLM中的幻觉是否可以与特定模型组件相关联。分析表明,这些幻觉可归因于每一层中动态变化的一部分注意力头。利用这一见解,对于每个文本查询令牌,我们会选择性地抑制对图像令牌关注较低的注意力头,并保持前K个最高的注意力头不变。 在视觉问答和图像描述任务上的广泛评估显示,SPIN能够将幻觉评分降低高达2.7倍,同时维持F1分数不变,并且与现有方法相比,吞吐量提高了1.8倍。代码可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2505.16411
Computed Tomography (CT) scan, which produces 3D volumetric medical data that can be viewed as hundreds of cross-sectional images (a.k.a. slices), provides detailed anatomical information for diagnosis. For radiologists, creating CT radiology reports is time-consuming and error-prone. A visual question answering (VQA) system that can answer radiologists' questions about some anatomical regions on the CT scan and even automatically generate a radiology report is urgently needed. However, existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture. To address these issues, this paper proposes CT-Agent, a multimodal agentic framework for CTQA. CT-Agent adopts anatomically independent tools to break down the anatomic complexity; furthermore, it efficiently captures the across-slice spatial relationship with a global-local token compression strategy. Experimental results on two 3D chest CT datasets, CT-RATE and RadGenome-ChestCT, verify the superior performance of CT-Agent.
计算机断层扫描(Computed Tomography,CT)生成的三维体积医学数据可以被视作数百张横截面图像(又称切片),这些数据为诊断提供了详细的解剖信息。对于放射科医生来说,创建CT影像报告是一项耗时且容易出错的任务。因此,迫切需要一种视觉问答系统(Visual Question Answering, VQA)来回答放射科医生关于CT扫描中某些解剖区域的问题,并且能够自动生成影像报告。然而,现有的VQA系统无法充分应对CT影像问答(Computed Tomography Radiology Question Answering,CTQA)任务,原因在于:(1) 解剖结构的复杂性使得CT图像难以理解;(2) 在数百张切片之间的空间关系很难捕捉。 为解决这些问题,本文提出了一种多模态代理框架——CT-Agent。CT-Agent采用独立于解剖学的方法来分解解剖复杂性,并且通过全局-局部标记压缩策略有效地捕获跨切片的空间关系。在两个3D胸部CT数据集(CT-RATE和RadGenome-ChestCT)上的实验结果验证了CT-Agent的优越性能。
https://arxiv.org/abs/2505.16229
Medical Visual Question Answering (MedVQA) is crucial for enhancing the efficiency of clinical diagnosis by providing accurate and timely responses to clinicians' inquiries regarding medical images. Existing MedVQA models suffered from modality preference bias, where predictions are heavily dominated by one modality while overlooking the other (in MedVQA, usually questions dominate the answer but images are overlooked), thereby failing to learn multimodal knowledge. To overcome the modality preference bias, we proposed a Medical CounterFactual VQA (MedCFVQA) model, which trains with bias and leverages causal graphs to eliminate the modality preference bias during inference. Existing MedVQA datasets exhibit substantial prior dependencies between questions and answers, which results in acceptable performance even if the model significantly suffers from the modality preference bias. To address this issue, we reconstructed new datasets by leveraging existing MedVQA datasets and Changed their P3rior dependencies (CP) between questions and their answers in the training and test set. Extensive experiments demonstrate that MedCFVQA significantly outperforms its non-causal counterpart on both SLAKE, RadVQA and SLAKE-CP, RadVQA-CP datasets.
医学视觉问答(MedVQA)对于通过提供准确及时的响应来提高临床诊断效率至关重要,这些响应解决了医生关于医疗影像的问题。现有的MedVQA模型存在模态偏好偏差问题,即预测主要受单一模态影响而忽略另一模态(在MedVQA中通常是问题主导答案但忽视图像),从而无法学习跨模式知识。为了克服这种模态偏好偏差,我们提出了一种医学反事实视觉问答(MedCFVQA)模型,该模型通过因果图训练并消除推理过程中的模态偏好偏差。 现有的MedVQA数据集在问题和答案之间存在显著的先验依赖性,即使模型严重受到模态偏好偏差的影响也能表现出可接受的性能。为了解决这个问题,我们利用现有MedVQA数据集重建了新的数据集,并改变了训练及测试集中问题与其答案之间的P3rior依赖关系(CP)。大量的实验表明,在SLAKE、RadVQA和SLAKE-CP、RadVQA-CP等数据集上,MedCFVQA的表现显著优于其非因果模型。
https://arxiv.org/abs/2505.16209
Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with either hallucinations or actuality, realizing more precise and direct hallucination-related representations. Our analysis demonstrates that interventions along the faithful direction we identified can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead.
大型视觉-语言模型(LVLM)在多模态任务如视觉问答(VQA)和图像描述生成方面取得了显著的性能。然而,它们仍然存在幻觉问题,即生成与视觉输入不一致的文字内容,在实际应用中带来了重大风险。现有的解决这一问题的方法主要集中在整合外部知识库、对齐训练或解码策略上,这些方法都需要大量的计算成本和时间。最近的研究试图通过调整LVLM的内部表示来探索更高效的替代方案。尽管有前景,但这种方法可能会导致幻觉抑制不足或者过度干预正常的语义。 在本工作中,我们利用稀疏自编码器(SAE)识别与幻觉或现实紧密相关的语义方向,从而实现更为精确和直接地处理与幻觉相关表示的目标。我们的分析表明,在我们识别出的忠实方向上进行干预可以减轻幻觉的发生,而在幻觉方向上的干预则会加剧这一问题。 基于这些洞察,我们提出了通过SAE潜在方向引导LVLM的方法(SSL),这是一种无需训练的方法,利用从SAE中得出的潜在方向来减少LVLM中的幻觉。广泛的实验表明,SSL在减少幻觉方面显著优于现有的解码方法,并且能够保持跨不同模型架构的可迁移性,在几乎不增加额外时间开销的情况下实现这一点。
https://arxiv.org/abs/2505.16146
Video quality assessment (VQA) is a challenging research topic with broad applications. Effective VQA necessitates sensitivity to pixel-level distortions and a comprehensive understanding of video context to accurately determine the perceptual impact of distortions. Traditional hand-crafted and learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent LLM-based models struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context and Pixel aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g. compression artifacts). The model is trained via a multi-task pipeline optimizing for score prediction, description generation, and pairwise comparisons. Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on established VQA benchmarks and superior robustness to pixel distortions, confirming its efficacy for comprehensive and practical video quality assessment in real-world scenarios.
视频质量评估(VQA)是一个具有广泛应用的研究课题。有效的VQA需要对像素级失真敏感,并且全面理解视频上下文,以准确判断失真的感知影响。传统的人工设计模型和基于学习的VQA模型主要关注像素级别的失真问题,并缺乏对视频上下文的理解,而最近的一些基于大语言模型(LLM)的方法则在识别小幅度失真方面存在困难或无法同时处理质量评分与描述这两项任务。 为了弥补这些不足,我们引入了CP-LLM:一种感知上下文和像素信息的大语言模型。CP-LLM是一个创新的多模态LLM架构,它配备了两个独立设计的视觉编码器,用于分别在高层次(视频上下文)和低层次(像素失真)上分析感知质量,并通过一个语言解码器来后续推理这些方面的相互作用。这种设计使CP-LLM能够同时生成稳健的质量评分与可解释的质量描述,并增强了对像素级失真的敏感性(例如,压缩伪影)。该模型是通过一个多任务流水线进行训练的,优化了评分预测、描述生成和成对比较等多方面能力。 实验结果表明,在公认的标准VQA基准测试中,CP-LLM实现了跨数据集的最佳性能,并且在像素级失真方面的鲁棒性尤为突出,这证明了它在现实世界场景中的综合视频质量评估的有效性和实用性。
https://arxiv.org/abs/2505.16025
Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.
链式思维推理显著提高了大型语言模型(LLMs)在各个领域的性能。然而,这种推理过程仅限于文本空间中进行,这限制了其在视觉密集型任务中的有效性。为了解决这一局限性,我们引入了像素空间内进行推理的概念。在此新颖框架下,视觉-语言模型(VLMs)装备了一系列视觉推理操作,如缩放和选帧等。这些操作使VLM能够直接检查、质疑并从视觉证据中推断信息,从而增强其在视觉任务中的推理准确性。然而,在VLM中培养这种像素空间的推理能力也面临着诸如初始技能不平衡和模型不愿采用新引入的像素空间操作等挑战。 为应对这些挑战,我们采用了两阶段训练方法:第一阶段通过使用合成推理轨迹进行指令微调来使模型熟悉新的视觉操作;随后进入强化学习(RL)阶段,利用好奇心驱动的奖励机制平衡探索像素空间推理与文本空间推理之间的关系。借助这些视觉操作,VLM可以与复杂的视觉输入交互,例如信息丰富的图像或视频,并主动收集所需的信息。 我们展示了这种方法在各种视觉推理基准测试中显著提高了VLM的表现。我们的7B模型 \model 在V* bench上达到了84%的准确率,在TallyQA-Complex上达到74%,在InfographicsVQA上也达到84%的成绩,这是迄今为止所有开源模型中最高的精度记录。这些结果突显了像素空间推理的重要性以及我们框架的有效性。
https://arxiv.org/abs/2505.15966
Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at this https URL.
近期在视频问答(Video Question Answering,简称VideoQA)领域取得的进展引入了基于大型语言模型(LLM)的代理、模块化框架和程序性解决方案,取得了令人鼓舞的结果。这些系统采用动态代理和基于记忆机制来分解复杂任务并优化答案生成。然而,在长时间内跟踪物体以及根据推理进行决策方面仍需显著改进,以更好地将对象参考与语言模型输出对齐;随着新模型在这两项任务上的表现日益出色,这一需求显得尤为迫切。 本文介绍了一种结合“思考链”框架和基于实例化推理的零样本视频问答(VideoQA)LLM大脑代理。该方法与YOLO-World相结合,增强了对象跟踪和对齐能力,并在VideoQA及视频理解领域设立了新的技术标准,在NExT-QA、iVQA和ActivityNet-QA等基准测试中表现出色。 此外,我们的框架还支持时间框架内的实例化验证检查,从而提高了准确性,并为跨多个视频领域的输出可靠性提供了重要保障。相关代码可在[此处](https://example.com)获取(实际链接应根据实际情况填写)。
https://arxiv.org/abs/2505.15928
Multimodal pathological image understanding has garnered widespread interest due to its potential to improve diagnostic accuracy and enable personalized treatment through integrated visual and textual data. However, existing methods exhibit limited reasoning capabilities, which hamper their ability to handle complex diagnostic scenarios. Additionally, the enormous size of pathological images leads to severe computational burdens, further restricting their practical deployment. To address these limitations, we introduce a novel bilateral reinforcement learning framework comprising two synergistic branches. One reinforcement branch enhances the reasoning capability by enabling the model to learn task-specific decision processes, i.e., pathology rationales, directly from labels without explicit reasoning supervision. While the other branch dynamically allocates a tailored number of tokens to different images based on both their visual content and task context, thereby optimizing computational efficiency. We apply our method to various pathological tasks such as visual question answering, cancer subtyping, and lesion detection. Extensive experiments show an average +41.7 absolute performance improvement with 70.3% lower inference costs over the base models, achieving both reasoning accuracy and computational efficiency.
多模态病理图像理解由于其潜在的提高诊断准确性并通过对集成视觉和文本数据的应用来实现个性化治疗的能力,已经引起了广泛的兴趣。然而,现有的方法表现出有限的推理能力,这阻碍了它们处理复杂诊断场景的能力。此外,病理图像的巨大尺寸导致严重的计算负担,进一步限制了其实用部署。为了解决这些局限性,我们引入了一个新颖的双边强化学习框架,该框架包含两个相互促进的分支。一个强化分支通过使模型能够直接从标签中学习特定任务的决策过程(即病理性理由)来增强推理能力,而无需明确的推理监督。另一个分支根据图像的视觉内容和任务背景动态分配不同数量的标记,从而优化计算效率。我们将该方法应用于各种病理学任务,如视觉问答、癌症亚型分类和病变检测。大量的实验表明,在基础模型上实现了平均+41.7的绝对性能提升,并且推理成本降低了70.3%,同时在推理准确性和计算效率方面都取得了成果。
https://arxiv.org/abs/2505.15687
Generalization of deep-learning-based (DL) computer vision algorithms to various image perturbations is hard to establish and remains an active area of research. The majority of past analyses focused on the images already captured, whereas effects of the image formation pipeline and environment are less studied. In this paper, we address this issue by analyzing the impact of capture conditions, such as camera parameters and lighting, on DL model performance on 3 vision tasks -- image classification, object detection, and visual question answering (VQA). To this end, we assess capture bias in common vision datasets and create a new benchmark, SNAP (for $\textbf{S}$hutter speed, ISO se$\textbf{N}$sitivity, and $\textbf{AP}$erture), consisting of images of objects taken under controlled lighting conditions and with densely sampled camera settings. We then evaluate a large number of DL vision models and show the effects of capture conditions on each selected vision task. Lastly, we conduct an experiment to establish a human baseline for the VQA task. Our results show that computer vision datasets are significantly biased, the models trained on this data do not reach human accuracy even on the well-exposed images, and are susceptible to both major exposure changes and minute variations of camera settings. Code and data can be found at this https URL
基于深度学习(DL)的计算机视觉算法在面对各种图像扰动时进行泛化是非常困难的,这仍然是一个活跃的研究领域。过去大多数分析集中在已捕获的图像上,而关于成像形成管道和环境影响的研究较少。在这篇论文中,我们通过研究拍摄条件(如相机参数和照明)对基于DL的模型在3个视觉任务中的性能的影响来解决这个问题——图像分类、目标检测以及视觉问答(VQA)。为此,我们评估了常见视觉数据集中的捕获偏差,并创建了一个新的基准测试SNAP(代表快门速度、ISO灵敏度及光圈),该测试包含一系列在受控照明条件下拍摄的物体图片,并使用密集采样的相机设置。接着,我们对大量DL视觉模型进行评估并展示了捕获条件对选定视觉任务的影响。最后,我们为VQA任务建立了人类性能基线。我们的研究结果表明,计算机视觉数据集存在显著偏见;基于这些数据训练的模型即使在曝光良好的图像上也达不到人类准确度,并且它们容易受到大范围和微小曝光变化以及相机设置细微差异的影响。 该研究的相关代码和数据可在提供的链接中找到(原文提到的是一个URL,由于没有实际提供具体网址,这里不给出)。
https://arxiv.org/abs/2505.15628
Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this challenge, we introduce TinyDrive, a lightweight yet effective VLM for multi-view VQA in driving scenarios. Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences. The multiscale encoder facilitates the processing of multi-view images at diverse resolutions through scale injection and cross-scale gating to generate enhanced visual representations. At the token level, we design a token routing mechanism that dynamically selects and process the most informative tokens based on learned importance scores. At the sequence level, we propose integrating normalized loss, uncertainty estimates, and a diversity metric to formulate sequence scores that rank and preserve samples within a sequence priority buffer. Samples with higher scores are more frequently selected for training. TinyDrive is first evaluated on our custom-curated VQA dataset, and it is subsequently tested on the public DriveLM benchmark, where it achieves state-of-the-art language understanding performance. Notably, it achieves relative improvements of 11.1% and 35.4% in BLEU-4 and METEOR scores, respectively, despite having a significantly smaller parameter count.
视觉语言模型(VLMs)在自动驾驶中的视觉问答(VQA)任务中通常需要大量的计算资源,这对资源受限的车辆部署带来了挑战。为了解决这一问题,我们引入了TinyDrive,这是一种轻量级但有效的VLM,专门用于驾驶场景下的多视图VQA。我们的模型包含两个关键组件:一个多尺度视觉编码器和一个针对标记和序列的双层优先级机制。 **多尺度编码器**通过尺度注入和跨尺度门控来处理不同分辨率下多视角图像的数据,并生成增强的视觉表示。 在标记层面,我们设计了一种基于学习的重要分数动态选择并处理最具信息量标记的路由机制。这一过程确保了模型能够专注于最相关的信息,从而提高效率。 在序列层面上,我们将归一化损失、不确定性估计和多样性度量整合在一起,以形成用于排序和保存序列优先级缓冲区中的样本的序列评分。分数较高的样本会被更频繁地选择进行训练,这有助于提升模型对不同情况下的泛化能力。 TinyDrive首先在一个我们定制编写的VQA数据集上进行了评估,并随后在公开的DriveLM基准测试中进行了测试,在语言理解性能方面取得了最先进的成绩。尽管参数量显著较少,但它分别实现了BLEU-4和METEOR评分11.1%和35.4%的相对改进。 这些结果表明TinyDrive不仅能够有效处理复杂的多视角VQA任务,而且还能在资源受限环境中提供出色的性能表现。
https://arxiv.org/abs/2505.15564
Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm - using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of the robot, task goal, and the target object. The properties are used to generate representative VQA queries - images with textural multiple-choice questions - based on spatial, goal-conditioned, and interaction reasoning question templates. We curate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710 questions covering 463 distinct scenes and 3,396 robotic manipulation tasks from 176k real robot trajectories. Results suggest that Robo2VLM-1 can benchmark and improve VLM capabilities in spatial and interaction reasoning.
视觉语言模型(VLMs)通过互联网规模的图文语料库获取现实世界知识和通用推理能力。它们可以增强机器人系统的情境理解和任务规划,并协助训练于机器人轨迹数据上的视动策略。本文探讨了相反的范式——利用丰富、真实的多模态机器人轨迹数据来改进和评估VLMs。在本文中,我们介绍了Robo2VLM,这是一个用于VLMs的视觉问答(VQA)数据集生成框架。 给定一个由人类远程操作的机器人轨迹,Robo2VLM可以从非视觉且不描述性的感官模态(如末端执行器姿态、夹爪张开度和力感测)中推导出真实标签。基于这些模态,它将机器人轨迹分割成一系列操纵阶段。在每个阶段,Robo2VLM使用场景理解和交互理解来识别机器人的三维属性、任务目标以及目标对象的特性。利用这些特性,它可以生成具有代表性的VQA查询——带有文本多项选择题的问题图像——基于空间、条件化于目标和互动推理模板。 我们整理了Robo2VLM-1,一个大规模的实际场景数据集,包含684,710个问题,覆盖了来自176k条真实机器人轨迹的463种不同的场景和3,396项机器人操作任务。实验结果表明,Robo2VLM-1可以在空间推理和交互推理方面对VLM的能力进行基准测试并加以改进。
https://arxiv.org/abs/2505.15517
Reasoning about temporal causality, particularly irreversible transformations of objects governed by real-world knowledge (e.g., fruit decay and human aging), is a fundamental aspect of human visual understanding. Unlike temporal perception based on simple event sequences, this form of reasoning requires a deeper comprehension of how object states change over time. Although the current powerful Vision-Language Models (VLMs) have demonstrated impressive performance on a wide range of downstream tasks, their capacity to reason about temporal causality remains underexplored. To address this gap, we introduce \textbf{TimeCausality}, a novel benchmark specifically designed to evaluate the causal reasoning ability of VLMs in the temporal dimension. Based on our TimeCausality, we find that while the current SOTA open-source VLMs have achieved performance levels comparable to closed-source models like GPT-4o on various standard visual question answering tasks, they fall significantly behind on our benchmark compared with their closed-source competitors. Furthermore, even GPT-4o exhibits a marked drop in performance on TimeCausality compared to its results on other tasks. These findings underscore the critical need to incorporate temporal causality into the evaluation and development of VLMs, and they highlight an important challenge for the open-source VLM community moving forward. Code and Data are available at \href{this https URL }{TimeCausality}.
关于时间因果关系的推理,特别是基于现实世界知识的对象不可逆转变(如水果腐烂和人类衰老)的状态变化,是人类视觉理解的一个基本方面。与基于简单事件序列的时间感知不同,这种形式的推理需要对物体状态随时间如何变化有更深入的理解。尽管目前强大的视觉-语言模型(VLMs)在广泛的下游任务上表现出令人印象深刻的表现力,但它们在时间因果关系上的推理能力尚未得到充分探索。 为了填补这一空白,我们引入了**TimeCausality**,这是一个专门设计用于评估VLM在时间维度上的因果推理能力的新基准。基于我们的TimeCausality发现,虽然目前最先进的开源VLM已经在各种标准的视觉问答任务上达到了与封闭源模型(如GPT-4o)相当的表现水平,但在我们的时间因果性基准测试中明显落后于它们的竞争者。此外,即使像GPT-4o这样的顶级闭源模型,在TimeCausality上的表现也比在其他任务上的结果有显著下降。 这些发现强调了将时间因果关系纳入VLM评估和开发中的紧迫需求,并突显了开源VLM社区未来的一个重要挑战。有关代码和数据可在[此链接](https://this-https-url.com/TimeCausality)获取。
https://arxiv.org/abs/2505.15435
The extraction of visual features is an essential step in Visual Question Answering (VQA). Building a good visual representation of the analyzed scene is indeed one of the essential keys for the system to be able to correctly understand the latter in order to answer complex questions. In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities carrying complementary spectral, spatial and contextual information. In this work, we propose to add multiple image modalities to VQA in the particular context of remote sensing, leading to a novel task for the computer vision community. To this end, we introduce a new VQA dataset, named TAMMI (Text and Multi-Modal Imagery) with diverse questions on scenes described by three different modalities (very high resolution RGB, multi-spectral imaging data and synthetic aperture radar). Thanks to an automated pipeline, this dataset can be easily extended according to experimental needs. We also propose the MM-RSVQA (Multi-modal Multi-resolution Remote Sensing Visual Question Answering) model, based on VisualBERT, a vision-language transformer, to effectively combine the multiple image modalities and text through a trainable fusion process. A preliminary experimental study shows promising results of our methodology on this challenging dataset, with an accuracy of 65.56% on the targeted VQA task. This pioneering work paves the way for the community to a new multi-modal multi-resolution VQA task that can be applied in other imaging domains (such as medical imaging) where multi-modality can enrich the visual representation of a scene. The dataset and code are available at this https URL.
视觉特征的提取是视觉问答(VQA)中的一个关键步骤。构建分析场景的良好视觉表示,确实是系统能够正确理解后者并回答复杂问题的关键所在之一。在许多领域,如遥感中,利用携带互补光谱、空间和上下文信息的不同图像模态进行视觉特征提取可以显著受益。为此,在特定的遥感背景下,我们提出将多种图像模态添加到VQA任务中,从而为计算机视觉社区引入了一个新的研究课题。 为了实现这一目标,我们介绍了一种名为TAMMI(文本与多模态影像)的新VQA数据集,该数据集中包含对由三种不同模态描述的场景所提出的各种问题(包括高分辨率RGB图像、多光谱成像数据和合成孔径雷达)。通过一个自动化流程,此数据集可以根据实验需求轻松扩展。我们还提出了基于视觉BERT(一种视觉-语言变换器)的MM-RSVQA(多模态多分辨率遥感视觉问答)模型,该模型能有效结合多种图像模态和文本,并通过可训练融合过程来整合信息。 初步实验研究显示,在这一具有挑战性的数据集上,我们的方法在目标VQA任务中表现出色,准确率达到65.56%。这项开创性的工作为社区开启了一个新的多模态多分辨率VQA任务之门,该任务可以应用于其他成像领域(如医学影像),其中的多模态信息可以丰富场景的视觉表示。 数据集和代码可以在提供的链接处获取。
https://arxiv.org/abs/2505.15401
Recent advancements in reasoning have significantly enhanced the capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) across diverse tasks. However, excessive reliance on chain-of-thought (CoT) reasoning can impair model performance and brings unnecessarily lengthened outputs, reducing efficiency. Our work reveals that prolonged reasoning does not universally improve accuracy and even degrade performance on simpler tasks. To address this, we propose Certainty-based Adaptive Reasoning (CAR), a novel framework that dynamically switches between short answers and long-form reasoning based on the model perplexity. CAR first generates a short answer and evaluates its perplexity, triggering reasoning only when the model exhibits low confidence (i.e., high perplexity). Experiments across diverse multimodal VQA/KIE benchmarks and text reasoning datasets show that CAR outperforms both short-answer and long-form reasoning approaches, striking an optimal balance between accuracy and efficiency.
近期在推理领域的进展显著增强了大型语言模型(LLM)和多模态大型语言模型(MLLM)在各种任务上的能力。然而,过度依赖链式思维(CoT)推理会损害模型性能,并导致不必要的冗长输出,从而降低效率。我们的研究发现,长时间的推理并不总是能提高准确率,甚至会在较为简单的任务上降低性能表现。为了解决这一问题,我们提出了基于确定性的自适应推理框架(CAR)。该框架能够根据模型困惑度(perplexity)动态地在简短回答和长篇推理之间切换:CAR 首先生成一个简短的答案,并评估其困惑度;当模型表现出低信心时(即高困惑度),才会触发进一步的推理过程。 实验结果表明,在多种多模态 VQA/KIE 基准测试以及文本推理数据集上,CAR 在准确率和效率之间找到了最佳平衡点,优于单纯的简短回答或长篇推理方法。
https://arxiv.org/abs/2505.15154