The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at this https URL.
基于生成预训练Transformer的多模态语言模型(MLMs)被视为统一各种领域和任务的强大候选者。专为遥感(RS)开发的MLMs在多项任务中展现了卓越性能,如视觉问答和视觉接地。除了检测与给定指令相对应的具体物体的视觉接地外,检测多种类别的所有对象的航空检测也是一个对RS基础模型有价值的且具有挑战性的任务。然而,由于MLMs的自回归预测机制与检测输出显著不同,现有的RS MLMs尚未探索航空检测领域。在这篇文章中,我们首次提出了一种简单的方法用于将MLMs应用于航空检测,并将其命名为LMMRotate。 具体而言,首先引入一种归一化方法,以将检测输出转换为文本形式的输出,从而使其与MLM框架兼容。然后,我们提出了一种评估方法,确保MLMs和传统目标检测模型之间的公平比较。通过微调开源通用MLMs构建基线,并取得了与传统检测器相当的出色检测性能。我们希望这一基线将作为未来MLM发展的参考,使理解RS图像的能力更加全面。 代码可在以下网址获得:[此URL](https://this-url.com)(原文中的链接请自行替换为实际提供的地址)。
https://arxiv.org/abs/2501.09720
This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $\Delta$CLIP and $\Delta^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $\Delta$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $\Delta$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $\Delta^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is this https URL.
这篇论文探讨了视觉-语言模型在对抗性视觉干扰下的鲁棒性,并引入了一种新颖的“双重视觉防御”方法,以增强这种鲁棒性。与以往依赖于轻量级对抗微调预训练CLIP模型的方法不同,我们使用网络规模的数据从头开始进行了大规模的对抗性视觉-语言预训练。然后通过加入对抗性视觉指令调整来加强防护措施。在每个阶段生成的模型$\Delta$CLIP和$\Delta^2$LLaVA显示出了显著增强的零样本鲁棒性,并且在对抗防御方面为视觉-语言模型设定了新的最佳状态。例如,$\Delta$CLIP在ImageNet-1k上的对抗鲁棒性比之前的最好模型高约20%。同样地,与先前的方法相比,$\Delta^2$LLaVA在图像描述任务上带来了大约30%的鲁棒性改进,在视觉问答任务上带来了大约20%的鲁棒性改进。此外,我们的模型还展示了更强的零样本识别能力、更少的幻觉现象以及比基准方法更优越的推理性能。我们的项目页面是这个网址:[请在此处插入正确的URL]。
https://arxiv.org/abs/2501.09446
Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs' understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene comprehension in safety-critical simulations, evident not only in improved VQA accuracies but also in emerging safety-aware driving maneuvers. In addition, the learning demonstrates strong transferability from simulation to real-world observation. Code and data will be publicly available at this https URL .
视觉语言模型(VLMs)在各种移动应用中作为具身AI代理展现出巨大潜力。然而,缺乏评估其空间推理和顺序决策能力的标准闭环基准测试。为了解决这一问题,我们介绍了MetaVQA:一个全面的基准测试工具,旨在通过视觉问答(VQA)和闭合回路模拟来评测并提升VLM对空间关系及场景动态的理解能力。MetaVQA利用nuScenes和Waymo数据集中的Set-of-Mark提示以及自顶向下视图的真实标注,自动根据多样化的现实交通情景生成大量问题-答案对,确保指令具有对象中心性和上下文丰富性。我们的实验表明,使用MetaVQA数据集微调VLM能显著提升其在安全关键模拟中空间推理和具身场景理解的能力,不仅体现在视觉问答准确性的提高上,还表现在出现的安全意识驾驶操作上。此外,学习成果展示了从模拟到现实世界观察的强迁移能力。代码和数据将在[此链接](https://this https URL)公开提供。
https://arxiv.org/abs/2501.09167
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal tasks, but their performance is often constrained by the lack of external knowledge integration, limiting their ability to handle knowledge-intensive tasks such as visual question answering and reasoning. To address this challenge, we propose a novel method, Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM), which dynamically incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning. Our approach employs a knowledge encoder to represent external knowledge, a retrieval mechanism to select task-relevant information, and a dynamic adaptor to align multimodal and knowledge representations effectively. We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art models. Furthermore, human evaluations highlight the superior correctness and relevance of our model's outputs. Extensive analyses confirm the robustness, efficiency, and scalability of AKGP-LVLM, making it a compelling solution for real-world knowledge-intensive tasks.
大型视觉-语言模型(LVLMs)在多模态任务中展现了出色的能力,但它们的性能往往受到缺乏外部知识整合的限制,从而影响了处理如视觉问答和推理等需要大量知识的任务。为了解决这一挑战,我们提出了一种新的方法——自适应知识引导预训练大型视觉-语言模型(AKGP-LVLM)。该方法在预训练和微调过程中动态地将结构化和非结构化的外部知识整合到LVLM中。我们的方法采用知识编码器来表示外部知识、检索机制来选择与任务相关的信息,以及动态适配器以有效地对齐多模态和知识表示。我们在四个基准数据集上评估了该方法,并展示了相对于现有最佳模型的显著性能改进。此外,人类评价强调了我们模型输出的正确性和相关性更优。广泛的分析确认了AKGP-LVLM的强大、高效及可扩展特性,使其成为处理现实世界中需要大量知识任务的有力解决方案。
https://arxiv.org/abs/2501.08597
Recent studies have revealed that modern image and video quality assessment (IQA/VQA) metrics are vulnerable to adversarial attacks. An attacker can manipulate a video through preprocessing to artificially increase its quality score according to a certain metric, despite no actual improvement in visual quality. Most of the attacks studied in the literature are white-box attacks, while black-box attacks in the context of VQA have received less attention. Moreover, some research indicates a lack of transferability of adversarial examples generated for one model to another when applied to VQA. In this paper, we propose a cross-modal attack method, IC2VQA, aimed at exploring the vulnerabilities of modern VQA models. This approach is motivated by the observation that the low-level feature spaces of images and videos are similar. We investigate the transferability of adversarial perturbations across different modalities; specifically, we analyze how adversarial perturbations generated on a white-box IQA model with an additional CLIP module can effectively target a VQA model. The addition of the CLIP module serves as a valuable aid in increasing transferability, as the CLIP model is known for its effective capture of low-level semantics. Extensive experiments demonstrate that IC2VQA achieves a high success rate in attacking three black-box VQA models. We compare our method with existing black-box attack strategies, highlighting its superiority in terms of attack success within the same number of iterations and levels of attack strength. We believe that the proposed method will contribute to the deeper analysis of robust VQA metrics.
最近的研究揭示了现代图像和视频质量评估(IQA/VQA)指标容易受到对抗性攻击的影响。通过预先处理,攻击者可以人为地提高视频的质量分数,尽管实际上并未改善其视觉效果。大多数文献中研究的都是白盒攻击,而针对VQA领域的黑盒攻击则较少被关注。此外,一些研究表明,在不同模型之间生成的对抗样本缺乏可迁移性,特别是在应用到VQA时。本文提出了一种跨模态攻击方法IC2VQA,旨在探索现代VQA模型的脆弱性。这一方法受到观察结果的启发,即图像和视频在低级特征空间上具有相似性。我们研究了不同模式之间的对抗扰动迁移能力;具体来说,分析了带有额外CLIP模块的白盒IQA模型生成的对抗扰动如何有效地针对VQA模型。添加CLIP模块作为提高可迁移性的有效辅助手段是有益的,因为CLIP模型以其能够捕捉低级语义而著称。大量的实验表明,IC2VQA在攻击三个黑盒VQA模型时取得了较高的成功率。我们将该方法与现有的黑盒攻击策略进行了比较,并强调了它在相同迭代次数和攻击强度下更具优势的攻击成功率。我们认为所提出的方法将有助于对鲁棒性更强的VQA指标进行更深入的研究和分析。
https://arxiv.org/abs/2501.08415
Remote sensing visual question answering (RSVQA) is a task that automatically extracts information from satellite images and processes a question to predict the answer from the images in textual form, helping with the interpretation of the image. While different methods have been proposed to extract information from optical images with different spectral bands and resolutions, no method has been proposed to answer questions from Synthetic Aperture Radar (SAR) images. SAR images capture electromagnetic information from the scene, and are less affected by atmospheric conditions, such as clouds. In this work, our objective is to introduce SAR in the RSVQA task, finding the best way to use this modality. In our research, we carry out a study on different pipelines for the task of RSVQA taking into account information from both SAR and optical data. To this purpose, we also present a dataset that allows for the introduction of SAR images in the RSVQA framework. We propose two different models to include the SAR modality. The first one is an end-to-end method in which we add an additional encoder for the SAR modality. In the second approach, we build on a two-stage framework. First, relevant information is extracted from SAR and, optionally, optical data. This information is then translated into natural language to be used in the second step which only relies on a language model to provide the answer. We find that the second pipeline allows us to obtain good results with SAR images alone. We then try various types of fusion methods to use SAR and optical images together, finding that a fusion at the decision level achieves the best results on the proposed dataset. We show that SAR data offers additional information when fused with the optical modality, particularly for questions related to specific land cover classes, such as water areas.
远程遥感视觉问答(RSVQA)是一项任务,旨在自动从卫星图像中提取信息,并处理问题以预测文字形式的答案,从而帮助解释图像。尽管已经提出了多种方法来提取不同光谱带和分辨率的光学图像中的信息,但尚未有提出用于回答合成孔径雷达(SAR)图像的问题的方法。SAR图像捕获场景中的电磁信息,受大气条件的影响较小,如云层。 在此项研究中,我们的目标是将SAR引入RSVQA任务,并找到利用这一模态的最佳方式。在本研究中,我们探讨了用于RSVQA任务的不同流水线方法,这些方法同时考虑了来自SAR和光学数据的信息。为此,我们也提出了一组数据集,它允许在RSVQA框架内引入SAR图像。我们提出了两种不同的模型来包含SAR模态。第一种是一个端到端的方法,在其中添加了一个额外的用于SAR模态的编码器。第二种方法是在一个两阶段框架的基础上构建的:首先从SAR(和可选的光学)数据中提取相关信息,然后将这些信息转换为自然语言以供第二步使用,该步骤仅依赖于语言模型来提供答案。我们发现,在仅用SAR图像的情况下,第二种流水线能够获得较好的结果。 接着,我们尝试了各种类型的融合方法来同时利用SAR和光学图像,并发现在决策层进行融合的方法在所提出的数据集上取得了最佳效果。我们证明,当与光学模态结合时,SAR数据为特定土地覆盖类别(如水域)的相关问题提供了额外的信息。
https://arxiv.org/abs/2501.08131
Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at this https URL.
图像金字塔广泛应用于高性能方法中,用于获取多尺度特征以实现精确的视觉感知和理解。然而,当前的图像金字塔使用相同的大型模型来处理不同分辨率的图像,导致计算成本显著增加。为了应对这一挑战,我们提出了一种新的网络架构,称为参数反转图像金字塔网络(PIIP)。具体来说,PIIP 使用预训练的模型(ViTs 或 CNNs)作为分支来处理多尺度图像,其中更高分辨率的图像由较小的网络分支进行处理,以平衡计算成本和性能。为了整合不同空间尺度的信息,我们还提出了一种新颖的跨分支特征交互机制。 为了验证 PIIP 的有效性,我们将它应用于各种感知模型以及一种代表性的多模态大型语言模型——LLaVA,并在对象检测、分割、图像分类和多模态理解等各项任务上进行了广泛的实验。PIIP 在计算成本更低的情况下,相较于单分支和现有多种分辨率方法实现了更优的性能表现。 当 PIIP 应用于大规模视觉基础模型 InternViT-6B 时,在检测和分割方面可以提升其1%-2% 的性能,并且仅需原始计算量的40%-60%,最终在 MS COCO 上实现 60.0 box AP,而在 ADE20K 上实现 59.7 mIoU。对于多模态理解,我们的 PIIP-LLaVA 使用仅有2.8M 的训练数据,在 TextVQA 上实现了 73.0% 准确率,并在 MMBench 上达到了 74.5%。 我们的代码已发布在这个网址上:[请参阅原文链接获取具体网址]。
https://arxiv.org/abs/2501.07783
Visual Question Answering (VQA) is an interdisciplinary field that bridges the gap between computer vision (CV) and natural language processing(NLP), enabling Artificial Intelligence(AI) systems to answer questions about images. Since its inception in 2015, VQA has rapidly evolved, driven by advances in deep learning, attention mechanisms, and transformer-based models. This survey traces the journey of VQA from its early days, through major breakthroughs, such as attention mechanisms, compositional reasoning, and the rise of vision-language pre-training methods. We highlight key models, datasets, and techniques that shaped the development of VQA systems, emphasizing the pivotal role of transformer architectures and multimodal pre-training in driving recent progress. Additionally, we explore specialized applications of VQA in domains like healthcare and discuss ongoing challenges, such as dataset bias, model interpretability, and the need for common-sense reasoning. Lastly, we discuss the emerging trends in large multimodal language models and the integration of external knowledge, offering insights into the future directions of VQA. This paper aims to provide a comprehensive overview of the evolution of VQA, highlighting both its current state and potential advancements.
视觉问答(Visual Question Answering,VQA)是一个跨学科领域,它将计算机视觉(CV)和自然语言处理(NLP)相结合,使人工智能系统能够回答关于图像的问题。自2015年首次提出以来,VQA迅速发展,受到深度学习、注意力机制以及基于变压器模型的推动。本文回顾了VQA从早期阶段到重大突破的发展历程,包括注意力机制的应用、组合推理能力的提升及视觉-语言预训练方法的兴起。文中强调了一些关键模型、数据集和技巧在VQA系统发展中所起的作用,并特别指出了Transformer架构以及多模态预训练对近期进展的重要性。此外,本文还探讨了VQA技术在医疗保健等领域的专业应用及其面临的挑战,如数据集偏差问题、模型可解释性和常识推理需求。最后,文章讨论了大型多模态语言模型及外部知识集成的新兴趋势,并为未来的发展方向提供了见解。本论文旨在全面概述VQA领域的发展历程,不仅总结当前现状还展望未来发展潜力。
https://arxiv.org/abs/2501.07109
With the rapid development of multimedia processing and deep learning technologies, especially in the field of video understanding, video quality assessment (VQA) has achieved significant progress. Although researchers have moved from designing efficient video quality mapping models to various research directions, in-depth exploration of the effectiveness-efficiency trade-offs of spatio-temporal modeling in VQA models is still less sufficient. Considering the fact that videos have highly redundant information, this paper investigates this problem from the perspective of joint spatial and temporal sampling, aiming to seek the answer to how little information we should keep at least when feeding videos into the VQA models while with acceptable performance sacrifice. To this end, we drastically sample the video's information from both spatial and temporal dimensions, and the heavily squeezed video is then fed into a stable VQA model. Comprehensive experiments regarding joint spatial and temporal sampling are conducted on six public video quality databases, and the results demonstrate the acceptable performance of the VQA model when throwing away most of the video information. Furthermore, with the proposed joint spatial and temporal sampling strategy, we make an initial attempt to design an online VQA model, which is instantiated by as simple as possible a spatial feature extractor, a temporal feature fusion module, and a global quality regression module. Through quantitative and qualitative experiments, we verify the feasibility of online VQA model by simplifying itself and reducing input.
随着多媒体处理和深度学习技术的迅速发展,特别是在视频理解领域,视频质量评估(VQA)取得了显著进展。尽管研究人员已经从设计高效的视频质量映射模型转向了各种研究方向,但对于在VQA模型中时空建模的有效性与效率之间的权衡深入探索仍然不足。鉴于视频具有高度冗余的信息,本文从联合空间和时间采样的角度探讨这一问题,旨在寻找一种方法:即在保证可接受的性能损失的情况下,我们最少可以保留多少信息以供输入到VQA模型中。 为此,我们在视频的空间维度和时间维度上大量抽样其信息,并将经过严重压缩的视频输入到一个稳定的VQA模型。在六个公开的视频质量数据库上进行了关于联合空间与时间采样的全面实验,结果表明,在丢弃大部分视频信息的情况下,VQA模型仍能保持可接受的表现。 此外,借助提出的联合空间和时间采样策略,我们首次尝试设计了一个在线VQA模型,该模型通过尽可能简单的空间特征提取器、时间特征融合模块以及全局质量回归模块来实现。通过定量和定性实验,我们验证了通过简化自身并减少输入,可以实现在线VQA模型的可行性。
https://arxiv.org/abs/2501.07087
Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM's segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level across the entire dataset. In addition, to address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset, comprising 65,463 images and 140,412 instances, with each instance annotated with text descriptions, bounding boxes, and masks. Furthermore, we develop a two-stage training strategy to balance the distinct requirements of text generation and masks prediction in multi-modal multi-task optimization. Extensive experiments verify the effectiveness and superiority of GeoPix in pixel-level segmentation tasks, while also maintaining competitive performance in image- and region-level benchmarks.
多模态大型语言模型(MLLMs)在图像和区域级别的遥感(RS)图像理解任务中取得了显著的成功,例如图像描述、视觉问答和视觉指代。然而,现有的遥感MLLM缺乏像素级对话能力,即根据用户的指令生成特定实例的分割掩码进行响应的能力。为此,在本文中我们提出了GeoPix,这是一种扩展了遥感图像理解至像素级别能力的MLLM。通过为模型装备一个掩码预测器来实现这一点,该预测器能够将视觉特征从视觉编码器转换成基于大型语言模型(LLM)分割令牌嵌入条件下的掩码。 为了支持RS图像中多尺度对象的分割,在掩码预测器中集成了按类别可学习的记忆模块,以捕获并存储整个数据集中每个实例级别的地物上下文。此外,由于缺乏训练像素级别遥感MLLM的大规模数据集,我们构建了GeoPixInstruct数据集,该数据集包含65,463张图像和140,412个实例,并为每个实例提供了文本描述、边界框以及掩码的标注信息。 为了平衡多模态多任务优化中文字生成与掩码预测的不同需求,我们还开发了一种两阶段训练策略。大量的实验验证了GeoPix在像素级分割任务中的有效性和优越性,同时保持了图像和区域级别基准测试中的竞争力。
https://arxiv.org/abs/2501.06828
Previous studies have pointed out that visual question answering (VQA) models are prone to relying on language priors for answer predictions. In this context, predictions often depend on linguistic shortcuts rather than a comprehensive grasp of multimodal knowledge, which diminishes their generalization ability. In this paper, we propose a novel method, namely, KDAR, leveraging knowledge distillation to address the prior-dependency dilemmas within the VQA task. Specifically, the regularization effect facilitated by soft labels from a well-trained teacher is employed to penalize overfitting to the most common answers. The soft labels, which serve a regularization role, also provide semantic guidance that narrows the range of candidate answers. Additionally, we design an adaptive sample-wise reweighting learning strategy to further mitigate bias by dynamically adjusting the importance of each sample. Experimental results demonstrate that our method enhances performance in both OOD and IID settings. Our method achieves state-of-the-art performance on the VQA-CPv2 out-of-distribution (OOD) benchmark, significantly outperforming previous state-of-the-art approaches.
先前的研究指出,视觉问答(VQA)模型容易依赖语言先验进行答案预测。在这种背景下,预测往往依靠语言捷径而非全面理解跨模态知识,从而降低了它们的泛化能力。在本文中,我们提出了一种新颖的方法,即KDAR(Knowledge Distillation for Answer Regularization),利用知识蒸馏来解决VQA任务中的先验依赖难题。具体而言,通过来自良好训练教师模型提供的软标签所促成的正则化效应被用来惩罚对最常见的答案过度拟合。这些软标签不仅起到了正则化的角色,还提供了语义指导以缩小候选答案范围。此外,我们设计了一种自适应样本权重调整学习策略,进一步动态地调节每个样本的重要性来减少偏差。 实验结果表明,我们的方法在OOD(Out-of-Distribution)和IID(In-Distribution)设置中均提高了性能表现。我们在VQA-CPv2 OOD基准测试上取得了最先进的性能,显著优于之前的最先进方法。
https://arxiv.org/abs/2501.05690
Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.
结构化图像理解,如解释表格和图表,需要在图像中的各种结构和文本之间战略性地重新聚焦,形成推理序列以得出最终答案。然而,当前的多模态大型语言模型(LLM)缺乏这种多步选择性注意的能力。在这项工作中,我们介绍了 ReFocus,这是一个简单而有效的框架,它使多模态 LLM 具备通过代码在输入图像上执行视觉编辑来生成“视觉思维”的能力,从而转移和精炼其视觉焦点。具体而言,ReFocus 使得多模态 LLM 能够生成 Python 代码调用工具并修改输入图像,在此基础上依次绘制方框、高亮显示部分和屏蔽区域,从而增强视觉推理过程。 我们在涉及表格和图表的多种结构化图像理解任务上进行了实验。与未经视觉编辑的 GPT-4 相比,ReFocus 在所有任务中都显著提高了性能,在表格任务上的平均增益为 11.0%,在图表任务上的平均增益为 6.8%。我们深入分析了不同视觉编辑的效果,并解释了为什么 ReFocus 能够在不引入额外信息的情况下提高性能。 此外,我们使用 ReFocus 收集了一个包含 14,000 条数据的训练集,并证明了这种具有中间信息的视觉思维链提供了比标准 VQA 数据更好的监督效果,在模型训练中与 QA 对相比平均增益为 8.0%,与 CoT 相比则为 2.6%。
https://arxiv.org/abs/2501.05452
This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.
这篇论文提出了一种针对常识视频问答(VQA)的首个基于视频的蕴含树推理方法。尽管大型视觉-语言模型(VLMs)取得了显著进展,但人们对这些模型存在的问题越来越担忧:它们可能学习到视频与潜在答案之间的虚假关联,并且由于其黑盒特性以及现有基准测试中的偏见而强化了这种关联。我们的方法通过四个步骤明确地将VQA任务锚定在视频片段上:构建蕴含树、验证视频-语言的蕴含关系、进行树推理和动态扩展树结构。该方法的重要优势在于,它能够跨当前基于视频和图像的VLMs以及不同的推理类型实现泛化能力。 为了支持公平评估,我们设计了一种基于大型语言模型的去偏策略,将重写VQA基准测试的答案集以强制执行模型推理。在现有及去偏后的基准上进行系统的实验,展示了我们的方法组件在不同基准、VLMs和推理类型上的影响。
https://arxiv.org/abs/2501.05069
In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary strengths of each projector. We observe that different visual projectors exhibit distinct characteristics when handling specific tasks. For instance, some projectors excel at capturing static details, while others are more effective at processing temporal information, and some are better suited for tasks requiring temporal coherence. By dynamically adjusting feature weights according to user instructions, LLaVA-Octopus dynamically selects and combines the most suitable features, significantly enhancing the model's performance in multimodal tasks. Experimental results demonstrate that LLaVA-Octopus achieves excellent performance across multiple benchmarks, especially in tasks such as multimodal understanding, visual question answering, and video understanding, highlighting its broad application potential.
在这篇论文中,我们介绍了LLaVA-OCTOPUS,这是一种新颖的视频多模态大型语言模型。LLaVA-Octopus能够根据用户指令自适应地加权不同视觉投影器(projector)的特征,使我们可以利用每个投影器的独特优势。我们观察到,不同的视觉投影器在处理特定任务时表现出不同的特性。例如,有些投影器擅长捕捉静态细节,而另一些则更善于处理时间信息,还有一些更适合需要时间一致性任务的需求。通过根据用户指令动态调整特征权重,LLaVA-Octopus可以灵活选择并组合最合适的特征,从而显著提升模型在多模态任务中的性能。 实验结果表明,LLaVA-Octopus在多个基准测试中表现出色,尤其是在多模态理解、视觉问答和视频理解等任务上,突显了其广泛的应用潜力。
https://arxiv.org/abs/2501.05067
Chart interpretation is crucial for visual data analysis, but accurately extracting information from charts poses significant challenges for automated models. This study investigates the fine-tuning of DEPLOT, a modality conversion module that translates the image of a plot or chart to a linearized table, on a custom dataset of 50,000 bar charts. The dataset comprises simple, stacked, and grouped bar charts, targeting the unique structural features of these visualizations. The finetuned DEPLOT model is evaluated against its base version using a test set of 1,000 images and two metrics: Relative Mapping Similarity (RMS), which measures categorical mapping accuracy, and Relative Number Set Similarity (RNSS), which evaluates numerical interpretation accuracy. To further explore the reasoning capabilities of large language models (LLMs), we curate an additional set of 100 bar chart images paired with question answer sets. Our findings demonstrate that providing a structured intermediate table alongside the image significantly enhances LLM reasoning performance compared to direct image queries.
图表解释对于可视化数据分析至关重要,但自动模型从图表中准确提取信息存在重大挑战。本研究探讨了DEPLOT的微调过程,DEPLOT是一种将图像中的图形或图表转换为线性表格的模态转换模块,并在由50,000个条形图组成的自定义数据集上进行测试。该数据集包括简单、堆叠和分组条形图,旨在针对这些可视化特有的结构特征。微调后的DEPLOT模型通过一个包含1,000张图像的测试集以及两种指标(相对映射相似度RMS和相对数值集合相似度RNSS)与基础版本进行了评估,其中RMS衡量了分类映射的准确性,而RNSS则评估了数值解释的准确性。为了进一步探索大型语言模型(LLMs)的推理能力,我们还整理了一个包含100个条形图图像及其配对问题答案集的数据子集。我们的研究结果表明,在向LLM提供结构化中间表格与直接基于图像查询相比,可以显著提升其推理性能。
https://arxiv.org/abs/2501.04675
Large vision-language models (LVLMs) augment language models with visual understanding, enabling multimodal reasoning. However, due to the modality gap between textual and visual data, they often face significant challenges, such as over-reliance on text priors, hallucinations, and limited capacity for complex visual reasoning. Existing benchmarks to evaluate visual reasoning in LVLMs often rely on schematic or synthetic images and on imprecise machine-generated explanations. To bridge the modality gap, we present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios. It offers 3,931 expert-crafted multiple-choice problems and interleaved explanations grounded with entities relevant to the reasoning process. We leverage this dataset to perform an extensive study of LVLMs' ability to reason about complex visual scenarios. Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings. We investigate training strategies that leverage relevant entities to improve visual reasoning. Notably, we observe a performance boost of up to 7\% when reasoning over image tokens of cropped regions tied to these entities.
大型视觉-语言模型(LVLM)通过添加对视觉的理解来增强语言模型,从而实现多模态推理。然而,由于文本数据和图像数据之间的模式差距,它们经常面临诸如过度依赖文本先验知识、幻觉以及在复杂视觉推理方面能力有限等重大挑战。现有的评估LVLM中视觉推理的基准测试通常依靠示意图或合成图像,并且依赖于不精确的机器生成解释。 为了弥合这种模式差距,我们提出了DrivingVQA,这是一个从驾驶理论考试派生的新基准测试,用于评估在复杂现实场景中的视觉链式思维推理。它提供了3,931个由专家设计的选择题以及与推理过程相关的实体交织在一起的详细说明。利用这个数据集,我们进行了广泛的LVLM研究来分析它们处理复杂视觉场景的能力。我们的实验结果显示,在零样本设置下开源和专有的大型视觉-语言模型在视觉链式思维推理方面存在困难。 为了改善视觉推理能力,我们探讨了训练策略,这些策略依赖于相关实体的利用。值得注意的是,当对与这些实体相关的裁剪区域图像令牌进行推理时,我们可以观察到高达7%的表现提升。
https://arxiv.org/abs/2501.04671
Artificial Intelligence is revolutionizing medical practice, enhancing diagnostic accuracy and healthcare delivery. However, its adaptation in medical settings still faces significant challenges, related to data availability and privacy constraints. Synthetic data has emerged as a promising solution to mitigate these issues, addressing data scarcity while preserving privacy. Recently, Latent Diffusion Models have emerged as a powerful tool for generating high-quality synthetic data. Meanwhile, the integration of different modalities has gained interest, emphasizing the need of models capable of handle multimodal medical this http URL approaches struggle to integrate complementary information and lack the ability to generate modalities simultaneously. To address this challenge, we present MedCoDi-M, a 6.77-billion-parameter model, designed for multimodal medical data generation, that, following Foundation Model paradigm, exploits contrastive learning and large quantity of data to build a shared latent space which capture the relationships between different data modalities. Further, we introduce the Multi-Prompt training technique, which significantly boosts MedCoDi-M's generation under different settings. We extensively validate MedCoDi-M: first we benchmark it against five competitors on the MIMIC-CXR dataset, a state-of-the-art dataset for Chest X-ray and radiological report generation. Secondly, we perform a Visual Turing Test with expert radiologists to assess the realism and clinical relevance of the generated data, ensuring alignment with real-world scenarios. Finally, we assess the utility of MedCoDi-M in addressing key challenges in the medical field, such as anonymization, data scarcity and imbalance learning. The results are promising, demonstrating the applicability of MedCoDi-M in medical contexts. Project page is at this https URL.
人工智能正在革新医学实践,通过提高诊断准确性并改善医疗服务来带来变革。然而,在医疗环境中应用人工智能仍然面临重大挑战,这些问题主要与数据可用性和隐私限制相关。合成数据作为一种有前景的解决方案应运而生,它不仅能够解决数据不足的问题,还能在生成过程中保护患者隐私。 最近,潜在扩散模型作为生成高质量合成数据的强大工具崭露头角。与此同时,多模态集成的需求日益增长,然而现有的方法难以整合互补信息且无法同时生成多种模态的数据。为了解决这一挑战,我们提出了MedCoDi-M,一个具有67亿参数的模型,专门用于处理多模态医疗数据生成任务。该模型遵循基础模型范式,利用对比学习和大量数据来构建共享潜在空间以捕捉不同数据模式之间的关系。 此外,我们引入了Multi-Prompt训练技术,这一技术显著提升了MedCoDi-M在各种设置下的生成能力。我们在MIMIC-CXR数据集上对其进行了广泛的验证:首先,我们将该模型与五个竞争对手进行基准测试,这是一个用于胸部X光片和放射学报告生成的最新数据集;其次,我们通过专家放射科医生进行视觉图灵测试来评估生成数据的真实性和临床相关性,以确保其符合现实场景中的需求;最后,我们还评估了MedCoDi-M在解决医疗领域关键挑战(如匿名化、数据稀缺和学习不平衡)方面的效用。 结果显示,MedCoDi-M具有在医学情境中应用的潜力。项目页面位于此链接:[假设此处应插入具体网址]。
https://arxiv.org/abs/2501.04614
Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Supervision-free Visual Projection), a novel framework that enhances vision-language alignment without relying on curated data or preference annotation. SVP leverages self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements, including a 14% average improvement in captioning tasks, up to 12% increase in object recall, and substantial reduction in hallucination rates. Notably, a small VLM using SVP achieves hallucination reductions comparable to a model five times larger, while a VLM with initially poor referring capabilities more than doubles its performance, approaching parity with a model twice its size.
视觉语言模型(VLMs)在整合视觉和语言信息方面展示了巨大的潜力,但其性能往往受限于需要大量高质量的图像文本训练数据。这些图像文本对的数据集整理既耗时又计算成本高昂。为了解决这一挑战,我们引入了SVP(无监督视觉投影),这是一种新的框架,可以在不依赖精心策划的数据或偏好注释的情况下增强视觉语言的一致性。SVP利用自我描述和预训练的定位模型作为反馈机制来激发VLM中的潜在信息。 我们在六个关键领域评估了我们的方法:描述生成、图像参考、视觉问答、多任务处理、幻觉控制以及物体召回。实验结果显示,平均在描述生成任务上提高了14%,最高可达12%的物体召回率提升,并且显著降低了幻觉发生率。值得注意的是,使用SVP的小型VLM实现了与五倍大小模型相当的幻觉减少效果,而一个初始图像参考能力较差的VLM通过应用SVP其性能几乎翻番,接近于两倍大小的模型的表现水平。
https://arxiv.org/abs/2501.04568
Visual Question Answering (VQA) is an evolving research field aimed at enabling machines to answer questions about visual content by integrating image and language processing techniques such as feature extraction, object detection, text embedding, natural language understanding, and language generation. With the growth of multimodal data research, VQA has gained significant attention due to its broad applications, including interactive educational tools, medical image diagnosis, customer service, entertainment, and social media captioning. Additionally, VQA plays a vital role in assisting visually impaired individuals by generating descriptive content from images. This survey introduces a taxonomy of VQA architectures, categorizing them based on design choices and key components to facilitate comparative analysis and evaluation. We review major VQA approaches, focusing on deep learning-based methods, and explore the emerging field of Large Visual Language Models (LVLMs) that have demonstrated success in multimodal tasks like VQA. The paper further examines available datasets and evaluation metrics essential for measuring VQA system performance, followed by an exploration of real-world VQA applications. Finally, we highlight ongoing challenges and future directions in VQA research, presenting open questions and potential areas for further development. This survey serves as a comprehensive resource for researchers and practitioners interested in the latest advancements and future
视觉问答(VQA)是一个不断发展的研究领域,旨在通过整合图像处理和语言处理技术(如特征提取、物体检测、文本嵌入、自然语言理解和语言生成)使机器能够回答关于视觉内容的问题。随着多模态数据研究的增长,由于其广泛的应用前景,包括互动教育工具、医学影像诊断、客户服务、娱乐和社会媒体配文,VQA受到了越来越多的关注。此外,在辅助视障人士方面,VQA通过从图像中生成描述性内容也发挥着重要作用。 本综述介绍了一种VQA架构的分类方法,基于设计选择和关键组件进行分类,以促进比较分析与评估。我们回顾了主要的VQA方法,重点关注基于深度学习的方法,并探讨了在多模态任务如VQA方面取得成功的大型视觉语言模型(LVLM)这一新兴领域。该论文进一步考察了衡量VQA系统性能所必需的数据集和评价指标,并探索了实际的VQA应用。最后,我们指出了VQA研究中当前存在的挑战和未来的发展方向,提出了一些开放性问题以及潜在的研究方向。 这份综述为对最新进展和未来发展感兴趣的VQA领域研究人员与从业者提供了全面资源。
https://arxiv.org/abs/2501.03939
Zero-shot anomaly detection (ZSAD) identifies anomalies without needing training samples from the target dataset, essential for scenarios with privacy concerns or limited data. Vision-language models like CLIP show potential in ZSAD but have limitations: relying on manually crafted fixed textual descriptions or anomaly prompts is time-consuming and prone to semantic ambiguity, and CLIP struggles with pixel-level anomaly segmentation, focusing more on global semantics than local details. To address these limitations, We introduce KAnoCLIP, a novel ZSAD framework that leverages vision-language models. KAnoCLIP combines general knowledge from a Large Language Model (GPT-3.5) and fine-grained, image-specific knowledge from a Visual Question Answering system (Llama3) via Knowledge-Driven Prompt Learning (KnPL). KnPL uses a knowledge-driven (KD) loss function to create learnable anomaly prompts, removing the need for fixed text prompts and enhancing generalization. KAnoCLIP includes the CLIP visual encoder with V-V attention (CLIP-VV), Bi-Directional Cross-Attention for Multi-Level Cross-Modal Interaction (Bi-CMCI), and Conv-Adapter. These components preserve local visual semantics, improve local cross-modal fusion, and align global visual features with textual information, enhancing pixel-level anomaly detection. KAnoCLIP achieves state-of-the-art performance in ZSAD across 12 industrial and medical datasets, demonstrating superior generalization compared to existing methods.
零样本异常检测(ZSAD)能够在没有目标数据集训练样本的情况下识别异常,这对于存在隐私顾虑或数据有限的情况非常关键。视觉-语言模型如CLIP在ZSAD中显示出潜力,但它们也存在局限性:依赖于手动设计的固定文本描述或异常提示既耗时又容易产生语义模糊的问题;同时,CLIP在像素级异常分割方面表现不佳,更关注全局语义而非局部细节。为解决这些限制,我们引入了KAnoCLIP,这是一种新颖的ZSAD框架,利用视觉-语言模型。 KAnoCLIP结合了一个大型语言模型(GPT-3.5)的一般知识和一个视觉问答系统(Llama3)的具体图像相关知识,通过知识驱动提示学习(KnPL)实现这一点。KnPL使用一种基于知识的(KD)损失函数来生成可学习的异常提示,这消除了对固定文本提示的需求,并增强了模型的泛化能力。 KAnoCLIP框架包括CLIP视觉编码器与V-V注意机制(CLIP-VV)、双向交叉注意力用于多级跨模态交互(Bi-CMCI)以及Conv-Adapter。这些组件能够保留局部视觉语义,改善局部跨模态融合,并使全局视觉特征与文本信息对齐,从而提升像素级别的异常检测能力。 KAnoCLIP在12个工业和医学数据集上的零样本异常检测中取得了最先进的性能,展示了比现有方法更优越的泛化能力。
https://arxiv.org/abs/2501.03786