This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
本文通过分解多级模块推理框架来解决视频问答(videoQA)任务。之前的方法已经通过单个规划阶段在不基于视觉内容的简单有效的基线上了表现出良好的效果。然而,通过一个简单而有效的基准,我们发现,对于具有挑战性的视频QA设置,这样的系统在实践中会导致脆性行为。因此,与传统单阶段规划方法不同,我们提出了一个由事件解析器、基线阶段和最终推理阶段以及外部记忆组成的多阶段系统。所有阶段都是训练免费的,并通过大型模型的少样本提示来执行,在每个阶段产生可解释的中间输出。通过分解规划和任务的底层复杂性,我们的方法MoReVQA在现有视频QA基准(NExT-QA,iVQA,EgoSchema,ActivityNet-QA)上取得了最先进的结果,并扩展到相关任务(基于内容的视频QA,段落标题)。
https://arxiv.org/abs/2404.06511
Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at this https URL.
去年,多模态架构在基于AI的方法和解决方案的人工智能(AI)应用和解决方案中提供了一场革命,扩展了大语言模型(LLM)的性能。我们提出了一个基于预训练LLM和视觉模式适配器的\textit{OmniFusion}模型。我们评估和比较了几个架构设计原则,以实现更好的文本和视觉数据耦合:多层感知器(MLP)和Transformer适配器,各种基于CLIP ViT的编码器(SigLIP,InternVIT等),以及它们的融合方法,图像编码方法(整张图像或块编码)和两个7B LLM(专有和开源Mistral)。在8个视觉语言基准测试中,OmniFusion在不同的VQA任务中的最佳设置在相对于开源LLAVA类似解决方案的最好表现方面:VizWiz,Pope,MM-Vet,ScienceQA,MMBench,TextVQA,VQAv2,MMU。我们还提出了各种情况,OmniFusion在不同的领域提供高度详细的答案:家务,观光,文化,医学,手写和扫描方程识别等。基于Mistral的OmniFusion模型是一个开源解决方案,其权重,训练和推理脚本可在此链接处获得:<https://url.com>
https://arxiv.org/abs/2404.06212
Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal problems. Therefore we pose the VQA problem from a unified perspective and evaluate a single system on a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying the LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical MultiModal React. We start from a multimodal ReAct-based system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality of the LLM+tools approach, which we show to be critical for obtaining high accuracy on generic VQA. Concretely, on our generic VQA suite, HAMMR outperforms the naive LLM+tools approach by 19.5%. Additionally, HAMMR achieves state-of-the-art results on this task, outperforming the generic standalone PaLI-X VQA model by 5.0%.
将大型语言模型(LLMs)与外部专业工具(LLMs+tools)相结合是一种为解决多模态任务(如视觉问答,VQA)的最新范例。虽然这种方法在优化和评估每个基准时表现良好,但实际应用中,对于下一代真实世界人工智能系统来说,处理多样多模态问题至关重要。因此,我们将VQA问题从一个统一的角度提出,并在包括计数、空间推理、OCR基础推理、视觉指认、外部知识等在内的多样VQA任务上对单一系统进行评估。在这个设置中,我们证明了使用LLM+tools方法直接应用是导致结果不佳。这激发了我们引入HAMMR:层次多模态React。我们从一个多模态的React基础系统开始,并通过让HAMMR代理调用其他专业代理来将其分层。这增强了LLM+tools方法的组合性,我们在通用VQA上证明了其关键性。具体来说,在我们的通用VQA套集中,HAMMR比 naive LLM+tools方法快19.5%。此外,HAMMR在这一点上实现了与Generic standalone PaLI-X VQA模型相当的最佳结果。
https://arxiv.org/abs/2404.05465
Introduction: Video Quality Assessment (VQA) is one of the important areas of study in this modern era, where video is a crucial component of communication with applications in every field. Rapid technology developments in mobile technology enabled anyone to create videos resulting in a varied range of video quality scenarios. Objectives: Though VQA was present for some time with the classical metrices like SSIM and PSNR, the advent of machine learning has brought in new techniques of VQAs which are built upon Convolutional Neural Networks (CNNs) or Deep Neural Networks (DNNs). Methods: Over the past years various research studies such as the BVQA which performed video quality assessment of nature-based videos using DNNs exposed the powerful capabilities of machine learning algorithms. BVQA using DNNs explored human visual system effects such as content dependency and time-related factors normally known as temporal effects. Results: This study explores the sharpness effect on models like BVQA. Sharpness is the measure of the clarity and details of the video image. Sharpness typically involves analyzing the edges and contrast of the image to determine the overall level of detail and sharpness. Conclusion: This study uses the existing video quality databases such as CVD2014. A comparative study of the various machine learning parameters such as SRCC and PLCC during the training and testing are presented along with the conclusion.
引言:视频质量评估(VQA)是现代社会的一个重要研究领域,视频作为各种应用中的关键组件,已经成为人们交流的不可或缺的一部分。移动技术的快速发展使得任何人都可以创建各种视频,从而形成了一系列丰富的视频质量场景。 目标:尽管在经典矩阵如SSIM和PSNR中已经存在了一定程度的VQA,但机器学习的出现带来了新的VQA技术,这些技术基于卷积神经网络(CNN)或深度神经网络(DNN)构建。 方法:在过去的几年里,有许多研究,如使用DNN进行自然视频质量评估的BVQA,探索了机器学习算法在VQA中的强大功能。BVQA使用DNN探讨了人类视觉系统的影响,这些影响通常被称为时间因素,例如内容相关和时间因素。 结果:本研究探讨了像BVQA这样模型的 sharpness 效果。Sharpness是视频图像清晰度和细节的度量。通常通过分析图像的边缘和对比度来确定整体细节和清晰度水平。 结论:本研究使用了现有的视频质量数据库,如CVD2014。在训练和测试期间,对各种机器学习参数如SRCC和PLCC进行了比较研究,并得出了结论。
https://arxiv.org/abs/2404.05764
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: this https URL
视觉程序合成是一种利用大型语言模型的推理能力来构建组合计算机视觉任务的的有前途的方法。之前的工作使用少量的样本提示来合成视觉程序。训练一个LLM编写更好的视觉程序是一个有吸引力的目标,但不确定如何实现。目前还没有用于训练的视觉程序数据集,而且获取一个视觉程序数据集很难通过专家注释来完成。为了克服缺乏直接监督的问题,我们探索通过交互体验来提高LLM的程序合成能力。我们提出了一种方法,在那里我们利用现有的视觉语言任务的注释来提高LLM的奖励信号,将LLM视为策略,并应用强化自我训练来提高LLM在对该任务进行视觉程序合成时的能力。我们在物体检测、组合视觉问题回答和图像-文本检索等系列实验中进行了实验,并展示了在每种情况下,自训练的LLM要么超越了少量的样本冻融LLM,要么与这些LLM相当。网站:https://this URL
https://arxiv.org/abs/2404.04627
The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.
链式思考技术在多模态任务中得到了很好的接收。它是一种逐步线性推理过程,根据生成提示的长度调整链条的长度以提高生成提示的性能。然而,人类思维过程主要是非线性的,因为它们同时涵盖多个方面并采用动态调整和更新机制。因此,我们提出了一个名为聚合-图-思维(AGoT)的多模态表示学习软提示调整的新机制。与AGoT不同,我们提出的AGoT模型将人类思维过程不仅建模为链条,而且将每一步都建模为一个推理聚合图,以应对单步推理中忽视的多个方面。这使得整个推理过程转化为提示聚合和提示流操作。实验证明,我们的多模态模型(AGoT软提示)在文本图像检索、视觉问题回答和图像识别等任务中取得了良好的结果。此外,我们还证明了它具有良好的领域泛化性能,因为其推理能力更强。
https://arxiv.org/abs/2404.04538
Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA), particularly in object-oriented perception tasks which demand fine-grained understanding of object identities, locations or attributes, as indicated by empirical findings. This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations. In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception. VTPrompt merges visual and text prompts to extract key concepts from textual questions and employs a detection model to highlight relevant objects as visual prompts in images. The processed images alongside text prompts are subsequently fed into MLLMs to produce more accurate answers. Our experiments with GPT-4V and Gemini Pro, on three benchmarks, i.e., MME , MMB and POPE, demonstrate significant improvements. Particularly, our method led to a score improvement of up to 183.5 for GPT-4V on MME and enhanced MMB performance by 8.17\% for GPT-4V and 15.69\% for Gemini Pro.
多模态大型语言模型(MLLMs)如GPT-4V和Gemini Pro在视觉问答(VQA)中面临挑战,尤其是在要求对物体身份、位置或属性的精细理解的对象导向感知任务中。这主要是由于它们在有效地将复杂视觉线索与文本信息相结合的能力方面有限。在本文中,我们提出了一个新方法,联合视觉和文本提示(VTPrompt),该方法利用细粒度的视觉信息来增强MLLMs在VQA中的能力,尤其是对于物体导向感知。VTPrompt将视觉和文本提示合并以提取文本问题中的关键概念,并使用检测模型在图像中突出相关的物体作为视觉提示。处理后的图像和文本提示随后输入MLLMs以产生更准确的答案。我们对GPT-4V和Gemini Pro在三个基准测试(MME,MMB和POPE)的实验结果表明,我们的方法取得了显著的改善。特别是,我们的方法使GPT-4V在MME上的得分提高了183.5,提高了GPT-4V和Gemini Pro在MMB上的8.17\%和15.69\%。
https://arxiv.org/abs/2404.04514
The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific of documents or task is not representative of how documents often need to be processed in the wild - where variety in style and requirements is expected. In this paper, we introduce BuDDIE (Business Document Dataset for Information Extraction), the first multi-task dataset of 1,665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and large language model approaches to VRDU.
研究领域视觉丰富文档理解(VRDU)旨在解决在多模态领域中广泛研究的问题。有一些数据集可用于研究VRDU的特定任务,例如文档分类(DC)、关键词实体提取(KEE)、实体链接、视觉问答(VQA)等。这些数据集包括缺乏丰富标注的发票和收据等类型的文档,从而支持一个或两个相关任务(例如实体提取和实体链接)。然而,仅关注单个特定文档或任务并不代表野生环境中有文档需要处理的方式 - 那里预计会有多样性和需求。在本文中,我们介绍了BuDDIE(商业文档数据集用于信息提取),第一个包含丰富丰富标注的1,665个真实世界商业文档的多任务数据集。我们的数据集包括来自美国州政府网站的公开可用商业实体文档。文档的结构和格式因州和类型而异(例如,报表、证书、文件等)。我们还为BuDDIE提供了数据多样性和质量度量以及每个任务的系列基线。我们的基线涵盖了传统文本、多模态和大型语言模型方法到VRDU。
https://arxiv.org/abs/2404.04003
Traditional machine learning models often require powerful hardware, making them unsuitable for deployment on resource-limited devices. Tiny Machine Learning (tinyML) has emerged as a promising approach for running machine learning models on these devices, but integrating multiple data modalities into tinyML models still remains a challenge due to increased complexity, latency, and power consumption. This paper proposes TinyVQA, a novel multimodal deep neural network for visual question answering tasks that can be deployed on resource-constrained tinyML hardware. TinyVQA leverages a supervised attention-based model to learn how to answer questions about images using both vision and language modalities. Distilled knowledge from the supervised attention-based VQA model trains the memory aware compact TinyVQA model and low bit-width quantization technique is employed to further compress the model for deployment on tinyML devices. The TinyVQA model was evaluated on the FloodNet dataset, which is used for post-disaster damage assessment. The compact model achieved an accuracy of 79.5%, demonstrating the effectiveness of TinyVQA for real-world applications. Additionally, the model was deployed on a Crazyflie 2.0 drone, equipped with an AI deck and GAP8 microprocessor. The TinyVQA model achieved low latencies of 56 ms and consumes 693 mW power while deployed on the tiny drone, showcasing its suitability for resource-constrained embedded systems.
传统的机器学习模型通常需要强大的硬件,这使得它们不适合在资源受限的设备上部署。Tiny Machine Learning (tinyML) 作为一种有前景的方法,为在资源受限的设备上运行机器学习模型提供了良好的途径。然而,将多个数据模态集成到 tinyML 模型中仍然具有挑战性,因为增加了复杂性、延迟和功耗。本文提出了一种名为 TinyVQA 的新颖的多模态深度神经网络,用于在资源受限的 tinyML 硬件上部署视觉问答任务。TinyVQA 利用监督注意力为基础的模型学习如何使用视觉和语言模态回答问题。从监督注意力为基础的 VQA 模型获得的蒸馏知识训练了内存感知紧凑型 TinyVQA 模型,并采用低位宽量化技术进一步压缩了模型,以适应部署在 tinyML 设备上。TinyVQA 模型在 FloodNet 数据集上进行了评估,该数据集用于灾害损失评估。紧凑型模型实现了 79.5% 的准确率,证明了 TinyVQA 在现实应用中的有效性。此外,该模型还部署在配备 AI 阵列和 GAP8 微处理器的疯狂飞行器 2.0 上。TinyVQA 模型在部署在 tiny无人机上时,具有低延迟(56ms)和低功耗(693mW),展示了其在资源受限嵌入式系统中的适用性。
https://arxiv.org/abs/2404.03574
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here this https URL
本文介绍了MiniGPT4-Video,一种专门为视频理解而设计的多模态大型语言模型(LLM)。该模型能够处理视频中的时间和文本数据,使其擅长理解视频的复杂性。在MiniGPT-v2的成功基础上,本文将模型的能力扩展到处理视频序列,使其能够理解视频。MiniGPT4-video不仅考虑视觉内容,还包含了文本对话,使模型能够有效回答涉及视觉和文本组件的查询。所提出的模型在MSVD、MSRVTT、TGIF和TVQA基准测试中的性能均优于现有最先进的方法,分别在MSVD、MSRVTT、TGIF和TVQA基准测试中取得了4.22%、1.13%、20.82%和13.1%的提高。我们的模型和代码都已公开发布,此处链接为https://。
https://arxiv.org/abs/2404.03413
Recent advancements in Computer Assisted Diagnosis have shown promising performance in medical imaging tasks, particularly in chest X-ray analysis. However, the interaction between these models and radiologists has been primarily limited to input images. This work proposes a novel approach to enhance human-computer interaction in chest X-ray analysis using Vision-Language Models (VLMs) enhanced with radiologists' attention by incorporating eye gaze data alongside textual prompts. Our approach leverages heatmaps generated from eye gaze data, overlaying them onto medical images to highlight areas of intense radiologist's focus during chest X-ray evaluation. We evaluate this methodology in tasks such as visual question answering, chest X-ray report automation, error detection, and differential diagnosis. Our results demonstrate the inclusion of eye gaze information significantly enhances the accuracy of chest X-ray analysis. Also, the impact of eye gaze on fine-tuning was confirmed as it outperformed other medical VLMs in all tasks except visual question answering. This work marks the potential of leveraging both the VLM's capabilities and the radiologist's domain knowledge to improve the capabilities of AI models in medical imaging, paving a novel way for Computer Assisted Diagnosis with a human-centred AI.
近年来,计算机辅助诊断(Computer Assisted Diagnosis,CAD)技术在医学影像任务中的表现已经取得了显著的进展,特别是在胸部X光片分析方面。然而,这些模型与放射科医生的交互主要局限于输入图像。本文提出了一种利用视觉语言模型(VLMs)增强放射科医生注意力的方法,以实现与放射科医生协同工作,从而在胸部X光片分析中提高人机交互。我们的方法将目光数据生成的热图叠加在医学图像上,突出显示了胸部X光片评估过程中放射科医生关注区域的强度。我们对这种方法在诸如视觉问答、胸部X光片报告自动化、错误检测和差异诊断等任务中进行了评估。我们的研究结果表明,包括目光信息可以显著提高胸部X光片分析的准确性。此外,目光对细粒度调整的影响得到了证实,因为在除视觉问答外的所有任务中,它的表现优于其他医疗VLM。这项工作为将VLM的功能和放射科医生的专业知识相结合,改进医疗影像领域的人工智能模型提供了可能,为以人为中心的AI辅助诊断铺平了新的道路。
https://arxiv.org/abs/2404.02370
Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.
尽管在生成式 AI方面取得了显著的进展,但全面评估仍然具有挑战性,原因在于缺乏有效的指标和标准化的基准。例如,广泛使用的 CLIPScore 测量了生成图像与文本提示之间的对齐程度,但它无法产生关于包含物体、属性和关系等复杂提示的可靠分数。一个原因是 CLIP 的文本编码器经常被视为一个“单词集合”,将诸如“马正在吃草”这样的提示与“草正在吃马”这样的提示混淆。为了解决这个问题,我们引入了 VQAScore,它使用视觉问答(VQA)模型通过计算“是的”回答的概率来生成对齐分数。尽管 VQAScore 比先前的技术更简单,但它使用的普通模型在许多图像文本对齐基准测试中都产生了最先进的成果。我们还使用一种遵循最佳实践的内部模型来计算 VQAScore。例如,我们使用一种双向图像-问题编码器,其中图像嵌入允许取决于提出的问题(反之亦然)。我们的内部模型 CLIP-FlanT5 甚至超过了使用专有 GPT-4V 的最强大的基线。有趣的是,尽管我们只使用图像进行训练,但 VQAScore 也可以将文本与视频和 3D 模型对齐。VQAScore 使研究人员能够通过捕捉现实世界提示的构成结构来比较文本到视觉生成。我们引入了 GenAI-Bench,一种更具挑战性的基准,含有1600个具有解析场景、物体、属性、关系和高阶推理(比较和逻辑)的复杂文本提示。GenAI-Bench 还提供了超过15,000个用户评分,用于评估 Stable Diffusion、DALL-E 3 和 Gen2 等领先图像和视频生成模型。
https://arxiv.org/abs/2404.01291
Localization plays a crucial role in enhancing the practicality and precision of VQA systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system's ability to provide contextually relevant and spatially accurate responses, crucial for applications in dynamic environments like robotics and augmented reality. However, traditional systems face challenges in accurately mapping objects within images to generate nuanced and spatially aware responses. In this work, we introduce "Detect2Interact", which addresses these challenges by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map. As a result, Detect2Interact achieves consistent qualitative results on object key field detection across extensive test cases and outperforms the existing VQA system with object detection by providing a more reasonable and finer visual representation.
本地化在增强VQA系统的实用性和精确性方面发挥着关键作用。通过使系统能够精细地识别并交互特定物体部分,它显著提高了系统在动态环境(如机器人学和增强现实)中提供相关且准确响应的能力。然而,传统系统在准确地将图像中的物体映射到生成细微和空间感知响应方面面临挑战。在这项工作中,我们引入了“Detect2Interact”,通过引入一种高级的细粒度物体视觉关键词检测方法来解决这些挑战。首先,我们使用 segment anything model (SAM) 生成图像中物体的详细空间地图。接下来,我们使用 Vision Studio 提取语义物体描述。最后,我们利用 GPT-4 的常识知识,使物体语义和空间图之间建立联系。因此,Detect2Interact在广泛的测试用例中实现了对物体关键字段检测的一致质保,并超过了现有VQA系统。通过提供更加合理和精确的视觉表示,使其在动态环境中提供更加相关和准确的响应。
https://arxiv.org/abs/2404.01151
Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage instruction tuning, which relies heavily on human-labeled or large language model-generated annotation, incurring high labeling costs. To tackle this challenge, we introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts, thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser, we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.
生成式视觉语言模型(VLMs)在零散射击视觉语言任务(如图像标题和视觉问题回答)中表现出色。然而,提高它们的零散射击推理通常需要第二阶段指令调整,这依赖于人类标注或大型语言模型生成的标注,导致高标注成本。为了解决这个问题,我们引入了 Image-Conditioned Caption Correction(ICCC)这一新颖的预训练任务,旨在在不需要标注任务感知数据的情况下增强VLMs的零散射击性能。ICCC 任务要求VLMs修复视觉和语言概念之间的不匹配,从而提高指令跟随和文本生成条件是基于视觉输入。利用语言结构和轻量级依赖解析器,我们通过低标注和计算成本的图像文本数据集构建了ICCC任务的数据样本。在BLIP-2和InstructBLIP上的实验结果表明,通过ICCC指令调整,零散射击图像文本生成任务中的VLM任务得到了显著的改进。
https://arxiv.org/abs/2404.00909
Medical image analysis is essential to clinical diagnosis and treatment, which is increasingly supported by multi-modal large language models (MLLMs). However, previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information. This paper aims to advance 3D medical image analysis with MLLMs. To this end, we present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs specifically tailored for various 3D medical tasks, such as image-text retrieval, report generation, visual question answering, positioning, and segmentation. Additionally, we propose M3D-LaMed, a versatile multi-modal large language model for 3D medical image analysis. Furthermore, we introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks. Through comprehensive evaluation, our method proves to be a robust model for 3D medical image analysis, outperforming existing solutions. All code, data, and models are publicly available at: this https URL.
医学图像分析对于临床诊断和治疗至关重要,而多模态大型语言模型(MLLMs)正越来越多地支持这一领域。然而,之前的研究主要集中在2D医学图像上,尽管3D图像具有更丰富的空间信息,但仍未得到充分探索。本文旨在通过MLLMs促进3D医学图像分析的进步。为此,我们提出了一个大规模3D多模态医疗数据集M3D-Data,包括120K图像-文本对和662K针对各种3D医疗任务的指令-回复对。此外,我们还提出了M3D-LaMed,一种用于3D医学图像分析的多模态大型语言模型。此外,我们还引入了一个新的3D多模态医疗基准M3D-Bench,以促进自动评估八个任务。通过全面评估,我们的方法证明自己是一个稳健的3D医疗图像分析模型,超越了现有解决方案。所有代码、数据和模型都可以在以下这个链接公开获取:https://this URL。
https://arxiv.org/abs/2404.00578
Panoramic videos have the advantage of providing an immersive and interactive viewing experience. Nevertheless, their spherical nature gives rise to various and uncertain user viewing behaviors, which poses significant challenges for panoramic video quality assessment (PVQA). In this work, we propose an end-to-end optimized, blind PVQA method with explicit modeling of user viewing patterns through visual scanpaths. Our method consists of two modules: a scanpath generator and a quality assessor. The scanpath generator is initially trained to predict future scanpaths by minimizing their expected code length and then jointly optimized with the quality assessor for quality prediction. Our blind PVQA method enables direct quality assessment of panoramic images by treating them as videos composed of identical frames. Experiments on three public panoramic image and video quality datasets, encompassing both synthetic and authentic distortions, validate the superiority of our blind PVQA model over existing methods.
全景视频具有提供沉浸式和交互式观看体验的优势。然而,其球形特性导致各种不确定的用户观看行为,这给全景视频质量评估(PVQA)带来了重大挑战。在本文中,我们提出了一种端到端优化的盲 PVQA方法,通过视觉扫描路径明确建模用户观看模式。我们的方法包括两个模块:扫描路径生成器和质量评估器。扫描路径生成器最初通过最小化预期代码长度来预测未来的扫描路径,然后与质量评估器共同优化质量预测。我们的盲 PVQA方法通过将全景图像视为由相同帧组成的视频,可以直接对全景图像进行质量评估。对三个公开全景图像和视频质量数据集的实验,包括真实和合成扭曲,证实了我们的盲 PVQA模型相对于现有方法的优越性。
https://arxiv.org/abs/2404.00252
Multimodal pre-training demonstrates its potential in the medical domain, which learns medical visual representations from paired medical reports. However, many pre-training tasks require extra annotations from clinicians, and most of them fail to explicitly guide the model to learn the desired features of different pathologies. To the best of our knowledge, we are the first to utilize Visual Question Answering (VQA) for multimodal pre-training to guide the framework focusing on targeted pathological features. In this work, we leverage descriptions in medical reports to design multi-granular question-answer pairs associated with different diseases, which assist the framework in pre-training without requiring extra annotations from experts. We also propose a novel pre-training framework with a quasi-textual feature transformer, a module designed to transform visual features into a quasi-textual space closer to the textual domain via a contrastive learning strategy. This narrows the vision-language gap and facilitates modality alignment. Our framework is applied to four downstream tasks: report generation, classification, segmentation, and detection across five datasets. Extensive experiments demonstrate the superiority of our framework compared to other state-of-the-art methods. Our code will be released upon acceptance.
多模态预训练在医疗领域展示了其潜力,从配对医疗报告中学到医疗视觉表示。然而,许多预训练任务需要来自临床医生的额外注释,而且大多数都无法明确指导模型学习不同疾病 pathologies 所需的特征。据我们所知,我们 是第一个利用 视觉问答(VQA)进行多模态预训练以指导专注于目标病理特征的框架。在这项工作中,我们利用医学报告中的描述来设计多粒度的问题-答案对,与不同疾病相关,从而无需专家额外注释帮助预训练框架。我们还提出了一个新颖的预训练框架,包括一个准文本特征Transformer,该模块通过对比学习策略将视觉特征转换为接近文本领域的准文本空间,从而缩小视-语差距并促进模态对齐。这拓宽了视-语差距并促进了模态对齐。我们的框架应用于四个下游任务:报告生成、分类、分割和检测 across 五大数据集。大量实验证明,与其他最先进的 methods相比,我们的框架具有优越性。我们的代码在通过审核后发布。
https://arxiv.org/abs/2404.00226
With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined the social biases contained in text generated by LLMs, this topic has been relatively unexplored in LVLMs. Examining social biases in LVLMs is particularly challenging due to the confounding contributions of bias induced by information contained across the text and visual modalities. To address this challenging problem, we conduct a large-scale study of text generated by different LVLMs under counterfactual changes to input images. Specifically, we present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets, where each set contains images which are largely identical in their depiction of a common subject (e.g., a doctor), but vary only in terms of intersectional social attributes (e.g., race and gender). We comprehensively evaluate the text produced by different LVLMs under this counterfactual generation setting and find that social attributes such as race, gender, and physical characteristics depicted in input images can significantly influence toxicity and the generation of competency-associated words.
大语言模型的出现,使得大型视觉语言模型(LVLMs)得到了日益出色的功能,一些大型视觉语言模型(LVLMs)被提出来补充大型语言模型,通过视觉输入来增强其功能。 such模型在输入图像和文本提示上都对生成的文本进行条件化,使得各种用例成为可能,如视觉问答和多模态聊天。 虽然之前的研究已经探讨了由LLM生成的文本中的社会偏见,但LVLMs中这个问题仍然没有被深入探讨。 研究LVLMs中的社会偏见是一个更具挑战性的问题,因为这种偏见是由文本和视觉模块中包含的信息引起的混淆所导致的。为了应对这个具有挑战性的问题,我们进行了一个大型研究,研究了不同LVLM根据反事实变化对输入图像生成的文本。 具体来说,我们在不同反事实集的图像条件下,提供了具有相同开放式文本提示的LVLMs,其中每个集包含描述共同主题(如医生)的图像,但它们的交集社会属性(如种族和性别)仅有所不同。 我们全面评估了在反事实生成设置下不同LVLM生成的文本,并发现输入图像中描绘的社会属性(如种族、性别和外貌特征)可以显著影响毒性以及与能力相关词汇的生成。
https://arxiv.org/abs/2404.00166
This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD). UPD examines the VLM's ability to withhold answers when faced with unsolvable problems in the context of Visual Question Answering (VQA) tasks. UPD encompasses three distinct settings: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD). To deeply investigate the UPD problem, extensive experiments indicate that most VLMs, including GPT-4V and LLaVA-Next-34B, struggle with our benchmarks to varying extents, highlighting significant room for the improvements. To address UPD, we explore both training-free and training-based solutions, offering new insights into their effectiveness and limitations. We hope our insights, together with future efforts within the proposed UPD settings, will enhance the broader understanding and development of more practical and reliable VLMs.
本文为 Vision Language Models (VLMs) 提出了一个新颖且重要的挑战,称为无解问题检测(UPD)。UPD 研究了 VLM 在面对无法解决的问题时是否能够隐瞒答案,这在视觉问答任务(VQA)中具有背景。UPD 包括三个不同的设置:缺失答案检测(AAD)、不兼容答案集检测(IASD)和不兼容视觉问题检测(IVQD)。为了深入研究 UPD 问题,进行了大量实验,包括使用 GPT-4V 和 LLaVA-Next-34B 等大多数 VLMs,在很大程度上无法达到我们的基准,凸显了改进的显著性。为了解决 UPD 问题,我们探讨了基于训练的和无训练的解决方案,为他们的有效性和局限性提供了新的见解。我们希望我们的见解与未来在提议的 UPD 设置中进行的努力相结合,能够增强对更实用和可靠 VLMs 的更广泛理解和开发。
https://arxiv.org/abs/2403.20331
The generic large Vision-Language Models (VLMs) is rapidly developing, but still perform poorly in Remote Sensing (RS) domain, which is due to the unique and specialized nature of RS imagery and the comparatively limited spatial perception of current VLMs. Existing Remote Sensing specific Vision Language Models (RSVLMs) still have considerable potential for improvement, primarily owing to the lack of large-scale, high-quality RS vision-language datasets. We constructed HqDC-1.4M, the large scale High quality and Detailed Captions for RS images, containing 1.4 million image-caption pairs, which not only enhance the RSVLM's understanding of RS images but also significantly improve the model's spatial perception abilities, such as localization and counting, thereby increasing the helpfulness of the RSVLM. Moreover, to address the inevitable "hallucination" problem in RSVLM, we developed RSSA, the first dataset aimed at enhancing the Self-Awareness capability of RSVLMs. By incorporating a variety of unanswerable questions into typical RS visual question-answering tasks, RSSA effectively improves the truthfulness and reduces the hallucinations of the model's outputs, thereby enhancing the honesty of the RSVLM. Based on these datasets, we proposed the H2RSVLM, the Helpful and Honest Remote Sensing Vision Language Model. H2RSVLM has achieved outstanding performance on multiple RS public datasets and is capable of recognizing and refusing to answer the unanswerable questions, effectively mitigating the incorrect generations. We will release the code, data and model weights at this https URL .
通用大型 Vision-Language 模型(VLMs)正在快速发展,但在遥感(RS)领域表现不佳,这是由于 RS 图像的独特和专用性质以及当前 VLMs 相对有限的二维感知能力。现有的 RS 特定 Vision 语言模型(RSVLMs)仍然具有显著的改进潜力,主要得益于大型、高质量 RS 视觉语言数据集的缺乏。我们构建了 HqDC-1.4M,大规模高质量和详细捕捉 RS 图像的模型,包含 1.4 万个图像-文本对,不仅提高了 RSVLM 对 RS 图像的理解,还显著提高了模型的空间感知能力,如定位和计数,从而增加了 RSVLM 的实用性。此外,为了解决 RSVLM 中不可避免的“虚构”问题,我们开发了 RSSA,旨在增强 RSVLM 的自我意识能力的第一个数据集。通过将各种无法回答的问题纳入典型 RS 视觉问题回答任务,RSSA 有效地提高了模型的真实性并减少了模型的“虚构”现象,从而提高了 RSVLM 的诚实度。基于这些数据集,我们提出了 H2RSVLM,帮助和诚实的遥感视觉语言模型。H2RSVLM 在多个 RS 公共数据集上取得了出色的性能,能够识别并拒绝回答无法回答的问题,有效减轻了错误的生成。我们将发布代码、数据和模型权重到这个链接:https://。
https://arxiv.org/abs/2403.20213