Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{this https URL}.
视觉问题回答(VQA)被认为是AI完成的任务,因为它需要理解、推理和推断关于视觉和语言内容的视觉和语言内容。在过去的几年里,为VQA问题提出了许多神经架构建议。然而,在零散射击VQA上取得成功仍然具有挑战性,因为需要具备高级的泛化能力和推理能力。本文探讨了将图像摘要作为VQA管道中中间过程的引入对视觉问题回答效果的影响。具体来说,我们探讨了使用图像摘要而不是图像并利用大型语言模型(LLMs)建立零散射击设置的有效性。 由于图像摘要是这个过程中最关键的一步,因此我们比较了最先进的图像摘要模型在各种问题类型的结构和语义方面的视觉问题回答性能。我们提出了一种直接而有效的基于问题的图像摘要方法,将上下文信息传递给问题回答(QA)模型。这种方法涉及从问题中提取关键词,为图像-问题对生成文本摘要,并将问题驱动的摘要融入LLM提示中。我们评估了使用通用和基于问题的图像摘要在VQA管道中的效果。 我们的研究突出了在零散射击设置下利用图像摘要和大型语言模型的潜力,以实现GQA竞争力的性能。我们的代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.08589
Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object's characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.
可扩展的标注方法对于构建广泛的3D文本数据集至关重要,促进了一系列应用。然而,现有的方法有时会导致生成伪影的旁注,从而损害了旁注的质量。本文重点探讨了在3D物体旁注中出现伪影的问题,重点关注Cap3D方法,该方法使用预训练模型将3D物体转换为2D视图进行旁注。我们指出了一个主要挑战:某些3D物体的渲染视图是非典型的,与标准图像旁注模型的训练数据不一致,导致伪影。为解决这个问题,我们提出了DiffuRank方法,该方法利用预训练的文本到3D模型评估3D物体与其2D渲染视图之间的对齐程度,其中高对齐视图最能代表对象的特性。通过排名所有渲染视图并将排名前几位的输入GPT4-Vision,我们提高了旁注的准确性和细节,使得在Cap3D数据集中的20000个旁注和将它们扩展到Objaverse和Objaverse-XL数据集中的100000个旁注得到纠正。此外,我们还展示了DiffuRank的适应性,将其应用于预训练的文本到图像模型上进行视觉问答任务,其中它超过了CLIP模型。
https://arxiv.org/abs/2404.07984
This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models like GPT4 and LLaVA against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. Key findings suggest that while MLLMs demonstrate potential in navigating technical documents, substantial limitations exist, particularly in accurately extracting and applying detailed requirements to engineering designs. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: this https URL.
这项研究介绍了一种名为DesignQA的新基准,旨在评估多模态大型语言模型(MLLM)在理解和技术文档中应用工程要求的能力。该基准重点关注现实世界的工程挑战,将多模态数据(包括文本设计要求、CAD图像和工程图纸)来源于方程式SAE学生竞赛,与许多现有MLLM基准不同。DesignQA包含基于文档的视觉问题,其中输入图像和输入文档来自不同的来源。基准基于工程师根据要求进行设计时执行的任务进行划分-规则理解、规则遵守和规则提取。我们评估了最先进的GPT4和LLaVA模型与该基准的比较,我们的研究揭示了MLLM在解释复杂工程文档方面的能力所存在的现有缺口。研究发现,尽管MLLM表现出在导航技术文档方面的潜力,但仍然存在很大的局限性,特别是在准确提取和应用详细工程设计要求方面。这项基准为支持AI辅助工程设计过程的未来发展奠定了基础。DesignQA可以在以下链接公开使用:https://this URL。
https://arxiv.org/abs/2404.07917
Unsupervised anomaly detection enables the identification of potential pathological areas by juxtaposing original images with their pseudo-healthy reconstructions generated by models trained exclusively on normal images. However, the clinical interpretation of resultant anomaly maps presents a challenge due to a lack of detailed, understandable explanations. Recent advancements in language models have shown the capability of mimicking human-like understanding and providing detailed descriptions. This raises an interesting question: \textit{How can language models be employed to make the anomaly maps more explainable?} To the best of our knowledge, we are the first to leverage a language model for unsupervised anomaly detection, for which we construct a dataset with different questions and answers. Additionally, we present a novel multi-image visual question answering framework tailored for anomaly detection, incorporating diverse feature fusion strategies to enhance visual knowledge extraction. Our experiments reveal that the framework, augmented by our new Knowledge Q-Former module, adeptly answers questions on the anomaly detection dataset. Besides, integrating anomaly maps as inputs distinctly aids in improving the detection of unseen pathologies.
无监督异常检测通过将原始图像与仅基于正常图像的模型生成的伪健康重构图像相邻来识别潜在的病理性区域。然而,由于结果异常图的临床解释存在缺乏详细、可理解解释的挑战,这是一个具有挑战性的问题。近年来语言模型的进步表明,具有类似于人类理解能力和提供详细描述的能力。这引发了一个有趣的问题:\textit{语言模型如何被用于使异常图更具有可解释性?}据我们所知,我们第一个利用语言模型进行无监督异常检测,为我们构建了一个不同问题和不回答的问答 dataset。此外,我们提出了一个专门针对异常检测的多图像视觉问答框架,结合了各种特征融合策略来增强视觉知识提取。我们的实验表明,在将新知识 Q-Former 模块扩展到框架后,该框架能够恰当地回答异常检测数据集中的问题。此外,将异常图作为输入可以明显地改善未见疾病的检测。
https://arxiv.org/abs/2404.07622
Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
将大型语言模型(LLMs)集成到视觉领域任务中,产生了视觉-LLM(V-LLM),在视觉语言任务中实现了卓越的性能,特别是对于视觉问答(VQA)。然而,现有的V-LLM(例如BLIP-2,LaVaA)表明缺乏空间推理和局部定位意识。尽管生成高度描述性和详细的文本答案,但这些模型在简单任务(如区分左右位置)上表现不佳。在这项工作中,我们探讨了图像空间坐标基于指令微调目标如何将空间意识注入V-LLM。我们发现了最优的坐标表示、数据有效的指令微调目标以及伪数据生成策略,从而提高了V-LLM的空间意识。此外,我们的结果模型在图像和视频域的VQA中 improved,减少了不必要的幻觉,并生成了更好的上下文物体描述。在涉及14个不同数据集的5个视觉语言任务上进行实验,证明了我们提出的框架所取得的明显性能提升。
https://arxiv.org/abs/2404.07449
Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.
视觉语言模型(VLMs)通常由一个视觉编码器(例如CLIP)和一个语言模型(LM)组成,用于解决下游任务。尽管在可见的进步,但VLMs由于视觉编码器的能力有限,例如对某些图像特征的“盲目”或视觉幻觉等问题,而存在多个缺陷。为了应对这些问题,我们研究了扩大VLM视觉编码器的能力。我们首先全面基准了为解决VLM任务而采用的不同归纳偏置的几个视觉编码器。我们观察到,没有一种编码器配置能够 consistently在不同的任务上取得最佳性能,并且具有不同偏置的编码器可以表现出意外的相似性。为了激励这一点,我们引入了一种名为BRAVE的方法,将多个冻结的编码器的特征合并到一个更通用的表示中,可以直接输入到冻结的语言模型中。BRAVE在广泛的摘要和VQA基准测试中实现了最先进的性能,并显著减少了上述VLMs存在的问题,而需要的训练参数比现有方法要少得多,并且具有更紧凑的表示。我们的结果突出了将不同视觉偏见集成到更广泛的上下文视觉理解中的VLMs的潜力。
https://arxiv.org/abs/2404.07204
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
本文通过分解多级模块推理框架来解决视频问答(videoQA)任务。之前的方法已经通过单个规划阶段在不基于视觉内容的简单有效的基线上了表现出良好的效果。然而,通过一个简单而有效的基准,我们发现,对于具有挑战性的视频QA设置,这样的系统在实践中会导致脆性行为。因此,与传统单阶段规划方法不同,我们提出了一个由事件解析器、基线阶段和最终推理阶段以及外部记忆组成的多阶段系统。所有阶段都是训练免费的,并通过大型模型的少样本提示来执行,在每个阶段产生可解释的中间输出。通过分解规划和任务的底层复杂性,我们的方法MoReVQA在现有视频QA基准(NExT-QA,iVQA,EgoSchema,ActivityNet-QA)上取得了最先进的结果,并扩展到相关任务(基于内容的视频QA,段落标题)。
https://arxiv.org/abs/2404.06511
Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at this https URL.
去年,多模态架构在基于AI的方法和解决方案的人工智能(AI)应用和解决方案中提供了一场革命,扩展了大语言模型(LLM)的性能。我们提出了一个基于预训练LLM和视觉模式适配器的\textit{OmniFusion}模型。我们评估和比较了几个架构设计原则,以实现更好的文本和视觉数据耦合:多层感知器(MLP)和Transformer适配器,各种基于CLIP ViT的编码器(SigLIP,InternVIT等),以及它们的融合方法,图像编码方法(整张图像或块编码)和两个7B LLM(专有和开源Mistral)。在8个视觉语言基准测试中,OmniFusion在不同的VQA任务中的最佳设置在相对于开源LLAVA类似解决方案的最好表现方面:VizWiz,Pope,MM-Vet,ScienceQA,MMBench,TextVQA,VQAv2,MMU。我们还提出了各种情况,OmniFusion在不同的领域提供高度详细的答案:家务,观光,文化,医学,手写和扫描方程识别等。基于Mistral的OmniFusion模型是一个开源解决方案,其权重,训练和推理脚本可在此链接处获得:<https://url.com>
https://arxiv.org/abs/2404.06212
Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal problems. Therefore we pose the VQA problem from a unified perspective and evaluate a single system on a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying the LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical MultiModal React. We start from a multimodal ReAct-based system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality of the LLM+tools approach, which we show to be critical for obtaining high accuracy on generic VQA. Concretely, on our generic VQA suite, HAMMR outperforms the naive LLM+tools approach by 19.5%. Additionally, HAMMR achieves state-of-the-art results on this task, outperforming the generic standalone PaLI-X VQA model by 5.0%.
将大型语言模型(LLMs)与外部专业工具(LLMs+tools)相结合是一种为解决多模态任务(如视觉问答,VQA)的最新范例。虽然这种方法在优化和评估每个基准时表现良好,但实际应用中,对于下一代真实世界人工智能系统来说,处理多样多模态问题至关重要。因此,我们将VQA问题从一个统一的角度提出,并在包括计数、空间推理、OCR基础推理、视觉指认、外部知识等在内的多样VQA任务上对单一系统进行评估。在这个设置中,我们证明了使用LLM+tools方法直接应用是导致结果不佳。这激发了我们引入HAMMR:层次多模态React。我们从一个多模态的React基础系统开始,并通过让HAMMR代理调用其他专业代理来将其分层。这增强了LLM+tools方法的组合性,我们在通用VQA上证明了其关键性。具体来说,在我们的通用VQA套集中,HAMMR比 naive LLM+tools方法快19.5%。此外,HAMMR在这一点上实现了与Generic standalone PaLI-X VQA模型相当的最佳结果。
https://arxiv.org/abs/2404.05465
Introduction: Video Quality Assessment (VQA) is one of the important areas of study in this modern era, where video is a crucial component of communication with applications in every field. Rapid technology developments in mobile technology enabled anyone to create videos resulting in a varied range of video quality scenarios. Objectives: Though VQA was present for some time with the classical metrices like SSIM and PSNR, the advent of machine learning has brought in new techniques of VQAs which are built upon Convolutional Neural Networks (CNNs) or Deep Neural Networks (DNNs). Methods: Over the past years various research studies such as the BVQA which performed video quality assessment of nature-based videos using DNNs exposed the powerful capabilities of machine learning algorithms. BVQA using DNNs explored human visual system effects such as content dependency and time-related factors normally known as temporal effects. Results: This study explores the sharpness effect on models like BVQA. Sharpness is the measure of the clarity and details of the video image. Sharpness typically involves analyzing the edges and contrast of the image to determine the overall level of detail and sharpness. Conclusion: This study uses the existing video quality databases such as CVD2014. A comparative study of the various machine learning parameters such as SRCC and PLCC during the training and testing are presented along with the conclusion.
引言:视频质量评估(VQA)是现代社会的一个重要研究领域,视频作为各种应用中的关键组件,已经成为人们交流的不可或缺的一部分。移动技术的快速发展使得任何人都可以创建各种视频,从而形成了一系列丰富的视频质量场景。 目标:尽管在经典矩阵如SSIM和PSNR中已经存在了一定程度的VQA,但机器学习的出现带来了新的VQA技术,这些技术基于卷积神经网络(CNN)或深度神经网络(DNN)构建。 方法:在过去的几年里,有许多研究,如使用DNN进行自然视频质量评估的BVQA,探索了机器学习算法在VQA中的强大功能。BVQA使用DNN探讨了人类视觉系统的影响,这些影响通常被称为时间因素,例如内容相关和时间因素。 结果:本研究探讨了像BVQA这样模型的 sharpness 效果。Sharpness是视频图像清晰度和细节的度量。通常通过分析图像的边缘和对比度来确定整体细节和清晰度水平。 结论:本研究使用了现有的视频质量数据库,如CVD2014。在训练和测试期间,对各种机器学习参数如SRCC和PLCC进行了比较研究,并得出了结论。
https://arxiv.org/abs/2404.05764
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: this https URL
视觉程序合成是一种利用大型语言模型的推理能力来构建组合计算机视觉任务的的有前途的方法。之前的工作使用少量的样本提示来合成视觉程序。训练一个LLM编写更好的视觉程序是一个有吸引力的目标,但不确定如何实现。目前还没有用于训练的视觉程序数据集,而且获取一个视觉程序数据集很难通过专家注释来完成。为了克服缺乏直接监督的问题,我们探索通过交互体验来提高LLM的程序合成能力。我们提出了一种方法,在那里我们利用现有的视觉语言任务的注释来提高LLM的奖励信号,将LLM视为策略,并应用强化自我训练来提高LLM在对该任务进行视觉程序合成时的能力。我们在物体检测、组合视觉问题回答和图像-文本检索等系列实验中进行了实验,并展示了在每种情况下,自训练的LLM要么超越了少量的样本冻融LLM,要么与这些LLM相当。网站:https://this URL
https://arxiv.org/abs/2404.04627
The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.
链式思考技术在多模态任务中得到了很好的接收。它是一种逐步线性推理过程,根据生成提示的长度调整链条的长度以提高生成提示的性能。然而,人类思维过程主要是非线性的,因为它们同时涵盖多个方面并采用动态调整和更新机制。因此,我们提出了一个名为聚合-图-思维(AGoT)的多模态表示学习软提示调整的新机制。与AGoT不同,我们提出的AGoT模型将人类思维过程不仅建模为链条,而且将每一步都建模为一个推理聚合图,以应对单步推理中忽视的多个方面。这使得整个推理过程转化为提示聚合和提示流操作。实验证明,我们的多模态模型(AGoT软提示)在文本图像检索、视觉问题回答和图像识别等任务中取得了良好的结果。此外,我们还证明了它具有良好的领域泛化性能,因为其推理能力更强。
https://arxiv.org/abs/2404.04538
Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA), particularly in object-oriented perception tasks which demand fine-grained understanding of object identities, locations or attributes, as indicated by empirical findings. This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations. In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception. VTPrompt merges visual and text prompts to extract key concepts from textual questions and employs a detection model to highlight relevant objects as visual prompts in images. The processed images alongside text prompts are subsequently fed into MLLMs to produce more accurate answers. Our experiments with GPT-4V and Gemini Pro, on three benchmarks, i.e., MME , MMB and POPE, demonstrate significant improvements. Particularly, our method led to a score improvement of up to 183.5 for GPT-4V on MME and enhanced MMB performance by 8.17\% for GPT-4V and 15.69\% for Gemini Pro.
多模态大型语言模型(MLLMs)如GPT-4V和Gemini Pro在视觉问答(VQA)中面临挑战,尤其是在要求对物体身份、位置或属性的精细理解的对象导向感知任务中。这主要是由于它们在有效地将复杂视觉线索与文本信息相结合的能力方面有限。在本文中,我们提出了一个新方法,联合视觉和文本提示(VTPrompt),该方法利用细粒度的视觉信息来增强MLLMs在VQA中的能力,尤其是对于物体导向感知。VTPrompt将视觉和文本提示合并以提取文本问题中的关键概念,并使用检测模型在图像中突出相关的物体作为视觉提示。处理后的图像和文本提示随后输入MLLMs以产生更准确的答案。我们对GPT-4V和Gemini Pro在三个基准测试(MME,MMB和POPE)的实验结果表明,我们的方法取得了显著的改善。特别是,我们的方法使GPT-4V在MME上的得分提高了183.5,提高了GPT-4V和Gemini Pro在MMB上的8.17\%和15.69\%。
https://arxiv.org/abs/2404.04514
The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific of documents or task is not representative of how documents often need to be processed in the wild - where variety in style and requirements is expected. In this paper, we introduce BuDDIE (Business Document Dataset for Information Extraction), the first multi-task dataset of 1,665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and large language model approaches to VRDU.
研究领域视觉丰富文档理解(VRDU)旨在解决在多模态领域中广泛研究的问题。有一些数据集可用于研究VRDU的特定任务,例如文档分类(DC)、关键词实体提取(KEE)、实体链接、视觉问答(VQA)等。这些数据集包括缺乏丰富标注的发票和收据等类型的文档,从而支持一个或两个相关任务(例如实体提取和实体链接)。然而,仅关注单个特定文档或任务并不代表野生环境中有文档需要处理的方式 - 那里预计会有多样性和需求。在本文中,我们介绍了BuDDIE(商业文档数据集用于信息提取),第一个包含丰富丰富标注的1,665个真实世界商业文档的多任务数据集。我们的数据集包括来自美国州政府网站的公开可用商业实体文档。文档的结构和格式因州和类型而异(例如,报表、证书、文件等)。我们还为BuDDIE提供了数据多样性和质量度量以及每个任务的系列基线。我们的基线涵盖了传统文本、多模态和大型语言模型方法到VRDU。
https://arxiv.org/abs/2404.04003
Traditional machine learning models often require powerful hardware, making them unsuitable for deployment on resource-limited devices. Tiny Machine Learning (tinyML) has emerged as a promising approach for running machine learning models on these devices, but integrating multiple data modalities into tinyML models still remains a challenge due to increased complexity, latency, and power consumption. This paper proposes TinyVQA, a novel multimodal deep neural network for visual question answering tasks that can be deployed on resource-constrained tinyML hardware. TinyVQA leverages a supervised attention-based model to learn how to answer questions about images using both vision and language modalities. Distilled knowledge from the supervised attention-based VQA model trains the memory aware compact TinyVQA model and low bit-width quantization technique is employed to further compress the model for deployment on tinyML devices. The TinyVQA model was evaluated on the FloodNet dataset, which is used for post-disaster damage assessment. The compact model achieved an accuracy of 79.5%, demonstrating the effectiveness of TinyVQA for real-world applications. Additionally, the model was deployed on a Crazyflie 2.0 drone, equipped with an AI deck and GAP8 microprocessor. The TinyVQA model achieved low latencies of 56 ms and consumes 693 mW power while deployed on the tiny drone, showcasing its suitability for resource-constrained embedded systems.
传统的机器学习模型通常需要强大的硬件,这使得它们不适合在资源受限的设备上部署。Tiny Machine Learning (tinyML) 作为一种有前景的方法,为在资源受限的设备上运行机器学习模型提供了良好的途径。然而,将多个数据模态集成到 tinyML 模型中仍然具有挑战性,因为增加了复杂性、延迟和功耗。本文提出了一种名为 TinyVQA 的新颖的多模态深度神经网络,用于在资源受限的 tinyML 硬件上部署视觉问答任务。TinyVQA 利用监督注意力为基础的模型学习如何使用视觉和语言模态回答问题。从监督注意力为基础的 VQA 模型获得的蒸馏知识训练了内存感知紧凑型 TinyVQA 模型,并采用低位宽量化技术进一步压缩了模型,以适应部署在 tinyML 设备上。TinyVQA 模型在 FloodNet 数据集上进行了评估,该数据集用于灾害损失评估。紧凑型模型实现了 79.5% 的准确率,证明了 TinyVQA 在现实应用中的有效性。此外,该模型还部署在配备 AI 阵列和 GAP8 微处理器的疯狂飞行器 2.0 上。TinyVQA 模型在部署在 tiny无人机上时,具有低延迟(56ms)和低功耗(693mW),展示了其在资源受限嵌入式系统中的适用性。
https://arxiv.org/abs/2404.03574
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here this https URL
本文介绍了MiniGPT4-Video,一种专门为视频理解而设计的多模态大型语言模型(LLM)。该模型能够处理视频中的时间和文本数据,使其擅长理解视频的复杂性。在MiniGPT-v2的成功基础上,本文将模型的能力扩展到处理视频序列,使其能够理解视频。MiniGPT4-video不仅考虑视觉内容,还包含了文本对话,使模型能够有效回答涉及视觉和文本组件的查询。所提出的模型在MSVD、MSRVTT、TGIF和TVQA基准测试中的性能均优于现有最先进的方法,分别在MSVD、MSRVTT、TGIF和TVQA基准测试中取得了4.22%、1.13%、20.82%和13.1%的提高。我们的模型和代码都已公开发布,此处链接为https://。
https://arxiv.org/abs/2404.03413
Recent advancements in Computer Assisted Diagnosis have shown promising performance in medical imaging tasks, particularly in chest X-ray analysis. However, the interaction between these models and radiologists has been primarily limited to input images. This work proposes a novel approach to enhance human-computer interaction in chest X-ray analysis using Vision-Language Models (VLMs) enhanced with radiologists' attention by incorporating eye gaze data alongside textual prompts. Our approach leverages heatmaps generated from eye gaze data, overlaying them onto medical images to highlight areas of intense radiologist's focus during chest X-ray evaluation. We evaluate this methodology in tasks such as visual question answering, chest X-ray report automation, error detection, and differential diagnosis. Our results demonstrate the inclusion of eye gaze information significantly enhances the accuracy of chest X-ray analysis. Also, the impact of eye gaze on fine-tuning was confirmed as it outperformed other medical VLMs in all tasks except visual question answering. This work marks the potential of leveraging both the VLM's capabilities and the radiologist's domain knowledge to improve the capabilities of AI models in medical imaging, paving a novel way for Computer Assisted Diagnosis with a human-centred AI.
近年来,计算机辅助诊断(Computer Assisted Diagnosis,CAD)技术在医学影像任务中的表现已经取得了显著的进展,特别是在胸部X光片分析方面。然而,这些模型与放射科医生的交互主要局限于输入图像。本文提出了一种利用视觉语言模型(VLMs)增强放射科医生注意力的方法,以实现与放射科医生协同工作,从而在胸部X光片分析中提高人机交互。我们的方法将目光数据生成的热图叠加在医学图像上,突出显示了胸部X光片评估过程中放射科医生关注区域的强度。我们对这种方法在诸如视觉问答、胸部X光片报告自动化、错误检测和差异诊断等任务中进行了评估。我们的研究结果表明,包括目光信息可以显著提高胸部X光片分析的准确性。此外,目光对细粒度调整的影响得到了证实,因为在除视觉问答外的所有任务中,它的表现优于其他医疗VLM。这项工作为将VLM的功能和放射科医生的专业知识相结合,改进医疗影像领域的人工智能模型提供了可能,为以人为中心的AI辅助诊断铺平了新的道路。
https://arxiv.org/abs/2404.02370
Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.
尽管在生成式 AI方面取得了显著的进展,但全面评估仍然具有挑战性,原因在于缺乏有效的指标和标准化的基准。例如,广泛使用的 CLIPScore 测量了生成图像与文本提示之间的对齐程度,但它无法产生关于包含物体、属性和关系等复杂提示的可靠分数。一个原因是 CLIP 的文本编码器经常被视为一个“单词集合”,将诸如“马正在吃草”这样的提示与“草正在吃马”这样的提示混淆。为了解决这个问题,我们引入了 VQAScore,它使用视觉问答(VQA)模型通过计算“是的”回答的概率来生成对齐分数。尽管 VQAScore 比先前的技术更简单,但它使用的普通模型在许多图像文本对齐基准测试中都产生了最先进的成果。我们还使用一种遵循最佳实践的内部模型来计算 VQAScore。例如,我们使用一种双向图像-问题编码器,其中图像嵌入允许取决于提出的问题(反之亦然)。我们的内部模型 CLIP-FlanT5 甚至超过了使用专有 GPT-4V 的最强大的基线。有趣的是,尽管我们只使用图像进行训练,但 VQAScore 也可以将文本与视频和 3D 模型对齐。VQAScore 使研究人员能够通过捕捉现实世界提示的构成结构来比较文本到视觉生成。我们引入了 GenAI-Bench,一种更具挑战性的基准,含有1600个具有解析场景、物体、属性、关系和高阶推理(比较和逻辑)的复杂文本提示。GenAI-Bench 还提供了超过15,000个用户评分,用于评估 Stable Diffusion、DALL-E 3 和 Gen2 等领先图像和视频生成模型。
https://arxiv.org/abs/2404.01291
Localization plays a crucial role in enhancing the practicality and precision of VQA systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system's ability to provide contextually relevant and spatially accurate responses, crucial for applications in dynamic environments like robotics and augmented reality. However, traditional systems face challenges in accurately mapping objects within images to generate nuanced and spatially aware responses. In this work, we introduce "Detect2Interact", which addresses these challenges by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map. As a result, Detect2Interact achieves consistent qualitative results on object key field detection across extensive test cases and outperforms the existing VQA system with object detection by providing a more reasonable and finer visual representation.
本地化在增强VQA系统的实用性和精确性方面发挥着关键作用。通过使系统能够精细地识别并交互特定物体部分,它显著提高了系统在动态环境(如机器人学和增强现实)中提供相关且准确响应的能力。然而,传统系统在准确地将图像中的物体映射到生成细微和空间感知响应方面面临挑战。在这项工作中,我们引入了“Detect2Interact”,通过引入一种高级的细粒度物体视觉关键词检测方法来解决这些挑战。首先,我们使用 segment anything model (SAM) 生成图像中物体的详细空间地图。接下来,我们使用 Vision Studio 提取语义物体描述。最后,我们利用 GPT-4 的常识知识,使物体语义和空间图之间建立联系。因此,Detect2Interact在广泛的测试用例中实现了对物体关键字段检测的一致质保,并超过了现有VQA系统。通过提供更加合理和精确的视觉表示,使其在动态环境中提供更加相关和准确的响应。
https://arxiv.org/abs/2404.01151
Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage instruction tuning, which relies heavily on human-labeled or large language model-generated annotation, incurring high labeling costs. To tackle this challenge, we introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts, thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser, we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.
生成式视觉语言模型(VLMs)在零散射击视觉语言任务(如图像标题和视觉问题回答)中表现出色。然而,提高它们的零散射击推理通常需要第二阶段指令调整,这依赖于人类标注或大型语言模型生成的标注,导致高标注成本。为了解决这个问题,我们引入了 Image-Conditioned Caption Correction(ICCC)这一新颖的预训练任务,旨在在不需要标注任务感知数据的情况下增强VLMs的零散射击性能。ICCC 任务要求VLMs修复视觉和语言概念之间的不匹配,从而提高指令跟随和文本生成条件是基于视觉输入。利用语言结构和轻量级依赖解析器,我们通过低标注和计算成本的图像文本数据集构建了ICCC任务的数据样本。在BLIP-2和InstructBLIP上的实验结果表明,通过ICCC指令调整,零散射击图像文本生成任务中的VLM任务得到了显著的改进。
https://arxiv.org/abs/2404.00909