We analyze knowledge-based visual question answering, for which given a question, the models need to ground it into the visual modality and retrieve the relevant knowledge from a given large knowledge base (KB) to be able to answer. Our analysis has two folds, one based on designing neural architectures and training them from scratch, and another based on large pre-trained language models (LLMs). Our research questions are: 1) Can we effectively augment models by explicit supervised retrieval of the relevant KB information to solve the KB-VQA problem? 2) How do task-specific and LLM-based models perform in the integration of visual and external knowledge, and multi-hop reasoning over both sources of information? 3) Is the implicit knowledge of LLMs sufficient for KB-VQA and to what extent it can replace the explicit KB? Our results demonstrate the positive impact of empowering task-specific and LLM models with supervised external and visual knowledge retrieval models. Our findings show that though LLMs are stronger in 1-hop reasoning, they suffer in 2-hop reasoning in comparison with our fine-tuned NN model even if the relevant information from both modalities is available to the model. Moreover, we observed that LLM models outperform the NN model for KB-related questions which confirms the effectiveness of implicit knowledge in LLMs however, they do not alleviate the need for external KB.
我们分析了基于知识的视觉问题回答,对于给定问题,模型需要将知识基于视觉模式并从给定的大型知识库(KB)中检索相关信息才能回答。我们的分析分为两个方面,一方面是基于设计神经架构并从零开始训练,另一方面是基于大型预训练语言模型(LLMs)。我们的研究问题包括:1)我们能否通过显式监督检索相关KB信息有效地增强模型来解决KB-VQA问题?2)任务特定和LLM-based模型在集成视觉和外部知识以及多级推理方面的表现如何?3)LLM的隐含知识是否足够用于KB-VQA,以及它能否替换显式的KB?我们的结果表明,将监督外部和视觉知识检索的模型与任务特定和LLM模型相结合可以产生积极影响。我们的研究结果表明,尽管LLM在1-hop推理方面表现更强,但与我们的微调NN模型相比,它们在2-hop推理方面表现不佳。此外,我们观察到LLM模型对于KB相关问题的表现优于NN模型,证实了在LLMs中隐含知识的有效性。然而,它们并没有减轻外部KB的需求。
https://arxiv.org/abs/2404.10226
The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic, black-box setting. We propose using the principle of \textit{neighborhood consistency} to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model.
选择性预测的目标是让模型在可能无法提供可靠预测时进行 abstain,这对于安全关键场景非常重要。现有方法进行选择性预测通常需要访问模型的内部,仅对单模态模型进行重新训练。然而,最强大的模型(如 GPT-4)通常仅作为无法访问内部的黑盒模型提供,无法通过终端用户进行重新训练,并且通常用于多模态任务。我们在一个现实的黑盒环境中研究了对于视觉语言模型的选择性预测可能性。我们提出了利用邻域一致性原则从黑盒视觉语言模型中识别不可靠响应的想法。我们假设仅给定一个视觉问题和模型响应,模型回答的邻域内的一致性将表明可靠性。在黑盒环境中直接采样邻居是不可能的。相反,我们证明了使用较小的代理模型来近似采样邻居是可能的。我们发现,邻域一致性可用于识别在 adversarial 设置或与代理模型分布不相同的设置中模型对视觉问题的不可靠响应。
https://arxiv.org/abs/2404.10193
Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). We train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images. When fine-tuned on our egocentric HOI-QA dataset, performance improves by 27.9% for referring hands and objects, and by 26.7% for referring interactions.
大视图语言模型(VLMs)现在已成为许多任务的默认最佳实践,包括视觉问答、识别物体和空间指征等。在这项工作中,我们提出了一个针对以自我为中心的图像的HOI-Ref任务,旨在使用VLMs理解手和物体之间的互动。为了实现HOI-Ref,我们编辑了HOI-QA数据集,其中包括用于训练和评估VLMs的390,000个问题-答案对。HOI-QA包括与定位手、物体及其相互作用的 questions(例如,指出正在操作的对象)有关的问题。我们在这个数据集上训练了第一个VLM for HOI-Ref,并称之为VLM4HOI。我们的结果表明,为第三人称图像进行指出的VLMs未能识别和指出在以自我为中心的图像中的手和物体。当在我们的自中心HOI-QA数据集上进行微调时,指出的性能提高了27.9%,而指出的性能则提高了26.7%。
https://arxiv.org/abs/2404.09933
This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.
本文介绍了一种名为VLAP的新方法,它将预训练的视觉模型(VMs)和大语言模型(LLMs)相桥,使冻结的LLMs能够理解视觉世界。VLAP通过使用单线性层将预训练的视觉模型的嵌入空间转换为LLMs的词向量空间,实现高效的视觉和语言理解。具体来说,我们利用已经确立的词向量来连接两个模态的嵌入空间。通过将分配过程表示为最优传输问题,将视觉和文本表示同时分配给预训练的LLM中的一个单词向量集合。我们预测从另一个模态的表示中分配一个模态,强制保持成对多模态数据的相似分配。这使得视觉和语言表示包含相同的信息,将冻结的LLM的词嵌入空间 grounded in visual data。此外,通过视觉数据可以保留LLM的语义分类器,因为LLM解释并推理单词嵌入之间的相关性。实验结果表明,VLAP在各种视觉-语言任务上的改进都超过了基于线性变换的先前方法。我们还证明了学习到的视觉表示具有LLM的语义分类器,使视觉语义算术成为可能。
https://arxiv.org/abs/2404.09632
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{this https URL}.
视觉问题回答(VQA)被认为是AI完成的任务,因为它需要理解、推理和推断关于视觉和语言内容的视觉和语言内容。在过去的几年里,为VQA问题提出了许多神经架构建议。然而,在零散射击VQA上取得成功仍然具有挑战性,因为需要具备高级的泛化能力和推理能力。本文探讨了将图像摘要作为VQA管道中中间过程的引入对视觉问题回答效果的影响。具体来说,我们探讨了使用图像摘要而不是图像并利用大型语言模型(LLMs)建立零散射击设置的有效性。 由于图像摘要是这个过程中最关键的一步,因此我们比较了最先进的图像摘要模型在各种问题类型的结构和语义方面的视觉问题回答性能。我们提出了一种直接而有效的基于问题的图像摘要方法,将上下文信息传递给问题回答(QA)模型。这种方法涉及从问题中提取关键词,为图像-问题对生成文本摘要,并将问题驱动的摘要融入LLM提示中。我们评估了使用通用和基于问题的图像摘要在VQA管道中的效果。 我们的研究突出了在零散射击设置下利用图像摘要和大型语言模型的潜力,以实现GQA竞争力的性能。我们的代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.08589
Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object's characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.
可扩展的标注方法对于构建广泛的3D文本数据集至关重要,促进了一系列应用。然而,现有的方法有时会导致生成伪影的旁注,从而损害了旁注的质量。本文重点探讨了在3D物体旁注中出现伪影的问题,重点关注Cap3D方法,该方法使用预训练模型将3D物体转换为2D视图进行旁注。我们指出了一个主要挑战:某些3D物体的渲染视图是非典型的,与标准图像旁注模型的训练数据不一致,导致伪影。为解决这个问题,我们提出了DiffuRank方法,该方法利用预训练的文本到3D模型评估3D物体与其2D渲染视图之间的对齐程度,其中高对齐视图最能代表对象的特性。通过排名所有渲染视图并将排名前几位的输入GPT4-Vision,我们提高了旁注的准确性和细节,使得在Cap3D数据集中的20000个旁注和将它们扩展到Objaverse和Objaverse-XL数据集中的100000个旁注得到纠正。此外,我们还展示了DiffuRank的适应性,将其应用于预训练的文本到图像模型上进行视觉问答任务,其中它超过了CLIP模型。
https://arxiv.org/abs/2404.07984
This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models like GPT4 and LLaVA against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. Key findings suggest that while MLLMs demonstrate potential in navigating technical documents, substantial limitations exist, particularly in accurately extracting and applying detailed requirements to engineering designs. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: this https URL.
这项研究介绍了一种名为DesignQA的新基准,旨在评估多模态大型语言模型(MLLM)在理解和技术文档中应用工程要求的能力。该基准重点关注现实世界的工程挑战,将多模态数据(包括文本设计要求、CAD图像和工程图纸)来源于方程式SAE学生竞赛,与许多现有MLLM基准不同。DesignQA包含基于文档的视觉问题,其中输入图像和输入文档来自不同的来源。基准基于工程师根据要求进行设计时执行的任务进行划分-规则理解、规则遵守和规则提取。我们评估了最先进的GPT4和LLaVA模型与该基准的比较,我们的研究揭示了MLLM在解释复杂工程文档方面的能力所存在的现有缺口。研究发现,尽管MLLM表现出在导航技术文档方面的潜力,但仍然存在很大的局限性,特别是在准确提取和应用详细工程设计要求方面。这项基准为支持AI辅助工程设计过程的未来发展奠定了基础。DesignQA可以在以下链接公开使用:https://this URL。
https://arxiv.org/abs/2404.07917
Unsupervised anomaly detection enables the identification of potential pathological areas by juxtaposing original images with their pseudo-healthy reconstructions generated by models trained exclusively on normal images. However, the clinical interpretation of resultant anomaly maps presents a challenge due to a lack of detailed, understandable explanations. Recent advancements in language models have shown the capability of mimicking human-like understanding and providing detailed descriptions. This raises an interesting question: \textit{How can language models be employed to make the anomaly maps more explainable?} To the best of our knowledge, we are the first to leverage a language model for unsupervised anomaly detection, for which we construct a dataset with different questions and answers. Additionally, we present a novel multi-image visual question answering framework tailored for anomaly detection, incorporating diverse feature fusion strategies to enhance visual knowledge extraction. Our experiments reveal that the framework, augmented by our new Knowledge Q-Former module, adeptly answers questions on the anomaly detection dataset. Besides, integrating anomaly maps as inputs distinctly aids in improving the detection of unseen pathologies.
无监督异常检测通过将原始图像与仅基于正常图像的模型生成的伪健康重构图像相邻来识别潜在的病理性区域。然而,由于结果异常图的临床解释存在缺乏详细、可理解解释的挑战,这是一个具有挑战性的问题。近年来语言模型的进步表明,具有类似于人类理解能力和提供详细描述的能力。这引发了一个有趣的问题:\textit{语言模型如何被用于使异常图更具有可解释性?}据我们所知,我们第一个利用语言模型进行无监督异常检测,为我们构建了一个不同问题和不回答的问答 dataset。此外,我们提出了一个专门针对异常检测的多图像视觉问答框架,结合了各种特征融合策略来增强视觉知识提取。我们的实验表明,在将新知识 Q-Former 模块扩展到框架后,该框架能够恰当地回答异常检测数据集中的问题。此外,将异常图作为输入可以明显地改善未见疾病的检测。
https://arxiv.org/abs/2404.07622
Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
将大型语言模型(LLMs)集成到视觉领域任务中,产生了视觉-LLM(V-LLM),在视觉语言任务中实现了卓越的性能,特别是对于视觉问答(VQA)。然而,现有的V-LLM(例如BLIP-2,LaVaA)表明缺乏空间推理和局部定位意识。尽管生成高度描述性和详细的文本答案,但这些模型在简单任务(如区分左右位置)上表现不佳。在这项工作中,我们探讨了图像空间坐标基于指令微调目标如何将空间意识注入V-LLM。我们发现了最优的坐标表示、数据有效的指令微调目标以及伪数据生成策略,从而提高了V-LLM的空间意识。此外,我们的结果模型在图像和视频域的VQA中 improved,减少了不必要的幻觉,并生成了更好的上下文物体描述。在涉及14个不同数据集的5个视觉语言任务上进行实验,证明了我们提出的框架所取得的明显性能提升。
https://arxiv.org/abs/2404.07449
Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.
视觉语言模型(VLMs)通常由一个视觉编码器(例如CLIP)和一个语言模型(LM)组成,用于解决下游任务。尽管在可见的进步,但VLMs由于视觉编码器的能力有限,例如对某些图像特征的“盲目”或视觉幻觉等问题,而存在多个缺陷。为了应对这些问题,我们研究了扩大VLM视觉编码器的能力。我们首先全面基准了为解决VLM任务而采用的不同归纳偏置的几个视觉编码器。我们观察到,没有一种编码器配置能够 consistently在不同的任务上取得最佳性能,并且具有不同偏置的编码器可以表现出意外的相似性。为了激励这一点,我们引入了一种名为BRAVE的方法,将多个冻结的编码器的特征合并到一个更通用的表示中,可以直接输入到冻结的语言模型中。BRAVE在广泛的摘要和VQA基准测试中实现了最先进的性能,并显著减少了上述VLMs存在的问题,而需要的训练参数比现有方法要少得多,并且具有更紧凑的表示。我们的结果突出了将不同视觉偏见集成到更广泛的上下文视觉理解中的VLMs的潜力。
https://arxiv.org/abs/2404.07204
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
本文通过分解多级模块推理框架来解决视频问答(videoQA)任务。之前的方法已经通过单个规划阶段在不基于视觉内容的简单有效的基线上了表现出良好的效果。然而,通过一个简单而有效的基准,我们发现,对于具有挑战性的视频QA设置,这样的系统在实践中会导致脆性行为。因此,与传统单阶段规划方法不同,我们提出了一个由事件解析器、基线阶段和最终推理阶段以及外部记忆组成的多阶段系统。所有阶段都是训练免费的,并通过大型模型的少样本提示来执行,在每个阶段产生可解释的中间输出。通过分解规划和任务的底层复杂性,我们的方法MoReVQA在现有视频QA基准(NExT-QA,iVQA,EgoSchema,ActivityNet-QA)上取得了最先进的结果,并扩展到相关任务(基于内容的视频QA,段落标题)。
https://arxiv.org/abs/2404.06511
Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at this https URL.
去年,多模态架构在基于AI的方法和解决方案的人工智能(AI)应用和解决方案中提供了一场革命,扩展了大语言模型(LLM)的性能。我们提出了一个基于预训练LLM和视觉模式适配器的\textit{OmniFusion}模型。我们评估和比较了几个架构设计原则,以实现更好的文本和视觉数据耦合:多层感知器(MLP)和Transformer适配器,各种基于CLIP ViT的编码器(SigLIP,InternVIT等),以及它们的融合方法,图像编码方法(整张图像或块编码)和两个7B LLM(专有和开源Mistral)。在8个视觉语言基准测试中,OmniFusion在不同的VQA任务中的最佳设置在相对于开源LLAVA类似解决方案的最好表现方面:VizWiz,Pope,MM-Vet,ScienceQA,MMBench,TextVQA,VQAv2,MMU。我们还提出了各种情况,OmniFusion在不同的领域提供高度详细的答案:家务,观光,文化,医学,手写和扫描方程识别等。基于Mistral的OmniFusion模型是一个开源解决方案,其权重,训练和推理脚本可在此链接处获得:<https://url.com>
https://arxiv.org/abs/2404.06212
Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal problems. Therefore we pose the VQA problem from a unified perspective and evaluate a single system on a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying the LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical MultiModal React. We start from a multimodal ReAct-based system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality of the LLM+tools approach, which we show to be critical for obtaining high accuracy on generic VQA. Concretely, on our generic VQA suite, HAMMR outperforms the naive LLM+tools approach by 19.5%. Additionally, HAMMR achieves state-of-the-art results on this task, outperforming the generic standalone PaLI-X VQA model by 5.0%.
将大型语言模型(LLMs)与外部专业工具(LLMs+tools)相结合是一种为解决多模态任务(如视觉问答,VQA)的最新范例。虽然这种方法在优化和评估每个基准时表现良好,但实际应用中,对于下一代真实世界人工智能系统来说,处理多样多模态问题至关重要。因此,我们将VQA问题从一个统一的角度提出,并在包括计数、空间推理、OCR基础推理、视觉指认、外部知识等在内的多样VQA任务上对单一系统进行评估。在这个设置中,我们证明了使用LLM+tools方法直接应用是导致结果不佳。这激发了我们引入HAMMR:层次多模态React。我们从一个多模态的React基础系统开始,并通过让HAMMR代理调用其他专业代理来将其分层。这增强了LLM+tools方法的组合性,我们在通用VQA上证明了其关键性。具体来说,在我们的通用VQA套集中,HAMMR比 naive LLM+tools方法快19.5%。此外,HAMMR在这一点上实现了与Generic standalone PaLI-X VQA模型相当的最佳结果。
https://arxiv.org/abs/2404.05465
Introduction: Video Quality Assessment (VQA) is one of the important areas of study in this modern era, where video is a crucial component of communication with applications in every field. Rapid technology developments in mobile technology enabled anyone to create videos resulting in a varied range of video quality scenarios. Objectives: Though VQA was present for some time with the classical metrices like SSIM and PSNR, the advent of machine learning has brought in new techniques of VQAs which are built upon Convolutional Neural Networks (CNNs) or Deep Neural Networks (DNNs). Methods: Over the past years various research studies such as the BVQA which performed video quality assessment of nature-based videos using DNNs exposed the powerful capabilities of machine learning algorithms. BVQA using DNNs explored human visual system effects such as content dependency and time-related factors normally known as temporal effects. Results: This study explores the sharpness effect on models like BVQA. Sharpness is the measure of the clarity and details of the video image. Sharpness typically involves analyzing the edges and contrast of the image to determine the overall level of detail and sharpness. Conclusion: This study uses the existing video quality databases such as CVD2014. A comparative study of the various machine learning parameters such as SRCC and PLCC during the training and testing are presented along with the conclusion.
引言:视频质量评估(VQA)是现代社会的一个重要研究领域,视频作为各种应用中的关键组件,已经成为人们交流的不可或缺的一部分。移动技术的快速发展使得任何人都可以创建各种视频,从而形成了一系列丰富的视频质量场景。 目标:尽管在经典矩阵如SSIM和PSNR中已经存在了一定程度的VQA,但机器学习的出现带来了新的VQA技术,这些技术基于卷积神经网络(CNN)或深度神经网络(DNN)构建。 方法:在过去的几年里,有许多研究,如使用DNN进行自然视频质量评估的BVQA,探索了机器学习算法在VQA中的强大功能。BVQA使用DNN探讨了人类视觉系统的影响,这些影响通常被称为时间因素,例如内容相关和时间因素。 结果:本研究探讨了像BVQA这样模型的 sharpness 效果。Sharpness是视频图像清晰度和细节的度量。通常通过分析图像的边缘和对比度来确定整体细节和清晰度水平。 结论:本研究使用了现有的视频质量数据库,如CVD2014。在训练和测试期间,对各种机器学习参数如SRCC和PLCC进行了比较研究,并得出了结论。
https://arxiv.org/abs/2404.05764
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: this https URL
视觉程序合成是一种利用大型语言模型的推理能力来构建组合计算机视觉任务的的有前途的方法。之前的工作使用少量的样本提示来合成视觉程序。训练一个LLM编写更好的视觉程序是一个有吸引力的目标,但不确定如何实现。目前还没有用于训练的视觉程序数据集,而且获取一个视觉程序数据集很难通过专家注释来完成。为了克服缺乏直接监督的问题,我们探索通过交互体验来提高LLM的程序合成能力。我们提出了一种方法,在那里我们利用现有的视觉语言任务的注释来提高LLM的奖励信号,将LLM视为策略,并应用强化自我训练来提高LLM在对该任务进行视觉程序合成时的能力。我们在物体检测、组合视觉问题回答和图像-文本检索等系列实验中进行了实验,并展示了在每种情况下,自训练的LLM要么超越了少量的样本冻融LLM,要么与这些LLM相当。网站:https://this URL
https://arxiv.org/abs/2404.04627
The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.
链式思考技术在多模态任务中得到了很好的接收。它是一种逐步线性推理过程,根据生成提示的长度调整链条的长度以提高生成提示的性能。然而,人类思维过程主要是非线性的,因为它们同时涵盖多个方面并采用动态调整和更新机制。因此,我们提出了一个名为聚合-图-思维(AGoT)的多模态表示学习软提示调整的新机制。与AGoT不同,我们提出的AGoT模型将人类思维过程不仅建模为链条,而且将每一步都建模为一个推理聚合图,以应对单步推理中忽视的多个方面。这使得整个推理过程转化为提示聚合和提示流操作。实验证明,我们的多模态模型(AGoT软提示)在文本图像检索、视觉问题回答和图像识别等任务中取得了良好的结果。此外,我们还证明了它具有良好的领域泛化性能,因为其推理能力更强。
https://arxiv.org/abs/2404.04538
Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA), particularly in object-oriented perception tasks which demand fine-grained understanding of object identities, locations or attributes, as indicated by empirical findings. This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations. In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception. VTPrompt merges visual and text prompts to extract key concepts from textual questions and employs a detection model to highlight relevant objects as visual prompts in images. The processed images alongside text prompts are subsequently fed into MLLMs to produce more accurate answers. Our experiments with GPT-4V and Gemini Pro, on three benchmarks, i.e., MME , MMB and POPE, demonstrate significant improvements. Particularly, our method led to a score improvement of up to 183.5 for GPT-4V on MME and enhanced MMB performance by 8.17\% for GPT-4V and 15.69\% for Gemini Pro.
多模态大型语言模型(MLLMs)如GPT-4V和Gemini Pro在视觉问答(VQA)中面临挑战,尤其是在要求对物体身份、位置或属性的精细理解的对象导向感知任务中。这主要是由于它们在有效地将复杂视觉线索与文本信息相结合的能力方面有限。在本文中,我们提出了一个新方法,联合视觉和文本提示(VTPrompt),该方法利用细粒度的视觉信息来增强MLLMs在VQA中的能力,尤其是对于物体导向感知。VTPrompt将视觉和文本提示合并以提取文本问题中的关键概念,并使用检测模型在图像中突出相关的物体作为视觉提示。处理后的图像和文本提示随后输入MLLMs以产生更准确的答案。我们对GPT-4V和Gemini Pro在三个基准测试(MME,MMB和POPE)的实验结果表明,我们的方法取得了显著的改善。特别是,我们的方法使GPT-4V在MME上的得分提高了183.5,提高了GPT-4V和Gemini Pro在MMB上的8.17\%和15.69\%。
https://arxiv.org/abs/2404.04514
The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific of documents or task is not representative of how documents often need to be processed in the wild - where variety in style and requirements is expected. In this paper, we introduce BuDDIE (Business Document Dataset for Information Extraction), the first multi-task dataset of 1,665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and large language model approaches to VRDU.
研究领域视觉丰富文档理解(VRDU)旨在解决在多模态领域中广泛研究的问题。有一些数据集可用于研究VRDU的特定任务,例如文档分类(DC)、关键词实体提取(KEE)、实体链接、视觉问答(VQA)等。这些数据集包括缺乏丰富标注的发票和收据等类型的文档,从而支持一个或两个相关任务(例如实体提取和实体链接)。然而,仅关注单个特定文档或任务并不代表野生环境中有文档需要处理的方式 - 那里预计会有多样性和需求。在本文中,我们介绍了BuDDIE(商业文档数据集用于信息提取),第一个包含丰富丰富标注的1,665个真实世界商业文档的多任务数据集。我们的数据集包括来自美国州政府网站的公开可用商业实体文档。文档的结构和格式因州和类型而异(例如,报表、证书、文件等)。我们还为BuDDIE提供了数据多样性和质量度量以及每个任务的系列基线。我们的基线涵盖了传统文本、多模态和大型语言模型方法到VRDU。
https://arxiv.org/abs/2404.04003
Traditional machine learning models often require powerful hardware, making them unsuitable for deployment on resource-limited devices. Tiny Machine Learning (tinyML) has emerged as a promising approach for running machine learning models on these devices, but integrating multiple data modalities into tinyML models still remains a challenge due to increased complexity, latency, and power consumption. This paper proposes TinyVQA, a novel multimodal deep neural network for visual question answering tasks that can be deployed on resource-constrained tinyML hardware. TinyVQA leverages a supervised attention-based model to learn how to answer questions about images using both vision and language modalities. Distilled knowledge from the supervised attention-based VQA model trains the memory aware compact TinyVQA model and low bit-width quantization technique is employed to further compress the model for deployment on tinyML devices. The TinyVQA model was evaluated on the FloodNet dataset, which is used for post-disaster damage assessment. The compact model achieved an accuracy of 79.5%, demonstrating the effectiveness of TinyVQA for real-world applications. Additionally, the model was deployed on a Crazyflie 2.0 drone, equipped with an AI deck and GAP8 microprocessor. The TinyVQA model achieved low latencies of 56 ms and consumes 693 mW power while deployed on the tiny drone, showcasing its suitability for resource-constrained embedded systems.
传统的机器学习模型通常需要强大的硬件,这使得它们不适合在资源受限的设备上部署。Tiny Machine Learning (tinyML) 作为一种有前景的方法,为在资源受限的设备上运行机器学习模型提供了良好的途径。然而,将多个数据模态集成到 tinyML 模型中仍然具有挑战性,因为增加了复杂性、延迟和功耗。本文提出了一种名为 TinyVQA 的新颖的多模态深度神经网络,用于在资源受限的 tinyML 硬件上部署视觉问答任务。TinyVQA 利用监督注意力为基础的模型学习如何使用视觉和语言模态回答问题。从监督注意力为基础的 VQA 模型获得的蒸馏知识训练了内存感知紧凑型 TinyVQA 模型,并采用低位宽量化技术进一步压缩了模型,以适应部署在 tinyML 设备上。TinyVQA 模型在 FloodNet 数据集上进行了评估,该数据集用于灾害损失评估。紧凑型模型实现了 79.5% 的准确率,证明了 TinyVQA 在现实应用中的有效性。此外,该模型还部署在配备 AI 阵列和 GAP8 微处理器的疯狂飞行器 2.0 上。TinyVQA 模型在部署在 tiny无人机上时,具有低延迟(56ms)和低功耗(693mW),展示了其在资源受限嵌入式系统中的适用性。
https://arxiv.org/abs/2404.03574
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here this https URL
本文介绍了MiniGPT4-Video,一种专门为视频理解而设计的多模态大型语言模型(LLM)。该模型能够处理视频中的时间和文本数据,使其擅长理解视频的复杂性。在MiniGPT-v2的成功基础上,本文将模型的能力扩展到处理视频序列,使其能够理解视频。MiniGPT4-video不仅考虑视觉内容,还包含了文本对话,使模型能够有效回答涉及视觉和文本组件的查询。所提出的模型在MSVD、MSRVTT、TGIF和TVQA基准测试中的性能均优于现有最先进的方法,分别在MSVD、MSRVTT、TGIF和TVQA基准测试中取得了4.22%、1.13%、20.82%和13.1%的提高。我们的模型和代码都已公开发布,此处链接为https://。
https://arxiv.org/abs/2404.03413