Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
大视图语言模型(LVLMs)在诸如视觉对话和 embodied 导航等通用多模态应用方面取得了显著的进步。然而,现有的多模态评估基准测试的项目数量有限,无法跟踪 LVLM 的开发。在这项研究中,我们提出了 MMT-Bench,一个全面的多模态基准,旨在评估 LVLMs 在需要专家知识和故意视觉识别、定位、推理和规划的大型多模态任务中的能力。MMT-Bench 包括来自各种多模态场景的 $31,325$ 个精心策划的多选题视觉问题,涵盖了 $32$ 个核心元任务和 $162$ 个亚任务的多模态理解。由于其广泛的任务覆盖,MMT-Bench 使使用任务图评估 LVLMs 成为可能,促进发现领域内和领域外任务。评估了 $30$ 个 LVLM,如专有 GPT-4V、GeminiProVision 和开源的 InternVL-Chat,结果表明 MMT-Bench 带来了重大挑战。我们预计,MMT-Bench 将激发社区开发指向实现通用多模态智能的下一代多模态基础模型。
https://arxiv.org/abs/2404.16006
Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.
多模态LLM是LLM的自然演变,并扩大其功能以实现超越纯文本模态。在设计新颖架构和视觉与语言适配器的研究过程中,本文重点关注为这样的模型赋予回答需要外部知识的问题的能力。我们称之为Wiki-LLaVA的方法旨在通过分层检索管道访问外部知识源,为LLM提供额外的上下文,提高生成对话的有效性和精确度。我们在针对视觉问题回答的外部数据集上进行广泛的实验,证明了我们的方法的合适性。
https://arxiv.org/abs/2404.15406
The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrapping strategy that exploits both image and label information to construct vision-language datasets. Based on the constructed dataset, we developed MedDr, a generalist foundation model for healthcare capable of handling diverse medical data modalities, including radiology, pathology, dermatology, retinography, and endoscopy. Moreover, during inference, we propose a simple but effective retrieval-augmented medical diagnosis strategy, which enhances the model's generalization ability. Extensive experiments on visual question answering, medical report generation, and medical image diagnosis demonstrate the superiority of our method.
大规模视觉语言模型的快速发展在各种任务中展示了令人印象深刻的性能。然而,在医学领域中缺乏大量高质量的图像-文本数据,大大阻碍了大规模医疗视觉语言模型的开发。在这项工作中,我们提出了一个指导下的bootstrap策略,该策略利用图像和标签信息来构建视觉语言数据集。基于构建的数据集,我们开发了MedDr,一种通用医疗数据处理模型,能够处理各种医疗数据模式,包括放射学、病理学、皮肤病学、眼科和内窥镜。此外,在推理过程中,我们提出了一种简单但有效的检索增强医疗诊断策略,可以增强模型的泛化能力。在视觉问答、医学报告生成和医学图像诊断等大量实验中,证明了我们方法的优势。
https://arxiv.org/abs/2404.15127
Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit the global and local alignment between medical image and text could however be marred by the redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge is grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between anatomical region-level visual features and the textural features of medical knowledge. The performance of GK-MVLP is competitive with or exceeds the state of the art on downstream chest X-ray disease classification, disease localization, report generation, and medical visual question-answering tasks. Our results show the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.
医学视觉语言预训练被证明是一种学习医学图像和文本领域通用表示的有前途的方法。然而,当前的算法可能受到医疗数据中冗余信息的影响。为了应对这个问题,我们提出了一个基于知识增强的医学视觉语言预训练(GK-MVLP)框架,用于胸部X光片。在这个框架中,医学知识是通过使用基于Transformer的 grounded knowledge-enhanced 模块将解剖区域级别的视觉特征与医学知识的文本特征进行精细对齐来 grounded 的。GK-MVLP 在下游胸部X光片疾病分类、疾病定位、报告生成和医疗视觉问答等任务上的性能与或超过最先进的水平。我们的结果表明,将校准机制集成到系统中可以消除偏见并改善胸部X光片图像与放射科报告之间的对齐。
https://arxiv.org/abs/2404.14750
This paper outlines our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation (M3G) shared task. We report results for two standalone solutions under the English category of the task, the first involving two consecutive API calls to the Claude 3 Opus API and the second involving training an image-disease label joint embedding in the style of CLIP for image classification. These two solutions scored 1st and 2nd place respectively on the competition leaderboard, substantially outperforming the next best solution. Additionally, we discuss insights gained from post-competition experiments. While the performance of these two solutions have significant room for improvement due to the difficulty of the shared task and the challenging nature of medical visual question answering in general, we identify the multi-stage LLM approach and the CLIP image classification approach as promising avenues for further investigation.
本文概述了我们向MEDIQA2024多语言多模态医疗答案生成(M3G)共享任务提交的论文。我们在任务的英语类别下报告了两个独立解决方案的结果,其中第一个涉及两次连续的API调用到Claude 3 Opus API,第二个涉及以CLIP风格训练图像疾病标签联合嵌入进行图像分类。这两个解决方案在竞赛排行榜上分别获得第一和第二名,远远超过了下一个最好的解决方案。此外,我们讨论了从比赛实验中获得的见解。虽然这两个解决方案由于共享任务的难度和医疗视觉问题回答的挑战性而性能还有很大的提升空间,但我们认为多级LLM方法和CLIP图像分类方法是进一步研究的有前途的途径。
https://arxiv.org/abs/2404.14567
Knowledge-based Visual Question Answering (VQA) requires models to incorporate external knowledge to respond to questions about visual content. Previous methods mostly follow the "retrieve and generate" paradigm. Initially, they utilize a pre-trained retriever to fetch relevant knowledge documents, subsequently employing them to generate answers. While these methods have demonstrated commendable performance in the task, they possess limitations: (1) they employ an independent retriever to acquire knowledge solely based on the similarity between the query and knowledge embeddings, without assessing whether the knowledge document is truly conducive to helping answer the question; (2) they convert the image into text and then conduct retrieval and answering in natural language space, which may not ensure comprehensive acquisition of all image information. To address these limitations, we propose Boter, a novel framework designed to bootstrap knowledge selection and question answering by leveraging the robust multimodal perception capabilities of the Multimodal Large Language Model (MLLM). The framework consists of two modules: Selector and Answerer, where both are initialized by the MLLM and parameter-efficiently finetuned in a simple cycle: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83%.
基于知识的视觉问答(VQA)需要模型包含外部知识来回答关于视觉内容的 questions。 以前的方法主要遵循“检索并生成”范式。 初始时,它们使用预训练的检索器来获取相关知识文档,然后使用它们生成答案。 虽然这些方法在任务上表现出色,但它们具有局限性:(1)它们使用一个独立检索器仅基于查询和知识表示的相似性来获取知识,而没有评估知识文档是否确实有助于回答问题;(2)它们将图像转换为自然语言并在此基础上进行检索和回答,这可能无法确保全面获取所有图像信息。为了应对这些局限性,我们提出了Boter,一种新框架,旨在通过利用多模态感知大型语言模型的稳健多模态特征来引导知识选择和问题回答。该框架包括两个模块:选择器和回答者,它们都由MLLM初始化并按简单周期进行参数优化:使用选择器查找检索到的知识文档中的关键知识,然后使用它们来微调回答者以预测答案;根据回答者的预测获得关键知识文档的伪标签,然后微调选择器以选择关键知识;重复。我们的框架在具有挑战性的开放域知识基于 VQA 基准OK-VQA上显著增强了基线的性能,达到62.83%的 state-of-the-art 准确率。
https://arxiv.org/abs/2404.13947
Foundation models have become invaluable in advancing the medical field. Despite their promise, the strategic deployment of LLMs for effective utility in complex medical tasks remains an open question. Our novel framework, Medical Decision-making Agents (MDAgents) aims to address this gap by automatically assigning the effective collaboration structure for LLMs. Assigned solo or group collaboration structure is tailored to the complexity of the medical task at hand, emulating real-world medical decision making processes. We evaluate our framework and baseline methods with state-of-the-art LLMs across a suite of challenging medical benchmarks: MedQA, MedMCQA, PubMedQA, DDXPlus, PMC-VQA, Path-VQA, and MedVidQA, achieving the best performance in 5 out of 7 benchmarks that require an understanding of multi-modal medical reasoning. Ablation studies reveal that MDAgents excels in adapting the number of collaborating agents to optimize efficiency and accuracy, showcasing its robustness in diverse scenarios. We also explore the dynamics of group consensus, offering insights into how collaborative agents could behave in complex clinical team dynamics. Our code can be found at this https URL.
基础模型在推动医疗领域方面已经变得非常有价值。然而,在实现复杂医疗任务的LLM的有效部署仍是一个开放问题。我们的新框架,医疗决策代理(MDAgents),旨在通过自动分配LLM的有效合作结构来解决这个空白。分配独奏或团体合作结构是根据当前医疗任务的复杂程度来定制的,模拟真实世界医学决策过程。我们用最先进的LLM在一系列具有挑战性的医疗基准中评估我们的框架和基线方法:MedQA,MedMCQA,PubMedQA,DDXPlus,PMC-VQA,Path-VQA和MedVidQA,在需要理解多模态医学推理的5个基准中实现了最佳性能。消融研究揭示了MDAgents在适应协作代理数量以优化效率和精度方面的优势,展示了其在复杂临床团队动态中的稳健性。我们还研究了群体共识的动态,提供了关于协作代理在复杂临床团队中的行为的一些见解。代码可以在这个链接中找到。
https://arxiv.org/abs/2404.15155
Visual Commonsense Reasoning (VCR) is a cognitive task, challenging models to answer visual questions requiring human commonsense, and to provide rationales explaining why the answers are correct. With emergence of Large Language Models (LLMs), it is natural and imperative to explore their applicability to VCR. However, VCR task demands more external knowledge to tackle its challenging questions, necessitating special designs to activate LLMs' commonsense reasoning abilities. Also, most existing Multimodal LLMs adopted an abstraction of entire input image, which makes it difficult to comprehend VCR's unique co-reference tags between image regions and text, posing challenges for fine-grained alignment. To address these issues, we propose EventLens that leverages Event-Aware Pretraining and Cross-modal Linking and EnhanceS VCR. First, by emulating the cognitive process of human reasoning, an Event-Aware Pretraining auxiliary task is introduced to better activate LLM's global comprehension of intricate scenarios. Second, during fine-tuning, we further utilize reference tags to bridge RoI features with texts, while preserving both modality semantics. Finally, we use instruct-style prompts to narrow the gap between pretraining and fine-tuning, and task-specific adapters to better integrate LLM's inherent knowledge with new commonsense. Experimental results show the effectiveness of our proposed auxiliary task and fine-grained linking strategy.
视觉共同推理(VCR)是一个具有挑战性的认知任务,要求模型回答视觉问题,并解释为什么答案是正确的。随着大型语言模型的(LLMs)的出现,自然且必要地研究它们应用于VCR的可行性。然而,VCR任务需要更多的外部知识来解决其具有挑战性的问题,因此需要特殊的设计来激活LLMs的常识推理能力。此外,现有的多模态LLM通常采用对整个输入图像的抽象,这使得VCR在图像区域和文本之间的独特共指标签难以理解,为精微对齐带来挑战。为了应对这些问题,我们提出了EventLens,它利用了事件感知预训练和跨模态链接来增强VCR。首先,通过模拟人类推理的认知过程,引入了一个事件感知的预训练辅助任务,以更好地激活LLM对复杂场景的全面理解。其次,在微调期间,我们进一步利用参考标签将RoI特征与文本连接起来,同时保留模态语义。最后,我们使用指导式提示来缩小预训练和微调之间的差距,并使用任务特定适配器更好地将LLM固有的知识与新的常识集成。实验结果表明,我们提出的辅助任务和细粒度链接策略的有效性得到了验证。
https://arxiv.org/abs/2404.13847
An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability
一种有效的将冻存的大语言模型(LLM)和视觉编码器相结合的方法包括一个重新采样模块,该模块为LLM创建了一个视觉提示,并提供了文本提示。虽然这种方法在许多粗粒度任务(如图像摘要和视觉问题回答)中取得了令人印象深刻的性能,但尚未对需要空间理解更细粒度任务进行全面评估。在本文中,我们使用\textit{诊断分类器}来衡量重新采样器产生的视觉提示是否编码了空间信息。我们的结果表明,在分类器训练期间,该信息基本上不存在于重新采样器的输出中。然而,当重新采样器和分类器一起训练时,我们观察到显著的性能提升。这说明通过重新采样器获得的压缩在原则上可以编码所需的空间信息,但需要更多的目标感知目标在预训练阶段以实现这种能力。
https://arxiv.org/abs/2404.13594
This study explores innovative methods for improving Visual Question Answering (VQA) using Generative Adversarial Networks (GANs), autoencoders, and attention mechanisms. Leveraging a balanced VQA dataset, we investigate three distinct strategies. Firstly, GAN-based approaches aim to generate answer embeddings conditioned on image and question inputs, showing potential but struggling with more complex tasks. Secondly, autoencoder-based techniques focus on learning optimal embeddings for questions and images, achieving comparable results with GAN due to better ability on complex questions. Lastly, attention mechanisms, incorporating Multimodal Compact Bilinear pooling (MCB), address language priors and attention modeling, albeit with a complexity-performance trade-off. This study underscores the challenges and opportunities in VQA and suggests avenues for future research, including alternative GAN formulations and attentional mechanisms.
本研究探讨了使用生成对抗网络(GANs)、自动编码器(AEs)和注意机制来提高视觉问答(VQA)的创新方法。在利用平衡的VQA数据集的基础上,我们研究了三种不同的策略。首先,基于GAN的方法旨在根据图像和问题输入生成答案嵌入,表现出潜在但难以处理更复杂任务的潜力。其次,基于AE的方法专注于学习问题和图像的最佳嵌入,由于在复杂问题上的表现与GAN相当,因此取得了较好的效果。最后,结合多模态紧凑线性池化(MCB)的注意机制解决了语言预设和注意建模,但代价是复杂性和性能之间的平衡。本研究突出了VQA中的挑战和机遇,并提出了未来的研究方向,包括 alternative GAN formulations 和 attentional mechanisms.
https://arxiv.org/abs/2404.13565
The development of Large Language Models (LLM) and Diffusion Models brings the boom of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality assessment framework to provide a quantifiable evaluation of different images or videos based on the AIGC technologies. The content generated by AIGC methods is driven by the crafted prompts. Therefore, it is intuitive that the prompts can also serve as the foundation of the AIGC quality assessment. This study proposes an effective AIGC quality assessment (QA) framework. First, we propose a hybrid prompt encoding method based on a dual-source CLIP (Contrastive Language-Image Pre-Training) text encoder to understand and respond to the prompt conditions. Second, we propose an ensemble-based feature mixer module to effectively blend the adapted prompt and vision features. The empirical study practices in two datasets: AIGIQA-20K (AI-Generated Image Quality Assessment database) and T2VQA-DB (Text-to-Video Quality Assessment DataBase), which validates the effectiveness of our proposed method: Prompt Condition Quality Assessment (PCQA). Our proposed simple and feasible framework may promote research development in the multimodal generation field.
大规模语言模型(LLM)和扩散模型的开发带来了人工智能生成内容(AIGC)的繁荣。建立一个有效的质量评估框架以根据AIGC技术对不同图像或视频进行定量评估非常重要。由AIGC方法生成的内容是由创建的提示驱动的。因此,提示也可以作为AIGC质量评估的基石。 本研究提出了一个有效的AIGC质量评估(QA)框架。首先,我们提出了一种基于双重源CLIP(对比性语言-图像预训练)文本编码器的中置提示方法,以理解和响应提示条件。其次,我们提出了一种基于集成特征混合器的 ensemble-based 方法,有效地融合了自适应提示和视觉特征。以下是两个数据集的实验研究实践:AIGIQA-20K(AI-Generated Image Quality Assessment database)和T2VQA-DB(文本-视频质量评估数据库),验证了我们提出方法的有效性:提示条件质量评估(PCQA)。我们提出的研究简单而可行,可能会促进多模态生成领域的研究发展。
https://arxiv.org/abs/2404.13299
Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features via the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a multi-scale aggregate 3D decoder. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs. Experimental results validate that our Uni3DR^2 yields convincing gains over the baseline on the 3D reconstruction dataset ScanNet (increasing F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior performance over the baseline on the 3D vision-language understanding dataset ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set, respectively). Furthermore, it outperforms the state-of-the-art method that uses additional GT point clouds on both ScanQA and 3DMV-VQA.
使大型语言模型(LLMs)与3D环境进行交互具有挑战性。现有的方法从地面真实(GT)几何或由辅助模型重构的3D场景中提取点云。然后将CLIP中的文本图像对齐的2D特征提升到点云中,作为LLMs的输入。然而,这种解决方案缺乏3D点对点连接的建立,导致空间结构信息不足。同时,场景的几何和语义表示之间的同步缺失导致了3D场景理解水平的降低。在本文中,我们证明了在3D场景中实现统一场景表示和重建框架对LLM的重要性。具体来说,我们引入了通过冻解除预训练2D基础模型(如CLIP和SAM)提取3D几何和语义感知表示特征的多尺度聚合3D解码器。我们的学习到的3D表示不仅有助于重建过程,还为LLMs提供了宝贵的知识。实验结果证实,我们的Uni3DR^2在3D重建数据集ScanNet上取得了显著的提高(将F- Score提高+1.8)。当应用于LLM时,我们在3D视觉语言理解数据集ScanQA上的表现超过了基线。此外,它在ScanQA和3DMV-VQA上的表现都超过了最先进的采用额外GT点云的方法。
https://arxiv.org/abs/2404.13044
Medical visual question answering (Med-VQA) aims to automate the prediction of correct answers for medical images and questions, thereby assisting physicians in reducing repetitive tasks and alleviating their workload. Existing approaches primarily focus on pre-training models using additional and comprehensive datasets, followed by fine-tuning to enhance performance in downstream tasks. However, there is also significant value in exploring existing models to extract clinically relevant information. In this paper, we propose the Latent Prompt Assist model (LaPA) for medical visual question answering. Firstly, we design a latent prompt generation module to generate the latent prompt with the constraint of the target answer. Subsequently, we propose a multi-modal fusion block with latent prompt fusion module that utilizes the latent prompt to extract clinical-relevant information from uni-modal and multi-modal features. Additionally, we introduce a prior knowledge fusion module to integrate the relationship between diseases and organs with the clinical-relevant information. Finally, we combine the final integrated information with image-language cross-modal information to predict the final answers. Experimental results on three publicly available Med-VQA datasets demonstrate that LaPA outperforms the state-of-the-art model ARL, achieving improvements of 1.83%, 0.63%, and 1.80% on VQA-RAD, SLAKE, and VQA-2019, respectively. The code is publicly available at this https URL.
医疗视觉问题回答(Med-VQA)旨在自动化的目的是辅助医生减少重复的任务,减轻他们的负担。现有的方法主要集中在使用额外的全面数据集进行预训练,然后进行微调以提高下游任务的性能。然而,探索现有的模型以提取有关临床相关的信息也非常重要。在本文中,我们提出了LaPA模型,用于医疗视觉问题回答。首先,我们设计了一个基于目标答案的隐式提示生成模块。然后,我们提出了一个多模态融合模块,利用隐式提示从单模态和多模态特征中提取临床相关信息。此外,我们还引入了一个先验知识融合模块,将疾病和器官的关系与临床相关信息集成起来。最后,我们将最终整合的信息与图像语言跨模态信息相结合来预测最终答案。在三个公开可用的Med-VQA数据集上的实验结果表明,LaPA在ARL(最先进的模型)上实现了卓越的性能,分别取得了VQA-RAD 1.83%,SLAKE 0.63%和VQA-2019 1.80%的改进。代码公开可用,在https://www. this URL。
https://arxiv.org/abs/2404.13039
Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning benchmark, abbreviated as \textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled counterfactual questions, to evaluate MLLM's counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.
作为人类智力的关键表现形式,反事实推理是指根据已有的事实做出假设,并推断可能的后果。已经存在的多模态大型语言模型(MLLMs)在广泛的视觉问答基准测试中表现出了出色的认知和推理能力。然而,当它们面对反事实问题时,现有的MLLMs会表现如何?为了回答这个问题,我们首先收集了一个新的反事实推理基准,称为反事实推理多模态模型(CFMM),以系统地评估MLLMs的反事实推理能力。我们的CFMM包括六个具有挑战性的任务,每个任务包括数百个精心人类标注的反事实问题,以评估MLLM在多样性方面的反事实推理能力。通过实验,我们发现,现有的MLLM更喜欢相信它们所看到的东西,但忽视了问题中提出的反事实假设,从而导致不准确的回答。此外,我们在CFMM上评估了各种常见的MLLM。它们在CFMM上的表现与我们CFMM上的表现之间的显著差距表明,现有MLLM在接近人类智能方面还有很大的改进空间。另一方面,通过在CFMM上提高MLLMs的表现,可以探索开发具有高级智能的MLLM的潜在途径。
https://arxiv.org/abs/2404.12966
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.
文本中心化的视觉问题回答(VQA)在多模态大型语言模型(MLLMs)的发展中取得了很大进展,然而,开源模型仍然与诸如GPT4V和Gemini这样的领先模型相疏离,部分原因是缺乏广泛的、高质量的教学数据调整。为此,我们介绍了一种用于创建大规模、高质量指令调整数据集的新方法,称为Square-10M,该数据集使用闭源的MLLM生成。数据构建过程被称为Square,包括四个步骤:自问自答、回答、推理和评估。我们使用Square-10M进行实验,得出了三个关键发现:1)我们的模型TextSquare在很大程度上超越了开源先前文本中心化MLLM,并在OCRBench(62.2%)上达到了新的标准。它甚至在前10个文本中心化基准中超过了顶级模型GPT4V和Gemini。2)此外,我们证明了VQA推理数据对于提供针对特定问题全面上下文的洞见至关重要。这不仅提高了准确性,而且显著减轻了幻觉。具体来说,TextSquare在四个通用VQA和幻觉评估数据集上的平均得分分别为75.1%,超过了前述领先模型的水平。3)值得注意的是,在扩展文本中心化VQA数据集的现象揭示了一个鲜艳的模式:指令调整数据量的指数增长与模型性能的提高直接相关,从而验证了数据集规模和Square-10M的高质量。
https://arxiv.org/abs/2404.12803
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA.
文档问答(QA)在理解视觉丰富文档(VRD)方面具有挑战性,尤其是那些大量包含文本内容的研究期刊文章。现有研究主要集中在真实世界文档稀疏文本的背景下,而理解多个页面之间的复杂语义关系仍然是一个挑战,以定位多模态组件。为了填补这一空白,我们提出了PDF-MVQA,专门针对研究期刊文章,包括多个页面和多模态信息检索。与传统机器阅读理解(MRC)任务不同,我们的方法旨在检索包含答案或视觉丰富文档实体(如表格和图)的完整段落。我们的贡献包括引入了一个全面的PDF文档VQA数据集,使人们可以研究文本主导文档的语义层次结构。我们还提出了新的VRD-QA框架,旨在同时抓住文档布局中的文本内容和关系,扩展了页级别理解,直至整个多页文档。通过这项工作,我们旨在增强现有视觉和语言模型在处理文本主导VRD-QA挑战中的能力。
https://arxiv.org/abs/2404.12720
We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at this http URL . A showcase of non cherry picked qualitative examples can also be found at this http URL .
我们介绍了由Reka开发的 Reka Core、Flash 和 Edge 一系列强大的多模态语言模型。这些模型是由 Reka 从头训练的,能够处理和推理文本、图像、视频和音频输入。本技术报告讨论了训练这些模型的细节,并提供了全面评估结果。我们发现,Reka Edge 和 Reka Flash 不仅是最先进的,而且在各自的计算类别中还表现出色,为它们各自的计算类别带来了超额价值。与此同时,我们最强大和最大的模型 Reka Core,在自动评估和盲人评估方面都接近最佳前沿模型。在图像问答基准(如MMMU,VQAv2)上,Core与GPT4-V竞争相当。在多模态聊天中,Core在盲第三方人类评估设置下排名第二,超过了像Claude 3 Opus这样的其他模型。在文本基准上,Core不仅在一批经过良好检验的基准(如MMLU,GSM8K)上表现竞争力,而且也在人类评估中超过了GPT4-0613。在视频问答(Perception-Test)中,Core超过了Gemini Ultra。模型现在在生产环境中通过这个链接发送:http://www.reka.ai/。也可以在这个链接找到展示非 cherry-picked 高质量示例的示例:http://www.reka.ai/quality-examples/。
https://arxiv.org/abs/2404.12387
Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamlining data preparation and build new benchmark MedVQA datasets R-RAD and R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing MedVQA datasets, i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process. The framework includes three distinct strategies to generate decision outcomes and corresponding rationales, thereby clearly showcasing the medical decision-making process during reasoning. Extensive experiments demonstrate that our method can achieve an accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperforming existing state-of-the-art baselines. Dataset and code will be released.
医学视觉问题回答(MedVQA)是一种为图像医学查询提供语言回答的挑战性任务,并在医疗保健领域取得了显著的进展。它帮助医疗专家迅速解释医学图像,从而实现更快更准确的诊断。然而,现有MedVQA解决方案的可解释性和透明度往往有限,这使得理解其决策过程具有挑战性。为解决这个问题,我们设计了一个半自动化的注释过程,以简化数据准备并构建新的MedVQA数据集R-RAD和R-SLAKE。R-RAD和R-SLAKE数据集提供了现有MedVQA数据集中的多模态大型语言模型和人类标注的交互式问题答案对的中间医疗决策过程。此外,我们还设计了一个新框架,通过将医学决策过程融入训练过程来优化轻量级预训练生成模型。该框架包括生成决策结果和相应理由的三种不同的策略,从而在推理过程中清晰地展示医学决策过程。大量实验证明,我们的方法可以在R-RAD和R-SLAKE上实现83.5%的准确度,显著优于现有状态下的最先进基线。数据集和代码将发布。
https://arxiv.org/abs/2404.12372
Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, \textit{MUSIC-AVQA-R}, crafted in two steps: rephrasing questions within the test split of a public dataset (\textit{MUSIC-AVQA}) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on both datasets, especially obtaining a significant improvement of 9.68\% on the proposed dataset. Extensive ablation experiments are conducted on these two datasets to validate the effectiveness of the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset.
音频视觉问答(AVQA)是一个复杂的多模态推理任务,要求智能系统根据音频-视频输入对对自然语言查询进行准确响应。然而,普遍的AVQA方法容易过拟合数据集偏差,导致鲁棒性差。此外,现有数据集可能无法提供这些方法的准确诊断。为了应对这些挑战,我们首先提出了一个名为\textit{MUSIC-AVQA-R}的新数据集,通过两个步骤进行制作:在公共数据集\textit{MUSIC-AVQA}的测试划分内重新表述问题,然后引入分布变换来分割问题。前者导致一个大型、多样化的测试空间,而后者在罕见、频繁和总体问题上的评估具有全面性。其次,我们提出了一个具有多方面合作的循环去偏策略来克服偏差学习。实验结果表明,这种架构在两个数据集上都实现了最先进的性能,尤其是针对所提出的数据集,性能提高了9.68%。此外,我们对这两个数据集进行了广泛的消融实验,以验证去偏策略的有效性。最后,通过在我们的数据集上评估现有多模态问答方法的有效性,我们揭示了这些方法的局限性。
https://arxiv.org/abs/2404.12020
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at this https URL.
本文回顾了NTIRE 2024挑战赛短形式UGC视频质量评估(S-UGC VQA),其中各种优秀的解决方案在收集到的数据集KVQ上提交并进行了评估,该数据集来自流行的短视频平台,即快手/Kwai平台。KVQ数据库分为三部分,包括用于训练的2926个视频、验证的420个视频和测试的854个视频。目的是建立新的基准并推动S-UGC VQA的发展。该比赛有200名参与者,13个团队在最终测试阶段提交了有效的解决方案。所提出的解决方案在S-UGC VQA上实现了最先进的表现。该项目可以在此链接中找到:https://url.cn/xyz6hxl。
https://arxiv.org/abs/2404.11313