Developing generalist foundation model has recently attracted tremendous attention among researchers in the field of AI for Medicine (AI4Medicine). A pivotal insight in developing these models is their reliance on dataset scaling, which emphasizes the requirements on developing open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation and large language models, to extend the original datasets (over 25,692 non-contrast 3D chest CT volume and reports from 20,000 patients) from the following aspects: (i) organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; (ii) 665 K multi-granularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume in the form of a segmentation mask; (iii) 1.3 M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. All grounded reports and VQA pairs in the validation set have gone through manual verification to ensure dataset quality. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets. We will release all segmentation masks, grounded reports, and VQA pairs to facilitate further research and development in this field.
在人工智能领域(AI4Medicine)的研究者中,开发通用基础模型最近引起了巨大的关注。这些模型的关键在于它们对数据集扩大的依赖,强调开发包含各种成像模式下不同监督信号的开放医疗图像数据集。在本文中,我们介绍了RadGenome-Chest CT,一个基于CT-RATE的全面、大规模、区域指导的3D chest CT解释数据集。具体来说,我们利用最先进的强大通用分割和大型语言模型,从以下方面扩展了原始数据集:(一)覆盖197个类别的器官级别分割掩码,为解释提供中间推理的视觉提示;(二)665K个多粒度 grounded 报告,其中每个报告的句子都与相应的 CT 体积的解剖区域通过分割掩码链接;(三)1.3M个 grounded VQA 对,其中问题及其答案都与参考分割掩码链接,使模型能够将视觉证据与文本解释相关联。所有验证集中的 grounded 报告和 VQA 对都经过手动验证,以确保数据集质量。我们相信,RadGenome-Chest CT 可以通过根据给定分割区域生成文本,从而显著推动多模态医疗基础模型的开发,这是之前相关数据集无法实现的。我们将释放所有分割掩码、 grounded 报告和 VQA 对,以促进该领域进一步的研究和发展。
https://arxiv.org/abs/2404.16754
This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC.
这篇论文报告了NTIRE 2024人工智能生成内容挑战赛,该挑战赛将与CVPR 2024中的图像修复和增强研讨会(NTIRE)同时举办。这项挑战的目标是解决图像和视频处理领域的一个重大挑战,即人工智能生成内容(AIGC)的图像质量和视频质量评估(VQA)。挑战分为图像赛道和视频赛道。图像赛道使用了AIGIQA-20K,它包含了由15个流行生成模型生成的20,000个AI生成图像(AIGIs)。图像赛道共有318名注册参与者。在开发阶段共收到1,646篇提交,测试阶段收到了221篇提交。最后,16支参赛队伍提交了他们的模型和报告。视频赛道使用了T2VQA-DB,它包含了由9个流行文本转视频(T2V)模型生成的10,000个AI生成视频(AIGVs)。共有196名参与者登记注册在视频赛道上。在开发阶段共收到991篇提交,测试阶段收到了185篇提交。最后,12支参赛队伍提交了他们的模型和报告。有些方法取得了比基线方法更好的效果,两赛道获胜的方法在AIGC上表现出卓越的预测性能。
https://arxiv.org/abs/2404.16687
In the realm of Medical Visual Language Models (Med-VLMs), the quest for universal efficient fine-tuning mechanisms remains paramount, especially given researchers in interdisciplinary fields are often extremely short of training resources, yet largely unexplored. Given the unique challenges in the medical domain, such as limited data scope and significant domain-specific requirements, evaluating and adapting Parameter-Efficient Fine-Tuning (PEFT) methods specifically for Med-VLMs is essential. Most of the current PEFT methods on Med-VLMs have yet to be comprehensively investigated but mainly focus on adding some components to the model's structure or input. However, fine-tuning intrinsic model components often yields better generality and consistency, and its impact on the ultimate performance of Med-VLMs has been widely overlooked and remains understudied. In this paper, we endeavour to explore an alternative to traditional PEFT methods, especially the impact of fine-tuning LayerNorm layers, FFNs and Attention layers on the Med-VLMs. Our comprehensive studies span both small-scale and large-scale Med-VLMs, evaluating their performance under various fine-tuning paradigms across tasks such as Medical Visual Question Answering and Medical Imaging Report Generation. The findings reveal unique insights into the effects of intrinsic parameter fine-tuning methods on fine-tuning Med-VLMs to downstream tasks and expose fine-tuning solely the LayerNorm layers not only surpasses the efficiency of traditional PEFT methods but also retains the model's accuracy and generalization capabilities across a spectrum of medical downstream tasks. The experiments show LayerNorm fine-tuning's superior adaptability and scalability, particularly in the context of large-scale Med-VLMs.
在医疗可视语言模型(Med-VLMs)领域,寻求通用的有效微调方法仍然是至关重要的,尤其是在跨学科领域的研究者通常缺乏训练资源的情况下,而这一领域也往往被广泛探索。考虑到医疗领域的独特挑战,如有限的数据范围和显著的领域特定要求,专门为Med-VLMs评估和适应参数高效的微调(PEFT)方法至关重要。目前,大多数关于Med-VLMs的PEFT方法尚未进行全面的调查,但主要集中在向模型结构或输入中添加一些组件。然而,微调固有模型组件通常会产生更好的泛化能力和一致性,对其在Med-VLMs最终性能的影响却被广泛忽视和未研究。在本文中,我们力求探讨一种不同于传统PEFT方法的新型选择,特别是对LayerNorm层、FFN和Attention层的微调对Med-VLMs的影响。我们全面的研究跨越了小规模和大型Med-VLMs,在各种任务上评估它们在不同微调范式下的性能,例如医疗视觉问答和医疗图像报告生成。研究结果揭示了在微调固有参数方法对微调Med-VLMs的影响以及仅对LayerNorm层进行微调不仅超越了传统PEFT方法的效率,而且保留了模型的准确性和泛化能力。实验表明,LayerNorm微调的适应性和可扩展性在大型Med-VLMs方面具有优势。
https://arxiv.org/abs/2404.16385
This paper reviews the AIS 2024 Video Quality Assessment (VQA) Challenge, focused on User-Generated Content (UGC). The aim of this challenge is to gather deep learning-based methods capable of estimating the perceptual quality of UGC videos. The user-generated videos from the YouTube UGC Dataset include diverse content (sports, games, lyrics, anime, etc.), quality and resolutions. The proposed methods must process 30 FHD frames under 1 second. In the challenge, a total of 102 participants registered, and 15 submitted code and models. The performance of the top-5 submissions is reviewed and provided here as a survey of diverse deep models for efficient video quality assessment of user-generated content.
本文回顾了 AIS 2024 视频质量评估(VQA)挑战,重点关注用户生成内容(UGC)。这一挑战的目标是收集基于深度学习的估算 UGC 视频感知质量的方法。来自 YouTube UGC 数据集的用户生成视频包括各种内容(体育、游戏、歌词、动漫等),质量和分辨率。所提出的方法必须在 1 秒内处理 30 FHD 帧。在挑战中,共有 102 名参与者注册,其中 15 名提交了代码和模型。对前五名提交者的性能进行了审查,并提供了一个调查不同深度模型用于有效评估用户生成内容视频质量的调查结果。
https://arxiv.org/abs/2404.16205
Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.
视觉语言模型在一般领域通常都表现出良好的效果,并且在多样化的多模态应用中如视觉问答(VQA)中表现出强大的性能。然而,在更 specialized的领域,如医学领域,这些模型很难保持同样的效果。为了克服这一问题,我们提出了一个医学视觉语言模型,该模型整合了适用于医学领域的较大视觉和语言模型。该模型通过使用三个分开的生物医学和放射学多模态视觉和文本数据集进行参数高效的训练,分别训练三个阶段。所提出的模型在SLAKE 1.0医疗VQA(MedVQA)数据集上实现了最先进的性能, overall accuracy 达到了87.5%,同时在另一个MedVQA数据集VQA-RAD上表现出强大的性能, overall accuracy 达到了73.2%。
https://arxiv.org/abs/2404.16192
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
大视图语言模型(LVLMs)在诸如视觉对话和 embodied 导航等通用多模态应用方面取得了显著的进步。然而,现有的多模态评估基准测试的项目数量有限,无法跟踪 LVLM 的开发。在这项研究中,我们提出了 MMT-Bench,一个全面的多模态基准,旨在评估 LVLMs 在需要专家知识和故意视觉识别、定位、推理和规划的大型多模态任务中的能力。MMT-Bench 包括来自各种多模态场景的 $31,325$ 个精心策划的多选题视觉问题,涵盖了 $32$ 个核心元任务和 $162$ 个亚任务的多模态理解。由于其广泛的任务覆盖,MMT-Bench 使使用任务图评估 LVLMs 成为可能,促进发现领域内和领域外任务。评估了 $30$ 个 LVLM,如专有 GPT-4V、GeminiProVision 和开源的 InternVL-Chat,结果表明 MMT-Bench 带来了重大挑战。我们预计,MMT-Bench 将激发社区开发指向实现通用多模态智能的下一代多模态基础模型。
https://arxiv.org/abs/2404.16006
Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.
多模态LLM是LLM的自然演变,并扩大其功能以实现超越纯文本模态。在设计新颖架构和视觉与语言适配器的研究过程中,本文重点关注为这样的模型赋予回答需要外部知识的问题的能力。我们称之为Wiki-LLaVA的方法旨在通过分层检索管道访问外部知识源,为LLM提供额外的上下文,提高生成对话的有效性和精确度。我们在针对视觉问题回答的外部数据集上进行广泛的实验,证明了我们的方法的合适性。
https://arxiv.org/abs/2404.15406
The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrapping strategy that exploits both image and label information to construct vision-language datasets. Based on the constructed dataset, we developed MedDr, a generalist foundation model for healthcare capable of handling diverse medical data modalities, including radiology, pathology, dermatology, retinography, and endoscopy. Moreover, during inference, we propose a simple but effective retrieval-augmented medical diagnosis strategy, which enhances the model's generalization ability. Extensive experiments on visual question answering, medical report generation, and medical image diagnosis demonstrate the superiority of our method.
大规模视觉语言模型的快速发展在各种任务中展示了令人印象深刻的性能。然而,在医学领域中缺乏大量高质量的图像-文本数据,大大阻碍了大规模医疗视觉语言模型的开发。在这项工作中,我们提出了一个指导下的bootstrap策略,该策略利用图像和标签信息来构建视觉语言数据集。基于构建的数据集,我们开发了MedDr,一种通用医疗数据处理模型,能够处理各种医疗数据模式,包括放射学、病理学、皮肤病学、眼科和内窥镜。此外,在推理过程中,我们提出了一种简单但有效的检索增强医疗诊断策略,可以增强模型的泛化能力。在视觉问答、医学报告生成和医学图像诊断等大量实验中,证明了我们方法的优势。
https://arxiv.org/abs/2404.15127
Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit the global and local alignment between medical image and text could however be marred by the redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge is grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between anatomical region-level visual features and the textural features of medical knowledge. The performance of GK-MVLP is competitive with or exceeds the state of the art on downstream chest X-ray disease classification, disease localization, report generation, and medical visual question-answering tasks. Our results show the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.
医学视觉语言预训练被证明是一种学习医学图像和文本领域通用表示的有前途的方法。然而,当前的算法可能受到医疗数据中冗余信息的影响。为了应对这个问题,我们提出了一个基于知识增强的医学视觉语言预训练(GK-MVLP)框架,用于胸部X光片。在这个框架中,医学知识是通过使用基于Transformer的 grounded knowledge-enhanced 模块将解剖区域级别的视觉特征与医学知识的文本特征进行精细对齐来 grounded 的。GK-MVLP 在下游胸部X光片疾病分类、疾病定位、报告生成和医疗视觉问答等任务上的性能与或超过最先进的水平。我们的结果表明,将校准机制集成到系统中可以消除偏见并改善胸部X光片图像与放射科报告之间的对齐。
https://arxiv.org/abs/2404.14750
This paper outlines our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation (M3G) shared task. We report results for two standalone solutions under the English category of the task, the first involving two consecutive API calls to the Claude 3 Opus API and the second involving training an image-disease label joint embedding in the style of CLIP for image classification. These two solutions scored 1st and 2nd place respectively on the competition leaderboard, substantially outperforming the next best solution. Additionally, we discuss insights gained from post-competition experiments. While the performance of these two solutions have significant room for improvement due to the difficulty of the shared task and the challenging nature of medical visual question answering in general, we identify the multi-stage LLM approach and the CLIP image classification approach as promising avenues for further investigation.
本文概述了我们向MEDIQA2024多语言多模态医疗答案生成(M3G)共享任务提交的论文。我们在任务的英语类别下报告了两个独立解决方案的结果,其中第一个涉及两次连续的API调用到Claude 3 Opus API,第二个涉及以CLIP风格训练图像疾病标签联合嵌入进行图像分类。这两个解决方案在竞赛排行榜上分别获得第一和第二名,远远超过了下一个最好的解决方案。此外,我们讨论了从比赛实验中获得的见解。虽然这两个解决方案由于共享任务的难度和医疗视觉问题回答的挑战性而性能还有很大的提升空间,但我们认为多级LLM方法和CLIP图像分类方法是进一步研究的有前途的途径。
https://arxiv.org/abs/2404.14567
Knowledge-based Visual Question Answering (VQA) requires models to incorporate external knowledge to respond to questions about visual content. Previous methods mostly follow the "retrieve and generate" paradigm. Initially, they utilize a pre-trained retriever to fetch relevant knowledge documents, subsequently employing them to generate answers. While these methods have demonstrated commendable performance in the task, they possess limitations: (1) they employ an independent retriever to acquire knowledge solely based on the similarity between the query and knowledge embeddings, without assessing whether the knowledge document is truly conducive to helping answer the question; (2) they convert the image into text and then conduct retrieval and answering in natural language space, which may not ensure comprehensive acquisition of all image information. To address these limitations, we propose Boter, a novel framework designed to bootstrap knowledge selection and question answering by leveraging the robust multimodal perception capabilities of the Multimodal Large Language Model (MLLM). The framework consists of two modules: Selector and Answerer, where both are initialized by the MLLM and parameter-efficiently finetuned in a simple cycle: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83%.
基于知识的视觉问答(VQA)需要模型包含外部知识来回答关于视觉内容的 questions。 以前的方法主要遵循“检索并生成”范式。 初始时,它们使用预训练的检索器来获取相关知识文档,然后使用它们生成答案。 虽然这些方法在任务上表现出色,但它们具有局限性:(1)它们使用一个独立检索器仅基于查询和知识表示的相似性来获取知识,而没有评估知识文档是否确实有助于回答问题;(2)它们将图像转换为自然语言并在此基础上进行检索和回答,这可能无法确保全面获取所有图像信息。为了应对这些局限性,我们提出了Boter,一种新框架,旨在通过利用多模态感知大型语言模型的稳健多模态特征来引导知识选择和问题回答。该框架包括两个模块:选择器和回答者,它们都由MLLM初始化并按简单周期进行参数优化:使用选择器查找检索到的知识文档中的关键知识,然后使用它们来微调回答者以预测答案;根据回答者的预测获得关键知识文档的伪标签,然后微调选择器以选择关键知识;重复。我们的框架在具有挑战性的开放域知识基于 VQA 基准OK-VQA上显著增强了基线的性能,达到62.83%的 state-of-the-art 准确率。
https://arxiv.org/abs/2404.13947
Foundation models have become invaluable in advancing the medical field. Despite their promise, the strategic deployment of LLMs for effective utility in complex medical tasks remains an open question. Our novel framework, Medical Decision-making Agents (MDAgents) aims to address this gap by automatically assigning the effective collaboration structure for LLMs. Assigned solo or group collaboration structure is tailored to the complexity of the medical task at hand, emulating real-world medical decision making processes. We evaluate our framework and baseline methods with state-of-the-art LLMs across a suite of challenging medical benchmarks: MedQA, MedMCQA, PubMedQA, DDXPlus, PMC-VQA, Path-VQA, and MedVidQA, achieving the best performance in 5 out of 7 benchmarks that require an understanding of multi-modal medical reasoning. Ablation studies reveal that MDAgents excels in adapting the number of collaborating agents to optimize efficiency and accuracy, showcasing its robustness in diverse scenarios. We also explore the dynamics of group consensus, offering insights into how collaborative agents could behave in complex clinical team dynamics. Our code can be found at this https URL.
基础模型在推动医疗领域方面已经变得非常有价值。然而,在实现复杂医疗任务的LLM的有效部署仍是一个开放问题。我们的新框架,医疗决策代理(MDAgents),旨在通过自动分配LLM的有效合作结构来解决这个空白。分配独奏或团体合作结构是根据当前医疗任务的复杂程度来定制的,模拟真实世界医学决策过程。我们用最先进的LLM在一系列具有挑战性的医疗基准中评估我们的框架和基线方法:MedQA,MedMCQA,PubMedQA,DDXPlus,PMC-VQA,Path-VQA和MedVidQA,在需要理解多模态医学推理的5个基准中实现了最佳性能。消融研究揭示了MDAgents在适应协作代理数量以优化效率和精度方面的优势,展示了其在复杂临床团队动态中的稳健性。我们还研究了群体共识的动态,提供了关于协作代理在复杂临床团队中的行为的一些见解。代码可以在这个链接中找到。
https://arxiv.org/abs/2404.15155
Visual Commonsense Reasoning (VCR) is a cognitive task, challenging models to answer visual questions requiring human commonsense, and to provide rationales explaining why the answers are correct. With emergence of Large Language Models (LLMs), it is natural and imperative to explore their applicability to VCR. However, VCR task demands more external knowledge to tackle its challenging questions, necessitating special designs to activate LLMs' commonsense reasoning abilities. Also, most existing Multimodal LLMs adopted an abstraction of entire input image, which makes it difficult to comprehend VCR's unique co-reference tags between image regions and text, posing challenges for fine-grained alignment. To address these issues, we propose EventLens that leverages Event-Aware Pretraining and Cross-modal Linking and EnhanceS VCR. First, by emulating the cognitive process of human reasoning, an Event-Aware Pretraining auxiliary task is introduced to better activate LLM's global comprehension of intricate scenarios. Second, during fine-tuning, we further utilize reference tags to bridge RoI features with texts, while preserving both modality semantics. Finally, we use instruct-style prompts to narrow the gap between pretraining and fine-tuning, and task-specific adapters to better integrate LLM's inherent knowledge with new commonsense. Experimental results show the effectiveness of our proposed auxiliary task and fine-grained linking strategy.
视觉共同推理(VCR)是一个具有挑战性的认知任务,要求模型回答视觉问题,并解释为什么答案是正确的。随着大型语言模型的(LLMs)的出现,自然且必要地研究它们应用于VCR的可行性。然而,VCR任务需要更多的外部知识来解决其具有挑战性的问题,因此需要特殊的设计来激活LLMs的常识推理能力。此外,现有的多模态LLM通常采用对整个输入图像的抽象,这使得VCR在图像区域和文本之间的独特共指标签难以理解,为精微对齐带来挑战。为了应对这些问题,我们提出了EventLens,它利用了事件感知预训练和跨模态链接来增强VCR。首先,通过模拟人类推理的认知过程,引入了一个事件感知的预训练辅助任务,以更好地激活LLM对复杂场景的全面理解。其次,在微调期间,我们进一步利用参考标签将RoI特征与文本连接起来,同时保留模态语义。最后,我们使用指导式提示来缩小预训练和微调之间的差距,并使用任务特定适配器更好地将LLM固有的知识与新的常识集成。实验结果表明,我们提出的辅助任务和细粒度链接策略的有效性得到了验证。
https://arxiv.org/abs/2404.13847
An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability
一种有效的将冻存的大语言模型(LLM)和视觉编码器相结合的方法包括一个重新采样模块,该模块为LLM创建了一个视觉提示,并提供了文本提示。虽然这种方法在许多粗粒度任务(如图像摘要和视觉问题回答)中取得了令人印象深刻的性能,但尚未对需要空间理解更细粒度任务进行全面评估。在本文中,我们使用\textit{诊断分类器}来衡量重新采样器产生的视觉提示是否编码了空间信息。我们的结果表明,在分类器训练期间,该信息基本上不存在于重新采样器的输出中。然而,当重新采样器和分类器一起训练时,我们观察到显著的性能提升。这说明通过重新采样器获得的压缩在原则上可以编码所需的空间信息,但需要更多的目标感知目标在预训练阶段以实现这种能力。
https://arxiv.org/abs/2404.13594
This study explores innovative methods for improving Visual Question Answering (VQA) using Generative Adversarial Networks (GANs), autoencoders, and attention mechanisms. Leveraging a balanced VQA dataset, we investigate three distinct strategies. Firstly, GAN-based approaches aim to generate answer embeddings conditioned on image and question inputs, showing potential but struggling with more complex tasks. Secondly, autoencoder-based techniques focus on learning optimal embeddings for questions and images, achieving comparable results with GAN due to better ability on complex questions. Lastly, attention mechanisms, incorporating Multimodal Compact Bilinear pooling (MCB), address language priors and attention modeling, albeit with a complexity-performance trade-off. This study underscores the challenges and opportunities in VQA and suggests avenues for future research, including alternative GAN formulations and attentional mechanisms.
本研究探讨了使用生成对抗网络(GANs)、自动编码器(AEs)和注意机制来提高视觉问答(VQA)的创新方法。在利用平衡的VQA数据集的基础上,我们研究了三种不同的策略。首先,基于GAN的方法旨在根据图像和问题输入生成答案嵌入,表现出潜在但难以处理更复杂任务的潜力。其次,基于AE的方法专注于学习问题和图像的最佳嵌入,由于在复杂问题上的表现与GAN相当,因此取得了较好的效果。最后,结合多模态紧凑线性池化(MCB)的注意机制解决了语言预设和注意建模,但代价是复杂性和性能之间的平衡。本研究突出了VQA中的挑战和机遇,并提出了未来的研究方向,包括 alternative GAN formulations 和 attentional mechanisms.
https://arxiv.org/abs/2404.13565
The development of Large Language Models (LLM) and Diffusion Models brings the boom of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality assessment framework to provide a quantifiable evaluation of different images or videos based on the AIGC technologies. The content generated by AIGC methods is driven by the crafted prompts. Therefore, it is intuitive that the prompts can also serve as the foundation of the AIGC quality assessment. This study proposes an effective AIGC quality assessment (QA) framework. First, we propose a hybrid prompt encoding method based on a dual-source CLIP (Contrastive Language-Image Pre-Training) text encoder to understand and respond to the prompt conditions. Second, we propose an ensemble-based feature mixer module to effectively blend the adapted prompt and vision features. The empirical study practices in two datasets: AIGIQA-20K (AI-Generated Image Quality Assessment database) and T2VQA-DB (Text-to-Video Quality Assessment DataBase), which validates the effectiveness of our proposed method: Prompt Condition Quality Assessment (PCQA). Our proposed simple and feasible framework may promote research development in the multimodal generation field.
大规模语言模型(LLM)和扩散模型的开发带来了人工智能生成内容(AIGC)的繁荣。建立一个有效的质量评估框架以根据AIGC技术对不同图像或视频进行定量评估非常重要。由AIGC方法生成的内容是由创建的提示驱动的。因此,提示也可以作为AIGC质量评估的基石。 本研究提出了一个有效的AIGC质量评估(QA)框架。首先,我们提出了一种基于双重源CLIP(对比性语言-图像预训练)文本编码器的中置提示方法,以理解和响应提示条件。其次,我们提出了一种基于集成特征混合器的 ensemble-based 方法,有效地融合了自适应提示和视觉特征。以下是两个数据集的实验研究实践:AIGIQA-20K(AI-Generated Image Quality Assessment database)和T2VQA-DB(文本-视频质量评估数据库),验证了我们提出方法的有效性:提示条件质量评估(PCQA)。我们提出的研究简单而可行,可能会促进多模态生成领域的研究发展。
https://arxiv.org/abs/2404.13299
Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features via the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a multi-scale aggregate 3D decoder. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs. Experimental results validate that our Uni3DR^2 yields convincing gains over the baseline on the 3D reconstruction dataset ScanNet (increasing F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior performance over the baseline on the 3D vision-language understanding dataset ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set, respectively). Furthermore, it outperforms the state-of-the-art method that uses additional GT point clouds on both ScanQA and 3DMV-VQA.
使大型语言模型(LLMs)与3D环境进行交互具有挑战性。现有的方法从地面真实(GT)几何或由辅助模型重构的3D场景中提取点云。然后将CLIP中的文本图像对齐的2D特征提升到点云中,作为LLMs的输入。然而,这种解决方案缺乏3D点对点连接的建立,导致空间结构信息不足。同时,场景的几何和语义表示之间的同步缺失导致了3D场景理解水平的降低。在本文中,我们证明了在3D场景中实现统一场景表示和重建框架对LLM的重要性。具体来说,我们引入了通过冻解除预训练2D基础模型(如CLIP和SAM)提取3D几何和语义感知表示特征的多尺度聚合3D解码器。我们的学习到的3D表示不仅有助于重建过程,还为LLMs提供了宝贵的知识。实验结果证实,我们的Uni3DR^2在3D重建数据集ScanNet上取得了显著的提高(将F- Score提高+1.8)。当应用于LLM时,我们在3D视觉语言理解数据集ScanQA上的表现超过了基线。此外,它在ScanQA和3DMV-VQA上的表现都超过了最先进的采用额外GT点云的方法。
https://arxiv.org/abs/2404.13044
Medical visual question answering (Med-VQA) aims to automate the prediction of correct answers for medical images and questions, thereby assisting physicians in reducing repetitive tasks and alleviating their workload. Existing approaches primarily focus on pre-training models using additional and comprehensive datasets, followed by fine-tuning to enhance performance in downstream tasks. However, there is also significant value in exploring existing models to extract clinically relevant information. In this paper, we propose the Latent Prompt Assist model (LaPA) for medical visual question answering. Firstly, we design a latent prompt generation module to generate the latent prompt with the constraint of the target answer. Subsequently, we propose a multi-modal fusion block with latent prompt fusion module that utilizes the latent prompt to extract clinical-relevant information from uni-modal and multi-modal features. Additionally, we introduce a prior knowledge fusion module to integrate the relationship between diseases and organs with the clinical-relevant information. Finally, we combine the final integrated information with image-language cross-modal information to predict the final answers. Experimental results on three publicly available Med-VQA datasets demonstrate that LaPA outperforms the state-of-the-art model ARL, achieving improvements of 1.83%, 0.63%, and 1.80% on VQA-RAD, SLAKE, and VQA-2019, respectively. The code is publicly available at this https URL.
医疗视觉问题回答(Med-VQA)旨在自动化的目的是辅助医生减少重复的任务,减轻他们的负担。现有的方法主要集中在使用额外的全面数据集进行预训练,然后进行微调以提高下游任务的性能。然而,探索现有的模型以提取有关临床相关的信息也非常重要。在本文中,我们提出了LaPA模型,用于医疗视觉问题回答。首先,我们设计了一个基于目标答案的隐式提示生成模块。然后,我们提出了一个多模态融合模块,利用隐式提示从单模态和多模态特征中提取临床相关信息。此外,我们还引入了一个先验知识融合模块,将疾病和器官的关系与临床相关信息集成起来。最后,我们将最终整合的信息与图像语言跨模态信息相结合来预测最终答案。在三个公开可用的Med-VQA数据集上的实验结果表明,LaPA在ARL(最先进的模型)上实现了卓越的性能,分别取得了VQA-RAD 1.83%,SLAKE 0.63%和VQA-2019 1.80%的改进。代码公开可用,在https://www. this URL。
https://arxiv.org/abs/2404.13039
Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning benchmark, abbreviated as \textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled counterfactual questions, to evaluate MLLM's counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.
作为人类智力的关键表现形式,反事实推理是指根据已有的事实做出假设,并推断可能的后果。已经存在的多模态大型语言模型(MLLMs)在广泛的视觉问答基准测试中表现出了出色的认知和推理能力。然而,当它们面对反事实问题时,现有的MLLMs会表现如何?为了回答这个问题,我们首先收集了一个新的反事实推理基准,称为反事实推理多模态模型(CFMM),以系统地评估MLLMs的反事实推理能力。我们的CFMM包括六个具有挑战性的任务,每个任务包括数百个精心人类标注的反事实问题,以评估MLLM在多样性方面的反事实推理能力。通过实验,我们发现,现有的MLLM更喜欢相信它们所看到的东西,但忽视了问题中提出的反事实假设,从而导致不准确的回答。此外,我们在CFMM上评估了各种常见的MLLM。它们在CFMM上的表现与我们CFMM上的表现之间的显著差距表明,现有MLLM在接近人类智能方面还有很大的改进空间。另一方面,通过在CFMM上提高MLLMs的表现,可以探索开发具有高级智能的MLLM的潜在途径。
https://arxiv.org/abs/2404.12966
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.
文本中心化的视觉问题回答(VQA)在多模态大型语言模型(MLLMs)的发展中取得了很大进展,然而,开源模型仍然与诸如GPT4V和Gemini这样的领先模型相疏离,部分原因是缺乏广泛的、高质量的教学数据调整。为此,我们介绍了一种用于创建大规模、高质量指令调整数据集的新方法,称为Square-10M,该数据集使用闭源的MLLM生成。数据构建过程被称为Square,包括四个步骤:自问自答、回答、推理和评估。我们使用Square-10M进行实验,得出了三个关键发现:1)我们的模型TextSquare在很大程度上超越了开源先前文本中心化MLLM,并在OCRBench(62.2%)上达到了新的标准。它甚至在前10个文本中心化基准中超过了顶级模型GPT4V和Gemini。2)此外,我们证明了VQA推理数据对于提供针对特定问题全面上下文的洞见至关重要。这不仅提高了准确性,而且显著减轻了幻觉。具体来说,TextSquare在四个通用VQA和幻觉评估数据集上的平均得分分别为75.1%,超过了前述领先模型的水平。3)值得注意的是,在扩展文本中心化VQA数据集的现象揭示了一个鲜艳的模式:指令调整数据量的指数增长与模型性能的提高直接相关,从而验证了数据集规模和Square-10M的高质量。
https://arxiv.org/abs/2404.12803