The development of Large Language Models (LLM) and Diffusion Models brings the boom of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality assessment framework to provide a quantifiable evaluation of different images or videos based on the AIGC technologies. The content generated by AIGC methods is driven by the crafted prompts. Therefore, it is intuitive that the prompts can also serve as the foundation of the AIGC quality assessment. This study proposes an effective AIGC quality assessment (QA) framework. First, we propose a hybrid prompt encoding method based on a dual-source CLIP (Contrastive Language-Image Pre-Training) text encoder to understand and respond to the prompt conditions. Second, we propose an ensemble-based feature mixer module to effectively blend the adapted prompt and vision features. The empirical study practices in two datasets: AIGIQA-20K (AI-Generated Image Quality Assessment database) and T2VQA-DB (Text-to-Video Quality Assessment DataBase), which validates the effectiveness of our proposed method: Prompt Condition Quality Assessment (PCQA). Our proposed simple and feasible framework may promote research development in the multimodal generation field.
大规模语言模型(LLM)和扩散模型的开发带来了人工智能生成内容(AIGC)的繁荣。建立一个有效的质量评估框架以根据AIGC技术对不同图像或视频进行定量评估非常重要。由AIGC方法生成的内容是由创建的提示驱动的。因此,提示也可以作为AIGC质量评估的基石。 本研究提出了一个有效的AIGC质量评估(QA)框架。首先,我们提出了一种基于双重源CLIP(对比性语言-图像预训练)文本编码器的中置提示方法,以理解和响应提示条件。其次,我们提出了一种基于集成特征混合器的 ensemble-based 方法,有效地融合了自适应提示和视觉特征。以下是两个数据集的实验研究实践:AIGIQA-20K(AI-Generated Image Quality Assessment database)和T2VQA-DB(文本-视频质量评估数据库),验证了我们提出方法的有效性:提示条件质量评估(PCQA)。我们提出的研究简单而可行,可能会促进多模态生成领域的研究发展。
https://arxiv.org/abs/2404.13299
Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features via the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a multi-scale aggregate 3D decoder. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs. Experimental results validate that our Uni3DR^2 yields convincing gains over the baseline on the 3D reconstruction dataset ScanNet (increasing F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior performance over the baseline on the 3D vision-language understanding dataset ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set, respectively). Furthermore, it outperforms the state-of-the-art method that uses additional GT point clouds on both ScanQA and 3DMV-VQA.
使大型语言模型(LLMs)与3D环境进行交互具有挑战性。现有的方法从地面真实(GT)几何或由辅助模型重构的3D场景中提取点云。然后将CLIP中的文本图像对齐的2D特征提升到点云中,作为LLMs的输入。然而,这种解决方案缺乏3D点对点连接的建立,导致空间结构信息不足。同时,场景的几何和语义表示之间的同步缺失导致了3D场景理解水平的降低。在本文中,我们证明了在3D场景中实现统一场景表示和重建框架对LLM的重要性。具体来说,我们引入了通过冻解除预训练2D基础模型(如CLIP和SAM)提取3D几何和语义感知表示特征的多尺度聚合3D解码器。我们的学习到的3D表示不仅有助于重建过程,还为LLMs提供了宝贵的知识。实验结果证实,我们的Uni3DR^2在3D重建数据集ScanNet上取得了显著的提高(将F- Score提高+1.8)。当应用于LLM时,我们在3D视觉语言理解数据集ScanQA上的表现超过了基线。此外,它在ScanQA和3DMV-VQA上的表现都超过了最先进的采用额外GT点云的方法。
https://arxiv.org/abs/2404.13044
Medical visual question answering (Med-VQA) aims to automate the prediction of correct answers for medical images and questions, thereby assisting physicians in reducing repetitive tasks and alleviating their workload. Existing approaches primarily focus on pre-training models using additional and comprehensive datasets, followed by fine-tuning to enhance performance in downstream tasks. However, there is also significant value in exploring existing models to extract clinically relevant information. In this paper, we propose the Latent Prompt Assist model (LaPA) for medical visual question answering. Firstly, we design a latent prompt generation module to generate the latent prompt with the constraint of the target answer. Subsequently, we propose a multi-modal fusion block with latent prompt fusion module that utilizes the latent prompt to extract clinical-relevant information from uni-modal and multi-modal features. Additionally, we introduce a prior knowledge fusion module to integrate the relationship between diseases and organs with the clinical-relevant information. Finally, we combine the final integrated information with image-language cross-modal information to predict the final answers. Experimental results on three publicly available Med-VQA datasets demonstrate that LaPA outperforms the state-of-the-art model ARL, achieving improvements of 1.83%, 0.63%, and 1.80% on VQA-RAD, SLAKE, and VQA-2019, respectively. The code is publicly available at this https URL.
医疗视觉问题回答(Med-VQA)旨在自动化的目的是辅助医生减少重复的任务,减轻他们的负担。现有的方法主要集中在使用额外的全面数据集进行预训练,然后进行微调以提高下游任务的性能。然而,探索现有的模型以提取有关临床相关的信息也非常重要。在本文中,我们提出了LaPA模型,用于医疗视觉问题回答。首先,我们设计了一个基于目标答案的隐式提示生成模块。然后,我们提出了一个多模态融合模块,利用隐式提示从单模态和多模态特征中提取临床相关信息。此外,我们还引入了一个先验知识融合模块,将疾病和器官的关系与临床相关信息集成起来。最后,我们将最终整合的信息与图像语言跨模态信息相结合来预测最终答案。在三个公开可用的Med-VQA数据集上的实验结果表明,LaPA在ARL(最先进的模型)上实现了卓越的性能,分别取得了VQA-RAD 1.83%,SLAKE 0.63%和VQA-2019 1.80%的改进。代码公开可用,在https://www. this URL。
https://arxiv.org/abs/2404.13039
Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning benchmark, abbreviated as \textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled counterfactual questions, to evaluate MLLM's counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.
作为人类智力的关键表现形式,反事实推理是指根据已有的事实做出假设,并推断可能的后果。已经存在的多模态大型语言模型(MLLMs)在广泛的视觉问答基准测试中表现出了出色的认知和推理能力。然而,当它们面对反事实问题时,现有的MLLMs会表现如何?为了回答这个问题,我们首先收集了一个新的反事实推理基准,称为反事实推理多模态模型(CFMM),以系统地评估MLLMs的反事实推理能力。我们的CFMM包括六个具有挑战性的任务,每个任务包括数百个精心人类标注的反事实问题,以评估MLLM在多样性方面的反事实推理能力。通过实验,我们发现,现有的MLLM更喜欢相信它们所看到的东西,但忽视了问题中提出的反事实假设,从而导致不准确的回答。此外,我们在CFMM上评估了各种常见的MLLM。它们在CFMM上的表现与我们CFMM上的表现之间的显著差距表明,现有MLLM在接近人类智能方面还有很大的改进空间。另一方面,通过在CFMM上提高MLLMs的表现,可以探索开发具有高级智能的MLLM的潜在途径。
https://arxiv.org/abs/2404.12966
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.
文本中心化的视觉问题回答(VQA)在多模态大型语言模型(MLLMs)的发展中取得了很大进展,然而,开源模型仍然与诸如GPT4V和Gemini这样的领先模型相疏离,部分原因是缺乏广泛的、高质量的教学数据调整。为此,我们介绍了一种用于创建大规模、高质量指令调整数据集的新方法,称为Square-10M,该数据集使用闭源的MLLM生成。数据构建过程被称为Square,包括四个步骤:自问自答、回答、推理和评估。我们使用Square-10M进行实验,得出了三个关键发现:1)我们的模型TextSquare在很大程度上超越了开源先前文本中心化MLLM,并在OCRBench(62.2%)上达到了新的标准。它甚至在前10个文本中心化基准中超过了顶级模型GPT4V和Gemini。2)此外,我们证明了VQA推理数据对于提供针对特定问题全面上下文的洞见至关重要。这不仅提高了准确性,而且显著减轻了幻觉。具体来说,TextSquare在四个通用VQA和幻觉评估数据集上的平均得分分别为75.1%,超过了前述领先模型的水平。3)值得注意的是,在扩展文本中心化VQA数据集的现象揭示了一个鲜艳的模式:指令调整数据量的指数增长与模型性能的提高直接相关,从而验证了数据集规模和Square-10M的高质量。
https://arxiv.org/abs/2404.12803
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA.
文档问答(QA)在理解视觉丰富文档(VRD)方面具有挑战性,尤其是那些大量包含文本内容的研究期刊文章。现有研究主要集中在真实世界文档稀疏文本的背景下,而理解多个页面之间的复杂语义关系仍然是一个挑战,以定位多模态组件。为了填补这一空白,我们提出了PDF-MVQA,专门针对研究期刊文章,包括多个页面和多模态信息检索。与传统机器阅读理解(MRC)任务不同,我们的方法旨在检索包含答案或视觉丰富文档实体(如表格和图)的完整段落。我们的贡献包括引入了一个全面的PDF文档VQA数据集,使人们可以研究文本主导文档的语义层次结构。我们还提出了新的VRD-QA框架,旨在同时抓住文档布局中的文本内容和关系,扩展了页级别理解,直至整个多页文档。通过这项工作,我们旨在增强现有视觉和语言模型在处理文本主导VRD-QA挑战中的能力。
https://arxiv.org/abs/2404.12720
We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at this http URL . A showcase of non cherry picked qualitative examples can also be found at this http URL .
我们介绍了由Reka开发的 Reka Core、Flash 和 Edge 一系列强大的多模态语言模型。这些模型是由 Reka 从头训练的,能够处理和推理文本、图像、视频和音频输入。本技术报告讨论了训练这些模型的细节,并提供了全面评估结果。我们发现,Reka Edge 和 Reka Flash 不仅是最先进的,而且在各自的计算类别中还表现出色,为它们各自的计算类别带来了超额价值。与此同时,我们最强大和最大的模型 Reka Core,在自动评估和盲人评估方面都接近最佳前沿模型。在图像问答基准(如MMMU,VQAv2)上,Core与GPT4-V竞争相当。在多模态聊天中,Core在盲第三方人类评估设置下排名第二,超过了像Claude 3 Opus这样的其他模型。在文本基准上,Core不仅在一批经过良好检验的基准(如MMLU,GSM8K)上表现竞争力,而且也在人类评估中超过了GPT4-0613。在视频问答(Perception-Test)中,Core超过了Gemini Ultra。模型现在在生产环境中通过这个链接发送:http://www.reka.ai/。也可以在这个链接找到展示非 cherry-picked 高质量示例的示例:http://www.reka.ai/quality-examples/。
https://arxiv.org/abs/2404.12387
Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamlining data preparation and build new benchmark MedVQA datasets R-RAD and R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing MedVQA datasets, i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process. The framework includes three distinct strategies to generate decision outcomes and corresponding rationales, thereby clearly showcasing the medical decision-making process during reasoning. Extensive experiments demonstrate that our method can achieve an accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperforming existing state-of-the-art baselines. Dataset and code will be released.
医学视觉问题回答(MedVQA)是一种为图像医学查询提供语言回答的挑战性任务,并在医疗保健领域取得了显著的进展。它帮助医疗专家迅速解释医学图像,从而实现更快更准确的诊断。然而,现有MedVQA解决方案的可解释性和透明度往往有限,这使得理解其决策过程具有挑战性。为解决这个问题,我们设计了一个半自动化的注释过程,以简化数据准备并构建新的MedVQA数据集R-RAD和R-SLAKE。R-RAD和R-SLAKE数据集提供了现有MedVQA数据集中的多模态大型语言模型和人类标注的交互式问题答案对的中间医疗决策过程。此外,我们还设计了一个新框架,通过将医学决策过程融入训练过程来优化轻量级预训练生成模型。该框架包括生成决策结果和相应理由的三种不同的策略,从而在推理过程中清晰地展示医学决策过程。大量实验证明,我们的方法可以在R-RAD和R-SLAKE上实现83.5%的准确度,显著优于现有状态下的最先进基线。数据集和代码将发布。
https://arxiv.org/abs/2404.12372
Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, \textit{MUSIC-AVQA-R}, crafted in two steps: rephrasing questions within the test split of a public dataset (\textit{MUSIC-AVQA}) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on both datasets, especially obtaining a significant improvement of 9.68\% on the proposed dataset. Extensive ablation experiments are conducted on these two datasets to validate the effectiveness of the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset.
音频视觉问答(AVQA)是一个复杂的多模态推理任务,要求智能系统根据音频-视频输入对对自然语言查询进行准确响应。然而,普遍的AVQA方法容易过拟合数据集偏差,导致鲁棒性差。此外,现有数据集可能无法提供这些方法的准确诊断。为了应对这些挑战,我们首先提出了一个名为\textit{MUSIC-AVQA-R}的新数据集,通过两个步骤进行制作:在公共数据集\textit{MUSIC-AVQA}的测试划分内重新表述问题,然后引入分布变换来分割问题。前者导致一个大型、多样化的测试空间,而后者在罕见、频繁和总体问题上的评估具有全面性。其次,我们提出了一个具有多方面合作的循环去偏策略来克服偏差学习。实验结果表明,这种架构在两个数据集上都实现了最先进的性能,尤其是针对所提出的数据集,性能提高了9.68%。此外,我们对这两个数据集进行了广泛的消融实验,以验证去偏策略的有效性。最后,通过在我们的数据集上评估现有多模态问答方法的有效性,我们揭示了这些方法的局限性。
https://arxiv.org/abs/2404.12020
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at this https URL.
本文回顾了NTIRE 2024挑战赛短形式UGC视频质量评估(S-UGC VQA),其中各种优秀的解决方案在收集到的数据集KVQ上提交并进行了评估,该数据集来自流行的短视频平台,即快手/Kwai平台。KVQ数据库分为三部分,包括用于训练的2926个视频、验证的420个视频和测试的854个视频。目的是建立新的基准并推动S-UGC VQA的发展。该比赛有200名参与者,13个团队在最终测试阶段提交了有效的解决方案。所提出的解决方案在S-UGC VQA上实现了最先进的表现。该项目可以在此链接中找到:https://url.cn/xyz6hxl。
https://arxiv.org/abs/2404.11313
Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. Initially, this task was researched, focusing on methods to help machines understand objects and scene contexts in images. However, some text appearing in the image that carries explicit information about the full content of the image is not mentioned. Along with the continuous development of the AI era, there have been many studies on the reading comprehension ability of VQA models in the world. As a developing country, conditions are still limited, and this task is still open in Vietnam. Therefore, we introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images, we call it ViTextVQA (\textbf{Vi}etnamese \textbf{Text}-based \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering dataset) which contains \textbf{over 16,000} images and \textbf{over 50,000} questions with answers. Through meticulous experiments with various state-of-the-art models, we uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers. This finding helped us significantly improve the performance of the baseline models on the ViTextVQA dataset. Our dataset is available at this \href{this https URL}{link} for research purposes.
视觉问答(VQA)是一个复杂的任务,需要同时处理自然语言和图像。最初,这个任务是研究集中在帮助机器理解图像中对象的上下文的方法。然而,在图像中出现的含有明确图像内容的信息的文本没有提及。随着人工智能时代的持续发展,世界各地已经有很多研究关注VQA模型的阅读理解能力。作为一个发展中国家,条件仍然有限,因此在越南,这个问题仍然是一个开放的任务。因此,我们介绍了第一个专门针对图像中出现文本的越南语大型数据集,我们称之为ViTextVQA(越南文本-为基础的视觉问答数据集),它包含超过16,000张图片和超过50,000个问题与答案。通过仔细实验各种最先进的模型,我们揭示了处理OCR文本中标记符的顺序以及选择标记符来形成答案的重要性。这一发现极大地提高了ViTextVQA数据集 baseline模型的性能。我们的数据集可在此链接中获取研究用途:<https://this <https://this link>
https://arxiv.org/abs/2404.10652
Mixture of Expert Tuning (MoE-Tuning) has effectively enhanced the performance of general MLLMs with fewer parameters, yet its application in resource-limited medical settings has not been fully explored. To address this gap, we developed MoE-TinyMed, a model tailored for medical applications that significantly lowers parameter demands. In evaluations on the VQA-RAD, SLAKE, and Path-VQA datasets, MoE-TinyMed outperformed LLaVA-Med in all Med-VQA closed settings with just 3.6B parameters. Additionally, a streamlined version with 2B parameters surpassed LLaVA-Med's performance in PathVQA, showcasing its effectiveness in resource-limited healthcare settings.
混合专家调整(MoE-Tuning)有效地增强了一般MLMs的性能,同时参数更少。然而,在资源受限的医疗环境中,它的应用并没有完全被探索。为了填补这一空白,我们开发了MoE-TinyMed,一种专为医疗应用而设计的模型,显著降低了参数需求。在VQA-RAD、SLAKE和Path-VQA数据集上的评估显示,MoE-TinyMed在所有Med-VQA关闭设置中均超过了LLaVA-Med的性能,只需3.6B个参数。此外,一个优化版本,具有2B个参数,在PathVQA上超过了LLaVA-Med的性能,展示了其在资源受限的医疗环境中的有效性。
https://arxiv.org/abs/2404.10237
We analyze knowledge-based visual question answering, for which given a question, the models need to ground it into the visual modality and retrieve the relevant knowledge from a given large knowledge base (KB) to be able to answer. Our analysis has two folds, one based on designing neural architectures and training them from scratch, and another based on large pre-trained language models (LLMs). Our research questions are: 1) Can we effectively augment models by explicit supervised retrieval of the relevant KB information to solve the KB-VQA problem? 2) How do task-specific and LLM-based models perform in the integration of visual and external knowledge, and multi-hop reasoning over both sources of information? 3) Is the implicit knowledge of LLMs sufficient for KB-VQA and to what extent it can replace the explicit KB? Our results demonstrate the positive impact of empowering task-specific and LLM models with supervised external and visual knowledge retrieval models. Our findings show that though LLMs are stronger in 1-hop reasoning, they suffer in 2-hop reasoning in comparison with our fine-tuned NN model even if the relevant information from both modalities is available to the model. Moreover, we observed that LLM models outperform the NN model for KB-related questions which confirms the effectiveness of implicit knowledge in LLMs however, they do not alleviate the need for external KB.
我们分析了基于知识的视觉问题回答,对于给定问题,模型需要将知识基于视觉模式并从给定的大型知识库(KB)中检索相关信息才能回答。我们的分析分为两个方面,一方面是基于设计神经架构并从零开始训练,另一方面是基于大型预训练语言模型(LLMs)。我们的研究问题包括:1)我们能否通过显式监督检索相关KB信息有效地增强模型来解决KB-VQA问题?2)任务特定和LLM-based模型在集成视觉和外部知识以及多级推理方面的表现如何?3)LLM的隐含知识是否足够用于KB-VQA,以及它能否替换显式的KB?我们的结果表明,将监督外部和视觉知识检索的模型与任务特定和LLM模型相结合可以产生积极影响。我们的研究结果表明,尽管LLM在1-hop推理方面表现更强,但与我们的微调NN模型相比,它们在2-hop推理方面表现不佳。此外,我们观察到LLM模型对于KB相关问题的表现优于NN模型,证实了在LLMs中隐含知识的有效性。然而,它们并没有减轻外部KB的需求。
https://arxiv.org/abs/2404.10226
The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic, black-box setting. We propose using the principle of \textit{neighborhood consistency} to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model.
选择性预测的目标是让模型在可能无法提供可靠预测时进行 abstain,这对于安全关键场景非常重要。现有方法进行选择性预测通常需要访问模型的内部,仅对单模态模型进行重新训练。然而,最强大的模型(如 GPT-4)通常仅作为无法访问内部的黑盒模型提供,无法通过终端用户进行重新训练,并且通常用于多模态任务。我们在一个现实的黑盒环境中研究了对于视觉语言模型的选择性预测可能性。我们提出了利用邻域一致性原则从黑盒视觉语言模型中识别不可靠响应的想法。我们假设仅给定一个视觉问题和模型响应,模型回答的邻域内的一致性将表明可靠性。在黑盒环境中直接采样邻居是不可能的。相反,我们证明了使用较小的代理模型来近似采样邻居是可能的。我们发现,邻域一致性可用于识别在 adversarial 设置或与代理模型分布不相同的设置中模型对视觉问题的不可靠响应。
https://arxiv.org/abs/2404.10193
Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). We train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our results demonstrate that VLMs trained for referral on third person images fail to recognise and refer hands and objects in egocentric images. When fine-tuned on our egocentric HOI-QA dataset, performance improves by 27.9% for referring hands and objects, and by 26.7% for referring interactions.
大视图语言模型(VLMs)现在已成为许多任务的默认最佳实践,包括视觉问答、识别物体和空间指征等。在这项工作中,我们提出了一个针对以自我为中心的图像的HOI-Ref任务,旨在使用VLMs理解手和物体之间的互动。为了实现HOI-Ref,我们编辑了HOI-QA数据集,其中包括用于训练和评估VLMs的390,000个问题-答案对。HOI-QA包括与定位手、物体及其相互作用的 questions(例如,指出正在操作的对象)有关的问题。我们在这个数据集上训练了第一个VLM for HOI-Ref,并称之为VLM4HOI。我们的结果表明,为第三人称图像进行指出的VLMs未能识别和指出在以自我为中心的图像中的手和物体。当在我们的自中心HOI-QA数据集上进行微调时,指出的性能提高了27.9%,而指出的性能则提高了26.7%。
https://arxiv.org/abs/2404.09933
This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.
本文介绍了一种名为VLAP的新方法,它将预训练的视觉模型(VMs)和大语言模型(LLMs)相桥,使冻结的LLMs能够理解视觉世界。VLAP通过使用单线性层将预训练的视觉模型的嵌入空间转换为LLMs的词向量空间,实现高效的视觉和语言理解。具体来说,我们利用已经确立的词向量来连接两个模态的嵌入空间。通过将分配过程表示为最优传输问题,将视觉和文本表示同时分配给预训练的LLM中的一个单词向量集合。我们预测从另一个模态的表示中分配一个模态,强制保持成对多模态数据的相似分配。这使得视觉和语言表示包含相同的信息,将冻结的LLM的词嵌入空间 grounded in visual data。此外,通过视觉数据可以保留LLM的语义分类器,因为LLM解释并推理单词嵌入之间的相关性。实验结果表明,VLAP在各种视觉-语言任务上的改进都超过了基于线性变换的先前方法。我们还证明了学习到的视觉表示具有LLM的语义分类器,使视觉语义算术成为可能。
https://arxiv.org/abs/2404.09632
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{this https URL}.
视觉问题回答(VQA)被认为是AI完成的任务,因为它需要理解、推理和推断关于视觉和语言内容的视觉和语言内容。在过去的几年里,为VQA问题提出了许多神经架构建议。然而,在零散射击VQA上取得成功仍然具有挑战性,因为需要具备高级的泛化能力和推理能力。本文探讨了将图像摘要作为VQA管道中中间过程的引入对视觉问题回答效果的影响。具体来说,我们探讨了使用图像摘要而不是图像并利用大型语言模型(LLMs)建立零散射击设置的有效性。 由于图像摘要是这个过程中最关键的一步,因此我们比较了最先进的图像摘要模型在各种问题类型的结构和语义方面的视觉问题回答性能。我们提出了一种直接而有效的基于问题的图像摘要方法,将上下文信息传递给问题回答(QA)模型。这种方法涉及从问题中提取关键词,为图像-问题对生成文本摘要,并将问题驱动的摘要融入LLM提示中。我们评估了使用通用和基于问题的图像摘要在VQA管道中的效果。 我们的研究突出了在零散射击设置下利用图像摘要和大型语言模型的潜力,以实现GQA竞争力的性能。我们的代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.08589
Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object's characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.
可扩展的标注方法对于构建广泛的3D文本数据集至关重要,促进了一系列应用。然而,现有的方法有时会导致生成伪影的旁注,从而损害了旁注的质量。本文重点探讨了在3D物体旁注中出现伪影的问题,重点关注Cap3D方法,该方法使用预训练模型将3D物体转换为2D视图进行旁注。我们指出了一个主要挑战:某些3D物体的渲染视图是非典型的,与标准图像旁注模型的训练数据不一致,导致伪影。为解决这个问题,我们提出了DiffuRank方法,该方法利用预训练的文本到3D模型评估3D物体与其2D渲染视图之间的对齐程度,其中高对齐视图最能代表对象的特性。通过排名所有渲染视图并将排名前几位的输入GPT4-Vision,我们提高了旁注的准确性和细节,使得在Cap3D数据集中的20000个旁注和将它们扩展到Objaverse和Objaverse-XL数据集中的100000个旁注得到纠正。此外,我们还展示了DiffuRank的适应性,将其应用于预训练的文本到图像模型上进行视觉问答任务,其中它超过了CLIP模型。
https://arxiv.org/abs/2404.07984
This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models like GPT4 and LLaVA against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. Key findings suggest that while MLLMs demonstrate potential in navigating technical documents, substantial limitations exist, particularly in accurately extracting and applying detailed requirements to engineering designs. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: this https URL.
这项研究介绍了一种名为DesignQA的新基准,旨在评估多模态大型语言模型(MLLM)在理解和技术文档中应用工程要求的能力。该基准重点关注现实世界的工程挑战,将多模态数据(包括文本设计要求、CAD图像和工程图纸)来源于方程式SAE学生竞赛,与许多现有MLLM基准不同。DesignQA包含基于文档的视觉问题,其中输入图像和输入文档来自不同的来源。基准基于工程师根据要求进行设计时执行的任务进行划分-规则理解、规则遵守和规则提取。我们评估了最先进的GPT4和LLaVA模型与该基准的比较,我们的研究揭示了MLLM在解释复杂工程文档方面的能力所存在的现有缺口。研究发现,尽管MLLM表现出在导航技术文档方面的潜力,但仍然存在很大的局限性,特别是在准确提取和应用详细工程设计要求方面。这项基准为支持AI辅助工程设计过程的未来发展奠定了基础。DesignQA可以在以下链接公开使用:https://this URL。
https://arxiv.org/abs/2404.07917
Unsupervised anomaly detection enables the identification of potential pathological areas by juxtaposing original images with their pseudo-healthy reconstructions generated by models trained exclusively on normal images. However, the clinical interpretation of resultant anomaly maps presents a challenge due to a lack of detailed, understandable explanations. Recent advancements in language models have shown the capability of mimicking human-like understanding and providing detailed descriptions. This raises an interesting question: \textit{How can language models be employed to make the anomaly maps more explainable?} To the best of our knowledge, we are the first to leverage a language model for unsupervised anomaly detection, for which we construct a dataset with different questions and answers. Additionally, we present a novel multi-image visual question answering framework tailored for anomaly detection, incorporating diverse feature fusion strategies to enhance visual knowledge extraction. Our experiments reveal that the framework, augmented by our new Knowledge Q-Former module, adeptly answers questions on the anomaly detection dataset. Besides, integrating anomaly maps as inputs distinctly aids in improving the detection of unseen pathologies.
无监督异常检测通过将原始图像与仅基于正常图像的模型生成的伪健康重构图像相邻来识别潜在的病理性区域。然而,由于结果异常图的临床解释存在缺乏详细、可理解解释的挑战,这是一个具有挑战性的问题。近年来语言模型的进步表明,具有类似于人类理解能力和提供详细描述的能力。这引发了一个有趣的问题:\textit{语言模型如何被用于使异常图更具有可解释性?}据我们所知,我们第一个利用语言模型进行无监督异常检测,为我们构建了一个不同问题和不回答的问答 dataset。此外,我们提出了一个专门针对异常检测的多图像视觉问答框架,结合了各种特征融合策略来增强视觉知识提取。我们的实验表明,在将新知识 Q-Former 模块扩展到框架后,该框架能够恰当地回答异常检测数据集中的问题。此外,将异常图作为输入可以明显地改善未见疾病的检测。
https://arxiv.org/abs/2404.07622