In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions. Please refer to our GitHub repository for more details: this https URL.
在过去的一年里,多模态大型语言模型(MLLMs)在诸如视觉问答、视觉理解和推理等任务上表现出了非凡的性能。然而,MLLMs的大型模型大小和高训练和推理成本限制了其在学术界和产业界的广泛应用。因此,研究高效且轻量级的MLLM具有巨大的潜力,尤其是在边缘计算场景中。在本次调查中,我们全面系统地回顾了高效MLLM的当前状态。具体来说,我们总结了具有代表性的高效MLLM的时间线、有效结构策略和研究现状,并讨论了当前高效MLLM研究的局限性和有前景的未来方向。更多详情,请参阅我们的GitHub仓库:https://github.com。
https://arxiv.org/abs/2405.10739
In recent years, people have increasingly used AI to help them with their problems by asking questions on different topics. One of these topics can be software-related and programming questions. In this work, we focus on the questions which need the understanding of images in addition to the question itself. We introduce the StackOverflowVQA dataset, which includes questions from StackOverflow that have one or more accompanying images. This is the first VQA dataset that focuses on software-related questions and contains multiple human-generated full-sentence answers. Additionally, we provide a baseline for answering the questions with respect to images in the introduced dataset using the GIT model. All versions of the dataset are available at this https URL.
近年来,人们越来越多地使用AI来寻求各种问题的帮助,比如通过在各种主题上提出问题来解决问题。其中一项主题是软件相关的问题和编程问题。在这项工作中,我们关注需要理解图像的问题,以及问题本身。我们引入了StackOverflowVQA数据集,其中包括StackOverflow上有一个或多个伴随图片的问题。这是第一个关注软件相关问题的VQA数据集,并且包含了多个人类编写的完整句子答案。此外,我们还提供了使用GIT模型回答问题与引入的数据集中的图像相关的基线。所有版本的数据集都可以在https://url.com/这个URL中找到。
https://arxiv.org/abs/2405.10736
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
我们提出了Chameleon模型,这是一款基于早期融合词素的多模态模型,可以理解并以任意顺序生成图像和文本。我们从一开始就概述了稳定训练方法、一个对齐配方和一种针对早期融合、词素基础、多模态设置的建筑参数化。这些模型在包括视觉问题回答、图像标题、文本生成、图像生成和长篇多模态生成在内的各种任务上进行了评估。Chameleon展示了广泛和普遍的能力,包括在图像标题任务中的最先进性能,在仅文本任务中超过了Llama-2的性能,同时与Mixtral 8x7B和Gemini-Pro等模型竞争,所有这些都在单个模型中实现。根据人类对长篇多模态生成评估的新鲜程度,Chameleon在包括图像和文本的混合序列的提示或输出上表现出色,甚至超过了Gemini Pro和GPT-4V等更大模型的性能。Chameleon在统一建模全多模态文本文档方面迈出了重要的一步。
https://arxiv.org/abs/2405.09818
In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at \url{this https URL}.
在本文中,我们提出了一种简单但有效的增强社交媒体视频的盲视频质量评估(BVQA)模型的方法。受到之前研究者利用从各种计算机视觉模型提取的预训练特征作为BVQA特征表示的启发,我们进一步探索了预训练的盲图像质量评估(BIQA)和BVQA模型的丰富质量感知特征作为辅助特征,以帮助BVQA模型处理社交媒体视频的复杂扭曲和多样内容。具体来说,我们使用SimpleVQA,一种由可训练的Swin Transformer-B和固定的SlowFast组成的BVQA模型,作为我们的基础模型。Swin Transformer-B和SlowFast组件分别负责提取空间和运动特征。然后,我们从Q-Align、LIQE和FAST-VQA中提取三种特征,分别捕捉帧级质量感知特征、帧级质量感知特征以及时空质量感知特征。通过连接这些特征,我们采用多层感知器(MLP)网络将它们回归为质量分数。实验结果表明,与三个公共社交媒体VQA数据集上的其他模型相比,所提出的模型在性能上取得了最佳结果。此外,所提出的模型在CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge上获得了第一名。代码可在此处访问:\url{此链接}。
https://arxiv.org/abs/2405.08745
With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artefacts and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimised through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.
随着深度学习的 recent 进步,已经开发了众多算法来提高视频质量、减少视觉伪影和提高感知质量。然而,关于增强内容质量评估的研究仍然很少。增强内容的质量评估通常基于为压缩应用设计的质量度量指标。在本文中,我们提出了一个专门针对增强视频内容的新颖盲深度视频质量评估(VQA)方法。它采用了一种基于新数据库(包含 13K 训练补丁)的内容质量感知循环记忆转置(RMT)网络架构来获得视频质量表示,并通过一种基于新数据库(包含13K训练补丁)的内容质量感知对比学习策略来优化它。提取出的质量表示通过线性回归合并生成视频级的质量索引。所提出的方法,RMT-BVQA,通过五倍交叉验证对 VDPVE(用于感知视频增强的数据集)数据库进行了评估。结果表明,与十个现有无参考质量度量指标相比,其相关性能优越。
https://arxiv.org/abs/2405.08621
While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the image-text matching knowledge of the pretrained model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, we propose a TSG+ module to transfer the image-text matching knowledge from CLIP models to our region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the pretrained image-text knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods.
虽然视觉语言预训练模型(VLMs)在各种多模态理解任务中表现出色,但它们在细粒度音频-视觉推理方面(特别是音频-视觉问题回答,AVQA)的潜力仍然没有被充分利用。由于在区域级别需要视觉理解并且与音频模态无缝集成,AVQA为VLMs提出了特定的挑战。之前基于VLMs的AVQA方法仅仅将CLIP用作特征编码器,但忽略了其知识,还将音频和视频视为双流框架中的单独实体,正如大多数AVQA方法一样。本文提出了一种新的CLIP驱动的目标感知单流(TASS)网络用于AVQA,通过自然音频-视觉匹配特性将预训练模型的图像-文本匹配知识传递给我们的区域文本匹配过程。它包括两个关键组件:目标感知空间基线模块(TSG+)和单流联合时间基线模块(JTG)。具体来说,我们提出了一个TSG+模块,将其从CLIP模型中的图像-文本匹配知识传递到我们的区域文本匹配过程,而无需相应的标注目标。此外,与之前单独的双流网络不同,JTG通过简单的单流架构将音频和视频融合在一起,并在保留其时间关联的同时扩展了预训练的图像-文本知识到音频-文本匹配。通过在MUSIC-AVQA基准上进行的大量实验证实,我们的方法在现有技术水平上具有优越性。
https://arxiv.org/abs/2405.07451
Educational scholars have analyzed various image data acquired from teaching and learning situations, such as photos that shows classroom dynamics, students' drawings with regard to the learning content, textbook illustrations, etc. Unquestioningly, most qualitative analysis of and explanation on image data have been conducted by human researchers, without machine-based automation. It was partially because most image processing artificial intelligence models were not accessible to general educational scholars or explainable due to their complex deep neural network architecture. However, the recent development of Visual Question Answering (VQA) techniques is accomplishing usable visual language models, which receive from the user a question about the given image and returns an answer, both in natural language. Particularly, GPT-4V released by OpenAI, has wide opened the state-of-the-art visual langauge model service so that VQA could be used for a variety of purposes. However, VQA and GPT-4V have not yet been applied to educational studies much. In this position paper, we suggest that GPT-4V contributes to realizing VQA for education. By 'realizing' VQA, we denote two meanings: (1) GPT-4V realizes the utilization of VQA techniques by any educational scholars without technical/accessibility barrier, and (2) GPT-4V makes educational scholars realize the usefulness of VQA to educational research. Given these, this paper aims to introduce VQA for educational studies so that it provides a milestone for educational research methodology. In this paper, chapter II reviews the development of VQA techniques, which primes with the release of GPT-4V. Chapter III reviews the use of image analysis in educational studies. Chapter IV demonstrates how GPT-4V can be used for each research usage reviewed in Chapter III, with operating prompts provided. Finally, chapter V discusses the future implications.
教育学者们对从教学和学习情境中获取的各种图像数据进行了分析,例如显示课堂动态的照片、关于学习内容的学生的绘画,教科书插图等。毫无疑问,大多数图像数据的可视化和解释都是通过人类研究人员进行的,没有机器基于的自动化。这部分是因为大多数图像处理人工智能模型对一般教育学者来说难以获取,或者由于其复杂深度神经网络架构,难以解释。然而,最近开发的视觉问答技术(VQA)正在取得可用性,该技术接受用户关于给定图像的问题,并返回自然语言的答案。特别是,OpenAI 发布的 GPT-4V 已经大大拓展了最先进的视觉语言模型服务,使得 VQA 可以用于各种目的。然而,迄今为止,VQA 和 GPT-4V 还没有在教育研究中得到广泛应用。在本文论文中,我们建议 GPT-4V 对教育研究有所贡献。通过“实现” VQA,我们指的是两个含义:(1)GPT-4V 实现了任何教育学者在不存在技术/可用性障碍的情况下利用 VQA 技术,以及(2)GPT-4V 使教育学者意识到 VQA 对教育研究的有用性。基于这些,本文旨在为教育研究提供 VQA 的里程碑,以便为教育研究方法论提供基准。本文第 II 章回顾了 VQA 技术的发展历程,第 III 章讨论了图像分析在教育研究中的应用,第 IV 章展示了 GPT-4V 在每个审查的研究用途中的应用,并提供操作提示。最后,第 V 章讨论了未来的影响。
https://arxiv.org/abs/2405.07163
An important handicap of document analysis research is that documents tend to be copyrighted or contain private information, which prohibits their open publication and the creation of centralised, large-scale document datasets. Instead, documents are scattered in private data silos, making extensive training over heterogeneous data a tedious task. In this work, we explore the use of a federated learning (FL) scheme as a way to train a shared model on decentralised private document data. We focus on the problem of Document VQA, a task particularly suited to this approach, as the type of reasoning capabilities required from the model can be quite different in diverse domains. Enabling training over heterogeneous document datasets can thus substantially enrich DocVQA models. We assemble existing DocVQA datasets from diverse domains to reflect the data heterogeneity in real-world applications. We explore the self-pretraining technique in this multi-modal setting, where the same data is used for both pretraining and finetuning, making it relevant for privacy preservation. We further propose combining self-pretraining with a Federated DocVQA training method using centralized adaptive optimization that outperforms the FedAvg baseline. With extensive experiments, we also present a multi-faceted analysis on training DocVQA models with FL, which provides insights for future research on this task. We show that our pretraining strategies can effectively learn and scale up under federated training with diverse DocVQA datasets and tuning hyperparameters is essential for practical document tasks under federation.
文档分析研究的一个重要障碍是,文档通常受到版权保护或包含私人信息,这禁止了它们的公开发布和集中式大规模文档数据集的创建。相反,文档在私有的数据 silo 中分散,使得在异构数据上的广泛训练变得枯燥乏味。在这项工作中,我们探讨了使用联邦学习(FL)方案作为在分布式私人物品数据上训练共享模型的方法。我们重点关注文档视觉量子算法(Document VQA)问题,这是一种特别适用于这种方法的任务,因为模型在各个领域所需的推理能力有很大的差异。因此,通过在异质文档数据集上进行训练,可以极大地丰富文档视觉量子算法模型。我们将来自不同领域的现有文档视觉量子算法数据集组装在一起,以反映现实应用中的数据异质性。我们在这个多模态环境中探讨了这种自监督技术,其中相同的数据用于预训练和微调,使得隐私保护有意义。我们还提出了一个使用集中自适应优化结合自监督训练的联邦文档视觉量子算法训练方法,该方法在联邦平均基线之上实现了优异的性能。通过广泛的实验,我们还对使用FL训练文档视觉量子算法模型进行了全面分析,为未来研究这个任务提供了有价值的见解。我们证明了我们的预训练策略可以在多样化的文档视觉量子算法数据集上有效学习和扩展。在联邦训练下调整超参数对实际文档任务来说至关重要。
https://arxiv.org/abs/2405.06636
Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage. Auxiliary losses are used to ensure a balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks using models within each model size group, all while training exclusively on open-sourced datasets. The code and model weights for CuMo are open-sourced at this https URL.
近年来,在多模态大型语言模型(LLMs)上取得的研究进展主要集中在通过增加文本-图像对数据规模和提高LLMs在多模态任务上的性能。然而,这些扩展方法在计算上是昂贵的,并忽视了改善模型功能从视觉方面的意义。受到混合专家(MoE)在LLMs成功应用的启发,我们在LLMs中引入了Co-upcycled Top-K稀疏掩码的混合专家模块。CuMo将稀疏掩码的混合专家块与视觉编码器和解剖网络连接器相结合,从而在推理过程中实现对多模态LLM的优化,同时最小化激活参数的增加。CuMo首先对MLP块进行预训练,然后在视觉指令调整阶段从预训练的MLP块中初始化每个专家。使用了辅助损失来确保专家的平衡加载。CuMo在各种VQA和视觉指令跟随基准测试中超过了最先进的LLM,所有这些都是在使用每个模型大小组中的模型进行训练的同时进行的。CuMo的代码和模型权重已公开发布在https://这个链接上。
https://arxiv.org/abs/2405.05949
While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved. We hope that this research will promote the study of LMM in dense text tasks. Code will be released at this https URL.
虽然大型多模态模型(LMM)在多模态任务中已经取得了显著的进展,但在涉及大量文本内容的任务中,它们的能力仍需要充分探索。富含信息的密集文本通常出现在文档、表格和产品描述中。理解密集文本有助于我们获得更准确的信息,从而更好地做出决策。为了进一步探索LMM在复杂文本任务中的能力,我们提出了DT-VQA数据集,包含170k个问题-答案对。在本文中,我们对GPT4V、Gemini以及各种开源LMM进行了全面评估,揭示了它们的优缺点。此外,我们还评估了两种LMM策略的有效性:提示工程和下游微调。我们发现,即使使用自动标注的训练数据,模型性能的显著改进也可以实现。我们希望这项研究将促进对在密集文本任务中学习LMM的研究。代码将在该链接处发布。
https://arxiv.org/abs/2405.06706
We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and quantifiable properties pertaining them, EQA with situational queries (such as "Is the bathroom clean and dry?") is more challenging, as the agent needs to figure out not just what the target objects pertaining to the query are, but also requires a consensus on their states to be answerable. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries, corresponding consensus object information, and predicted answers. PGE maintains uniqueness among the generated queries, using multiple forms of semantic similarity. We validate the generated dataset via a large scale user-study conducted on M-Turk, and introduce it as S-EQA, the first dataset tackling EQA with situational queries. Our user study establishes the authenticity of S-EQA with a high 97.26% of the generated queries being deemed answerable, given the consensus object data. Conversely, we observe a low correlation of 46.2% on the LLM-predicted answers to human-evaluated ones; indicating the LLM's poor capability in directly answering situational queries, while establishing S-EQA's usability in providing a human-validated consensus for an indirect solution. We evaluate S-EQA via Visual Question Answering (VQA) on VirtualHome, which unlike other simulators, contains several objects with modifiable states that also visually appear different upon modification -- enabling us to set a quantitative benchmark for S-EQA. To the best of our knowledge, this is the first work to introduce EQA with situational queries, and also the first to use a generative approach for query creation.
我们在家庭环境中针对情境查询(S-EQA)解决了 embodied 问题回答(EQA)问题。与之前的工作不同,这些工作主要解决与目标对象直接引用并可量化的属性相关的简单查询,而 EQA with situational queries(例如“卫生间干净干燥吗?”)更具挑战性,因为代理需要确定不仅目标对象的答案,而且还需要就它们的状态达成一致。为了实现这个目标,我们首先介绍了一种新颖的提示生成-评估(PGE)方案,该方案围绕 LLM 的输出创建了一个独特的数据集,包括独特的情境查询、相应的共识对象信息和预测的答案。PGE 在生成的查询中保持独特性,利用多种语义相似性。我们通过在 M-Turk 上进行大规模用户研究来验证生成的数据集,并将其作为 S-EQA,第一个处理情境查询的 dataset。我们的用户研究证实 S-EQA 的真实性,其中有 97.26% 的生成查询被认为具有答案,基于共识对象数据。相反,我们在 LLM 预测的答案和人类评估的答案之间观察到较低的相关性,表明 LLM 在直接回答情境查询方面能力较差,但 S-EQA 在提供人类验证的共识方面具有可用性。我们通过在 VirtualHome 上使用视觉问答(VQA)来评估 S-EQA,这个模拟器与其他模拟器不同,包含多个可修改的状态的对象,在修改后也具有不同的视觉表现,使我们能够为 S-EQA 设定一个量化基准。据我们所知,这是第一个介绍 EQA with situational queries 的作品,也是第一个使用生成方法创建查询的。
https://arxiv.org/abs/2405.04732
While Vector Symbolic Architectures (VSAs) are promising for modelling spatial cognition, their application is currently limited to artificially generated images and simple spatial queries. We propose VSA4VQA - a novel 4D implementation of VSAs that implements a mental representation of natural images for the challenging task of Visual Question Answering (VQA). VSA4VQA is the first model to scale a VSA to complex spatial queries. Our method is based on the Semantic Pointer Architecture (SPA) to encode objects in a hyperdimensional vector space. To encode natural images, we extend the SPA to include dimensions for object's width and height in addition to their spatial location. To perform spatial queries we further introduce learned spatial query masks and integrate a pre-trained vision-language model for answering attribute-related questions. We evaluate our method on the GQA benchmark dataset and show that it can effectively encode natural images, achieving competitive performance to state-of-the-art deep learning methods for zero-shot VQA.
虽然向量符号架构(VSAs)对于建模空间认知潜力很大,但目前它们的应用仅限于人工生成的图像和简单的空间查询。我们提出VSA4VQA——一种新颖的4D实现VSAs,它在视觉问答(VQA)具有挑战性的任务中实现了自然图像的思维表示。VSA4VQA是第一个将VSAs扩展到复杂空间查询的模型。我们的方法基于语义指针架构(SPA)对超维向量空间中的物体进行编码。为了编码自然图像,我们将SPA扩展以包括物体宽度和高度的空间位置。为了执行空间查询,我们进一步引入了学习到的空间查询掩码,并集成了一个预训练的视觉语言模型来回答属性相关问题。我们在GQA基准数据集上评估我们的方法,结果表明它可以有效编码自然图像,在零散射击VQA中实现与最先进的深度学习方法竞争的性能。
https://arxiv.org/abs/2405.03852
Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks. We extend MLLMs' perceptual capabilities to ground and reason about images in 3-dimensional space. To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formulation: as multi-turn question-answering. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective. Cube-LLM exhibits intriguing properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting to improve 3D understanding from 2D context information. (2) Cube-LLM can follow complex and diverse instructions and adapt to versatile input and output formats. (3) Cube-LLM can be visually prompted such as 2D box or a set of candidate 3D boxes from specialists. Our experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios, respectively. Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding with (87.0) average score, as well as visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning. Our project is available at this https URL.
多模态大型语言模型(MLLMs)在各种2D视觉和语言任务中表现出令人惊叹的 capabilities。我们将MLLMs的感知能力扩展到将图像在3维空间中接地和推理。为此,我们首先通过将多个现有的2D和3D识别数据集合并为一个共同任务的形式(多轮问题回答),开发了一个大规模2D和3D预训练数据集LV3D。接下来,我们引入了一个名为Cube-LLM的新MLLM,并将其在LV3D上预训练。我们证明了纯数据扩展具有很强的3D感知能力,而不需要3D特定的架构设计或训练目标。Cube-LLM表现出与LLMs相似的特征:(1)Cube-LLM可以从2D上下文信息应用连锁思考提示来提高3D理解。(2)Cube-LLM可以遵循复杂和多样化的指令,并适应各种输入和输出格式。(3)Cube-LLM可以以视觉提示的方式进行,如2D框或来自专家的一组候选3D框。在户外基准测试上,我们的实验表明,Cube-LLM在3D grounded reasoning和complex reasoning about driving scenarios方面显著优于现有基线,其AP-BEV得分分别比Talk2Car数据集高21.3分和比DriveLM数据集高17.7分。Cube-LLM在通用MLLM基准测试中的表现也非常 competitive,例如在2D grounding方面,其平均得分与refCOCO基准测试相当,(87.0);在视觉问题回答基准测试中,如VQAv2、GQA、SQA、POPE等,也表现出竞争力的结果。我们的项目URL为https://this URL。
https://arxiv.org/abs/2405.03685
Recently, User-Generated Content (UGC) videos have gained popularity in our daily lives. However, UGC videos often suffer from poor exposure due to the limitations of photographic equipment and techniques. Therefore, Video Exposure Correction (VEC) algorithms have been proposed, Low-Light Video Enhancement (LLVE) and Over-Exposed Video Recovery (OEVR) included. Equally important to the VEC is the Video Quality Assessment (VQA). Unfortunately, almost all existing VQA models are built generally, measuring the quality of a video from a comprehensive perspective. As a result, Light-VQA, trained on LLVE-QA, is proposed for assessing LLVE. We extend the work of Light-VQA by expanding the LLVE-QA dataset into Video Exposure Correction Quality Assessment (VEC-QA) dataset with over-exposed videos and their corresponding corrected versions. In addition, we propose Light-VQA+, a VQA model specialized in assessing VEC. Light-VQA+ differs from Light-VQA mainly from the usage of the CLIP model and the vision-language guidance during the feature extraction, followed by a new module referring to the Human Visual System (HVS) for more accurate assessment. Extensive experimental results show that our model achieves the best performance against the current State-Of-The-Art (SOTA) VQA models on the VEC-QA dataset and other public datasets.
近年来,用户生成内容(UGC)视频在我们的日常生活中变得越来越受欢迎。然而,由于摄影设备的限制和技术的局限性,UGC视频往往受到不良曝光。因此,提出了Video Exposure Correction(VEC)算法,包括Low-Light Video Enhancement(LLVE)和Over-Exposed Video Recovery(OEVR)。与VEC同样重要的是视频质量评估(VQA)。然而,几乎所有现有的VQA模型都是基于全面的,从综合角度来看视频的质量。因此,我们提出了基于LLVE-QA的Light-VQA,用于评估LLVE。我们通过将LLVE-QA数据集扩展到包括过曝视频及其相应修正版本的Video Exposure Correction(VEC)质量评估(VEC-QA)数据集,来扩展Light-VQA的工作。此外,我们提出了Light-VQA+,一种专门用于评估VEC的VQA模型。Light-VQA+与Light-VQA的主要区别在于使用CLIP模型和特征提取过程中的视觉语言指导,然后是一个新模块,指代人类视觉系统(HVS),用于更准确地评估。大量的实验结果表明,我们的模型在VEC-QA数据集和其他公共数据集上,实现了与当前最先进的(SOTA)VQA模型相同的性能。
https://arxiv.org/abs/2405.03333
Many clinical tasks require an understanding of specialized data, such as medical images and genomics, which is not typically found in general-purpose large multimodal models. Building upon Gemini's multimodal models, we develop several models within the new Med-Gemini family that inherit core capabilities of Gemini and are optimized for medical use via fine-tuning with 2D and 3D radiology, histopathology, ophthalmology, dermatology and genomic data. Med-Gemini-2D sets a new standard for AI-based chest X-ray (CXR) report generation based on expert evaluation, exceeding previous best results across two separate datasets by an absolute margin of 1% and 12%, where 57% and 96% of AI reports on normal cases, and 43% and 65% on abnormal cases, are evaluated as "equivalent or better" than the original radiologists' reports. We demonstrate the first ever large multimodal model-based report generation for 3D computed tomography (CT) volumes using Med-Gemini-3D, with 53% of AI reports considered clinically acceptable, although additional research is needed to meet expert radiologist reporting quality. Beyond report generation, Med-Gemini-2D surpasses the previous best performance in CXR visual question answering (VQA) and performs well in CXR classification and radiology VQA, exceeding SoTA or baselines on 17 of 20 tasks. In histopathology, ophthalmology, and dermatology image classification, Med-Gemini-2D surpasses baselines across 18 out of 20 tasks and approaches task-specific model performance. Beyond imaging, Med-Gemini-Polygenic outperforms the standard linear polygenic risk score-based approach for disease risk prediction and generalizes to genetically correlated diseases for which it has never been trained. Although further development and evaluation are necessary in the safety-critical medical domain, our results highlight the potential of Med-Gemini across a wide range of medical tasks.
许多临床任务需要对专业数据的理解,比如医学图像和基因组数据,这在通用大型多模态模型中通常不会存在。在Gemini的多模态模型的基础上,我们开发了几种新Med-Gemini家族中的模型,通过2D和3D放射学、病理学、眼科、皮肤病和基因组数据进行微调,以优化医学用途。Med-Gemini-2D为基于专家评估的AI驱动胸部X光(CXR)报告生成树立了新的标准,超越了两个不同的数据集的 previous best 结果,其绝对差值分别为1%和12%。在正常和异常病例中,AI报告与原始放射科医生的报告相比较,有57%和96%的AI报告被认为是“等同或更好”的。我们证明了使用Med-Gemini-3D生成3D计算机断层扫描(CT)体积的第一种大型多模态模型报告。在CT体积的评估中,尽管53%的AI报告在临床上是可以接受的,但需要进一步研究以满足专家放射科医生报告的质量要求。超越报告生成,Med-Gemini-2D在CXR视觉问答(VQA)方面超越了前面的最佳表现,并在CXR分类和放射学VQA方面表现出色,在20个任务中有17个任务超过了SoTA或基线。在病理学、眼科和皮肤病图像分类中,Med-Gemini-2D超越了基线,在18个任务中接近于任务特定的模型性能。除了成像之外,Med-Gemini-Polygenic在疾病风险预测方面超越了基于标准线性多基因风险评分的方法,并将其扩展到与培训无关的遗传相关疾病。尽管在安全关键医疗领域还需要进一步发展和评估,但我们的结果突出了Med-Gemini在广泛的医疗任务中的潜力。
https://arxiv.org/abs/2405.03162
Quantum computing has shown promise in solving complex problems by leveraging the principles of superposition and entanglement. Variational quantum algorithms (VQA) are a class of algorithms suited for near term quantum computers due to their modest requirements of qubits and depths of computation. This paper introduces Tetris, a compilation framework for VQA applications on near term quantum devices. Tetris focuses on reducing two qubit gates in the compilation process since a two qubit gate has an order of magnitude more significant error and execution time than a single qubit gate. Tetris exploits unique opportunities in the circuit synthesis stage often overlooked by the state of the art VQA compilers for reducing the number of two qubit gates. Tetris comes with a refined IR of Pauli string to express such a two qubit gate optimization opportunity. Moreover, Tetris is equipped with a fast bridging approach that mitigates the hardware mapping cost. Overall, Tetris demonstrates a reduction of up to 41.3 percent in CNOT gate counts, 37.9 percent in circuit depth, and 42.6 percent in circuit duration for various molecules of different sizes and structures compared with the state-of-the-art approaches. Tetris is open-sourced at this link.
量子计算通过利用叠加和纠缠的原理在解决复杂问题上表现出了巨大的潜力。变分量子算法(VQA)是一类适用于近期量子计算机的算法,因为它们对qubits的量子比特数和计算深度的要求相对较低。本文介绍了Tetris,一个用于近期量子设备上VQA应用的编译框架。Tetris专注于在编译过程中减少两个qubit门,因为一个两个qubit门比一个单qubit门的错误和执行时间有十倍以上。Tetris利用了在电路合成阶段常常被忽视的电路结构独特性,减少了两个qubit门的数量。Tetris附带了一个精确的IR Pauli字符串,表达了这种两个qubit门优化的机会。此外,Tetris还配备了快速桥接方法,减轻了硬件映射成本。总之,与最先进的VQA方法相比,Tetris在各种分子的大小和结构上展示了将CNOT门计数减少至41.3%的 reduction,将电路深度减少至37.9%,将电路持续时间减少至42.6%。Tetris目前处于测试阶段,并已开源。
https://arxiv.org/abs/2309.01905
The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.
多模态大型语言模型(MLLMs)的进步导致了对基于LLM的自动驾驶代理的浓厚兴趣,以利用其强大的推理能力。然而,利用MLLMs的强大的推理能力进行改进的规划行为具有挑战性,因为规划需要超过2D推理的全面3D情景意识。为解决这个问题,我们的工作提出了一个整体框架,实现代理模型与3D驾驶任务的强一致性。我们的框架从采用稀疏查询的全新3D MLLM架构开始,该架构在将视觉表示压缩成3D后输入LLM之前利用稀疏查询。这种基于查询的表示允许我们共同编码动态物体和静态地图元素(例如,交通车道),为3D感知-动作对齐提供了一个压缩的世界模型。我们还提出了OmniDrive-nuScenes,一个新的视觉问题回答数据集,挑战了具有全面视觉问题回答(VQA)任务的模型的真正3D情景意识,包括场景描述、交通规则、3D建模、反事实推理、决策和规划。大量研究证明了所建议的架构的有效性以及VQA任务对复杂3D场景中的推理和规划的重要性。
https://arxiv.org/abs/2405.01533
Large Vision-Language models (VLMs) have demonstrated strong reasoning capabilities in tasks requiring a fine-grained understanding of literal images and text, such as visual question-answering or visual entailment. However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative phenomena such as metaphors or humor, the meaning of which is often implicit. To close this gap, we propose a new task and a high-quality dataset: Visual Figurative Language Understanding with Textual Explanations (V-FLUTE). We frame the visual figurative language understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a claim (hypothesis) and justify the predicted label with a textual explanation. Using a human-AI collaboration framework, we build a high-quality dataset, V-FLUTE, that contains 6,027 <image, claim, label, explanation> instances spanning five diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. The figurative phenomena can be present either in the image, the caption, or both. We further conduct both automatic and human evaluations to assess current VLMs' capabilities in understanding figurative phenomena.
大视觉语言模型(VLMs)已经在需要对字面图像和文本进行深入理解的任务中表现出强大的推理能力,例如视觉问答或视觉蕴含。然而,在遇到包含象征性现象(如隐喻或幽默)的图像和字幕时,对这些模型的能力进行了深入的研究还是很少的。为了填补这一空白,我们提出了一个新的任务和高质量的数据集:视觉符号语言理解与文本解释(V-FLUTE)。我们将视觉符号语言理解问题视为一种可解释的视觉蕴含任务,其中模型需要预测图像(前提)是否符合一个假设(结论),并通过文本解释预测标签。利用人机合作框架,我们构建了一个高质量的数据集V-FLUTE,其中包括6,027个<图像,陈述,标签,解释>实例,涵盖了五种多样 multimodal 符号现象:隐喻、比喻、惯用语、讽刺和幽默。符号现象可以出现在图像中,描述中,或两者兼备。我们进一步进行了自动和人类评估,以评估现有 VLMs 对符号现象的理解能力。
https://arxiv.org/abs/2405.01474
Vision language models (VLMs) have recently emerged and gained the spotlight for their ability to comprehend the dual modality of image and textual data. VLMs such as LLaVA, ChatGPT-4, and Gemini have recently shown impressive performance on tasks such as natural image captioning, visual question answering (VQA), and spatial reasoning. Additionally, a universal segmentation model by Meta AI, Segment Anything Model (SAM) shows unprecedented performance at isolating objects from unforeseen images. Since medical experts, biologists, and materials scientists routinely examine microscopy or medical images in conjunction with textual information in the form of captions, literature, or reports, and draw conclusions of great importance and merit, it is indubitably essential to test the performance of VLMs and foundation models such as SAM, on these images. In this study, we charge ChatGPT, LLaVA, Gemini, and SAM with classification, segmentation, counting, and VQA tasks on a variety of microscopy images. We observe that ChatGPT and Gemini are impressively able to comprehend the visual features in microscopy images, while SAM is quite capable at isolating artefacts in a general sense. However, the performance is not close to that of a domain expert - the models are readily encumbered by the introduction of impurities, defects, artefact overlaps and diversity present in the images.
近年来,随着Vision语言模型(VLMs)的出现,它们在理解图像和文本数据的双模态方面得到了关注。例如,LLaVA、ChatGPT-4和Gemini等VLM最近在自然图像描述性、视觉问答(VQA)和空间推理等任务中表现出色。此外,由元人工智能(Meta AI)开发的普遍分割模型Semantic Anywhere Model(SAM)在从未见过的图像中隔离物体方面表现出史无前例的性能。由于医疗专家、生物学家和材料科学家通常将显微镜图像或医学图像与文本信息(标题、文献或报告)一起检查,并从中得出重要且有益的结论,因此测试VLM和基础模型(如SAM)在这些图像上的性能无疑至关重要。在这项研究中,我们对ChatGPT、LLaVA、Gemini和SAM在各种显微镜图像上执行分类、分割、计数和VQA任务。我们观察到,ChatGPT和Gemini在显微镜图像的视觉特征方面表现出惊人的理解能力,而SAM在分离总体上的伪影方面表现相当出色。然而,这些模型的性能与领域专家的相当距离,模型很容易受到图像中存在的杂质、缺陷、伪影和多样性等因素的影响。
https://arxiv.org/abs/2405.00876
Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, attracting increasing research efforts aiming to enhance VQA accuracy through the deployment of advanced models such as Transformers. Despite this growing interest, there has been limited exploration into the comparative analysis and impact of textual modalities within VQA, particularly in terms of model complexity and its effect on performance. In this work, we conduct a comprehensive comparison between complex textual models that leverage long dependency mechanisms and simpler models focusing on local textual features within a well-established VQA framework. Our findings reveal that employing complex textual encoders is not invariably the optimal approach for the VQA-v2 dataset. Motivated by this insight, we introduce an improved model, ConvGRU, which incorporates convolutional layers to enhance the representation of question text. Tested on the VQA-v2 dataset, ConvGRU achieves better performance without substantially increasing parameter complexity.
视觉问题回答(VQA)近年来成为了一个高度有趣的领域,吸引了越来越多的研究努力,通过部署先进的模型如Transformer来提高VQA的准确性。尽管如此,对VQA中文本模式的比较分析和影响的研究仍然有限,特别是在模型复杂性和其对性能的影响方面。在这项工作中,我们全面比较了在VQA框架内利用长距离依赖机制的复杂文本模型和关注局部文本特征的简单模型的性能。我们的研究结果表明,使用复杂的文本编码器并不一定是最优策略,尤其是在VQA-v2数据集上。为了应对这一见解,我们引入了一个改进的模型ConvGRU,它通过添加卷积层来增强问题文本的表示。在VQA-v2数据集上进行测试,ConvGRU实现了与参数复杂度相当的好性能。
https://arxiv.org/abs/2405.00479