Advances in Large Language Models (LLMs) have inspired a surge of research exploring their expansion into the visual domain. While recent models exhibit promise in generating abstract captions for images and conducting natural conversations, their performance on text-rich images leaves room for improvement. In this paper, we propose the Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details typically overlooked by existing methods. Cream integrates vision and auxiliary encoders, complemented by a contrastive feature alignment technique, resulting in a more effective understanding of textual information within document images. Our approach, thus, seeks to bridge the gap between vision and language understanding, paving the way for more sophisticated Document Intelligence Assistants. Rigorous evaluations across diverse tasks, such as visual question answering on document images, demonstrate the efficacy of Cream as a state-of-the-art model in the field of visual document understanding. We provide our codebase and newly-generated datasets at this https URL
大型语言模型(LLMs)的进步激发了研究探索它们扩展到视觉领域的热情。尽管最近的模型在生成抽象图像标题和进行自然对话方面表现出潜力,但它们在文本丰富的图像上的表现不佳。在本文中,我们提出了Contrastive Reading Model(Cream),这是一种新神经网络架构,旨在通过捕获通常被忽视的精细细节,提高LLMs的语言-图像理解能力。Cream将视觉和辅助编码器相结合,并借助Contrastive feature alignment技术进行补充,从而在文档图像内更有效地理解文本信息。因此,我们的目标是填补视觉和语言理解之间的差距,为更复杂的文档情报助理铺平道路。在不同任务上,例如在文档图像中的视觉问答,进行了严格的评估,证明了Cream作为视觉文档理解领域最先进的模型的有效性。我们提供了我们的代码库和新生成的数据集,在本网站上提供。
https://arxiv.org/abs/2305.15080
Text to image generation methods (T2I) are widely popular in generating art and other creative artifacts. While visual hallucinations can be a positive factor in scenarios where creativity is appreciated, such artifacts are poorly suited for cases where the generated image needs to be grounded in complex natural language without explicit visual elements. In this paper, we propose to strengthen the consistency property of T2I methods in the presence of natural complex language, which often breaks the limits of T2I methods by including non-visual information, and textual elements that require knowledge for accurate generation. To address these phenomena, we propose a Natural Language to Verified Image generation approach (NL2VI) that converts a natural prompt into a visual prompt, which is more suitable for image generation. A T2I model then generates an image for the visual prompt, which is then verified with VQA algorithms. Experimentally, aligning natural prompts with image generation can improve the consistency of the generated images by up to 11% over the state of the art. Moreover, improvements can generalize to challenging domains like cooking and DIY tasks, where the correctness of the generated image is crucial to illustrate actions.
文本到图像生成方法(T2I)在生成艺术和其他创意产物方面非常流行。虽然视觉幻觉可能在欣赏创造力的情况下是一种积极的因素,但在这种情况下,生成图像需要基于复杂的自然语言,而这种方法常常包括非视觉信息以及需要准确生成知识的文本元素,这使得这种方法并不适合生成图像,特别是当需要生成图像时,它们没有明确的视觉元素。在本文中,我们提议加强T2I方法在自然复杂语言中的一致性性质,这种语言常常包括非视觉信息并突破T2I方法的极限,以包括需要准确生成知识的文本元素。为了解决这些问题,我们提出了一种自然语言到验证图像生成方法(NL2VI),将自然提示转换为视觉提示,更适合于图像生成。T2I模型随后生成针对视觉提示的图像,然后使用VQA算法进行验证。实验表明,将自然提示与图像生成对齐可以改进生成图像的一致性,比当前技术水平提高了11%。此外,这些改进可以扩展到挑战性的领域,例如烹饪和 DIY任务,其中正确生成的图像对于演示行动至关重要。
https://arxiv.org/abs/2305.15026
Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.
身体感知型人工智能是机器人领域的一个关键前沿,能够为机器人在物理环境中实现长期目标的计划和执行行动序列。在本文中,我们介绍了EmbodiedGPT,这是一种面向身体感知型人工智能的身体感知型多媒基座模型,赋予身体感知型代理多媒理解和执行能力。为了实现这一点,我们采取了以下努力:(i) 我们制作了一个大规模的身体感知型规划数据集,称为EgoCOT。该数据集精选了Ego4D数据集中的 carefully selected 视频,并配上高质量的语言指令。具体来说,我们使用“思维链”模式生成一组子目标,以进行有效的身体感知型规划。(ii) 我们引入了高效的训练方法,为EmbodiedGPT提供高质量的规划生成,通过前缀调整将7B的大型语言模型(LLM)适应到EgoCOT数据集上。(iii) 我们引入了一种范式,从LLM生成的规划查询中提取任务相关特征,形成高级别的规划和低级别的控制之间的闭合循环。广泛的实验表明,EmbodiedGPT对身体感知型任务的有效性,包括身体感知规划、身体感知控制、视觉标题制作和视觉问答。值得注意的是,EmbodiedGPT通过提取更有效的特征,显著增强了身体感知型控制任务的成功率。它在 Franka Kitchen 基准测试中成功率的显著提高,以及在Meta-World基准测试中成功率的1.3倍显著提高,相比之下,与Ego4D数据集微调的BLIP-2基线相比,其成功率显著提高。
https://arxiv.org/abs/2305.15021
Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system's VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both "faithful" and "plausible". Our metric, called "Faithful and Plausible Visual Grounding" (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub.
在视觉问答系统(VQA)中,视觉基线(VG) metrics 主要用于测量系统在推断给定问题答案时对图像相关部分的依赖程度。缺乏 VG 是当前 VQA 系统中的一种普遍问题,可能会表现为过度依赖无关的图像部分或完全忽视视觉特性。虽然 VQA 模型的推断能力往往可以通过一些定性插图来展示,但大多数系统对他们的 VG 性质没有定量评估。我们相信,容易计算的标准 criterion 可以帮助弥补这一缺点,并为模型评估和分析添加另一个有价值的维度。为此,我们提出了一个新的 VG 度量方法,该方法可以捕捉如果一个模型 a) 在场景中识别相关物体,并且 b) 在产生答案时实际上依赖于相关物体中的信息,即它的视觉基线是“可靠”和“可信”的。我们的度量方法被称为“可靠可信的视觉基线分配”(FPVG),对于大多数 VQA 模型设计来说,可以轻松确定。我们详细描述了 FPVG 方法和评估了多个参考系统,支持 GQA 数据集的度量计算代码可在 GitHub 上找到。
https://arxiv.org/abs/2305.15015
Chain-of-Thought prompting (CoT) enables large-scale language models to solve complex reasoning problems by decomposing the problem and tackling it step-by-step. However, Chain-of-Thought is a greedy thinking process that requires the language model to come up with a starting point and generate the next step solely based on previous steps. This thinking process is different from how humans approach a complex problem e.g., we proactively raise sub-problems related to the original problem and recursively answer them. In this work, we propose Socratic Questioning, a divide-and-conquer fashion algorithm that simulates the self-questioning and recursive thinking process. Socratic Questioning is driven by a Self-Questioning module that employs a large-scale language model to propose sub-problems related to the original problem as intermediate steps and Socratic Questioning recursively backtracks and answers the sub-problems until reaches the original problem. We apply our proposed algorithm to the visual question-answering task as a case study and by evaluating it on three public benchmark datasets, we observe a significant performance improvement over all baselines on (almost) all datasets. In addition, the qualitative analysis clearly demonstrates the intermediate thinking steps elicited by Socratic Questioning are similar to the human's recursively thinking process of a complex reasoning problem.
思维引导(CoT)使得大规模语言模型能够通过分解问题并逐步解决它来解决复杂的推理问题。然而,思维引导是一种贪婪的思考过程,需要语言模型想出一个起点,仅基于以前的步骤生成下一步。这种思考过程与人类如何解决复杂问题的方式不同,例如,我们主动提出与原始问题相关的子问题,并递归地回答它们。在这个工作中,我们提出了苏格拉底问题求解算法,这是一种分而治之的方式,模拟了自我思考和递归思考过程。苏格拉底问题求解算法由自我问答模块驱动,该模块使用大规模语言模型提出与原始问题相关的子问题作为中间步骤,苏格拉底问题求解算法递归地退回并回答子问题,直到达到原始问题。我们将我们提出的算法应用于视觉问答任务作为案例研究,通过评估三个公共基准数据集,我们观察到在所有数据集上(几乎)对所有基准点的显著性能改进。此外,定性分析清楚地表明,苏格拉底问题求解算法提取的中间思考步骤与人类解决复杂推理问题时的递归思考过程相似。
https://arxiv.org/abs/2305.14999
Model interpretability has long been a hard problem for the AI community especially in the multimodal setting, where vision and language need to be aligned and reasoned at the same time. In this paper, we specifically focus on the problem of Visual Question Answering (VQA). While previous researches try to probe into the network structures of black-box multimodal models, we propose to tackle the problem from a different angle -- to treat interpretability as an explicit additional goal. Given an image and question, we argue that an interpretable VQA model should be able to tell what conclusions it can get from which part of the image, and show how each statement help to arrive at an answer. We introduce InterVQA: Interpretable-by-design VQA, where we design an explicit intermediate dynamic reasoning structure for VQA problems and enforce symbolic reasoning that only use the structure for final answer prediction to take place. InterVQA produces high-quality explicit intermediate reasoning steps, while maintaining similar to the state-of-the-art (sota) end-task performance.
模型解释性一直是一个对人工智能社区来说的难题,特别是在多媒件环境中,视觉和语言需要同时对齐和推理。在本文中,我们特别注重视觉问答问题(VQA)的问题。尽管以前的研究试图探索黑盒多媒件模型的网络结构,但我们建议从不同的角度解决这个问题 - 将解释性作为明确的额外目标。给定一张图片和一个问题,我们主张一个可解释的VQA模型应该能够从图片的某个部分得出什么结论,并展示每个陈述如何帮助得出答案。我们介绍了InterVQA:设计可解释的VQA模型,我们在VQA问题中设计了一份明确的中间动态推理结构,并强制使用结构进行最终答案预测的唯一用途是符号推理。InterVQA生成高质量的明确中间推理步骤,同时保持类似于最先进的任务表现的水平。
https://arxiv.org/abs/2305.14882
We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at this https URL.
我们将在自动驾驶场景中引入一项全新的视觉问答任务(VQA),旨在基于路景线索回答自然语言问题。与传统的VQA任务相比,自动驾驶场景中的VQA任务面临更多的挑战。首先, raw 视觉数据是多模态的,包括相机和激光雷达捕获的图像和点云。其次,数据是多帧的,因为持续实时获取。第三,户外场景既有移动的前端,也有静态的背景。现有VQA基准点无法充分解决这些复杂性。为了解决这个问题,我们提出了 NuScenes-QA,它是自动驾驶场景中VQA的第一个基准,涵盖了34,000个视觉场景和460,000个问答对。具体来说,我们利用现有的3D检测注释生成场景图,并手动设计问答模板。随后,根据这些模板,通过编程方式生成问答对。全面的统计表明,我们的 NuScenes-QA是一个平衡的大型基准,具有多种问题格式。基于它,我们开发了一系列基准,采用高级3D检测和VQA技术。我们的广泛实验突出了这个新任务所带来的挑战。代码和数据集可在这个 https URL 上获取。
https://arxiv.org/abs/2305.14836
Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks. However, such generalization to vision-language tasks including grounding and generation tasks has been under-explored; existing few-shot VL models struggle to handle tasks that involve object grounding and multiple images such as visual commonsense reasoning or NLVR2. In this paper, we introduce GRILL, GRounded vIsion Language aLigning, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances. Specifically, GRILL learns object grounding and localization by exploiting object-text alignments, which enables it to transfer to grounding tasks in a zero-/few-shot fashion. We evaluate our model on various zero-/few-shot VL tasks and show that it consistently surpasses the state-of-the-art few-shot methods.
对于少量样本学习者来说,泛化到 unseen 任务非常重要,这样他们可以在多个任务中取得更好的零/少量样本性能。然而,这种泛化对于视觉语言任务,包括grounding 和 generation 任务,以及不需要或很少训练样本的任务,却一直没有被深入研究。现有的少量样本 VL 模型 struggle 处理包含对象grounding 和多个图像的任务,例如视觉常识推理或 NLVR2。在本文中,我们介绍了 Grill(rounded vision languageign),它是一个 novel VL 模型,可以泛化到多个任务,包括视觉问答、字幕和对象grounding 任务,且这些任务不需要或很少训练样本。具体来说,Grill 通过利用对象-文本对齐来学习对象grounding 和定位,从而可以在零/少量样本的情况下转移到grounding 任务。我们在不同的零/少量样本 VL 任务上评估了我们的模型,并表明它 consistently 超越了现有的少量样本方法。
https://arxiv.org/abs/2305.14676
We are interested in image manipulation via natural language text -- a task that is useful for multiple AI applications but requires complex reasoning over multi-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning (NSCL), which has been quite effective for the task of Visual Question Answering (VQA), for the task of image manipulation. Our system referred to as NeuroSIM can perform complex multi-hop reasoning over multi-object scenes and only requires weak supervision in the form of annotated data for VQA. NeuroSIM parses an instruction into a symbolic program, based on a Domain Specific Language (DSL) comprising of object attributes and manipulation operations, that guides its execution. We create a new dataset for the task, and extensive experiments demonstrate that NeuroSIM is highly competitive with or beats SOTA baselines that make use of supervised data for manipulation.
我们对通过自然语言文本进行图像操纵感兴趣 - 这是一个对于多个人工智能应用有用的任务,但需要在多模态空间中进行复杂的推理。我们扩展了最近提出的神经符号概念学习(NSCL),它在视觉问答任务(VQA)中已经相当有效,并将用于图像操纵任务。我们所说的神经SIM系统能够在多物体场景上进行复杂的多级推理,仅需要少量的弱监督,以作为VQA的标注数据。神经SIM通过一个包括物体属性和操纵操作的域特定语言(DSL)来解析指令,并为其执行提供指导。我们为该任务创建了一个新的数据集,并进行了广泛的实验,表明神经SIM在与使用监督数据进行操纵的SOTA基准进行比较或竞争方面具有很高的竞争力。
https://arxiv.org/abs/2305.14410
Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on webpages using three novel objectives that leverage the spatial and semantic information in the document images: Masked Document Content Generation Task, Bounding Box Task, and Rendered Question Answering Task. We evaluate our model on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. We show that our model achieves competitive or better results than the state-of-the-art models on these tasks. In particular, we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms the current pixel-based SOTA models on DocVQA and AI2D datasets by significant margins, 2% and 21% increase in performance, respectively. Also, DUBLIN is the first ever pixel-based model which achieves comparable to text-based SOTA methods on XFUND dataset for Semantic Entity Recognition showcasing its multilingual capability. Moreover, we create new baselines for text-based datasets by rendering them as document images and applying this model.
视觉文档理解是一个复杂的任务,它涉及到分析文档图像中的文本和视觉元素。现有的模型通常依赖于手动的特征工程或特定领域的管道,这限制了它们在不同文档类型和语言之间的泛化能力。在本文中,我们提出了DUBLIN,它是在网页上预先训练的,利用三个新的目标,利用文档图像中的空间和语义信息:Masked Document Content Generation Task,Bounding Box Task,和Rendered Question Answering Task。我们评估了我们的模型在许多基准上的表现,例如Web structural reading comprehension,文档视觉问答,关键信息提取,Diagram Understanding,和Table Question Answering。我们表明,我们的模型在这些任务上取得了竞争或更好的结果,比当前最先进的模型在这方面表现更好。特别是,我们表明DUBLIN是第一个在WebSRC数据集上实现EM为77.75,F1为84.25的基于像素模型模型。我们还表明,我们的模型在docVQA和AI2D数据集上比当前基于像素模型的SOTA模型表现显著提高,分别提高了2%和21%。此外,DUBLIN是第一个在XFund数据集上实现与文本SOTA方法相媲美的基于像素模型模型,展示了它的多语言能力。此外,我们将文本数据集渲染为文档图像并应用此模型,创造了新的基准。
https://arxiv.org/abs/2305.14218
Variational Quantum Algorithms (VQA) have emerged with a wide variety of applications. One question to ask is either they can efficiently be implemented and executed on existing architectures. Current hardware suffers from uncontrolled noise that can alter the expected results of one calculation. The nature of this noise is different from one technology to another. In this work, we chose to investigate a technology that is intrinsically resilient to bit-flips: cat qubits. To this end, we implement two noise models. The first one is hardware-agnostic -- in the sense that it is used in the literature to cover different hardware types. The second one is specific to cat qubits. We perform simulations on two types of problems that can be formulated with VQAs (Quantum Approximate Optimization Algorithm (QAOA) and the Variatinoal Quantum Linear Soler (VQLS)), study the impact of noise on the evolution of the cost function and extract noise level thresholds from which a noise-resilient regime can be considered. By tackling compilation issues, we discuss the need of implementing hardware-specific noise models as hardware-agnostic ones can lead to misleading conclusions regarding the regime of noise that is acceptable for an algorithm to run.
Variational Quantum Algorithms (VQA) 出现了广泛的应用。一个问题是,它们是否能够在现有的架构上高效实现和执行。当前硬件受到无法控制的噪声的影响,可能会改变一个计算的预计结果。这种噪声的性质与一种技术到另一种技术不同。在本研究中,我们选择研究一种技术,其本身就具有对位 Flip 的抵抗力:猫qubit。为此,我们实施了两种噪声模型。第一种是硬件无关的,也就是说,它在文献中用于覆盖不同类型的硬件。第二种是专门针对猫qubit的。我们实施了两种问题,它们可以与 VQAs(量子近似优化算法(QAOA)和量子线性近似优化器(VQLS))进行格式化,研究噪声对成本函数的演化的影响,并提取噪声水平阈值,考虑一种噪声 resilient 的状态。通过解决编译问题,我们讨论了将硬件特定的噪声模型作为硬件无关的模型可能导致误导性的结论,即对于算法运行所应接受的状态噪声regime。
https://arxiv.org/abs/2305.14143
Artificial General Intelligence (AGI) requires comprehensive understanding and generation capabilities for a variety of tasks spanning different modalities and functionalities. Integrative AI is one important direction to approach AGI, through combining multiple models to tackle complex multimodal tasks. However, there is a lack of a flexible and composable platform to facilitate efficient and effective model composition and coordination. In this paper, we propose the i-Code Studio, a configurable and composable framework for Integrative AI. The i-Code Studio orchestrates multiple pre-trained models in a finetuning-free fashion to conduct complex multimodal tasks. Instead of simple model composition, the i-Code Studio provides an integrative, flexible, and composable setting for developers to quickly and easily compose cutting-edge services and technologies tailored to their specific requirements. The i-Code Studio achieves impressive results on a variety of zero-shot multimodal tasks, such as video-to-text retrieval, speech-to-speech translation, and visual question answering. We also demonstrate how to quickly build a multimodal agent based on the i-Code Studio that can communicate and personalize for users.
人工智能(AGI)需要对多种任务进行 comprehensive 理解和生成能力,涵盖了不同模式和功能的多方面任务。综合人工智能是接近AGI的一个重要方向,通过结合多个模型来解决复杂的多模式任务。然而,缺乏一个灵活、可组合的平台来促进高效、有效的模型组合和协调。在本文中,我们提出了 i-Code Studio,这是一个可配置和可组合的框架,用于综合人工智能。i-Code Studio 指挥多个预训练模型以无调整的方式执行复杂的多模式任务。 Instead of 简单的模型组合,i-Code Studio 提供了一个综合、灵活和可组合的环境,以开发人员快速、轻松地为特定需求构建最先进的服务和技术。i-Code Studio 在多种零次响应多模式任务中取得了令人印象深刻的结果,例如视频到文本检索、语音到语音翻译和视觉问答。我们还演示了如何使用 i-Code Studio 快速构建一个基于i-Code Studio的多模式代理,它能够对用户进行通信和个性化。
https://arxiv.org/abs/2305.13738
Memes are a widely popular tool for web users to express their thoughts using visual metaphors. Understanding memes requires recognizing and interpreting visual metaphors with respect to the text inside or around the meme, often while employing background knowledge and reasoning abilities. We present the task of meme captioning and release a new dataset, MemeCap. Our dataset contains 6.3K memes along with the title of the post containing the meme, the meme captions, the literal image caption, and the visual metaphors. Despite the recent success of vision and language (VL) models on tasks such as image captioning and visual question answering, our extensive experiments using state-of-the-art VL models show that they still struggle with visual metaphors, and perform substantially worse than humans.
弹幕(Memes)是 Web 用户使用视觉比喻来表达自己的想法的一个广泛使用的工具。理解弹幕需要识别和解释围绕弹幕的文字,通常同时使用背景知识和推理能力。我们提出了弹幕标注任务并发布了一个新的数据集,名为“弹幕Cap”。我们的数据集包含 6.3 千个弹幕,以及包含弹幕、弹幕标注和字面图像标注的标题。尽管视觉和语言(VL)模型(如图像标注和视觉问题回答)在图像标注和视觉问答任务上取得了最近的成功,但我们使用最先进的 VL 模型的广泛实验表明,他们仍然与视觉比喻相困难,并且表现远远不如人类。
https://arxiv.org/abs/2305.13703
In this paper, we aim to expand the understanding of the relationship between the composition of the Hamiltonian in the Quantum Approximate Optimization Algorithm (QAOA) and the corresponding cost landscape characteristics. QAOA is a prominent example of a Variational Quantum Algorithm (VQA), which is most commonly used for combinatorial optimization. The success of QAOA heavily relies on parameter optimization, which is a great challenge, especially on scarce noisy quantum hardware. Thus understanding the cost function landscape can aid in designing better optimization heuristics and therefore potentially provide eventual value. We consider the case of 1-layer QAOA for Hamiltonians with up to 5-local terms and up to 20 qubits. In addition to visualizing the cost landscapes, we calculate their Fourier transform to study the relationship with the structure of the Hamiltonians from a complementary perspective. Furthermore, we introduce metrics to quantify the roughness of the landscape, which provide valuable insights into the nature of high-dimensional parametrized landscapes. While these techniques allow us to elucidate the role of Hamiltonian structure, order of the terms and their coefficients on the roughness of the optimization landscape, we also find that predicting the intricate landscapes of VQAs from first principles is very challenging and unlikely to be feasible in general.
本论文旨在扩展对量子近似优化算法(QAOA)中哈密顿算符成分和相应的成本地形特征的理解。QAOA是Variational Quantum Algorithm (VQA) 的一个显著例子,通常用于组合优化。QAOA的成功 heavily relies on parameter optimization, which is a great challenge, especially on scarce noisy quantum hardware。因此,理解成本函数地形可以协助设计更好的优化启发式,从而可能提供潜在的最终价值。我们考虑了1层QAOA适用于具有最多5个局部项和20qubit的哈密顿算符的情况。除了可视化成本地形外,我们计算它们的傅里叶变换,以研究与哈密顿算符结构的关系的互补视角。此外,我们引入了指标来量化地形的粗糙度,从而提供宝贵的 insights into高维参数确定的地形的性质。尽管这些技术允许我们阐明哈密顿算符结构、 terms 的顺序及其系数在优化地形粗糙度中的作用,但我们也发现从基本原理预测VQAs的复杂地形是非常挑战的,并且不太可能在一般情况实现。
https://arxiv.org/abs/2305.13594
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.
平方根 laws 最近被用于推导计算最优模型大小(参数数量)给定计算时长。我们 advance 和 refine 这些方法来推断计算最优模型形状,如宽度和深度,并成功在视觉变换器中实现。我们的形状优化视觉变换器 SoViT,尽管在相同的设置下预先训练了同等数量的计算,却能够与超过其两倍大小models 竞争结果。例如,SoViT-400m/14 在ILSRCV2012中 Fine-tuning accuracy 达到90.3%,超越了大得多的 ViT-g/14,并在相同的设置下接近 ViT-G/14,推理成本不到一半。我们进行了广泛的评估,包括图像分类、标题生成、VQA 和零样本传输等任务,证明了我们模型在广泛领域上的 effectiveness 并识别了限制。总的来说,我们的研究成果挑战了盲目增长视觉模型的主流方法,并为更明智的增长铺平了道路。
https://arxiv.org/abs/2305.13035
The proliferation of in-the-wild videos has greatly expanded the Video Quality Assessment (VQA) problem. Unlike early definitions that usually focus on limited distortion types, VQA on in-the-wild videos is especially challenging as it could be affected by complicated factors, including various distortions and diverse contents. Though subjective studies have collected overall quality scores for these videos, how the abstract quality scores relate with specific factors is still obscure, hindering VQA methods from more concrete quality evaluations (e.g. sharpness of a video). To solve this problem, we collect over two million opinions on 4,543 in-the-wild videos on 13 dimensions of quality-related factors, including in-capture authentic distortions (e.g. motion blur, noise, flicker), errors introduced by compression and transmission, and higher-level experiences on semantic contents and aesthetic issues (e.g. composition, camera trajectory), to establish the multi-dimensional Maxwell database. Specifically, we ask the subjects to label among a positive, a negative, and a neural choice for each dimension. These explanation-level opinions allow us to measure the relationships between specific quality factors and abstract subjective quality ratings, and to benchmark different categories of VQA algorithms on each dimension, so as to more comprehensively analyze their strengths and weaknesses. Furthermore, we propose the MaxVQA, a language-prompted VQA approach that modifies vision-language foundation model CLIP to better capture important quality issues as observed in our analyses. The MaxVQA can jointly evaluate various specific quality factors and final quality scores with state-of-the-art accuracy on all dimensions, and superb generalization ability on existing datasets. Code and data available at \url{this https URL}.
野生视频的蔓延已经极大地扩展了视频质量评估(VQA)的问题。与早期定义通常关注有限的扭曲类型不同,VQA对野生视频特别具有挑战性,因为这些视频可能受到复杂的因素,包括各种扭曲和多样性的内容的影响。尽管主观研究已经为这些视频收集了整体质量得分,但如何抽象质量得分与具体因素之间的关系仍然不明确,阻碍了更具体的质量评估方法(例如视频清晰度)的实施。为了解决这一问题,我们收集了超过2000万对4,543个野生视频的13个质量相关因素的意见和建议,包括捕获的正版扭曲(例如运动模糊、噪声、闪烁),由压缩和传输引入的错误,以及在语义内容和美学问题上的高级体验(例如构图、相机轨迹),建立了多维麦克斯韦数据库。具体来说,我们要求参与者在每个维度上选择积极、消极和神经网络选择。这些解释级意见允许我们测量具体质量因素和抽象主观质量评分之间的关系,并在每个维度上基准不同的VQA算法类别,以便更全面地分析其优势和劣势。此外,我们提出了MaxVQA,这是一种语言Prompted的VQA方法,修改了视觉语言基础模型Clip,更好地捕捉我们在分析中观察到的重要质量问题。MaxVQA可以在所有维度上同时评估各种具体质量因素和最终质量得分,并具有最先进的准确性,在现有数据集上表现出出色的泛化能力。代码和数据可访问\url{this https URL}。
https://arxiv.org/abs/2305.12726
Audio-visual question answering (AVQA) is a challenging task that requires multistep spatio-temporal reasoning over multimodal contexts. To achieve scene understanding ability similar to humans, the AVQA task presents specific challenges, including effectively fusing audio and visual information and capturing question-relevant audio-visual features while maintaining temporal synchronization. This paper proposes a Target-aware Joint Spatio-Temporal Grounding Network for AVQA to address these challenges. The proposed approach has two main components: the Target-aware Spatial Grounding module, the Tri-modal consistency loss and corresponding Joint audio-visual temporal grounding module. The Target-aware module enables the model to focus on audio-visual cues relevant to the inquiry subject by exploiting the explicit semantics of text modality. The Tri-modal consistency loss facilitates the interaction between audio and video during question-aware temporal grounding and incorporates fusion within a simpler single-stream architecture. Experimental results on the MUSIC-AVQA dataset demonstrate the effectiveness and superiority of the proposed method over existing state-of-the-art methods. Our code will be availiable soon.
音频-视频问答(AVQA)是一个具有挑战性的任务,需要在多模态上下文中采取多步空间-时间推理。为了实现类似于人类的Scene understanding能力,AVQA任务提出了具体挑战,包括有效地融合音频和视觉信息,捕捉与问题相关的音频-视觉特征,并维持时间同步。本文提出了一种目标 aware 的联合空间-时间grounding网络来应对这些挑战。该方法有两个主要组成部分:目标 aware 的空间grounding模块, Tri-modal 一致性损失,以及相应的联合音频-视觉时间grounding模块。目标 aware 模块利用文本模态的明确语义,使模型专注于与问题主题相关的音频-视觉线索,通过实现目标 aware 的时间grounding,促进音频和视频之间的交互,并在更简单的单一流架构中集成融合。音乐 - AVQA 数据集的实验结果证明了该方法比现有先进方法更有效和优越。我们的代码将很快可用。
https://arxiv.org/abs/2305.12397
We empirically investigate proper pre-training methods to build good visual tokenizers, making Large Language Models (LLMs) powerful Multimodal Large Language Models (MLLMs). In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i.e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset. ii) Self-supervised models are better at fine-grained perception, where patch-level supervision is particularly effective. iii) Tuning the visual tokenizer leads to the loss of semantics obtained from large-scale pretraining, which is unfavorable with relatively small-scale instruction-tuning dataset. Given the findings, we reviewed methods that attempted to unify semantics and fine-grained visual understanding, e.g., patch-level feature distillation with semantically-rich targets. We obtain an intriguing insight mask-based strategies that were once all the rage may not be applicable for obtaining good visual tokenizers. Based on this critical observation, we obtain a new MLLM equipped with a tailored Good Visual Tokenizer (GVT), which exhibits strong visual comprehension capability at multiple scales. In particular, without introducing extra parameters and task-specific fine-tuning, GVT achieves superior performance on visual question answering, image captioning, and other fine-grained visual understanding tasks such as object counting and multi-class identification.
我们通过经验研究适当的预训练方法来构建好的视觉分词器,使大型语言模型(LLM)成为强大的多模态大型语言模型(MLLM)。在我们的基准中,我们讨论了使用主要方法预训练的不同视觉分词器,例如DeiT、CLIP、MAE和DiNO,并观察到了以下几点:第一,完全/弱监督模型能够捕获更多的语义,而自我监督模型则能够更好地进行精细感知,特别是在 patch-level 监督特别有效的情况下。第三,调整视觉分词器会导致从大规模预训练中获得的语义丢失,这与相对较小的指导微调数据集不太有利。基于这些发现,我们审查了试图统一语义和精细视觉理解的方法,例如 patch-level 特征蒸馏与语义丰富的目标。我们获得了令人感兴趣的洞察力,遮蔽策略,这些策略曾经非常流行,但可能不适用于获得好的视觉分词器。基于这一关键观察,我们获得了一个新的 MLLM 带有定制好的好的视觉分词器(GVT),表现出强大的多尺度视觉理解能力。特别是在没有引入额外的参数和任务特定的 fine-tuning 的情况下,GVT 在视觉问答、图像描述和其他精细的视觉理解任务(如物体计数和多类识别)中表现出卓越的性能。
https://arxiv.org/abs/2305.12223
The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, though, especially in the areas of mathematics, physics, chemistry, and biology. The VNHSGE dataset seeks to provide an adequate benchmark for assessing the abilities of LLMs with its wide-ranging coverage and variety of activities. We intend to promote future developments in the creation of LLMs by making this dataset available to the scientific community, especially in resolving LLMs' limits in disciplines involving mathematics and the natural sciences.
VNHSGE(越南高中毕业生考试)数据集是专门用于评估大型语言模型(LLM)的,本文介绍了该数据集。该数据集涵盖了九门课程,是从越南国家高中毕业考试和可比测试生成的。该数据集包括300篇文学作品,并涵盖了多个主题,包括多项选择题超过19,000个。该数据集通过包括文本数据和伴随图像的方式,在问答、文本生成、阅读理解、视觉问答等多种任务中评估了LLM的能力,并通过与越南学生进行比较来评估它们的性能。使用ChatGPT和BingChat,我们对该数据集上的LLM进行了评估,并对比了它们的性能和越南学生的表现,以查看它们表现如何。结果显示,ChatGPT和BingChat在许多领域都达到了人类水平,包括文学、英语、历史、地理和公民教育。尽管它们仍有空间增长,但在数学、物理、化学和生物学等领域。该数据集旨在提供一个适当的基准,以评估LLM的能力,其广泛的覆盖范围和多种活动。我们将通过向科学社区提供这些数据集,促进LLM创造领域的未来发展,特别是在解决涉及数学和自然科学领域的LLM的限制方面。
https://arxiv.org/abs/2305.12199
Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on image-text pairs collected from the web as pre-training data and unfortunately overlook the need for fine-grained feature alignment between vision and language modalities, which requires detailed understanding of images and language expressions. While integrating VQA and dense captioning (DC) into pre-training can address this issue, acquiring image-question-answer as well as image-location-caption triplets is challenging and time-consuming. Additionally, publicly available datasets for VQA and dense captioning are typically limited in scale due to manual data collection and labeling efforts. In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets. We apply this method to the Conceptual Caption (CC3M) dataset to generate a new dataset called CC3M-QA-DC. Experiments show that when used for pre-training in a multi-task manner, CC3M-QA-DC can improve the performance with various backbones on various downstream tasks. Furthermore, our generated CC3M-QA-DC can be combined with larger image-text datasets (e.g., CC15M) and achieve competitive results compared with models using much more data. Code and dataset will be released.
大量预训练多模式模型在多种下游任务中表现出显著成功,包括图像captioning、图像-文本检索、视觉问答(VQA)等。然而,这些方法中许多人依赖于从互联网收集的图像-文本对作为预训练数据,但不幸的是忽略了视觉和语言模式的精细特征对齐,这需要对图像和语言表达的深入了解。虽然将VQA和稠密captioning(DC)集成到预训练可以解决这个问题,但获取图像问答和图像位置caption三件套非常困难且费时。此外,VQA和稠密captioning的公共 datasets 通常因为手动数据收集和标签努力而规模有限。在本文中,我们提出了一种新方法,称为“联合QA和DC生成”(JADE),它利用预训练多模式模型和易于爬升的图像-文本对,自动生成和过滤大规模的VQA和稠密captioning datasets。我们应用这种方法到“概念caption”(CC3M)dataset中,生成一个新的dataset,称为CC3M-QA-DC。实验表明,当用于多任务预训练时,CC3M-QA-DC可以改进各种主干在不同下游任务中的性能。此外,我们生成的CC3M-QA-DC可以与更大的图像-文本dataset(例如CC15M)合并,并与使用大量数据的其他模型相比,取得竞争结果。代码和dataset将发布。
https://arxiv.org/abs/2305.11769