Barren plateaus are a central bottleneck in the scalability of variational quantum algorithms (VQAs), and are known to arise in various ways, from circuit depth and hardware noise to global observables. However, a caveat of most existing results is the requirement of t-design circuit assumptions that are typically not satisfied in practice. In this work, we loosen these assumptions altogether and derive tight upper and lower bounds on gradient concentration, for a large class of parameterized quantum circuits and arbitrary observables. By requiring only a couple of design choices that are constructive and easily verified, our results can readily be leveraged to rule out barren plateaus for explicit circuits and mixed observables, namely, observables containing a non-vanishing local term. This insight has direct implications for hybrid Quantum Generative Adversarial Networks (qGANs), a generative model that can be reformulated as a VQA with an observable composed of local and global terms. We prove that designing the discriminator appropriately leads to 1-local weights that stay constant in the number of qubits, regardless of discriminator depth. Combined with our first contribution, this implies that qGANs with shallow generators can be trained at scale without suffering from barren plateaus -- making them a promising candidate for applications in generative quantum machine learning. We demonstrate this result by training a qGAN to learn a 2D mixture of Gaussian distributions with up to 16 qubits, and provide numerical evidence that global contributions to the gradient, while initially exponentially small, may kick in substantially over the course of training.
荒芜的凸起是Variational Quantum 算法(VQA) scalability 的核心瓶颈,它们通常以不同的方式出现,从电路深度和硬件噪声到全局观测器。然而,大多数现有结果的一个缺点是要求 t 设计电路假设,这些假设在实际应用中通常无法满足。在本文中,我们放松了这些假设,并推导了梯度聚集的 tight upper and lower bounds,适用于大量参数化的量子电路和任意观测器。只需要几个 constructive 且容易验证的设计选择,我们就能够利用结果来排除 explicit 电路和混合观测器的荒芜凸起,即包含局部变量的观测器。这一见解直接适用于混合量子生成对抗网络(qGANs),一种可以重新表述为 VQA 的生成模型,观测器由 local 和全局变量组成。我们证明,适当地设计分岔器会导致 1 局部权重,在qubit 数量不变的情况下,无论分岔器深度如何,都保持恒定。与我们的第一项贡献相结合,这意味着 qGAN 的浅生成器可以在规模上训练,而无需经历荒芜凸起,使其成为生成量子机器学习应用程序的有前途的选择。我们训练了一个 qGAN 来学习一个 2D 混合高斯分布,高达 16 qubit,并提供了数值证据, global 对梯度的贡献,尽管开始时呈指数级小,可能在训练过程中显著增加。
https://arxiv.org/abs/2309.12681
We introduce quantum algorithms able to sample equilibrium water solvent molecules configurations within proteins thanks to analog quantum computing. To do so, we combine a quantum placement strategy to the 3D Reference Interaction Site Model (3D-RISM), an approach capable of predicting continuous solvent distributions. The intrinsic quantum nature of such coupling guarantees molecules not to be placed too close to each other, a constraint usually imposed by hand in classical approaches. We present first a full quantum adiabatic evolution model that uses a local Rydberg Hamiltonian to cast the general problem into an anti-ferromagnetic Ising model. Its solution, an NP-hard problem in classical computing, is embodied into a Rydberg atom array Quantum Processing Unit (QPU). Following a classical emulator implementation, a QPU portage allows to experimentally validate the algorithm performances on an actual quantum computer. As a perspective of use on next generation devices, we emulate a second hybrid quantum-classical version of the algorithm. Such a variational quantum approach (VQA) uses a classical Bayesian minimization routine to find the optimal laser parameters. Overall, these Quantum-3D-RISM (Q-3D-RISM) algorithms open a new route towards the application of analog quantum computing in molecular modelling and drug design.
我们介绍了利用模拟量子计算技术能够采样蛋白质中平衡的 water 溶剂分子构型的方法。为此,我们结合了量子位置规划与 3D 参考相互作用点模型(3D-RISM),这是一种能够预测连续溶剂分布的方法。这种量子耦合的内在量子性质保证了分子不应该彼此放置得太接近,这在经典方法中通常是手动添加的限制。我们首先介绍了一种完整的量子渐近演化模型,该模型使用局部Rydberghamiltonian将一般问题转化为反铁磁李纳模型。它的解决方案在经典计算中是一个NP-困难问题,将其转化为Rydberg原子数组量子处理单元(QPU)。在经典模拟实现之后,QPU端口允许在真实的量子计算机上实验验证算法性能。作为未来设备使用的视角,我们模拟了另一种混合量子-经典版本的算法。这种可变量子方法(VQA)使用经典的贝叶斯最小化 routine 来找到最优激光参数。总的来说,这些量子-3D-RISM(Q-3D-RISM)算法开辟了将模拟量子计算应用于分子模拟和药物设计的新途径。
https://arxiv.org/abs/2309.12129
Answer grounding is the task of locating relevant visual evidence for the Visual Question Answering task. While a wide variety of attention methods have been introduced for this task, they suffer from the following three problems: designs that do not allow the usage of pre-trained networks and do not benefit from large data pre-training, custom designs that are not based on well-grounded previous designs, therefore limiting the learning power of the network, or complicated designs that make it challenging to re-implement or improve them. In this paper, we propose a novel architectural block, which we term Sentence Attention Block, to solve these problems. The proposed block re-calibrates channel-wise image feature-maps by explicitly modeling inter-dependencies between the image feature-maps and sentence embedding. We visually demonstrate how this block filters out irrelevant feature-maps channels based on sentence embedding. We start our design with a well-known attention method, and by making minor modifications, we improve the results to achieve state-of-the-art accuracy. The flexibility of our method makes it easy to use different pre-trained backbone networks, and its simplicity makes it easy to understand and be re-implemented. We demonstrate the effectiveness of our method on the TextVQA-X, VQS, VQA-X, and VizWiz-VQA-Grounding datasets. We perform multiple ablation studies to show the effectiveness of our design choices.
Answering grounded 是视觉问答任务中定位相关视觉证据的任务。虽然已经介绍了多种注意力方法来实现这一任务,但它们面临着以下三个问题:设计不允许使用预先训练的网络,并且未从大规模预训练数据中受益;自定义设计,其基础建立在不充分扎实的先前设计中,因此限制了网络的学习能力;或者复杂的设计,使其难以重新实现或改进。在本文中,我们提出了一种新的建筑模块,我们称之为句子注意力模块,以解决这些问题。该模块通过明确建模图像特征映射和句子嵌入之间的依赖关系,重新校准通道上的图像特征映射。我们通过视觉演示了该模块如何基于句子嵌入过滤无关的特征映射通道。我们先从已知的注意力方法开始设计,通过轻微的修改,提高了结果的准确性,达到了最先进的精度。我们的方法的灵活性使其容易使用不同的预先训练主干网络,其简单性使其容易理解和重新实现。我们证明了我们的方法在文本VQA-X、VQS、VQA-X和VizWiz-VQA-grounding数据集上的有效性。我们进行了多个 ablation研究,以展示我们的设计选择的有效性。
https://arxiv.org/abs/2309.11593
Medical visual question answering (Med-VQA) is a machine learning task that aims to create a system that can answer natural language questions based on given medical images. Although there has been rapid progress on the general VQA task, less progress has been made on Med-VQA due to the lack of large-scale annotated datasets. In this paper, we present domain-specific pre-training strategies, including a novel contrastive learning pretraining method, to mitigate the problem of small datasets for the Med-VQA task. We find that the model benefits from components that use fewer parameters. We also evaluate and discuss the model's visual reasoning using evidence verification techniques. Our proposed model obtained an accuracy of 60% on the VQA-Med 2019 test set, giving comparable results to other state-of-the-art Med-VQA models.
医学视觉问答(Med-VQA)是一项机器学习任务,旨在创建一个系统,基于给定的医疗图像,回答自然语言问题。尽管在一般VQA任务方面取得了迅速的进展,但在Med-VQA任务方面进展较少,因为缺乏大规模的注释数据集。在本文中,我们介绍了特定领域的预训练策略,包括一种新的对比度增强预训练方法,以减轻Med-VQA任务中小型数据集的问题。我们发现,模型从使用参数更少的成分中受益。我们还使用证据验证技术评估和讨论模型的视觉推理。我们提出的模型在VQA-Med2019测试集上获得了60%的准确率,与其他先进的Med-VQA模型相当。
https://arxiv.org/abs/2309.11080
Visual Question Answering (VQA) aims to automatically answer natural language questions related to given image content. Existing VQA methods integrate vision modeling and language understanding to explore the deep semantics of the question. However, these methods ignore the significant syntax information of the question, which plays a vital role in understanding the essential semantics of the question and guiding the visual feature refinement. To fill the gap, we suggested a novel Syntax Tree Constrained Graph Network (STCGN) for VQA based on entity message passing and syntax tree. This model is able to extract a syntax tree from questions and obtain more precise syntax information. Specifically, we parse questions and obtain the question syntax tree using the Stanford syntax parsing tool. From the word level and phrase level, syntactic phrase features and question features are extracted using a hierarchical tree convolutional network. We then design a message-passing mechanism for phrase-aware visual entities and capture entity features according to a given visual context. Extensive experiments on VQA2.0 datasets demonstrate the superiority of our proposed model.
视觉问答(VQA)旨在自动回答与给定图像内容相关的自然语言问题。现有的VQA方法将视觉建模和语言理解相结合,以探索问题的深度语义。然而,这些方法忽略了问题的重要语法信息,它在理解问题的重要语义和指导视觉特征改进方面起着关键作用。为了填补这一差距,我们提出了一种基于实体消息传递和语法树的新语法树约束Graph网络(STCGN),以用于VQA。该模型能够从问题中提取语法树,获得更精确的语法信息。具体来说,我们使用斯坦福大学语法解析工具解析问题,并获取问题语法树。从词级和短语级上,使用Hierarchical Tree卷积神经网络提取语法短语特征和问题特征。然后,我们设计了一个短语aware的视觉实体消息传递机制,并根据给定的视觉上下文捕获实体特征。在VQA2.0数据集上的广泛实验证明了我们提出的模型的优越性。
https://arxiv.org/abs/2309.09179
Systematic generalization is a crucial aspect of intelligence, which refers to the ability to generalize to novel tasks by combining known subtasks and concepts. One critical factor that has been shown to influence systematic generalization is the diversity of training data. However, diversity can be defined in various ways, as data have many factors of variation. A more granular understanding of how different aspects of data diversity affect systematic generalization is lacking. We present new evidence in the problem of Visual Question Answering (VQA) that reveals that the diversity of simple tasks (i.e. tasks formed by a few subtasks and concepts) plays a key role in achieving systematic generalization. This implies that it may not be essential to gather a large and varied number of complex tasks, which could be costly to obtain. We demonstrate that this result is independent of the similarity between the training and testing data and applies to well-known families of neural network architectures for VQA (i.e. monolithic architectures and neural module networks). Additionally, we observe that neural module networks leverage all forms of data diversity we evaluated, while monolithic architectures require more extensive amounts of data to do so. These findings provide a first step towards understanding the interactions between data diversity design, neural network architectures, and systematic generalization capabilities.
系统级泛化是智力的一个关键方面,指的是通过结合已知子任务和概念将泛化到新的任务的能力。一种被证明会影响系统级泛化的重要因素是训练数据的多样性。然而,多样性可以从不同的方式定义,因为数据有许多变化的因素。缺乏对数据多样性不同方面影响系统级泛化的深刻理解。我们在视觉问答问题(VQA)问题上呈现新的证据,表明简单的任务的多样性(即由几个子任务和概念形成的任务)在实现系统级泛化中起着关键作用。这暗示着可能不必要收集大量且多样化的复杂任务,这些任务可能代价高昂。我们证明,这个结果不受训练测试数据相似性的影响,并适用于VQA中著名的神经网络架构家族(即单层架构和神经网络模块网络)。此外,我们观察到神经网络模块网络利用我们评估的所有数据多样性形式,而单层架构需要更多广泛的数据才能这样做。这些发现为理解数据多样性设计、神经网络架构和系统级泛化能力之间的相互作用提供了一个的第一步。
https://arxiv.org/abs/2309.08798
The widespread adoption of commercial autonomous vehicles (AVs) and advanced driver assistance systems (ADAS) may largely depend on their acceptance by society, for which their perceived trustworthiness and interpretability to riders are crucial. In general, this task is challenging because modern autonomous systems software relies heavily on black-box artificial intelligence models. Towards this goal, this paper introduces a novel dataset, Rank2Tell, a multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. Using various close and open-ended visual question answering, the dataset provides dense annotations of various semantic, spatial, temporal, and relational attributes of various important objects in complex traffic scenarios. The dense annotations and unique attributes of the dataset make it a valuable resource for researchers working on visual scene understanding and related fields. Further, we introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset and demonstrate performance with quantitative evaluations.
商业化自主汽车(AVs)和高级驾驶辅助系统(ADAS)的广泛采用可能取决于社会的接受程度,而这对于乘客来说,其 perceived 的可靠性和解释性是至关重要的。通常情况下,这项任务具有挑战性,因为现代自主系统软件在很大程度上依赖于黑盒人工智能模型。为实现这一目标,本文介绍了一个 novel dataset, Rank2Tell,它是一个多模态自我意识中心的数据集,用于排名重要性等级并说明重要性的原因。使用各种主观和开放式的视觉问题回答,该数据集提供了在复杂交通场景中各种重要对象的丰富注释,包括各种语义、空间、时间和关系属性。该数据的丰富注释和独特属性使其成为研究视觉场景理解和相关领域的宝贵资源。此外,我们介绍了一个联合重要性等级排名和自然语言摘要生成的联合模型,以基准我们的数据集并使用量化评估展示性能。
https://arxiv.org/abs/2309.06597
Variational Quantum Algorithms (VQA) have been identified as a promising candidate for the demonstration of near-term quantum advantage in solving optimization tasks in chemical simulation, quantum information, and machine learning. The standard model of training requires a significant amount of quantum resources, which led us to use classical shadows to devise an alternative that consumes exponentially fewer quantum resources. However, the approach only works when the observables are local and the ansatz is the shallow Alternating Layered Ansatz (ALA), thus severely limiting its potential in solving problems such as quantum state preparation, where the ideal state might not be approximable with an ALA. In this work, we present a protocol based on shallow shadows that achieves similar levels of savings for almost any shallow ansatz studied in the literature, when combined with observables of low Frobenius norm. We show that two important applications in quantum information for which VQAs can be a powerful option, namely variational quantum state preparation and variational quantum circuit synthesis, are compatible with our protocol. We also experimentally demonstrate orders of magnitude improvement in comparison to the standard VQA model.
Variational Quantum Algorithms (VQA) 已经被识别为在解决化学模拟、量子信息以及机器学习中的优化任务中展示量子优势的一个有前途的候选人。训练的标准模型需要大量量子资源,因此我们使用 classical shadows 设计了一种消耗较少的量子资源的新方案。然而,该方法只有在观测器是局部且其 ansatz 是浅层交替层状 ansatz(ALA)时才有效,因此严重限制了它在解决例如量子状态制备等问题中的潜力,在这些情况下,理想状态可能无法与 ALA 近似。在这项工作中,我们提出了基于浅 shadows 的方案,几乎可以与 literature 中研究的浅 ansatz 实现类似的节省量,当与低傅里叶范数的观测器结合使用时。我们表明,在量子信息中,VQAs 是两个重要的应用,即Variational 量子状态制备和Variational 量子电路合成,与我们的方案是兼容的。我们还实验证明了相对于标准 VQA 模型的数十倍改进。
https://arxiv.org/abs/2309.04754
While Multimodal Large Language Models (MLLMs) are widely used for a variety of vision-language tasks, one observation is that they sometimes misinterpret visual inputs or fail to follow textual instructions even in straightforward cases, leading to irrelevant responses, mistakes, and ungrounded claims. This observation is analogous to a phenomenon in neuropsychology known as Agnosia, an inability to correctly process sensory modalities and recognize things (e.g., objects, colors, relations). In our study, we adapt this similar concept to define "agnosia in MLLMs", and our goal is to comprehensively evaluate and mitigate such agnosia in MLLMs. Inspired by the diagnosis and treatment process in neuropsychology, we propose a novel framework EMMA (Evaluation and Mitigation of Multimodal Agnosia). In EMMA, we develop an evaluation module that automatically creates fine-grained and diverse visual question answering examples to assess the extent of agnosia in MLLMs comprehensively. We also develop a mitigation module to reduce agnosia in MLLMs through multimodal instruction tuning on fine-grained conversations. To verify the effectiveness of our framework, we evaluate and analyze agnosia in seven state-of-the-art MLLMs using 9K test samples. The results reveal that most of them exhibit agnosia across various aspects and degrees. We further develop a fine-grained instruction set and tune MLLMs to mitigate agnosia, which led to notable improvement in accuracy.
尽管多媒体大型语言模型(MLLM)被广泛应用于多种视觉语言任务,但一个观察是,他们有时可能会误解视觉输入或在任何简单的情况下都不会遵循文本指令,导致无关响应、错误和缺乏扎实的主张。这个观察类似于神经心理学中的现象,称为阿gnosia,即无法正确处理感官模式并识别事物(例如,物体、颜色、关系)。在我们的研究中,我们采用了这个类似的概念来定义“MLLM中的阿gnosia”,并的目标是全面评估和减轻这种阿gnosia。受到神经心理学的诊断和治疗过程启发,我们提出了一个崭新的框架EMMA(评估和减轻多感官阿gnosia)。在EMMA中,我们开发了一个评估模块,会自动创建精细的视觉问答例子,以全面评估MLLM中的阿gnosia程度。我们还开发了一个减轻模块,通过精细对话指令调整来减少MLLM中的阿gnosia。为了验证我们的框架的有效性,我们使用9K测试样本评估和分析了七种最先进的MLLM中的阿gnosia。结果揭示,大多数都表现出在各种方面和程度中的阿gnosia。我们进一步开发了精细指令集,并通过减轻阿gnosia,使MLLM更加精确,这导致精度显著的提高。
https://arxiv.org/abs/2309.04041
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. However, such models are likely to disregard crucial visual cues and often rely on multimodal shortcuts and inherent biases of the language modality to predict the correct answer, a phenomenon commonly referred to as lack of visual grounding. In this work, we alleviate this shortcoming through a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. Reasoning supervision takes the form of a textual justification of the correct answer, with such annotations being already available on large-scale Visual Common Sense Reasoning (VCR) datasets. The model's visual attention is guided toward important elements of the scene through a similarity loss that aligns the learned attention distributions guided by the question and the correct reasoning. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase, without requiring training on explicit grounding annotations.
Transformer-based架构最近在视觉问答任务(VQA)中表现出了卓越的性能。然而,这些模型可能忽视关键的视觉线索,并常常依赖多通道捷径和语言模式的固有偏见来预测正确的答案,这种现象通常被称为缺乏视觉基座。在本研究中,我们通过一种新的视觉问答架构来减轻这种缺点,该架构利用常识推理作为监督信号。推理监督以正确的答案文本justify的方式进行,这些注释已经在大型视觉常识推理(VCR)数据集上公开可用。模型的视觉注意力通过相似损失引导向场景的重要元素,该损失函数将 learn 的注意力分布与问题和正确的推理指导相结合。我们既进行了定量测试,也进行了定性测试,证明了该提出的方法可以增强模型的视觉感知能力,并导致性能增加,而不需要训练 explicit grounding 注释。
https://arxiv.org/abs/2309.03726
Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models are now well-positioned to reason about the physical world, particularly within domains such as robotic manipulation. However, current VLMs are limited in their understanding of the physical concepts (e.g., material, fragility) of common objects, which restricts their usefulness for robotic manipulation tasks that involve interaction and physical reasoning about such objects. To address this limitation, we propose PhysObjects, an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. We demonstrate that fine-tuning a VLM on PhysObjects improves its understanding of physical object concepts, by capturing human priors of these concepts from visual appearance. We incorporate this physically-grounded VLM in an interactive framework with a large language model-based robotic planner, and show improved planning performance on tasks that require reasoning about physical object concepts, compared to baselines that do not leverage physically-grounded VLMs. We additionally illustrate the benefits of our physically-grounded VLM on a real robot, where it improves task success rates. We release our dataset and provide further details and visualizations of our results at this https URL.
recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. As a result, these models are now well-positioned to reason about the physical world, particularly within domains such as robotic manipulation. However, current VLMs are limited in their understanding of common objects' physical concepts (e.g., material, fragility), which restricts their usefulness for robotic manipulation tasks that involve interaction and physical reasoning about such objects. To address this limitation, we proposePhysObjects, an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. We demonstrate that fine-tuning a VLM onPhysObjects improves its understanding of physical object concepts, by capturing human priors of these concepts from visual appearance. We incorporate this physically-grounded VLM in an interactive framework with a large language model-based robotic planner, and show improved planning performance on tasks that require reasoning about physical object concepts, compared to baselines that do not leverage physically-grounded VLMs. We Additionally illustrate the benefits of our physically-grounded VLM on a real robot, where it improves task success rates. We release our dataset and provide further details and visualizations of our results at this https URL.
https://arxiv.org/abs/2309.02561
VQA Natural Language Explanation (VQA-NLE) task aims to explain the decision-making process of VQA models in natural language. Unlike traditional attention or gradient analysis, free-text rationales can be easier to understand and gain users' trust. Existing methods mostly use post-hoc or self-rationalization models to obtain a plausible explanation. However, these frameworks are bottlenecked by the following challenges: 1) the reasoning process cannot be faithfully responded to and suffer from the problem of logical inconsistency. 2) Human-annotated explanations are expensive and time-consuming to collect. In this paper, we propose a new Semi-Supervised VQA-NLE via Self-Critical Learning (S3C), which evaluates the candidate explanations by answering rewards to improve the logical consistency between answers and rationales. With a semi-supervised learning framework, the S3C can benefit from a tremendous amount of samples without human-annotated explanations. A large number of automatic measures and human evaluations all show the effectiveness of our method. Meanwhile, the framework achieves a new state-of-the-art performance on the two VQA-NLE datasets.
VQA自然语言解释任务旨在以自然语言解释VQA模型的决策过程。与传统的关注或梯度分析不同,自由文本 rationales 更容易理解并赢得用户的信任。现有的方法大多使用后验或自我解释模型以获得合理的解释。然而,这些框架受到以下挑战的限制:1) 推理过程无法忠实的响应,并受到逻辑一致性的问题的困扰。2) 人类注释的解释非常昂贵且耗时收集。在本文中,我们提出了一种新的半监督VQA-NLE通过自我 critical 学习 (S3C),该方法通过回答问题奖励来改善答案和 rationale 的逻辑一致性。通过半监督学习框架,S3C可以从大量未经人类注释的解释样本中受益匪浅。大量自动测量和人类评估都显示了我们方法的有效性。同时,框架在两个VQA-NLE数据集上实现了新的先进技术。
https://arxiv.org/abs/2309.02155
With ever increasing parameters and computation, vision-language pre-trained (VLP) models exhibit prohibitive expenditure in downstream task adaption. Recent endeavors mainly focus on parameter efficient transfer learning (PETL) for VLP models by only updating a small number of parameters. However, excessive computational overhead still plagues the application of VLPs. In this paper, we aim at parameter and computation efficient transfer learning (PCETL) for VLP models. In particular, PCETL not only needs to limit the number of trainable parameters in VLP models, but also to reduce the computational redundancy during inference, thus enabling a more efficient transfer. To approach this target, we propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL. Instead of directly optimizing the intrinsic architectures of VLP models, DAS first observes the significances of their modules to downstream tasks via a reinforcement learning (RL) based process, and then skips the redundant ones with lightweight networks, i.e., adapters, according to the obtained rewards. In this case, the VLP model can well maintain the scale of trainable parameters while speeding up its inference on downstream tasks. To validate DAS, we apply it to two representative VLP models, namely ViLT and METER, and conduct extensive experiments on a bunch of VL tasks. The experimental results not only show the great advantages of DAS in reducing computational complexity, e.g. -11.97% FLOPs of METER on VQA2.0, but also confirm its competitiveness against existing PETL methods in terms of parameter scale and performance. Our source code is given in our appendix.
随着参数和计算的增加,视觉语言预训练(VLP)模型在下游任务适应方面表现出不必要的支出。最近的努力主要关注仅更新少量参数的参数 efficient transfer learning (PETL)对 VLP模型的应用。然而,过度的计算开销仍然困扰着VLP模型的应用。在本文中,我们旨在为VLP模型实现计算和参数 efficient transfer learning (PCETL),特别是要限制VLP模型中的可训练参数数量,并在推理期间减少计算冗余,从而实现更高效的数据传输。为了实现这个目标,我们提出了一种独特的动态架构跳过(DAS)方法,以有效的实现 PCETL。 Instead of directly optimizing VLP模型的内部架构, DAS首先通过一种基于强化学习(RL)的过程观察其模块对下游任务的重要性,然后根据获得的奖励跳过冗余的 lightweight networks,即 adapter,来实现这一点。在这种情况下,VLP模型可以很好地保持训练参数的规模,同时加速其在下游任务上的推理。为了验证 DAS,我们将其应用于两个代表性的 VLP模型,即 ViLT 和 METER,并进行了广泛的 VL 任务实验。实验结果表明,不仅 DAS 在减少计算复杂度方面表现出巨大的优势,例如 METER 在 VQA2.0 任务中 -11.97% 的 FOP 值,而且也证明了其与现有PETL方法在参数规模和性能方面的竞争力。我们的源代码在我们的附录中提供。
https://arxiv.org/abs/2309.01479
Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content. The NewsVideoQA dataset contains question-answer pairs related to the text in news videos, while M4-ViteVQA comprises question-answer pairs from diverse categories like vlogging, traveling, and shopping. We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions. Additionally, the study includes experimentation with BERT-QA, a text-only model, which demonstrates comparable performance to the original methods on both datasets, indicating the shortcomings in the formulation of these datasets. Furthermore, we also look into the domain adaptation aspect by examining the effectiveness of training on M4-ViteVQA and evaluating on NewsVideoQA and vice-versa, thereby shedding light on the challenges and potential benefits of out-of-domain training.
研究人员广泛研究了视觉和语言领域,发现视觉和文本内容对于有效理解场景至关重要。特别是,理解视频中的文字具有重要意义,需要同时考虑场景文本理解和时间推理。本文重点探索了最近引入的两个数据集: NewsVideoQA 和 M4-ViteVQA,它们旨在基于文本内容解决视频问答。 NewsVideoQA 数据集包含与新闻视频中文本相关的问答对,而 M4-ViteVQA 包括来自不同类别的问答对,如视频日志、旅行和购物。我们对这些数据集的构建层次进行了分析,探索了回答这些问题所需的视觉理解和多帧理解程度。此外,研究还包括对 BERT-QA 模型的试验,这是一个只读模型,在两个数据集上表现出与原始方法相当的性能,这表明这些数据集的构建存在缺陷。此外,我们还研究了跨域适应方面,通过检查在 M4-ViteVQA 上训练的效果以及在 NewsVideoQA 上评估的效果,从而揭示了跨域训练所面临的挑战和潜在好处。
https://arxiv.org/abs/2309.01380
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding. Specifically, by forcing vision-language models (VLMs) to answer questions and simultaneously provide visual evidence, we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content, versus spurious correlations from language or irrelevant visual context. Towards this, we construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding (or location) labels tied to the original QA pairs. With NExT-GQA, we scrutinize a variety of state-of-the-art VLMs. Through post-hoc attention analysis, we find that these models are weak in substantiating the answers despite their strong QA performance. This exposes a severe limitation of these models in making reliable predictions. As a remedy, we further explore and suggest a video grounding mechanism via Gaussian mask optimization and cross-modal learning. Experiments with different backbones demonstrate that this grounding mechanism improves both video grounding and QA. Our dataset and code are released. With these efforts, we aim to push towards the reliability of deploying VLMs in VQA systems.
我们研究视觉基础的视频质量评估,以回应利用预训练技术进行视频-语言理解的趋势。具体而言,我们迫使视觉语言模型(VLMs)回答问题并同时提供视觉证据,旨在确定这些技术的预测是否真正建立在相关的视频内容上,而不是从语言或无关的视觉上下文中的伪相关性。为此,我们建立了NExT-GQA,它是NExT-QA的扩展,其中10.5$K$时间基线(或位置)标签与最初的QA对相关联。通过NExT-GQA,我们对各种最先进的VLMs进行了深入的分析。通过事后注意力分析,我们发现这些模型在支持答案方面较弱,尽管它们的QA表现很好。这暴露了这些模型在可靠预测方面存在的严重限制。作为一种改善方法,我们进一步探索并建议一种视频基线机制,通过高斯掩膜优化和跨模态学习来实现。不同基线的实验表明,这种基线机制改进了视频基线和QA。我们的数据和代码已发布。通过这些努力,我们旨在推动在视频质量评估系统中部署VLMs的可靠性。
https://arxiv.org/abs/2309.01327
Machine learning-based video codecs have made significant progress in the past few years. A critical area in the development of ML-based video codecs is an accurate evaluation metric that does not require an expensive and slow subjective test. We show that existing evaluation metrics that were designed and trained on DSP-based video codecs are not highly correlated to subjective opinion when used with ML video codecs due to the video artifacts being quite different between ML and video codecs. We provide a new dataset of ML video codec videos that have been accurately labeled for quality. We also propose a new full reference video quality assessment (FRVQA) model that achieves a Pearson Correlation Coefficient (PCC) of 0.99 and a Spearman's Rank Correlation Coefficient (SRCC) of 0.99 at the model level. We make the dataset and FRVQA model open source to help accelerate research in ML video codecs, and so that others can further improve the FRVQA model.
在过去几年中,基于机器学习的视频编码器取得了巨大的进展。机器学习视频编码器的发展中的一个关键领域是提供一个准确的评估指标,而不需要昂贵的慢速的主观测试。我们证明了,在基于DSP的视频编码器中设计和训练的现有评估指标在与机器学习视频编码器使用时并不高度相关,因为机器学习和视频编码之间的视频 artifacts非常不同。我们提供了一种新的机器学习视频编码器视频数据集,该数据集准确标注了质量。我们还提出了一种新的全参考视频质量评估(FRVQA)模型,该模型在模型级别上实现了PearsonPearson correlation Coefficient(PCC)为0.99和SpearmanSpearman's Rank correlation Coefficient(SRCC)为0.99。我们将数据集和FRVQA模型开源,以帮助加速机器学习视频编码器的研究,并让其他人进一步改进FRVQA模型。
https://arxiv.org/abs/2309.00769
Object proposal generation serves as a standard pre-processing step in Vision-Language (VL) tasks (image captioning, visual question answering, etc.). The performance of object proposals generated for VL tasks is currently evaluated across all available annotations, a protocol that we show is misaligned - higher scores do not necessarily correspond to improved performance on downstream VL tasks. Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects. To this end, we propose evaluating object proposals against only a subset of available annotations, selected by thresholding an annotation importance score. Importance of object annotations to VL tasks is quantified by extracting relevant semantic information from text describing the image. We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation when compared against existing techniques. Lastly, we compare current detectors used in the Scene Graph Generation (SGG) benchmark as a use case, which serves as an example of when traditional object proposal evaluation techniques are misaligned.
对象提议生成在视觉语言(VL)任务中作为标准预处理步骤(例如图像captioning、视觉问答等任务)。为VL任务生成对象提议的性能 currently evaluate across all availableannotations,我们展示的一个协议是不一致的 - 更高的评分并不一定对应着下游VL任务的性能提高。我们的工作作为对这种现象的研究,并探索语义基准化的有效性,以减轻其影响。为此,我们提议仅评估可用annotations的 subset,通过设定annotation importance score 的阈值来选择。对VL任务的 objectannotations 的重要性通过从描述图像的文字中提取相关的语义信息来计算。我们证明我们的方法和通过图像captioning metrics 和人类标注选择的目标annotations的比较,与现有技术相比,表现出了极大的一致性。最后,我们比较了当前在场景Graph Generation(SGG)基准中的应用探测器,作为使用案例,作为传统对象提议评估技术不一致性的例子。
https://arxiv.org/abs/2309.00215
The generation of effective latent representations and their subsequent refinement to incorporate precise information is an essential prerequisite for Vision-Language Understanding (VLU) tasks such as Video Question Answering (VQA). However, most existing methods for VLU focus on sparsely sampling or fine-graining the input information (e.g., sampling a sparse set of frames or text tokens), or adding external knowledge. We present a novel "DRAX: Distraction Removal and Attended Cross-Alignment" method to rid our cross-modal representations of distractors in the latent space. We do not exclusively confine the perception of any input information from various modalities but instead use an attention-guided distraction removal method to increase focus on task-relevant information in latent embeddings. DRAX also ensures semantic alignment of embeddings during cross-modal fusions. We evaluate our approach on a challenging benchmark (SUTD-TrafficQA dataset), testing the framework's abilities for feature and event queries, temporal relation understanding, forecasting, hypothesis, and causal analysis through extensive experiments.
生成有效的隐态表示及其随后的精练以包含精确信息是视觉语言理解(VLU)任务如视频问答(VQA)等的关键先决条件。然而,VLU现有方法大多数集中在稀疏采样或精细粒度化输入信息(例如,采样一个稀疏的帧或文本元数据集)或增加外部知识。我们提出了一种新的“DRAX:干扰移除并Attended 交叉对齐”方法,以减少隐态空间中的干扰。我们并不局限于从各种感官渠道感知的任何输入信息,而是使用注意力引导的干扰移除方法,增加在隐态嵌入中的任务相关信息的关注。DRAX还确保了在跨感官融合期间嵌入语义对齐。我们在一个具有挑战性的基准(SUTD- traffic QA 数据集)上评估了我们的方法,通过广泛的实验测试框架的特征和事件查询、时间关系理解、预测、假设和因果关系分析能力。
https://arxiv.org/abs/2309.00133
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision, while the most popular modality to augment LLMs with, is only one representation of a scene. In human-robot interaction scenarios, robot perception requires accurate scene understanding by the robot. In this paper, we define and demonstrate a method of aligning the embedding spaces of different modalities (in this case, inertial measurement unit (IMU) data) to the vision embedding space through a combination of supervised and contrastive training, enabling the VLM to understand and reason about these additional modalities without retraining. We opt to give the model IMU embeddings directly over using a separate human activity recognition model that feeds directly into the prompt to allow for any nonlinear interactions between the query, image, and IMU signal that would be lost by mapping the IMU data to a discrete activity label. Further, we demonstrate our methodology's efficacy through experiments involving human activity recognition using IMU data and visual inputs. Our results show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks, thus paving the way for more versatile and capable language models in multi-modal contexts.
视觉语言模型(VLMs)在视觉问答和推理任务中表现出强大的能力,通过结合视觉表示与大型语言模型(LLMs)在预训练期间学习抽象的技能集。虽然视觉是增加LLMs的常见方式之一,但它只是一种场景表示。在人机交互场景中,机器人感知需要机器人准确的场景理解。在本文中,我们定义并展示了一种方法,通过监督和对比训练将不同模态(这里是惯性测量单元(IMU)数据)嵌入空间对齐到视觉嵌入空间,从而使VLMs在没有重新训练的情况下理解并推理这些额外的模态。我们选择使用独立的人类活动识别模型,将其直接注入提示中,以便允许任何由于将IMU数据映射为离散活动标签而丢失的非线性相互作用。此外,我们通过使用IMU数据和视觉输入的实验来演示我们方法的有效性。我们的结果显示,使用多种模态作为输入可以提高VLMs的场景理解,增强它在各种任务中的整体表现,从而为多模态环境中更加灵活和有能力的语言模型铺平了道路。
https://arxiv.org/abs/2308.16493
Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at this https URL.
基于文本的视觉问答(TextVQA)旨在回答图像中文本的问题。该领域的大部分工作都关注设计网络结构或前训练任务。这些方法都将OCR文本按照阅读顺序(从左到右和从上到下)列出,形成一个序列,并将其视为自然语言的“句子”。然而,他们忽略了大多数OCR单词在TextVQA任务中并没有语义上下文关系的事实。此外,这些方法使用1D位置嵌入Sequentially构建OCR代币之间的空间关系,这不是合理的。1D位置嵌入只能代表句子中单词之间的left-right序列关系,而不是复杂的空间位置关系。为了解决这些问题,我们提出了一种新的方法名为Separate and Locate(SaL),该方法探索文本上下文线索并设计空间位置嵌入,以构建OCR文本之间的空间关系。具体来说,我们提出了文本语义分开(TSS)模块,帮助模型识别单词是否有语义上下文关系。然后,我们引入了空间圆形位置(SCP)模块,帮助模型更好地构建和解释OCR文本之间的空间位置关系。我们的SaL模型在TextVQA和ST-VQA数据集上比基准模型提高了4.44%和3.96%的准确率。与在64百万前训练样本中预训练的最新前训练方法相比,我们的方法在没有前训练任务的情况下,在TextVQA和ST-VQA数据集上仍实现了2.68%和2.52%的准确率改进。我们的代码和模型将在这个httpsURL上发布。
https://arxiv.org/abs/2308.16383