Mitigating hallucinations of Large Multi-modal Models(LMMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LMMs can be significantly exacerbated by preceding user-system dialogues. To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues generated by our novel Adversarial Question Generator, which can automatically generate image-related yet adversarial dialogues by adopting adversarial attacks on LMMs. On our benchmark, the zero-shot performance of state-of-the-art LMMs dropped significantly for both the VQA and Captioning tasks. Next, we further reveal this hallucination is mainly due to the prediction bias toward preceding dialogues rather than visual content. To reduce this bias, we propose Adversarial Instruction Tuning that robustly fine-tunes LMMs on augmented multi-modal instruction-following datasets with hallucinatory dialogues. Extensive experiments show that our proposed approach successfully reduces dialogue hallucination while maintaining or even improving performance.
缓解大型多模态模型(LMMs)的幻觉对于增强其对于通用助手设备的可靠性至关重要。本文表明,LMMs的前用户-系统对话可能会显著加剧这种幻觉。为了准确测量这一点,我们首先通过扩展流行的多模态基准数据集,使用我们新颖的对抗性问题生成器生成的附带幻觉对话,该生成器可以通过对LMMs的对抗攻击来生成与图像相关的 adversarial 对话。在我们的基准上,最先进的 LMM 的零散性能对于 both VQA 和 Captioning 任务都下降了显著。接下来,我们进一步表明,这种幻觉主要是由先前的对话预测偏差导致的,而不是视觉内容。为了减少这种偏见,我们提出了对抗指令调整,它在使用增强多模态指令跟随数据集上对 LMMs 进行鲁棒微调的同时,通过附带幻觉对话进行调整。大量实验证明,我们提出的方法在保持或甚至提高性能的同时,成功地减少了对话幻觉。
https://arxiv.org/abs/2403.10492
The task of few-shot image classification and segmentation (FS-CS) involves classifying and segmenting target objects in a query image, given only a few examples of the target classes. We introduce the Vision-Instructed Segmentation and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Question Answering (VQA) problem, utilising Vision-Language Models (VLMs), and addresses it in a training-free manner. By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels. Specifically, chain-of-thought prompting and in-context learning guide the VLM to answer multiple-choice questions like a human; vision models such as YOLO and Segment Anything Model (SAM) assist the VLM in completing the task. The modular framework of the proposed method makes it easily extendable. Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.
少量样本图像分类和分割(FS-CS)任务的目的是对查询图像中的目标对象进行分类和分割,而给定只有几个目标类别的例子。我们引入了 Vision-Instructed Segmentation and Evaluation(VISE)方法,将FS-CS问题转化为视觉问答(VQA)问题,利用视觉语言模型(VLMs),并且以无需训练的方式解决了这个问题。通过使视觉模型与通用视觉模型作为工具进行交互,所提出的方法能够使用仅有的图像级别标签对目标对象进行分类和分割。具体来说,连锁思考提示和上下文学习引导VLM像人类一样回答多选题;像YOLO和Segment Anything Model(SAM)这样的视觉模型帮助VLM完成任务。所提出方法的模块化框架使其易于扩展。我们的方法在Pascal-5i和COCO-20i数据集上实现了最先进的性能。
https://arxiv.org/abs/2403.10287
Knowledge-based visual question answering (KB-VQA) is a challenging task, which requires the model to leverage external knowledge for comprehending and answering questions grounded in visual content. Recent studies retrieve the knowledge passages from external knowledge bases and then use them to answer questions. However, these retrieved knowledge passages often contain irrelevant or noisy information, which limits the performance of the model. To address the challenge, we propose two synergistic models: Knowledge Condensation model and Knowledge Reasoning model. We condense the retrieved knowledge passages from two perspectives. First, we leverage the multimodal perception and reasoning ability of the visual-language models to distill concise knowledge concepts from retrieved lengthy passages, ensuring relevance to both the visual content and the question. Second, we leverage the text comprehension ability of the large language models to summarize and condense the passages into the knowledge essence which helps answer the question. These two types of condensed knowledge are then seamlessly integrated into our Knowledge Reasoning model, which judiciously navigates through the amalgamated information to arrive at the conclusive answer. Extensive experiments validate the superiority of the proposed method. Compared to previous methods, our method achieves state-of-the-art performance on knowledge-based VQA datasets (65.1% on OK-VQA and 60.1% on A-OKVQA) without resorting to the knowledge produced by GPT-3 (175B).
基于知识的视觉问题回答(KB-VQA)是一个具有挑战性的任务,要求模型利用外部知识来理解和回答基于视觉内容的 questions。最近的研究从外部知识库中检索知识段,然后使用它们来回答问题。然而,这些检索到的知识段通常包含相关或不相关信息,这限制了模型的性能。为了应对这个挑战,我们提出了两个协同模型:知识凝聚模型和知识推理模型。我们从两个角度对检索到的知识段进行浓缩。首先,我们利用视觉语言模型的多模态感知和推理能力,从检索到的较长段中提炼出简洁的知识概念,确保其与视觉内容和问题相关。其次,我们利用大型语言模型的文本理解能力,对段落进行概括和浓缩,有助于回答问题。这两种浓缩的知识随后被无缝集成到我们的知识推理模型中,该模型谨慎地浏览合并的信息以得出结论。大量实验证实了所提出方法的优势。与之前的方法相比,我们的方法在知识基于 VQA 数据集上实现了最先进的性能(在 OK-VQA 和 A-OKVQA 上的性能分别为 65.1% 和 60.1%,不依赖于 GPT-3 产生的知识)。
https://arxiv.org/abs/2403.10037
Quantum machine learning -- and specifically Variational Quantum Algorithms (VQAs) -- offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for VQAs. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications, in the quantum autoencoder for data compression and in unitary compilation problems showing successful results in all cases.
量子机器学习——尤其是变分量子算法(VQAs)——为在近期量子计算机上编程提供了强大的、灵活的范式,这些应用包括化学、计量学、材料科学、数据科学和数学。在这里,我们通过训练一个参数化的量子电路来完成感兴趣的任务。然而,最近出现了挑战,表明深层态的训练困难,因为由随机性或硬件噪声引起的平滑训练景观。这激励了我们的工作,我们提出了一个可变结构方法来构建VQAs的 Ansatz。我们称为 VAns(可变 Ansatz),在优化过程中以知觉的方式生长和(至关重要地)删除量子门。因此,VAns 理想上通过保持 ansatz 浅层来减轻可训练性和噪声相关问题。我们将 VAns应用于在凝聚态和量子化学应用中的变分量子 eigensolver,在数据压缩的量子自动编码器以及在量子编译问题中取得成功。
https://arxiv.org/abs/2103.06712
Mainstream parameter-efficient fine-tuning (PEFT) methods, such as LoRA or Adapter, project a model's hidden states to a lower dimension, allowing pre-trained models to adapt to new data through this low-rank bottleneck. However, PEFT tasks involving multiple modalities, like vision-language (VL) tasks, require not only adaptation to new data but also learning the relationship between different modalities. Targeting at VL PEFT tasks, we propose a family of operations, called routing functions, to enhance VL alignment in the low-rank bottlenecks. The routing functions adopt linear operations and do not introduce new trainable parameters. In-depth analyses are conducted to study their behavior. In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods, achieving over 20% improvement on VQAv2 ($\text{RoBERTa}_{\text{large}}$+ViT-L/16) and 30% on COCO Captioning (GPT2-medium+ViT-L/16). Also when fine-tuning a pre-trained multimodal model such as CLIP-BART, we observe smaller but consistent improvements across a range of VL PEFT tasks.
主流参数高效的微调(PEFT)方法,如LoRA或Adapter,将模型的隐藏状态降低到较低的维度,使得预训练模型能通过这个低秩瓶颈适应新的数据。然而,涉及多个模态的PEFT任务,如视觉语言(VL)任务,不仅需要适应新的数据,还需要学习不同模态之间的关系。针对VL PEFT任务,我们提出了一类操作,称为路由函数,以增强VL的配准在低秩瓶颈中。路由函数采用线性操作,不引入新的训练参数。对它们的行为进行了深入分析。在各种VL PEFT设置中,路由函数显著提高了原始PEFT方法的性能,在VQAv2(RoBERTa large)+ViT-L/16)和COCO Captioning(GPT2-medium+ViT-L/16)等任务上实现了超过20%的改善。此外,当微调预训练的多模态模型(如CLIP-BART)时,我们观察到一系列VL PEFT任务的改善程度较小,但具有 consistency。
https://arxiv.org/abs/2403.09377
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems, and aggressive fine-tuning based on limited spatial location information and erroneous OCR text information often leads to inevitable overfitting. In this paper, we propose a multimodal adversarial training architecture with spatial awareness capabilities. Specifically, we introduce an Adversarial OCR Enhancement (AOE) module, which leverages adversarial training in the embedding space of OCR modality to enhance fault-tolerant representation of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better capture the spatial relationships among OCR tokens. Various experiments demonstrate that our method achieves significant performance improvements on both the ST-VQA and TextVQA datasets and provides a novel paradigm for multimodal adversarial training.
场景文本视觉问答(ST-VQA)旨在理解图像中的场景文本,并回答与文本内容相关的问题。现有方法很大程度上依赖于光学字符识别(OCR)系统的准确性,而且基于有限的空间位置信息和错误的OCR文本信息进行激进的微调往往会导致不可预测的过拟合。在本文中,我们提出了一个具有空间感知能力的多模态对抗训练架构。具体来说,我们引入了一个对抗性OCR增强(AOE)模块,它利用OCR模态的嵌入空间中的对抗性训练来增强OCR文本的容错表示,从而减少由OCR错误引起的噪声。同时,我们还添加了一个空间感知自注意力(SASA)机制,帮助模型更好地捕捉OCR词汇之间的空间关系。各种实验结果表明,我们的方法在ST-VQA和TextVQA数据集上都取得了显著的性能提升,并为多模态对抗训练树立了新的范例。
https://arxiv.org/abs/2403.09288
Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.
视觉语言模型(VLMs)在短短几年内彻底改变了计算机视觉模型的格局,为零散图像分类、图像描述和视觉问答等带来了众多新颖的应用。与纯视觉模型不同,它们通过语言提示提供了直观访问视觉内容的方式。这类模型的广泛应用鼓励我们问是否也与人类视觉保持一致 - 尤其是,它们在多模态融合中采用人类诱导的视觉偏见程度,或是否只是从纯视觉模型中继承偏见。一个重要的视觉偏见是纹理 vs. 形状偏见,即局部信息相对于全局信息的统治。在本文中,我们研究了这种偏见在广受欢迎的VLMs中的情况。有趣的是,我们发现VLMs往往比它们的视觉编码器更具有形状偏见,表明视觉偏见在某种程度上通过多模态模型中的文本进行调节。如果文本确实会影响视觉偏见,这表明我们不仅可以通过视觉输入来引导视觉偏见,还可以通过语言:通过广泛的实验结果我们证实了这一点。例如,我们通过提示可以将形状偏见从49%引导到72%。目前,所有测试的VLMs对形状(96%)的强烈人类偏见仍然是不可达的。
https://arxiv.org/abs/2403.09193
In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. This innovation addresses a critical limitation in existing MLLMs: their reliance on a text-only codebook, which restricts MLLM's ability to generate images and texts in a multimodal context. Towards this end, we propose a language-driven iterative training paradigm, coupled with an in-context pre-training task we term ``image decompression'', enabling our model to interpret compressed visual data and generate high-quality images.The unified codebook empowers our model to extend visual instruction tuning to non-linguistic generation tasks. Moreover, UniCode is adaptable to diverse stacked quantization approaches in order to compress visual signals into a more compact token representation. Despite using significantly fewer parameters and less data during training, Unicode demonstrates promising capabilities in visual reconstruction and generation. It also achieves performances comparable to leading MLLMs across a spectrum of VQA benchmarks.
在本文中,我们提出了UniCode,一种在多模态大型语言模型(MLLM)领域中的新方法,该方法学习了一个统一的代码本,以有效地对视觉、文本和可能还有其他类型的信号进行标记。这一创新解决了现有MLLM的一个关键限制:它们仅依赖文本代码本,这限制了MLLM在多模态环境中生成图像和文本的能力。为此,我们提出了一个语言驱动的迭代训练范式,并结合一个我们称之为“图像解压缩”的上下文预训练任务,使我们的模型能够解释压缩的视觉数据并生成高质量图像。统一代码本使我们的模型能够将视觉指令调整扩展到非语言生成任务中。此外,UniCode具有适应各种堆叠量化方法的潜力,以便将视觉信号压缩为更紧凑的标记表示。尽管在训练过程中使用了相当少的参数和更少的数据,但Unicode在视觉重建和生成方面表现出有前景的能力。此外,在VQA基准测试中,Unicode的性能与最先进的MLLM相当。
https://arxiv.org/abs/2403.09072
Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hidden knowledge at the output layer of LVLMs. We demonstrate that the logit distributions of the first tokens contain sufficient information to determine whether to respond to the instructions, including recognizing unanswerable visual questions, defending against multi-modal jailbreaking attack, and identifying deceptive questions. Such hidden knowledge is gradually lost in logits of subsequent tokens during response generation. Then, we illustrate a simple decoding strategy at the generation of the first token, effectively improving the generated content. In experiments, we find a few interesting insights: First, the CLIP model already contains a strong signal for solving these tasks, indicating potential bias in the existing datasets. Second, we observe performance improvement by utilizing the first logit distributions on three additional tasks, including indicting uncertainty in math solving, mitigating hallucination, and image classification. Last, with the same training data, simply finetuning LVLMs improve models' performance but is still inferior to linear probing on these tasks.
大视觉语言模型(LVLMs)被设计来解释和响应人类指令,但偶尔会生成错误或有害的内容,因为给出的指令不合适。这项研究使用线性探测来揭示LVLM输出层中隐藏的知识。我们证明了第一个单词的元分布包含足够的信息来决定是否响应指令,包括识别无法回答的视觉问题、抵御多模态破解攻击和识别欺骗性问题。在响应生成过程中,这样的隐藏知识在后续单词的元分布中逐渐丢失。然后,我们说明了在第一个单词生成时的简单解码策略,有效地提高了生成的内容的质量。在实验中,我们发现了几个有趣的观察结果:首先,CLIP模型已经在解决这些任务中包含了一个强烈的信号,表明现有的数据集中存在偏见。其次,通过使用前三个单词的元分布,我们在包括数学解决不确定性指示、减轻幻觉和图像分类在内的三个附加任务中观察到性能改善。最后,使用相同的训练数据,仅仅对LVLMs进行微调可以提高模型的性能,但仍然在这些任务上劣于线性探测。
https://arxiv.org/abs/2403.09037
With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as text-conditioned image understanding/generation/editing and visual question answering. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, and generalization, and performance. Our code and models will be made publicly available. Keywords: VisionGPT, Open-world visual perception, Vision-language understanding, Large language model, and Foundation model
随着大型语言模型(LLMs)和视觉基础模型(VMMs)的出现,如何将它们的开源或API可得模型的智能和能力相结合以实现开放世界视觉感知仍然是一个开放性问题。在本文中,我们介绍了VisionGPT来巩固和自动化将最先进的基线模型集成到一起,从而促进视觉语言理解和视觉导向人工智能的发展。VisionGPT基于一个通用的多模态框架,通过三个关键特点取得了区分:(1)利用LLM(例如LLLA-2)作为基准来分解用户的请求为详细行动建议以调用合适的基线模型;(2)自动集成基线模型的多源输出并生成全面回答给用户;(3)适用于诸如文本条件图像理解/生成/编辑和视觉问题回答等广泛应用场景。本文概述了VisionGPT的架构和能力,证明了其通过提高效率、多样性和泛化能力以及性能,可能彻底颠覆计算机视觉领域的潜力。我们的代码和模型将公开发布。关键词:VisionGPT,开放世界视觉感知,视觉语言理解,大型语言模型和基线模型
https://arxiv.org/abs/2403.09027
In this work, we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity, and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. With this innovative image encoding, the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion-crucially, without any fine-tuning. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. Code and models are available at this https URL.
在这项工作中,我们研究了大型语言模型(LLM)直接理解视觉信号的潜力,而无需在多模态数据集上进行微调。我们方法的基本概念是将图像看作一个语言实体,并将其转化为LLM词汇表中的一组离散单词。为了实现这一目标,我们提出了 Vision-to-Language Tokenizer,简称V2T Tokenizer,它通过联合编码器-解码器、LLM词汇表和CLIP模型将图像转换为“外语”。有了这种创新性的图像编码,LLM不仅能够实现视觉理解,而且能够以自回归的方式进行图像去噪和修复。我们进行了严格的实验来验证我们的方法,包括理解任务(图像识别、图像标题和视觉问答)和图像去噪任务(修复、去模糊和位移恢复)。代码和模型可在此https URL找到。
https://arxiv.org/abs/2403.07874
Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-modal scenarios to build Large Multi-modal Models (LMMs), there lies a great difficulty that the image information is processed in the LMM as continuous visual embeddings, which cannot obtain discrete supervised labels for classification. In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time. Specifically, we propose the concept of visual words, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information. Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach.
大语言模型(LLMs)利用在大型无标注文本语料库上执行的自回归建模方法,展示了强大的感知和推理能力。然而,将自回归建模扩展到多模态场景以构建大型多模态模型(LMM)仍然存在巨大的困难。在本文中,我们成功地将多模态自回归建模与统一的目标相结合,这是第一次成功实现。具体来说,我们提出了视觉词的概念,将视觉特征映射到LLM词汇表的概率分布中,提供视觉建模的监督信息。我们进一步探索了LLM中视觉特征的语义空间分布和使用文本嵌入表示视觉信息的可能性。对5个VQA任务和4个基准工具包的实验结果和消融研究证实了我们提出方法的有效性。
https://arxiv.org/abs/2403.07720
Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer. Although mining deeper layers of audio-visual information to interact with questions facilitates the multimodal fusion process, the redundancy of audio-visual parameters tends to reduce the generalization of the inference engine to multiple question-answer pairs in a single video. Indeed, the natural heterogeneous relationship between audiovisuals and text makes the perfect fusion challenging, to prevent high-level audio-visual semantics from weakening the network's adaptability to diverse question types, we propose a framework for performing mutual correlation distillation (MCD) to aid question inference. MCD is divided into three main steps: 1) firstly, the residual structure is utilized to enhance the audio-visual soft associations based on self-attention, then key local audio-visual features relevant to the question context are captured hierarchically by shared aggregators and coupled in the form of clues with specific question vectors. 2) Secondly, knowledge distillation is enforced to align audio-visual-text pairs in a shared latent space to narrow the cross-modal semantic gap. 3) And finally, the audio-visual dependencies are decoupled by discarding the decision-level integrations. We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs, i.e., Music-AVQA and AVQA. Experiments show that our method outperforms other state-of-the-art methods, and one interesting finding behind is that removing deep audio-visual features during inference can effectively mitigate overfitting. The source code is released at this http URL.
音频视觉问题回答(AVQA)需要参考视频内容和音频信息,然后将问题与预测最精确的答案相关联。尽管通过挖掘更深层次的音频视觉信息来与问题互动有助于多模态融合过程,但音频视觉参数的冗余会降低对推理引擎在单个视频中的多个问题答案对 generalization。事实上,音频视觉与文本的自然异质关系使得完美的融合具有挑战性,以防止高级音频视觉语义削弱网络对不同问题类型的适应性。为了帮助进行问题推理,我们提出了一个框架来执行互相关提纯(MCD)以 aid question inference。MCD 分为三个主要步骤:1)首先,利用残差结构增强基于自注意力的音频视觉松散联想,然后通过共享聚合器捕捉与问题上下文相关的关键局部音频视觉特征,并将其以特定问题向量耦合。2)其次,强制知识蒸馏以将音频视觉文本对齐到共享潜在空间,从而缩小跨模态语义鸿沟。3)最后,通过丢弃决策级整合来解耦音频视觉依赖。我们在两个公开可用的数据集(即 Music-AVQA 和 AVQA)上评估所提出的方法。实验结果表明,我们的方法在其他最先进的办法之上,其中一个有趣的发现是,在推理过程中删除深层音频视觉特征可以有效地减轻过拟合。源代码已在此处发布。
https://arxiv.org/abs/2403.06679
Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a new, more practical approach enabling simultaneous counting of multiple object categories using an open vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging point prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions and heralding a new era in object counting technology.
对象计数对于理解场景的构成至关重要。之前,这个任务主要被类特异性方法所统治,这些方法逐渐发展成为更具有适应性的类无关策略。然而,这些策略也存在一些限制,例如需要手动示例输入和对多个类别的多次处理,导致效率低下。本文介绍了一种新的、更实用的方法,使用开放词汇框架同时计数多个对象类别。我们的解决方案OmniCount通过利用预训练模型的语义和几何洞察力来计数用户指定的多个类别物体,而无需进行额外的训练。OmniCount通过生成精确的物体掩码并利用Segment Anything模型进行点提示来实现高效的计数。为了评估OmniCount,我们创建了OmniCount-191基准数据集,这是首个包括多个标签物体计数的基准数据集,包括点、边界框和VQA注释。我们在OmniCount-191上的全面评估,以及其他基准测试,展示了OmniCount的非凡性能,显著超过了现有解决方案,预示着物体计数技术将进入一个新时代。
https://arxiv.org/abs/2403.05435
Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric VQA. This task aims to test the models' capabilities in identifying entities and providing detailed, entity-specific knowledge. We have developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA datasets: (1) It encompasses a wide range of categorized entities, each represented by images and explicitly named in the answers; (2) It features QA pairs that require extensive knowledge for accurate responses. The dataset is organized into 22 major categories, containing 7,568 unique entities in total. For each entity, we curated 10 illustrative images and crafted 10 knowledge-intensive QA pairs. To address this novel task, we devised a scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5\% improvement in the BELURT score. We will soon make the dataset and the source code publicly accessible.
视觉扩展的LLM在视觉问答(VQA)方面取得了显著的进步。尽管如此,LLMs在处理涉及长尾实体的查询时仍然遇到了很大的困难,往往产生错误或模糊的答案。在这项工作中,我们引入了一个名为SnapNTell的新评估基准,专门针对实体中心化的VQA进行设计。这个任务旨在测试模型的能力,在识别实体和提供详细、实体特定的知识方面进行评估。为了构建SnapNTell数据集,我们在传统VQA数据集中的基础上进行了扩展:(1)它涵盖了广泛的分类实体,每个实体都以图像形式表示,并且在答案中明确命名;(2)它包含了需要广泛知识才能提供准确回答的问题对。数据集分为22个主要类别,包含7,568个独特的实体。对于每个实体,我们挑选了10个示例图像,并策划了10个知识密集型的问题对。为了应对这个新颖的任务,我们设计了一种可扩展、高效、透明的多模态LLM。我们的方法在SnapNTell数据集上明显优于现有方法,实现了BELURT分数的66.5%的提高。我们将很快将数据集和源代码公开发布。
https://arxiv.org/abs/2403.04735
This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the collected instructions are released at this https URL.
本文重点探讨在由丰富而复杂的动态音频-视觉组件组成的场景中回答问题的挑战。尽管现有的多模态大型语言模型(MLLMs)可以响应音频-视觉内容,但这些回答有时会模糊,无法描述具体的音频-视觉事件。为了克服这一局限,我们引入了CAT,它通过三种方式增强了MLLM:1)除了直接连接音频和视频外,我们还设计了一个提示聚合器,该聚合器在动态音频-视觉场景中汇总与问题相关的线索,以丰富大型语言模型所需的高级知识。2)CAT在混合多模态数据集上训练,允许直接应用于音频-视觉场景。值得注意的是,我们还收集了一个名为AVinstruct的音频-视觉联合指令数据集,以进一步增强CAT建模跨义关系的能力。3)我们提出了AI辅助的模糊度感知直接偏好优化策略,这是一种专门针对重新训练模型以偏好非模糊响应并提高指定音频-视觉对象的定位能力的策略。大量的实验结果表明,CAT在多模态任务上优于现有方法,尤其是音频-视觉问答(AVQA)任务。代码和收集的指令发布在https://这个网址上。
https://arxiv.org/abs/2403.04640
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks, including document question answering (DocVQA) and scene text analysis. Our approach introduces enhancement across several dimensions: by adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability and minimize hallucinations. Additionally, TextMonkey can be finetuned to gain the ability to comprehend commands for clicking screenshots. Overall, our method notably boosts performance across various benchmark datasets, achieving increases of 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, especially with a score of 561 on OCRBench, surpassing prior open-sourced large multimodal models for document understanding. Code will be released at this https URL.
我们提出了TextMonkey,一个专为文本中心任务(包括文档问题回答和场景文本分析)而设计的大型多模态模型(LMM)。我们的方法在几个维度上进行了增强:通过采用Shifted Window Attention且初始化为零,我们在较高输入分辨率上实现了跨窗口连接并稳定了早期的训练;我们假设图像中可能包含冗余词,通过使用相似性过滤掉显著的词,我们不仅可以简化词长,而且还可以提高模型的性能。此外,通过扩展我们的模型的能力包括文本定位和 grounded,并将位置信息融入回答,我们提高了可解释性和减少了幻觉。此外,TextMonkey还可以微调以具有理解截图按键命令的能力。总的来说,我们的方法在各种基准数据集上显著提高了性能,在Scene Text-Centric VQA、Document Oriented VQA和KIE等任务中分别实现了5.2%、6.9%和2.8%的提高,尤其是OCRBench上的得分561,超过了之前开源的大型多模态模型对于文档理解的性能。代码将在此处发布:https:// this URL.
https://arxiv.org/abs/2403.04473
Variational quantum algorithms (VQAs) are among the most promising algorithms to achieve quantum advantages in the NISQ era. One important challenge in implementing such algorithms is to construct an effective parameterized quantum circuit (also called an ansatz). In this work, we propose a single entanglement connection architecture (SECA) for a bipartite hardware-efficient ansatz (HEA) by balancing its expressibility, entangling capability, and trainability. Numerical simulations with a one-dimensional Heisenberg model and quadratic unconstrained binary optimization (QUBO) issues were conducted. Our results indicate the superiority of SECA over the common full entanglement connection architecture (FECA) in terms of computational performance. Furthermore, combining SECA with gate-cutting technology to construct distributed quantum computation (DQC) can efficiently expand the size of NISQ devices under low overhead. We also demonstrated the effectiveness and scalability of the DQC scheme. Our study is a useful indication for understanding the characteristics associated with an effective training circuit.
变分量子算法(VQAs)是在NISQ时代实现量子优势的最有前途的算法之一。实现这类算法的一个关键挑战是构建一个有效的参数化量子电路(也称为ansatz)。在本文中,我们提出了一个二进制硬件效率的ansatz(HEA)的单纠缠连接架构(SECA),通过平衡其表达性、纠缠能力和可训练性来实现。我们进行了用一维Heisenberg模型和二次约束二进制优化(QUBO)问题的数值模拟。我们的结果表明,与传统的全纠缠连接架构(FECA)相比,SECA在计算性能方面具有优势。此外,将SECA与门切割技术相结合以构建分布式量子计算(DQC)可以有效地在低开销下扩展NISQ设备的规模。我们还证明了DQC方案的有效性和可扩展性。我们的研究有助于理解有效培训电路相关的特征。
https://arxiv.org/abs/2307.12323
This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles to encompass a diverse array of mathematical and algorithmic topics such as boolean logic, combinatorics, graph theory, optimization, search, etc., aiming to evaluate the gap between visual data interpretation and algorithmic problem-solving skills. The dataset is generated automatically from code authored by humans. All our puzzles have exact solutions that can be found from the algorithm without tedious human calculations. It ensures that our dataset can be scaled up arbitrarily in terms of reasoning complexity and dataset size. Our investigation reveals that large language models (LLMs) such as GPT4V and Gemini exhibit limited performance in puzzle-solving tasks. We find that their performance is near random in a multi-choice question-answering setup for a significant number of puzzles. The findings emphasize the challenges of integrating visual, language, and algorithmic knowledge for solving complex reasoning problems.
本文介绍了一种新颖的多模态问题解决任务,置于视觉问题回答的背景下。我们提出了一个新的数据集AlgoPuzzleVQA,旨在挑战和评估多模态语言模型在解决需要视觉理解、语言理解和复杂算法推理的算法谜题的能力。我们创建了各种数学和算法主题的谜题,包括布尔逻辑、组合数学、图论、优化、搜索等,旨在评估视觉数据解释和算法问题解决技能之间的差距。该数据集是由人类编写的代码自动生成的。我们所有的谜题都有精确的解决方案,可以从算法中找到,而不需要进行繁琐的人计算。这确保了我们的数据集在推理复杂性和数据规模方面可以无限扩展。我们的调查发现,像GPT4V和Gemini这样的大型语言模型在谜题解决任务中的表现有限。我们发现,对于大多数谜题,它们的性能接近于随机。这些发现强调了将视觉、语言和算法知识相结合以解决复杂推理问题的挑战。
https://arxiv.org/abs/2403.03864
The integration of learning and reasoning is high on the research agenda in AI. Nevertheless, there is only a little attention to use existing background knowledge for reasoning about partially observed scenes to answer questions about the scene. Yet, we as humans use such knowledge frequently to infer plausible answers to visual questions (by eliminating all inconsistent ones). Such knowledge often comes in the form of constraints about objects and it tends to be highly domain or environment-specific. We contribute a novel benchmark called CLEVR-POC for reasoning-intensive visual question answering (VQA) in partially observable environments under constraints. In CLEVR-POC, knowledge in the form of logical constraints needs to be leveraged to generate plausible answers to questions about a hidden object in a given partial scene. For instance, if one has the knowledge that all cups are colored either red, green or blue and that there is only one green cup, it becomes possible to deduce the color of an occluded cup as either red or blue, provided that all other cups, including the green one, are observed. Through experiments, we observe that the low performance of pre-trained vision language models like CLIP (~ 22%) and a large language model (LLM) like GPT-4 (~ 46%) on CLEVR-POC ascertains the necessity for frameworks that can handle reasoning-intensive tasks where environment-specific background knowledge is available and crucial. Furthermore, our demonstration illustrates that a neuro-symbolic model, which integrates an LLM like GPT-4 with a visual perception network and a formal logical reasoner, exhibits exceptional performance on CLEVR-POC.
在人工智能领域,学习和推理的集成是其研究议程的重要方面。然而,在利用现有背景知识进行关于部分观察场景的推理方面,关注点还很少。然而,作为人类,我们经常使用这种知识来推断关于视觉问题的合理答案(通过消除所有不合理的答案)。这种知识通常以物体约束的形式存在,并往往具有领域或环境 specific性。我们在基于约束的推理密集型视觉问题回答(VQA)在部分观察场景中进行了贡献,该 benchmark名为CLEVR-POC。在CLEVR-POC中,需要利用形式化约束来生成关于隐藏物体的问题的合理答案。例如,如果一个人知道所有的杯子要么是红色,要么是蓝色,而只有一个杯子是绿色,那么他或她可以推断出被遮挡的杯子的颜色是红色或蓝色,只要观察到的所有杯子(包括绿色杯子)都被观察到了。通过实验,我们观察到像CLIP(~22%)这样的预训练视觉语言模型和如GPT-4(~46%)这样的大型语言模型在CLEVR-POC上的低性能表明,在具有环境特定背景知识且至关重要的情况下处理推理密集型任务是必要的。此外,我们的实验还表明,将LLM(如GPT-4)与视觉感知网络和形式化逻辑推理器集成的人工神经网络在CLEVR-POC上表现出卓越的性能。
https://arxiv.org/abs/2403.03203