Multimodal retrieval methods have limitations in handling complex, compositional queries that require reasoning about the visual content of both the query and the retrieved entities. On the other hand, Large Multimodal Models (LMMs) can answer with language to more complex visual questions, but without the inherent ability to retrieve relevant entities to support their answers. We aim to address these limitations with UniCoRN, a Unified Commented Retrieval Network that combines the strengths of composed multimodal retrieval methods and generative language approaches, going beyond Retrieval-Augmented Generation (RAG). We introduce an entity adapter module to inject the retrieved multimodal entities back into the LMM, so it can attend to them while generating answers and comments. By keeping the base LMM frozen, UniCoRN preserves its original capabilities while being able to perform both retrieval and text generation tasks under a single integrated framework. To assess these new abilities, we introduce the Commented Retrieval task (CoR) and a corresponding dataset, with the goal of retrieving an image that accurately answers a given question and generate an additional textual response that provides further clarification and details about the visual information. We demonstrate the effectiveness of UniCoRN on several datasets showing improvements of +4.5% recall over the state of the art for composed multimodal retrieval and of +14.9% METEOR / +18.4% BEM over RAG for commenting in CoR.
多模态检索方法在处理需要对查询和检索实体的视觉内容进行推理的复杂组合查询时存在局限性。另一方面,大型多模态模型(LMMs)可以使用语言回答更复杂的视觉问题,但它们没有内在的能力来检索支持其答案的相关实体。为了解决这些限制,我们提出了UniCoRN(Unified Commented Retrieval Network),这是一个结合了组成式多模态检索方法和生成性语言方法优势的统一网络,并超越了检索增强生成(RAG)的方法。我们引入了一个实体适配器模块,将检索到的多模态实体重新注入LMM中,使模型在生成答案和评论时能够关注这些实体。通过保持基础LMM不变,UniCoRN保留其原始功能的同时还能在一个集成框架下执行检索和文本生成任务。 为了评估这种新能力,我们引入了一个带有注释检索任务(CoR)及其相应的数据集的目标,即准确地从给定的问题中检索一张图片,并生成额外的文字回应来提供对视觉信息的进一步解释和详细说明。我们在多个数据集中展示了UniCoRN的有效性,在组合多模态检索方面比现有技术提高了+4.5%的召回率,而在RAG上实现了+14.9% METEOR / +18.4% BEM(Binary Exact Match)在注释检索任务中的性能提升。
https://arxiv.org/abs/2502.08254
Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains such as autonomous vehicles, smart surveillance, and healthcare, their deployment on resource-constrained edge devices remains challenging due to processing power, memory, and energy limitations. This survey explores recent advancements in optimizing VLMs for edge environments, focusing on model compression techniques, including pruning, quantization, knowledge distillation, and specialized hardware solutions that enhance efficiency. We provide a detailed discussion of efficient training and fine-tuning methods, edge deployment challenges, and privacy considerations. Additionally, we discuss the diverse applications of lightweight VLMs across healthcare, environmental monitoring, and autonomous systems, illustrating their growing impact. By highlighting key design strategies, current challenges, and offering recommendations for future directions, this survey aims to inspire further research into the practical deployment of VLMs, ultimately making advanced AI accessible in resource-limited settings.
视觉大型语言模型(VLM)结合了视觉理解和自然语言处理,支持诸如图像描述、视觉问答和视频分析等任务。尽管VLM在自动驾驶汽车、智能监控以及医疗保健等领域展示了令人印象深刻的能力,但由于计算能力、内存和能源限制,在资源受限的边缘设备上部署它们仍然面临挑战。本综述探讨了优化VLM以适应边缘环境的最近进展,重点介绍了模型压缩技术(如剪枝、量化、知识蒸馏)和专为提高效率而设计的硬件解决方案。我们详细讨论了高效的训练和微调方法、边缘部署挑战以及隐私考虑因素。此外,还探讨了轻量级VLM在医疗保健、环境监测及自主系统等领域的多样应用,展示了它们不断增长的影响。通过强调关键的设计策略、当前面临的挑战,并为未来的发展方向提供建议,本综述旨在激发进一步的研究,使先进的AI技术能够在资源受限的环境中实现实际部署,最终让更多人受益于高级人工智能的能力。
https://arxiv.org/abs/2502.07855
We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance.
我们介绍了EgoTextVQA,这是一个新颖且严谨构建的基准测试平台,用于评估涉及场景文本的以自我为中心(egocentric)的问题回答辅助。EgoTextVQA包含1500个第一人称视角视频和7000个反映户外驾驶和室内家居管理活动中真实用户需求的场景文字感知问题。这些问题旨在激发在动态且以自我为中心环境中对场景文本的识别与推理能力。通过使用EgoTextVQA,我们全面评估了10种突出的多模态大型语言模型。当前,所有模型都面临挑战,其中表现最佳的是Gemini 1.5 Pro,其准确率为约33%,这表明这些技术在以自我为中心的问题回答辅助方面存在严重的不足之处。 进一步的研究表明,精确的时间定位、多帧推理以及高分辨率和场景文本输入的辅助是提高性能的关键因素。通过详细的分析和启发式建议,我们希望EgoTextVQA能成为研究以自我为中心场景文字问题回答辅助的坚实测试平台。
https://arxiv.org/abs/2502.07411
Music performances are representative scenarios for audio-visual modeling. Unlike common scenarios with sparse audio, music performances continuously involve dense audio signals throughout. While existing multimodal learning methods on the audio-video QA demonstrate impressive capabilities in general scenarios, they are incapable of dealing with fundamental problems within the music performances: they underexplore the interaction between the multimodal signals in performance and fail to consider the distinctive characteristics of instruments and music. Therefore, existing methods tend to answer questions regarding musical performances inaccurately. To bridge the above research gaps, (i) given the intricate multimodal interconnectivity inherent to music data, our primary backbone is designed to incorporate multimodal interactions within the context of music; (ii) to enable the model to learn music characteristics, we annotate and release rhythmic and music sources in the current music datasets; (iii) for time-aware audio-visual modeling, we align the model's music predictions with the temporal dimension. Our experiments show state-of-the-art effects on the Music AVQA datasets. Our code is available at this https URL.
音乐表演是音视频建模的典型场景。与普通稀疏音频的情景不同,音乐表演在整个过程中持续涉及密集的音频信号。尽管现有的跨模态学习方法在音频-视频问答的一般情境中表现出令人印象深刻的能力,但它们无法处理音乐表演中的基本问题:这些方法未能充分探索表演中多模态信号之间的互动,并且忽略了乐器和音乐的独特特征。因此,现有方法通常会不准确地回答关于音乐表演的问题。 为了填补上述研究空白: (i) 鉴于音乐数据内在的复杂多模态互联性,我们的主要骨干设计旨在在音乐场景中整合多模态互动; (ii) 为了让模型学习到音乐特性,我们在当前的音乐数据集中添加并公开了节奏和音源注释; (iii) 对于时间感知的音频-视频建模,我们将模型的音乐预测与时间维度对齐。 我们的实验显示在Music AVQA数据集上达到了最先进的效果。我们的代码可在[此处](https://this https URL)获取。(请注意,原始文本中的URL应替换为实际可用链接地址)
https://arxiv.org/abs/2502.06710
Med-VQA (Medical Visual Question Answering) is a crucial subtask within the broader VQA (Visual Question Answering) domain. This task requires a visual question answering system to analyze the provided image and corresponding question,offering reasonable analysis and suggestions to assist medical professionals in making pathological diagnoses, or ideally, enabling the system to independently provide correct diagnoses. Furthermore, more advanced Med-VQA tasks involve Referring and Grounding, which not only require the system to accurately comprehend medical images but also to pinpoint specific biological locations within those images. While many large pre-trained models have demonstrated substantial VQA capabilities,challenges persist in the medical imaging domain. The intricacy of biological features in medical images and the scarcity of high-quality medical image datasets, combined with the fact that current models are not tailored for the medical field in terms of architecture and training paradigms, hinder the full exploitation of model generalization. This results in issues such as hallucination in Visual Grounding. In this paper, we introduce the ClinKD model, which incorporates modifications to model position encoding and a diversified training process. Initially, we enhance the model's ability to perceive image and modality variations by using Med-CLIP Guided Rotary Position Embedding. Subsequently, we leverage distillation to provide prior knowledge to the model before using complete training data. Additionally, the feedback-based training process during the formal training phase further enhances data utilization. Notably, under unchanged evaluation protocols, we achieve a new state-of-the-art performance on the Med-GRIT-270k dataset, and the Med-CLIP Guided Rotary Position Embedding approach presents potential for generalizing to universal model position encoding.
Med-VQA(医疗视觉问答)是更广泛的VQA(视觉问答)领域中的一个重要子任务。该任务要求视觉问答系统能够分析提供的医学图像和相关问题,为医务人员提供合理的分析和建议,以帮助病理诊断;或者理想情况下,使系统能够独立提供正确的诊断结果。更为高级的Med-VQA任务还包括引用与定位功能,这些不仅需要系统准确理解医学影像,还需要在图像中识别特定生物位置。尽管许多大型预训练模型已经在VQA领域展示了强大的能力,但在医疗成像方面仍面临挑战。由于医疗图像中的生物学特征复杂且高质量的医疗图像数据集稀缺,加上当前模型在架构和训练范式上并未针对医疗行业进行优化,这阻碍了这些模型全面发挥其泛化性能。因此,在视觉定位任务中出现了幻觉等问题。 本文介绍了一种名为ClinKD的新模型,该模型对模型的位置编码进行了改进,并采用了多样化的训练过程。首先,我们使用Med-CLIP引导的旋转位置嵌入来增强模型识别图像和模式变化的能力;随后通过知识蒸馏向模型提供先验知识,在此之后再进行完整数据集的训练。在正式训练阶段,基于反馈的训练流程进一步提升了数据利用效率。值得注意的是,在保持评估协议不变的情况下,我们的ClinKD模型在Med-GRIT-270k数据集中取得了新的最佳性能;此外,Med-CLIP引导的旋转位置嵌入方法显示出了应用于通用模型位置编码中的潜在可能。
https://arxiv.org/abs/2502.05928
Video Quality Assessment (VQA) is vital for large-scale video retrieval systems, aimed at identifying quality issues to prioritize high-quality videos. In industrial systems, low-quality video characteristics fall into four categories: visual-related issues like mosaics and black boxes, textual issues from video titles and OCR content, and semantic issues like frame incoherence and frame-text mismatch from AI-generated videos. Despite their prevalence in industrial settings, these low-quality videos have been largely overlooked in academic research, posing a challenge for accurate identification. To address this, we introduce the Multi-Branch Collaborative Network (MBCN) tailored for industrial video retrieval systems. MBCN features four branches, each designed to tackle one of the aforementioned quality issues. After each branch independently scores videos, we aggregate these scores using a weighted approach and a squeeze-and-excitation mechanism to dynamically address quality issues across different scenarios. We implement point-wise and pair-wise optimization objectives to ensure score stability and reasonableness. Extensive offline and online experiments on a world-level video search engine demonstrate MBCN's effectiveness in identifying video quality issues, significantly enhancing the retrieval system's ranking performance. Detailed experimental analyses confirm the positive contribution of all four evaluation branches. Furthermore, MBCN significantly improves recognition accuracy for low-quality AI-generated videos compared to the baseline.
视频质量评估(VQA)对于大规模的视频检索系统至关重要,旨在识别质量问题以优先处理高质量视频。在工业系统中,低质量视频的特点主要分为四类:视觉相关问题如马赛克和黑屏、文字相关的从视频标题和OCR内容提取的问题、以及语义方面的问题比如帧间不一致性和帧与文本不符的情况(尤其是在由AI生成的视频中)。尽管这些低质量视频在工业环境中普遍存在,但在学术研究中却受到了很大程度上的忽视,这给准确识别带来了挑战。为了解决这一问题,我们引入了专门针对工业级视频检索系统的多分支协作网络(MBCN)。 MBCN包括四个分支,每个分支都设计用于解决上述一个质量评估问题。在每个分支独立打分后,我们会采用加权方式和挤压-激励机制来聚合这些分数,以便动态应对不同场景中的质量问题。为了确保评分的稳定性和合理性,我们实施了点对点和成对优化目标。 通过世界级别的视频搜索引擎进行广泛的离线和在线实验表明,MBCN在识别视频质量问题方面非常有效,并显著提升了检索系统的排序性能。详细的实验分析证实了所有四个评估分支的正面贡献。此外,与基线相比,MBCN大幅提高了低质量AI生成视频的识别准确率。 简而言之,通过引入MBCN并针对工业环境中的特定挑战进行优化,我们能够更有效地识别和处理大规模视频检索系统中的质量问题,从而提供更好的用户体验和搜索结果精度。
https://arxiv.org/abs/2502.05924
In real-world applications where computational resources are limited, effectively integrating visual and textual information for Visual Question Answering (VQA) presents significant challenges. This paper investigates the performance of traditional models under computational constraints, focusing on enhancing VQA performance, particularly for numerical and counting questions. We evaluate models based on Bidirectional GRU (BidGRU), GRU, Bidirectional LSTM (BidLSTM), and Convolutional Neural Networks (CNN), analyzing the impact of different vocabulary sizes, fine-tuning strategies, and embedding dimensions. Experimental results show that the BidGRU model with an embedding dimension of 300 and a vocabulary size of 3000 achieves the best overall performance without the computational overhead of larger models. Ablation studies emphasize the importance of attention mechanisms and counting information in handling complex reasoning tasks under resource limitations. Our research provides valuable insights for developing more efficient VQA models suitable for deployment in environments with limited computational capacity.
在计算资源有限的实际应用场景中,有效地整合视觉和文本信息以进行视觉问答(VQA)面临着重大挑战。本文探讨了传统模型在计算约束下的性能表现,并专注于提高VQA的性能,尤其是在处理数字和计数问题时的表现。我们基于双向门控循环单元(BidGRU)、门控循环单元(GRU)、双向长短期记忆网络(BidLSTM)以及卷积神经网络(CNN)评估不同的模型,并分析词汇表大小、微调策略及嵌入维度对这些模型的影响。 实验结果表明,在不增加大型模型计算开销的情况下,使用300维嵌入和3000词词汇量的BidGRU模型表现最佳。消融研究则强调了注意力机制以及计数信息在资源受限条件下处理复杂推理任务中的重要性。我们的研究成果为开发适用于计算能力有限环境下的更高效的VQA模型提供了宝贵的见解。
https://arxiv.org/abs/2502.05738
While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e. a reference image with accompanying text guidance query. To address this, we introduce Hummingbird, the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images preserve the scene attributes of reference images in relation to the text guidance while maintaining diversity. As the first model to address the task of maintaining both diversity and fidelity given a multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI datasets. Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird's potential as a robust multimodal context-aligned image generator in complex visual tasks.
虽然扩散模型在生成高质量、多样化的以对象为中心的任务的合成数据方面表现出色,但现有方法在场景感知任务(如视觉问答(VQA)和人与物交互(HOI)推理)中却难以保持生成图像中的场景属性与多模态上下文的一致性。为了解决这个问题,我们引入了Hummingbird,这是第一个基于扩散的图像生成器,它能够在给定多模态上下文的情况下,根据参考图像生成高度多样化的图像,并通过准确保留文本引导中的场景属性(如物体交互和空间关系)来确保高保真度。 Hummingbird采用了创新的多模态上下文评估器,同时优化我们制定的整体语义和细粒度一致性奖励,以确保在保持多样性的同时,所生成的图像能与参考图中有关文本指导的场景属性保持一致。作为第一个能够在一个多模态上下文中同时维持多样性和保真的模型,我们引入了新的基准测试方案,结合MME感知和邦加德HOI数据集。基准实验表明,在保持多样性的前提下,Hummingbird在保证图像保真度方面超越所有现有方法,验证了Hummingbird作为复杂视觉任务中强大的多模态上下文对齐的图像生成器的潜力。
https://arxiv.org/abs/2502.05153
Continual Learning in Visual Question Answering (VQACL) requires models to learn new visual-linguistic tasks (plasticity) while retaining knowledge from previous tasks (stability). The multimodal nature of VQACL presents unique challenges, requiring models to balance stability across visual and textual domains while maintaining plasticity to adapt to novel objects and reasoning tasks. Existing methods, predominantly designed for unimodal tasks, often struggle to balance these demands effectively. In this work, we introduce QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularisation, eliminating the need to store visual data and addressing both memory and privacy concerns. QUAD achieves stability by introducing a question-only replay mechanism that selectively uses questions from previous tasks to prevent overfitting to the current task's answer space, thereby mitigating the out-of-answer-set problem. Complementing this, we propose attention consistency distillation, which uniquely enforces both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations. Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA.
在视觉问答(VQACL)中的连续学习要求模型既能学会新的视觉-语言任务(可塑性),又能保留之前任务的知识(稳定性)。由于多模态的特性,VQACL提出了独特的挑战,需要模型在保持视觉和文本领域稳定性的基础上,同时具备适应新对象和推理任务的能力。现有的大多数方法主要针对单模态任务设计,通常难以有效地平衡这些需求。 在这项工作中,我们介绍了QUestion-only replay with Attention Distillation(QUAD),这是一种新的VQACL方法,仅利用过去任务的问题进行正则化,无需存储视觉数据,并解决了内存和隐私问题。通过引入一种只使用先前任务中问题的重播机制,QUAD实现了稳定性,这种方法选择性地利用之前的问题来防止过度适应当前任务的答案空间,从而减轻了超出答案集范围的问题。 此外,我们提出了注意力一致性蒸馏方法,它独特地要求跨任务内的模态间和模态内的一致性,保持重要的视觉-语言关联。在VQAv2和NExT-QA上的广泛实验表明,QUAD显著优于现有的最佳方法,在连续的视觉问答中实现了稳健的表现。
https://arxiv.org/abs/2502.04469
Multiple works have emerged to push the boundaries on multi-modal large language models (MLLMs) towards pixel-level understanding. Such approaches have shown strong performance on benchmarks for referring expression segmentation and grounded conversation generation. The current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data. However, we show that such MLLMs when evaluated on recent challenging vision centric benchmarks, exhibit a weak ability in visual question answering. Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such supervision. In this work, we propose two novel challenging benchmarks and show that MLLMs without pixel-level grounding supervision can outperform the state of the art in such tasks when evaluating both the pixel-level grounding and visual question answering. We propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call as PixFoundation. More importantly, we study the research question of ``When does grounding emerge in MLLMs that are not trained with pixel-level grounding supervision?'' We show that grounding can coincide with object parts or location/appearance information. Code repository is at this https URL.
多模态大型语言模型(MLLMs)在推动向像素级理解的边界方面涌现了许多工作,这些方法在指代表达分割和基于场景对话生成等基准测试中表现出色。目前针对像素级别的MLLM的研究趋势是使用大规模标注数据进行像素级定位监督训练。然而,我们发现当评估这类模型在最近提出的视觉为中心的挑战性基准上的性能时,它们在视觉问答任务中的能力较弱。令人惊讶的是,某些方法甚至会削弱未经过此类监督训练的MLLM的定位能力。 在这项工作中,我们提出了两个新颖且具有挑战性的基准测试,并展示了未经像素级定位监督训练的MLLM可以在评估这些任务时(同时考虑像素级定位和视觉问答性能)超越现有最佳水平。我们提出了一种简单的方法来提取可以插入任何MLLM中的定位信息基线模型,称之为PixFoundation。 更重要的是,我们探讨了“在没有进行过像素级定位监督训练的情况下,MLLM何时会出现定位能力?”这一研究问题,并证明这些模型的定位能力可能与对象部位或位置/外观信息相关联。代码仓库可在[此处](https://这个URL应该是具体的GitHub或者其他代码托管平台上的链接)访问。
https://arxiv.org/abs/2502.04192
We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.
我们介绍了一组新的以厨房为中心的自视角视频验证数据集,这些视频经过人工标注,包含了高度详细且相互关联的真实标签,涵盖:食谱步骤、细粒度动作、包含营养信息的食材、移动物体以及音频注释。重要的是,所有注释都通过场景、固定装置、对象位置的数字孪生,并结合注视点数据进行三维定位。视频素材来自多样化的家庭环境中未经脚本编写的记录,在真实环境条件下收集但具有与受控实验室环境下详细标注相匹配的数据集,使HDEPIC成为首个野外采集但具备精细注释标准的数据集。 我们通过一个包含26,000个问题的挑战性VQA基准测试展示了高度详细的注释潜力,该测试评估识别食谱、食材、营养成分、细粒度动作、三维感知、物体运动以及注视方向的能力。强大的长上下文Gemini Pro模型在这一基准上仅达到38.5%,表明了任务难度并突显了当前视觉语言模型的不足之处。 我们还在HD-EPIC数据集上评估了动作识别、声音识别及长期视频对象分割技术。HD-EPIC包含9个厨房内共计41小时的视频,拥有413种厨房固定装置的数字孪生体,记录了69道食谱,59,000次细粒度动作,51,000次音频事件,20,000次物体移动和37,000个三维提升的对象掩码。平均每分钟未经过脚本编写的视频包含约263项注释。
https://arxiv.org/abs/2502.04144
Vision-language models (VLMs) excel in tasks such as visual question answering and image captioning. However, VLMs are often limited by their use of pretrained image encoders, like CLIP, leading to image understanding errors that hinder overall performance. On top of that, real-world applications often require the model to be continuously adapted as new and often limited data continuously arrive. To address this, we propose LoRSU (Low-Rank Adaptation with Structured Updates), a robust and computationally efficient method for selectively updating image encoders within VLMs. LoRSU introduces structured and localized parameter updates, effectively correcting performance on previously error-prone data while preserving the model's general robustness. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency. Specifically, we demonstrate that LoRSU reduces computational overhead by over 25x compared to full VLM updates, without sacrificing performance. Experimental results on VQA tasks in the few-shot continual learning setting, validate LoRSU's scalability, efficiency, and effectiveness, making it a compelling solution for image encoder adaptation in resource-constrained environments.
视觉语言模型(VLMs)在诸如视觉问答和图像描述等任务中表现出色。然而,这些模型通常受限于预训练的图像编码器(如CLIP),这导致了图像理解中的错误并影响整体性能。此外,在实际应用中,由于不断有新的且往往有限的数据到达,需要对模型进行持续适应调整。 为了解决这个问题,我们提出了LoRSU(低秩适应结构化更新)这一方法,这是一种在VLM中选择性地更新图像编码器的稳健而计算效率高的方式。LoRSU引入了结构化的、局部化的参数更新,有效纠正了之前容易出错的数据上的表现,并保持模型的整体鲁棒性。我们的方法利用理论洞察来识别并仅更新最关键的部分参数,在确保性能的同时显著提升了资源利用率。 具体而言,我们展示了在与完全的VLM更新相比,LoRSU可以将计算开销降低超过25倍,而不会牺牲性能。实验结果表明,LoRSU在少量样本持续学习设置下的视觉问答任务中验证了其可扩展性、效率和有效性,在资源受限环境中为图像编码器适应提供了一个有吸引力的解决方案。
https://arxiv.org/abs/2502.04098
The advent of next-generation video generation models like \textit{Sora} poses challenges for AI-generated content (AIGC) video quality assessment (VQA). These models substantially mitigate flickering artifacts prevalent in prior models, enable longer and complex text prompts and generate longer videos with intricate, diverse motion patterns. Conventional VQA methods designed for simple text and basic motion patterns struggle to evaluate these content-rich videos. To this end, we propose \textbf{CRAVE} (\underline{C}ontent-\underline{R}ich \underline{A}IGC \underline{V}ideo \underline{E}valuator), specifically for the evaluation of Sora-era AIGC videos. CRAVE proposes the multi-granularity text-temporal fusion that aligns long-form complex textual semantics with video dynamics. Additionally, CRAVE leverages the hybrid motion-fidelity modeling to assess temporal artifacts. Furthermore, given the straightforward prompts and content in current AIGC VQA datasets, we introduce \textbf{CRAVE-DB}, a benchmark featuring content-rich videos from next-generation models paired with elaborate prompts. Extensive experiments have shown that the proposed CRAVE achieves excellent results on multiple AIGC VQA benchmarks, demonstrating a high degree of alignment with human perception. All data and code will be publicly available at this https URL.
下一代视频生成模型(如*Sora*)的出现,为人工智能生成内容(AIGC)视频质量评估(VQA)带来了挑战。这些模型显著减少了早期模型中常见的闪烁伪影,支持更长且复杂的文本提示,并能生成包含复杂多样的运动模式的较长视频。传统的针对简单文本和基本运动模式设计的VQA方法在评价这类内容丰富的视频时力有不逮。为此,我们提出了名为**CRAVE**(Content-Rich AIGC Video Evaluator)的方法,专门用于评估*Sora*时代AIGC视频的质量。 CRAVE 提出了多粒度文本-时间融合技术,将长篇复杂文本的语义与视频动态相匹配。此外,CRAVE 利用混合运动-保真度建模来评价时序伪影。鉴于当前 AIGC VQA 数据集中存在的提示简单和内容单一的问题,我们还引入了**CRAVE-DB**这一基准测试集,其中包含了下一代模型生成的内容丰富视频以及详细的提示信息。 大量实验表明,所提出的 CRAVE 方法在多个AIGC VQA 基准上取得了优异的成绩,并且与人类感知高度一致。所有数据和代码将在该网址公开发布。
https://arxiv.org/abs/2502.04076
The performance of Variational Quantum Algorithms (VQAs) strongly depends on the choice of the parameterized quantum circuit to optimize. One of the biggest challenges in VQAs is designing quantum circuits tailored to the particular problem and to the quantum hardware. This article proposes a gradient-free Monte Carlo Tree Search (MCTS) technique to automate the process of quantum circuit design. It introduces a novel formulation of the action space based on a sampling scheme and a progressive widening technique to explore the space dynamically. When testing our MCTS approach on the domain of random quantum circuits, MCTS approximates unstructured circuits under different values of stabilizer Rényi entropy. It turns out that MCTS manages to approximate the benchmark quantum states independently from their degree of nonstabilizerness. Next, our technique exhibits robustness across various application domains, including quantum chemistry and systems of linear equations. Compared to previous MCTS research, our technique reduces the number of quantum circuit evaluations by a factor of 10 to 100 while achieving equal or better results. In addition, the resulting quantum circuits have up to three times fewer CNOT gates.
变分量子算法(VQAs)的性能在很大程度上取决于优化参数化量子电路的选择。VQAs面临的最大挑战之一是设计针对特定问题和量子硬件的量子电路。本文提出了一种无梯度的蒙特卡洛树搜索(MCTS)技术,以自动化量子电路的设计过程。该方法引入了基于抽样方案的动作空间的新颖表述以及一种渐进式扩展技术,能够动态地探索动作空间。 当在随机量子电路领域测试我们的MCTS方法时,MCTS能够在不同稳定子瑞利熵值下近似无结构的量子电路。结果表明,无论基准量子态的非稳定程度如何,MCTS都能够独立地逼近这些状态。此外,在包括量子化学和线性方程系统在内的各种应用领域中,该技术表现出强大的鲁棒性。 与之前的MCTS研究相比,我们的方法在实现同等或更好性能的同时,减少了量子电路评估次数达10到100倍,并且生成的量子电路中的CNOT门数量最多减少三倍。
https://arxiv.org/abs/2502.03962
Document Visual Question Answering (DocVQA) has introduced a new paradigm for end-to-end document understanding, and quickly became one of the standard benchmarks for multimodal LLMs. Automating document processing workflows, driven by DocVQA models, presents significant potential for many business sectors. However, documents tend to contain highly sensitive information, raising concerns about privacy risks associated with training such DocVQA models. One significant privacy vulnerability, exploited by the membership inference attack, is the possibility for an adversary to determine if a particular record was part of the model's training data. In this paper, we introduce two novel membership inference attacks tailored specifically to DocVQA models. These attacks are designed for two different adversarial scenarios: a white-box setting, where the attacker has full access to the model architecture and parameters, and a black-box setting, where only the model's outputs are available. Notably, our attacks assume the adversary lacks access to auxiliary datasets, which is more realistic in practice but also more challenging. Our unsupervised methods outperform existing state-of-the-art membership inference attacks across a variety of DocVQA models and datasets, demonstrating their effectiveness and highlighting the privacy risks in this domain.
文档视觉问答(DocVQA)引入了一种新的端到端文档理解范式,迅速成为了多模态大语言模型的标准基准之一。由DocVQA模型驱动的自动化文档处理工作流程在许多业务领域中展现了巨大的潜力。然而,文档往往包含高度敏感的信息,这引发了对训练此类DocVQA模型时隐私风险的关注。一种重要的隐私漏洞是成员推断攻击所利用的可能性:即对手能够确定特定记录是否属于模型的训练数据集。在这篇论文中,我们介绍了两种专门针对DocVQA模型的新颖成员推断攻击方法。这些攻击设计用于两种不同的对抗场景:在白盒环境下(即攻击者拥有对模型架构和参数的完全访问权);以及黑盒环境(即只有模型输出可用)。值得注意的是,我们的攻击假设对手无法获取辅助数据集,在实践中更现实但也更具挑战性。我们的无监督方法超越了现有最先进的成员推断攻击技术,适用于各种DocVQA模型和数据集,展示了其有效性和在此领域中凸显的隐私风险。
https://arxiv.org/abs/2502.03692
Designing datasets for Visual Question Answering (VQA) is a difficult and complex task that requires NLP for parsing and computer vision for analysing the relevant aspects of the image for answering the question asked. Several benchmark datasets have been developed by researchers but there are many issues with using them for methodical performance tests. This paper proposes a new benchmark dataset -- a pilot version called VQA-Levels is ready now -- for testing VQA systems systematically and assisting researchers in advancing the field. The questions are classified into seven levels ranging from direct answers based on low-level image features (without needing even a classifier) to those requiring high-level abstraction of the entire image content. The questions in the dataset exhibit one or many of ten properties. Each is categorised into a specific level from 1 to 7. Levels 1 - 3 are directly on the visual content while the remaining levels require extra knowledge about the objects in the image. Each question generally has a unique one or two-word answer. The questions are 'natural' in the sense that a human is likely to ask such a question when seeing the images. An example question at Level 1 is, ``What is the shape of the red colored region in the image?" while at Level 7, it is, ``Why is the man cutting the paper?". Initial testing of the proposed dataset on some of the existing VQA systems reveals that their success is high on Level 1 (low level features) and Level 2 (object classification) questions, least on Level 3 (scene text) followed by Level 6 (extrapolation) and Level 7 (whole scene analysis) questions. The work in this paper will go a long way to systematically analyze VQA systems.
设计用于视觉问答(VQA)的数据集是一项困难且复杂的任务,需要自然语言处理技术来解析问题,并使用计算机视觉技术分析图像中与回答问题相关的方面。尽管研究人员已经开发出了几个基准数据集,但在进行系统性性能测试时这些数据集仍然存在诸多问题。本文提出了一种新的基准数据集——一个称为VQA-Levels的试点版本现已准备就绪——旨在对VQA系统进行系统的测试,并帮助研究者推进该领域的进展。这些问题被分类为七个等级,从基于图像低级特征(甚至不需要使用分类器)直接回答的问题到需要整个图像内容高层次抽象的问题不等。数据集中的问题表现出十种属性之一或多种。每一个问题根据其复杂性从1到7进行分级。等级1至3主要涉及视觉内容的直接分析,而剩余的等级则需要关于图像中物体的额外知识。一般而言,每个问题通常具有一个或两个词的独特答案。这些问题在“自然”方面表现出色,也就是说当看到这些图像时人类有可能会提出这样的问题。例如,在Level 1(低级特征)的问题是:“图片中的红色区域是什么形状?”而在Level 7(整个场景分析),问题是:“为什么这个男人正在剪纸?”对现有VQA系统进行的初步测试显示,它们在Level 1和Level 2(对象分类)这类简单问题上的成功率较高,在Level 3(场景文本)、Level 6(推断)以及最难的Level 7(整个场景分析)类问题上的表现则相对较差。本文的研究将有助于对VQA系统进行系统的分析。
https://arxiv.org/abs/2502.02951
Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.
空间推理是人类认知的重要组成部分,而最新的视觉-语言模型(VLMs)在这一领域表现出明显的困难。目前的分析工作主要通过图像描述和视觉问答任务来进行评估。在这项工作中,我们建议使用指称表达理解任务来作为评估VLM的空间推理能力的一个平台。这个平台提供了深入分析空间理解和定位能力的机会,在以下情况下:1)目标检测中的模糊性;2)复杂的包含长句结构及多种空间关系的表述;3)包含否定词(如“不”)的表述。在我们的研究中,我们使用了特定任务架构以及大规模视觉-语言模型,并突出显示它们在处理这些具体情况时的优点和缺点。 尽管所有这些模型都面临着当前任务中的挑战,但具体行为取决于底层模型及其特定的空间语义类别(拓扑、方向、邻近等)。我们的结果强调了这些挑战与行为模式,并为研究空白及未来发展方向提供了见解。
https://arxiv.org/abs/2502.04359
Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at this https URL.
多模态大型语言模型(MLLMs)在视觉-语言任务中表现出色,但仍然容易受到诱导幻觉、操纵响应或绕过安全机制的视觉对抗性干扰的影响。现有的方法通过将受限制的对抗性微调应用于CLIP视觉编码器来缓解这些风险,并确保其泛化能力得到保留,这种方法使用的是ImageNet规模的数据集。然而,这种有限的对抗训练方式限制了模型的鲁棒性和更广泛的泛化能力。 在本研究中,我们探索了一种替代方法:利用已经在大规模数据上进行过对抗性预训练的现有视觉分类模型。我们的分析揭示了两个主要贡献: 1. 广阔规模和多样性的对抗性预训练使这些模型能够展示出对各种对抗威胁(从不易察觉的变化到高级破解尝试)的卓越鲁棒性,无需额外的对抗性训练。 2. 将MLLMs与这些强大的视觉分类模型进行端到端集成可以促进语言组件更好地适应稳健的视觉特征,在复杂的推理任务中优于现有的即插即用方法。 通过在图像问答、图像描述和破解攻击等视觉问题上的系统评估,我们证明了使用这些强大模型训练的MLLMs能够实现优越的对抗鲁棒性,并且保持良好的清洁性能。我们的框架分别使图像描述和VQA(视觉问答)任务中的平均稳健性提高了2倍和1.5倍,并在破解攻击中提供了超过10%的改进。 代码和预训练模型将在以下URL提供:[此链接](请在此处插入实际链接)。
https://arxiv.org/abs/2502.01576
The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the context change is irrelevant to the question, models often incorrectly adjust their answers.
视觉-语言基础模型的兴起标志着在三维场景推理中弥合人类与机器能力差距方面的进步。现有的3D推理基准假设可以实时访问场景数据,但由于频繁更新场景数据成本高昂,这种假设并不切实际。为此,我们引入了假设性3D推理(Hypothetical 3D Reasoning),即Hypo3D,这是一个用于评估模型在没有实时场景数据情况下进行推理能力的基准测试。模型需要根据提供的变化描述想象出场景的状态后再进行推理。 Hypo3D被设计为一个三维视觉问答(Visual Question Answering, VQA) 基准,包含了700个室内场景中的7,727次上下文变化,总共产生了14,885对问题和答案。对于所有场景都建立了一个基于锚点的世界框架,确保在上下文变化及问答中方向性术语可以一致地引用到全局框架。 通过广泛的实验发现,最先进的基础模型在假设改变的场景中进行推理时表现挣扎。这表明这些模型与人类的表现存在显著差距,特别是在涉及移动变化和方向性推理的情况下。即使上下文变化与问题无关,模型也会经常错误地调整它们的答案。
https://arxiv.org/abs/2502.00954
In this paper, we propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving by integrating Vision-Language Models (VLMs) with continual learning. In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings. However, traditional models often struggle with catastrophic forgetting when sequentially exposed to new driving tasks, such as perception, prediction, and planning, each requiring different forms of knowledge. To address this challenge, we present a novel continual learning framework that combines VLMs with selective memory replay and knowledge distillation, reinforced by task-specific projection layer regularization. The knowledge distillation allows a previously trained model to act as a "teacher" to guide the model through subsequent tasks, minimizing forgetting. Meanwhile, task-specific projection layers calculate the loss based on the divergence of feature representations, ensuring continuity in learning and reducing the shift between tasks. Evaluated on the DriveLM dataset, our framework shows substantial performance improvements, with gains ranging from 21.40% to 32.28% across various metrics. These results highlight the effectiveness of combining continual learning with VLMs in enhancing the resilience and reliability of VQA systems in autonomous driving. We will release our source code.
在这篇论文中,我们提出了一种新的方法,通过将视觉-语言模型(VLM)与连续学习相结合来解决自动驾驶中的视觉问答(VQA)任务。在自动驾驶中,VQA对于系统理解并推理周围环境起着至关重要的作用。然而,传统的模型往往在顺序地暴露于新驾驶任务(如感知、预测和规划等不同知识需求的任务)时,会遇到灾难性遗忘的问题。为了解决这个问题,我们提出了一种新的连续学习框架,该框架结合了VLM与选择性记忆重放及知识蒸馏,并通过特定任务的投影层正则化来强化这一过程。知识蒸馏使先前训练好的模型能够充当“教师”,指导模型完成后续任务,从而减少遗忘现象的发生。同时,特定任务的投影层会根据特征表示的差异计算损失,确保学习的连续性并降低各任务之间的转换影响。在DriveLM数据集上的评估显示,我们的框架表现出显著的性能改进,在各种度量标准上提高了21.40%到32.28%。这些结果突显了将连续学习与VLM结合使用以增强自动驾驶中VQA系统弹性和可靠性的有效性。我们将公开源代码。
https://arxiv.org/abs/2502.00843