It's no secret that video has become the primary way we share information online. That's why there's been a surge in demand for algorithms that can analyze and understand video content. It's a trend going to continue as video continues to dominate the digital landscape. These algorithms will extract and classify related features from the video and will use them to describe the events and objects in the video. Deep neural networks have displayed encouraging outcomes in the realm of feature extraction and video description. This paper will explore the spatiotemporal features found in videos and recent advancements in deep neural networks in video understanding. We will review some of the main trends in video understanding models and their structural design, the main problems, and some offered solutions in this topic. We will also review and compare significant video understanding and action recognition datasets.
毫无疑问,视频已经成为我们在线分享信息的主要方式。因此,对于能够分析和理解视频内容的算法的需求激增。随着视频在数字领域继续占据主导地位,这一趋势将继续发展。这些算法将从视频中提取并分类相关特征,并利用它们来描述视频中的事件和物体。深度神经网络在特征提取和视频描述方面已经展示出了令人鼓舞的结果。本文将探讨视频中的时空特征以及视频理解领域中深度神经网络的最新进展。我们将回顾一些主要的视频理解模型及其结构设计、该领域的核心问题,以及在此主题上提出的一些解决方案。我们还将回顾并比较重要的视频理解和动作识别数据集。
https://arxiv.org/abs/2502.07277
Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state-space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross-scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a promising architecture in computer vision research and applications.
Transformer模型已成为物体检测、语义分割和视频理解等视觉任务的基础,但它们在注意力机制中的二次复杂性带来了可扩展性的挑战。为了应对这些限制,Mamba架构采用状态空间模型(SSMs)实现了线性可扩展性、高效处理以及上下文感知能力的提升。本文探讨了Mamba架构在视觉领域应用及其近期进展,包括Vision Mamba (ViM) 和VideoMamba,这两种方法引入双向扫描、选择性扫描机制和时空处理技术来增强图像和视频的理解能力。诸如位置嵌入、跨扫模块以及分层设计等架构创新进一步优化了Mamba框架,使其在全局与局部特征提取方面表现得更为出色。这些进步使得Mamba成为计算机视觉研究和应用领域中一个有前景的架构选择。
https://arxiv.org/abs/2502.07161
The explosive growth of video data has driven the development of distributed video analytics in cloud-edge-terminal collaborative (CETC) systems, enabling efficient video processing, real-time inference, and privacy-preserving analysis. Among multiple advantages, CETC systems can distribute video processing tasks and enable adaptive analytics across cloud, edge, and terminal devices, leading to breakthroughs in video surveillance, autonomous driving, and smart cities. In this survey, we first analyze fundamental architectural components, including hierarchical, distributed, and hybrid frameworks, alongside edge computing platforms and resource management mechanisms. Building upon these foundations, edge-centric approaches emphasize on-device processing, edge-assisted offloading, and edge intelligence, while cloud-centric methods leverage powerful computational capabilities for complex video understanding and model training. Our investigation also covers hybrid video analytics incorporating adaptive task offloading and resource-aware scheduling techniques that optimize performance across the entire system. Beyond conventional approaches, recent advances in large language models and multimodal integration reveal both opportunities and challenges in platform scalability, data protection, and system reliability. Future directions also encompass explainable systems, efficient processing mechanisms, and advanced video analytics, offering valuable insights for researchers and practitioners in this dynamic field.
视频数据的爆炸性增长推动了云端边端协同(CETC)系统中分布式视频分析的发展,实现了高效的视频处理、实时推理和隐私保护分析。在诸多优势中,CETC系统可以将视频处理任务分配到云、边缘和终端设备上,并实现适应性的数据分析,在视频监控、自动驾驶和智慧城市等领域取得了突破性进展。在这篇综述文章中,我们首先分析了基本的架构组件,包括层次化、分布式以及混合框架,并讨论了边缘计算平台与资源管理机制。 在此基础上,以边缘为中心的方法强调本地处理、边缘辅助卸载及边缘智能,而云中心方法则利用强大的计算能力来进行复杂的视频理解和模型训练。我们的研究还涵盖了结合自适应任务卸载和资源感知调度技术的混合视频分析,这些技术优化了整个系统的性能表现。 除了传统的做法之外,最近在大型语言模型与跨模态融合方面的进展也揭示了平台可扩展性、数据保护以及系统可靠性等方面的机遇和挑战。未来的研究方向包括可解释性系统、高效处理机制及高级视频分析,为该领域的研究者和实践者提供了宝贵的见解。
https://arxiv.org/abs/2502.06581
Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in this https URL.
多模态大型语言模型(MLLM)在处理长视频时面临挑战,因为需要大量的视觉标记。这些标记远远超过了MLLM的上下文长度限制,导致填充了许多与任务无关的片段。如何选择合适的镜头是一个尚未解决的关键问题:稀疏抽样可能会错过关键细节,而全面抽样则会使模型被过多无关内容淹没,从而产生误解视频的风险。 为了解决这个问题,我们提出了“链式镜头提示法”(Chain-of-Shot prompting, CoS)。其核心思想是将镜头选择视为测试时间的视觉提示优化过程,通过优化镜头与任务的相关性来选择适应于特定理解任务的关键镜头。CoS包含两个关键部分: 1. **二元视频摘要机制**:该机制执行伪时态定位,发现一种能够识别任务相关镜头的二进制编码。 2. **视频共推理模块**:利用上述发现的二进制编码将任务相关的正向镜头与无关的负向镜头配对(学习进行匹配)。此部分将优化后的镜头选择嵌入到原始视频中,以利于聚焦于相关上下文来提高长视频的理解能力。 实验结果在三个基准模型和五个数据集上进行了测试,证实了CoS的有效性和适应性。有关代码详情请参阅提供的链接。
https://arxiv.org/abs/2502.06428
Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at this https URL.
多模态基础模型(MFMs)在诸如图像描述、问答和图文检索等任务中展示了显著的成功。然而,这些模型由于内部容量有限而面临固有局限性,这限制了它们处理延长时序序列的能力,这对于全面的视频和音频分析至关重要。为了克服这一挑战,我们引入了一种专门的认知模块——时间工作记忆(TWM),旨在增强MFMs的时间建模能力。该模块能够在时序维度上选择性地保留任务相关的信息,确保在处理视频和音频内容的过程中关键细节得以保存。TWM采用基于查询引导的注意力方法,专注于时序序列中的多模式重要片段。通过仅保留最相关的内容,TWM优化了模型有限容量的使用效率,从而提高了其时间建模能力。这个即插即用模块可以轻松集成到现有的MFMs中。 在我们的实验中,有九种最先进的模型在视频描述、问答和图文检索等任务上都显示出了显著的性能提升。通过增强时间建模能力,TWM扩展了MFMs处理复杂且与时序相关的数据的能力。我们的代码可在[此链接](https://example.com)获取。 (注:实际使用时,请将"this https URL"替换为有效的项目网址)
https://arxiv.org/abs/2502.06020
Establishing the long-context capability of large vision-language models is crucial for video understanding, high-resolution image understanding, multi-modal agents and reasoning. We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of $17$M samples from public datasets only and demonstrates the state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.
建立大型视觉-语言模型的长上下文能力对于视频理解、高分辨率图像理解、多模态代理和推理至关重要。我们介绍了Long-VITA,这是一种简单但有效的大型多模态模型,用于处理长期视觉-语言理解和分析任务。它可以同时处理并分析图像、视频和文本(长达4K帧或100万令牌),并且在短上下文的多模态任务中表现出色。我们提出了一种有效的多模态训练方案,该方案从大型语言模型开始,经过视觉-语言对齐、通用知识学习,并通过两个连续阶段的长序列微调来实现。此外,我们在模型推理过程中实施了上下文并行分布式推理和掩码语言建模头,以将Long-VITA扩展到无限长的图像和文本输入。 关于训练数据,Long-VITA基于1700万个来自公共数据集的样本构建,并在各种多模态基准测试中表现出色,与最近使用内部数据的尖端模型相比。Long-VITA完全可重复,并支持NPU和GPU平台进行培训和测试。我们希望Long-VITA可以作为具有竞争力的基线并为开源社区提供有价值的见解,以推进长上下文多模式理解的发展。
https://arxiv.org/abs/2502.05177
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{this https URL}{this https URL}.
尽管旋转位置嵌入(RoPE)及其变体因其长上下文处理能力而被广泛采用,但将其从一维扩展到视频领域的复杂时空结构中仍然是一项开放性挑战。这项工作首先引入了一项全面分析,识别出四种关键特性对于有效将 RoPE 适应视频来说是必不可少的,这些特性在以往的工作中未得到充分考虑。作为我们分析的一部分,我们介绍了具有挑战性的 V-NIAH-D(带有干扰物的视觉针锋相对)任务,在该任务中,我们将周期性干扰物添加到了 V-NIAH 中。V-NIAH-D 任务表明,以前的 RoPE 变体由于缺乏适当的时序维度分配,容易被干扰物误导。基于我们的分析,我们引入了 **VideoRoPE**,它具有一个设计用于保持时空关系的 **3D 结构**。VideoRoPE 特征包括用于减轻周期性振荡的 **低频时域分配**、用于维持空间对称性的 **对角布局** 以及用于解耦时间和空间索引的 **可调时间间隔**。在长视频检索、视频理解及视频幻觉等多种下游任务中,VideoRoPE 持续超越了之前的 RoPE 变体。我们的代码将在 [此链接](this https URL) 提供。
https://arxiv.org/abs/2502.05173
In this paper, we introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i) collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii) diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii) high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (48.0% best accuracy). We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.
在这篇论文中,我们介绍了WorldSense,这是首个评估多模态视频理解能力的基准测试,它同时涵盖了视觉、音频和文本输入。与现有的基准测试相比,我们的WorldSense具有几个特点:(i) 全模态协作,我们将评估任务设计为强烈依赖于音频和视频的耦合,要求模型能够有效地利用全模态协同感知;(ii) 视频和任务的多样性,WorldSense包含了一组多样化的1,662个音视频同步视频,这些视频系统地被分类到8个主要领域和67个细粒度子类别中,以覆盖广泛的场景,并且提供了3,172个多选题问答对,涵盖了26种不同的任务,从而能够进行全面的评估;(iii) 高质量的标注,所有问答对都是由80名专家标注员经过多轮校正后手动标记的,确保了高质量。 基于我们的WorldSense,我们广泛地评估了各种最先进的模型。实验结果显示,现有的模型在理解现实世界场景时面临着重大挑战(最佳准确率为48.0%)。我们希望我们的WorldSense能够为评估从全模态构建和理解连贯上下文的能力提供一个平台。
https://arxiv.org/abs/2502.04326
The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.
视觉-语言模型(如CLIP)的引入,促进了能够泛化到未见过视频和人类动作的基础视频模型的发展。然而,这些模型通常是在网络视频上进行训练的,而这些视频往往无法捕捉日常活动(ADL)视频中存在的挑战。现有研究通过结合3D骨架与RGB视频来解决类似外观、细微的动作模式及多视角等特定于ADL的问题。不过,这种方法未将语言整合进来,从而限制了其对新动作类别的泛化能力。 在本文中,我们提出了SKI模型,该模型将3D骨架融入到视觉-语言嵌入空间中。通过联合训练,SKI模型利用了一种骨骼-语言模型(SkeletonCLIP),能够将骨架信息注入到视觉语言模型(VLMs)和大型视觉语言模型(LVLMs)中。值得注意的是,在推理阶段SKI模型不需要骨架数据,从而增强了其在实际应用中的鲁棒性。 我们通过三个流行的ADL数据集上的零样本动作识别与视频字幕生成任务验证了SKI模型的有效性。
https://arxiv.org/abs/2502.03459
Modern Video Large Language Models (VLLMs) often rely on uniform frame sampling for video understanding, but this approach frequently fails to capture critical information due to frame redundancy and variations in video content. We propose MaxInfo, a training-free method based on the maximum volume principle, which selects and retains the most representative frames from the input video. By maximizing the geometric volume formed by selected embeddings, MaxInfo ensures that the chosen frames cover the most informative regions of the embedding space, effectively reducing redundancy while preserving diversity. This method enhances the quality of input representations and improves long video comprehension performance across benchmarks. For instance, MaxInfo achieves a 3.28% improvement on LongVideoBench and a 6.4% improvement on EgoSchema for LLaVA-Video-7B. It also achieves a 3.47% improvement for LLaVA-Video-72B. The approach is simple to implement and works with existing VLLMs without the need for additional training, making it a practical and effective alternative to traditional uniform sampling methods.
现代视频大型语言模型(VLLM)通常依赖于均匀帧采样来进行视频理解,但这种方法常常因帧冗余和视频内容变化而无法捕捉到关键信息。我们提出了一种名为MaxInfo的方法,这是一种基于最大体积原则的无训练方法,能够从输入视频中选择并保留最具代表性的帧。通过最大化所选嵌入形成的几何体积,MaxInfo确保选取的帧涵盖了嵌入空间中最具信息量的区域,在减少冗余的同时保持多样性。这种方法提高了输入表示的质量,并在多个基准测试上提升了长视频理解性能。 例如,MaxInfo在LongVideoBench和EgoSchema数据集上分别为LLaVA-Video-7B模型带来了3.28%和6.4%的改进;对于更大的LLaVA-Video-72B模型,它也取得了3.47%的提升。这种方法简单易实现,并且可以与现有的VLLM无缝兼容而无需额外训练,因此成为传统均匀采样方法的一个实用且有效的替代方案。
https://arxiv.org/abs/2502.03183
Action Quality Assessment (AQA) -- the ability to quantify the quality of human motion, actions, or skill levels and provide feedback -- has far-reaching implications in areas such as low-cost physiotherapy, sports training, and workforce development. As such, it has become a critical field in computer vision & video understanding over the past decade. Significant progress has been made in AQA methodologies, datasets, & applications, yet a pressing need remains for a comprehensive synthesis of this rapidly evolving field. In this paper, we present a thorough survey of the AQA landscape, systematically reviewing over 200 research papers using the preferred reporting items for systematic reviews & meta-analyses (PRISMA) framework. We begin by covering foundational concepts & definitions, then move to general frameworks & performance metrics, & finally discuss the latest advances in methodologies & datasets. This survey provides a detailed analysis of research trends, performance comparisons, challenges, & future directions. Through this work, we aim to offer a valuable resource for both newcomers & experienced researchers, promoting further exploration & progress in AQA. Data are available at this https URL
动作质量评估(AQA)——即量化人类运动、行为或技能水平的能力,并提供反馈——在低成本物理治疗、体育训练和劳动力发展等领域具有深远的影响。因此,它在过去十年中已成为计算机视觉及视频理解领域的一个关键研究方向。尽管AQA方法论、数据集与应用方面已取得重大进展,但依然迫切需要对这一迅速发展的领域进行全面总结。本文提出了一项关于AQA领域的全面调查,系统地回顾了超过200篇研究论文,并采用了系统性综述和元分析中推荐的报告项目(PRISMA)框架。我们首先介绍基础概念及定义,接着探讨通用框架与性能指标,最后讨论最新方法论和数据集的发展进展。这项调研提供了详细的研究趋势、性能对比、挑战以及未来发展方向的分析。通过此项工作,我们的目标是为新手研究人员和经验丰富的学者提供有价值的资源,并促进AQA领域的进一步探索与发展。 有关本文中涉及的数据,请访问以下链接:[此处插入实际链接]
https://arxiv.org/abs/2502.02817
Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.
我们对展示人类活动的视频流的理解是多方面的:在短短几秒钟内,我们可以理解正在发生的事情,识别场景中物体的相关性和互动,并预测即将发生的事件。为了赋予自主系统这种整体感知能力,学习如何关联概念、抽象知识跨不同任务传递以及利用协同作用来学习新技能至关重要。在这方面的一个重要进展就是EgoPack,这是一个在执行各种任务时对人类活动进行理解的统一框架,其特点是使用最少的开销。EgoPack促进了下游任务之间的信息共享和协作,这对于高效地学习新的技能是必不可少的。 在这篇论文中,我们介绍了Hier-EgoPack,它通过增强跨多种时间粒度的推理能力来扩展了EgoPack的功能,使其适用于更广泛的下游任务。为了实现这一点,我们提出了一种新颖的分层架构用于时间推理,并配备了一个专门设计来有效应对多粒度推理挑战的图神经网络(GNN)层。 我们在多个Ego4d基准测试中评估了我们的方法,这些基准包括片段级别和帧级别的推理任务,结果表明我们的层次统一架构能够同时有效地解决这些多样化的任务。
https://arxiv.org/abs/2502.02487
We present TUMTraffic-VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraffic-VideoQA unifies three essential tasks-multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding-within a cohesive evaluation framework. We further introduce the TUMTraffic-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset's complexity, highlight the limitations of existing models, and position TUMTraffic-VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.
我们介绍了TUMTraffic-VideoQA,这是一个专门用于复杂路边交通场景中时空视频理解的新型数据集和基准。该数据集包含1,000个视频,其中包括85,000个多选题问答对、2,300个对象描述和5,700个空间时间对象定位标注,涵盖了各种真实世界条件,如恶劣天气和交通异常情况。通过引入基于元组的时空对象表达方式,TUMTraffic-VideoQA统一了三个关键任务——多选视频问题回答、引用对象描述和时空对象定位,在一个连贯的评估框架中进行综合评价。 我们进一步介绍了增强型视觉标记采样策略支持的TUMTraffic-Qwen基线模型,为精细粒度的空间时间推理挑战提供了宝贵的见解。广泛的实验展示了该数据集的复杂性,突显了现有模型的局限性,并确立了TUMTraffic-VideoQA作为智能交通系统研究坚实基础的地位。该数据集和基准测试已公开提供,以促进进一步的研究探索。
https://arxiv.org/abs/2502.02449
Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique enabling support for longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 5.58$\times$ end-to-end speedup compared to existing approaches.
跨注意力机制在多模态大型语言模型(MLLM)中被广泛采用,以将视觉信息整合到语言骨干网络中。然而,在需要处理大量视觉输入的应用场景下(如视频理解),跨注意力层中的大量视觉标记会导致极高的内存需求,并通常要求在多个GPU上进行分布式计算。现有的分布式注意机制面临严重的通信开销问题,这使得跨注意力层成为MLLM高效训练和推理的关键瓶颈。为了解决这一挑战,我们提出了LV-XAttn,这是一种具有最小通信开销的分布式、精确的跨注意力机制。 我们在实践中观察到,在涉及大量视觉输入的应用中,查询块(query block)的大小通常远小于键值块(key-value blocks)。因此,在LV-XAttn中,我们将大型键值块保留在每个GPU上,并在不同GPU之间交换较小的查询块。我们还引入了一种高效的激活重新计算技术,以支持更长的视觉上下文。 从理论上分析了LV-XAttn的通信优势,并表明它可以为各种模型带来显著的速度提升。使用mPLUG-Owl3和OpenFlamingo模型进行的评估发现,与现有方法相比,LV-XAttn可以实现高达5.58倍的整体加速效果。
https://arxiv.org/abs/2502.02406
Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic dependencies through specialized multi-modal retrieval paradigms. Through comprehensive empirical evaluation on our proposed LongerVideos benchmark-comprising over 160 videos totaling 134+ hours across lecture, documentary, and entertainment categories-VideoRAG demonstrates substantial performance compared to existing RAG alternatives and long video understanding methods. The source code of VideoRAG implementation and the benchmark dataset are openly available at: this https URL.
基于检索的增强生成(Retrieval-Augmented Generation,简称RAG)在通过外部知识集成来提升大型语言模型(LLMs)性能方面取得了显著的成功。然而,其应用主要集中在文本内容上,而丰富的多模态视频知识领域则大多未被探索。本文介绍了VideoRAG,这是首个专门用于处理和理解极长上下文视频的检索增强生成框架。 我们的核心创新在于采用双通道架构,该架构无缝整合了两个关键部分:(i)基于图的知识定位以捕捉跨视频语义关系;以及(ii)多模态上下文编码,有效保存视觉特征。这种新颖的设计使VideoRAG能够通过构建跨越多个视频的精确知识图来处理无限制长度的视频,并借助专门的多模态检索范式保持语义依赖性。 通过我们在LongerVideos基准数据集上的全面实证评估——该数据集包括160多个视频,总时长超过134小时,涵盖了讲座、纪录片和娱乐类别——VideoRAG展示了相对于现有的RAG替代方案以及长期视频理解方法的显著性能优势。VideoRAG实现代码及基准数据集可在以下链接获取:[此处应为具体URL]。 --- 原文中提供的链接未给出具体的网址,在实际应用时请替换为正确的链接地址,以便读者访问相关资源。
https://arxiv.org/abs/2502.01549
Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.
在大型语言模型(LLMs)迅速发展并进化为大型多模态模型(LMMs)的过程中,英语和中文等高资源语言取得了显著进展。虽然阿拉伯语的LLM也有所进步,但阿拉伯语的LMM仍处于探索初期阶段,通常只关注语言和视觉理解中的一些特定方面。为了填补这一空白,我们推出了AIN——一种涵盖各种领域的包容性多模态模型。AIN是一款英阿双语的大型多模态模型(LMM),旨在精通英语和阿拉伯语,在高质量的英阿双语多模态数据样本基础上精心构建而成,共包含360万份数据。 在最近发布的CAMEL-Bench基准测试中,该基准覆盖了包括多图像理解、复杂视觉感知、手写文档识别、视频理解和医疗影像在内的38个子领域。我们的AIN模型表现出色,在其中的7B版本超过了GPT-4o,平均得分提高了3.4%,在八个主要领域和所有38个子领域的测试中均保持领先。 AIN卓越的能力使其成为阿拉伯语使用者在各种应用场景中使用先进多模态生成式AI工具的重要一步。
https://arxiv.org/abs/2502.00094
Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, often leading to information loss. This paper introduces $\infty$-Video, which can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by allowing them to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories that evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.
当前的视频语言模型由于上下文长度有限和依赖稀疏帧抽样,难以处理长视频理解,这通常会导致信息丢失。本文介绍了$\infty$-Video,该方法通过连续时间长期记忆(LTM)巩固机制能够处理任意长度的视频。我们的框架通过允许视频Q形式器高效且无需额外训练地处理无界视频上下文来增强其功能。通过连续注意力机制,我们的方法可以动态分配更高粒度给最相关的视频片段,形成随时间演变的“粘性”记忆。使用Video-LLaMA和VideoChat2进行的实验显示,在视频问答任务中性能得到提升,这展示了连续时间LTM机制在实现长视频的理解、可扩展性和无需训练的能力方面的潜力。
https://arxiv.org/abs/2501.19098
Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.
多模态Transformer在视频描述任务中迅速获得了广泛关注。现有的多模态视频描述方法通常会从视频中提取固定数量的帧,这带来了一些关键挑战。当只抽取有限数量的帧时,可能会遗漏含有生成描述所需重要信息的关键帧;相反,如果抽取过多的帧,则会导致连续帧之间存在冗余,从而使得从连续视频帧中提取出的视觉标记(tokens)出现重复。 为了针对每个视频提取适当的帧数,本文提出了首个模型无关的选择模块框架,用于视频描述任务。该框架具备两个主要功能:(1) 根据从视频帧中抽取到的视觉标记选择合适大小的描述生成模块;(2) 为所选的描述生成模块构建视觉标记子集。 此外,我们还提出了一种新的自适应注意力屏蔽方案,以增强对重要视觉标记的关注度。我们在三个不同的基准数据集上的实验表明,该框架能够显著提升三种最近提出的视频描述模型的表现性能。
https://arxiv.org/abs/2501.18269
Applying Multimodal Large Language Models (MLLMs) to video understanding presents significant challenges due to the need to model temporal relations across frames. Existing approaches adopt either implicit temporal modeling, relying solely on the LLM decoder, or explicit temporal modeling, employing auxiliary temporal encoders. To investigate this debate between the two paradigms, we propose the Stackable Temporal Encoder (STE). STE enables flexible explicit temporal modeling with adjustable temporal receptive fields and token compression ratios. Using STE, we systematically compare implicit and explicit temporal modeling across dimensions such as overall performance, token compression effectiveness, and temporal-specific understanding. We also explore STE's design considerations and broader impacts as a plug-in module and in image modalities. Our findings emphasize the critical role of explicit temporal modeling, providing actionable insights to advance video MLLMs.
将多模态大型语言模型(MLLM)应用于视频理解面临着重大挑战,因为需要建模帧之间的时序关系。现有方法采用了两种方案:一种是隐式时序建模,依赖于LLM解码器本身;另一种是显式时序建模,使用辅助的时序编码器。为了探讨这两种范式的优劣,我们提出了可堆叠时序编码器(STE)。STE能够实现灵活的显式时序建模,并且可以调整其时序感受野和标记压缩比例。 利用STE,我们在整体性能、标记压缩效果以及对特定时间的理解等多个维度上系统地比较了隐式与显式时序建模。我们还探讨了作为插件模块和在图像模式中的STE设计考虑及其更广泛的影响。我们的研究结果强调了显式时序建模的关键作用,并为推进视频MLLM的发展提供了可操作的见解。
https://arxiv.org/abs/2501.16786
The analysis of extended video content poses unique challenges in artificial intelligence, particularly when dealing with the complexity of tracking and understanding visual elements across time. Current methodologies that process video frames sequentially struggle to maintain coherent tracking of objects, especially when these objects temporarily vanish and later reappear in the footage. A critical limitation of these approaches is their inability to effectively identify crucial moments in the video, largely due to their limited grasp of temporal relationships. To overcome these obstacles, we present GraphVideoAgent, a cutting-edge system that leverages the power of graph-based object tracking in conjunction with large language model capabilities. At its core, our framework employs a dynamic graph structure that maps and monitors the evolving relationships between visual entities throughout the video sequence. This innovative approach enables more nuanced understanding of how objects interact and transform over time, facilitating improved frame selection through comprehensive contextual awareness. Our approach demonstrates remarkable effectiveness when tested against industry benchmarks. In evaluations on the EgoSchema dataset, GraphVideoAgent achieved a 2.2 improvement over existing methods while requiring analysis of only 8.2 frames on average. Similarly, testing on the NExT-QA benchmark yielded a 2.0 performance increase with an average frame requirement of 8.1. These results underscore the efficiency of our graph-guided methodology in enhancing both accuracy and computational performance in long-form video understanding tasks.
对扩展视频内容的分析在人工智能领域提出了独特的挑战,特别是在处理随时间推移视觉元素复杂性的情况下。目前采用顺序处理视频帧的方法难以维持物体的一致跟踪,尤其是在这些物体暂时消失后又重新出现在画面中时。这种方法的一个关键限制是无法有效地识别视频中的关键时刻,这主要是因为它们对时间关系的理解有限。 为了解决这些问题,我们提出了GraphVideoAgent这一前沿系统,该系统利用基于图的物体追踪技术与大型语言模型能力相结合。在核心部分,我们的框架采用了动态图结构来映射和监控视频序列中视觉实体之间不断变化的关系。这种创新方法使我们能够更细致地理解物体随着时间的变化如何相互作用和转换,并通过全面的情境意识改善帧的选择。 经过测试证明,在行业基准上GraphVideoAgent表现出显著的效果。在EgoSchema数据集的评估中,该系统比现有方法提高了2.2个点,平均只需要分析8.2帧。类似地,在NExT-QA基准测试中的性能也提升了2.0个点,并且只需要平均分析8.1帧。这些结果凸显了我们基于图的方法在提高长视频理解任务的准确性和计算效率方面的有效性。
https://arxiv.org/abs/2501.15953