We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
我们介绍了一种先进的感知编码器(PE),这是一种通过简单视觉-语言学习训练出来的图像和视频理解的编码器。传统上,视觉编码器依赖于一系列用于特定下游任务如分类、描述或定位的预训练目标。令人惊讶的是,在扩展了我们精心调整的图像预训练方案并用我们的稳健视频数据引擎进行微调后,我们发现仅通过对比式视觉-语言训练就能产生适用于所有这些下游任务的强大且通用的嵌入表示。唯一的不足是:这些嵌入隐藏在网络中间层中。 为了提取它们,我们引入了两种对齐方法:多模态语言模型的语言对齐和密集预测的空间对齐。结合核心对比检查点,我们的PE家族模型在广泛的任务上取得了最先进的性能,包括零样本图像和视频分类及检索;文档、图像和视频问答;以及空间任务如检测、深度估计和跟踪。 为了促进进一步的研究,我们将发布我们的模型、代码以及一套新颖的合成和人工标注视频数据集。
https://arxiv.org/abs/2504.13181
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.
视觉语言模型是计算机视觉研究中的重要组成部分,然而许多高性能的模型仍然是闭源的,这使得它们的数据、设计和训练方案不为人知。为了解决这个问题,研究社区通过从黑盒模型中进行蒸馏来标注训练数据,并以此取得了强大的基准测试结果。但是,这样做是以可衡量的科学进步为代价的,因为没有教师模型及其数据来源的具体细节,科学研究的进步仍然难以衡量。 在本文中,我们致力于在一个完全开放且可重复的框架内构建感知语言模型(PLM),以促进图像和视频理解方面的透明研究。我们在不使用专有模型蒸馏的情况下分析了标准训练流程,并探索大规模合成数据来识别详细视频理解中的关键数据缺口。为了填补这些缺口,我们发布了一个包含280万个由人工标注的精细粒度视频问答对及时空定位视频字幕的数据集。 此外,我们还推出了PLM-VideoBench,这是一个专注于评测复杂视频理解任务(特别是关于“什么”、“哪里”、“何时”和“如何”的推理能力)的一套评估工具。为了确保研究工作的可重复性,我们公开了数据、训练方案、代码及模型。
https://arxiv.org/abs/2504.13180
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at this https URL.
基于大型语言模型(LLMs)构建的大型视频模型(LVMs)在视频理解方面显示出巨大潜力,但常常存在与人类直觉不符以及视频幻觉等问题。为解决这些问题,我们提出了VistaDPO——一个用于视频层次化时空直接偏好优化的新框架。VistaDPO通过三个层级增强了文本和视频之间偏好的对齐: 1. 实例层级:将整体视频内容与响应进行对齐; 2. 时间层级:将视频的时间语义与事件描述相匹配; 3. 感知层级:使空间对象与语言标记相对齐。 由于缺乏用于细粒度视频-语言偏好对齐的数据集,我们构建了VistaDPO-7k数据集,其中包含7,200组问题答案(QA),附有被选择和拒绝的响应注释以及时间、关键帧和边界框等空间-时间定位信息。 在诸如视频幻觉、视频问答及描述生成等基准测试中进行的广泛实验表明,VistaDPO显著提高了现有LVMs的表现,并有效减少了视频与语言之间的对齐问题和幻觉现象。代码和数据可在提供的链接获取。
https://arxiv.org/abs/2504.13122
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at this https URL.
近期在视频生成领域的进展主要得益于扩散模型和自回归框架,但仍然存在一些关键挑战:如何在遵循提示、视觉质量、运动动态以及时长之间取得平衡。为了提高时间上的视觉质量而妥协了运动动态,受限的视频长度(5-10秒)优先考虑分辨率,以及缺乏基于镜头意识生成的能力,因为通用多模态大规模语言模型(MLLMs)无法理解电影语法,如镜头构图、演员表情和摄像机移动。这些相互交织的局限性阻碍了现实主义长篇合成和专业电影风格生成的发展。 为了克服这些限制,我们提出了SkyReels-V2,这是一种无限长度电影生成模型,该模型结合了多模态大规模语言模型(MLLM)、多阶段预训练、强化学习和扩散强迫框架。首先,我们设计了一种综合的视频结构表示方法,该方法将多模态LLM提供的通用描述与子专家模型的详细镜头语言相结合。借助人工注释,我们随后训练了一个统一的视频说明器SkyCaptioner-V1,用于有效地标记视频数据。 其次,我们为基本的视频生成建立了逐步分辨率预训练,并通过四个阶段的后训练增强:初始概念平衡监督微调(SFT)改善了基准质量;带有手工标注和合成失真数据的运动特定强化学习(RL)训练解决了动态伪影问题;我们的扩散强迫框架结合非递减噪声时间表,能够在一个高效的搜索空间中进行长视频合成;最后高质量的SFT增强了视觉保真度。 所有代码和模型都可在以下链接获取:[此URL](https://this https URL "提供正确的GitHub或项目网站链接")。请注意,在引用具体网址时,请替换“此 https URL”为实际提供的项目地址。
https://arxiv.org/abs/2504.13074
In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.
在检索系统中,同时实现搜索准确性和效率是非常具有挑战性的。这一问题尤其体现在部分相关视频检索(PRVR)中,在此场景下,为了提升准确性而加入更多样化的上下文表示会随时间尺度变化增加计算和内存成本。为了解决这一矛盾,我们提出了一种原型式的PRVR框架,该框架将视频中的多样化背景信息编码为固定数量的原型。接下来,我们引入了几种策略来增强文本关联及对原型中视频的理解,并设置了一个正交目标以确保这些原型能够捕捉到内容的多样性。为了在通过文字查询搜索时保持原型可查找性的同时准确地编码视频上下文,我们实施了跨模态和单模态重构任务。其中,跨模态重构任务将原型与文本特征对齐于共享空间中,而单模态重构任务则保留所有视频背景信息的完整性。此外,我们还采用了一种视频混合技术来提供弱引导以进一步对齐原型与其相关的文本表示。我们在TVR、ActivityNet-Captions和QVHighlights数据集上的广泛评估验证了我们的方法的有效性,并且在效率方面没有牺牲性能。
https://arxiv.org/abs/2504.13035
Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie this http URL identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at this https URL .
电影音频描述(AD)旨在通过叙述视觉内容来帮助盲人和视力受损者(BVI)在无对话的片段中更好地理解画面。相比一般的视频字幕,AD需要提供与剧情相关且明确指明角色名称的叙述,这为电影带来了独特的挑战。 为了识别活跃的主要角色并专注于与剧情相关的区域,我们提出了FocusedAD这一创新框架,它提供了以人物为中心的电影音频描述。该框架包括以下部分: (i) 角色感知模块(CPM),用于跟踪角色所在的画面区域,并将其链接到对应的名字; (ii) 动态先验模块(DPM),通过可学习的软提示从之前的AD和字幕中注入上下文线索; (iii) 集中描述模块(FCM),生成包含与剧情相关的细节及命名人物的叙述。 为了克服角色识别上的局限性,我们还引入了一种自动化流程来构建角色查询库。 在多个基准测试上,包括在MAD-eval-Named和新提出的Cinepile-AD数据集中的零样本结果中,FocusedAD达到了最先进的性能水平。代码和数据将在该链接(假设提供了一个URL)发布。
https://arxiv.org/abs/2504.12157
Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO. Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, short- and long-video understanding, and fine-grained temporal reasoning.
尽管大型视频语言模型(LVLM)在最近取得了进展,它们仍然难以理解视频中的细微时间信息、会出现幻觉现象,并且在简单的视频问答任务中也会犯一些基本错误。所有这些都对它们在实际应用中的安全和可靠部署构成了重大挑战。为了解决这些问题,我们提出了一种自我校准框架,该框架可以让LVLM从自身错误中学习。 我们的方法首先获得一组优选响应与非优选响应的训练集,其中非优选响应是通过加入常见的错误模式生成的,这些错误通常由于时空理解不足、共同出现概念之间的虚假相关性以及过分依赖语言线索而忽视视觉模态等原因产生。为了使LVLM能够与构建出的优选和非优选响应对进行自我校准,我们提出了一种新颖的偏好优化方法——精炼正则化偏好优化(RRPO),它利用子序列级别的精细奖励及逐令牌的KL正则化来解决直接偏好优化(DPO)的局限性。实验表明,与DPO相比,RRPO能够实现更精确的校准和更加稳定的训练过程。 我们的实验证明了该方法在视频幻觉、短/长视频理解以及细微时间推理等多样化视频任务中的有效性。
https://arxiv.org/abs/2504.12083
This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: this https URL.
该报告提供了关于2025年CVPR会议上举办的第四届Pixel-level Video Understanding in the Wild (PVUW) 挑战赛的全面概述。它总结了挑战赛的结果、参赛方法以及未来的研究方向。本次挑战包括两个赛道:MOSE专注于复杂场景视频物体分割,MeViS则针对基于运动引导和语言基础的视频分割。这两个赛道引入了新的更为具有挑战性的数据集,旨在更好地反映现实世界的场景。通过详细的评估与分析,该挑战赛为复杂的视频分割领域的当前最新技术状态及新兴趋势提供了有价值的见解。更多相关信息可在研讨会网站上找到:[此网址](请将“this https URL”替换为您实际的链接)。
https://arxiv.org/abs/2504.11326
The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.
视频内容的指数级增长给高效导航、搜索和检索带来了重大挑战,因此需要先进的视频摘要技术。现有的视频摘要方法主要依赖于视觉特征和时间动态特性,往往无法捕捉到视频内容的语义信息,导致生成的摘要不完整或缺乏连贯性。为了应对这一挑战,我们提出了一种新的基于最新大型语言模型(LLMs)能力的视频摘要框架,期望通过从海量数据中学到的知识使LLM能够以更符合多样化语义和人类判断的方式评估视频帧的重要性,从而有效解决关键帧定义的内在主观性问题。我们的方法被称为“基于大型语言模型的视频总结”(LLMVS),该方法首先使用多模态大型语言模型(M-LLM)将视频帧转换为一系列描述文本,然后根据这些描述文本在局部上下文中的表现来评估每个帧的重要性。随后通过在整个视频描述文本的上下文中采用全局注意力机制对这些局部重要性评分进行优化处理,确保生成的摘要既能反映细节又能呈现整体叙事脉络。实验结果表明,在标准基准测试中我们提出的方法优于现有方法,突显了LLMs在多媒体内容处理中的潜力。
https://arxiv.org/abs/2504.11199
In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.
在这篇论文中,我们提出了一种新颖的可控视频扩散框架OmniVDiff,旨在通过单一的扩散模型合成和理解多个视频视觉内容。为了实现这一目标,OmniVDiff将所有视频视觉模态(如颜色空间)视为整体进行联合分布学习,并采用自适应控制策略,在扩散过程中动态调整每个视觉模态的角色,既可以作为生成模态也可以作为条件模态。这使得对每一模态角色的灵活操控成为可能,从而支持广泛的任务需求。因此,我们的模型具备以下三大核心功能: 1. 基于文本的视频生成:OmniVDiff在一次扩散过程中根据文本条件生成多模式视觉视频序列(例如RGB、深度图、Canny边缘检测图和分割图)。 2. 视频理解:OmniVDiff能够估计输入RGB帧中的深度信息、Canny图以及语义分割,同时确保与RGB输入的一致性。 3. 基于X的视频生成:OmniVDiff可以基于精细属性(例如深度图或分割图)来生成条件化的视频。 通过将这些多样化的任务整合到统一的视频扩散框架中,OmniVDiff增强了可控视频扩散的灵活性和可扩展性,成为视频转视频翻译等众多下游应用的有效工具。广泛的实验验证了我们方法的有效性,并展示了其在各种视频相关应用中的潜力。
https://arxiv.org/abs/2504.10825
Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at this https URL.
最近在大型语言模型(LLM)领域的进展已经显著推动了视频理解技术的进步。然而,现有模型仍然难以处理长视频,原因在于LLM的上下文长度限制以及视频中包含的巨大信息量。尽管有一些近期的方法旨在解决长视频的理解问题,但它们往往在令牌压缩过程中丢失关键信息,并且在处理音频等额外模态时也存在困难。 为此,我们提出了一种利用帧间时间关系进行动态长视频编码的方法,命名为“时间动态上下文”(Temporal Dynamic Context, TDC)。该方法包含以下几个步骤: 1. **基于帧间相似性分割视频**:首先根据帧之间的相似度将视频划分为语义一致的场景。 2. **使用视觉-音频编码器进行令牌化**:其次,采用一种创新的时间上下文压缩器来减少每个片段内的令牌数量。具体来说,我们利用基于查询的Transformer模型聚合来自视频、音频和指令文本的令牌,并将其压缩为一组有限的时间上下文令牌。 3. **处理帧静态令牌及时间上下文令牌**:最后,将生成的静态帧令牌与时间上下文令牌一起输入LLM以进行视频理解。 此外,为了应对极长视频的问题,我们还提出了一种无需训练的链式思维策略,通过逐步从多个视频片段中提取答案。这些中间结果作为推理过程的一部分,并对最终答案作出贡献。 我们在通用视频理解和音频-视频理解基准上进行了广泛的实验,证明了所提方法的有效性。相关代码和模型可在[此处](https://这个URL是示例,实际应用时应替换为真实链接)获取。
https://arxiv.org/abs/2504.10443
Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose $\mathbf{Mavors}$, a novel framework that introduces $\mathbf{M}$ulti-gr$\mathbf{a}$nularity $\mathbf{v}$ide$\mathbf{o}$ $\mathbf{r}$epre$\mathbf{s}$entation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.
在多模态大型语言模型(MLLMs)中,长视频理解面临一个关键挑战:如何平衡计算效率与细粒度时空模式的保留。现有的方法(例如稀疏采样、低分辨率密集采样和标记压缩)在处理具有复杂运动或不同分辨率的视频时,在时间动态性、空间细节或微妙互动方面存在显著的信息损失问题。 为了应对这一挑战,我们提出了$\mathbf{Mavors}$,这是一个新的框架,通过引入$\mathbf{M}$ulti-gr$\mathbf{a}$nularity $\mathbf{v}$ide$\mathbf{o}$ $\mathbf{r}$epre$\mathbf{s}$entation(多粒度视频表示)来实现整体长视频建模。具体而言,Mavors通过两个核心组件直接将原始视频内容编码为潜在表示:1) 一个基于3D卷积和视觉变换器的Intra-chunk Vision Encoder (IVE),该编码器保留了高分辨率的空间特征;2) 一个使用基于transformer的依赖模型并采用分块级别的旋转位置编码来在不同片段之间建立时间一致性的Inter-chunk Feature Aggregator (IFA)。 此外,该框架通过将图像视为单帧视频并通过子图分解统一了对图像和视频的理解。跨多个基准测试的实验表明,Mavors在保持空间准确性和时间连贯性方面优于现有方法,在需要细粒度时空推理的任务中表现尤为出色。
https://arxiv.org/abs/2504.10068
Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at this https URL.
近期,通过强化学习提升大型多模态模型(LMMs)的推理能力取得了显著进展。然而,大多数现有研究基于如数学和代码等高度依赖推理能力的数据集,并且研究人员通常选择大规模模型作为基础。我们主张,在计算资源有限的情况下,探索小型模型的推理能力仍然具有重要价值。此外,让模型能够在通用问答数据集上解释其推理过程同样意义重大。 因此,我们推出了小规模视频推理模型TinyLLaVA-Video-R1。该模型基于Traceably训练的小型视频理解模型TinyLLaVA-Video,参数量不超过40亿。在使用通用视频问答(Video-QA)数据集进行强化学习后,它不仅显著提升了推理和思考能力,还展现出了“顿悟时刻”这一新兴特性。 此外,我们分享了一系列实验结果,旨在为未来小型模型视频推理(思考)能力的探索提供实用见解。该研究可在此[URL]访问。
https://arxiv.org/abs/2504.09641
Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at this https URL.
广告视频作为一种富含目的性信息的宝贵资源,包含了高质量的视觉、文本和情境线索,旨在吸引观众。它们通常比同长度的一般视频更为复杂,因为其结构化的叙述以及快速的场景转换给多模态大型语言模型(MLLM)带来了巨大的挑战。为此,我们引入了VideoAds,这是首个专为评估广告视频上MLLM性能而设计的数据集。VideoAds包含了一系列精心策划、具有复杂时间结构的广告视频,并附有**人工标注**的多样化问题,涵盖视觉发现、视频总结和视觉推理三大核心任务。 为了比较VideoAds与其他现有基准测试在视频复杂性方面的表现,我们提出了一种定量衡量方法。通过广泛实验,我们发现在VideoAds数据集上,开源MLLM Qwen2.5-VL-72B实现了73.35%的准确率,优于专有模型GPT-4o(66.82%)和Gemini-1.5 Pro(69.66%)。值得注意的是,在视频总结与推理任务上,这两款专有模型明显落后于开源模型;但在视觉发现方面,则表现最佳。与此同时,人类专家能够轻松达到94.27%的准确率。 这些结果强调了推进MLLM时间建模能力的重要性,并突显VideoAds作为未来研究中理解需要高FPS采样的视频的关键基准测试的重要潜力。该数据集及评估代码将公开发布在此网址:[https://this-url.com](请注意,实际链接应替换为此URL)。
https://arxiv.org/abs/2504.09282
Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive video retrieval through four key innovations: (1) an ensemble search strategy that integrates coarse-grained (CLIP) and fine-grained (BEIT3) models to improve retrieval accuracy, (2) a storage optimization technique that reduces redundancy by selecting representative keyframes via TransNetV2 and deduplication, (3) a temporal search mechanism that localizes video segments using dual queries for start and end points, and (4) a temporal reranking approach that leverages neighboring frame context to stabilize rankings. Evaluated on known-item search and question-answering tasks, our framework demonstrates substantial improvements in retrieval precision, efficiency, and user interpretability, offering a robust solution for real-world interactive video retrieval applications.
长视频理解对于交互式检索系统来说存在重大挑战,因为传统方法难以高效地处理大量视频内容。现有的方法通常依赖单一模型、低效的存储方式、不稳定的时序搜索以及缺乏上下文感知的重排序,这些都限制了它们的效果。本文提出了一种新颖的框架,通过四个关键创新来增强交互式视频检索:(1) 一种集成粗粒度(CLIP)和细粒度(BEIT3)模型的集合搜索策略,以提高检索准确性;(2) 一种存储优化技术,通过选择具有代表性的关键帧并通过TransNetV2去重的方式来减少冗余;(3) 一个使用起始点和结束点双查询来定位视频片段的时序搜索机制;以及 (4) 利用相邻帧上下文以稳定排名的时序重排序方法。在已知项检索和问答任务上进行评估,我们的框架展示了显著提升的检索精度、效率和用户可解释性,为实际交互式视频检索应用提供了稳健的解决方案。
https://arxiv.org/abs/2504.08384
Analyzing Fast, Frequent, and Fine-grained (F$^3$) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F$^3$ criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F$^3$Set, a benchmark that consists of video datasets for precise F$^3$ event detection. Datasets in F$^3$Set are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, F$^3$Set contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on F$^3$Set, revealing substantial challenges for existing techniques. Additionally, we propose a new method, F$^3$ED, for F$^3$ event detections, achieving superior performance. The dataset, model, and benchmark code are available at this https URL.
分析快速、频繁且细微(F$^3$)事件在视频分析和多模态大语言模型中是一个重大挑战。当前的方法难以准确识别满足所有F$^3$标准的事件,主要原因是运动模糊和微小视觉差异等问题。为了推动视频理解的研究进展,我们引入了F$^3$Set这一基准测试集,它包含了一系列用于精准F$^3$事件检测的视频数据集。F$^3$Set中的数据集以其广泛的规模和详尽的细节为特点,通常涵盖超过1,000种事件类型,并且带有精确的时间戳以及支持多级粒度划分。目前,F$^3$Set包含多个体育数据集,但该框架也可以扩展到其他应用领域。 我们在F$^3$Set上评估了流行的时间动作理解方法,发现现有技术面临着严峻的挑战。此外,我们还提出了一种新的F$^3$事件检测方法——F$^3$ED,并在性能方面取得了显著成果。数据集、模型和基准测试代码可在此[URL]获取(请将方括号中的文本替换为实际链接)。
https://arxiv.org/abs/2504.08222
Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF$^2$T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. We assessed multiple models and validated the effectiveness of SF$^2$T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.
近年来,基于视频的大规模语言模型(Video-LLMs)在多模态大规模语言模型的发展推动下取得了显著的进步。尽管这些模型已经展示了提供整体视频描述的能力,但在诸如视觉动态和视频细节查询等细微理解方面仍然面临挑战。为了克服这些问题,我们发现通过自我监督片段任务对Video-LLMs进行微调,可以大幅提高它们的细粒度视频理解能力。因此,我们提出了以下两个关键贡献: 1. 自我监督片段微调(Self-Supervised Fragment Fine-Tuning, SF$^2$T)是一种新颖且易于实施的方法,该方法利用了视频本身的丰富固有特性进行训练,并释放Video-LLMs更多的细粒度理解能力。此外,它缓解了研究人员在劳动密集型注释上的负担,并巧妙地绕过了自然语言的局限性,后者往往无法捕捉到视频中复杂的时空变化。 2. 一个新的基准数据集FineVidBench被提出,该数据集旨在严格评估Video-LLMs在场景和片段级别的性能,提供对其能力的全面评价。我们对多个模型进行了测试,并验证了SF$^2$T的有效性。实验结果显示,我们的方法提高了它们捕捉和解释时空细节的能力。 综上所述,通过引入自我监督片段微调技术和FineVidBench基准数据集,我们在提升Video-LLMs细粒度视频理解能力方面迈出了重要的一步。
https://arxiv.org/abs/2504.07745
The core challenge in video understanding lies in perceiving dynamic content changes over time. However, multimodal large language models struggle with temporal-sensitive video tasks, which requires generating timestamps to mark the occurrence of specific events. Existing strategies require MLLMs to generate absolute or relative timestamps directly. We have observed that those MLLMs tend to rely more on language patterns than visual cues when generating timestamps, affecting their performance. To address this problem, we propose VideoExpert, a general-purpose MLLM suitable for several temporal-sensitive video tasks. Inspired by the expert concept, VideoExpert integrates two parallel modules: the Temporal Expert and the Spatial Expert. The Temporal Expert is responsible for modeling time sequences and performing temporal grounding. It processes high-frame-rate yet compressed tokens to capture dynamic variations in videos and includes a lightweight prediction head for precise event localization. The Spatial Expert focuses on content detail analysis and instruction following. It handles specially designed spatial tokens and language input, aiming to generate content-related responses. These two experts collaborate seamlessly via a special token, ensuring coordinated temporal grounding and content generation. Notably, the Temporal and Spatial Experts maintain independent parameter sets. By offloading temporal grounding from content generation, VideoExpert prevents text pattern biases in timestamp predictions. Moreover, we introduce a Spatial Compress module to obtain spatial tokens. This module filters and compresses patch tokens while preserving key information, delivering compact yet detail-rich input for the Spatial Expert. Extensive experiments demonstrate the effectiveness and versatility of the VideoExpert.
视频理解的核心挑战在于感知随时间变化的动态内容。然而,多模态大型语言模型在处理需要生成时间戳以标记特定事件发生的时序敏感型视频任务上存在困难。现有的策略要求这些模型直接生成绝对或相对的时间戳。我们观察到,在生成时间戳时,这些模型往往依赖于语言模式而非视觉线索,从而影响其性能。 为解决这一问题,我们提出了VideoExpert,这是一种适用于多种时序敏感型视频任务的通用多模态大型语言模型。受专家概念启发,VideoExpert集成了两个并行模块:时序专家和空间专家。时序专家负责建模时间序列并执行时序定位,通过处理高帧率但压缩过的令牌来捕捉视频中的动态变化,并包含一个轻量级预测头部以实现精确事件定位。空间专家则专注于内容细节分析与指令遵循,它处理特殊设计的空间令牌及语言输入,旨在生成相关内容响应。这两个专家通过一个特殊的令牌无缝协作,确保了时序定位和内容生成的协调一致。值得注意的是,时序专家和空间专家保持独立的参数集。 VideoExpert 通过将时间戳预测从内容生成中分离出来,避免了基于文本模式的时间戳预测偏差,并且还引入了一个空间压缩模块来获取空间令牌。该模块过滤并压缩补丁令牌同时保留关键信息,为空间专家提供紧凑而细节丰富的输入。广泛的实验验证了 VideoExpert 在有效性和灵活性方面的卓越表现。
https://arxiv.org/abs/2504.07519
How multimodal large language models (MLLMs) perceive the visual world remains a mystery. To one extreme, object and relation modeling may be implicitly implemented with inductive biases, for example by treating objects as tokens. To the other extreme, empirical results reveal the surprising finding that simply performing visual captioning, which tends to ignore spatial configuration of the objects, serves as a strong baseline for video understanding. We aim to answer the question: how can objects help video-language understanding in MLLMs? We tackle the question from the object representation and adaptation perspectives. Specifically, we investigate the trade-off between representation expressiveness (e.g., distributed versus symbolic) and integration difficulty (e.g., data-efficiency when learning the adapters). Through extensive evaluations on five video question answering datasets, we confirm that explicit integration of object-centric representation remains necessary, and the symbolic objects can be most easily integrated while being performant for question answering. We hope our findings can encourage the community to explore the explicit integration of perception modules into MLLM design. Our code and models will be publicly released.
多模态大型语言模型(MLLM)如何感知视觉世界仍然是一个谜。在一种极端情况下,对象和关系建模可能通过归纳偏差隐式实现,例如将物体视为标记(token)。而在另一种极端情况中,实验证明了一个令人惊讶的发现:仅仅执行视觉描述任务,这种任务倾向于忽略物体的空间配置,就可以作为视频理解的强大基线模型。 我们旨在回答的问题是:在MLLM中,对象如何帮助视频-语言的理解?我们从对象表示和适应性两个角度来探讨这个问题。具体而言,我们研究了表示表达力(例如,分布式与符号化)与集成难度(例如,在学习适配器时的数据效率)之间的权衡。 通过在五个视频问答数据集上的广泛评估,我们确认显式整合以对象为中心的表示仍然是必要的,并且发现象征性对象最容易被集成并且在回答问题方面表现出色。我们希望我们的研究结果能够鼓励社区探索将感知模块显式地整合进MLLM设计中的方法。我们将公开发布代码和模型。
https://arxiv.org/abs/2504.07454
Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.
最近在强化学习领域的进展显著提升了多模态大型语言模型(MLLM)的推理能力。尽管诸如组相对策略优化(GRPO)和基于规则的奖励机制等方法在文本和图像领域表现出色,但它们在视频理解中的应用仍然有限。本文提出了一种系统性地探索使用GRPO进行强化微调(RFT)以改进视频MLLM空间-时间感知能力的方法,并且这种方法能够保持模型的一般能力不受影响。我们的实验结果表明,RFT对于特定任务的提升非常高效,特别是在样本数量有限的情况下进行多任务RFT时。通过这种方式开发了VideoChat-R1这一强大的视频MLLM,在不牺牲聊天功能的前提下,其在空间-时间感知任务上达到了最先进的性能,并且展示了新兴的空间-时间推理能力。相比于Qwen2.5-VL-7B模型,在诸如时间定位(+31.8)和对象跟踪(+31.2)等任务中,VideoChat-R1的性能有了显著提升;同时在通用问答基准测试如VideoMME(+0.9),MVBench(+1.0)以及感知测试(+0.9)上也有所改进。我们的研究发现强调了RFT在视频MLLM特定任务增强方面的潜力,并期望我们的工作能够为未来视频MLLM的强化学习研究提供宝贵的见解和指导方向。
https://arxiv.org/abs/2504.06958