The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.
社交媒体上多模态错误信息的快速传播引发了越来越多的关注,但由于缺乏大规模、多样化的数据集,有关视频错误信息检测的研究仍然有限。现有方法往往过度拟合于僵化模板,并且在处理欺骗性内容时缺乏深度推理。为解决这些挑战,我们引入了FakeVV,这是一个包含超过10万对视频-文本的数据基准集合,带有细致可解释的标注。此外,我们还提出了Fact-R1,这是一种将深层推理与基于规则的协作强化学习相结合的新框架。Fact-R1通过三个阶段进行训练:(1)错误信息长思维链(CoT)指令微调;(2)通过直接偏好优化(DPO)实现偏好转向;以及(3)使用新型可验证奖励函数进行群体相对策略优化(GRPO)。这使得Fact-R1能够在复杂的多模态错误信息环境中展现出与高级文本强化学习系统相媲美的新兴推理行为。我们的工作确立了错误信息检测的新范式,连接大规模视频理解、引导推理对齐以及可解释性验证。
https://arxiv.org/abs/2505.16836
In this paper, we present the runner-up solution for the Ego4D EgoSchema Challenge at CVPR 2025 (Confirmed on May 20, 2025). Inspired by the success of large models, we evaluate and leverage leading accessible multimodal large models and adapt them to video understanding tasks via few-shot learning and model ensemble strategies. Specifically, diversified prompt styles and process paradigms are systematically explored and evaluated to effectively guide the attention of large models, fully unleashing their powerful generalization and adaptability abilities. Experimental results demonstrate that, with our carefully designed approach, directly utilizing an individual multimodal model already outperforms the previous state-of-the-art (SOTA) method which includes several additional processes. Besides, an additional stage is further introduced that facilitates the cooperation and ensemble of periodic results, which achieves impressive performance improvements. We hope this work serves as a valuable reference for the practical application of large models and inspires future research in the field.
在这篇论文中,我们展示了Ego4D EgoSchema挑战赛(CVPR 2025,确认于2025年5月20日)的亚军解决方案。受大型模型成功的影响,我们评估并利用了领先的可访问多模态大模型,并通过少量样本学习和模型集成策略将它们适应到视频理解任务中。具体而言,系统地探索和评估了多样化的提示风格和处理范式,以有效地引导大规模模型的注意力,充分释放其强大的泛化能力和适应能力。 实验结果表明,在我们精心设计的方法下,直接利用单一多模态模型已超越了包含多个额外过程的前一代最先进(SOTA)方法。此外,还引入了一个额外阶段来促进周期性结果的合作和集成,从而实现了显著的性能改进。我们希望这项工作能为大型模型的实际应用提供有价值的参考,并激发该领域的未来研究。
https://arxiv.org/abs/2505.16784
The integration of artificial intelligence in sports analytics has transformed soccer video understanding, enabling real-time, automated insights into complex game dynamics. Traditional approaches rely on isolated data streams, limiting their effectiveness in capturing the full context of a match. To address this, we introduce SoccerChat, a multimodal conversational AI framework that integrates visual and textual data for enhanced soccer video comprehension. Leveraging the extensive SoccerNet dataset, enriched with jersey color annotations and automatic speech recognition (ASR) transcripts, SoccerChat is fine-tuned on a structured video instruction dataset to facilitate accurate game understanding, event classification, and referee decision making. We benchmark SoccerChat on action classification and referee decision-making tasks, demonstrating its performance in general soccer event comprehension while maintaining competitive accuracy in referee decision making. Our findings highlight the importance of multimodal integration in advancing soccer analytics, paving the way for more interactive and explainable AI-driven sports analysis. this https URL
将人工智能技术融入体育数据分析,特别是在足球视频理解方面,已经改变了对复杂比赛动态的实时、自动化洞察。传统的方法依赖于孤立的数据流,这限制了它们捕捉比赛全貌的有效性。为了解决这个问题,我们推出了SoccerChat,这是一个多模态对话AI框架,它整合了视觉和文本数据以增强足球视频的理解能力。 通过利用丰富的SoccerNet数据集,并结合球衣颜色注释以及自动语音识别(ASR)转录内容,SoccerChat在经过结构化的视频指令数据集上进行微调,从而能够实现准确的比赛理解、事件分类和裁判决策。我们在动作分类和裁判决策任务上对SoccerChat进行了基准测试,展示了其在通用足球赛事理解方面的性能,并且在裁判决策方面保持了竞争性的准确性。 我们的研究结果强调了多模态整合在推进足球分析领域的关键作用,为更加互动和解释性强的AI驱动体育分析铺平道路。[参考链接](https://this-url)
https://arxiv.org/abs/2505.16630
Video captioning models have seen notable advancements in recent years, especially with regard to their ability to capture temporal information. While many research efforts have focused on architectural advancements, such as temporal attention mechanisms, there remains a notable gap in understanding how models capture and utilize temporal semantics for effective temporal feature extraction, especially in the context of Advanced Driver Assistance Systems. We propose an automated LiDAR-based captioning procedure that focuses on the temporal dynamics of traffic participants. Our approach uses a rule-based system to extract essential details such as lane position and relative motion from object tracks, followed by a template-based caption generation. Our findings show that training SwinBERT, a video captioning model, using only front camera images and supervised with our template-based captions, specifically designed to encapsulate fine-grained temporal behavior, leads to improved temporal understanding consistently across three datasets. In conclusion, our results clearly demonstrate that integrating LiDAR-based caption supervision significantly enhances temporal understanding, effectively addressing and reducing the inherent visual/static biases prevalent in current state-of-the-art model architectures.
近年来,视频字幕模型在捕捉时间信息方面取得了显著进展。尽管许多研究工作集中在架构改进上,如引入时间注意力机制,但在理解模型如何捕获和利用时间语义以进行有效的时间特征提取,特别是在高级驾驶辅助系统(ADAS)的背景下,仍存在明显的知识空白。 我们提出了一种自动化的基于LiDAR的字幕生成程序,该程序专注于交通参与者的时间动态。我们的方法采用基于规则的系统从物体轨迹中提取关键细节,如车道位置和相对运动,并使用模板生成字幕。研究结果表明,通过仅使用前向摄像头图像并用我们设计的包含细粒度时间行为信息的模板式字幕进行监督训练SwinBERT(一种视频字幕模型),可以在三个数据集上持续提高时间理解能力。 总之,我们的结果显示,结合基于LiDAR的字幕指导能够显著增强时间理解能力,并有效解决和减少现有最先进模型架构中的固有视觉/静态偏见。
https://arxiv.org/abs/2505.16594
Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.
长时间视频理解在诸如视频监控、会议摘要生成、教育讲座分析和体育广播等实际应用中已成为一项关键能力。然而,对于大型语言模型(VideoLLMs)来说,处理长时间视频仍然面临计算成本过高的问题,主要瓶颈在于两个方面:1)顺序视频解码过程,即将原始比特流转换为RGB帧的过程可能需要花费长达一分钟的时间来处理时长数小时的视频输入;2)在LLM推理中预填充高达数百万个令牌的成本高昂,导致延迟高和内存使用量大。为了应对这些挑战,我们提出了QuickVideo系统算法协同设计方法,该方法显著加速了长时间视频理解过程,支持实时下游应用。它包括三个关键创新点:QuickDecoder是一个基于CPU的并行化视频解码器,通过将视频拆分为以关键帧对齐的时间间隔并发处理实现了2-3倍的速度提升;QuickPrefill是一种内存高效的预填充方法,使用KV缓存修剪来支持更少GPU内存下的更多帧;以及一种重叠方案,该方案使CPU视频解码与GPU推理可以同时进行。这些组件共同将长时间视频输入的推断时间减少了一分钟,使得即使在硬件资源有限的情况下也能实现可扩展和高质量的视频理解。实验表明,QuickVideo方法能够跨越不同的时长和采样率推广使用,这使得实际应用中处理长时间视频变得可行。
https://arxiv.org/abs/2505.16175
Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at this https URL.
近期在视频问答(Video Question Answering,简称VideoQA)领域取得的进展引入了基于大型语言模型(LLM)的代理、模块化框架和程序性解决方案,取得了令人鼓舞的结果。这些系统采用动态代理和基于记忆机制来分解复杂任务并优化答案生成。然而,在长时间内跟踪物体以及根据推理进行决策方面仍需显著改进,以更好地将对象参考与语言模型输出对齐;随着新模型在这两项任务上的表现日益出色,这一需求显得尤为迫切。 本文介绍了一种结合“思考链”框架和基于实例化推理的零样本视频问答(VideoQA)LLM大脑代理。该方法与YOLO-World相结合,增强了对象跟踪和对齐能力,并在VideoQA及视频理解领域设立了新的技术标准,在NExT-QA、iVQA和ActivityNet-QA等基准测试中表现出色。 此外,我们的框架还支持时间框架内的实例化验证检查,从而提高了准确性,并为跨多个视频领域的输出可靠性提供了重要保障。相关代码可在[此处](https://example.com)获取(实际链接应根据实际情况填写)。
https://arxiv.org/abs/2505.15928
Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.
当前的视觉-语言模型(VLMs)在各种视频理解应用中展示了出色的性能。设计用于处理视频输入的VLM需要有效地建模时间维度(即,在帧之间捕捉依赖关系),并平衡对短片和长片的处理。具体而言,对于短片,这些模型必须保留细微的细节;而对于长片,则需要策略性地压缩视觉信息以高效处理广泛的时序上下文。然而,我们的实证分析揭示了一个关键限制:大多数现有的VLM在将视觉令牌数量压缩到其原始大小的四分之一以下时,在理解长视频的任务中会遭受严重的性能下降。 为了更有效地建模短片和长片输入,我们提出了Clapper方法,该方法采用了一种慢快策略来进行视频表示,并引入了一个名为TimePerceiver的新模块,用于在现有VLM骨干网络内高效地进行时空编码。通过使用我们的方法,在不牺牲问答准确性的情况下,每帧可以将视觉令牌压缩13倍(平均为61个令牌/帧)。在实验中,Clapper在VideoMME上实现了62.0%,在MLVU上实现了69.8%,在TempCompass上达到了67.4%,所有这些都使用了每个视频少于6,000个视觉令牌。我们的代码将在主页上公开提供。
https://arxiv.org/abs/2505.15529
Video understanding is inherently intention-driven-humans naturally focus on relevant frames based on their goals. Recent advancements in multimodal large language models (MLLMs) have enabled flexible query-driven reasoning; however, video-based frameworks like Video Chain-of-Thought lack direct training signals to effectively identify relevant frames. Current approaches often rely on heuristic methods or pseudo-label supervised annotations, which are both costly and limited in scalability across diverse scenarios. To overcome these challenges, we introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in intention-driven video understanding. An iterated amplification strategy is adopted to perform alternating cyclic training in the video CoT system, where each component undergoes iterative cycles of refinement to improve its capabilities. ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error, eliminating the need for expensive annotations while closely aligning with human-like learning processes. Comprehensive experiments across multiple benchmarks, including VideoMME, LVBench, and MLVU, demonstrate that ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks, highlighting its effectiveness and scalability. Notably, ViaRL achieves a nearly 15\% improvement on Needle QA, a subset of MLVU, which is required to search a specific needle within a long video and regarded as one of the most suitable benchmarks for evaluating temporal grounding.
视频理解本质上是目标驱动的,人类会根据自己的目的自然地关注相关的画面帧。尽管多模态大型语言模型(MLLMs)近期在灵活查询驱动推理方面取得了进展,但像Video Chain-of-Thought这样的基于视频框架却缺乏直接训练信号来有效地识别相关帧。目前的方法通常依赖于启发式方法或伪标签监督注释,这些方法成本高昂且难以扩展到多样化的场景中。 为了克服这些挑战,我们提出了ViaRL,这是第一个利用基于规则的强化学习(RL)优化目标驱动视频理解中的帧选择框架。该框架采用迭代放大策略,在视频CoT系统中进行交替循环训练,其中每个组件在反复的迭代周期中不断改进以提升其能力。ViaRL利用下游模型的答案准确性作为奖励信号来通过试错法训练一个帧选择器,从而消除了昂贵注释的需求,并且更加贴近人类的学习过程。 跨多个基准(包括VideoMME、LVBench和MLVU)进行的全面实验显示,ViaRL在时间定位性能上持续表现出色,并且能够在各种视频理解任务中实现强大的泛化能力,突显了其有效性和可扩展性。特别值得注意的是,在MLVU的一个子集Needle QA(要求从长视频中搜索特定针头并被广泛认为是评估时间定位的最合适的基准之一)上,ViaRL取得了近15%的显著改进。
https://arxiv.org/abs/2505.15447
Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose $\textbf{LiveVLM}$, a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44$\times$ number of frames on the same device, and achieves up to 5$\times$ speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.
近期在视频大型语言模型(Video LLMs)方面的发展使得模型能够处理长视频序列并展现出卓越的性能。然而,大多数研究主要集中在离线视频问答上,忽略了内存使用和响应速度这两个在深度搜索服务、自动驾驶以及机器人技术等实际应用场景中至关重要的因素。为了应对这些挑战,我们提出了**LiveVLM**,这是一个专门针对流媒体环境设计的在线视频理解和实时互动的零训练框架。 与现有方法仅在提出一个问题后才处理视频不同,LiveVLM构建了一个创新性的以流为导向的KV缓存,能够在实时中处理视频流,保留长时间内的视频细节,并消除冗余KV,确保能够快速响应用户查询。对于连续的视频流,LiveVLM生成并压缩视频键值张量(video KVs),在保存视觉信息的同时提高内存效率。 此外,在提出新的问题时,LiveVLM整合了在线问答流程,可以高效地获取短期和长期的视觉信息,并尽量减少冗余上下文带来的干扰。广泛的实验表明,通过使用基础LLaVA-OneVision模型,LiveVLM能够在相同设备上处理比原始方法多44倍的画面数量,在输入为256帧时,响应速度相比现有最佳在线方法提高了最多5倍,同时保持了相同的或更好的模型性能。
https://arxiv.org/abs/2505.15269
Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning by modulating edge importance based on action semantics. Through extensive evaluations on diverse benchmark datasets, we demonstrate that our method consistently outperforms state-of-the-art baselines, underscoring the strength of combining foundation models with dynamic graph-based reasoning for robust and generalizable action recognition.
基础模型通过支持提取丰富的时间空间和语义表示,开启了多模态视频理解的新时代。在此项工作中,我们引入了一种基于图的新型框架,该框架整合了视觉-语言基础模型,利用VideoMAE进行动态视觉编码,并使用BERT进行上下文文本嵌入,以应对精细的手部操作动作识别挑战。与传统的静态图架构不同,我们的方法构建了一个自适应多模态图,其中节点代表帧、对象和文本注释,边则编码空间、时间和语义关系。这些图形结构根据学习到的交互动态演变,允许进行灵活且基于上下文的推理。在图注意力网络中的任务特定注意机制进一步增强了这种推理能力,通过调整边的重要性来适应动作语义的变化。通过对多种基准数据集进行全面评估,我们展示了我们的方法始终优于最先进的基线模型,强调了结合基础模型与动态图推理对于鲁棒性和泛化性动作识别的重要价值。
https://arxiv.org/abs/2505.15192
Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance ($>$25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.
最近,大型多模态模型(LMMs)作为长期视频理解(LVU)的强大工具崭露头角,从而推动了标准化LVU基准测试的发展以评估这些模型的性能。然而,我们的研究揭示了现有LVU基准的一课:第一,大多数现有的基准严重依赖于多项选择题(MCQ),其评价结果由于存在猜测正确答案的可能性而被夸大;第二,在这些基准中的许多问题具有很强的前提条件,使得模型可以在不观看输入视频的情况下直接回答这些问题。例如,在Video-MME中,给定长视频的随机帧,Gemini-1.5-Pro可以达到超过50%的准确率。我们还观察到增加帧数并不一定会改善现有的基准测试结果,这与直觉相反。因此,目前LVU基准的有效性和鲁棒性受到了削弱,阻碍了对LMMs长期视频理解能力的真实评估。 为了解决这个问题,我们提出了VideoEval-Pro,这是一个包含开放式简答题的现实主义LVU基准,这些问题真正需要全面理解和观看整个视频。VideoEval-Pro通过感知和推理任务来评估片段级别的理解和整个视频的理解。通过对21种专有及开源的视频LMM进行评估,我们得出以下结论:(1)与多项选择题相比,视频LMM在开放式问题上的表现大幅下降(超过25%);(2)令人惊讶的是,在VideoEval-Pro中获得更高的MCQ分数并不意味着会得到更高的开放性问答得分;(3)与其他MCQ基准相比,增加输入帧数对VideoEval-Pro的改进更为显著。 我们的结果表明,VideoEval-Pro为长期视频理解提供了一个更现实和可靠的衡量标准,并能更清晰地反映出该领域的进展。
https://arxiv.org/abs/2505.14640
Video large language models (VideoLLM) excel at video understanding, but face efficiency challenges due to the quadratic complexity of abundant visual tokens. Our systematic analysis of token compression methods for VideoLLMs reveals two critical issues: (i) overlooking distinctive visual signals across frames, leading to information loss; (ii) suffering from implementation constraints, causing incompatibility with modern architectures or efficient operators. To address these challenges, we distill three design principles for VideoLLM token compression and propose a plug-and-play inference acceleration framework "Video Compression Commander" (VidCom2). By quantifying each frame's uniqueness, VidCom2 adaptively adjusts compression intensity across frames, effectively preserving essential information while reducing redundancy in video sequences. Extensive experiments across various VideoLLMs and benchmarks demonstrate the superior performance and efficiency of our VidCom2. With only 25% visual tokens, VidCom2 achieves 99.6% of the original performance on LLaVA-OV while reducing 70.8% of the LLM generation latency. Notably, our Frame Compression Adjustment strategy is compatible with other token compression methods to further improve their performance. Our code is available at this https URL.
视频大型语言模型(VideoLLM)在视频理解方面表现出色,但由于丰富的视觉标记导致二次复杂性问题而面临效率挑战。我们对VideoLLM的令牌压缩方法进行了系统分析,发现了两个关键问题:(i) 忽视了帧间独特的视觉信号,导致信息丢失;(ii) 受到实现约束的影响,与现代架构或高效操作不兼容。 为了解决这些挑战,我们提出了三种设计原则用于VideoLLM的令牌压缩,并提出了一种即插即用的推理加速框架“视频压缩指挥官”(VidCom2)。通过量化每个帧的独特性,VidCom2能够自适应地调整跨帧的压缩强度,在减少视频序列冗余的同时有效保留关键信息。在各种VideoLLM和基准测试中的广泛实验表明了我们提出的VidCom2框架的优越性能和效率。 使用仅25%视觉令牌的情况下,VidCom2在LLaVA-OV上的表现达到了原始性能的99.6%,同时减少了70.8%的语言模型生成延迟。值得注意的是,我们的帧压缩调整策略可以与其他令牌压缩方法兼容,进一步提高它们的表现。 我们的代码可在[此处](https://this-URL.com)获取。
https://arxiv.org/abs/2505.14454
Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.
现有的视频理解基准测试通常将基于知识的问题与纯图像内容的问题混淆在一起,而不是明确区分模型的时间推理能力。时间推理能力是视频理解与其他模式(如静态图像)相比的关键区别点。我们识别了两个主要限制因素,这些因素模糊了得分更高是否真正意味着对视频中动态内容有更强理解的判断:(1) 强语言先验,即模型可以在不观看视频的情况下回答问题;以及 (2) 时间顺序不变性,即使视频帧被时间打乱,某些问题上的表现依然保持相似。为了缓解这些问题,我们提出了VBenchComp,这是一个自动化的管道,它将问题分类为不同的领域:LLM-Answerable(语言模型可直接作答)、语义和时间类型的问题。具体而言,LLM-Answerable 类型的问题可以在不观看视频的情况下回答;语义类型的问题即使在视频帧被打乱后仍然可以被回答;而时间类型的问题则需要理解帧之间的正确时间顺序。其余问题被标记为其他类。这种方法可以使对不同能力的视频语言模型进行更为细致的能力评估成为可能。我们的分析揭示了一些由传统总分掩盖的复杂模型弱点,并提供了设计未来基准测试以更准确地评估视频LLM们的见解和建议。
https://arxiv.org/abs/2505.14321
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at this https URL
长时间视频包含大量的信息,使得视频-文本检索成为多模态学习中一个既重要又具挑战性的任务。然而,现有的基准测试由于视频时长有限、字幕质量低以及标注精细度不足等问题,在评估先进的视频-文本检索方法方面存在局限性。为解决这些问题,我们引入了LoVR这一专门用于长时间视频-文本检索的基准。 LoVR包含467段长时间视频和超过40,804个高质量细粒度片段。为了克服机器生成注释质量不佳的问题,我们提出了一种高效的字幕生成框架,该框架结合了视觉语言模型自动生成功能、字幕质量评分以及动态优化方法。这种流程在保证准确性的同时还保持了可扩展性。此外,我们引入了一个语义融合的方法来生成连贯的全视频描述而不丢失重要的上下文信息。 我们的基准测试提供了更长的视频、更为详细的注释和更大规模的数据集,为视频理解和检索带来了新的挑战。在各种先进的嵌入模型上的广泛实验表明,LoVR是一个具有挑战性的基准测试,揭示了现有方法的局限性,并为未来研究提供了有价值的见解。我们可以在这里(请将“this https URL”替换为实际链接)提供代码和数据集的访问。 通过这些改进,LoVR旨在促进视频-文本检索领域的发展,推动该领域的模型性能进一步提升。
https://arxiv.org/abs/2505.13928
Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video understanding VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains under-explored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.
视觉语言模型(VLM)通过有效地对齐视觉和文本表示,在多模态任务中展示了强大的性能。然而,大多数视频理解的VLM研究是领域无关的,导致其在特定专业领域的迁移学习能力未得到充分探索。在这项工作中,我们通过探讨开源VLM适应特定领域的能力来解决这一问题,并以足球作为初始案例进行研究。我们的方法利用大规模的足球数据集和大型语言模型(LLM)创建指令跟随数据,并使用这些数据对通用域VLM进行迭代微调,在课程学习框架下进行训练(首先教授模型关键足球概念,然后是问答任务)。最终适应后的模型在经过一个精心策划的20,000个视频片段的数据集上训练后,在特定于足球的任务中表现出显著改进,视觉问题回答任务相对提高了37.5%,下游足球动作分类任务的准确率从11.8%提高到63.5%。
https://arxiv.org/abs/2505.13860
Modern video understanding systems excel at tasks such as scene classification, object detection, and short video retrieval. However, as video analysis becomes increasingly central to real-world applications, there is a growing need for proactive video agents for the systems that not only interpret video streams but also reason about events and take informed actions. A key obstacle in this direction is temporal reasoning: while deep learning models have made remarkable progress in recognizing patterns within individual frames or short clips, they struggle to understand the sequencing and dependencies of events over time, which is critical for action-driven decision-making. Addressing this limitation demands moving beyond conventional deep learning approaches. We posit that tackling this challenge requires a neuro-symbolic perspective, where video queries are decomposed into atomic events, structured into coherent sequences, and validated against temporal constraints. Such an approach can enhance interpretability, enable structured reasoning, and provide stronger guarantees on system behavior, all key properties for advancing trustworthy video agents. To this end, we present a grand challenge to the research community: developing the next generation of intelligent video agents that integrate three core capabilities: (1) autonomous video search and analysis, (2) seamless real-world interaction, and (3) advanced content generation. By addressing these pillars, we can transition from passive perception to intelligent video agents that reason, predict, and act, pushing the boundaries of video understanding.
现代视频理解系统在场景分类、目标检测和短片检索等任务上表现出色。然而,随着视频分析在现实世界应用中的重要性日益增加,对于能够不仅解读视频流还能够对事件进行推理并采取明智行动的主动型视频代理的需求也在增长。这一方向上的一个关键障碍是时间推理:虽然深度学习模型在识别单帧或短片段内的模式方面取得了显著进展,但它们难以理解事件随时间排列和依赖关系,这对于基于行为的决策至关重要。解决这个限制需要超越传统的深度学习方法。我们认为,应对这一挑战需要采用神经符号学视角,即将视频查询分解为原子事件、结构化成连贯序列,并根据时间约束进行验证。这种做法可以增强可解释性,支持结构化的推理过程,并提供更有力的行为保证,这些都是构建值得信赖的视频代理的关键属性。 为此,我们向研究界提出了一个重大挑战:开发下一代智能视频代理,整合三个核心能力:(1) 自主视频搜索和分析;(2) 无缝的现实世界交互;以及 (3) 高级内容生成。通过解决这些支柱问题,我们可以从被动感知过渡到能够推理、预测并采取行动的智能视频代理,推动视频理解技术的发展边界。
https://arxiv.org/abs/2505.13851
Large language and multimodal models (LLMs and LMMs) exhibit strong inference capabilities but are often limited by slow decoding speeds. This challenge is especially acute in LMMs, where visual inputs typically comprise more tokens with lower information density than text -- an issue exacerbated by recent trends toward finer-grained visual tokenizations to boost performance. Speculative decoding has been effective in accelerating LLM inference by using a smaller draft model to generate candidate tokens, which are then selectively verified by the target model, improving speed without sacrificing output quality. While this strategy has been extended to LMMs, existing methods largely overlook the unique properties of visual inputs and depend solely on text-based draft models. In this work, we propose \textbf{FLASH} (Fast Latent-Aware Semi-Autoregressive Heuristics), a speculative decoding framework designed specifically for LMMs, which leverages two key properties of multimodal data to design the draft model. First, to address redundancy in visual tokens, we propose a lightweight latent-aware token compression mechanism. Second, recognizing that visual objects often co-occur within a scene, we employ a semi-autoregressive decoding strategy to generate multiple tokens per forward pass. These innovations accelerate draft decoding while maintaining high acceptance rates, resulting in faster overall inference. Experiments show that FLASH significantly outperforms prior speculative decoding approaches in both unimodal and multimodal settings, achieving up to \textbf{2.68$\times$} speed-up on video captioning and \textbf{2.55$\times$} on visual instruction tuning tasks compared to the original LMM.
大型语言和多模态模型(LLMs 和 LMMs)表现出强大的推理能力,但常常受限于缓慢的解码速度。这一挑战在处理视觉输入时尤为突出,因为视觉输入通常包含更多低信息密度的标记——这种情况因最近向更细粒度的视觉标记化趋势发展而进一步加剧,以期提升性能。推测性解码通过使用较小的草稿模型生成候选标记,并由目标模型进行选择验证来有效加速LLM推理速度,在提高速度的同时不牺牲输出质量。虽然这种策略已扩展到LMMs中,但现有的方法往往忽视了视觉输入的独特特性,仅仅依赖于基于文本的草稿模型。 在本研究工作中,我们提出了\textbf{FLASH}(Fast Latent-Aware Semi-Autoregressive Heuristics),一种专门针对LMMs设计的推测性解码框架。该框架利用多模态数据的两个关键属性来设计草稿模型。首先,为了解决视觉标记中的冗余问题,我们提出了一种轻量级且感知潜在信息的令牌压缩机制;其次,考虑到在场景中物体通常会共同出现,我们采用了半自回归解码策略,在每次前向传播过程中生成多个令牌。这些创新不仅加速了草稿解码过程,还维持了高接受率,从而整体提高了推理速度。 实验表明,在单模态和多模态设置下,FLASH显著优于之前的推测性解码方法,视频描述任务中实现了高达\textbf{2.68$\times$}的提速,在视觉指令微调任务中则为\textbf{2.55$\times$},相较于原始LMM模型。
https://arxiv.org/abs/2505.12728
Recent years have witnessed outstanding advances of large vision-language models (LVLMs). In order to tackle video understanding, most of them depend upon their implicit temporal understanding capacity. As such, they have not deciphered important components that contribute to temporal understanding ability, which might limit the potential of these LVLMs for video understanding. In this work, we conduct a thorough empirical study to demystify crucial components that influence the temporal understanding of LVLMs. Our empirical study reveals that significant impacts are centered around the intermediate interface between the visual encoder and the large language model. Building on these insights, we propose a temporal-oriented recipe that encompasses temporal-oriented training schemes and an upscaled interface. Our final model developed using our recipe significantly enhances previous LVLMs on standard video understanding tasks.
近年来,大型视觉-语言模型(LVLM)取得了显著进展。为了处理视频理解任务,大多数这类模型依赖于其隐含的时间理解能力。然而,它们尚未明确哪些关键组成部分对时间理解能力的贡献最大,这可能会限制这些LVLM在视频理解方面的潜力。在这项工作中,我们进行了全面的经验研究,揭示了影响LVLM时间理解的关键组件。我们的经验研究表明,在视觉编码器和大型语言模型之间的中间接口处存在显著的影响因素。基于这些见解,我们提出了一种以时间为导向的方法,包括以时间为导向的训练方案和升级后的接口。使用这种方法开发的最终模型在标准视频理解任务上显著优于之前的LVLM。
https://arxiv.org/abs/2505.12605
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable reasoning and generalization capabilities in video understanding; however, their application in video editing remains largely underexplored. This paper presents the first systematic study of LLMs in the context of video editing. To bridge the gap between visual information and language-based reasoning, we introduce L-Storyboard, an intermediate representation that transforms discrete video shots into structured language descriptions suitable for LLM processing. We categorize video editing tasks into Convergent Tasks and Divergent Tasks, focusing on three core tasks: Shot Attributes Classification, Next Shot Selection, and Shot Sequence Ordering. To address the inherent instability of divergent task outputs, we propose the StoryFlow strategy, which converts the divergent multi-path reasoning process into a convergent selection mechanism, effectively enhancing task accuracy and logical coherence. Experimental results demonstrate that L-Storyboard facilitates a more robust mapping between visual information and language descriptions, significantly improving the interpretability and privacy protection of video editing tasks. Furthermore, StoryFlow enhances the logical consistency and output stability in Shot Sequence Ordering, underscoring the substantial potential of LLMs in intelligent video editing.
大型语言模型(LLM)和视觉-语言模型(VLM)在视频理解方面展示了非凡的推理和泛化能力;然而,它们在视频编辑领域的应用仍鲜有研究。本文首次系统地探讨了LLMs在视频编辑中的应用。为了弥合视觉信息与基于语言的推理之间的差距,我们引入了L-Storyboard这一中间表示形式,它将离散的视频镜头转换为适合LLM处理的结构化语言描述。我们将视频编辑任务分为收敛型任务和发散型任务,并重点研究三种核心任务:镜头属性分类、下一镜头选择以及镜头序列排序。 为了应对发散型任务输出固有的不稳定性问题,我们提出了StoryFlow策略,该策略将多路径推理过程转换为一种收敛的选择机制,从而有效地提高了任务的准确性和逻辑连贯性。实验结果表明,L-Storyboard有助于在视觉信息与语言描述之间建立更加稳健的映射关系,显著提升了视频编辑任务的可解释性和隐私保护能力。 此外,StoryFlow策略还增强了镜头序列排序中的逻辑一致性和输出稳定性,突显了LLMs在智能视频编辑领域中巨大的潜力。
https://arxiv.org/abs/2505.12237
As Video Large Multimodal Models (VLMMs) rapidly advance, their inherent complexity introduces significant safety challenges, particularly the issue of mismatched generalization where static safety alignments fail to transfer to dynamic video contexts. We introduce SafeVid, a framework designed to instill video-specific safety principles in VLMMs. SafeVid uniquely transfers robust textual safety alignment capabilities to the video domain by employing detailed textual video descriptions as an interpretive bridge, facilitating LLM-based rule-driven safety reasoning. This is achieved through a closed-loop system comprising: 1) generation of SafeVid-350K, a novel 350,000-pair video-specific safety preference dataset; 2) targeted alignment of VLMMs using Direct Preference Optimization (DPO); and 3) comprehensive evaluation via our new SafeVidBench benchmark. Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements (e.g., up to 42.39%) on SafeVidBench. SafeVid provides critical resources and a structured approach, demonstrating that leveraging textual descriptions as a conduit for safety reasoning markedly improves the safety alignment of VLMMs. We have made SafeVid-350K dataset (this https URL) publicly available.
随着视频大型多模态模型(VLMM)的快速发展,其内在复杂性带来了显著的安全挑战,特别是静态安全对齐无法转移到动态视频上下文中的问题。我们介绍了SafeVid框架,旨在将特定于视频的安全原则嵌入到VLMM中。SafeVid通过使用详细的文本视频描述作为解释桥梁,独特地实现了从文本领域到视频领域的鲁棒性安全性对齐能力的转移,从而促进了基于LLM的规则驱动安全推理。这一过程是通过一个闭环系统实现的,该系统包括: 1) 生成 SafeVid-350K,这是一个新的包含35万对特定于视频的安全偏好数据集; 2) 使用直接偏好优化(DPO)方法对VLMM进行有针对性的对齐; 3) 通过我们新开发的SafeVidBench基准进行全面评估。 与SafeVid-350K的数据集对齐显著提高了VLMM的安全性,例如LLaVA-NeXT-Video模型在SafeVidBench上表现出显著改进(例如,最高达42.39%)。SafeVid提供了关键资源和结构化方法,证明利用文本描述作为安全推理的途径可以明显改善VLMM的安全对齐。我们已经将SafeVid-350K数据集公开发布(此链接)。 该段落介绍了如何通过引入新的框架来解决视频大型多模态模型在安全性方面的问题,并详细说明了这个框架是如何工作的,包括生成特定于视频的数据集、利用优化方法进行模型对齐以及评估其效果。
https://arxiv.org/abs/2505.11926