The demand for real-time visual understanding and interaction in complex scenarios is increasingly critical for unmanned aerial vehicles. However, a significant challenge arises from the contradiction between the high computational cost of large Vision language models and the limited computing resources available on UAV edge devices. To address this challenge, this paper proposes a lightweight multimodal task platform based on BLIP-2, integrated with YOLO-World and YOLOv8-Seg models. This integration extends the multi-task capabilities of BLIP-2 for UAV applications with minimal adaptation and without requiring task-specific fine-tuning on drone data. Firstly, the deep integration of BLIP-2 with YOLO models enables it to leverage the precise perceptual results of YOLO for fundamental tasks like object detection and instance segmentation, thereby facilitating deeper visual-attention understanding and reasoning. Secondly, a content-aware key frame sampling mechanism based on K-Means clustering is designed, which incorporates intelligent frame selection and temporal feature concatenation. This equips the lightweight BLIP-2 architecture with the capability to handle video-level interactive tasks effectively. Thirdly, a unified prompt optimization scheme for multi-task adaptation is implemented. This scheme strategically injects structured event logs from the YOLO models as contextual information into BLIP-2's input. Combined with output constraints designed to filter out technical details, this approach effectively guides the model to generate accurate and contextually relevant outputs for various tasks.
在复杂场景中,无人驾驶航空器对实时视觉理解和交互的需求日益重要。然而,大型视觉语言模型的高计算成本与无人机边缘设备有限的计算资源之间的矛盾成为了一个重大挑战。为解决这一问题,本文提出了一种基于BLIP-2的轻量级多模态任务平台,并集成了YOLO-World和YOLOv8-Seg模型。这种集成使BLIP-2在无人飞行器应用中能够扩展其多任务功能,而无需对无人机数据进行特定任务的微调。 首先,BLIP-2与YOLO模型的深度融合使其能利用YOLO提供的精确感知结果来执行对象检测和实例分割等基本任务,从而促进更深层次的视觉注意力理解和推理。其次,设计了一种基于K-Means聚类的内容感知关键帧采样机制,该机制结合了智能帧选择和时间特征串联功能,使轻量级BLIP-2架构能够有效处理视频级别的交互式任务。最后,实施了一个统一的多任务适应性提示优化方案。此方案战略性地将YOLO模型中的结构化事件日志作为上下文信息注入到BLIP-2的输入中,并结合输出限制以过滤掉技术细节,从而有效地指导模型生成准确且与上下文相关的多种任务结果。
https://arxiv.org/abs/2601.08408
Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.
视觉-语言模型在多种多模态理解和推理任务中表现出色,但其多步推理的稳定性仍然存在问题。对同一输入进行多次采样往往会导致不同的推理路径和不一致的最终预测结果。为了解决这个问题,我们提出了两种受测试时间缩放启发的方法:(1)CASHEW,这是一个推理框架,在推理过程中通过迭代聚合多个候选轨迹来稳定推理过程,并通过显式的视觉验证过滤掉幻觉步骤并使推理建立在视觉证据之上;以及(2)CASHEW-RL,这是一种学习型变体,它在一个单一模型中内化了这种聚合行为。CASHEW-RL 使用分组序列策略优化(GSPO)进行训练,并采用了一种复合奖励机制来鼓励基于最少但足够视觉证据的正确答案,同时根据任务难度自适应地分配推理努力。这一训练目标使模型能够在推理时实现稳健的自我聚合。 在13个图像理解、视频理解和视频推理基准测试中进行了广泛实验,结果显示了显著性能提升,包括ScienceQA和EgoSchema分别提高了23.6个百分点和8.1个百分点。
https://arxiv.org/abs/2601.08010
Large Vision-Language Models (LVLMs) face a fundamental dilemma in video reasoning: they are caught between the prohibitive computational costs of verbose reasoning and the hallucination risks of efficient, ungrounded approaches. To resolve this, we introduce the Chain of Evidence (CoE), a novel framework that architecturally decouples and co-optimizes perceptual grounding and reasoning efficiency. CoE incorporates two core innovations: (1) A lightweight Evidence Grounding Module (EGM) that acts as a query-guided filter, dynamically identifying and extracting a compact set of high-fidelity visual evidence; and (2) An Evidence-Anchoring Protocol optimized via Reinforcement Learning. Crucially, we design a composite reward mechanism that enforces process alignment, compelling the model to strictly reference identified temporal anchors during deduction, thereby mitigating hallucinations. To enable this, we construct CoE-Instruct, a large-scale dataset (164k samples) featuring a novel dual-annotation schema for separate perception and reasoning supervision. Extensive experiments on five benchmarks, including Video-MME, MVBench, and VSI-Bench, demonstrate that CoE-enhanced models establish a new state-of-the-art. They significantly outperform existing methods in accuracy, proving CoE to be a powerful and practical paradigm for reliable video understanding.
大型视觉-语言模型(LVLMs)在视频推理方面面临着一个根本性的困境:它们要么面临详细推理的高昂计算成本,要么承担低效且缺乏依据的方法所带来的幻觉风险。为了解决这一问题,我们引入了证据链(CoE),这是一种新型框架,它从架构上分离并协同优化感知基础和推理效率。CoE 包含两大核心创新: 1. 一个轻量级的证据接地模块(EGM),作为查询引导过滤器,动态识别并提取一组高质量、简洁的视觉证据。 2. 一种通过强化学习优化的证据锚定协议。 至关重要的是,我们设计了一个复合奖励机制来强制执行过程一致性,使模型在演绎过程中严格参考已识别的时间锚点,从而减少幻觉风险。为了实现这一点,我们构建了 CoE-Instruct,这是一个大规模的数据集(164,000 个样本),其中包含一种新颖的双标注方案,用于单独感知和推理监督。 我们在五个基准测试上进行了广泛的实验,包括 Video-MME、MVBench 和 VSI-Bench,结果表明增强型 CoE 模型建立了新的最先进水平。它们在准确性方面显著超越了现有方法,证明 CoE 是一种强大且实用的可靠视频理解范式。
https://arxiv.org/abs/2601.07761
Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are fast-paced and information-dense, with visual tokens dominating the input sequence. To enable efficient training while reducing input tokens, we propose the Scene-Primed ASR-anchored Compressor (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues. Built upon these designs, our HiVid-Narrator framework achieves superior narrative quality with fewer input tokens compared to existing methods.
生成针对现实世界电子商务视频的结构化叙述,需要模型能够感知细微的视觉细节,并将这些细节组织成连贯、高层次的故事——这是现有方法难以统一的能力。我们引入了具有双粒度、时间接地标注的E-commerce Hierarchical Video Captioning (E-HVC) 数据集:一个锚定事件级观察的时间链式思维(Temporal Chain-of-Thought),以及将它们组合成简洁、以故事为中心摘要的小节总结(Chapter Summary)。 不同于直接提示章节,我们采用了分阶段构建的方法,首先通过精心挑选的自动语音识别(ASR)和帧级别描述来收集可靠的语言和视觉证据,然后根据时间链式思维细化粗略标注,从而生成基于事实的时间对齐叙述。 此外,我们注意到电子商务视频节奏快、信息密集,并且视觉标记在输入序列中占据主导地位。为了减少输入标记的同时实现高效训练,我们提出了场景引导的ASR锚定压缩器(Scene-Primed ASR-anchored Compressor, SPA-Compressor),它能够将多模态令牌压缩为层次化的场景和事件表示,并由ASR语义线索指导。 基于这些设计,我们的HiVid-Narrator框架相比现有方法使用更少的输入标记获得了更高的叙述质量。
https://arxiv.org/abs/2601.07366
While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.
尽管多模态大型语言模型(MLLM)在单一图像理解方面表现出色,但在处理涉及多个图像的推理场景时性能显著下降。多图像推理面临的基本挑战包括图像之间的复杂相互关系以及关键信息在图像集中的分散分布。受人类认知过程启发,我们提出了一种称为“CINEMA”(认知启发元动作框架)的新方法,该方法将多图像推理分解为五个结构化的元动作:全局视图、聚焦、提示、思考和回答,这些步骤明确地模拟了人类自然使用的顺序认知步骤。在冷启动训练阶段,我们引入了一种基于检索的树采样策略,以生成高质量的元动作轨迹来初始化模型的推理模式。在强化学习过程中,我们采用两阶段范式:第一阶段是探索阶段,使用保持多样性的策略避免熵塌陷;第二阶段是经过调整后的开发阶段,逐步加强利用过程。 为了训练我们的模型,我们构建了一个包含57,000个冷启动和58,000个强化学习实例的数据集,这些实例涵盖了多图像、多帧以及单一图像的任务。我们在多个多图像推理基准测试、视频理解基准测试及单张图像基准测试中进行了广泛的评估,并在几个关键基准上实现了具有竞争力的最新性能。我们的模型在MUIR和MVMath基准测试上超越了GPT-4o,在视频理解基准测试上也显著优于专门化的视频推理模型,这证明了我们的人类认知启发推理框架的有效性和通用性。
https://arxiv.org/abs/2601.07298
This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
本文介绍了VideoLoom,这是一种统一的视频大型语言模型(Video LLM),用于联合空间-时间理解。为了促进细粒度的空间和时间定位能力的发展,我们整理了LoomData-8.7k,这是一个以人为中心的视频数据集,具有基于时间标注和空间定位的字幕。凭借这一优势,VideoLoom在多种空间和时间基准测试中实现了最先进的或极具竞争力的表现(例如,在ReVOS上针对参考视频对象分割获得63.1 J&F,在Charades-STA上针对时间定位获得48.3 R1@0.7)。此外,我们引入了LoomBench,这是一个新的由时间、空间和组成视频问题对构成的基准测试集,能够从多个角度全面评估Video LLMs。这些贡献共同提供了一套通用且有效的工具,用于联合进行空间-时间视频理解,并在多模态智能领域设立了新标准。
https://arxiv.org/abs/2601.07290
Multimodal Large Language Models (MLLMs) have achieved strong performance across many tasks, yet most systems remain limited to offline inference, requiring complete inputs before generating outputs. Recent streaming methods reduce latency by interleaving perception and generation, but still enforce a sequential perception-generation cycle, limiting real-time interaction. In this work, we target a fundamental bottleneck that arises when extending MLLMs to real-time video understanding: the global positional continuity constraint imposed by standard positional encoding schemes. While natural in offline inference, this constraint tightly couples perception and generation, preventing effective input-output parallelism. To address this limitation, we propose a parallel streaming framework that relaxes positional continuity through three designs: Overlapped, Group-Decoupled, and Gap-Isolated. These designs enable simultaneous perception and generation, allowing the model to process incoming inputs while producing responses in real time. Extensive experiments reveal that Group-Decoupled achieves the best efficiency-performance balance, maintaining high fluency and accuracy while significantly reducing latency. We further show that the proposed framework yields up to 2x acceleration under balanced perception-generation workloads, establishing a principled pathway toward speak-while-watching real-time systems. We make all our code publicly available: this https URL.
多模态大型语言模型(MLLMs)在许多任务中取得了强大的性能,但大多数系统仍然局限于离线推理阶段,需要输入完整后才能生成输出。最近的流式方法通过交替感知和生成来减少延迟,但仍强制执行感知与生成的顺序循环周期,限制了实时交互能力。在这项工作中,我们针对将MLLMs扩展到实时视频理解时出现的一个基本瓶颈:标准位置编码方案强加的位置连续性约束。尽管这种约束在离线推理中是自然的,但它紧密地耦合了感知和生成过程,阻碍了输入输出并行处理的有效实现。 为了解决这一限制,我们提出了一种平行流式框架,通过三种设计来放宽位置连续性的限制:重叠(Overlapped)、组解耦(Group-Decoupled)和间隔隔离(Gap-Isolated)。这些设计方案使得模型能够同时进行感知与生成,即在处理传入输入的同时实时产生响应。广泛的实验表明,组解耦方案达到了效率与性能的最佳平衡,在保持高流畅性和准确性的同时显著减少了延迟。我们进一步证明了所提出的框架能够在均衡的感知和生成工作负载下实现高达2倍的速度提升,为建立说话的同时观看的实时系统奠定了理论基础。 我们将所有代码开源:[此URL](this https URL)。
https://arxiv.org/abs/2601.06843
This paper introduces QCaption, a novel video captioning and Q&A pipeline that enhances video analytics by fusing three models: key frame extraction, a Large Multimodal Model (LMM) for image-text analysis, and a Large Language Model (LLM) for text analysis. This approach enables integrated analysis of text, images, and video, achieving performance improvements over existing video captioning and Q&A models; all while remaining fully self-contained, adept for on-premises deployment. Experimental results using QCaption demonstrated up to 44.2% and 48.9% improvements in video captioning and Q&A tasks, respectively. Ablation studies were also performed to assess the role of LLM on the fusion on the results. Moreover, the paper proposes and evaluates additional video captioning approaches, benchmarking them against QCaption and existing methodologies. QCaption demonstrate the potential of adopting a model fusion approach in advancing video analytics.
本文介绍了QCaption,这是一种新颖的视频字幕生成和问答流水线,通过融合三种模型来增强视频分析:关键帧提取、用于图像-文本分析的大规模多模态模型(LMM)以及用于文本分析的大规模语言模型(LLM)。这种方法能够实现文本、图像与视频的一体化分析,并在现有的视频字幕生成和问答模型上实现了性能提升;同时,QCaption保持了完全独立的部署能力,非常适合本地环境的使用。通过使用QCaption进行实验的结果表明,在视频字幕生成任务中取得了高达44.2%的改进,在问答任务中则达到了48.9%的改进。此外还进行了消融研究来评估LLM在融合过程中的作用。论文还提出了并评估了额外的视频字幕生成方法,并将其与QCaption和现有技术进行基准比较。QCaption展示了采用模型融合方法推进视频分析领域的潜力。
https://arxiv.org/abs/2601.06566
Grounding events in videos serves as a fundamental capability in video analysis. While Vision-Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.
将事件与视频中的时间戳关联起来是视频分析的基本能力之一。虽然视觉-语言模型(Vision-Language Models,VLMs)越来越多地用于这一任务,但现有方法主要训练模型仅针对正向视频来关联事件和时间戳。这种范式阻碍了VLM捕捉事件内在的时间结构和方向性,从而限制了其鲁棒性和泛化能力。为了解决这一局限性,并受物理学中“时间之箭”的启发——它描述了时间进程中固有的方向性——我们提出了ArrowGEV,这是一种强化学习框架,旨在通过显式建模事件的时间方向性来提升事件关联和理解视频中时间方向性的能力。 具体来说,我们将事件分为两类:时序敏感(例如放下背包)和时序不敏感(例如左手拿着毛巾)。前者是指其逆转会显著改变意义的事件;而后者在逆转后语义保持不变。对于时序敏感的事件,ArrowGEV引入了一个奖励机制,激励VLM模型区分正向视频与反向视频中的事件关联;而对于时序不敏感的事件,则强制要求在同一时间点上无论从哪个方向都要保持一致的关联。 广泛的实验表明,ArrowGEV不仅提高了事件关联的精确度和时间方向性的识别能力,而且还提升了对视频的整体理解和推理能力。
https://arxiv.org/abs/2601.06559
Training video-language models is often prohibitively expensive due to the high cost of processing long frame sequences and the limited availability of annotated long videos. We present VideoWeave, a simple yet effective approach to improve data efficiency by constructing synthetic long-context training samples that splice together short, captioned videos from existing datasets. Rather than modifying model architectures or optimization objectives, VideoWeave reorganizes available video-text pairs to expand temporal diversity within fixed compute. We systematically study how different data composition strategies like random versus visually clustered splicing and caption enrichment affect downstream performance on downstream video question answering. Under identical compute constraints, models trained with VideoWeave achieve higher accuracy than conventional video finetuning. Our results highlight that reorganizing training data, rather than altering architectures, may offer a simple and scalable path for training video-language models. We link our code for all experiments here.
训练视频语言模型通常由于处理长帧序列的成本高昂以及标注长视频的稀缺性而代价巨大。我们提出了VideoWeave,这是一种简单却有效的方法,通过构建合成的长期上下文训练样本来提高数据效率,这些样本由现有数据集中短的、带有描述的视频片段拼接而成。不同于修改模型架构或优化目标,VideoWeave重新组织现有的视频-文本对以在固定计算资源内扩展时间多样性。我们系统地研究了不同的数据组合策略(如随机拼接与基于视觉聚类的拼接以及描述增强)如何影响下游的视频问答任务表现。在相同的计算约束下,使用VideoWeave训练的模型比传统的视频微调方法获得了更高的准确率。我们的实验结果表明,重新组织训练数据而非修改架构可能为训练视频-语言模型提供一种简单且可扩展的途径。我们在这里提供了所有实验代码的链接。
https://arxiv.org/abs/2601.06309
Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest method, achieving a 19.67% improvement in hour-long video understanding while reducing processing latency to 45.4% of the original.
长时间视频(从几分钟到几小时不等)对当前的多模态大型语言模型(MLLMs)提出了重大挑战,因为这些视频包含复杂的事件、多样化的场景和长期依赖关系。直接编码这样的视频在计算上代价高昂,而简单的视频转文本转换通常会导致冗余或片段化的内容。为了解决这些问题,我们引入了MMViR,这是一种新颖的多模态、多层次结构表示方法,用于长视频的理解。MMViR通过识别关键转折点来分割视频,并构建了一个三层描述,结合全局叙述和细粒度视觉细节。这种设计支持基于查询的高效检索并能在各种场景中很好地推广。在问答、摘要和检索三项任务上的广泛评估表明,与前一种最强方法相比,MMViR在长达一小时的视频理解上提高了19.67%的表现,并将处理延迟降至原始值的45.4%。
https://arxiv.org/abs/2601.05495
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
链式思维(CoT)推理已成为多模态大型语言模型在视频理解任务中的一种强大工具。然而,CoT的必要性和相对于直接回答的优势仍然有待探索。在这篇论文中,我们首先证明对于通过强化学习训练的视频模型来说,直接回答通常可以匹配或甚至超过链式思维(CoT)的表现,尽管CoT需要更高的计算成本来进行逐步骤分析。受到这一发现的启发,我们提出了VideoAuto-R1,这是一种采用按需推理策略的视频理解框架。 在训练阶段,我们的方法遵循“思考一次、回答两次”的范式:模型首先生成初始答案,然后进行推理,最后输出经过审查的答案。两个答案都通过可验证奖励来进行监督。 在推断过程中,模型使用初始答案的信心分数来决定是否需要继续推理过程。 VideoAuto-R1框架在视频问答和基准测试中取得了最先进的准确性,并且显著提高了效率,将平均响应长度减少了约3.3倍(例如从149个令牌减少到仅44个令牌)。此外,我们观察到,在感知导向的任务上“思考模式”激活的频率较低,而在推理密集型任务上的激活率较高。这表明基于语言的显式推理通常是有益的,但并非总是必要的。
https://arxiv.org/abs/2601.05175
Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.
视觉语言模型(VLM)在多模态推理任务中表现出色,但大多数评估集中在短视频上,并且假定有不受限制的计算资源。然而,在制药内容理解等工业环境中,从业者必须在严格的GPU、延迟和成本约束下处理长形式视频,许多现有方法在这种情况下无法扩展。 在这项工作中,我们介绍了一个针对工业应用场景设计的生成式人工智能(GenAI)框架,该框架能够处理超过20万份PDF文档、25,326个不同格式(例如MP4、M4V等)的视频和888个多语言音频文件。这些文件涵盖了超过20种不同的语言,并且跨越了14个疾病领域。 我们的研究做出了三项贡献: (i) 在制药领域的多模态推理中提供了一个工业大规模架构; (ii) 对超过40种视觉语言模型(VLM)在两个领先的基准测试(Video-MME和MMBench)以及一个包含25,326个视频的专有数据集上的实证分析,该数据集覆盖了14个疾病领域; (iii) 四项关于长形式视频推理的重要发现:多模态的作用、注意力机制中的权衡、时间推理的限制,以及在GPU资源受限的情况下分割视频所面临的挑战。 研究结果显示,在商业级GPU上使用SDPA注意力可以获得3-8倍的效率提升;多模态性可以改善多达12个任务领域中(尤其是与长度相关的任务)的性能;并且所有开源和闭源VLMs都存在时间对齐和关键帧检测方面的明显瓶颈。 本论文并未提出一个新的“A+B”模型,而是描述了当前视觉语言模型在现实部署条件下所面临的实际限制、权衡以及失败模式,并为研究者和从业者提供了关于如何设计适用于工业环境中长形式视频理解的可扩展多模态系统的实用指导。
https://arxiv.org/abs/2601.04891
Biryani, one of India's most celebrated dishes, exhibits remarkable regional diversity in its preparation, ingredients, and presentation. With the growing availability of online cooking videos, there is unprecedented potential to study such culinary variations using computational tools systematically. However, existing video understanding methods fail to capture the fine-grained, multimodal, and culturally grounded differences in procedural cooking videos. This work presents the first large-scale, curated dataset of biryani preparation videos, comprising 120 high-quality YouTube recordings across 12 distinct regional styles. We propose a multi-stage framework leveraging recent advances in vision-language models (VLMs) to segment videos into fine-grained procedural units and align them with audio transcripts and canonical recipe text. Building on these aligned representations, we introduce a video comparison pipeline that automatically identifies and explains procedural differences between regional variants. We construct a comprehensive question-answer (QA) benchmark spanning multiple reasoning levels to evaluate procedural understanding in VLMs. Our approach employs multiple VLMs in complementary roles, incorporates human-in-the-loop verification for high-precision tasks, and benchmarks several state-of-the-art models under zero-shot and fine-tuned settings. The resulting dataset, comparison methodology, and QA benchmark provide a new testbed for evaluating VLMs on structured, multimodal reasoning tasks and open new directions for computational analysis of cultural heritage through cooking videos. We release all data, code, and the project website at this https URL.
印度著名菜肴之一的Biryani,在其准备、食材和呈现上展现出显著的地方多样性。随着在线烹饪视频可用性的提高,现在有机会利用计算工具系统地研究这种美食差异。然而,现有的视频理解方法无法捕捉到烹饪视频中细微且多模态的文化差异。本文提出了一套大规模、精心策划的Biryani制作视频数据集,包含120段高质量YouTube记录,涵盖了12种不同的地方风格。我们提出了一种多阶段框架,利用最近在视觉-语言模型(VLMs)方面的进展,将视频细分为精细的过程单元,并与音频字幕和规范食谱文本对齐。基于这些对齐的表示,我们构建了一个自动识别并解释地区变体之间过程差异的视频比较管道。为了评估VLM中的程序理解,我们在多个推理层次上创建了全面的问题回答(QA)基准测试。 我们的方法采用了多种互补角色下的视觉-语言模型,并结合人机交互验证来确保高精度任务。此外,我们还在零样本和微调设置下对几种最先进的模型进行了基准测试。最终产生的数据集、比较方法论以及问答基准为评估VLM在结构化多模态推理任务上的表现提供了新的试验平台,并为进一步通过烹饪视频分析文化遗产的计算研究开辟了新方向。 我们将所有数据、代码及项目网站发布在此链接:[此处应插入实际链接]。
https://arxiv.org/abs/2601.06198
With the rapid growth of video centered social media, the ability to anticipate risky events from visual data is a promising direction for ensuring public safety and preventing real world accidents. Prior work has extensively studied supervised video risk assessment across domains such as driving, protests, and natural disasters. However, many existing datasets provide models with access to the full video sequence, including the accident itself, which substantially reduces the difficulty of the task. To better reflect real world conditions, we introduce a new video understanding benchmark RiskCueBench in which videos are carefully annotated to identify a risk signal clip, defined as the earliest moment that indicates a potential safety concern. Experimental results reveal a significant gap in current systems ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting important challenges for deploying video risk prediction models in practice.
随着以视频为中心的社交媒体迅速发展,从视觉数据中预测风险事件的能力成为保障公共安全和预防现实世界事故的一个有前景的方向。先前的研究已经广泛研究了监督下的视频风险评估,在诸如驾驶、抗议活动和自然灾害等领域取得了进展。然而,许多现有的数据集提供了模型访问整个视频序列的机会,包括事故发生时的情况,这大大降低了任务的难度。为了更好地反映实际情况,我们引入了一个新的视频理解基准测试——RiskCueBench。在这个基准测试中,视频被仔细标注以识别风险信号片段,定义为最早表明潜在安全问题的时刻。实验结果显示,当前系统在从早期视觉信号解读不断变化的情况并预测未来可能的风险事件方面存在显著差距,这突出了将视频风险预测模型部署到实践中的重要挑战。
https://arxiv.org/abs/2601.03369
Video Anomaly Understanding (VAU) extends traditional Video Anomaly Detection (VAD) by not only localizing anomalies but also describing and reasoning about their context. Existing VAU approaches often rely on fine-tuned multimodal large language models (MLLMs) or external modules such as video captioners, which introduce costly annotations, complex training pipelines, and high inference overhead. In this work, we introduce PrismVAU, a lightweight yet effective system for real-time VAU that leverages a single off-the-shelf MLLM for anomaly scoring, explanation, and prompt optimization. PrismVAU operates in two complementary stages: (1) a coarse anomaly scoring module that computes frame-level anomaly scores via similarity to textual anchors, and (2) an MLLM-based refinement module that contextualizes anomalies through system and user prompts. Both textual anchors and prompts are optimized with a weakly supervised Automatic Prompt Engineering (APE) framework. Extensive experiments on standard VAD benchmarks demonstrate that PrismVAU delivers competitive detection performance and interpretable anomaly explanations -- without relying on instruction tuning, frame-level annotations, and external modules or dense processing -- making it an efficient and practical solution for real-world applications.
视频异常理解(Video Anomaly Understanding,简称VAU)在传统的视频异常检测(Video Anomaly Detection,简称VAD)基础上更进一步,不仅定位了异常事件,还描述并解释了这些异常的上下文。现有的VAU方法往往依赖于微调过的多模态大规模语言模型(Multimodal Large Language Models,简称MLLMs)或外部模块如视频字幕生成器,这引入了昂贵的数据标注、复杂的训练流程和高推理成本。 在此项工作中,我们提出了PrismVAU,这是一个轻量级且高效的系统,适用于实时的VAU任务。该系统利用单一现成的多模态大规模语言模型(MLLM)来完成异常评分、解释以及提示优化。PrismVAU通过两个互补阶段实现其功能:(1) 一个粗略的异常评分模块,它计算基于与文本锚点相似度的帧级异常分数;(2) 一个基于多模态大规模语言模型(MLLM)的细化模块,该模块通过系统和用户提示来上下文化异常。这些文本锚点和提示都经过弱监督自动提示工程(Automatic Prompt Engineering, APE)框架进行优化。 在标准VAD基准上的广泛实验表明,PrismVAU能够提供竞争力强的检测性能以及可解释性的异常解释——并且无需依赖指令调优、帧级标注或外部模块及密集处理流程。这使得PrismVAU成为实际应用中一个高效且实用的解决方案。
https://arxiv.org/abs/2601.02927
Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.
这段文本的翻译如下: 密集视频字幕生成旨在解释并描述输入视频中所有时间上定位的事件。最近最先进的方法利用大型语言模型(LLM)为视频数据提供详细的时刻描述。然而,现有的VideoLLMs在识别未修剪视频中的精确事件边界时仍然面临挑战,导致生成的字幕没有充分锚定。在这篇论文中,我们提出了TA-Prompting,该方法通过时间锚点增强VideoLLMs以学习精确定位事件,并提示VideoLLMs执行基于时间感知的视频事件理解。在推理阶段,为了从视频中任意数量的事件正确地确定输出字幕序列,我们引入了一种事件连贯采样策略来选择具有跨时序事件足够连贯性和与给定视频交叉模态相似性的事件字幕。通过在基准数据集上进行广泛的实验,我们展示了我们的TA-Prompting相对于最先进的VideoLLMs的优势,在密集视频字幕生成和时间理解任务(包括时刻检索和时间QA)中表现出更优越的性能。
https://arxiv.org/abs/2601.02908
Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video. While recent advances leverage iterative refinement-based strategies, they fail to explicitly utilize the hierarchical nature of human actions. In this work, we propose HybridTAS - a novel framework that incorporates a hybrid of Euclidean and hyperbolic geometries into the denoising process of diffusion models to exploit the hierarchical structure of actions. Hyperbolic geometry naturally provides tree-like relationships between embeddings, enabling us to guide the action label denoising process in a coarse-to-fine manner: higher diffusion timesteps are influenced by abstract, high-level action categories (root nodes), while lower timesteps are refined using fine-grained action classes (leaf nodes). Extensive experiments on three benchmark datasets, GTEA, 50Salads, and Breakfast, demonstrate that our method achieves state-of-the-art performance, validating the effectiveness of hyperbolic-guided denoising for the temporal action segmentation task.
时间动作分割是视频理解中的一个关键任务,其目标是对视频中的每一帧分配动作标签。尽管最近的研究利用了迭代细化策略取得了进展,但它们未能明确地利用人类动作的层次结构特性。在本研究中,我们提出了HybridTAS——一个新的框架,该框架将欧几里得和双曲几何相结合,应用于扩散模型的去噪过程,以利用动作的分层结构。双曲几何自然提供了嵌入之间的树状关系,使我们能够以粗到细的方式指导动作标签的去噪过程:在较高的时间步中(即扩散早期),去噪过程受到抽象、高层次的动作类别的影响(根节点),而在较低的时间步中(即扩散后期),则是通过细化级别的动作类别进行调整和优化(叶节点)。我们在三个基准数据集上进行了广泛的实验,包括GTEA、50Salads和Breakfast,结果表明我们的方法达到了最先进的性能水平,验证了双曲几何引导去噪在时间动作分割任务中的有效性。
https://arxiv.org/abs/2601.01914
Recent Video Large Language Models (Video-LLMs) have shown strong multimodal reasoning capabilities, yet remain challenged by video understanding tasks that require consistent temporal ordering and causal coherence. Many parameter-efficient Video-LLMs rely on unconstrained bidirectional projectors to model inter-frame interactions, which can blur temporal ordering by allowing later frames to influence earlier representations, without explicit architectural mechanisms to respect the directional nature of video reasoning. To address this limitation, we propose V-CORE, a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding. V-CORE consists of two key components: (1) Learnable Spatial Aggregation (LSA), which adaptively selects salient spatial tokens to reduce redundancy, and (2) a Causality-Aware Temporal Projector (CATP), which enforces structured unidirectional information flow via block-causal attention and a terminal dynamic summary token acting as a causal sink. This design preserves intra-frame spatial interactions while ensuring that temporal information is aggregated in a strictly ordered manner. With 4-bit QLoRA and a frozen LLM backbone, V-CORE can be trained efficiently on a single consumer GPU. Experiments show that V-CORE achieves strong performance on the challenging NExT-QA benchmark, reaching 61.2% accuracy, and remains competitive across MSVD-QA, MSRVTT-QA, and TGIF-QA, with gains concentrated in temporal and causal reasoning subcategories (+3.5% and +5.2% respectively), directly validating the importance of explicit temporal ordering constraints.
最近的视频大型语言模型(Video-LLMs)展现出了强大的多模态推理能力,但在需要一致的时间顺序和因果连贯性的视频理解任务中仍面临挑战。许多参数高效的Video-LLMs依赖于不受约束的双向投影器来建模帧间的交互关系,这可能会通过允许后续帧影响前序表示而模糊时间顺序,并且没有明确的架构机制来尊重视频推理的方向性特性。为了解决这一限制,我们提出了V-CORE框架,这是一个引入了显式时间顺序约束以改善视频理解任务性能的参数高效方法。 V-CORE包括两个关键组件: 1. **可学习的空间聚合(LSA)**:该模块能够自适应地选择重要的空间标记来减少冗余。 2. **因果感知的时间投影器(CATP)**:此组件通过块因果注意力机制和一个作为因果汇的终端动态摘要标记强制执行结构化的单向信息流动,确保时间信息在严格有序的方式下被聚合。 这种设计不仅保留了帧内的空间交互作用,还保证了时间信息以严格有序的方式进行汇总。使用4位QLoRA和冻结的大型语言模型骨干网络,V-CORE可以在单一消费者GPU上高效训练。 实验表明,V-CORE在具有挑战性的NExT-QA基准测试中达到了61.2%的准确率,并且在MSVD-QA、MSRVTT-QA以及TGIF-QA等跨领域任务中保持了竞争力。尤其是在时间推理和因果推理子类别上获得了显著提升(分别提高了3.5%和5.2%),这直接验证了明确的时间顺序约束的重要性。
https://arxiv.org/abs/2601.01804
Reinforcement Learning (RL) is crucial for empowering VideoLLMs with complex spatiotemporal reasoning. However, current RL paradigms predominantly rely on random data shuffling or naive curriculum strategies based on scalar difficulty metrics. We argue that scalar metrics fail to disentangle two orthogonal challenges in video understanding: Visual Temporal Perception Load and Cognitive Reasoning Depth. To address this, we propose VideoCuRL, a novel framework that decomposes difficulty into these two axes. We employ efficient, training-free proxies, optical flow and keyframe entropy for visual complexity, Calibrated Surprisal for cognitive complexity, to map data onto a 2D curriculum grid. A competence aware Diagonal Wavefront strategy then schedules training from base alignment to complex reasoning. Furthermore, we introduce Dynamic Sparse KL and Structured Revisiting to stabilize training against reward collapse and catastrophic forgetting. Extensive experiments show that VideoCuRL surpasses strong RL baselines on reasoning (+2.5 on VSI-Bench) and perception (+2.9 on VideoMME) tasks. Notably, VideoCuRL eliminates the prohibitive inference overhead of generation-based curricula, offering a scalable solution for robust video post-training.
强化学习(RL)对于赋予VideoLLMs复杂时空推理能力至关重要。然而,目前的RL范式主要依赖于随机数据洗牌或基于标量难度指标的简单课程策略。我们认为,标量度量无法分离视频理解中的两个正交挑战:视觉时间感知负载和认知推理深度。为解决这一问题,我们提出了VideoCuRL,这是一个将困难性分解为这两轴的新框架。对于视觉复杂性的训练免费代理,我们使用了高效的光学流和关键帧熵;而对于认知复杂性,我们采用了校准的意外度量,并将其映射到一个2D课程网格上。之后采用能力感知对角波前策略从基本对齐训练至复杂的推理。此外,为了在面对奖励崩溃和灾难性遗忘时稳定训练,我们引入了动态稀疏KL和结构化回访机制。广泛的实验表明,在推理(VSI-Bench上的+2.5)和感知(VideoMME上的+2.9)任务上,VideoCuRL超越了强大的RL基线模型。值得注意的是,VideoCuRL消除了生成式课程的繁重推断开销,提供了一个可扩展的解决方案以增强视频训练后的稳健性。
https://arxiv.org/abs/2601.00887