Abstract
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: this https URL
Abstract (translated)
视频理解代表了计算机视觉中最具挑战性的前沿领域,要求模型能够推理复杂的时空关系、长期依赖性和多模态证据。最近出现的大型多模态视频模型(Video-Large Multimodal Models, Video-LMMs),通过将视觉编码器与强大的基于解码的语言模型相结合,在视频理解任务中展示了卓越的能力。然而,在训练后,使这些模型从基础感知系统转变为复杂的推理引擎的关键阶段在文献中仍然分散且不完整。这项调查提供了对Video-LMMs训练后的综合研究方法的首次全面审视,涵盖了三个基本支柱:带有链式思维的监督微调(SFT)、基于可验证目标的强化学习(RL)以及通过增强推断计算进行测试时间缩放(TTS)。我们提出了一个结构化的分类法,阐明了这些技术的角色、相互关系及其针对视频的具体适应性,并解决了诸如时间定位、时空定位、长视频效率和多模态证据整合等独特挑战。通过对代表性方法的系统分析,我们综合了关键的设计原则、见解和评估协议,同时识别出奖励设计、可扩展性和成本性能优化等方面的批判性开放性挑战。此外,我们还整理了一系列重要的基准测试、数据集和指标,以促进对训练后效果进行严格的评估。这项调查旨在为研究人员和从业者提供一个统一的框架,用于推进Video-LMM的能力。 更多的资源和更新可在以下网址找到:[此URL](this https URL)
URL
https://arxiv.org/abs/2510.05034