Paper Reading AI Learner

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

2025-10-06 17:10:44
Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu

Abstract

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: this https URL

Abstract (translated)

视频理解代表了计算机视觉中最具挑战性的前沿领域,要求模型能够推理复杂的时空关系、长期依赖性和多模态证据。最近出现的大型多模态视频模型(Video-Large Multimodal Models, Video-LMMs),通过将视觉编码器与强大的基于解码的语言模型相结合,在视频理解任务中展示了卓越的能力。然而,在训练后,使这些模型从基础感知系统转变为复杂的推理引擎的关键阶段在文献中仍然分散且不完整。这项调查提供了对Video-LMMs训练后的综合研究方法的首次全面审视,涵盖了三个基本支柱:带有链式思维的监督微调(SFT)、基于可验证目标的强化学习(RL)以及通过增强推断计算进行测试时间缩放(TTS)。我们提出了一个结构化的分类法,阐明了这些技术的角色、相互关系及其针对视频的具体适应性,并解决了诸如时间定位、时空定位、长视频效率和多模态证据整合等独特挑战。通过对代表性方法的系统分析,我们综合了关键的设计原则、见解和评估协议,同时识别出奖励设计、可扩展性和成本性能优化等方面的批判性开放性挑战。此外,我们还整理了一系列重要的基准测试、数据集和指标,以促进对训练后效果进行严格的评估。这项调查旨在为研究人员和从业者提供一个统一的框架,用于推进Video-LMM的能力。 更多的资源和更新可在以下网址找到:[此URL](this https URL)

URL

https://arxiv.org/abs/2510.05034

PDF

https://arxiv.org/pdf/2510.05034.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot