Paper Reading AI Learner

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

2024-09-27 17:38:36
Heqing Zou (Xiao Jie), Tianze Luo (Xiao Jie), Guiyang Xie (Xiao Jie), Victor (Xiao Jie), Zhang, Fengmao Lv, Guangcong Wang, Juanyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang

Abstract

The integration of Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks, leveraging their inherent capability to comprehend and generate human-like text for visual reasoning. Given the diverse nature of visual data, MultiModal Large Language Models (MM-LLMs) exhibit variations in model designing and training for understanding images, short videos, and long videos. Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding. Unlike static images, short videos encompass sequential frames with both spatial and within-event temporal information, while long videos consist of multiple events with between-event and long-term temporal information. In this survey, we aim to trace and summarize the advancements of MM-LLMs from image understanding to long video understanding. We review the differences among various visual understanding tasks and highlight the challenges in long video understanding, including more fine-grained spatiotemporal details, dynamic events, and long-term dependencies. We then provide a detailed summary of the advancements in MM-LLMs in terms of model design and training methodologies for understanding long videos. Finally, we compare the performance of existing MM-LLMs on video understanding benchmarks of various lengths and discuss potential future directions for MM-LLMs in long video understanding.

Abstract (translated)

大规模语言模型(LLMs)与视觉编码器的集成在视觉理解任务中最近表现出良好的性能,利用其固有的理解并生成类似于人类文本的能力来理解视觉推理。考虑到视觉数据的多样性,多模态大规模语言模型(MM-LLMs)在理解图像、短视频和长视频时,模型设计和训练存在差异和独特挑战。我们的论文重点关注长视频理解与静态图像和短视频理解之间的重大差异和独特挑战。与静态图像不同,短视频包含具有空间和事件时间信息的连续帧,而长视频由多个事件组成,包括事件之间和长期时间信息。在这次调查中,我们旨在追踪和总结MM-LLMs在图像理解到长视频理解方面的进步。我们回顾了各种视觉理解任务之间的差异,并重点关注长视频理解中的挑战,包括更细粒度的空间和事件时间信息、动态事件和长期依赖关系。然后,我们详细总结了MM-LLMs在理解长视频方面的模型设计和训练方法。最后,我们比较了各种长度视频理解基准测试中现有MM-LLM的表现,并讨论了MM-LLM在长视频理解方面的潜在未来发展方向。

URL

https://arxiv.org/abs/2409.18938

PDF

https://arxiv.org/pdf/2409.18938.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot