Abstract
The integration of Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks, leveraging their inherent capability to comprehend and generate human-like text for visual reasoning. Given the diverse nature of visual data, MultiModal Large Language Models (MM-LLMs) exhibit variations in model designing and training for understanding images, short videos, and long videos. Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding. Unlike static images, short videos encompass sequential frames with both spatial and within-event temporal information, while long videos consist of multiple events with between-event and long-term temporal information. In this survey, we aim to trace and summarize the advancements of MM-LLMs from image understanding to long video understanding. We review the differences among various visual understanding tasks and highlight the challenges in long video understanding, including more fine-grained spatiotemporal details, dynamic events, and long-term dependencies. We then provide a detailed summary of the advancements in MM-LLMs in terms of model design and training methodologies for understanding long videos. Finally, we compare the performance of existing MM-LLMs on video understanding benchmarks of various lengths and discuss potential future directions for MM-LLMs in long video understanding.
Abstract (translated)
大规模语言模型(LLMs)与视觉编码器的集成在视觉理解任务中最近表现出良好的性能,利用其固有的理解并生成类似于人类文本的能力来理解视觉推理。考虑到视觉数据的多样性,多模态大规模语言模型(MM-LLMs)在理解图像、短视频和长视频时,模型设计和训练存在差异和独特挑战。我们的论文重点关注长视频理解与静态图像和短视频理解之间的重大差异和独特挑战。与静态图像不同,短视频包含具有空间和事件时间信息的连续帧,而长视频由多个事件组成,包括事件之间和长期时间信息。在这次调查中,我们旨在追踪和总结MM-LLMs在图像理解到长视频理解方面的进步。我们回顾了各种视觉理解任务之间的差异,并重点关注长视频理解中的挑战,包括更细粒度的空间和事件时间信息、动态事件和长期依赖关系。然后,我们详细总结了MM-LLMs在理解长视频方面的模型设计和训练方法。最后,我们比较了各种长度视频理解基准测试中现有MM-LLM的表现,并讨论了MM-LLM在长视频理解方面的潜在未来发展方向。
URL
https://arxiv.org/abs/2409.18938