Abstract
Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical imaging, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, {especially open-source ones,} struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at: this https URL.
Abstract (translated)
近年来,在大型语言模型(LLMs)方面取得了显著进展,导致开发了视频大型多模态模型(Video-LMMs),这些模型可以处理广泛的视频理解任务。这些模型在现实世界的应用场景,如机器人学、人工智能助手、医学成像和自动驾驶车辆等方面具有潜在部署价值。在我们日常生活中广泛部署Video-LMMs,突显了在复杂、现实世界语境中确保和评估其稳健性能的重要性。然而,现有的Video-LMM基准主要关注于通用视频理解能力,而忽略了评估其在复杂视频中的推理能力,以及通过用户提示的视角评估模型的稳健性。在本文中,我们提出了CVRR-ES(复杂视频推理和稳健性评估套装),一种新颖的基准,全面评估了11种不同现实世界视频维度中Video-LMM的性能。我们对9个最近的后开源和闭源模型进行了评估,发现大多数视频模型(尤其是开源模型)在处理复杂视频时,表现出了稳健性和推理能力不足。根据我们的分析,我们提出了一种无需训练的自助双步上下文提示(DSCP)技术,以提高现有视频模型的性能。我们的研究结果为构建具有高级稳健性和推理能力的下一代人类中心化人工智能系统提供了宝贵的洞见。我们的数据和代码 publicly available at:this <https:// this URL.
URL
https://arxiv.org/abs/2405.03690