Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information, covering from Image LLMs to the more complex Video LLMs. Numerous studies have illustrated their exceptional cross-modal comprehension. Recently, integrating video foundation models with large language models to build a comprehensive video understanding system has been proposed to overcome the limitations of specific pre-defined vision tasks. However, the current advancements in Video LLMs tend to overlook the foundational contributions of Image LLMs, often opting for more complicated structures and a wide variety of multimodal data for pre-training. This approach significantly increases the costs associated with these this http URL response to these challenges, this work introduces an efficient method that strategically leverages the priors of Image LLMs, facilitating a resource-efficient transition from Image to Video LLMs. We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs, which utilizes a temporal adaptation plug-and-play structure within the image fusion module of Image LLMs. This adaptation extends their understanding capabilities to include temporal information, enabling the development of Video LLMs that not only surpass baseline performances but also do so with minimal instructional data and training resources. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models, effectively building upon the foundational work of Image LLMs.
多模态大型语言模型(MLLMs)在理解多模态信息方面取得了深刻的潜力,从图像LL模型到更复杂的视频LL模型。许多研究都证明了它们的跨模态理解能力。最近,将视频基础模型与大型语言模型集成以构建全面视频理解系统的主张,以克服特定预定义的视觉任务的局限性。然而,当前的视频LL模型的发展往往忽视了图像LL模型的基础贡献,通常选择更复杂的设计和各种多模态数据进行预训练。这种方法显著增加了为应对这些挑战而产生的成本,本文提出了一种有效的策略,通过战略性地利用图像LL模型的先验知识,促使从图像到视频LL模型的资源高效过渡。我们提出了RED-VILLM,一个资源高效的图像LL模型开发流程,该流程利用了图像LL模型的图像融合模块中的时间适应插件和 play-pause 结构。这个适应扩展了他们的理解能力,使他们能够开发出不仅超越基线性能,而且可以用最少的数据和训练资源实现的视频LL模型。我们的方法突出了在多模态模型中实现更成本效益和可扩展性的发展的潜力,有效地建立在图像LL模型的基础工作之上。
https://arxiv.org/abs/2404.11865
Pretrained vision-language models have shown effectiveness in video understanding. However, recent studies have not sufficiently leveraged essential temporal information from videos, simply averaging frame-wise representations or referencing consecutive frames. We introduce Temporally Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding that effectively and efficiently leverages comprehensive video information. We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process. Furthermore, our Video-conditional Prompting (VP) module manufactures context tokens to generate informative prompts in text modality. We conduct extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition to validate the superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design choices. Code is available at this https URL
预训练的视觉-语言模型已经在视频理解方面取得了有效性。然而,最近的研究并没有充分利用视频的必要时间信息,仅仅是平均帧级表示或参考连续帧。我们引入了 Temporally Contextualized CLIP (TC-CLIP),这是一个先驱性的框架,用于视频理解,有效且高效地利用了全面视频信息。我们提出了 Temporal Contextualization (TC),一种新的层间时间信息注入机制,用于提取每个帧的核心信息,将视频中的相关信息连接起来,并最终在特征编码过程中利用上下文 tokens。此外,我们的 Video-conditional Prompting (VP) 模块用于在文本模态生成有信息的提示。我们在零散、少散、基础到 novel 和完全监督的动作识别上进行了广泛的实验,以验证我们的 TC-CLIP 的优越性。TC 和 VP 的消融研究确保了我们的设计选择。代码可以从该链接获取:
https://arxiv.org/abs/2404.09490
The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities. The 2024 edition featured five tracks, attracting unprecedented interest from 726 teams in 47 countries and regions. Track 1 dealt with multi-target multi-camera (MTMC) people tracking, highlighting significant enhancements in camera count, character number, 3D annotation, and camera matrices, alongside new rules for 3D tracking and online tracking algorithm encouragement. Track 2 introduced dense video captioning for traffic safety, focusing on pedestrian accidents using multi-camera feeds to improve insights for insurance and prevention. Track 3 required teams to classify driver actions in a naturalistic driving analysis. Track 4 explored fish-eye camera analytics using the FishEye8K dataset. Track 5 focused on motorcycle helmet rule violation detection. The challenge utilized two leaderboards to showcase methods, with participants setting new benchmarks, some surpassing existing state-of-the-art achievements.
第八届AI城市挑战突出了计算机视觉和人工智能在零售、仓库设置和智能交通系统(ITS)等领域的汇聚,为研究提供了重要机会。2024版活动设有五个赛道,吸引了来自726支来自47个国家和地区的队伍,前所未有的关注。赛道1涉及多目标多摄像头(MTMC)人员跟踪,重点关注摄像头数量、角色数量、3D注释和相机矩阵的显著提升,以及为3D跟踪和在线跟踪算法鼓励的新规则。赛道2介绍了用于交通安全的密集视频字幕,利用多摄像头数据改善对保险和预防的洞察。赛道3要求团队对自然驾驶分析中的驾驶员动作进行分类。赛道4探索了使用FishEye8K数据集的鱼眼相机数据分析。赛道5关注摩托车头盔违规检测。挑战使用了两个排行榜来展示方法,参与者设定了新基准,有些甚至超越了现有最先进的成就。
https://arxiv.org/abs/2404.09432
Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.
动作识别对于自恋型视频理解至关重要,可以无需用户努力自动持续监测日常生活活动(ADLs)的动作。现有文献主要关注3D手势输入,需要计算密集的深度估计网络或佩戴不舒适的深度传感器。相比之下,对于自恋型动作识别,尽管市场上有用户友好的智能眼镜可以捕捉到一个单色RGB图像,但关于2D手势识别的研究却相对不足。我们的研究旨在填补这一研究空白,通过探索2D手势估计领域,做出两点贡献。首先,我们引入了两种新的2D手势估计方法,即EffHandNet单手估计和EffHandEgoNet,专为自恋视角设计,可以捕捉手与物体之间的交互。这两项方法在H2O和FPHA公开基准上均超越了最先进的模型。其次,我们提出了一个自适应的动作识别架构,包括EffHandEgoNet和基于Transformer的动作识别方法。在H2O和FPHA数据集上评估,我们的架构具有更快的推理时间,准确率分别为91.32%和94.43%,均超越了最先进的方法,包括基于3D的方法。我们的工作表明,使用2D骨骼数据对于自恋型动作理解是一种稳健的方法。广泛的评估和消融研究证明了手势估计方法的影响,以及每个输入如何影响整体性能。
https://arxiv.org/abs/2404.09308
Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems. Most existing methods only focus on locating traffic event segments, which severely lack descriptive details related to the behaviour and context of all the subjects of interest in the events. In this paper, we present TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego camera view. TrafficVLM models traffic video events at different levels of analysis, both spatially and temporally, and generates long fine-grained descriptions for the vehicle and pedestrian at different phases of the event. We also propose a conditional component for TrafficVLM to control the generation outputs and a multi-task fine-tuning paradigm to enhance TrafficVLM's learning capability. Experiments show that TrafficVLM performs well on both vehicle and overhead camera views. Our solution achieved outstanding results in Track 2 of the AI City Challenge 2024, ranking us third in the challenge standings. Our code is publicly available at this https URL.
近年来,由于对高效且可靠的城郊监控系统需求的不断增加,交通视频描述和分析得到了广泛关注。目前,大多数现有方法仅关注于定位交通事件段,这严重缺乏与所有感兴趣对象的行为和上下文相关的详细描述。在本文中,我们提出了TrafficVLM,一种用于车辆自相机视场的多模态密集视频标注模型。TrafficVLM在不同的分析和空间水平上对交通视频事件进行建模,并生成不同事件阶段车辆和行人的详细描述。我们还提出了一种条件组件,用于控制TrafficVLM的生成输出,以及一种多任务微调范式,以增强TrafficVLM的学习能力。实验证明,TrafficVLM在车辆和 overhead 相机视图上表现出色。我们的解决方案在2024年AI城市挑战赛的第二部分(Track 2)中取得了突出成绩,排名第三。我们的代码公开可用,位于此链接:https://www.example.com。
https://arxiv.org/abs/2404.09275
Video moment retrieval and highlight detection are two highly valuable tasks in video understanding, but until recently they have been jointly studied. Although existing studies have made impressive advancement recently, they predominantly follow the data-driven bottom-up paradigm. Such paradigm overlooks task-specific and inter-task effects, resulting in poor model performance. In this paper, we propose a novel task-driven top-down framework TaskWeave for joint moment retrieval and highlight detection. The framework introduces a task-decoupled unit to capture task-specific and common representations. To investigate the interplay between the two tasks, we propose an inter-task feedback mechanism, which transforms the results of one task as guiding masks to assist the other task. Different from existing methods, we present a task-dependent joint loss function to optimize the model. Comprehensive experiments and in-depth ablation studies on QVHighlights, TVSum, and Charades-STA datasets corroborate the effectiveness and flexibility of the proposed framework. Codes are available at this https URL.
视频时刻检索和突出检测是视频理解中两个非常有价值的任务,但迄今为止,它们主要遵循数据驱动的自下而上的范式。这种范式忽视了任务特定和跨任务影响,导致模型性能不佳。在本文中,我们提出了一种全新的基于任务的自上而下框架TaskWeave,用于联合时刻检索和突出检测。框架引入了一个任务解耦的单元来捕捉任务特定和共同表示。为了研究这两个任务之间的相互作用,我们提出了一个跨任务反馈机制,将一个任务的结果作为引导mask辅助另一个任务。与现有方法不同,我们提出了一个基于任务的任务损失函数来优化模型。对QVHighlights、TVSum和Charades-STA数据集的全面实验和深入消融研究证实了所提出的框架的有效性和灵活性。代码可在此处访问:https://www.huaweicf.com/TaskWeave
https://arxiv.org/abs/2404.09263
This paper introduces our solution for Track 2 in AI City Challenge 2024. The task aims to solve traffic safety description and analysis with the dataset of Woven Traffic Safety (WTS), a real-world Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding. Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video. 2) Our work leverages CLIP to extract visual features to more efficiently perform cross-modality training between visual and textual representations. 3) We conduct domain-specific model adaptation to mitigate domain shift problem that poses recognition challenge in video understanding. 4) Moreover, we leverage BDD-5K captioned videos to conduct knowledge transfer for better understanding WTS videos and more accurate captioning. Our solution has yielded on the test set, achieving 6th place in the competition. The open source code will be available at this https URL
本文介绍了我们为2024年AI城市挑战赛第二 track 2 提出的解决方案。该任务旨在利用Woven Traffic Safety(WTS)数据集中的数据解决交通安全的描述和分析。我们的解决方案主要关注以下几点:1)为了解决视频摘要问题,我们利用了密集视频摘要与并行解码(PDVC)框架来建模视觉-语言序列并生成视频的章节摘要。2)我们的工作利用了CLIP来提取视觉特征,以更有效地在视觉和文本表示之间进行跨模态训练。3)我们进行了领域特定模型适应,以减轻在视频理解中出现的领域漂移问题。4)此外,我们利用BDD-5K捕获的视频进行知识传递,以更好地理解WTS视频并获得更准确的摘要。我们的解决方案在测试集上已经实现了成果,获得了竞争中的第六名。开源代码将在此处 https:// URL 下载。
https://arxiv.org/abs/2404.08229
There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.
近年来,对密集视频标注的研究受到了广泛关注,该研究旨在自动将未剪辑视频中的所有事件进行定位和标注。几项研究通过将密集视频标注作为一个多任务问题,将事件定位和事件标注相结合,来考虑任务之间的关系。然而,仅通过视觉输入来解决这两个任务是具有挑战性的,因为缺乏语义内容。在本研究中,我们通过提议一个以人类认知信息处理为基础的新框架来解决这个问题。我们的模型利用外部记忆来包含先验知识。提出了跨模态视频-文本匹配的内存检索方法。为了有效地包括检索到的文本特征,设计了一个灵活的编码器和一个具有视觉和文本ual跨注意力的解码器。对活动网络标注和YouCook2数据集进行了比较实验,以展示所提出方法的有效性。实验结果表明,在没有大量预训练视频数据的情况下,我们的模型具有鼓舞人心的性能。
https://arxiv.org/abs/2404.07610
In the Massive Open Online Courses (MOOC) learning scenario, the semantic information of instructional videos has a crucial impact on learners' emotional state. Learners mainly acquire knowledge by watching instructional videos, and the semantic information in the videos directly affects learners' emotional states. However, few studies have paid attention to the potential influence of the semantic information of instructional videos on learners' emotional states. To deeply explore the impact of video semantic information on learners' emotions, this paper innovatively proposes a multimodal emotion recognition method by fusing video semantic information and physiological signals. We generate video descriptions through a pre-trained large language model (LLM) to obtain high-level semantic information about instructional videos. Using the cross-attention mechanism for modal interaction, the semantic information is fused with the eye movement and PhotoPlethysmoGraphy (PPG) signals to obtain the features containing the critical information of the three modes. The accurate recognition of learners' emotional states is realized through the emotion classifier. The experimental results show that our method has significantly improved emotion recognition performance, providing a new perspective and efficient method for emotion recognition research in MOOC learning scenarios. The method proposed in this paper not only contributes to a deeper understanding of the impact of instructional videos on learners' emotional states but also provides a beneficial reference for future research on emotion recognition in MOOC learning scenarios.
在大规模开放在线课程(MOOC)学习场景中,教学视频的语义信息对学习者的情感状态具有重要影响。学习者主要通过观看教学视频来获取知识,视频中的语义信息会直接影响学习者的情感状态。然而,迄今为止,几乎没有研究关注过教学视频语义信息对学习者情感状态的潜在影响。为了深入探索视频语义信息对学习者情感的影响,本文创新性地提出了一种将视频语义信息和生理信号融合的多模态情感识别方法。我们通过预训练的大语言模型(LLM)生成视频描述,以获得关于教学视频的高级语义信息。通过模态交互的跨注意机制,将语义信息与眼动和Plethysmography(PPG)信号融合,以获得包含三种模式关键信息的特征。通过情感分类器的准确识别,学习者的情感状态得到了准确识别。实验结果表明,我们的方法在情感识别性能上显著提高,为MOOC学习场景中情感识别的研究提供了新的视角和高效方法。本文提出的方法不仅为深入理解教学视频对学习者情感状态的影响提供了重要的参考,而且为未来在MOOC学习场景中情感识别的研究提供了有益的参考。
https://arxiv.org/abs/2404.07484
Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important. To effectively automate the process of video analysis based on eye-tracking data, it is important to accurately replicate human gaze behavior. However, this task presents significant challenges due to the inherent complexity and ambiguity of human gaze patterns. In this work, we introduce a novel method for simulating human gaze behavior. Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer, with the primary role of watching videos and simulating human gaze behavior. We employed an eye-tracking dataset gathered from videos generated by the VirtualHome simulator, with a primary focus on activity recognition. Our experimental results demonstrate the effectiveness of our gaze prediction method by highlighting its capability to replicate human gaze behavior and its applicability for downstream tasks where real human-gaze is used as input.
利用人类视线的视频理解任务中翻译眼动应用越来越重要。要有效自动化基于眼动数据的视频分析过程,准确复制人类视线的 behaviors 至关重要。然而,由于人类视图模式固有的复杂性和不确定性,这项任务带来了巨大的挑战。在这项工作中,我们提出了一种模拟人类视线行为的新方法。我们的方法使用基于Transformer的强化学习算法来训练一个观察者型代理,主要职责是观看视频并模拟人类视线行为。我们使用了由VirtualHome模拟器生成的视频数据,重点是活动识别。我们的实验结果表明,通过强调其复制人类视线行为的能力和适用于下游任务(使用真实人类视线作为输入)的可行性,我们的人眼预测方法的有效性得到了充分证明。
https://arxiv.org/abs/2404.07351
Humans utilize their gaze to concentrate on essential information while perceiving and interpreting intentions in videos. Incorporating human gaze into computational algorithms can significantly enhance model performance in video understanding tasks. In this work, we address a challenging and innovative task in video understanding: predicting the actions of an agent in a video based on a partial video. We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input. Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention. To assess the efficiency of our approach, we collect a dataset containing household activities generated in the VirtualHome environment, accompanied by human gaze data of viewing videos. Our method outperforms state-of-the-art techniques, achieving a 7\% improvement in accuracy for 18-class intention recognition. This highlights the efficiency of our method in learning important features from human gaze data.
人类利用他们的目光来集中注意力于视频中的关键信息,同时通过观察和解释视频中的意图。将人类目光纳入计算算法可以显著提高视频理解任务的模型性能。在这项工作中,我们解决了一个具有挑战性和创新性的视频理解任务:根据视频的片段预测代理的行动。我们引入了 gaze-guided action anticipation 算法,该算法基于视频输入建立了一个视觉语义图。我们的方法利用图神经网络识别代理的意图并预测实现这一意图的动作序列。为了评估我们的方法的效率,我们收集了一个虚拟家庭环境中生成的生活活动数据集,并附有观看视频的人类目光数据。我们的方法超越了最先进的 techniques,实现了18类意图识别的准确度提高了7%。这突出了我们方法从人类目光数据中学习重要特征的有效性。
https://arxiv.org/abs/2404.07347
With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at this https URL.
随着大型语言模型(LLMs)的成功,将视觉模型集成到LLMs中构建视觉语言基础模型近年来受到了越来越多的关注。然而,基于LLM的大型多模态模型(例如,Video-LLaMA,VideoChat)只能处理有限数量的帧短期视频理解。在本文中,我们主要关注设计一个高效有效的长期视频理解模型。我们不试图通过同时处理更多帧来超越现有工作的目的,而是提出了一种以在线方式处理视频并将过去视频信息存储在内存银行中的方法。这使得我们的模型在不需要超过LLMs的上下文长度限制或GPU内存限制的情况下,可以参考历史视频内容进行长期分析。我们的内存银行可以以一种无缝的方式与现有的多模态LLMs集成。我们对各种视频理解任务(如长期视频理解,视频问题回答和视频字幕)进行了广泛的实验,我们的模型可以在多个数据集上实现最先进的性能。代码可在此处下载。
https://arxiv.org/abs/2404.05726
Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
基于视频的视觉关系检测任务,如视频场景图生成和视频场景关系检测,在精细视频理解中发挥着重要作用。然而,当前的视频视觉关系检测数据集存在两个主要限制,阻碍了该领域的研究进展。首先,它们没有在多人人际场景中探索复杂的人际互动。其次,现有数据集中的关系类型具有较低级的语义,并且通常可以通过外观或简单的先验信息来识别,而无需详细的空间时间上下文推理。然而,理解人类之间的高级互动对于理解复杂的人际视频(如体育和监视视频)至关重要。为了解决这个问题,我们提出了一个新的视频视觉关系检测任务:视频人际互动检测,并为此构建了一个名为SportsHHI的数据集。SportsHHI包含了篮球和排球运动中的34个高级互动类别。在11,398个关键帧上,有118,075个人体边界框和50,649个互动实例被注释。为了进行基准,我们提出了一个两阶段基线方法,并通过广泛的实验揭示了成功的人际交互检测的关键因素。我们希望SportsHHI能够刺激在视频中的人际互动理解的研究,并推动在视频视觉关系检测中发展空间时间上下文建模技术。
https://arxiv.org/abs/2404.04565
Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
长的视频问答是一个具有挑战性的任务,需要识别短期的活动并进行关于它们细粒度关系的推理。先进的视频大型语言模型(vLLMs)因其在新的任务上表现出的新兴能力而具有实现的可能。然而,尽管它们通过训练掌握了数百万秒长的视频,但vLLMs无法理解几分钟长的视频,也无法准确回答关于它们的问题。为了解决这个问题,我们提出了一个轻量级且自监督的方法:关键帧条件下的长视频-LLM(Koala),它引入了可学习的时态和空间查询,以将预训练的vLLM扩展到更长的视频中。我们的方法引入了两个新的词标器,它们基于从稀疏视频关键帧计算的视觉标记物来理解短和长视频时刻。我们在HowTo100M上训练我们的方法,并在零散的视频理解基准上证明了其有效性,其中它在所有任务上的绝对准确度比最先进的模型高3 - 6%。令人惊讶的是,我们还通过实验发现,我们的方法不仅有助于预训练的vLLM理解长视频,而且还有助于提高其对短期动作识别的准确性。
https://arxiv.org/abs/2404.04346
Open-world video instance segmentation is an important video understanding task. Yet most methods either operate in a closed-world setting, require an additional user-input, or use classic region-based proposals to identify never before seen objects. Further, these methods only assign a one-word label to detected objects, and don't generate rich object-centric descriptions. They also often suffer from highly overlapping predictions. To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video. For this, we introduce open-world object queries to discover never before seen objects without additional user-input. We generate rich and descriptive object-centric captions for each detected object via a masked attention augmented LLM input. We introduce an inter-query contrastive loss to ensure that the object queries differ from one another. Our generalized approach matches or surpasses state-of-the-art on three tasks: open-world video instance segmentation on the BURST dataset, dense video object captioning on the VidSTG dataset, and closed-world video instance segmentation on the OVIS dataset.
开放世界视频实例分割是一个重要的视频理解任务。然而,大多数方法要么在关闭的世界设置中操作,需要额外的用户输入,要么使用基于经典区域的提议来识别从未见过的物体。此外,这些方法只给检测到的物体分配一个单词标签,并且不生成丰富的物体中心描述。它们还经常遭受高度重叠预测的问题。为了解决这些问题,我们提出了Open-World Video Instance Segmentation and Captioning (OW-VISCap),一种在视频中共同分割、跟踪和捕获之前见过的或未见过的物体的方法。为此,我们引入了开放世界物体查询来发现没有额外用户输入的从未见过的物体。我们通过遮罩注意增强LLM输入为每个检测到的物体生成丰富而描述性的物体中心描述。我们引入了跨查询对比损失来确保物体查询彼此不同。我们的通用方法在三个任务上都超越了最先进的水平:在BURST数据集上的开放世界视频实例分割,VidSTG数据集上的密集视频物体注释和OVIS数据集上的关闭世界视频实例分割。
https://arxiv.org/abs/2404.03657
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here this https URL
本文介绍了MiniGPT4-Video,一种专门为视频理解而设计的多模态大型语言模型(LLM)。该模型能够处理视频中的时间和文本数据,使其擅长理解视频的复杂性。在MiniGPT-v2的成功基础上,本文将模型的能力扩展到处理视频序列,使其能够理解视频。MiniGPT4-video不仅考虑视觉内容,还包含了文本对话,使模型能够有效回答涉及视觉和文本组件的查询。所提出的模型在MSVD、MSRVTT、TGIF和TVQA基准测试中的性能均优于现有最先进的方法,分别在MSVD、MSRVTT、TGIF和TVQA基准测试中取得了4.22%、1.13%、20.82%和13.1%的提高。我们的模型和代码都已公开发布,此处链接为https://。
https://arxiv.org/abs/2404.03413
Empowered by Large Language Models (LLMs), recent advancements in VideoLLMs have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding in videos due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a straightforward yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. Our approach proposes to decompose long videos into multiple short-term segments and encode local features for each local segment via a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across sequential short-term segments. Additionally, we propose to integrate global semantics into each local feature to enhance context understanding. In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate the superior capabilities of our model over the previous state-of-the-art methods. Qualitative examples demonstrate that our model produces more precise responses for long videos understanding. Code is available at \url{this https URL}.
得益于大型语言模型(LLMs)的力量,近年来在视频理解任务中的进步已经推动了各种视频理解任务的发展。这些模型通过在大量视觉符号上进行池化或查询聚合来编码视频表示,从而实现了计算和内存成本的降低。尽管它们成功地提供了一个对视频内容的全面理解,但现有的视频LLM仍然由于忽略了长期视频中的局部信息而面临挑战。为解决这个问题,我们引入了LongVLM,一个简单而强大的视频LLM,用于长视频理解,该模型基于一个观察,即长期视频通常由顺序关键事件、复杂动作和相机运动组成。我们的方法通过一个层次化的标记合并模块将长期视频分解为多个短期段,并为每个短期段编码局部特征。这些特征按时间顺序进行拼接,以保持故事情节在顺序短时间段中的连贯性。此外,我们还将全局语义集成到每个局部特征中,以增强上下文理解。在这种情况下,我们编码的视频表示同时包含了局部和全局信息,使得LLM能够为长期视频生成全面的回答。在VideoChatGPT基准和零散视频问题回答数据集上的实验结果证明了我们的模型相对于前 state-of-the-art 方法具有优越的性能。定性例子表明,我们的模型对于长视频的理解更精确。代码可以从 \url{这个链接} 中获得。
https://arxiv.org/abs/2404.03384
We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos. By leveraging the capabilities of diverse large language models (LLMs), we generate rich DVC-oriented caption candidates and optimize the corresponding pseudo boundaries under several meticulously designed objectives, considering diversity, event-centricity, temporal ordering, and coherence. Moreover, we further introduce a novel online boundary refinement strategy that iteratively improves the quality of pseudo boundaries during training. Comprehensive experiments have been conducted to examine the effectiveness of the proposed technique components. By leveraging a substantial amount of unlabeled video data, such as HowTo100M, we achieve a remarkable advancement on standard DVC datasets like YouCook2 and ActivityNet. We outperform the previous state-of-the-art Vid2Seq across a majority of metrics, achieving this with just 0.4% of the unlabeled video data used for pre-training by Vid2Seq.
我们提出了Dive Into the Boundaries (DIBS)框架,一种用于密集视频注释(DVC)的新预训练框架,该框架详细介绍了从未标注的视频中提高生成事件摘要和相关伪事件边界的质量。通过利用各种大型语言模型的功能,我们生成了丰富的DVC导向的摘要候选词,并优化了多个精心设计的超参数,包括多样性、事件中心性、时间顺序和连贯性。此外,我们还引入了一种新的在线边界修复策略,在训练过程中逐步提高伪边界的质量。为了检验所提出的技术组件的有效性,我们进行了全面的实验。通过利用大量的未标注视频数据,如HowTo100M,我们在诸如YouCook2和ActivityNet等标准DVC数据集上取得了显著的进步。我们超过了前 state-of-the-art Vid2Seq,仅仅使用了用于预训练的0.4% 的未标注视频数据。
https://arxiv.org/abs/2404.02755
Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.
视频中的文本描述的时间绑定是一个在视觉语言学习和视频理解中的核心问题。现有的方法通常优先考虑准确性,而不是可扩展性——它们已经优化为仅在短视频中绑定少数文本查询,并且无法扩展到具有数百个查询的长视频。在本文中,我们研究了跨模态融合对视频绑定模型可扩展性的影响。我们的分析证实了晚融合是一种更经济有效的融合方案,适用于长视频和高文本查询。此外,它我们还导致了一种新的视频中心采样方案,用于高效的训练。基于这些发现,我们提出了SnAG,一个简单的基础设施,具有可扩展性和准确性。没有花哨的装饰,SnAG比CONE快43%,准确率也提高了1.5倍,同时具有与在具有挑战性的MAD数据集上进行长视频绑定最先进的水平相当的表现,而在短视频中取得了极具竞争力的结果。
https://arxiv.org/abs/2404.02257
An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at this https URL.
一个理想的模型用于密集视频标题——预测在视频中的局部文本描述——应该能够处理长输入视频,预测丰富的详细文本描述,并在处理整个视频之前产生输出。然而,当前的顶级模型仅处理固定数量的下采样帧,在看完整个视频后做出单个全预测。我们提出了一种流式密集视频标题模型,该模型由两个新颖的组件组成:首先,我们提出了一个基于聚类的输入词模块,该模块的内存大小固定,可以处理任意长度的视频。其次,我们开发了一种流式解码算法,使得我们的模型能够在处理整个视频之前做出预测。我们的模型实现了这种流式能力,并在三个密集视频标题基准中显著提高了性能:ActivityNet,YouCook2 和 ViTT。我们的代码发布在這個 https URL 上。
https://arxiv.org/abs/2404.01297