The task of spatiotemporal action localization in chaotic scenes is a challenging task toward advanced video understanding. Paving the way with high-quality video feature extraction and enhancing the precision of detector-predicted anchors can effectively improve model performance. To this end, we propose a high-performance dual-stream spatiotemporal feature extraction network SFMViT with an anchor pruning strategy. The backbone of our SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal action localization, which fully utilizes ViT's excellent global feature extraction capabilities and SlowFast's spatiotemporal sequence modeling capabilities. Secondly, we introduce the confidence maximum heap to prune the anchors detected in each frame of the picture to filter out the effective anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the Chaotic World dataset, far exceeding existing models. Code is available at this https URL.
在复杂场景中进行时空动作局部化的任务是高级视频理解的一个具有挑战性的任务。通过高质量的视频特征提取和增强检测器预测锚点的精度,可以有效地提高模型性能。为此,我们提出了一个高性能的双流时空特征提取网络SFMViT,采用锚点剪枝策略。我们SFMViT的骨干网络由ViT和SlowFast组成,基于先前对时空动作局部化的知识,充分利用ViT的卓越全局特征提取能力和SlowFast的时空序列建模能力。其次,我们引入了最大置信度堆来剪枝检测器在每个帧中检测到的锚点,以过滤出有效的锚点。这些设计使我们的SFMViT在Chaotic World数据集上的mAP达到26.62%,远超过现有模型。代码可以从该链接获得。
https://arxiv.org/abs/2404.16609
Video anomaly detection (VAD) is a challenging task aiming to recognize anomalies in video frames, and existing large-scale VAD researches primarily focus on road traffic and human activity scenes. In industrial scenes, there are often a variety of unpredictable anomalies, and the VAD method can play a significant role in these scenarios. However, there is a lack of applicable datasets and methods specifically tailored for industrial production scenarios due to concerns regarding privacy and security. To bridge this gap, we propose a new dataset, IPAD, specifically designed for VAD in industrial scenarios. The industrial processes in our dataset are chosen through on-site factory research and discussions with engineers. This dataset covers 16 different industrial devices and contains over 6 hours of both synthetic and real-world video footage. Moreover, we annotate the key feature of the industrial process, ie, periodicity. Based on the proposed dataset, we introduce a period memory module and a sliding window inspection mechanism to effectively investigate the periodic information in a basic reconstruction model. Our framework leverages LoRA adapter to explore the effective migration of pretrained models, which are initially trained using synthetic data, into real-world scenarios. Our proposed dataset and method will fill the gap in the field of industrial video anomaly detection and drive the process of video understanding tasks as well as smart factory deployment.
视频异常检测(VAD)是一个具有挑战性的任务,旨在识别视频帧中的异常情况,现有的大规模VAD研究主要集中在道路交通和人类活动场景。在工业场景中,通常存在多种不可预测的异常情况,VAD方法在这些场景中发挥着重要作用。然而,由于对隐私和安全问题的担忧,缺乏针对工业生产场景的可应用数据和方法。为了填补这一空白,我们提出了一个专门为工业场景设计的新的数据集IPAD。我们通过对现场工厂研究和与工程师的讨论来选择工业过程。这个数据集涵盖了16种不同的工业设备,包含了超过6小时的合成和现实世界的视频录像。此外,我们还对工业过程的关键特征,即周期性进行了标注。基于所提出的数据集,我们引入了周期记忆模块和滑动窗口检查机制,有效调查了基本重构模型的周期信息。我们的框架利用了LoRA适配器,探索将预训练模型有效迁移到真实世界场景。我们所提出的数据集和方法将填补工业视频异常检测领域中的空白,推动视频理解任务和智能工厂部署的发展。
https://arxiv.org/abs/2404.15033
In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at \href{this https URL }{here}.
在本文中,我们研究了一个名为叙述动作评估(NAE)的新问题。NAE的目标是生成专业的评论来评估一个行动的执行。与传统的评分基于动作质量评估和涉及浅层句子的视频标题等任务不同,NAE专注于在自然语言中创建详细的叙述。这些叙述提供了动作的详细描述以及客观评价。因为需要叙述的灵活性和评估的严谨性,NAE是一个更具挑战性的任务。一个现有的可能解决方案是使用多任务学习,其中叙述语言和评估信息分别预测。然而,由于任务之间存在差异和语言信息与评估信息之间的差异,这种方法在个人任务上产生了较低的性能。为了解决这个问题,我们提出了一个提示引导的多模态交互框架。这个框架使用了一对变压器来促进不同信息模态之间的交互。它还使用提示将评分回归任务转化为视频文本匹配任务,从而实现任务交互。为了支持在这个领域进一步的研究,我们用高质量、全面的动作叙述重新标注了MTL-AQA和FineGym数据集。此外,我们还为NAE建立了基准。大量实验结果证明,我们的方法超越了单独学习和 naive multi-task learning 方法。数据和代码发布在 \href{this <https://this URL> }{这里}。
https://arxiv.org/abs/2404.14471
Automatic movie narration targets at creating video-aligned plot descriptions to assist visually impaired audiences. It differs from standard video captioning in that it requires not only describing key visual details but also inferring the plots developed across multiple movie shots, thus posing unique and ongoing challenges. To advance the development of automatic movie narrating systems, we first revisit the limitations of existing datasets and develop a large-scale, bilingual movie narration dataset, Movie101v2. Second, taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages and tentatively focus on the initial stages featuring understanding within individual clips. We also introduce a new narration assessment to align with our staged task goals. Third, using our new dataset, we baseline several leading large vision-language models, including GPT-4V, and conduct in-depth investigations into the challenges current models face for movie narration generation. Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.
自动电影解说旨在为视觉受损的观众创建与视频对齐的剧情描述。它与标准视频字幕的不同之处在于,它不仅描述关键视觉细节,而且推断了多个电影镜头之间开发的剧情,因此提出了独特的持续挑战。为了推动自动电影解说系统的发展,我们首先回顾现有数据集的局限性,并开发了一个大规模、双语的电影解说数据集MOV101v2。其次,考虑到实现适用性电影解说所面临的根本困难,我们将长期目标分解为三个渐进阶段,并暂时将重点放在理解个体剪辑内的理解上。我们还引入了一个新的解说评估,以与我们的阶段任务目标保持一致。第三,使用我们的新数据集,我们对比了几个领先的大视图语言模型,包括GPT-4V,并对当前模型在电影解说生成方面的挑战进行了深入调查。我们的研究结果表明,实现适用性电影解说生成是一个迷人的目标,需要进行深入的研究。
https://arxiv.org/abs/2404.13370
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information, covering from Image LLMs to the more complex Video LLMs. Numerous studies have illustrated their exceptional cross-modal comprehension. Recently, integrating video foundation models with large language models to build a comprehensive video understanding system has been proposed to overcome the limitations of specific pre-defined vision tasks. However, the current advancements in Video LLMs tend to overlook the foundational contributions of Image LLMs, often opting for more complicated structures and a wide variety of multimodal data for pre-training. This approach significantly increases the costs associated with these this http URL response to these challenges, this work introduces an efficient method that strategically leverages the priors of Image LLMs, facilitating a resource-efficient transition from Image to Video LLMs. We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs, which utilizes a temporal adaptation plug-and-play structure within the image fusion module of Image LLMs. This adaptation extends their understanding capabilities to include temporal information, enabling the development of Video LLMs that not only surpass baseline performances but also do so with minimal instructional data and training resources. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models, effectively building upon the foundational work of Image LLMs.
多模态大型语言模型(MLLMs)在理解多模态信息方面取得了深刻的潜力,从图像LL模型到更复杂的视频LL模型。许多研究都证明了它们的跨模态理解能力。最近,将视频基础模型与大型语言模型集成以构建全面视频理解系统的主张,以克服特定预定义的视觉任务的局限性。然而,当前的视频LL模型的发展往往忽视了图像LL模型的基础贡献,通常选择更复杂的设计和各种多模态数据进行预训练。这种方法显著增加了为应对这些挑战而产生的成本,本文提出了一种有效的策略,通过战略性地利用图像LL模型的先验知识,促使从图像到视频LL模型的资源高效过渡。我们提出了RED-VILLM,一个资源高效的图像LL模型开发流程,该流程利用了图像LL模型的图像融合模块中的时间适应插件和 play-pause 结构。这个适应扩展了他们的理解能力,使他们能够开发出不仅超越基线性能,而且可以用最少的数据和训练资源实现的视频LL模型。我们的方法突出了在多模态模型中实现更成本效益和可扩展性的发展的潜力,有效地建立在图像LL模型的基础工作之上。
https://arxiv.org/abs/2404.11865
Pretrained vision-language models have shown effectiveness in video understanding. However, recent studies have not sufficiently leveraged essential temporal information from videos, simply averaging frame-wise representations or referencing consecutive frames. We introduce Temporally Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding that effectively and efficiently leverages comprehensive video information. We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process. Furthermore, our Video-conditional Prompting (VP) module manufactures context tokens to generate informative prompts in text modality. We conduct extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition to validate the superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design choices. Code is available at this https URL
预训练的视觉-语言模型已经在视频理解方面取得了有效性。然而,最近的研究并没有充分利用视频的必要时间信息,仅仅是平均帧级表示或参考连续帧。我们引入了 Temporally Contextualized CLIP (TC-CLIP),这是一个先驱性的框架,用于视频理解,有效且高效地利用了全面视频信息。我们提出了 Temporal Contextualization (TC),一种新的层间时间信息注入机制,用于提取每个帧的核心信息,将视频中的相关信息连接起来,并最终在特征编码过程中利用上下文 tokens。此外,我们的 Video-conditional Prompting (VP) 模块用于在文本模态生成有信息的提示。我们在零散、少散、基础到 novel 和完全监督的动作识别上进行了广泛的实验,以验证我们的 TC-CLIP 的优越性。TC 和 VP 的消融研究确保了我们的设计选择。代码可以从该链接获取:
https://arxiv.org/abs/2404.09490
The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities. The 2024 edition featured five tracks, attracting unprecedented interest from 726 teams in 47 countries and regions. Track 1 dealt with multi-target multi-camera (MTMC) people tracking, highlighting significant enhancements in camera count, character number, 3D annotation, and camera matrices, alongside new rules for 3D tracking and online tracking algorithm encouragement. Track 2 introduced dense video captioning for traffic safety, focusing on pedestrian accidents using multi-camera feeds to improve insights for insurance and prevention. Track 3 required teams to classify driver actions in a naturalistic driving analysis. Track 4 explored fish-eye camera analytics using the FishEye8K dataset. Track 5 focused on motorcycle helmet rule violation detection. The challenge utilized two leaderboards to showcase methods, with participants setting new benchmarks, some surpassing existing state-of-the-art achievements.
第八届AI城市挑战突出了计算机视觉和人工智能在零售、仓库设置和智能交通系统(ITS)等领域的汇聚,为研究提供了重要机会。2024版活动设有五个赛道,吸引了来自726支来自47个国家和地区的队伍,前所未有的关注。赛道1涉及多目标多摄像头(MTMC)人员跟踪,重点关注摄像头数量、角色数量、3D注释和相机矩阵的显著提升,以及为3D跟踪和在线跟踪算法鼓励的新规则。赛道2介绍了用于交通安全的密集视频字幕,利用多摄像头数据改善对保险和预防的洞察。赛道3要求团队对自然驾驶分析中的驾驶员动作进行分类。赛道4探索了使用FishEye8K数据集的鱼眼相机数据分析。赛道5关注摩托车头盔违规检测。挑战使用了两个排行榜来展示方法,参与者设定了新基准,有些甚至超越了现有最先进的成就。
https://arxiv.org/abs/2404.09432
Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.
动作识别对于自恋型视频理解至关重要,可以无需用户努力自动持续监测日常生活活动(ADLs)的动作。现有文献主要关注3D手势输入,需要计算密集的深度估计网络或佩戴不舒适的深度传感器。相比之下,对于自恋型动作识别,尽管市场上有用户友好的智能眼镜可以捕捉到一个单色RGB图像,但关于2D手势识别的研究却相对不足。我们的研究旨在填补这一研究空白,通过探索2D手势估计领域,做出两点贡献。首先,我们引入了两种新的2D手势估计方法,即EffHandNet单手估计和EffHandEgoNet,专为自恋视角设计,可以捕捉手与物体之间的交互。这两项方法在H2O和FPHA公开基准上均超越了最先进的模型。其次,我们提出了一个自适应的动作识别架构,包括EffHandEgoNet和基于Transformer的动作识别方法。在H2O和FPHA数据集上评估,我们的架构具有更快的推理时间,准确率分别为91.32%和94.43%,均超越了最先进的方法,包括基于3D的方法。我们的工作表明,使用2D骨骼数据对于自恋型动作理解是一种稳健的方法。广泛的评估和消融研究证明了手势估计方法的影响,以及每个输入如何影响整体性能。
https://arxiv.org/abs/2404.09308
Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems. Most existing methods only focus on locating traffic event segments, which severely lack descriptive details related to the behaviour and context of all the subjects of interest in the events. In this paper, we present TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego camera view. TrafficVLM models traffic video events at different levels of analysis, both spatially and temporally, and generates long fine-grained descriptions for the vehicle and pedestrian at different phases of the event. We also propose a conditional component for TrafficVLM to control the generation outputs and a multi-task fine-tuning paradigm to enhance TrafficVLM's learning capability. Experiments show that TrafficVLM performs well on both vehicle and overhead camera views. Our solution achieved outstanding results in Track 2 of the AI City Challenge 2024, ranking us third in the challenge standings. Our code is publicly available at this https URL.
近年来,由于对高效且可靠的城郊监控系统需求的不断增加,交通视频描述和分析得到了广泛关注。目前,大多数现有方法仅关注于定位交通事件段,这严重缺乏与所有感兴趣对象的行为和上下文相关的详细描述。在本文中,我们提出了TrafficVLM,一种用于车辆自相机视场的多模态密集视频标注模型。TrafficVLM在不同的分析和空间水平上对交通视频事件进行建模,并生成不同事件阶段车辆和行人的详细描述。我们还提出了一种条件组件,用于控制TrafficVLM的生成输出,以及一种多任务微调范式,以增强TrafficVLM的学习能力。实验证明,TrafficVLM在车辆和 overhead 相机视图上表现出色。我们的解决方案在2024年AI城市挑战赛的第二部分(Track 2)中取得了突出成绩,排名第三。我们的代码公开可用,位于此链接:https://www.example.com。
https://arxiv.org/abs/2404.09275
Video moment retrieval and highlight detection are two highly valuable tasks in video understanding, but until recently they have been jointly studied. Although existing studies have made impressive advancement recently, they predominantly follow the data-driven bottom-up paradigm. Such paradigm overlooks task-specific and inter-task effects, resulting in poor model performance. In this paper, we propose a novel task-driven top-down framework TaskWeave for joint moment retrieval and highlight detection. The framework introduces a task-decoupled unit to capture task-specific and common representations. To investigate the interplay between the two tasks, we propose an inter-task feedback mechanism, which transforms the results of one task as guiding masks to assist the other task. Different from existing methods, we present a task-dependent joint loss function to optimize the model. Comprehensive experiments and in-depth ablation studies on QVHighlights, TVSum, and Charades-STA datasets corroborate the effectiveness and flexibility of the proposed framework. Codes are available at this https URL.
视频时刻检索和突出检测是视频理解中两个非常有价值的任务,但迄今为止,它们主要遵循数据驱动的自下而上的范式。这种范式忽视了任务特定和跨任务影响,导致模型性能不佳。在本文中,我们提出了一种全新的基于任务的自上而下框架TaskWeave,用于联合时刻检索和突出检测。框架引入了一个任务解耦的单元来捕捉任务特定和共同表示。为了研究这两个任务之间的相互作用,我们提出了一个跨任务反馈机制,将一个任务的结果作为引导mask辅助另一个任务。与现有方法不同,我们提出了一个基于任务的任务损失函数来优化模型。对QVHighlights、TVSum和Charades-STA数据集的全面实验和深入消融研究证实了所提出的框架的有效性和灵活性。代码可在此处访问:https://www.huaweicf.com/TaskWeave
https://arxiv.org/abs/2404.09263
This paper introduces our solution for Track 2 in AI City Challenge 2024. The task aims to solve traffic safety description and analysis with the dataset of Woven Traffic Safety (WTS), a real-world Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding. Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video. 2) Our work leverages CLIP to extract visual features to more efficiently perform cross-modality training between visual and textual representations. 3) We conduct domain-specific model adaptation to mitigate domain shift problem that poses recognition challenge in video understanding. 4) Moreover, we leverage BDD-5K captioned videos to conduct knowledge transfer for better understanding WTS videos and more accurate captioning. Our solution has yielded on the test set, achieving 6th place in the competition. The open source code will be available at this https URL
本文介绍了我们为2024年AI城市挑战赛第二 track 2 提出的解决方案。该任务旨在利用Woven Traffic Safety(WTS)数据集中的数据解决交通安全的描述和分析。我们的解决方案主要关注以下几点:1)为了解决视频摘要问题,我们利用了密集视频摘要与并行解码(PDVC)框架来建模视觉-语言序列并生成视频的章节摘要。2)我们的工作利用了CLIP来提取视觉特征,以更有效地在视觉和文本表示之间进行跨模态训练。3)我们进行了领域特定模型适应,以减轻在视频理解中出现的领域漂移问题。4)此外,我们利用BDD-5K捕获的视频进行知识传递,以更好地理解WTS视频并获得更准确的摘要。我们的解决方案在测试集上已经实现了成果,获得了竞争中的第六名。开源代码将在此处 https:// URL 下载。
https://arxiv.org/abs/2404.08229
There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.
近年来,对密集视频标注的研究受到了广泛关注,该研究旨在自动将未剪辑视频中的所有事件进行定位和标注。几项研究通过将密集视频标注作为一个多任务问题,将事件定位和事件标注相结合,来考虑任务之间的关系。然而,仅通过视觉输入来解决这两个任务是具有挑战性的,因为缺乏语义内容。在本研究中,我们通过提议一个以人类认知信息处理为基础的新框架来解决这个问题。我们的模型利用外部记忆来包含先验知识。提出了跨模态视频-文本匹配的内存检索方法。为了有效地包括检索到的文本特征,设计了一个灵活的编码器和一个具有视觉和文本ual跨注意力的解码器。对活动网络标注和YouCook2数据集进行了比较实验,以展示所提出方法的有效性。实验结果表明,在没有大量预训练视频数据的情况下,我们的模型具有鼓舞人心的性能。
https://arxiv.org/abs/2404.07610
In the Massive Open Online Courses (MOOC) learning scenario, the semantic information of instructional videos has a crucial impact on learners' emotional state. Learners mainly acquire knowledge by watching instructional videos, and the semantic information in the videos directly affects learners' emotional states. However, few studies have paid attention to the potential influence of the semantic information of instructional videos on learners' emotional states. To deeply explore the impact of video semantic information on learners' emotions, this paper innovatively proposes a multimodal emotion recognition method by fusing video semantic information and physiological signals. We generate video descriptions through a pre-trained large language model (LLM) to obtain high-level semantic information about instructional videos. Using the cross-attention mechanism for modal interaction, the semantic information is fused with the eye movement and PhotoPlethysmoGraphy (PPG) signals to obtain the features containing the critical information of the three modes. The accurate recognition of learners' emotional states is realized through the emotion classifier. The experimental results show that our method has significantly improved emotion recognition performance, providing a new perspective and efficient method for emotion recognition research in MOOC learning scenarios. The method proposed in this paper not only contributes to a deeper understanding of the impact of instructional videos on learners' emotional states but also provides a beneficial reference for future research on emotion recognition in MOOC learning scenarios.
在大规模开放在线课程(MOOC)学习场景中,教学视频的语义信息对学习者的情感状态具有重要影响。学习者主要通过观看教学视频来获取知识,视频中的语义信息会直接影响学习者的情感状态。然而,迄今为止,几乎没有研究关注过教学视频语义信息对学习者情感状态的潜在影响。为了深入探索视频语义信息对学习者情感的影响,本文创新性地提出了一种将视频语义信息和生理信号融合的多模态情感识别方法。我们通过预训练的大语言模型(LLM)生成视频描述,以获得关于教学视频的高级语义信息。通过模态交互的跨注意机制,将语义信息与眼动和Plethysmography(PPG)信号融合,以获得包含三种模式关键信息的特征。通过情感分类器的准确识别,学习者的情感状态得到了准确识别。实验结果表明,我们的方法在情感识别性能上显著提高,为MOOC学习场景中情感识别的研究提供了新的视角和高效方法。本文提出的方法不仅为深入理解教学视频对学习者情感状态的影响提供了重要的参考,而且为未来在MOOC学习场景中情感识别的研究提供了有益的参考。
https://arxiv.org/abs/2404.07484
Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important. To effectively automate the process of video analysis based on eye-tracking data, it is important to accurately replicate human gaze behavior. However, this task presents significant challenges due to the inherent complexity and ambiguity of human gaze patterns. In this work, we introduce a novel method for simulating human gaze behavior. Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer, with the primary role of watching videos and simulating human gaze behavior. We employed an eye-tracking dataset gathered from videos generated by the VirtualHome simulator, with a primary focus on activity recognition. Our experimental results demonstrate the effectiveness of our gaze prediction method by highlighting its capability to replicate human gaze behavior and its applicability for downstream tasks where real human-gaze is used as input.
利用人类视线的视频理解任务中翻译眼动应用越来越重要。要有效自动化基于眼动数据的视频分析过程,准确复制人类视线的 behaviors 至关重要。然而,由于人类视图模式固有的复杂性和不确定性,这项任务带来了巨大的挑战。在这项工作中,我们提出了一种模拟人类视线行为的新方法。我们的方法使用基于Transformer的强化学习算法来训练一个观察者型代理,主要职责是观看视频并模拟人类视线行为。我们使用了由VirtualHome模拟器生成的视频数据,重点是活动识别。我们的实验结果表明,通过强调其复制人类视线行为的能力和适用于下游任务(使用真实人类视线作为输入)的可行性,我们的人眼预测方法的有效性得到了充分证明。
https://arxiv.org/abs/2404.07351
Humans utilize their gaze to concentrate on essential information while perceiving and interpreting intentions in videos. Incorporating human gaze into computational algorithms can significantly enhance model performance in video understanding tasks. In this work, we address a challenging and innovative task in video understanding: predicting the actions of an agent in a video based on a partial video. We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input. Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention. To assess the efficiency of our approach, we collect a dataset containing household activities generated in the VirtualHome environment, accompanied by human gaze data of viewing videos. Our method outperforms state-of-the-art techniques, achieving a 7\% improvement in accuracy for 18-class intention recognition. This highlights the efficiency of our method in learning important features from human gaze data.
人类利用他们的目光来集中注意力于视频中的关键信息,同时通过观察和解释视频中的意图。将人类目光纳入计算算法可以显著提高视频理解任务的模型性能。在这项工作中,我们解决了一个具有挑战性和创新性的视频理解任务:根据视频的片段预测代理的行动。我们引入了 gaze-guided action anticipation 算法,该算法基于视频输入建立了一个视觉语义图。我们的方法利用图神经网络识别代理的意图并预测实现这一意图的动作序列。为了评估我们的方法的效率,我们收集了一个虚拟家庭环境中生成的生活活动数据集,并附有观看视频的人类目光数据。我们的方法超越了最先进的 techniques,实现了18类意图识别的准确度提高了7%。这突出了我们方法从人类目光数据中学习重要特征的有效性。
https://arxiv.org/abs/2404.07347
With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at this https URL.
随着大型语言模型(LLMs)的成功,将视觉模型集成到LLMs中构建视觉语言基础模型近年来受到了越来越多的关注。然而,基于LLM的大型多模态模型(例如,Video-LLaMA,VideoChat)只能处理有限数量的帧短期视频理解。在本文中,我们主要关注设计一个高效有效的长期视频理解模型。我们不试图通过同时处理更多帧来超越现有工作的目的,而是提出了一种以在线方式处理视频并将过去视频信息存储在内存银行中的方法。这使得我们的模型在不需要超过LLMs的上下文长度限制或GPU内存限制的情况下,可以参考历史视频内容进行长期分析。我们的内存银行可以以一种无缝的方式与现有的多模态LLMs集成。我们对各种视频理解任务(如长期视频理解,视频问题回答和视频字幕)进行了广泛的实验,我们的模型可以在多个数据集上实现最先进的性能。代码可在此处下载。
https://arxiv.org/abs/2404.05726
Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
基于视频的视觉关系检测任务,如视频场景图生成和视频场景关系检测,在精细视频理解中发挥着重要作用。然而,当前的视频视觉关系检测数据集存在两个主要限制,阻碍了该领域的研究进展。首先,它们没有在多人人际场景中探索复杂的人际互动。其次,现有数据集中的关系类型具有较低级的语义,并且通常可以通过外观或简单的先验信息来识别,而无需详细的空间时间上下文推理。然而,理解人类之间的高级互动对于理解复杂的人际视频(如体育和监视视频)至关重要。为了解决这个问题,我们提出了一个新的视频视觉关系检测任务:视频人际互动检测,并为此构建了一个名为SportsHHI的数据集。SportsHHI包含了篮球和排球运动中的34个高级互动类别。在11,398个关键帧上,有118,075个人体边界框和50,649个互动实例被注释。为了进行基准,我们提出了一个两阶段基线方法,并通过广泛的实验揭示了成功的人际交互检测的关键因素。我们希望SportsHHI能够刺激在视频中的人际互动理解的研究,并推动在视频视觉关系检测中发展空间时间上下文建模技术。
https://arxiv.org/abs/2404.04565
Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
长的视频问答是一个具有挑战性的任务,需要识别短期的活动并进行关于它们细粒度关系的推理。先进的视频大型语言模型(vLLMs)因其在新的任务上表现出的新兴能力而具有实现的可能。然而,尽管它们通过训练掌握了数百万秒长的视频,但vLLMs无法理解几分钟长的视频,也无法准确回答关于它们的问题。为了解决这个问题,我们提出了一个轻量级且自监督的方法:关键帧条件下的长视频-LLM(Koala),它引入了可学习的时态和空间查询,以将预训练的vLLM扩展到更长的视频中。我们的方法引入了两个新的词标器,它们基于从稀疏视频关键帧计算的视觉标记物来理解短和长视频时刻。我们在HowTo100M上训练我们的方法,并在零散的视频理解基准上证明了其有效性,其中它在所有任务上的绝对准确度比最先进的模型高3 - 6%。令人惊讶的是,我们还通过实验发现,我们的方法不仅有助于预训练的vLLM理解长视频,而且还有助于提高其对短期动作识别的准确性。
https://arxiv.org/abs/2404.04346
Open-world video instance segmentation is an important video understanding task. Yet most methods either operate in a closed-world setting, require an additional user-input, or use classic region-based proposals to identify never before seen objects. Further, these methods only assign a one-word label to detected objects, and don't generate rich object-centric descriptions. They also often suffer from highly overlapping predictions. To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video. For this, we introduce open-world object queries to discover never before seen objects without additional user-input. We generate rich and descriptive object-centric captions for each detected object via a masked attention augmented LLM input. We introduce an inter-query contrastive loss to ensure that the object queries differ from one another. Our generalized approach matches or surpasses state-of-the-art on three tasks: open-world video instance segmentation on the BURST dataset, dense video object captioning on the VidSTG dataset, and closed-world video instance segmentation on the OVIS dataset.
开放世界视频实例分割是一个重要的视频理解任务。然而,大多数方法要么在关闭的世界设置中操作,需要额外的用户输入,要么使用基于经典区域的提议来识别从未见过的物体。此外,这些方法只给检测到的物体分配一个单词标签,并且不生成丰富的物体中心描述。它们还经常遭受高度重叠预测的问题。为了解决这些问题,我们提出了Open-World Video Instance Segmentation and Captioning (OW-VISCap),一种在视频中共同分割、跟踪和捕获之前见过的或未见过的物体的方法。为此,我们引入了开放世界物体查询来发现没有额外用户输入的从未见过的物体。我们通过遮罩注意增强LLM输入为每个检测到的物体生成丰富而描述性的物体中心描述。我们引入了跨查询对比损失来确保物体查询彼此不同。我们的通用方法在三个任务上都超越了最先进的水平:在BURST数据集上的开放世界视频实例分割,VidSTG数据集上的密集视频物体注释和OVIS数据集上的关闭世界视频实例分割。
https://arxiv.org/abs/2404.03657
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here this https URL
本文介绍了MiniGPT4-Video,一种专门为视频理解而设计的多模态大型语言模型(LLM)。该模型能够处理视频中的时间和文本数据,使其擅长理解视频的复杂性。在MiniGPT-v2的成功基础上,本文将模型的能力扩展到处理视频序列,使其能够理解视频。MiniGPT4-video不仅考虑视觉内容,还包含了文本对话,使模型能够有效回答涉及视觉和文本组件的查询。所提出的模型在MSVD、MSRVTT、TGIF和TVQA基准测试中的性能均优于现有最先进的方法,分别在MSVD、MSRVTT、TGIF和TVQA基准测试中取得了4.22%、1.13%、20.82%和13.1%的提高。我们的模型和代码都已公开发布,此处链接为https://。
https://arxiv.org/abs/2404.03413