Temporal localization in untrimmed videos, which aims to identify specific timestamps, is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. Existing methods in each subfield are typically designed for specific tasks and lack generalizability across domains. In this paper, we propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks. First, our approach employs a simple yet effective one-stage localization model that supports text queries as input and multiple actions as output. Second, we jointly train the video encoder and localization model in an end-to-end manner. To efficiently process long videos, we introduce temporal chunking, enabling the handling of videos with over 30k frames. Third, we find that fine-tuning pre-trained text encoders with a multi-stage training strategy further enhances text-conditioned localization. TimeLoc achieves state-of-the-art results across multiple benchmarks: +1.3% and +1.9% mAP over previous best methods on THUMOS14 and EPIC-Kitchens-100, +1.1% on Kinetics-GEBD, +2.94% mAP on QVHighlights, and significant improvements in temporal video grounding (+11.5% on TACoS and +6.7% on Charades-STA under R1@0.5). Our code and checkpoints will be released at this https URL.
无修剪视频中的时间定位,旨在识别特定的时间戳,在视频理解中至关重要但仍然具有挑战性。这一任务包括若干子任务,如时间动作定位、时间视频对齐、时刻检索和通用事件边界检测等。现有方法通常针对具体任务设计,并且在跨域应用方面缺乏泛化能力。本文提出了TimeLoc,这是一个统一的端到端框架,用于处理多个任务的时间戳定位。首先,我们的方法采用了一种简单而有效的单阶段定位模型,支持以文本查询作为输入并输出多个动作。其次,我们通过端到端方式联合训练视频编码器和定位模型。为了高效地处理长视频,我们引入了时间分块技术,使得能够处理超过30k帧的视频。第三,我们发现使用多阶段微调策略对预训练文本编码器进行细化,进一步增强了基于文本条件下的定位效果。 TimeLoc在多个基准测试中取得了最先进的结果:THUMOS14和EPIC-Kitchens-100上的mAP分别比之前最佳方法高出+1.3%和+1.9%,Kinetics-GEBD上高出+1.1%,QVHighlights上的mAP为+2.94%,以及在TACoS和Charades-STA(R1@0.5)的视频时间对齐任务中分别提高了+11.5%和+6.7%。 我们的代码和检查点将在此网址上发布。
https://arxiv.org/abs/2503.06526
Naturalistic driving action recognition is essential for vehicle cabin monitoring systems. However, the complexity of real-world backgrounds presents significant challenges for this task, and previous approaches have struggled with practical implementation due to their limited ability to observe subtle behavioral differences and effectively learn inter-frame features from video. In this paper, we propose a novel Spatial-Temporal Perception (STP) architecture that emphasizes both temporal information and spatial relationships between key objects, incorporating a causal decoder to perform behavior recognition and temporal action localization. Without requiring multimodal input, STP directly extracts temporal and spatial distance features from RGB video clips. Subsequently, these dual features are jointly encoded by maximizing the expected likelihood across all possible permutations of the factorization order. By integrating temporal and spatial features at different scales, STP can perceive subtle behavioral changes in challenging scenarios. Additionally, we introduce a causal-aware module to explore relationships between video frame features, significantly enhancing detection efficiency and performance. We validate the effectiveness of our approach using two publicly available driver distraction detection benchmarks. The results demonstrate that our framework achieves state-of-the-art performance.
自然驾驶行为识别对于车辆舱内监控系统至关重要。然而,现实世界的复杂背景为这一任务带来了显著挑战,以往的方法由于观察细微行为差异的能力有限以及难以从视频中有效学习帧间特征,在实际应用中遇到了困难。为此,本文提出了一种新颖的空间-时间感知(STP)架构,该架构强调了关键对象之间的时间信息和空间关系,并引入了一个因果解码器来执行行为识别及时间动作定位。无需多模态输入,STP可以直接从RGB视频片段中提取时间和空间距离特征。随后,通过最大化因子分解顺序的所有可能排列的预期似然性,这些双重视觉特征被联合编码。通过在不同尺度上整合时空特性,STP能够感知挑战场景中的细微行为变化。此外,我们引入了一个因果意识模块来探索视频帧特征之间的关系,显著提高了检测效率和性能。我们使用两个公开可用的驾驶员分心检测基准数据集验证了所提出方法的有效性。实验结果表明,我们的框架达到了最先进的性能水平。
https://arxiv.org/abs/2503.04078
This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.
本文介绍了ViNet-S,这是一种基于ViNet架构并采用U-Net设计的36MB模型,其轻量级解码器在不牺牲性能的前提下大幅减少了模型大小和参数。此外,ViNet-A(148MB)融合了时空动作定位(STAL)特征,与传统的视频显著性模型不同,后者依赖于动作分类骨干网络。我们的研究表明,在三个仅视觉数据集以及六个视听显著性数据集中,通过平均预测的显著性图,由ViNet-S和ViNet-A组成的集成模型达到了最先进的性能水平,并且在参数效率和实时性能方面均优于基于变换器的模型,其中ViNet-S可以达到每秒超过1000帧的速度。
https://arxiv.org/abs/2502.00397
Human Action Recognition (HAR) plays a crucial role in applications such as health monitoring, smart home automation, and human-computer interaction. While HAR has been extensively studied, action summarization, which involves identifying and summarizing continuous actions, remains an emerging task. This paper introduces the novel XRF V2 dataset, designed for indoor daily activity Temporal Action Localization (TAL) and action summarization. XRF V2 integrates multimodal data from Wi-Fi signals, IMU sensors (smartphones, smartwatches, headphones, and smart glasses), and synchronized video recordings, offering a diverse collection of indoor activities from 16 volunteers across three distinct environments. To tackle TAL and action summarization, we propose the XRFMamba neural network, which excels at capturing long-term dependencies in untrimmed sensory sequences and outperforms state-of-the-art methods, such as ActionFormer and WiFiTAD. We envision XRF V2 as a valuable resource for advancing research in human action localization, action forecasting, pose estimation, multimodal foundation models pre-training, synthetic data generation, and more.
人体动作识别(HAR)在健康监测、智能家居自动化和人机交互等应用中发挥着关键作用。尽管HAR已经得到了广泛的研究,但涉及连续动作的识别与总结的动作摘要任务仍然是一项新兴的任务。本文介绍了一种新型数据集XRF V2,该数据集专为室内日常活动的时间动作定位(TAL)和动作摘要设计。XRF V2整合了来自Wi-Fi信号、IMU传感器(智能手机、智能手表、耳机和智能眼镜)、以及同步视频记录的多模态数据,提供了16名志愿者在三种不同环境下的多样化室内活动集合。 为了应对TAL和动作摘要任务,我们提出了XRFMamba神经网络。该模型擅长捕捉未经修剪的感觉序列中的长期依赖关系,并且优于现有的最佳方法,如ActionFormer和WiFiTAD。我们认为,XRF V2将作为推进人类动作定位、动作预测、姿态估计、多模态基础模型预训练、合成数据生成等领域的研究的重要资源。
https://arxiv.org/abs/2501.19034
Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.
伪标签学习方法在弱监督下的时序动作定位中得到了广泛应用。现有研究直接利用弱监督基础模型生成实例级伪标签来训练全监督检测头。我们认为,伪标签中的噪声会干扰全监督检测头的学习过程,导致显著的性能损失。带有噪声标签的问题包括:(1)边界定位不准确;(2)未检出短动作片段;(3)多个相邻段落被错误地识别为一个段落。 为了应对这些问题,我们提出了一种两阶段的噪声标签学习策略,以利用噪声标签中的每一个潜在有用信号。首先,我们提出了一个带有上下文感知降噪算法的帧级伪标签生成模型来优化边界定位。其次,我们引入了一个在线修正的教师-学生框架,并配备了一个缺失实例补偿模块和一个模棱两可实例校正模块,用于解决短动作片段丢失以及多个段落被错误地识别为单个段落的问题。此外,我们在改进后的教师-学生框架中应用了高质量伪标签挖掘损失函数,以对噪声标签赋予不同的权重,从而更有效地进行训练。 我们的模型在THUMOS14和ActivityNet v1.2基准测试上,在检测准确性和推理速度方面都显著优于之前的最先进方法。
https://arxiv.org/abs/2501.11124
Detecting and interpreting operator actions, engagement, and object interactions in dynamic industrial workflows remains a significant challenge in human-robot collaboration research, especially within complex, real-world environments. Traditional unimodal methods often fall short of capturing the intricacies of these unstructured industrial settings. To address this gap, we present a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that captures realistic assembly and disassembly tasks, facilitating the evaluation of key meta-tasks such as action localization, object interaction, and engagement prediction. The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video, annotated in detail for task performance and operator behavior. Its distinctiveness lies in the integration of multiple data modalities and its emphasis on real-world, untrimmed industrial workflows-key for advancing research in human-robot collaboration and operator monitoring. Additionally, we propose a multimodal network that fuses RGB frames, IMU data, and skeleton sequences to predict engagement levels during industrial tasks. Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments. The dataset and code can be accessed from this https URL.
在动态工业工作流程中检测和解读操作员的动作、参与度以及与物体的互动仍然是人机协作研究中的一个重要挑战,尤其是在复杂的真实世界环境中。传统的一模态方法往往难以捕捉这些无结构工业环境下的细微差别。为了解决这一缺口,我们提出了一个新型的多模态工业活动监测(MIAM)数据集,该数据集记录了现实装配和拆卸任务,便于评估包括动作定位、物体互动以及参与度预测等关键元任务的效果。此数据集中包含来自22个会话收集的多视角RGB、深度图像及惯性测量单元(IMU) 数据,共计未剪辑视频290分钟,并详细标注了任务表现和操作员行为。其独特之处在于整合了多种数据模态,并且着重于真实世界的无裁剪工业工作流程,这对推进人机协作和操作员监测研究至关重要。 此外,我们还提出了一种多模态网络,该网络融合RGB帧、IMU数据及骨架序列以预测工业任务中的人类参与度。我们的方法提高了识别参与状态的准确性,为在动态工业环境中监测操作员表现提供了一个稳健的解决方案。数据集和代码可以从以下链接获取:[此URL](请将"[此URL]"替换为实际提供的访问地址)。
https://arxiv.org/abs/2501.05936
Most existing traffic video datasets including Waymo are structured, focusing predominantly on Western traffic, which hinders global applicability. Specifically, most Asian scenarios are far more complex, involving numerous objects with distinct motions and behaviors. Addressing this gap, we present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bicycles) in complex and unpredictable environments. DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.), which require high reasoning ability. DAVE densely annotates over 13 million bounding boxes (bboxes) actors with identification, and more than 1.6 million boxes are annotated with both actor identification and action/behavior details. The videos within DAVE are collected based on a broad spectrum of factors, such as weather conditions, the time of day, road scenarios, and traffic density. DAVE can benchmark video tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment retrieval, and Multi-label Video Action Recognition. Given the critical importance of accurately identifying VRUs to prevent accidents and ensure road safety, in DAVE, vulnerable road users constitute 41.13% of instances, compared to 23.71% in Waymo. DAVE provides an invaluable resource for the development of more sensitive and accurate visual perception algorithms in the complex real world. Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.
现有的大多数交通视频数据集,包括Waymo在内,都侧重于结构化和以西方交通为主的场景,这限制了其全球适用性。特别是,在亚洲的许多情景中,道路状况更加复杂,涉及众多具有独特运动模式和行为的对象。为了解决这一差距,我们提出了一个新的数据集DAVE(Diverse Actors in Varied Environments),旨在评估在复杂且不可预测环境中准确识别弱势道路使用者(VRUs:例如行人、动物、摩托车、自行车等)的感知方法。 DAVE是一个手动标注的数据集,涵盖了16种不同的对象类别(包括动物、人类和车辆等)以及16种动作类型(如切线插入、蛇形运动、U型转弯等复杂的罕见案例),这些都需要高水平的推理能力。在DAVE中,有超过1300万个边界框被详细标注以识别物体,并且其中超过160万个边界框还包含了行为和动作细节的信息。 该数据集中的视频根据多种因素收集而来,包括天气条件、一天中的时间、道路场景以及交通密度等。基于这些多样化的特性,DAVE可以作为追踪、检测、时空动作定位、语言-视觉时刻检索及多标签视频动作识别等任务的评估基准。 鉴于准确识别VRUs对于防止事故和保障道路交通安全至关重要,在DAVE数据集中,弱势道路使用者占总实例的41.13%,而Waymo中这一比例仅为23.71%。因此,DAVE为开发更敏感且精确的道路视觉感知算法提供了宝贵的资源,特别是在复杂的真实世界环境中。 实验表明,现有的方法在评估过程中使用DAVE时表现不佳,这凸显了其对未来视频识别研究的重要性与价值。
https://arxiv.org/abs/2412.20042
Weakly supervised temporal action localization (WS-TAL) is a task of targeting at localizing complete action instances and categorizing them with video-level labels. Action-background ambiguity, primarily caused by background noise resulting from aggregation and intra-action variation, is a significant challenge for existing WS-TAL methods. In this paper, we introduce a hybrid multi-head attention (HMHA) module and generalized uncertainty-based evidential fusion (GUEF) module to address the problem. The proposed HMHA effectively enhances RGB and optical flow features by filtering redundant information and adjusting their feature distribution to better align with the WS-TAL task. Additionally, the proposed GUEF adaptively eliminates the interference of background noise by fusing snippet-level evidences to refine uncertainty measurement and select superior foreground feature information, which enables the model to concentrate on integral action instances to achieve better action localization and classification performance. Experimental results conducted on the THUMOS14 dataset demonstrate that our method outperforms state-of-the-art methods. Our code is available in \url{this https URL}.
弱监督时间动作定位(WS-TAL)的任务旨在通过视频级别的标签来定位和分类完整的动作实例。现有方法在处理这一任务时面临的主要挑战是动作背景模糊,这主要是由于背景噪音和动作内部变化导致的。为了解决这个问题,本文提出了一种混合多头注意力(HMHA)模块和基于广义不确定性证据融合(GUEF)模块。 提出的HMHA模块能够有效地通过过滤冗余信息并调整特征分布来增强RGB和光流特征,使其更好地与WS-TAL任务对齐。此外,提出的GUEF模块通过融合片段级别的证据来适应性地消除背景噪音的影响,改进不确定性测量,并选择更优的前景特征信息。这使得模型能够专注于完整的动作实例,从而实现更好的动作定位和分类性能。 在THUMOS14数据集上的实验结果表明,我们的方法优于现有的最先进方法。我们的代码可在此网址获取:[this https URL]。
https://arxiv.org/abs/2412.19418
Existing skeleton-based human action classification models rely on well-trimmed action-specific skeleton videos for both training and testing, precluding their scalability to real-world applications where untrimmed videos exhibiting concatenated actions are predominant. To overcome this limitation, recently introduced skeleton action segmentation models involve un-trimmed skeleton videos into end-to-end training. The model is optimized to provide frame-wise predictions for any length of testing videos, simultaneously realizing action localization and classification. Yet, achieving such an improvement im-poses frame-wise annotated skeleton videos, which remains time-consuming in practice. This paper features a novel framework for skeleton-based action segmentation trained on short trimmed skeleton videos, but that can run on longer un-trimmed videos. The approach is implemented in three steps: Stitch, Contrast, and Segment. First, Stitch proposes a tem-poral skeleton stitching scheme that treats trimmed skeleton videos as elementary human motions that compose a semantic space and can be sampled to generate multi-action stitched se-quences. Contrast learns contrastive representations from stitched sequences with a novel discrimination pretext task that enables a skeleton encoder to learn meaningful action-temporal contexts to improve action segmentation. Finally, Segment relates the proposed method to action segmentation by learning a segmentation layer while handling particular da-ta availability. Experiments involve a trimmed source dataset and an untrimmed target dataset in an adaptation formulation for real-world skeleton-based human action segmentation to evaluate the effectiveness of the proposed method.
现有的基于骨架的人类动作分类模型依赖于精心修剪的动作特定骨架视频来进行训练和测试,这阻碍了它们在现实世界应用中的扩展能力,在这些应用场景中,包含拼接动作的未修剪视频占主导地位。为了克服这一局限性,最近引入的骨架动作分割模型将未修剪的骨架视频纳入端到端的训练过程。该模型被优化为能够对任何长度的测试视频提供逐帧预测,同时实现动作定位和分类。然而,要达到这种改进需要逐帧标注的骨架视频,这在实践中仍然耗时费力。本文介绍了一种新颖的基于骨架的动作分割框架,它可以在短修剪过的骨架视频上进行训练,但在较长未修剪的视频上运行。该方法分为三个步骤:拼接、对比和分割。首先,“拼接”提出了一种时间骨架拼接方案,将修剪后的骨架视频视为构成语义空间的基本人体运动,并可以采样生成包含多个动作的拼接序列。“对比”从拼接序列中学习对比表示,使用一种新颖的判别预训练任务,使骨架编码器能够学习有意义的动作-时间上下文以改进动作分割。最后,“分割”通过学习分割层来将提出的方法应用于动作分割,并处理特定的数据可用性问题。实验采用修剪过的源数据集和未修剪的目标数据集,在适应现实世界基于骨架的人类动作分割的设定下,评估了所提方法的有效性。
https://arxiv.org/abs/2412.14988
Temporal action localization (TAL) involves dual tasks to classify and localize actions within untrimmed videos. However, the two tasks often have conflicting requirements for features. Existing methods typically employ separate heads for classification and localization tasks but share the same input feature, leading to suboptimal performance. To address this issue, we propose a novel TAL method with Cross Layer Task Decoupling and Refinement (CLTDR). Based on the feature pyramid of video, CLTDR strategy integrates semantically strong features from higher pyramid layers and detailed boundary-aware boundary features from lower pyramid layers to effectively disentangle the action classification and localization tasks. Moreover, the multiple features from cross layers are also employed to refine and align the disentangled classification and regression results. At last, a lightweight Gated Multi-Granularity (GMG) module is proposed to comprehensively extract and aggregate video features at instant, local, and global temporal granularities. Benefiting from the CLTDR and GMG modules, our method achieves state-of-the-art performance on five challenging benchmarks: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3, and HACS. Our code and pre-trained models are publicly available at: this https URL.
时间动作定位(TAL)涉及对未剪辑视频中的动作进行分类和定位的双重任务。然而,这两个任务通常对特征有相互冲突的要求。现有的方法通常为分类和定位任务使用独立的头部,但共享相同的输入特征,导致性能次优。为了解决这一问题,我们提出了一种新的TAL方法,即跨层任务解耦与细化(CLTDR)。基于视频的特征金字塔,CLTDR策略整合了来自较高金字塔层级的语义强特征和来自较低金字塔层级的详细边界感知特征,有效地分离了动作分类和定位任务。此外,还利用跨层次的多个特征来细化并校准分离后的分类和回归结果。最后,提出了一种轻量级的门控多粒度(GMG)模块,以全面提取和聚合视频在瞬间、局部和全局时间粒度上的特征。得益于CLTDR和GMG模块,我们的方法在五个具有挑战性的基准测试中达到了最先进的性能:THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3 和 HACS。我们的代码和预训练模型可以在以下链接获取:此 https URL。
https://arxiv.org/abs/2412.09202
Naturalistic driving action localization task aims to recognize and comprehend human behaviors and actions from video data captured during real-world driving scenarios. Previous studies have shown great action localization performance by applying a recognition model followed by probability-based post-processing. Nevertheless, the probabilities provided by the recognition model frequently contain confused information causing challenge for post-processing. In this work, we adopt an action recognition model based on self-supervise learning to detect distracted activities and give potential action probabilities. Subsequently, a constraint ensemble strategy takes advantages of multi-camera views to provide robust predictions. Finally, we introduce a conditional post-processing operation to locate distracted behaviours and action temporal boundaries precisely. Experimenting on test set A2, our method obtains the sixth position on the public leaderboard of track 3 of the 2024 AI City Challenge.
自然驾驶行为定位任务旨在从真实世界驾驶场景中捕获的视频数据中识别和理解人类的行为和动作。先前的研究表明,通过应用一个识别模型并结合基于概率的后处理可以实现出色的动作定位性能。然而,识别模型提供的概率往往包含混淆信息,给后处理带来挑战。在这项工作中,我们采用了一个基于自监督学习的动作识别模型来检测分心活动并提供潜在的行为概率。随后,一种约束集成策略利用多摄像机视角的优势提供了稳健的预测。最后,我们引入了一种条件后处理操作以精确地定位分心行为和动作的时间边界。在测试集A2上的实验中,我们的方法在2024年AI城市挑战赛第3赛道的公共排行榜上获得了第六名的位置。
https://arxiv.org/abs/2411.12525
Recent breakthroughs in Multimodal Large Language Models (MLLMs) have gained significant recognition within the deep learning community, where the fusion of the Video Foundation Models (VFMs) and Large Language Models(LLMs) has proven instrumental in constructing robust video understanding systems, effectively surmounting constraints associated with predefined visual tasks. These sophisticated MLLMs exhibit remarkable proficiency in comprehending videos, swiftly attaining unprecedented performance levels across diverse benchmarks. However, their operation demands substantial memory and computational resources, underscoring the continued importance of traditional models in video comprehension tasks. In this paper, we introduce a novel learning paradigm termed MLLM4WTAL. This paradigm harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors for conventional Weakly-supervised Temporal Action Localization (WTAL) methods. MLLM4WTAL facilitates the enhancement of WTAL by leveraging MLLM guidance. It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR). These modules work in tandem to effectively address prevalent issues like incomplete and over-complete outcomes common in WTAL methods. Rigorous experiments are conducted to validate the efficacy of our proposed approach in augmenting the performance of various heterogeneous WTAL models.
最近在多模态大型语言模型(MLLMs)方面的突破获得了深度学习社区的广泛认可,其中视频基础模型(VFMs)和大型语言模型(LLMs)的融合对于构建强大的视频理解系统至关重要,有效克服了预定义视觉任务相关的限制。这些高级的MLLM展示了对视频的理解能力,并在各种基准测试中迅速达到了前所未有的性能水平。然而,它们的操作需要大量的内存和计算资源,这突显了传统模型在视频理解任务中的持续重要性。本文介绍了一种名为MLLM4WTAL的新学习范式。该范式利用MLLM的潜力,为传统的弱监督时间动作定位(WTAL)方法提供时序动作关键语义和完整的语义先验。MLLM4WTAL通过利用MLLM指导来增强WTAL。它通过整合两个不同的模块——关键语义匹配(KSM)和完整语义重构(CSR)实现这一点。这两个模块协同工作,有效地解决了WTAL方法中常见的不完全和过度完成的问题。我们进行了严格的实验以验证我们的提议在提高各种异构WTAL模型性能方面的有效性。
https://arxiv.org/abs/2411.08466
In the task of temporal action localization of ActivityNet-1.3 datasets, we propose to locate the temporal boundaries of each action and predict action class in untrimmed videos. We first apply VideoSwinTransformer as feature extractor to extract different features. Then we apply a unified network following Faster-TAD to simultaneously obtain proposals and semantic labels. Last, we ensemble the results of different temporal action detection models which complement each other. Faster-TAD simplifies the pipeline of TAD and gets remarkable performance, obtaining comparable results as those of multi-step approaches.
在ActivityNet-1.3数据集的时间动作定位任务中,我们提出定位每个动作的时间边界,并预测未剪辑视频中的动作类别。首先应用VideoSwinTransformer作为特征提取器来抽取不同的特征。然后采用类似于Faster-TAD的统一网络同时获得提议和语义标签。最后,我们将不同时间动作检测模型的结果进行集成,这些模型互相补充。Faster-TAD简化了TAD的流程,并取得了显著性能,达到了与多步骤方法相当的效果。
https://arxiv.org/abs/2411.00883
Precise action localization in untrimmed video is vital for fields such as professional sports and minimally invasive surgery, where the delineation of particular motions in recordings can dramatically enhance analysis. But in many cases, large scale datasets with video-label pairs for localization are unavailable, limiting the opportunity to fine-tune video-understanding models. Recent developments in large vision-language models (LVLM) address this need with impressive zero-shot capabilities in a variety of video understanding tasks. However, the adaptation of image-based LVLMs, with their powerful visual question answering capabilities, to action localization in long-form video is still relatively unexplored. To this end, we introduce a true ZEro-shot Action Localization method (ZEAL). Specifically, we leverage the built-in action knowledge of a large language model (LLM) to inflate actions into highly-detailed descriptions of the archetypal start and end of the action. These descriptions serve as queries to LVLM for generating frame-level confidence scores which can be aggregated to produce localization outputs. The simplicity and flexibility of our method lends it amenable to more capable LVLMs as they are developed, and we demonstrate remarkable results in zero-shot action localization on a challenging benchmark, without any training.
https://arxiv.org/abs/2410.14340
Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ($\textbf{COMP}ositional h\textbf{U}man-cen\textbf{T}ric qu\textbf{ER}y$ machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: this https URL.
由于人类活动的空间和时间复杂性以及上下文相关性,从视频中识别人类活动具有挑战性。之前的研究通常依赖于单个输入模式,如RGB或骨骼数据,这限制了它们在模态之间互补优势的利用能力。最近的研究专注于使用简单的特征融合技术将这些两种模式进行结合。然而,由于这些输入模式固有的差异表示,设计一个统一的神经网络架构有效地利用它们的互补信息仍然是一个重要的挑战。为了应对这个问题,我们提出了一个全面的多模态视频基于人类活动识别框架。我们的关键贡献是引入了一种新颖的合成查询机器,称为COMPUTER(Compositional human-centric-time machine),一种通用的神经网络架构,它建模了感兴趣的人类和周围环境在空间和时间上的相互作用。由于其多才多艺的设计,COMPUTER可以用于对各种输入模态进行降维。此外,我们引入了一个一致性损失,用于在模态之间建模预测的一致性,利用多模态输入的互补信息进行稳健的人类运动识别。通过在动作定位和群活动识别任务上的广泛实验,我们的方法与最先进的方法相比表现出卓越的性能。我们的代码可在此处下载:https:// this URL。
https://arxiv.org/abs/2409.02385
Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos.
视频动作定位旨在从长视频中查找特定动作的时间。虽然现有的基于学习的 approaches 已经取得了成功,但它们需要标注带有相当大劳动成本的视频。本文提出了一种基于新兴的视觉语言模型(VLM)的学习无,开卷积文本文提出的技术基于 emerging vision-language models (VLM)。挑战来自于 VLMs 既没有设计来处理长视频,也不是为了找到动作而设计的。我们通过扩展迭代视觉提示技术来克服这些问题。具体来说,我们将视频帧采样到与帧编号标签连接的 concatenated 图像中,使 VLM 猜测一个被认为是动作开始/结束的帧。通过缩小采样时间窗口,迭代这个过程可以找到特定动作的开始和结束。我们证明了这种采样技术产生了合理的结果,证明了 VLMs 在理解视频方面具有实际扩展。
https://arxiv.org/abs/2408.17422
To address the zero-shot temporal action localization (ZSTAL) task, existing works develop models that are generalizable to detect and classify actions from unseen categories. They typically develop a category-agnostic action detector and combine it with the Contrastive Language-Image Pre-training (CLIP) model to solve ZSTAL. However, these methods suffer from incomplete action proposals generated for \textit{unseen} categories, since they follow a frame-level prediction paradigm and require hand-crafted post-processing to generate action proposals. To address this problem, in this work, we propose a novel model named Generalizable Action Proposal generator (GAP), which can interface seamlessly with CLIP and generate action proposals in a holistic way. Our GAP is built in a query-based architecture and trained with a proposal-level objective, enabling it to estimate proposal completeness and eliminate the hand-crafted post-processing. Based on this architecture, we propose an Action-aware Discrimination loss to enhance the category-agnostic dynamic information of actions. Besides, we introduce a Static-Dynamic Rectifying module that incorporates the generalizable static information from CLIP to refine the predicted proposals, which improves proposal completeness in a generalizable manner. Our experiments show that our GAP achieves state-of-the-art performance on two challenging ZSTAL benchmarks, i.e., Thumos14 and ActivityNet1.3. Specifically, our model obtains significant performance improvement over previous works on the two benchmarks, i.e., +3.2% and +3.4% average mAP, respectively.
为解决零样本时序动作定位(ZSTAL)问题,现有研究开发了可适用于检测和分类未见类别的动作的模型。他们通常开发了一个类无关的动作检测器,并将其与对比性语言与图像预训练(CLIP)模型相结合来解决ZSTAL。然而,这些方法由于遵循帧级预测范式并需要手工后处理生成动作建议而产生了对未见类别的动作提议的缺失。为了解决这个问题,在本文中,我们提出了一个名为可扩展动作提议生成器(GAP)的新模型,该模型可以与CLIP无缝交互并以整体方式生成动作建议。我们的GAP基于查询架构构建,并使用建议级的目标进行训练,使其能够估计提议的完整性并消除手工后处理。基于这种架构,我们提出了一个动作感知差异损失,以增强动作的类无关动态信息。此外,我们还引入了一个静态动态校正模块,它从CLIP中包含了可扩展的静态信息,以改进预测建议,从而以一种通用的方式提高提议的完整性。我们的实验结果表明,我们的GAP在两个具有挑战性的ZSTAL基准上实现了最先进的性能,即Thumos14和ActivityNet1.3。具体来说,我们的模型在两个基准上的平均mAP均比以前的工作高出+3.2%和+3.4%。
https://arxiv.org/abs/2408.13777
The present few-shot temporal action localization model can't handle the situation where videos contain multiple action instances. So the purpose of this paper is to achieve manifold action instances localization in a lengthy untrimmed query video using limited trimmed support videos. To address this challenging problem effectively, we proposed a novel solution involving a spatial-channel relation transformer with probability learning and cluster refinement. This method can accurately identify the start and end boundaries of actions in the query video, utilizing only a limited number of labeled videos. Our proposed method is adept at capturing both temporal and spatial contexts to effectively classify and precisely locate actions in videos, enabling a more comprehensive utilization of these crucial details. The selective cosine penalization algorithm is designed to suppress temporal boundaries that do not include action scene switches. The probability learning combined with the label generation algorithm alleviates the problem of action duration diversity and enhances the model's ability to handle fuzzy action boundaries. The interval cluster can help us get the final results with multiple instances situations in few-shot temporal action localization. Our model achieves competitive performance through meticulous experimentation utilizing the benchmark datasets ActivityNet1.3 and THUMOS14. Our code is readily available at this https URL.
目前,这种少数shot时空动作局部定位模型无法处理视频包含多个动作实例的情况。因此,本文的目的是使用有限的剪辑支持视频在长篇未剪辑查询视频中实现多实例动作实例的局部定位,以解决这一具有挑战性的问题。为了有效地解决这个具有挑战性的问题,我们提出了一个新方法,涉及空间通道关系Transformer、概率学习和聚类 refinement。这种方法可以准确地确定查询视频中的动作开始和结束边界,只需使用很少量的标记视频。我们所提出的方法擅长捕捉时空上下文,从而有效地对视频进行分类和精确定位动作,有助于更全面地利用这些关键信息。选择性余弦惩罚算法旨在抑制不包含动作场景切换的时空边界。概率学习和标签生成算法相结合减轻了动作持续时间多样性的问题,提高了模型处理模糊动作边界的能力。区间聚类可以帮助我们在几 shot时空动作局部定位中处理多个实例的情况。我们的模型通过仔细实验利用基准数据集ActivityNet1.3和THUMOS14获得了竞争力的性能。我们的代码可通过此链接获取。
https://arxiv.org/abs/2408.13765
Online video understanding often relies on individual frames, leading to frame-by-frame predictions. Recent advancements such as Online Temporal Action Localization (OnTAL), extend this approach to instance-level predictions. However, existing methods mainly focus on short-term context, neglecting historical information. To address this, we introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL. By integrating historical context, our framework enhances the synergy between long-term and short-term information, improving the quality of anchor features crucial for classification and localization. We evaluate our model on both procedural egocentric (PREGO) datasets (EGTEA and EPIC) and standard non-PREGO OnTAL datasets (THUMOS and MUSES). Results show that our model outperforms state-of-the-art approaches significantly on PREGO datasets and achieves comparable or slightly superior performance on non-PREGO datasets, underscoring the importance of leveraging long-term history, especially in procedural and egocentric action scenarios. Code is available at: this https URL
视频在线理解通常依赖于单个帧,导致帧对帧预测。 例如,在线时间动作定位 (OnTAL) 是一种 recent 进步,将此方法扩展到实例级别预测。然而,现有的方法主要关注短期上下文,忽视了历史信息。为了解决这个问题,我们引入了 OnTAL 的 History-Augmented Anchor Transformer (HAT) 框架。通过整合历史信息,我们的框架提高了长期和短期信息之间的协同作用,从而提高了分类和定位关键锚特征的质量。我们在 both 程序化自旋点 (PREGO) 数据集(EGTEA 和 EPIC)和非程序化 OnTAL 数据集(THUMOS 和 MUSES)上评估我们的模型。结果表明,我们的模型在 PREGO 数据集上显著优于最先进的 approaches,同时在非程序化 OnTAL 数据集上实现或稍微优于最先进的 approaches。这突出表明了利用长期历史在程序化和自旋点动作场景中的重要性。代码可在此处下载:https:// this URL
https://arxiv.org/abs/2408.06437
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at this https URL.
弱监督的时间动作定位(WTAL)旨在通过仅使用视频级别的注释来检测未剪辑视频中的动作实例。由于许多现有工作基于动作分类标签优化WTAL模型,因此它们遇到了任务差异问题(即通过分类来定位)。为解决这个问题,最近的研究试图通过视觉语言预训练(VLP)利用动作类别名称作为辅助语义知识。然而,目前的研究仍存在不足之处。之前的解决方案主要关注利用语言模型的文本信息,但忽视了动态人类动作与VLP知识在共同空间中的对齐。此外,之前使用的确定性表示方法很难捕捉到细粒度的人类运动。为了解决这些问题,我们提出了一个新框架,将人类动作知识和VLP知识在概率嵌入空间中对齐。此外,我们还提出了内部和分布式对比学习来增强基于统计相似性的概率嵌入空间。大量实验和消融实验揭示了我们的方法显著优于所有先前的最优方法。代码可在此处访问:https://www.xxx.com/。
https://arxiv.org/abs/2408.05955