Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.
受到近年来在视频识别和目标检测领域中变压器(Transformer)及多阶段架构成功应用的启发,我们深入探索了变压器在网络中的多层次架构下处理时间动作定位(TAL)任务时所具有的丰富时空特性。这一研究促进了分层多阶段变压器架构PCL-Former的发展,该架构通过专门设计的损失函数,让每个子任务都能由特定的Transformer模块来完成。具体来说: - Proposal-Former:识别未修剪视频中可能包含动作的候选片段。 - Classification-Former:对这些片段中的动作类别进行分类。 - Localization-Former:精确预测动作实例的时间边界(即开始和结束时间)。 为了评估我们方法的表现,我们在三个具有挑战性的基准数据集上进行了广泛的实验:THUMOS-14、ActivityNet-1.3 和 HACS Segments。此外,我们也进行了详细的消融研究来评估 PCL-Former 中每个单独模块的影响。所获得的定量结果验证了提出的 PCL-Former 的有效性,在 THUMOS14、ActivityNet-1.3 和 HACS 数据集上分别超过了现有的 TAL 方法 2.8%、1.2% 和 4.8%。
https://arxiv.org/abs/2507.06411
Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of +3.3%, +2.6%, +1.2%, +1.7% (verb), and +1.4% (noun), respectively.
真实世界的视频通常包含重叠事件和复杂的时序依赖关系,这使得多模态交互建模特别具有挑战性。我们提出了DEL框架,用于密集语义动作定位,旨在准确检测和分类长未裁剪视频中的多个动作,并在细粒度的时间分辨率下进行分类。DEL包括两个关键模块:利用掩码自我注意机制来增强内模式一致性的音频和视觉特征对齐;以及建模多尺度跨模式依赖关系的多模态交互细化模块,这使得高级语义和细粒度细节得以实现。我们的方法在多个真实世界的时序动作定位(TAL)数据集上取得了最先进的性能,包括UnAV-100、THUMOS14、ActivityNet 1.3和EPIC-Kitchens-100,在这些数据集中分别获得了显著的平均mAP提升:+3.3%,+2.6%,+1.2%,+1.7%(动词)和+1.4%(名词)。
https://arxiv.org/abs/2506.23196
Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to fully exploit multi-resolution information in the temporal domain and generate high quality frame-level pseudo labels based on both appearance and motion streams. Specifically, in the first stage, we generate reliable initial frame-level pseudo labels, and in the second stage, we iteratively refine the pseudo labels and use a set of selected frames with highly confident pseudo labels to train neural networks and better predict action class scores at each frame. We fully exploit temporal information at multiple scales to improve temporal action localization performance. Specifically, in order to obtain reliable initial frame-level pseudo labels, in the first stage, we propose an Initial Label Generation (ILG) module, which leverages temporal multi-resolution consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework. In our PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, the multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each stream (i.e., the OTS/RTS stream) by exploiting the refined pseudo labels from another stream (i.e., the RTS/OTS stream).
弱监督时间动作定位是一项具有挑战性的任务,因为在训练过程中只有视频级别的注释可用。为了解决这个问题,我们提出了一种两阶段的方法来充分利用时域中的多分辨率信息,并基于外观和运动流生成高质量的帧级伪标签。具体来说,在第一阶段中,我们生成可靠的初始帧级伪标签;在第二阶段中,我们迭代地细化这些伪标签,并使用一组具有高置信度伪标签的选择性帧训练神经网络,以更好地预测每个帧的动作类别得分。 为了充分利用时域中的多分辨率信息来提高时间动作定位性能,我们在第一阶段提出了一个初始标签生成(ILG)模块。该模块利用了时间多分辨率一致性来生成高质量的类激活序列(CAS),这些序列由多个序列组成,每个序列衡量视频中每一帧属于特定动作类别的可能性。 在第二阶段,我们提出了一种渐进式时间标签细化(PTLR)框架,在此框架中使用两个称为Network-OTS和Network-RTS的网络分别生成原始时间和减少的时间尺度上的CAS。这两个网络作为两条流(即OTC流和RTC流),交替地用来细化伪标签。 通过这种方式,时域中的多分辨率信息以伪标签的形式进行交换,我们的工作可以通过利用另一条流中经过精炼的伪标签来改进每一条流(即OTC/RTC流)。
https://arxiv.org/abs/2506.18261
We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we propose $TA^{2}Net$, a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input. $TA^{2}Net$ also integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization accuracy and simultaneously identifies sound sources within video frames. To support this task, we introduce a new benchmark dataset, $Audible623$, derived from Kinetics and UCF101 by removing non-essential vocalization subsets. Extensive experiments confirm the effectiveness of our approach on $Audible623$ and show strong generalizability to other domains, such as repetitive counting and sound source localization. Code and dataset are available at this https URL.
我们介绍了“可听动作时间定位”任务,其目标是识别出可发声的动作的时空坐标。不同于传统的动作识别和时间动作定位等广泛分析视频内容的任务,我们的任务专注于捕捉可发声动作独特的动力学动态。这一任务基于这样的前提:关键动作是由转折性运动驱动的;例如,产生声音的碰撞通常涉及突然的速度变化。为了捕捉这一点,我们提出了$TA^{2}Net$(即TAA-Net),这是一种新颖的架构,它使用运动的二阶导数来估算转折流,并据此确定碰撞时间点,而无需依赖音频输入。此外,$TA^{2}Net$在训练过程中整合了一种自我监督的空间定位策略,结合对比学习与空间分析以优化性能。 这种双管齐下的设计不仅提升了时间定位的准确性,还能够同时识别视频帧内的声源位置。为了支持这一任务,我们推出了一套新的基准数据集——$Audible623$(即AUDIBLE-623),该数据集是从Kinetics和UCF101中去除非必要音调子集后构建的。广泛的实验验证了我们在$Audible623$上的方法的有效性,并展示了其在其他领域的强大泛化能力,如重复计数和声源定位。代码及数据集可在以下网址获得:[请参见原文链接]。
https://arxiv.org/abs/2506.13320
Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at this https URL.
在视频中定位人与物体的交互(HOI)动作是多个下游任务的基础,如人类行为分析和人机技能转移。当前的时间行动定位方法通常依赖于标注的动作和对象类别来进行优化,这会导致领域偏差并降低部署效率。尽管最近有一些研究通过大型视觉语言模型(VLMs)实现了零样本时间行动定位(ZS-TAL),但它们粗略的估计结果和开环管道阻碍了人与物体交互时间定位(TIL)性能进一步提升。 为了应对这些问题,我们提出了一种新颖的零样本TIL方法,称为EgoLoc,用于在第一视角视频中定位人与物体互动中的抓握动作的时间。EgoLoc引入了一个自适应采样策略来为VLM推理生成合理的视觉提示。通过吸收二维和三维观察数据,它根据三维手的速度直接从可能的接触/分离时间戳周围采样高质量的初始猜测,从而提高推理准确性和效率。 此外,EgoLoc还通过视觉和动态线索产生闭环反馈,进一步细化定位结果。在公开可用的数据集以及我们新提出的基准测试上的全面实验表明,与最先进的基线相比,EgoLoc在第一视角视频的人际互动时间定位方面表现出更好的性能。我们将开源我们的代码及相关数据,请访问此链接以获取更多详情:[该URL需由发布者提供]。
https://arxiv.org/abs/2506.03662
Language-driven action localization in videos requires not only semantic alignment between language query and video segment, but also prediction of action boundaries. However, the language query primarily describes the main content of an action and usually lacks specific details of action start and end boundaries, which increases the subjectivity of manual boundary annotation and leads to boundary uncertainty in training data. In this paper, on one hand, we propose to expand the original query by generating textual descriptions of the action start and end boundaries through LLMs, which can provide more detailed boundary cues for localization and thus reduce the impact of boundary uncertainty. On the other hand, to enhance the tolerance to boundary uncertainty during training, we propose to model probability scores of action boundaries by calculating the semantic similarities between frames and the expanded query as well as the temporal distances between frames and the annotated boundary frames. They can provide more consistent boundary supervision, thus improving the stability of training. Our method is model-agnostic and can be seamlessly and easily integrated into any existing models of language-driven action localization in an off-the-shelf manner. Experimental results on several datasets demonstrate the effectiveness of our method.
视频中的语言驱动动作定位不仅需要在语言查询和视频片段之间进行语义对齐,还需要预测动作边界。然而,语言查询主要描述了动作的主要内容,通常缺乏关于动作开始和结束的具体细节,这增加了人工边界标注的主观性,并导致训练数据中的边界不确定性。 在这篇论文中,一方面,我们提出通过大型语言模型(LLMs)生成动作起始和结束边界的文本描述来扩展原始查询。这种方法可以为定位提供更多详细的边界线索,从而减少边界不确定性的负面影响。另一方面,为了增强模型在面对边界不确定性时的训练鲁棒性,我们建议通过对帧与扩展后的查询之间的语义相似性和帧与标注边界帧之间的时间距离进行计算,以建立动作边界的概率分数。这些概率分数可以提供更一致的边界监督信息,从而提高训练过程中的稳定性。 我们的方法是模型无关的,并且可以通过现成的方式无缝集成到任何现有的语言驱动的动作定位模型中。在多个数据集上的实验结果验证了我们方法的有效性。
https://arxiv.org/abs/2505.24282
Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model's superiority over several state-of-the-art competitors.
时间动作定位(TAL)在信息检索领域引起了广泛的关注。现有的监督或弱监督方法高度依赖于标记的时间边界和动作类别,这需要大量的劳动时间和资源。因此,无监督时间动作定位(UTAL)变得越来越受欢迎。然而,当前的方法面临着两大挑战:1) 预训练的分类特征过于关注高区分性的区域;2) 单纯依靠视觉模态的信息难以确定上下文边界。为了解决这些问题,我们提出了一种CLIP辅助的跨视角音频-视觉增强UTAL方法。具体来说,我们引入了视觉语言预训练(VLP)和基于分类预训练的合作增强机制来避免对高区分性区域的过度关注;同时,我们还融入了音频感知以提供更丰富的上下文边界信息。最后,我们提出了一种自监督跨视角学习范式,在不增加额外标注的情况下实现多视角感知增强。在两个公开数据集上的广泛实验表明,我们的模型优于多个最先进的竞争对手。
https://arxiv.org/abs/2505.23524
Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define \textbf{key events} by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.
时序动作定位(Temporal Action Localization,TAL)的目标是在视频中检测动作的起始和结束时间戳。然而,训练TAL模型需要大量的手动标注数据。数据编程是一种通过一系列由人类定义的标签函数来创建训练标签的有效方法。然而,在TAL中的应用面临着如何在时序视频帧背景下定义复杂动作的挑战。为此,本文提出了ProTAL,这是一种用于TAL的拖拽链接视频编程框架。在ProTAL中,用户可以通过拖动代表身体部位和物体的节点,并将它们连接起来以约束关系(方向、距离等),来定义**关键事件**。这些定义被用来为大规模未标注视频生成动作标签。接着使用半监督方法利用这些标签训练TAL模型。我们通过一个应用场景和用户研究展示了ProTAL的有效性,这为我们设计视频编程框架提供了见解。 这段描述介绍了ProTAL框架的设计思路及其在时序动作定位任务中的应用价值。ProTAL通过允许用户定义关键事件来简化大规模未标注视频的动作标签生成过程,并且该方法利用了半监督学习技术进一步训练模型。
https://arxiv.org/abs/2505.17555
Video event detection has become an essential component of sports analytics, enabling automated identification of key moments and enhancing performance analysis, viewer engagement, and broadcast efficiency. Recent advancements in deep learning, particularly Convolutional Neural Networks (CNNs) and Transformers, have significantly improved accuracy and efficiency in Temporal Action Localization (TAL), Action Spotting (AS), and Precise Event Spotting (PES). This survey provides a comprehensive overview of these three key tasks, emphasizing their differences, applications, and the evolution of methodological approaches. We thoroughly review and categorize existing datasets and evaluation metrics specifically tailored for sports contexts, highlighting the strengths and limitations of each. Furthermore, we analyze state-of-the-art techniques, including multi-modal approaches that integrate audio and visual information, methods utilizing self-supervised learning and knowledge distillation, and approaches aimed at generalizing across multiple sports. Finally, we discuss critical open challenges and outline promising research directions toward developing more generalized, efficient, and robust event detection frameworks applicable to diverse sports. This survey serves as a foundation for future research on efficient, generalizable, and multi-modal sports event detection.
视频事件检测已经成为体育分析的一个关键组成部分,它能够自动识别关键时刻,并增强性能分析、观众参与度和广播效率。近年来,深度学习的进展,特别是卷积神经网络(CNNs)和Transformer模型,在时间动作定位(TAL)、动作检测(AS)以及精确事件检测(PES)方面显著提高了准确性和效率。本文综述详细概述了这三个关键任务,并强调它们之间的差异、应用及方法论演进。我们全面回顾并分类现有的专门针对体育背景的数据集和评估指标,突显每个数据集和评价标准的优势与局限性。此外,我们分析了最新的技术趋势,包括多模态方法(融合音频和视觉信息)、利用自监督学习和知识蒸馏的方法,以及跨多个运动泛化的策略。最后,本文讨论了一些关键的开放挑战,并概述了有前景的研究方向,旨在开发适用于各种体育项目的更通用、高效且鲁棒的事件检测框架。这项综述为未来关于高效的、可扩展性和多模态体育事件检测研究奠定了基础。
https://arxiv.org/abs/2505.03991
Weakly-supervised Temporal Action Localization (WTAL) has achieved notable success but still suffers from a lack of temporal annotations, leading to a performance and framework gap compared with fully-supervised methods. While recent approaches employ pseudo labels for training, three key challenges: generating high-quality pseudo labels, making full use of different priors, and optimizing training methods with noisy labels remain unresolved. Due to these perspectives, we propose PseudoFormer, a novel two-branch framework that bridges the gap between weakly and fully-supervised Temporal Action Localization (TAL). We first introduce RickerFusion, which maps all predicted action proposals to a global shared space to generate pseudo labels with better quality. Subsequently, we leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch. Finally, the uncertainty mask and iterative refinement mechanism are applied for training with noisy pseudo labels. PseudoFormer achieves state-of-the-art WTAL results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3. Besides, extensive ablation studies demonstrate the contribution of each component of our method.
弱监督时间动作定位(WTAL)已经取得了显著的成功,但仍因缺乏时间标注而与全监督方法在性能和框架方面存在差距。尽管最近的方法采用伪标签进行训练,但在生成高质量的伪标签、充分利用不同先验以及用噪声标签优化训练方法这三个关键挑战上仍存在问题。鉴于这些视角,我们提出了PseudoFormer,这是一种新颖的双分支框架,旨在弥合弱监督与全监督时间动作定位(TAL)之间的差距。 首先,我们引入RickerFusion,将所有预测的动作提案映射到一个全局共享空间中,以生成质量更好的伪标签。随后,利用弱分支中的片段级和提案级不同先验的标注来训练强分支中的回归模型。最后,应用不确定性掩码与迭代细化机制来进行噪声伪标签下的训练。 PseudoFormer在两个常用基准测试集THUMOS14和ActivityNet1.3上取得了当前最佳的WTAL结果。此外,广泛的消融研究展示了我们方法中每个组件的贡献。
https://arxiv.org/abs/2504.14860
Wireless signal-based human sensing technologies, such as WiFi, millimeter-wave (mmWave) radar, and Radio Frequency Identification (RFID), enable the detection and interpretation of human presence, posture, and activities, thereby providing critical support for applications in public security, healthcare, and smart environments. These technologies exhibit notable advantages due to their non-contact operation and environmental adaptability; however, existing systems often fail to leverage the textual information inherent in datasets. To address this, we propose an innovative text-enhanced wireless sensing framework, WiTalk, that seamlessly integrates semantic knowledge through three hierarchical prompt strategies-label-only, brief description, and detailed action description-without requiring architectural modifications or incurring additional data costs. We rigorously validate this framework across three public benchmark datasets: XRF55 for human action recognition (HAR), and WiFiTAL and XRFV2 for WiFi temporal action localization (TAL). Experimental results demonstrate significant performance improvements: on XRF55, accuracy for WiFi, RFID, and mmWave increases by 3.9%, 2.59%, and 0.46%, respectively; on WiFiTAL, the average performance of WiFiTAD improves by 4.98%; and on XRFV2, the mean average precision gains across various methods range from 4.02% to 13.68%. Our codes have been included in this https URL.
基于无线信号的人体感知技术,如WiFi、毫米波(mmWave)雷达和射频识别(RFID),能够检测并解释人体的存在、姿势及活动情况,从而为公共安全、医疗保健和智能环境的应用提供关键支持。这些技术由于其非接触操作能力和环境适应性而具有显著优势;然而,现有的系统常常未能充分利用数据集中的文本信息。为此,我们提出了一种创新的增强型无线感知框架WiTalk,在不进行架构修改或增加额外数据成本的情况下,通过三种层次化的提示策略——仅标签、简短描述和详细动作描述——无缝地整合语义知识。我们在三个公开基准数据集中严格验证了该框架的有效性:XRF55用于人体行为识别(HAR),以及WiFiTAL和XRFV2用于WiFi时间动作定位(TAL)。实验结果显示性能显著提升:在XRF55上,对于WiFi、RFID和mmWave的准确率分别提高了3.9%、2.59% 和0.46%;在WiFiTAL上,WiFiTAD的平均性能提升了4.98%;而在XRFV2上,不同方法的平均精度增益范围从4.02%到13.68%。我们的代码已包含在此链接中:https URL。
https://arxiv.org/abs/2504.14621
Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the localization task. Therefore, we propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level to assist action localization, we design a Chain of Thought (CoT)-like reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoT-like text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets. We introduce the first dataset named Human-related Anomaly Localization and explore the application of the TAL task in human anomaly detection. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. We will release our code, data and benchmark.
传统的时间动作定位(TAL)方法依赖大量的详细标注数据,而少样本时间动作定位则通过仅使用少量训练样本识别未见过的动作类别来减少对大量标注数据的依赖。然而,现有的少样本时间动作定位方法通常只关注视频级别的信息,忽视了文本信息,后者可以为定位任务提供有价值的语义支持。因此,我们提出了一种新的基于Chain-of-Thought(CoT)文本推理的时间动作定位方法,以提高定位性能。 具体来说,我们设计了一个新颖的少样本学习框架,利用文本语义信息增强模型捕捉动作共性和变异性的能力。该框架包括一个具有语义感知能力的文图对齐模块,用于在不同层面上对查询视频和支持视频进行对齐。同时,为了更好地表达动作之间的时序依赖关系和因果关系(从文本层面),我们设计了一种类似于CoT的推理方法,逐步引导视觉语言模型(VLM) 和大型语言模型(LLM) 生成类似CoT的文本描述来帮助行动定位任务。这种方法产生的文本比视觉特征更能捕捉到动作的变化。 我们在公开可用的ActivityNet1.3和THUMOS14数据集上进行了广泛的实验,并引入了一个名为“人类相关异常定位”的新数据集,以探索时间动作定位在人类行为异常检测中的应用。实验结果表明,在单实例和多实例场景下,我们提出的方法显著优于现有方法。 我们将公开我们的代码、数据和基准测试。
https://arxiv.org/abs/2504.13460
Temporal localization in untrimmed videos, which aims to identify specific timestamps, is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. Existing methods in each subfield are typically designed for specific tasks and lack generalizability across domains. In this paper, we propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks. First, our approach employs a simple yet effective one-stage localization model that supports text queries as input and multiple actions as output. Second, we jointly train the video encoder and localization model in an end-to-end manner. To efficiently process long videos, we introduce temporal chunking, enabling the handling of videos with over 30k frames. Third, we find that fine-tuning pre-trained text encoders with a multi-stage training strategy further enhances text-conditioned localization. TimeLoc achieves state-of-the-art results across multiple benchmarks: +1.3% and +1.9% mAP over previous best methods on THUMOS14 and EPIC-Kitchens-100, +1.1% on Kinetics-GEBD, +2.94% mAP on QVHighlights, and significant improvements in temporal video grounding (+11.5% on TACoS and +6.7% on Charades-STA under R1@0.5). Our code and checkpoints will be released at this https URL.
无修剪视频中的时间定位,旨在识别特定的时间戳,在视频理解中至关重要但仍然具有挑战性。这一任务包括若干子任务,如时间动作定位、时间视频对齐、时刻检索和通用事件边界检测等。现有方法通常针对具体任务设计,并且在跨域应用方面缺乏泛化能力。本文提出了TimeLoc,这是一个统一的端到端框架,用于处理多个任务的时间戳定位。首先,我们的方法采用了一种简单而有效的单阶段定位模型,支持以文本查询作为输入并输出多个动作。其次,我们通过端到端方式联合训练视频编码器和定位模型。为了高效地处理长视频,我们引入了时间分块技术,使得能够处理超过30k帧的视频。第三,我们发现使用多阶段微调策略对预训练文本编码器进行细化,进一步增强了基于文本条件下的定位效果。 TimeLoc在多个基准测试中取得了最先进的结果:THUMOS14和EPIC-Kitchens-100上的mAP分别比之前最佳方法高出+1.3%和+1.9%,Kinetics-GEBD上高出+1.1%,QVHighlights上的mAP为+2.94%,以及在TACoS和Charades-STA(R1@0.5)的视频时间对齐任务中分别提高了+11.5%和+6.7%。 我们的代码和检查点将在此网址上发布。
https://arxiv.org/abs/2503.06526
Naturalistic driving action recognition is essential for vehicle cabin monitoring systems. However, the complexity of real-world backgrounds presents significant challenges for this task, and previous approaches have struggled with practical implementation due to their limited ability to observe subtle behavioral differences and effectively learn inter-frame features from video. In this paper, we propose a novel Spatial-Temporal Perception (STP) architecture that emphasizes both temporal information and spatial relationships between key objects, incorporating a causal decoder to perform behavior recognition and temporal action localization. Without requiring multimodal input, STP directly extracts temporal and spatial distance features from RGB video clips. Subsequently, these dual features are jointly encoded by maximizing the expected likelihood across all possible permutations of the factorization order. By integrating temporal and spatial features at different scales, STP can perceive subtle behavioral changes in challenging scenarios. Additionally, we introduce a causal-aware module to explore relationships between video frame features, significantly enhancing detection efficiency and performance. We validate the effectiveness of our approach using two publicly available driver distraction detection benchmarks. The results demonstrate that our framework achieves state-of-the-art performance.
自然驾驶行为识别对于车辆舱内监控系统至关重要。然而,现实世界的复杂背景为这一任务带来了显著挑战,以往的方法由于观察细微行为差异的能力有限以及难以从视频中有效学习帧间特征,在实际应用中遇到了困难。为此,本文提出了一种新颖的空间-时间感知(STP)架构,该架构强调了关键对象之间的时间信息和空间关系,并引入了一个因果解码器来执行行为识别及时间动作定位。无需多模态输入,STP可以直接从RGB视频片段中提取时间和空间距离特征。随后,通过最大化因子分解顺序的所有可能排列的预期似然性,这些双重视觉特征被联合编码。通过在不同尺度上整合时空特性,STP能够感知挑战场景中的细微行为变化。此外,我们引入了一个因果意识模块来探索视频帧特征之间的关系,显著提高了检测效率和性能。我们使用两个公开可用的驾驶员分心检测基准数据集验证了所提出方法的有效性。实验结果表明,我们的框架达到了最先进的性能水平。
https://arxiv.org/abs/2503.04078
This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.
本文介绍了ViNet-S,这是一种基于ViNet架构并采用U-Net设计的36MB模型,其轻量级解码器在不牺牲性能的前提下大幅减少了模型大小和参数。此外,ViNet-A(148MB)融合了时空动作定位(STAL)特征,与传统的视频显著性模型不同,后者依赖于动作分类骨干网络。我们的研究表明,在三个仅视觉数据集以及六个视听显著性数据集中,通过平均预测的显著性图,由ViNet-S和ViNet-A组成的集成模型达到了最先进的性能水平,并且在参数效率和实时性能方面均优于基于变换器的模型,其中ViNet-S可以达到每秒超过1000帧的速度。
https://arxiv.org/abs/2502.00397
Human Action Recognition (HAR) plays a crucial role in applications such as health monitoring, smart home automation, and human-computer interaction. While HAR has been extensively studied, action summarization, which involves identifying and summarizing continuous actions, remains an emerging task. This paper introduces the novel XRF V2 dataset, designed for indoor daily activity Temporal Action Localization (TAL) and action summarization. XRF V2 integrates multimodal data from Wi-Fi signals, IMU sensors (smartphones, smartwatches, headphones, and smart glasses), and synchronized video recordings, offering a diverse collection of indoor activities from 16 volunteers across three distinct environments. To tackle TAL and action summarization, we propose the XRFMamba neural network, which excels at capturing long-term dependencies in untrimmed sensory sequences and outperforms state-of-the-art methods, such as ActionFormer and WiFiTAD. We envision XRF V2 as a valuable resource for advancing research in human action localization, action forecasting, pose estimation, multimodal foundation models pre-training, synthetic data generation, and more.
人体动作识别(HAR)在健康监测、智能家居自动化和人机交互等应用中发挥着关键作用。尽管HAR已经得到了广泛的研究,但涉及连续动作的识别与总结的动作摘要任务仍然是一项新兴的任务。本文介绍了一种新型数据集XRF V2,该数据集专为室内日常活动的时间动作定位(TAL)和动作摘要设计。XRF V2整合了来自Wi-Fi信号、IMU传感器(智能手机、智能手表、耳机和智能眼镜)、以及同步视频记录的多模态数据,提供了16名志愿者在三种不同环境下的多样化室内活动集合。 为了应对TAL和动作摘要任务,我们提出了XRFMamba神经网络。该模型擅长捕捉未经修剪的感觉序列中的长期依赖关系,并且优于现有的最佳方法,如ActionFormer和WiFiTAD。我们认为,XRF V2将作为推进人类动作定位、动作预测、姿态估计、多模态基础模型预训练、合成数据生成等领域的研究的重要资源。
https://arxiv.org/abs/2501.19034
Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.
伪标签学习方法在弱监督下的时序动作定位中得到了广泛应用。现有研究直接利用弱监督基础模型生成实例级伪标签来训练全监督检测头。我们认为,伪标签中的噪声会干扰全监督检测头的学习过程,导致显著的性能损失。带有噪声标签的问题包括:(1)边界定位不准确;(2)未检出短动作片段;(3)多个相邻段落被错误地识别为一个段落。 为了应对这些问题,我们提出了一种两阶段的噪声标签学习策略,以利用噪声标签中的每一个潜在有用信号。首先,我们提出了一个带有上下文感知降噪算法的帧级伪标签生成模型来优化边界定位。其次,我们引入了一个在线修正的教师-学生框架,并配备了一个缺失实例补偿模块和一个模棱两可实例校正模块,用于解决短动作片段丢失以及多个段落被错误地识别为单个段落的问题。此外,我们在改进后的教师-学生框架中应用了高质量伪标签挖掘损失函数,以对噪声标签赋予不同的权重,从而更有效地进行训练。 我们的模型在THUMOS14和ActivityNet v1.2基准测试上,在检测准确性和推理速度方面都显著优于之前的最先进方法。
https://arxiv.org/abs/2501.11124
Detecting and interpreting operator actions, engagement, and object interactions in dynamic industrial workflows remains a significant challenge in human-robot collaboration research, especially within complex, real-world environments. Traditional unimodal methods often fall short of capturing the intricacies of these unstructured industrial settings. To address this gap, we present a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that captures realistic assembly and disassembly tasks, facilitating the evaluation of key meta-tasks such as action localization, object interaction, and engagement prediction. The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video, annotated in detail for task performance and operator behavior. Its distinctiveness lies in the integration of multiple data modalities and its emphasis on real-world, untrimmed industrial workflows-key for advancing research in human-robot collaboration and operator monitoring. Additionally, we propose a multimodal network that fuses RGB frames, IMU data, and skeleton sequences to predict engagement levels during industrial tasks. Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments. The dataset and code can be accessed from this https URL.
在动态工业工作流程中检测和解读操作员的动作、参与度以及与物体的互动仍然是人机协作研究中的一个重要挑战,尤其是在复杂的真实世界环境中。传统的一模态方法往往难以捕捉这些无结构工业环境下的细微差别。为了解决这一缺口,我们提出了一个新型的多模态工业活动监测(MIAM)数据集,该数据集记录了现实装配和拆卸任务,便于评估包括动作定位、物体互动以及参与度预测等关键元任务的效果。此数据集中包含来自22个会话收集的多视角RGB、深度图像及惯性测量单元(IMU) 数据,共计未剪辑视频290分钟,并详细标注了任务表现和操作员行为。其独特之处在于整合了多种数据模态,并且着重于真实世界的无裁剪工业工作流程,这对推进人机协作和操作员监测研究至关重要。 此外,我们还提出了一种多模态网络,该网络融合RGB帧、IMU数据及骨架序列以预测工业任务中的人类参与度。我们的方法提高了识别参与状态的准确性,为在动态工业环境中监测操作员表现提供了一个稳健的解决方案。数据集和代码可以从以下链接获取:[此URL](请将"[此URL]"替换为实际提供的访问地址)。
https://arxiv.org/abs/2501.05936
Most existing traffic video datasets including Waymo are structured, focusing predominantly on Western traffic, which hinders global applicability. Specifically, most Asian scenarios are far more complex, involving numerous objects with distinct motions and behaviors. Addressing this gap, we present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bicycles) in complex and unpredictable environments. DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.), which require high reasoning ability. DAVE densely annotates over 13 million bounding boxes (bboxes) actors with identification, and more than 1.6 million boxes are annotated with both actor identification and action/behavior details. The videos within DAVE are collected based on a broad spectrum of factors, such as weather conditions, the time of day, road scenarios, and traffic density. DAVE can benchmark video tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment retrieval, and Multi-label Video Action Recognition. Given the critical importance of accurately identifying VRUs to prevent accidents and ensure road safety, in DAVE, vulnerable road users constitute 41.13% of instances, compared to 23.71% in Waymo. DAVE provides an invaluable resource for the development of more sensitive and accurate visual perception algorithms in the complex real world. Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.
现有的大多数交通视频数据集,包括Waymo在内,都侧重于结构化和以西方交通为主的场景,这限制了其全球适用性。特别是,在亚洲的许多情景中,道路状况更加复杂,涉及众多具有独特运动模式和行为的对象。为了解决这一差距,我们提出了一个新的数据集DAVE(Diverse Actors in Varied Environments),旨在评估在复杂且不可预测环境中准确识别弱势道路使用者(VRUs:例如行人、动物、摩托车、自行车等)的感知方法。 DAVE是一个手动标注的数据集,涵盖了16种不同的对象类别(包括动物、人类和车辆等)以及16种动作类型(如切线插入、蛇形运动、U型转弯等复杂的罕见案例),这些都需要高水平的推理能力。在DAVE中,有超过1300万个边界框被详细标注以识别物体,并且其中超过160万个边界框还包含了行为和动作细节的信息。 该数据集中的视频根据多种因素收集而来,包括天气条件、一天中的时间、道路场景以及交通密度等。基于这些多样化的特性,DAVE可以作为追踪、检测、时空动作定位、语言-视觉时刻检索及多标签视频动作识别等任务的评估基准。 鉴于准确识别VRUs对于防止事故和保障道路交通安全至关重要,在DAVE数据集中,弱势道路使用者占总实例的41.13%,而Waymo中这一比例仅为23.71%。因此,DAVE为开发更敏感且精确的道路视觉感知算法提供了宝贵的资源,特别是在复杂的真实世界环境中。 实验表明,现有的方法在评估过程中使用DAVE时表现不佳,这凸显了其对未来视频识别研究的重要性与价值。
https://arxiv.org/abs/2412.20042
Weakly supervised temporal action localization (WS-TAL) is a task of targeting at localizing complete action instances and categorizing them with video-level labels. Action-background ambiguity, primarily caused by background noise resulting from aggregation and intra-action variation, is a significant challenge for existing WS-TAL methods. In this paper, we introduce a hybrid multi-head attention (HMHA) module and generalized uncertainty-based evidential fusion (GUEF) module to address the problem. The proposed HMHA effectively enhances RGB and optical flow features by filtering redundant information and adjusting their feature distribution to better align with the WS-TAL task. Additionally, the proposed GUEF adaptively eliminates the interference of background noise by fusing snippet-level evidences to refine uncertainty measurement and select superior foreground feature information, which enables the model to concentrate on integral action instances to achieve better action localization and classification performance. Experimental results conducted on the THUMOS14 dataset demonstrate that our method outperforms state-of-the-art methods. Our code is available in \url{this https URL}.
弱监督时间动作定位(WS-TAL)的任务旨在通过视频级别的标签来定位和分类完整的动作实例。现有方法在处理这一任务时面临的主要挑战是动作背景模糊,这主要是由于背景噪音和动作内部变化导致的。为了解决这个问题,本文提出了一种混合多头注意力(HMHA)模块和基于广义不确定性证据融合(GUEF)模块。 提出的HMHA模块能够有效地通过过滤冗余信息并调整特征分布来增强RGB和光流特征,使其更好地与WS-TAL任务对齐。此外,提出的GUEF模块通过融合片段级别的证据来适应性地消除背景噪音的影响,改进不确定性测量,并选择更优的前景特征信息。这使得模型能够专注于完整的动作实例,从而实现更好的动作定位和分类性能。 在THUMOS14数据集上的实验结果表明,我们的方法优于现有的最先进方法。我们的代码可在此网址获取:[this https URL]。
https://arxiv.org/abs/2412.19418