Detecting and interpreting operator actions, engagement, and object interactions in dynamic industrial workflows remains a significant challenge in human-robot collaboration research, especially within complex, real-world environments. Traditional unimodal methods often fall short of capturing the intricacies of these unstructured industrial settings. To address this gap, we present a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that captures realistic assembly and disassembly tasks, facilitating the evaluation of key meta-tasks such as action localization, object interaction, and engagement prediction. The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video, annotated in detail for task performance and operator behavior. Its distinctiveness lies in the integration of multiple data modalities and its emphasis on real-world, untrimmed industrial workflows-key for advancing research in human-robot collaboration and operator monitoring. Additionally, we propose a multimodal network that fuses RGB frames, IMU data, and skeleton sequences to predict engagement levels during industrial tasks. Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments. The dataset and code can be accessed from this https URL.
在动态工业工作流程中检测和解读操作员的动作、参与度以及与物体的互动仍然是人机协作研究中的一个重要挑战,尤其是在复杂的真实世界环境中。传统的一模态方法往往难以捕捉这些无结构工业环境下的细微差别。为了解决这一缺口,我们提出了一个新型的多模态工业活动监测(MIAM)数据集,该数据集记录了现实装配和拆卸任务,便于评估包括动作定位、物体互动以及参与度预测等关键元任务的效果。此数据集中包含来自22个会话收集的多视角RGB、深度图像及惯性测量单元(IMU) 数据,共计未剪辑视频290分钟,并详细标注了任务表现和操作员行为。其独特之处在于整合了多种数据模态,并且着重于真实世界的无裁剪工业工作流程,这对推进人机协作和操作员监测研究至关重要。 此外,我们还提出了一种多模态网络,该网络融合RGB帧、IMU数据及骨架序列以预测工业任务中的人类参与度。我们的方法提高了识别参与状态的准确性,为在动态工业环境中监测操作员表现提供了一个稳健的解决方案。数据集和代码可以从以下链接获取:[此URL](请将"[此URL]"替换为实际提供的访问地址)。
https://arxiv.org/abs/2501.05936
Most existing traffic video datasets including Waymo are structured, focusing predominantly on Western traffic, which hinders global applicability. Specifically, most Asian scenarios are far more complex, involving numerous objects with distinct motions and behaviors. Addressing this gap, we present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bicycles) in complex and unpredictable environments. DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.), which require high reasoning ability. DAVE densely annotates over 13 million bounding boxes (bboxes) actors with identification, and more than 1.6 million boxes are annotated with both actor identification and action/behavior details. The videos within DAVE are collected based on a broad spectrum of factors, such as weather conditions, the time of day, road scenarios, and traffic density. DAVE can benchmark video tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment retrieval, and Multi-label Video Action Recognition. Given the critical importance of accurately identifying VRUs to prevent accidents and ensure road safety, in DAVE, vulnerable road users constitute 41.13% of instances, compared to 23.71% in Waymo. DAVE provides an invaluable resource for the development of more sensitive and accurate visual perception algorithms in the complex real world. Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.
现有的大多数交通视频数据集,包括Waymo在内,都侧重于结构化和以西方交通为主的场景,这限制了其全球适用性。特别是,在亚洲的许多情景中,道路状况更加复杂,涉及众多具有独特运动模式和行为的对象。为了解决这一差距,我们提出了一个新的数据集DAVE(Diverse Actors in Varied Environments),旨在评估在复杂且不可预测环境中准确识别弱势道路使用者(VRUs:例如行人、动物、摩托车、自行车等)的感知方法。 DAVE是一个手动标注的数据集,涵盖了16种不同的对象类别(包括动物、人类和车辆等)以及16种动作类型(如切线插入、蛇形运动、U型转弯等复杂的罕见案例),这些都需要高水平的推理能力。在DAVE中,有超过1300万个边界框被详细标注以识别物体,并且其中超过160万个边界框还包含了行为和动作细节的信息。 该数据集中的视频根据多种因素收集而来,包括天气条件、一天中的时间、道路场景以及交通密度等。基于这些多样化的特性,DAVE可以作为追踪、检测、时空动作定位、语言-视觉时刻检索及多标签视频动作识别等任务的评估基准。 鉴于准确识别VRUs对于防止事故和保障道路交通安全至关重要,在DAVE数据集中,弱势道路使用者占总实例的41.13%,而Waymo中这一比例仅为23.71%。因此,DAVE为开发更敏感且精确的道路视觉感知算法提供了宝贵的资源,特别是在复杂的真实世界环境中。 实验表明,现有的方法在评估过程中使用DAVE时表现不佳,这凸显了其对未来视频识别研究的重要性与价值。
https://arxiv.org/abs/2412.20042
Weakly supervised temporal action localization (WS-TAL) is a task of targeting at localizing complete action instances and categorizing them with video-level labels. Action-background ambiguity, primarily caused by background noise resulting from aggregation and intra-action variation, is a significant challenge for existing WS-TAL methods. In this paper, we introduce a hybrid multi-head attention (HMHA) module and generalized uncertainty-based evidential fusion (GUEF) module to address the problem. The proposed HMHA effectively enhances RGB and optical flow features by filtering redundant information and adjusting their feature distribution to better align with the WS-TAL task. Additionally, the proposed GUEF adaptively eliminates the interference of background noise by fusing snippet-level evidences to refine uncertainty measurement and select superior foreground feature information, which enables the model to concentrate on integral action instances to achieve better action localization and classification performance. Experimental results conducted on the THUMOS14 dataset demonstrate that our method outperforms state-of-the-art methods. Our code is available in \url{this https URL}.
弱监督时间动作定位(WS-TAL)的任务旨在通过视频级别的标签来定位和分类完整的动作实例。现有方法在处理这一任务时面临的主要挑战是动作背景模糊,这主要是由于背景噪音和动作内部变化导致的。为了解决这个问题,本文提出了一种混合多头注意力(HMHA)模块和基于广义不确定性证据融合(GUEF)模块。 提出的HMHA模块能够有效地通过过滤冗余信息并调整特征分布来增强RGB和光流特征,使其更好地与WS-TAL任务对齐。此外,提出的GUEF模块通过融合片段级别的证据来适应性地消除背景噪音的影响,改进不确定性测量,并选择更优的前景特征信息。这使得模型能够专注于完整的动作实例,从而实现更好的动作定位和分类性能。 在THUMOS14数据集上的实验结果表明,我们的方法优于现有的最先进方法。我们的代码可在此网址获取:[this https URL]。
https://arxiv.org/abs/2412.19418
Existing skeleton-based human action classification models rely on well-trimmed action-specific skeleton videos for both training and testing, precluding their scalability to real-world applications where untrimmed videos exhibiting concatenated actions are predominant. To overcome this limitation, recently introduced skeleton action segmentation models involve un-trimmed skeleton videos into end-to-end training. The model is optimized to provide frame-wise predictions for any length of testing videos, simultaneously realizing action localization and classification. Yet, achieving such an improvement im-poses frame-wise annotated skeleton videos, which remains time-consuming in practice. This paper features a novel framework for skeleton-based action segmentation trained on short trimmed skeleton videos, but that can run on longer un-trimmed videos. The approach is implemented in three steps: Stitch, Contrast, and Segment. First, Stitch proposes a tem-poral skeleton stitching scheme that treats trimmed skeleton videos as elementary human motions that compose a semantic space and can be sampled to generate multi-action stitched se-quences. Contrast learns contrastive representations from stitched sequences with a novel discrimination pretext task that enables a skeleton encoder to learn meaningful action-temporal contexts to improve action segmentation. Finally, Segment relates the proposed method to action segmentation by learning a segmentation layer while handling particular da-ta availability. Experiments involve a trimmed source dataset and an untrimmed target dataset in an adaptation formulation for real-world skeleton-based human action segmentation to evaluate the effectiveness of the proposed method.
现有的基于骨架的人类动作分类模型依赖于精心修剪的动作特定骨架视频来进行训练和测试,这阻碍了它们在现实世界应用中的扩展能力,在这些应用场景中,包含拼接动作的未修剪视频占主导地位。为了克服这一局限性,最近引入的骨架动作分割模型将未修剪的骨架视频纳入端到端的训练过程。该模型被优化为能够对任何长度的测试视频提供逐帧预测,同时实现动作定位和分类。然而,要达到这种改进需要逐帧标注的骨架视频,这在实践中仍然耗时费力。本文介绍了一种新颖的基于骨架的动作分割框架,它可以在短修剪过的骨架视频上进行训练,但在较长未修剪的视频上运行。该方法分为三个步骤:拼接、对比和分割。首先,“拼接”提出了一种时间骨架拼接方案,将修剪后的骨架视频视为构成语义空间的基本人体运动,并可以采样生成包含多个动作的拼接序列。“对比”从拼接序列中学习对比表示,使用一种新颖的判别预训练任务,使骨架编码器能够学习有意义的动作-时间上下文以改进动作分割。最后,“分割”通过学习分割层来将提出的方法应用于动作分割,并处理特定的数据可用性问题。实验采用修剪过的源数据集和未修剪的目标数据集,在适应现实世界基于骨架的人类动作分割的设定下,评估了所提方法的有效性。
https://arxiv.org/abs/2412.14988
Temporal action localization (TAL) involves dual tasks to classify and localize actions within untrimmed videos. However, the two tasks often have conflicting requirements for features. Existing methods typically employ separate heads for classification and localization tasks but share the same input feature, leading to suboptimal performance. To address this issue, we propose a novel TAL method with Cross Layer Task Decoupling and Refinement (CLTDR). Based on the feature pyramid of video, CLTDR strategy integrates semantically strong features from higher pyramid layers and detailed boundary-aware boundary features from lower pyramid layers to effectively disentangle the action classification and localization tasks. Moreover, the multiple features from cross layers are also employed to refine and align the disentangled classification and regression results. At last, a lightweight Gated Multi-Granularity (GMG) module is proposed to comprehensively extract and aggregate video features at instant, local, and global temporal granularities. Benefiting from the CLTDR and GMG modules, our method achieves state-of-the-art performance on five challenging benchmarks: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3, and HACS. Our code and pre-trained models are publicly available at: this https URL.
时间动作定位(TAL)涉及对未剪辑视频中的动作进行分类和定位的双重任务。然而,这两个任务通常对特征有相互冲突的要求。现有的方法通常为分类和定位任务使用独立的头部,但共享相同的输入特征,导致性能次优。为了解决这一问题,我们提出了一种新的TAL方法,即跨层任务解耦与细化(CLTDR)。基于视频的特征金字塔,CLTDR策略整合了来自较高金字塔层级的语义强特征和来自较低金字塔层级的详细边界感知特征,有效地分离了动作分类和定位任务。此外,还利用跨层次的多个特征来细化并校准分离后的分类和回归结果。最后,提出了一种轻量级的门控多粒度(GMG)模块,以全面提取和聚合视频在瞬间、局部和全局时间粒度上的特征。得益于CLTDR和GMG模块,我们的方法在五个具有挑战性的基准测试中达到了最先进的性能:THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3 和 HACS。我们的代码和预训练模型可以在以下链接获取:此 https URL。
https://arxiv.org/abs/2412.09202
Naturalistic driving action localization task aims to recognize and comprehend human behaviors and actions from video data captured during real-world driving scenarios. Previous studies have shown great action localization performance by applying a recognition model followed by probability-based post-processing. Nevertheless, the probabilities provided by the recognition model frequently contain confused information causing challenge for post-processing. In this work, we adopt an action recognition model based on self-supervise learning to detect distracted activities and give potential action probabilities. Subsequently, a constraint ensemble strategy takes advantages of multi-camera views to provide robust predictions. Finally, we introduce a conditional post-processing operation to locate distracted behaviours and action temporal boundaries precisely. Experimenting on test set A2, our method obtains the sixth position on the public leaderboard of track 3 of the 2024 AI City Challenge.
自然驾驶行为定位任务旨在从真实世界驾驶场景中捕获的视频数据中识别和理解人类的行为和动作。先前的研究表明,通过应用一个识别模型并结合基于概率的后处理可以实现出色的动作定位性能。然而,识别模型提供的概率往往包含混淆信息,给后处理带来挑战。在这项工作中,我们采用了一个基于自监督学习的动作识别模型来检测分心活动并提供潜在的行为概率。随后,一种约束集成策略利用多摄像机视角的优势提供了稳健的预测。最后,我们引入了一种条件后处理操作以精确地定位分心行为和动作的时间边界。在测试集A2上的实验中,我们的方法在2024年AI城市挑战赛第3赛道的公共排行榜上获得了第六名的位置。
https://arxiv.org/abs/2411.12525
Recent breakthroughs in Multimodal Large Language Models (MLLMs) have gained significant recognition within the deep learning community, where the fusion of the Video Foundation Models (VFMs) and Large Language Models(LLMs) has proven instrumental in constructing robust video understanding systems, effectively surmounting constraints associated with predefined visual tasks. These sophisticated MLLMs exhibit remarkable proficiency in comprehending videos, swiftly attaining unprecedented performance levels across diverse benchmarks. However, their operation demands substantial memory and computational resources, underscoring the continued importance of traditional models in video comprehension tasks. In this paper, we introduce a novel learning paradigm termed MLLM4WTAL. This paradigm harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors for conventional Weakly-supervised Temporal Action Localization (WTAL) methods. MLLM4WTAL facilitates the enhancement of WTAL by leveraging MLLM guidance. It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR). These modules work in tandem to effectively address prevalent issues like incomplete and over-complete outcomes common in WTAL methods. Rigorous experiments are conducted to validate the efficacy of our proposed approach in augmenting the performance of various heterogeneous WTAL models.
最近在多模态大型语言模型(MLLMs)方面的突破获得了深度学习社区的广泛认可,其中视频基础模型(VFMs)和大型语言模型(LLMs)的融合对于构建强大的视频理解系统至关重要,有效克服了预定义视觉任务相关的限制。这些高级的MLLM展示了对视频的理解能力,并在各种基准测试中迅速达到了前所未有的性能水平。然而,它们的操作需要大量的内存和计算资源,这突显了传统模型在视频理解任务中的持续重要性。本文介绍了一种名为MLLM4WTAL的新学习范式。该范式利用MLLM的潜力,为传统的弱监督时间动作定位(WTAL)方法提供时序动作关键语义和完整的语义先验。MLLM4WTAL通过利用MLLM指导来增强WTAL。它通过整合两个不同的模块——关键语义匹配(KSM)和完整语义重构(CSR)实现这一点。这两个模块协同工作,有效地解决了WTAL方法中常见的不完全和过度完成的问题。我们进行了严格的实验以验证我们的提议在提高各种异构WTAL模型性能方面的有效性。
https://arxiv.org/abs/2411.08466
In the task of temporal action localization of ActivityNet-1.3 datasets, we propose to locate the temporal boundaries of each action and predict action class in untrimmed videos. We first apply VideoSwinTransformer as feature extractor to extract different features. Then we apply a unified network following Faster-TAD to simultaneously obtain proposals and semantic labels. Last, we ensemble the results of different temporal action detection models which complement each other. Faster-TAD simplifies the pipeline of TAD and gets remarkable performance, obtaining comparable results as those of multi-step approaches.
在ActivityNet-1.3数据集的时间动作定位任务中,我们提出定位每个动作的时间边界,并预测未剪辑视频中的动作类别。首先应用VideoSwinTransformer作为特征提取器来抽取不同的特征。然后采用类似于Faster-TAD的统一网络同时获得提议和语义标签。最后,我们将不同时间动作检测模型的结果进行集成,这些模型互相补充。Faster-TAD简化了TAD的流程,并取得了显著性能,达到了与多步骤方法相当的效果。
https://arxiv.org/abs/2411.00883
Precise action localization in untrimmed video is vital for fields such as professional sports and minimally invasive surgery, where the delineation of particular motions in recordings can dramatically enhance analysis. But in many cases, large scale datasets with video-label pairs for localization are unavailable, limiting the opportunity to fine-tune video-understanding models. Recent developments in large vision-language models (LVLM) address this need with impressive zero-shot capabilities in a variety of video understanding tasks. However, the adaptation of image-based LVLMs, with their powerful visual question answering capabilities, to action localization in long-form video is still relatively unexplored. To this end, we introduce a true ZEro-shot Action Localization method (ZEAL). Specifically, we leverage the built-in action knowledge of a large language model (LLM) to inflate actions into highly-detailed descriptions of the archetypal start and end of the action. These descriptions serve as queries to LVLM for generating frame-level confidence scores which can be aggregated to produce localization outputs. The simplicity and flexibility of our method lends it amenable to more capable LVLMs as they are developed, and we demonstrate remarkable results in zero-shot action localization on a challenging benchmark, without any training.
https://arxiv.org/abs/2410.14340
Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ($\textbf{COMP}ositional h\textbf{U}man-cen\textbf{T}ric qu\textbf{ER}y$ machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: this https URL.
由于人类活动的空间和时间复杂性以及上下文相关性,从视频中识别人类活动具有挑战性。之前的研究通常依赖于单个输入模式,如RGB或骨骼数据,这限制了它们在模态之间互补优势的利用能力。最近的研究专注于使用简单的特征融合技术将这些两种模式进行结合。然而,由于这些输入模式固有的差异表示,设计一个统一的神经网络架构有效地利用它们的互补信息仍然是一个重要的挑战。为了应对这个问题,我们提出了一个全面的多模态视频基于人类活动识别框架。我们的关键贡献是引入了一种新颖的合成查询机器,称为COMPUTER(Compositional human-centric-time machine),一种通用的神经网络架构,它建模了感兴趣的人类和周围环境在空间和时间上的相互作用。由于其多才多艺的设计,COMPUTER可以用于对各种输入模态进行降维。此外,我们引入了一个一致性损失,用于在模态之间建模预测的一致性,利用多模态输入的互补信息进行稳健的人类运动识别。通过在动作定位和群活动识别任务上的广泛实验,我们的方法与最先进的方法相比表现出卓越的性能。我们的代码可在此处下载:https:// this URL。
https://arxiv.org/abs/2409.02385
Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos.
视频动作定位旨在从长视频中查找特定动作的时间。虽然现有的基于学习的 approaches 已经取得了成功,但它们需要标注带有相当大劳动成本的视频。本文提出了一种基于新兴的视觉语言模型(VLM)的学习无,开卷积文本文提出的技术基于 emerging vision-language models (VLM)。挑战来自于 VLMs 既没有设计来处理长视频,也不是为了找到动作而设计的。我们通过扩展迭代视觉提示技术来克服这些问题。具体来说,我们将视频帧采样到与帧编号标签连接的 concatenated 图像中,使 VLM 猜测一个被认为是动作开始/结束的帧。通过缩小采样时间窗口,迭代这个过程可以找到特定动作的开始和结束。我们证明了这种采样技术产生了合理的结果,证明了 VLMs 在理解视频方面具有实际扩展。
https://arxiv.org/abs/2408.17422
To address the zero-shot temporal action localization (ZSTAL) task, existing works develop models that are generalizable to detect and classify actions from unseen categories. They typically develop a category-agnostic action detector and combine it with the Contrastive Language-Image Pre-training (CLIP) model to solve ZSTAL. However, these methods suffer from incomplete action proposals generated for \textit{unseen} categories, since they follow a frame-level prediction paradigm and require hand-crafted post-processing to generate action proposals. To address this problem, in this work, we propose a novel model named Generalizable Action Proposal generator (GAP), which can interface seamlessly with CLIP and generate action proposals in a holistic way. Our GAP is built in a query-based architecture and trained with a proposal-level objective, enabling it to estimate proposal completeness and eliminate the hand-crafted post-processing. Based on this architecture, we propose an Action-aware Discrimination loss to enhance the category-agnostic dynamic information of actions. Besides, we introduce a Static-Dynamic Rectifying module that incorporates the generalizable static information from CLIP to refine the predicted proposals, which improves proposal completeness in a generalizable manner. Our experiments show that our GAP achieves state-of-the-art performance on two challenging ZSTAL benchmarks, i.e., Thumos14 and ActivityNet1.3. Specifically, our model obtains significant performance improvement over previous works on the two benchmarks, i.e., +3.2% and +3.4% average mAP, respectively.
为解决零样本时序动作定位(ZSTAL)问题,现有研究开发了可适用于检测和分类未见类别的动作的模型。他们通常开发了一个类无关的动作检测器,并将其与对比性语言与图像预训练(CLIP)模型相结合来解决ZSTAL。然而,这些方法由于遵循帧级预测范式并需要手工后处理生成动作建议而产生了对未见类别的动作提议的缺失。为了解决这个问题,在本文中,我们提出了一个名为可扩展动作提议生成器(GAP)的新模型,该模型可以与CLIP无缝交互并以整体方式生成动作建议。我们的GAP基于查询架构构建,并使用建议级的目标进行训练,使其能够估计提议的完整性并消除手工后处理。基于这种架构,我们提出了一个动作感知差异损失,以增强动作的类无关动态信息。此外,我们还引入了一个静态动态校正模块,它从CLIP中包含了可扩展的静态信息,以改进预测建议,从而以一种通用的方式提高提议的完整性。我们的实验结果表明,我们的GAP在两个具有挑战性的ZSTAL基准上实现了最先进的性能,即Thumos14和ActivityNet1.3。具体来说,我们的模型在两个基准上的平均mAP均比以前的工作高出+3.2%和+3.4%。
https://arxiv.org/abs/2408.13777
The present few-shot temporal action localization model can't handle the situation where videos contain multiple action instances. So the purpose of this paper is to achieve manifold action instances localization in a lengthy untrimmed query video using limited trimmed support videos. To address this challenging problem effectively, we proposed a novel solution involving a spatial-channel relation transformer with probability learning and cluster refinement. This method can accurately identify the start and end boundaries of actions in the query video, utilizing only a limited number of labeled videos. Our proposed method is adept at capturing both temporal and spatial contexts to effectively classify and precisely locate actions in videos, enabling a more comprehensive utilization of these crucial details. The selective cosine penalization algorithm is designed to suppress temporal boundaries that do not include action scene switches. The probability learning combined with the label generation algorithm alleviates the problem of action duration diversity and enhances the model's ability to handle fuzzy action boundaries. The interval cluster can help us get the final results with multiple instances situations in few-shot temporal action localization. Our model achieves competitive performance through meticulous experimentation utilizing the benchmark datasets ActivityNet1.3 and THUMOS14. Our code is readily available at this https URL.
目前,这种少数shot时空动作局部定位模型无法处理视频包含多个动作实例的情况。因此,本文的目的是使用有限的剪辑支持视频在长篇未剪辑查询视频中实现多实例动作实例的局部定位,以解决这一具有挑战性的问题。为了有效地解决这个具有挑战性的问题,我们提出了一个新方法,涉及空间通道关系Transformer、概率学习和聚类 refinement。这种方法可以准确地确定查询视频中的动作开始和结束边界,只需使用很少量的标记视频。我们所提出的方法擅长捕捉时空上下文,从而有效地对视频进行分类和精确定位动作,有助于更全面地利用这些关键信息。选择性余弦惩罚算法旨在抑制不包含动作场景切换的时空边界。概率学习和标签生成算法相结合减轻了动作持续时间多样性的问题,提高了模型处理模糊动作边界的能力。区间聚类可以帮助我们在几 shot时空动作局部定位中处理多个实例的情况。我们的模型通过仔细实验利用基准数据集ActivityNet1.3和THUMOS14获得了竞争力的性能。我们的代码可通过此链接获取。
https://arxiv.org/abs/2408.13765
Online video understanding often relies on individual frames, leading to frame-by-frame predictions. Recent advancements such as Online Temporal Action Localization (OnTAL), extend this approach to instance-level predictions. However, existing methods mainly focus on short-term context, neglecting historical information. To address this, we introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL. By integrating historical context, our framework enhances the synergy between long-term and short-term information, improving the quality of anchor features crucial for classification and localization. We evaluate our model on both procedural egocentric (PREGO) datasets (EGTEA and EPIC) and standard non-PREGO OnTAL datasets (THUMOS and MUSES). Results show that our model outperforms state-of-the-art approaches significantly on PREGO datasets and achieves comparable or slightly superior performance on non-PREGO datasets, underscoring the importance of leveraging long-term history, especially in procedural and egocentric action scenarios. Code is available at: this https URL
视频在线理解通常依赖于单个帧,导致帧对帧预测。 例如,在线时间动作定位 (OnTAL) 是一种 recent 进步,将此方法扩展到实例级别预测。然而,现有的方法主要关注短期上下文,忽视了历史信息。为了解决这个问题,我们引入了 OnTAL 的 History-Augmented Anchor Transformer (HAT) 框架。通过整合历史信息,我们的框架提高了长期和短期信息之间的协同作用,从而提高了分类和定位关键锚特征的质量。我们在 both 程序化自旋点 (PREGO) 数据集(EGTEA 和 EPIC)和非程序化 OnTAL 数据集(THUMOS 和 MUSES)上评估我们的模型。结果表明,我们的模型在 PREGO 数据集上显著优于最先进的 approaches,同时在非程序化 OnTAL 数据集上实现或稍微优于最先进的 approaches。这突出表明了利用长期历史在程序化和自旋点动作场景中的重要性。代码可在此处下载:https:// this URL
https://arxiv.org/abs/2408.06437
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at this https URL.
弱监督的时间动作定位(WTAL)旨在通过仅使用视频级别的注释来检测未剪辑视频中的动作实例。由于许多现有工作基于动作分类标签优化WTAL模型,因此它们遇到了任务差异问题(即通过分类来定位)。为解决这个问题,最近的研究试图通过视觉语言预训练(VLP)利用动作类别名称作为辅助语义知识。然而,目前的研究仍存在不足之处。之前的解决方案主要关注利用语言模型的文本信息,但忽视了动态人类动作与VLP知识在共同空间中的对齐。此外,之前使用的确定性表示方法很难捕捉到细粒度的人类运动。为了解决这些问题,我们提出了一个新框架,将人类动作知识和VLP知识在概率嵌入空间中对齐。此外,我们还提出了内部和分布式对比学习来增强基于统计相似性的概率嵌入空间。大量实验和消融实验揭示了我们的方法显著优于所有先前的最优方法。代码可在此处访问:https://www.xxx.com/。
https://arxiv.org/abs/2408.05955
Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes the memory queue that selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action. Our method outperformed existing methods on two datasets, THUMOS14 and MUSES, surpassing not only TAL methods in the online setting but also some offline TAL methods.
在线时序动作局部化(On-TAL)是将给定流式视频中的多个动作实例进行识别的任务。由于现有的方法仅以每次迭代固定的视频片段作为输入,因此它们在考虑长期上下文方面存在局限,需要仔细调整片段大小。为了克服这些限制,我们提出了记忆增强Transformer(MATR)。MATR利用选择性地保留过去片段特征的内存队列,允许利用长期上下文进行推理。我们还提出了一个新颖的动作局部化方法,它观察当前输入片段来预测正在进行的动作的结束时间,并访问内存队列估计动作的开始时间。在THUMOS14和MUSES数据集上,我们的方法超越了现有方法,不仅在在线设置中超过了TAL方法,而且在离线设置中的一些方法中也表现优异。
https://arxiv.org/abs/2408.02957
In sewer pipe Closed-Circuit Television (CCTV) inspection, accurate temporal defect localization is essential for effective defect classification, detection, segmentation and quantification. Industry standards typically do not require time-interval annotations, even though they are more informative than time-point annotations for defect localization, resulting in additional annotation costs when fully supervised methods are used. Additionally, differences in scene types and camera motion patterns between pipe inspections and Temporal Action Localization (TAL) hinder the effective transfer of point-supervised TAL methods. Therefore, this study introduces a Semi-supervised multi-Prototype-based method incorporating visual Odometry for enhanced attention guidance (PipeSPO). PipeSPO fully leverages unlabeled data through unsupervised pretext tasks and utilizes time-point annotated data with a weakly supervised multi-prototype-based method, relying on visual odometry features to capture camera pose information. Experiments on real-world datasets demonstrate that PipeSPO achieves 41.89% average precision across Intersection over Union (IoU) thresholds of 0.1-0.7, improving by 8.14% over current state-of-the-art methods.
https://arxiv.org/abs/2407.15170
Temporal Action Localization (TAL) is a critical task in video analysis, identifying precise start and end times of actions. Existing methods like CNNs, RNNs, GCNs, and Transformers have limitations in capturing long-range dependencies and temporal causality. To address these challenges, we propose a novel TAL architecture leveraging the Selective State Space Model (S6). Our approach integrates the Feature Aggregated Bi-S6 block, Dual Bi-S6 structure, and a recurrent mechanism to enhance temporal and channel-wise dependency modeling without increasing parameter complexity. Extensive experiments on benchmark datasets demonstrate state-of-the-art results with mAP scores of 74.2% on THUMOS-14, 42.9% on ActivityNet, 29.6% on FineAction, and 45.8% on HACS. Ablation studies validate our method's effectiveness, showing that the Dual structure in the Stem module and the recurrent mechanism outperform traditional approaches. Our findings demonstrate the potential of S6-based models in TAL tasks, paving the way for future research.
Temporal Action Localization(TAL)是视频分析中一个关键的任务,用于识别动作的确切开始和结束时间。现有的方法如CNNs、RNNs、GCNs和Transformer在捕捉长距离依赖和时间因果关系方面具有局限性。为了应对这些挑战,我们提出了一种新型的TAL架构,利用了选择状态空间模型(S6)。我们的方法结合了特征聚合 bi-S6 块、双 bi-S6 结构以及一个循环机制,在不增加参数复杂性的情况下,增强了时间与通道之间的依赖建模。在基准数据集上的大量实验证明,在THUMOS-14上的mAP得分达到了74.2%,在ActivityNet上的mAP得分达到了42.9%,在FineAction上的mAP得分达到了29.6%,在HACS上的mAP得分达到了45.8%。消融研究证实了我们的方法的有效性,表明 stem 模块中的双结构以及循环机制超越了传统方法。我们的研究证明了基于S6的模型的潜力,为未来的研究奠定了基础。
https://arxiv.org/abs/2407.13078
Online Temporal Action Localization (On-TAL) is a critical task that aims to instantaneously identify action instances in untrimmed streaming videos as soon as an action concludes -- a major leap from frame-based Online Action Detection (OAD). Yet, the challenge of detecting overlapping actions is often overlooked even though it is a common scenario in streaming videos. Current methods that can address concurrent actions depend heavily on class information, limiting their flexibility. This paper introduces ActionSwitch, the first class-agnostic On-TAL framework capable of detecting overlapping actions. By obviating the reliance on class information, ActionSwitch provides wider applicability to various situations, including overlapping actions of the same class or scenarios where class information is unavailable. This approach is complemented by the proposed "conservativeness loss", which directly embeds a conservative decision-making principle into the loss function for On-TAL. Our ActionSwitch achieves state-of-the-art performance in complex datasets, including Epic-Kitchens 100 targeting the challenging egocentric view and FineAction consisting of fine-grained actions.
在线时元动作定位(On-TAL)是一个关键的任务,旨在在视频剪辑结束时立即识别完成动作实例,从基于帧的在线动作检测(OAD)中迈出了重要的一步。然而,即使这是流视频的常见情景,仍常常忽视了检测重叠动作的挑战。现有的方法在很大程度上依赖于类别信息,限制了其灵活性。本文介绍了ActionSwitch,第一个类无关的On-TAL框架,能够检测重叠动作。通过消除对类信息的依赖,ActionSwitch为各种情况提供了更广泛的适用性,包括相同类别的重叠动作或类信息不可用的情况。这种方法补充了提出的"容错损失",将保守决策原则直接嵌入到On-TAL的损失函数中。我们的ActionSwitch在复杂的数据集上实现了最先进的性能,包括Epic-Kitchens 100(针对具有挑战性的自中心视角)和FineAction(由细粒度动作组成的场景)。
https://arxiv.org/abs/2407.12987
Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest WSTAL methods introduce pseudo label learning framework to bridge the gap between classification-based training and inferencing targets at localization, and achieve cutting-edge results. In these frameworks, a classification-based model is used to generate pseudo labels for a regression-based student model to learn from. However, the quality of pseudo labels in the framework, which is a key factor to the final result, is not carefully studied. In this paper, we propose a set of simple yet efficient pseudo label quality enhancement mechanisms to build our FuSTAL framework. FuSTAL enhances pseudo label quality at three stages: cross-video contrastive learning at proposal Generation-Stage, prior-based filtering at proposal Selection-Stage and EMA-based distillation at Training-Stage. These designs enhance pseudo label quality at different stages in the framework, and help produce more informative, less false and smoother action proposals. With the help of these comprehensive designs at all stages, FuSTAL achieves an average mAP of 50.8% on THUMOS'14, outperforming the previous best method by 1.2%, and becomes the first method to reach the milestone of 50%.
弱监督时序动作定位(WSTAL)旨在使用仅视频级别的监督来对未剪辑的视频进行动作定位。最先进的WSTAL方法引入伪标签学习框架来弥合基于分类的训练和推理目标之间的差距,并取得最佳结果。在这些框架中,基于分类的模型用于为基于回归的学生模型生成伪标签,以学习。然而,框架中伪标签的质量,这是最终结果的关键因素,并没有仔细研究。在本文中,我们提出了一组简单而有效的伪标签质量增强机制来构建我们的FuSTAL框架。FuSTAL在提议生成阶段通过跨视频对比学习来提高伪标签质量,在提议选择阶段基于先验进行过滤,在训练阶段采用EMA进行蒸馏。这些设计在框架的不同阶段提高了伪标签的质量,并有助于产生更具有信息性、更准确、更平滑的动作提议。在所有阶段都有这些全面设计的帮助下,FuSTAL在THUMOS'14上的平均mAP达到50.8%,比之前最好的方法领先1.2%,成为第一个达到里程碑50%的方法。
https://arxiv.org/abs/2407.08971