Temporal action localization (TAL), which involves recognizing and locating action instances, is a challenging task in video understanding. Most existing approaches directly predict action classes and regress offsets to boundaries, while overlooking the discrepant importance of each frame. In this paper, we propose an Action Sensitivity Learning framework (ASL) to tackle this task, which aims to assess the value of each frame and then leverage the generated action sensitivity to recalibrate the training procedure. We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively. The outputs of the two branches are combined to reweight the gradient of the two sub-tasks. Moreover, based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames. The extensive studies on various action localization benchmarks (i.e., MultiThumos, Charades, Ego4D-Moment Queries v1.0, Epic-Kitchens 100, Thumos14 and ActivityNet1.3) show that ASL surpasses the state-of-the-art in terms of average-mAP under multiple types of scenarios, e.g., single-labeled, densely-labeled and egocentric.
时间动作定位(TAL)是指在视频中识别和定位动作实例的挑战性任务。大部分现有方法直接预测动作类别并回归到边界,同时忽略每个帧的离散重要性。在本文中,我们提出一个Action sensitive Learning framework(ASL)来解决该任务,旨在评估每个帧的价值,然后利用生成的动作敏感性来重新校准训练过程。我们首先介绍了一个轻量级Action sensitive Evaluator,分别学习动作敏感性在类级别和实例级别的响应。两个分支的输出组合在一起重新调整两个子任务的梯度。此外,基于每个帧的动作敏感性,我们设计了一个Action Sensitive Contrastive Loss,以增强特征,其中动作意识到帧被采样为正值对,以推开无关的动作帧。对多种动作定位基准(如MultiThumos、Charades、Ego4D-Moment Queries v1.0、Epic-Kitchens 100、 Thumos14和ActivityNet1.3)广泛的研究结果表明,ASL在平均MAP方面超越了当前的最佳方法,特别是在各种场景类型(如单个标记、密集标记和自我中心)下的情况。
https://arxiv.org/abs/2305.15701
Weakly-supervised temporal action localization aims to identify and localize the action instances in the untrimmed videos with only video-level action labels. When humans watch videos, we can adapt our abstract-level knowledge about actions in different video scenarios and detect whether some actions are occurring. In this paper, we mimic how humans do and bring a new perspective for locating and identifying multiple actions in a video. We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video. The learned queries not only contain the actions' knowledge features at the abstract level but also have the ability to fit this knowledge into the target video scenario, and they will be used to detect the presence of the corresponding action along the temporal dimension. To better learn these action category queries, we exploit not only the features of the current input video but also the correlation between different videos through a novel video-specific action category query learner worked with a query similarity loss. Finally, we conduct extensive experiments on three commonly used datasets (THUMOS14, ActivityNet1.2, and ActivityNet1.3) and achieve state-of-the-art performance.
弱监督的时域动作定位旨在仅使用视频级别的动作标签来确定和定位未剪辑的视频中的行动实例。当人类观看视频时,我们可以适应不同视频场景下抽象的知识,并检测某些行动是否正在发生。在本文中,我们模拟人类的行为并提出了一种名为VQK-Net的网络,它使用视频特定的查询关键注意力模型来学习每个输入视频的行动类别的唯一查询。学习到的查询不仅包含行动在抽象层面上的知识特征,而且有能力将这些知识嵌入到目标视频场景之中,并将用于检测时间维度上相应的行动的存在。为了更好地学习这些行动类别查询,我们不仅利用当前输入视频的特征,还通过一种独特的视频特定行动类别查询学习器与查询相似度损失合作,利用不同视频之间的相关关系。最后,我们研究了三个常用的数据集(THUMOS14、ActivityNet1.2和ActivityNet1.3)并实现了最先进的性能。
https://arxiv.org/abs/2305.04186
Due to the lack of temporal annotation, current Weakly-supervised Temporal Action Localization (WTAL) methods are generally stuck into over-complete or incomplete localization. In this paper, we aim to leverage the text information to boost WTAL from two aspects, i.e., (a) the discriminative objective to enlarge the inter-class difference, thus reducing the over-complete; (b) the generative objective to enhance the intra-class integrity, thus finding more complete temporal boundaries. For the discriminative objective, we propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments. Without the temporal annotation of actions, TSM compares the text query with the entire videos across the dataset to mine the best matching segments while ignoring irrelevant ones. Due to the shared sub-actions in different categories of videos, merely applying TSM is too strict to neglect the semantic-related segments, which results in incomplete localization. We further introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence. We achieve the state-of-the-art performance on THUMOS14 and ActivityNet1.3. Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin. The code is available at this https URL.
由于缺乏时间标注,当前Weakly-supervised Temporal Action Localization (WTAL)方法往往陷入过度完整或不完整Localization的状态。在本文中,我们旨在利用文本信息从两个方面提高WTAL,即(a)增强不同类别之间的差异,减少过度完整;(b)增强内部类别一致性,找到更多的完整时间边界。针对增强目标,我们提出了文本片段挖掘机制(TSM),该机制基于行动类别标签构建文本描述,并将文本视为查询,以挖掘所有类别相关的片段。在没有时间标注的行动的情况下,TSM将文本查询与整个数据集的视频进行比较,以找到最佳匹配片段,并忽略无关的片段。由于不同类别视频共享相同的子行动,仅仅应用TSM过于严格,忽略语义相关的片段,导致不完整Localization。我们还介绍了一个生成目标名为视频文本语言完整(VLC),它专注于从视频中提取所有语义相关的片段,以完成句子。我们在THUMOS14和ActivityNet1.3上实现了最先进的性能。令人惊讶地,我们还发现,我们的提出方法可以无缝应用于现有方法,并以明显优势改进其性能。代码在此httpsURL上可用。
https://arxiv.org/abs/2305.00607
Weakly Supervised Temporal Action Localization (WTAL) aims to classify and localize temporal boundaries of actions for the video, given only video-level category labels in the training datasets. Due to the lack of boundary information during training, existing approaches formulate WTAL as a classificationproblem, i.e., generating the temporal class activation map (T-CAM) for localization. However, with only classification loss, the model would be sub-optimized, i.e., the action-related scenes are enough to distinguish different class labels. Regarding other actions in the action-related scene ( i.e., the scene same as positive actions) as co-scene actions, this sub-optimized model would misclassify the co-scene actions as positive actions. To address this misclassification, we propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi-SCC), to discriminate the positive actions from co-scene actions. The proposed Bi-SCC firstly adopts a temporal context augmentation to generate an augmented video that breaks the correlation between positive actions and their co-scene actions in the inter-video; Then, a semantic consistency constraint (SCC) is used to enforce the predictions of the original video and augmented video to be consistent, hence suppressing the co-scene actions. However, we find that this augmented video would destroy the original temporal context. Simply applying the consistency constraint would affect the completeness of localized positive actions. Hence, we boost the SCC in a bidirectional way to suppress co-scene actions while ensuring the integrity of positive actions, by cross-supervising the original and augmented videos. Finally, our proposed Bi-SCC can be applied to current WTAL approaches, and improve their performance. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
弱监督时间行动定位(WTAL)的目标是对视频进行分类和定位时间边界,仅提供训练数据中的视频级分类标签。由于在训练期间缺乏边界信息,现有方法将WTAL视为分类问题,即生成时间类激活图(T-CAM)以定位。然而,仅产生分类损失,模型将变得次优化,即与动作相关的场景足够区分不同类标签。对于与动作相关的场景中的其他行动(即与积极行动相同的场景),该次优化模型将错误地将共同场景行动归为积极行动。为了解决这种错误分类,我们提出了一种简单但高效的方法,称为双向语义一致性约束(Bi-SCC),以区分积极行动和共同场景行动。该Bi-SCC首先采用时间上下文增强来生成增强视频,在 inter-video 中破坏积极行动和其共同场景行动之间的相关性;然后,使用语义一致性约束(SCC)强制原始视频和增强视频的预测保持一致,因此抑制共同场景行动。然而,我们发现,这种增强视频会摧毁原始时间上下文。简单地应用一致性约束将影响定位积极行动的完整度。因此,我们双向推动 SCC 抑制共同场景行动,同时确保积极行动的完整性,通过交叉监督原始视频和增强视频。最终,我们提出的 Bi-SCC 可以应用于当前的 WTAL 方法,以提高其性能。实验结果显示,我们的方法在 THUMOS14 和ActivityNet 上优于当前的最佳方法。
https://arxiv.org/abs/2304.12616
The task of weakly supervised temporal action localization targets at generating temporal boundaries for actions of interest, meanwhile the action category should also be classified. Pseudo-label-based methods, which serve as an effective solution, have been widely studied recently. However, existing methods generate pseudo labels during training and make predictions during testing under different pipelines or settings, resulting in a gap between training and testing. In this paper, we propose to generate high-quality pseudo labels from the predicted action boundaries. Nevertheless, we note that existing post-processing, like NMS, would lead to information loss, which is insufficient to generate high-quality action boundaries. More importantly, transforming action boundaries into pseudo labels is quite challenging, since the predicted action instances are generally overlapped and have different confidence scores. Besides, the generated pseudo-labels can be fluctuating and inaccurate at the early stage of training. It might repeatedly strengthen the false predictions if there is no mechanism to conduct self-correction. To tackle these issues, we come up with an effective pipeline for learning better pseudo labels. Firstly, we propose a Gaussian weighted fusion module to preserve information of action instances and obtain high-quality action boundaries. Second, we formulate the pseudo-label generation as an optimization problem under the constraints in terms of the confidence scores of action instances. Finally, we introduce the idea of $\Delta$ pseudo labels, which enables the model with the ability of self-correction. Our method achieves superior performance to existing methods on two benchmarks, THUMOS14 and ActivityNet1.3, achieving gains of 1.9\% on THUMOS14 and 3.7\% on ActivityNet1.3 in terms of average mAP.
弱监督时间动作本地化的任务目标是生成有关感兴趣的行动的时间边界,同时还需要对行动类别进行分类。最近,基于伪标签的方法被广泛研究作为有效的解决方案。然而,现有的方法在训练期间生成伪标签并在测试期间make predictions,在采用不同的管道或设置的情况下,导致训练和测试之间的差异。在本文中,我们提议从预测的行动边界中生成高质量的伪标签。然而,我们注意到,现有的后处理,如NMS,会导致信息丢失,不足以生成高质量的行动边界。更重要的是,将行动边界转换为伪标签相当困难,因为预测的行动实例通常重叠且具有不同的信任度。此外,在训练的早期阶段,生成的伪标签可能会波动和不准确。如果没有自我修正机制,可能会多次强化错误的预测。为了解决这些问题,我们提出了一种有效的学习伪标签的管道。首先,我们提议使用高斯加权融合模块来保护行动实例的信息并生成高质量的行动边界。其次,我们将伪标签生成问题表示为约束条件下的行动实例信任度分数的优化问题。最后,我们引入了$\Delta$伪标签的概念,这使具有自我修正能力的模型能够取得更好的性能。我们的方法在两个基准测试上比现有方法表现更好,在THUMOS14上取得了1.9%的平均mAP提高,在ActivityNet1.3上取得了3.7%的提高。
https://arxiv.org/abs/2304.07978
Identifying unusual driving behaviors exhibited by drivers during driving is essential for understanding driver behavior and the underlying causes of crashes. Previous studies have primarily approached this problem as a classification task, assuming that naturalistic driving videos come discretized. However, both activity segmentation and classification are required for this task due to the continuous nature of naturalistic driving videos. The current study therefore departs from conventional approaches and introduces a novel methodological framework, DeepSegmenter, that simultaneously performs activity segmentation and classification in a single framework. The proposed framework consists of four major modules namely Data Module, Activity Segmentation Module, Classification Module and Postprocessing Module. Our proposed method won 8th place in the 2023 AI City Challenge, Track 3, with an activity overlap score of 0.5426 on experimental validation data. The experimental results demonstrate the effectiveness, efficiency, and robustness of the proposed system.
识别Driver在驾驶时表现出的异常情况对于理解Driver的行为和事故根本原因至关重要。以往的研究主要将这个问题视为分类任务,假设自然驾驶视频是离散的。然而,由于自然驾驶视频的连续性质,这项任务需要同时进行活动分割和分类。因此,本研究采用了一种全新的方法框架DeepSegmenter,该框架在一个框架内同时完成活动分割和分类。该框架由四个主要模块组成,分别是数据模块、活动分割模块、分类模块和后处理模块。我们提出的方法和在2023年AI城市挑战(Track 3)的第8名比赛中,在实验验证数据上获得了活动重叠得分0.5426。实验结果证明了该方案的有效性、效率和鲁棒性。
https://arxiv.org/abs/2304.08261
Weakly-supervised temporal action localization aims to localize action instances in untrimmed videos with only video-level supervision. We witness that different actions record common phases, e.g., the run-up in the HighJump and LongJump. These different actions are defined as conjoint actions, whose rest parts are definite phases, e.g., leaping over the bar in a HighJump. Compared with the common phases, the definite phases are more easily localized in existing researches. Most of them formulate this task as a Multiple Instance Learning paradigm, in which the common phases are tended to be confused with the background, and affect the localization completeness of the conjoint actions. To tackle this challenge, we propose a Joint of Common and Definite phases Network (JCDNet) by improving feature discriminability of the conjoint actions. Specifically, we design a Class-Aware Discriminative module to enhance the contribution of the common phases in classification by the guidance of the coarse definite-phase features. Besides, we introduce a temporal attention module to learn robust action-ness scores via modeling temporal dependencies, distinguishing the common phases from the background. Extensive experiments on three datasets (THUMOS14, ActivityNetv1.2, and a conjoint-action subset) demonstrate that JCDNet achieves competitive performance against the state-of-the-art methods. Keywords: weakly-supervised learning, temporal action localization, conjoint action
弱监督的时间行动定位旨在在未剪辑的视频上仅使用视频级别的监督将行动实例Localization。我们观察到,不同的行动记录了共同阶段,例如高跳和长跳的起跳阶段,这些不同行动被定义为联合行动,其Rest部分是确定的阶段,例如在高跳中跃过横杆。与共同阶段相比,确定的阶段更容易在现有的研究中Localization。大多数研究人员将这一任务定义为多个实例学习范式,其中共同阶段倾向于与背景混淆,并影响联合行动的Localization完整度。为了应对这一挑战,我们提出了一个共同和确定阶段网络(JCDNet),通过提高联合行动的特征区分能力来改善其特征分类能力。具体来说,我们设计了一个类aware的分类模块,以增强分类中共同阶段的贡献,通过指导粗确定的阶段特征的指导。此外,我们引入了时间注意模块,通过建模时间依赖来学习稳定的行动特性得分,从背景中区分共同阶段。对三个数据集(THUMOS14、ActivityNetv1.2和联合行动子集)进行的广泛实验表明,JCDNet在与现有方法竞争时实现了出色的性能。关键词:弱监督学习,时间行动定位,联合行动
https://arxiv.org/abs/2303.17294
Temporal action detection aims to predict the time intervals and the classes of action instances in the video. Despite the promising performance, existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow. In this paper, we introduce a decomposed cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality. Specifically, instead of direct distillation, we propose to separately learn RGB and motion representations, which are in turn combined to perform action localization. The dual-branch design and the asymmetric training objectives enable effective motion knowledge transfer while preserving RGB information intact. In addition, we introduce a local attentive fusion to better exploit the multimodal complementarity. It is designed to preserve the local discriminability of the features that is important for action localization. Extensive experiments on the benchmarks verify the effectiveness of the proposed method in enhancing RGB-based action detectors. Notably, our framework is agnostic to backbones and detection heads, bringing consistent gains across different model combinations.
时间动作检测旨在预测视频中时间间隔和行动实例的分类。尽管表现令人期待,但现有的两流模型表现出较慢的推理速度,因为它们依赖于计算代价高的光学流。在本文中,我们介绍了一种分解的跨modal蒸馏框架,以构建强大的基于RGB的探测器,通过传递运动模式知识。具体来说,我们提议分别学习RGB和运动表示,并将其组合以进行动作定位。双重分支设计和不对称训练目标可实现有效的运动知识传递,同时保持RGB信息不变。此外,我们介绍了局部注意力融合,以更好地利用多模态互补性。它旨在保留对动作定位重要的是局部分类器的局部辨别能力。在基准数据上进行了广泛的实验验证所提出的方法在增强基于RGB的动作探测器方面的有效性。值得注意的是,我们的框架对基线和创新头没有偏好,在不同模型组合中带来一致性增益。
https://arxiv.org/abs/2303.17285
Most of existing video-language pre-training methods focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. In this work, we propose a simple yet effective video-language pre-training framework, namely G-ViLM, to learn discriminative spatiotemporal features. Two novel designs involving spatiotemporal grounding and temporal grouping promote learning local region-noun alignment and temporal-aware features simultaneously. Specifically, spatiotemporal grounding aggregates semantically similar video tokens and aligns them with noun phrases extracted from the caption to promote local region-noun correspondences. Moreover, temporal grouping leverages cut-and-paste to manually create temporal scene changes and then learns distinguishable features from different scenes. Comprehensive evaluations demonstrate that G-ViLM performs favorably against existing approaches on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition and temporal action localization. G-ViLM performs competitively on all evaluated tasks and in particular achieves R@10 of 65.1 on zero-shot MSR-VTT retrieval, over 9% higher than the state-of-the-art method.
现有的视频语言预训练方法大多数都关注于视频片段和字幕的实例级对齐,通过全球对比学习,但忽略了丰富的细粒度局部信息,这对于需要时间定位和语义推理的后续任务非常重要。在本研究中,我们提出了一种简单但有效的视频语言预训练框架,即G-ViLM,来学习 discriminative spatiotemporal features。涉及时空grounding和时空分组的两个新型设计同时促进学习局部区域noun匹配和时间意识 features。具体来说,时空grounding汇总了语义相似的视频代币并将它们与字幕中 noun phrases 对齐,以促进局部区域noun对应关系。此外,时空分组利用剪切和粘贴手动创建时间场景变化,然后学习从不同场景区分的特征。全面评估表明,G-ViLM在四个代表性的后续任务中表现良好,包括文本视频检索、视频问答、视频动作识别和时间动作定位。G-ViLM在所有评估任务中表现竞争力,特别是在零样本MSR-VTT检索中取得了R@10 65.1,比现有方法高9%。
https://arxiv.org/abs/2303.16341
This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.
本论文同时解决了与传统骨骼基于行动识别相关的三个限制:骨骼检测和跟踪错误、目标行动的多样性差以及人wise和frame-wise行动识别。引入点云深度学习范式来推进行动识别,并提出了基于结构要点汇集的新型深度学习神经网络架构,名为Structured Keypoint Pooling。该方法以点云数据结构先前知识为基础的分步骤地聚合要点特征,例如每个要点所属的实例和帧,并实现对输入错误的鲁棒性。其不受限制和跟踪免费的架构使得时间序列要点由人类骨骼和非物体对象轮廓组成的组合可以高效地视为输入3D点云并扩展目标行动的多样性。此外,我们提出了一种基于Structured Keypoint Pooling灵感的要点汇集切换技巧。该技巧在训练和推断阶段通过仅使用视频级别的行动标签切换要点汇集内核,以弱监督方式检测人wise和frame-wise的行动,并利用我们的训练计划自然地引入新的数据增强,将其混合来自不同视频的多个点云。在实验中,我们全面验证所提出方法的有效性,该方法超越了传统的骨骼基于行动识别和空间时间行动定位方法。
https://arxiv.org/abs/2303.15270
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in untrimmed videos, only taking video-level labels as the supervised information. Pseudo label generation is a promising strategy to solve the challenging problem, but most existing methods are limited to employing snippet-wise classification results to guide the generation, and they ignore that the natural temporal structure of the video can also provide rich information to assist such a generation process. In this paper, we propose a novel weakly-supervised temporal action localization method by inferring snippet-feature affinity. First, we design an affinity inference module that exploits the affinity relationship between temporal neighbor snippets to generate initial coarse pseudo labels. Then, we introduce an information interaction module that refines the coarse labels by enhancing the discriminative nature of snippet-features through exploring intra- and inter-video relationships. Finally, the high-fidelity pseudo labels generated from the information interaction module are used to supervise the training of the action localization network. Extensive experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet v1.3, demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods.
弱监督的时间行动定位旨在在未剪辑的视频中提取行动区域并识别行动类别,仅使用视频级别的标签作为监督信息。伪标签生成是一种解决挑战性问题有前途的策略,但大多数现有方法局限于使用片段级别的分类结果来指导生成,并忽视了视频的自然时间结构也可以提供丰富的信息来帮助这种生成过程。在本文中,我们提出了一种新的弱监督的时间行动定位方法,通过推断片段特征亲和力来实现。首先,我们设计了一个亲和力推断模块,利用时间相邻片段之间的亲和力关系来生成初始的粗仿标签。然后,我们引入了一个信息交互模块,通过探索视频内部和外部的关系来优化粗仿标签,并最后使用信息交互模块生成的高保真的仿标签来监督行动定位网络的训练。在两个公开数据集THUMOS14和ActivityNet v1.3上进行广泛的实验,证明了我们提出的方法相比现有方法取得了显著的改进。
https://arxiv.org/abs/2303.12332
In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visually-conditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.
在本文中,我们考虑了低精度(零精度和少量精度)情况下的时间行动定位问题,目标是在训练时不能看到某些未修剪视频的任意分类行动中实例的情况下,检测和分类这些行动实例。我们采用了基于Transformer的两步行动定位架构,并使用类无关的行动提议,随后采用开放词汇分类。我们做出了以下贡献。第一,为了补偿图像文本基础模型的时间运动,我们改进了类无关的行动提议,通过明确对齐光学流、RGB和文本的嵌入来提高其精度。这在现有的低精度方法中几乎被忽视了。第二,为了提高开放词汇分类的精度,我们建立了具有强大分类力的Classifier,即避免词义歧义。具体而言,我们提议使用详细的行动描述(从大规模语言模型获取)或视觉条件特定实例优先提示向量来启发预训练的CLIP文本编码器。第三,我们对THUMOS14和ActivityNet1.3进行了完整的实验和分解研究,展示了我们提出的模型的优秀性能,比现有的先进技术高出一个显著的差异。
https://arxiv.org/abs/2303.11732
Temporal Action Localization (TAL) is a challenging task in video understanding that aims to identify and localize actions within a video sequence. Recent studies have emphasized the importance of applying long-term temporal context modeling (TCM) blocks to the extracted video clip features such as employing complex self-attention mechanisms. In this paper, we present the simplest method ever to address this task and argue that the extracted video clip features are already informative to achieve outstanding performance without sophisticated architectures. To this end, we introduce TemporalMaxer, which minimizes long-term temporal context modeling while maximizing information from the extracted video clip features with a basic, parameter-free, and local region operating max-pooling block. Picking out only the most critical information for adjacent and local clip embeddings, this block results in a more efficient TAL model. We demonstrate that TemporalMaxer outperforms other state-of-the-art methods that utilize long-term TCM such as self-attention on various TAL datasets while requiring significantly fewer parameters and computational resources. The code for our approach is publicly available at this https URL
时间动作定位(TAL)是在视频理解中具有挑战性的任务,旨在识别和定位视频序列中的行动。最近的研究表明,应用长期时间上下文建模(TCM)块提取提取的视频片段特征,如使用复杂的自注意力机制非常重要。在本文中,我们介绍了解决此任务最简单的方法,并认为提取的视频片段特征已经具有信息,在没有 sophisticated 架构的情况下,以出色的性能为目标。为此,我们介绍了TemporalMaxer,该方法最小化长期时间上下文建模,同时最大化从提取的视频片段特征中获取的信息,使用基本、参数免费的局部区域最大池化块。仅选择相邻和局部片段嵌入的最重要信息,该块生成更高效的TAL模型。我们证明了TemporalMaxer在多个TAL数据集上优于其他利用长期TCM的先进技术,如自注意力,在各种TAL数据集上表现优异,同时只需要较少的参数和计算资源。我们的算法代码在此https URL上公开可用。
https://arxiv.org/abs/2303.09055
Temporal action localization (TAL) is a prevailing task due to its great application potential. Existing works in this field mainly suffer from two weaknesses: (1) They often neglect the multi-label case and only focus on temporal modeling. (2) They ignore the semantic information in class labels and only use the visual information. To solve these problems, we propose a novel Co-Occurrence Relation Module (CORM) that explicitly models the co-occurrence relationship between actions. Besides the visual information, it further utilizes the semantic embeddings of class labels to model the co-occurrence relationship. The CORM works in a plug-and-play manner and can be easily incorporated with the existing sequence models. By considering both visual and semantic co-occurrence, our method achieves high multi-label relationship modeling capacity. Meanwhile, existing datasets in TAL always focus on low-semantic atomic actions. Thus we construct a challenging multi-label dataset UCF-Crime-TAL that focuses on high-semantic actions by annotating the UCF-Crime dataset at frame level and considering the semantic overlap of different events. Extensive experiments on two commonly used TAL datasets, \textit{i.e.}, MultiTHUMOS and TSU, and our newly proposed UCF-Crime-TAL demenstrate the effectiveness of the proposed CORM, which achieves state-of-the-art performance on these datasets.
时间动作定位(TAL)是一项普遍存在的任务,由于其广泛的应用潜力。该领域现有工作主要存在两个弱点:(1)它们常常忽略多标签情况,仅关注时间建模。(2)它们忽略类标签中的语义信息,仅使用视觉信息。为了解决这些问题,我们提出了一种全新的共现关系模块(CORM), explicitly models 行动之间的共现关系。除了视觉信息,它还利用类标签的语义嵌入来模型共现关系。CORM采用自动插入方式工作,可以轻松地与现有的序列模型集成。通过考虑视觉和语义共现,我们的方法和新提出的UCF- Crime-TAL等TAL数据集的广泛实验结果表明,我们提出的CORM具有极高的多标签关系建模能力,在这些数据集上取得了最先进的表现。
https://arxiv.org/abs/2303.08463
Temporal action localization in videos presents significant challenges in the field of computer vision. While the boundary-sensitive method has been widely adopted, its limitations include incomplete use of intermediate and global information, as well as an inefficient proposal feature generator. To address these challenges, we propose a novel framework, Sparse Multilevel Boundary Generator (SMBG), which enhances the boundary-sensitive method with boundary classification and action completeness regression. SMBG features a multi-level boundary module that enables faster processing by gathering boundary information at different lengths. Additionally, we introduce a sparse extraction confidence head that distinguishes information inside and outside the action, further optimizing the proposal feature generator. To improve the synergy between multiple branches and balance positive and negative samples, we propose a global guidance loss. Our method is evaluated on two popular benchmarks, ActivityNet-1.3 and THUMOS14, and is shown to achieve state-of-the-art performance, with a better inference speed (2.47xBSN++, 2.12xDBG). These results demonstrate that SMBG provides a more efficient and simple solution for generating temporal action proposals. Our proposed framework has the potential to advance the field of computer vision and enhance the accuracy and speed of temporal action localization in video analysis.The code and models are made available at \url{this https URL}.
视频时序行为定位在计算机视觉领域面临巨大的挑战。虽然边界敏感方法已经被广泛采用,但它的限制包括不完整的使用中间和全球信息,以及高效的提议特征生成方法。为了解决这些问题,我们提出了一个 novel 框架,Sparse Multilevel Boundary Generator (SMBG),它通过边界分类和行动完整性回归增强边界敏感方法。SMBG 采用多层次边界模块,通过收集不同长度的边界信息实现更快的处理。此外,我们引入了稀疏提取自信头,它区别内外行动信息,进一步优化提议特征生成方法。为了改善多个分支之间的协同作用并平衡正面和负面样本,我们提出了全球指导损失。我们的方法在两个流行的基准测试中进行评估,分别是ActivityNet-1.3和THUMOS14,并显示实现最先进的性能,推理速度更快(2.47xBSN++,2.12xDBG)。这些结果表明,SMBG 提供了更有效和简单的时序行为提议生成方法。我们提出的框架有潜力推进计算机视觉领域,提高视频分析中的时序行为定位准确性和速度。代码和模型可在 url{this https URL} 上提供。
https://arxiv.org/abs/2303.03166
Video-Language Pre-training models have recently significantly improved various multi-modal downstream tasks. Previous dominant works mainly adopt contrastive learning to achieve global feature alignment across modalities. However, the local associations between videos and texts are not modeled, restricting the pre-training models' generality, especially for tasks requiring the temporal video boundary for certain query texts. This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment such that the trained model can accurately perceive temporal boundaries in videos given the text description. Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description, and text localization which matches the subset of texts with the video features. To produce temporal boundaries, frame features in several videos are manually merged into a long video sequence that interacts with a text sequence. With the localization task, our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality. Notably, comprehensive experimental results show that our method significantly improves the state-of-the-art performance on various benchmarks, covering text-to-video retrieval, video question answering, video captioning, temporal action localization and temporal moment retrieval. The code will be released soon.
https://arxiv.org/abs/2301.07463
We present Ego-Only, the first training pipeline that enables state-of-the-art action detection on egocentric (first-person) videos without any form of exocentric (third-person) pretraining. Previous approaches found that egocentric models cannot be trained effectively from scratch and that exocentric representations transfer well to first-person videos. In this paper we revisit these two observations. Motivated by the large content and appearance gap separating the two domains, we propose a strategy that enables effective training of egocentric models without exocentric pretraining. Our Ego-Only pipeline is simple. It trains the video representation with a masked autoencoder finetuned for temporal segmentation. The learned features are then fed to an off-the-shelf temporal action localization method to detect actions. We evaluate our approach on two established egocentric video datasets: Ego4D and EPIC-Kitchens-100. On Ego4D, our Ego-Only is on-par with exocentric pretraining methods that use an order of magnitude more labels. On EPIC-Kitchens-100, our Ego-Only even outperforms exocentric pretraining (by 2.1% on verbs and by 1.8% on nouns), setting a new state-of-the-art.
https://arxiv.org/abs/2301.01380
Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.
https://arxiv.org/abs/2212.09335
To balance the annotation labor and the granularity of supervision, single-frame annotation has been introduced in temporal action localization. It provides a rough temporal location for an action but implicitly overstates the supervision from the annotated-frame during training, leading to the confusion between actions and backgrounds, i.e., action incompleteness and background false positives. To tackle the two challenges, in this work, we present the Snippet Classification model and the Dilation-Erosion module. In the Dilation-Erosion module, we expand the potential action segments with a loose criterion to alleviate the problem of action incompleteness and then remove the background from the potential action segments to alleviate the problem of action incompleteness. Relying on the single-frame annotation and the output of the snippet classification, the Dilation-Erosion module mines pseudo snippet-level ground-truth, hard backgrounds and evident backgrounds, which in turn further trains the Snippet Classification model. It forms a cyclic dependency. Furthermore, we propose a new embedding loss to aggregate the features of action instances with the same label and separate the features of actions from backgrounds. Experiments on THUMOS14 and ActivityNet 1.2 validate the effectiveness of the proposed method. Code has been made publicly available (this https URL).
https://arxiv.org/abs/2212.06348
Temporal action localization (TAL) requires long-form reasoning to predict actions of various lengths and complex content. Given limited GPU memory, training TAL end-to-end on such long-form videos (i.e., from videos to predictions) is a significant challenge. Most methods can only train on pre-extracted features without optimizing them for the localization problem, consequently limiting localization performance. In this work, to extend the potential in TAL networks, we propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL. Re2TAL builds a backbone with reversible modules, where the input can be recovered from the output such that the bulky intermediate activations can be cleared from memory during training. Instead of designing one single type of reversible module, we propose a network rewiring mechanism, to transform any module with a residual connection to a reversible module without changing any parameters. This provides two benefits: (1) a large variety of reversible networks are easily obtained from existing and even future model designs, and (2) the reversible models require much less training effort as they reuse the pre-trained parameters of their original non-reversible versions. Re2TAL reaches 37.01% average mAP, a new state-of-the-art record on ActivityNet-v1.3, and mAP 64.9% at tIoU=0.5 on THUMOS-14 without using optimal flow.
https://arxiv.org/abs/2211.14053