POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of `pseudo-labels' to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets, showing a significant improvement of 5% average mAP on the former.

Abstract (translated)

本文解决了基于点监督的时间动作检测中的挑战，其中每个训练集中的动作实例只标注了一个帧。大多数现有方法受到稀疏性标注点的限制，很难有效地表示动作的连续结构或动作实例中的内在时间和语义依赖关系。因此，这些方法通常只能学习动作的最具特征性的部分，导致创建不完整的动作提案。本文提出了一种利用点级注释的伪标签定向Transformer（POTLoc）来进行弱监督动作局部化。POTLoc通过自训练策略来识别和跟踪连续的动作结构。基模型首先仅通过点级监督生成动作建议。这些建议经过细化和回归以提高估计动作边界的精度，从而产生“伪标签”，作为补充监督信号。模型的架构融合了Transformer和时间特征金字塔，以捕捉视频片段依赖关系并建模具有不同持续时间的动作。伪标签提供关于动作粗略位置和边界的信息，有助于引导Transformer进行增强的学习动作 dynamics。 POTLoc在THUMOS'14和ActivityNet-v1.2数据集上优于最先进的点监督方法，其平均mAP提高了5%。

URL

https://arxiv.org/abs/2310.13585

PDF

https://arxiv.org/pdf/2310.13585.pdf

POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization

Abstract

Abstract (translated)

URL

PDF Copy

PDF