Paper Reading AI Learner

POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization

2023-10-20 15:28:06
Elahe Vahdani, Yingli Tian

Abstract

This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of `pseudo-labels' to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets, showing a significant improvement of 5% average mAP on the former.

Abstract (translated)

本文解决了基于点监督的时间动作检测中的挑战,其中每个训练集中的动作实例只标注了一个帧。大多数现有方法受到稀疏性标注点的限制,很难有效地表示动作的连续结构或动作实例中的内在时间和语义依赖关系。因此,这些方法通常只能学习动作的最具特征性的部分,导致创建不完整的动作提案。本文提出了一种利用点级注释的伪标签定向Transformer(POTLoc)来进行弱监督动作局部化。POTLoc通过自训练策略来识别和跟踪连续的动作结构。 基模型首先仅通过点级监督生成动作建议。这些建议经过细化和回归以提高估计动作边界的精度,从而产生“伪标签”,作为补充监督信号。模型的架构融合了Transformer和时间特征金字塔,以捕捉视频片段依赖关系并建模具有不同持续时间的动作。伪标签提供关于动作粗略位置和边界的信息,有助于引导Transformer进行增强的学习动作 dynamics。 POTLoc在THUMOS'14和ActivityNet-v1.2数据集上优于最先进的点监督方法,其平均mAP提高了5%。

URL

https://arxiv.org/abs/2310.13585

PDF

https://arxiv.org/pdf/2310.13585.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot