Paper Reading AI Learner

Unsupervised Temporal Action Localization via Self-paced Incremental Learning

2023-12-12 16:00:55
Haoyu Tang, Han Jiang, Mingzhu Xu, Yupeng Hu, Jihua Zhu, Liqiang Nie

Abstract

Recently, temporal action localization (TAL) has garnered significant interest in information retrieval community. However, existing supervised/weakly supervised methods are heavily dependent on extensive labeled temporal boundaries and action categories, which is labor-intensive and time-consuming. Although some unsupervised methods have utilized the ``iteratively clustering and localization'' paradigm for TAL, they still suffer from two pivotal impediments: 1) unsatisfactory video clustering confidence, and 2) unreliable video pseudolabels for model training. To address these limitations, we present a novel self-paced incremental learning model to enhance clustering and localization training simultaneously, thereby facilitating more effective unsupervised TAL. Concretely, we improve the clustering confidence through exploring the contextual feature-robust visual information. Thereafter, we design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels and further improving overall localization performance. Extensive experiments on two public datasets have substantiated the superiority of our model over several state-of-the-art competitors.

Abstract (translated)

近年来,时间动作局部定位(TAL)在信息检索领域引起了广泛关注。然而,现有的监督/弱监督方法在很大程度上依赖于广泛的标记时间边界的动作类别,这需要大量的人力和时间。尽管一些无监督方法利用了“迭代聚类和局部化”范式进行TAL,但它们仍然受到两个关键限制:1)不满意的视频聚类置信度,2)模型训练不可靠的视频伪标签。为了克服这些限制,我们提出了一个自适应的增量学习模型,以同时增强聚类和局部化训练,从而促进更有效的无监督TAL。具体来说,我们通过探索上下文特征鲁棒的视觉信息来提高聚类置信度。然后,我们设计了两组(恒定速度和可变速度)增量实例学习策略,用于易到难的模型训练,从而确保这些视频伪标签的可靠性,并进一步提高整体局部化性能。在两个公开数据集上进行的大量实验证实了我们的模型相对于最先进的竞争对手的优越性。

URL

https://arxiv.org/abs/2312.07384

PDF

https://arxiv.org/pdf/2312.07384.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot