Paper Reading AI Learner

Learning Discriminative Spatio-temporal Representations for Semi-supervised Action Recognition

2024-04-25 08:49:08
Yu Wang, Sanping Zhou, Kun Xia, Le Wang

Abstract

Semi-supervised action recognition aims to improve spatio-temporal reasoning ability with a few labeled data in conjunction with a large amount of unlabeled data. Albeit recent advancements, existing powerful methods are still prone to making ambiguous predictions under scarce labeled data, embodied as the limitation of distinguishing different actions with similar spatio-temporal information. In this paper, we approach this problem by empowering the model two aspects of capability, namely discriminative spatial modeling and temporal structure modeling for learning discriminative spatio-temporal representations. Specifically, we propose an Adaptive Contrastive Learning~(ACL) strategy. It assesses the confidence of all unlabeled samples by the class prototypes of the labeled data, and adaptively selects positive-negative samples from a pseudo-labeled sample bank to construct contrastive learning. Additionally, we introduce a Multi-scale Temporal Learning~(MTL) strategy. It could highlight informative semantics from long-term clips and integrate them into the short-term clip while suppressing noisy information. Afterwards, both of these two new techniques are integrated in a unified framework to encourage the model to make accurate predictions. Extensive experiments on UCF101, HMDB51 and Kinetics400 show the superiority of our method over prior state-of-the-art approaches.

Abstract (translated)

半监督动作识别旨在通过与大量未标记数据相结合,通过几标记数据来提高空间和时间推理能力。尽管有最近的研究进展,但现有的强大方法在稀疏标记数据下仍然容易产生模糊预测,这表现为用类似的时空信息区分不同动作的局限性。在本文中,我们通过赋予模型两个能力方面来解决这个问题,即判别性空间建模和时间结构建模,以学习具有判别性的时空表示。具体来说,我们提出了自适应对比学习(ACL)策略。它通过标记数据的类原型评估所有未标记样本的置信度,并从预标记样本库中选择正负样本进行对比学习。此外,我们还引入了多尺度时间学习(MTL)策略。它可以从长期视频片段中突出有用的语义信息,并将它们整合到短期视频片段中,同时抑制噪声信息。然后,这两种新的技术都被融入到统一的框架中,以鼓励模型做出准确的预测。在UCF101、HMDB51和Kinetics400等数据集上进行的大量实验表明,我们的方法优越于先前的最先进方法。

URL

https://arxiv.org/abs/2404.16416

PDF

https://arxiv.org/pdf/2404.16416.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot