Paper Reading AI Learner

Cross-Stream Contrastive Learning for Self-Supervised Skeleton-Based Action Recognition

2023-05-03 10:31:35
Ding Li, Yongqiang Tang, Zhizhong Zhang, Wensheng Zhang

Abstract

Self-supervised skeleton-based action recognition enjoys a rapid growth along with the development of contrastive learning. The existing methods rely on imposing invariance to augmentations of 3D skeleton within a single data stream, which merely leverages the easy positive pairs and limits the ability to explore the complicated movement patterns. In this paper, we advocate that the defect of single-stream contrast and the lack of necessary feature transformation are responsible for easy positives, and therefore propose a Cross-Stream Contrastive Learning framework for skeleton-based action Representation learning (CSCLR). Specifically, the proposed CSCLR not only utilizes intra-stream contrast pairs, but introduces inter-stream contrast pairs as hard samples to formulate a better representation learning. Besides, to further exploit the potential of positive pairs and increase the robustness of self-supervised representation learning, we propose a Positive Feature Transformation (PFT) strategy which adopts feature-level manipulation to increase the variance of positive pairs. To validate the effectiveness of our method, we conduct extensive experiments on three benchmark datasets NTU-RGB+D 60, NTU-RGB+D 120 and PKU-MMD. Experimental results show that our proposed CSCLR exceeds the state-of-the-art methods on a diverse range of evaluation protocols.

Abstract (translated)

自监督的骨骼为基础的动作识别在比较学习的发展过程中取得了迅速的增长。目前的方法依赖于在单个数据流内对三维骨骼的增强实现不变性,这仅仅利用了简单的正则对,并限制了探索复杂运动模式的能力。在本文中,我们主张单流对比度的缺陷和必要的特征转换缺乏是导致简单正则对的原因,因此我们提出了基于交叉流比较学习框架的骨骼为基础的动作表示学习(CSCLR)方法。具体来说,我们提出的CSCLR不仅利用内部流对比度对,还引入了外部流对比度对作为困难样本,以构建更好的表示学习。此外,为了进一步利用正则对的潜力并增加自监督表示学习的鲁棒性,我们提出了正则特征转换(PFT)策略,采用特征级别的操作来增加正则对的变异性。为了验证我们的方法的有效性,我们进行了广泛的实验,对三个基准数据集NTU-RGB+D 60、NTU-RGB+D 120和PKU-MMD进行了评估。实验结果表明,我们提出的CSCLR在多种评估协议标准上超过了当前最先进的方法。

URL

https://arxiv.org/abs/2305.02324

PDF

https://arxiv.org/pdf/2305.02324.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot