Paper Reading AI Learner

Contrastive Predictive Autoencoders for Dynamic Point Cloud Self-Supervised Learning

2023-05-22 12:09:51
Xiaoxiao Sheng, Zhiqiang Shen, Gang Xiao

Abstract

We present a new self-supervised paradigm on point cloud sequence understanding. Inspired by the discriminative and generative self-supervised methods, we design two tasks, namely point cloud sequence based Contrastive Prediction and Reconstruction (CPR), to collaboratively learn more comprehensive spatiotemporal representations. Specifically, dense point cloud segments are first input into an encoder to extract embeddings. All but the last ones are then aggregated by a context-aware autoregressor to make predictions for the last target segment. Towards the goal of modeling multi-granularity structures, local and global contrastive learning are performed between predictions and targets. To further improve the generalization of representations, the predictions are also utilized to reconstruct raw point cloud sequences by a decoder, where point cloud colorization is employed to discriminate against different frames. By combining classic contrast and reconstruction paradigms, it makes the learned representations with both global discrimination and local perception. We conduct experiments on four point cloud sequence benchmarks, and report the results on action recognition and gesture recognition under multiple experimental settings. The performances are comparable with supervised methods and show powerful transferability.

Abstract (translated)

我们提出了一种新的点云序列理解自监督范式。借鉴了具有选择和生成自监督方法,我们设计了两个任务,即点云序列基于对比预测和重建(CPR),旨在协作学习更全面的时间空间表示。具体来说,密集点云片段首先输入到编码器中以提取嵌入。除了最后一个片段,所有片段都被基于上下文自回归器聚合起来,以预测最后一个目标片段。为了建模多粒度结构,在预测和目标之间进行 local 和 global 对比学习。为了进一步提高表示的泛化能力,预测也被用于解码器中重建原始点云序列,其中点云颜色化用于区分不同的帧。通过结合经典对比和重建范式,它使学习到的表示具有全球区分性和 local 感知。我们对四个点云序列基准进行了实验,并在不同的实验设置下报告了动作识别和手势识别的结果。表现与监督方法相当,表现出强大的转移性。

URL

https://arxiv.org/abs/2305.12959

PDF

https://arxiv.org/pdf/2305.12959.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot