Paper Reading AI Learner

Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos

2025-04-07 08:47:36
Zhi Zuo, Chenyi Zhuang, Zhiqiang Shen, Pan Gao, Jie Qin

Abstract

Point cloud video representation learning is primarily built upon the masking strategy in a self-supervised manner. However, the progress is slow due to several significant challenges: (1) existing methods learn the motion particularly with hand-crafted designs, leading to unsatisfactory motion patterns during pre-training which are non-transferable on fine-tuning scenarios. (2) previous Masked AutoEncoder (MAE) frameworks are limited in resolving the huge representation gap inherent in 4D data. In this study, we introduce the first self-disentangled MAE for learning discriminative 4D representations in the pre-training stage. To address the first challenge, we propose to model the motion representation in a latent space. The second issue is resolved by introducing the latent tokens along with the typical geometry tokens to disentangle high-level and low-level features during decoding. Extensive experiments on MSR-Action3D, NTU-RGBD, HOI4D, NvGesture, and SHREC'17 verify this self-disentangled learning framework. We demonstrate that it can boost the fine-tuning performance on all 4D tasks, which we term Uni4D. Our pre-trained model presents discriminative and meaningful 4D representations, particularly benefits processing long videos, as Uni4D gets $+3.8\%$ segmentation accuracy on HOI4D, significantly outperforming either self-supervised or fully-supervised methods after end-to-end fine-tuning.

Abstract (translated)

点云视频表示学习主要基于自监督方式下的掩码策略构建。然而,由于几个显著的挑战,其进展缓慢:(1)现有方法通过手工设计来学习运动模式,在预训练阶段产生了不令人满意的、在微调场景中不可转移的运动模式。(2)之前的遮罩自动编码器(Masked AutoEncoder, MAE)框架难以解决4D数据固有的巨大表示差距。在此研究中,我们引入了第一个自我解耦式MAE来学习鉴别性的4D表示,并应用于预训练阶段。为了解决第一项挑战,我们在潜在空间中建模运动表示;第二项问题通过在解码过程中加入潜在令牌与典型几何令牌一起使用的方法得到解决,从而实现了高级和低级特征的分离。在MSR-Action3D、NTU-RGBD、HOI4D、NvGesture以及SHREC'17数据集上的广泛实验验证了这种自我解耦式学习框架的有效性。我们展示了它可以在所有4D任务上提升微调性能,我们将这一模型称为Uni4D。我们的预训练模型呈现出具有鉴别性和意义的4D表示,特别是对于处理长视频非常有益,因为Uni4D在HOI4D上的分割准确率提高了3.8%,在端到端微调后显著优于任何自我监督或全监督方法。

URL

https://arxiv.org/abs/2504.04837

PDF

https://arxiv.org/pdf/2504.04837.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot