Paper Reading AI Learner

RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective

2024-04-18 15:57:19
Chenxi Wang, Hongjie Fang, Hao-Shu Fang, Cewu Lu

Abstract

Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: this http URL.

Abstract (translated)

精确机器人操作需要丰富的空间信息在模仿学习中。基于图像的政策模型从固定的相机中建模物体位置,这些相机对相机视角的变化非常敏感。使用3D点云的政策通常预测关键帧,这使得在动态和接触丰富的场景中实现高效操作具有困难。为了有效地利用3D感知,我们提出了RISE,一个端到端的实世界模仿学习基准,它直接从单视点云中预测连续动作。它使用稀疏的3D编码器压缩点云。添加稀疏位置编码后,点被用Transformer特征化。最后,通过扩散头将特征解码为机器人动作。为每个现实世界任务训练50个演示,RISE在现有2D和3D策略的基础上优势明显,展示了在准确性和效率方面的显著优势。实验还表明,与之前的基础相比,RISE对环境变化的适应性更强。项目网站:this http URL。

URL

https://arxiv.org/abs/2404.12281

PDF

https://arxiv.org/pdf/2404.12281.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot