Abstract
Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: this http URL.
Abstract (translated)
精确机器人操作需要丰富的空间信息在模仿学习中。基于图像的政策模型从固定的相机中建模物体位置,这些相机对相机视角的变化非常敏感。使用3D点云的政策通常预测关键帧,这使得在动态和接触丰富的场景中实现高效操作具有困难。为了有效地利用3D感知,我们提出了RISE,一个端到端的实世界模仿学习基准,它直接从单视点云中预测连续动作。它使用稀疏的3D编码器压缩点云。添加稀疏位置编码后,点被用Transformer特征化。最后,通过扩散头将特征解码为机器人动作。为每个现实世界任务训练50个演示,RISE在现有2D和3D策略的基础上优势明显,展示了在准确性和效率方面的显著优势。实验还表明,与之前的基础相比,RISE对环境变化的适应性更强。项目网站:this http URL。
URL
https://arxiv.org/abs/2404.12281