Paper Reading AI Learner

Segmenting Unknown 3D Objects from Real Depth Images using Mask R-CNN Trained on Synthetic Point Clouds

2018-09-16 07:08:58
Michael Danielczuk, Matthew Matl, Saurabh Gupta, Andrew Li, Andrew Lee, Jeffrey Mahler, Ken Goldberg

Abstract

The ability to segment unknown objects in depth images has potential to enhance robot skills in grasping and object tracking. Recent computer vision research has demonstrated that Mask R-CNN can be trained to segment specific categories of objects in RGB images when massive hand labeled datasets are available. As generating these datasets is time-consuming, we instead train with synthetic depth images. Many robots now use depth sensors, and recent results suggest training on synthetic depth data can generalize well to the real world. We present a method for automated dataset generation and rapidly generate a training dataset of 50k depth images and 320k object masks synthetically using simulated scenes of 3D CAD models. We train a variant of Mask R-CNN on the generated dataset to perform category-agnostic instance segmentation without hand-labeled data. We evaluate the trained network, which we refer to as Synthetic Depth (SD) Mask R-CNN, on a set of real, high-resolution images of challenging, densely cluttered bins containing objects with highly-varied geometry. SD Mask R-CNN outperforms point cloud clustering baselines by an absolute 15% in Average Precision and 20% in Average Recall, and achieves performance levels similar to a Mask RCNN trained on a massive, hand-labeled RGB dataset and fine-tuned on real images from the experimental setup. The network also generalizes well to a lower-resolution depth sensor. We deploy the model in an instance-specific grasping pipeline to demonstrate its usefulness in a robotics application. Code, the synthetic training dataset, and supplementary material are available at https://bit.ly/2letCuE .

Abstract (translated)

在深度图像中分割未知对象的能力有可能增强抓取和对象跟踪中的机器人技能。最近的计算机视觉研究表明,当大量手工标记数据集可用时,可以训练Mask R-CNN在RGB图像中分割特定类别的对象。由于生成这些数据集非常耗时,因此我们使用合成深度图像进行训练。现在许多机器人都使用深度传感器,最近的结果表明,对合成深度数据的训练可以很好地推广到现实世界。我们提出了一种自动生成数据集的方法,并使用3D CAD模型的模拟场景,快速生成50k深度图像和320k对象掩模的训练数据集。我们在生成的数据集上训练Mask R-CNN的变体,以执行不带手标记数据的类别无关的实例分割。我们评估训练的网络,我们将其称为合成深度(SD)掩模R-CNN,在一组真实的高分辨率图像上,这些图像包含具有高度变化的几何形状的具有挑战性的密集杂乱的箱子。 SD掩模R-CNN的平均精度绝对值为15%,平均召回率为20%,性能水平类似于在大量手工标记的RGB数据集上训练的掩码RCNN,并在实际上进行微调来自实验装置的图像。该网络还可以很好地推广到分辨率较低的深度传感器。我们将模型部署在特定于实例的抓取管道中,以展示其在机器人应用程序中的有用性。代码,综合训练数据集和补充材料可在https://bit.ly/2letCuE获得。

URL

https://arxiv.org/abs/1809.05825

PDF

https://arxiv.org/pdf/1809.05825.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot