Paper Reading AI Learner

6-DoF Robotic Grasping with Transformer

2023-01-29 15:59:28
Zhenjie Zhao, Hang Yu, Hang Wu, Xuebo Zhang

Abstract

Robotic grasping aims to detect graspable points and their corresponding gripper configurations in a particular scene, and is fundamental for robot manipulation. Existing research works have demonstrated the potential of using a transformer model for robotic grasping, which can efficiently learn both global and local features. However, such methods are still limited in grasp detection on a 2D plane. In this paper, we extend a transformer model for 6-Degree-of-Freedom (6-DoF) robotic grasping, which makes it more flexible and suitable for tasks that concern safety. The key designs of our method are a serialization module that turns a 3D voxelized space into a sequence of feature tokens that a transformer model can consume and skip-connections that merge multiscale features effectively. In particular, our method takes a Truncated Signed Distance Function (TSDF) as input. After serializing the TSDF, a transformer model is utilized to encode the sequence, which can obtain a set of aggregated hidden feature vectors through multi-head attention. We then decode the hidden features to obtain per-voxel feature vectors through deconvolution and skip-connections. Voxel feature vectors are then used to regress parameters for executing grasping actions. On a recently proposed pile and packed grasping dataset, we showcase that our transformer-based method can surpass existing methods by about 5% in terms of success rates and declutter rates. We further evaluate the running time and generalization ability to demonstrate the superiority of the proposed method.

Abstract (translated)

机器人抓取的目标是在特定的场景中检测可抓取点及其相应的夹持配置,是机器人操纵的基本。现有研究已经证明了使用Transformer模型用于机器人抓取的潜力,该模型可以高效学习全球和局部特征。然而,在2D平面上的抓取检测仍然受到限制。在本文中,我们扩展了Transformer模型,将其用于6自由度(6-DoF)机器人抓取,使其更灵活并适合涉及安全的任务。我们的关键设计是序列化模块,将3D立方体编码空间转换为Transformer模型可以消耗和跳过的连接序列,有效地合并多尺度特征。特别是,我们使用Truncated signed distance function(TSDF)作为输入。在序列化TSDF后,Transformer模型用于编码序列,可以通过多眼注意力获得一组聚合的隐藏特征向量。然后,我们解码隐藏的特征,通过傅里叶反变换和跳过连接获得每个样本的点特征向量。点特征向量 then 用于回归参数,执行抓取动作。在一个最近提出的堆和紧凑抓取数据集上,我们展示了我们的Transformer-based方法可以在成功率和清理率方面超过现有方法,超过5%。我们进一步评估了运行时间和泛化能力,以证明该方法的优越性。

URL

https://arxiv.org/abs/2301.12476

PDF

https://arxiv.org/pdf/2301.12476.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot