Robotic grasping aims to detect graspable points and their corresponding gripper configurations in a particular scene, and is fundamental for robot manipulation. Existing research works have demonstrated the potential of using a transformer model for robotic grasping, which can efficiently learn both global and local features. However, such methods are still limited in grasp detection on a 2D plane. In this paper, we extend a transformer model for 6-Degree-of-Freedom (6-DoF) robotic grasping, which makes it more flexible and suitable for tasks that concern safety. The key designs of our method are a serialization module that turns a 3D voxelized space into a sequence of feature tokens that a transformer model can consume and skip-connections that merge multiscale features effectively. In particular, our method takes a Truncated Signed Distance Function (TSDF) as input. After serializing the TSDF, a transformer model is utilized to encode the sequence, which can obtain a set of aggregated hidden feature vectors through multi-head attention. We then decode the hidden features to obtain per-voxel feature vectors through deconvolution and skip-connections. Voxel feature vectors are then used to regress parameters for executing grasping actions. On a recently proposed pile and packed grasping dataset, we showcase that our transformer-based method can surpass existing methods by about 5% in terms of success rates and declutter rates. We further evaluate the running time and generalization ability to demonstrate the superiority of the proposed method.
机器人抓取的目标是在特定的场景中检测可抓取点及其相应的夹持配置,是机器人操纵的基本。现有研究已经证明了使用Transformer模型用于机器人抓取的潜力,该模型可以高效学习全球和局部特征。然而,在2D平面上的抓取检测仍然受到限制。在本文中,我们扩展了Transformer模型,将其用于6自由度(6-DoF)机器人抓取,使其更灵活并适合涉及安全的任务。我们的关键设计是序列化模块,将3D立方体编码空间转换为Transformer模型可以消耗和跳过的连接序列,有效地合并多尺度特征。特别是,我们使用Truncated signed distance function(TSDF)作为输入。在序列化TSDF后,Transformer模型用于编码序列,可以通过多眼注意力获得一组聚合的隐藏特征向量。然后,我们解码隐藏的特征,通过傅里叶反变换和跳过连接获得每个样本的点特征向量。点特征向量 then 用于回归参数,执行抓取动作。在一个最近提出的堆和紧凑抓取数据集上,我们展示了我们的Transformer-based方法可以在成功率和清理率方面超过现有方法,超过5%。我们进一步评估了运行时间和泛化能力,以证明该方法的优越性。