Paper Reading AI Learner

Efficient Bi-manipulation using RGBD Multi-model Fusion based on Attention Mechanism

2024-04-27 07:29:53
Jian Shen, Jiaxin Huang, Zhigong Song

Abstract

Dual-arm robots have great application prospects in intelligent manufacturing due to their human-like structure when deployed with advanced intelligence algorithm. However, the previous visuomotor policy suffers from perception deficiencies in environments where features of images are impaired by the various conditions, such as abnormal lighting, occlusion and shadow etc. The Focal CVAE framework is proposed for RGB-D multi-modal data fusion to address this challenge. In this study, a mixed focal attention module is designed for the fusion of RGB images containing color features and depth images containing 3D shape and structure information. This module highlights the prominent local features and focuses on the relevance of RGB and depth via cross-attention. A saliency attention module is proposed to improve its computational efficiency, which is applied in the encoder and the decoder of the framework. We illustrate the effectiveness of the proposed method via extensive simulation and experiments. It's shown that the performances of bi-manipulation are all significantly improved in the four real-world tasks with lower computational cost. Besides, the robustness is validated through experiments under different scenarios where there is a perception deficiency problem, demonstrating the feasibility of the method.

Abstract (translated)

双臂机器人由于其具有先进智能算法部署时的人性化结构,在智能制造领域具有巨大的应用潜力。然而,在部署过程中,由于图像特征受到各种条件(如异常照明、遮挡和阴影等)的影响,前可视运动策略存在感知缺陷。为了应对这一挑战,我们提出了Focal CVAE框架来解决这一问题。 在本研究中,我们设计了一个混合焦注意模块,用于融合包含颜色特征的RGB图像和包含3D形状和结构信息的深度图像。该模块突出了显著的局部特征,并关注RGB和深度之间的跨注意关系。为了提高计算效率,我们提出了一个欠拟合注意力模块,该模块应用于框架的编码器和解码器。通过广泛的仿真和实验,我们证明了所提出方法的有效性。实验结果表明,在四个现实世界的任务中,双臂机器人的性能都有显著的提高,且计算成本较低。此外,通过在不同情景下的实验验证了方法的稳健性,证明了该方法的可行性。

URL

https://arxiv.org/abs/2404.17811

PDF

https://arxiv.org/pdf/2404.17811.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot