Paper Reading AI Learner

Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation

2024-08-02 16:20:56
Ruoxuan Feng, Di Hu, Wenke Ma, Xuelong Li

Abstract

Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment. Picture a chef skillfully gauging the timing of ingredient additions and controlling the heat according to the colors, sounds, and aromas, seamlessly navigating through every stage of the complex cooking process. This ability is founded upon a thorough comprehension of task stages, as achieving the sub-goal within each stage can necessitate the utilization of different senses. In order to endow robots with similar ability, we incorporate the task stages divided by sub-goals into the imitation learning process to accordingly guide dynamic multi-sensory fusion. We propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage. We train a robot system equipped with visual, auditory, and tactile sensors to accomplish challenging robotic manipulation tasks: pouring and peg insertion with keyway. Experimental results indicate that our approach enables more effective and explainable dynamic fusion, aligning more closely with the human fusion process than existing methods.

Abstract (translated)

人类具有在互动环境中灵活交替使用不同感官的天赋。想象一个厨师熟练地掌握食材添加的时间和控制热量根据颜色、声音和气味,顺利通过复杂的烹饪过程的每个阶段。这种能力基于对任务阶段的深入理解,因为在每个阶段实现子目标可能需要利用不同的感官。为了赋予机器人类似的能力,我们将任务阶段按子目标划分融入到模仿学习过程中,相应地指导动态多感官融合。我们提出了MS-Bot,一种基于粗细阶段理解阶段指导的动态多感官融合方法,它根据预测当前阶段的细小状态动态调整模态的优先级。我们训练了一个配备视觉、听觉和触觉传感器的机器人系统,以完成具有挑战性的机器人操作任务:包括钥匙孔的倾倒和插入。实验结果表明,我们的方法实现了更有效的动态融合,能够更好地解释,与人类融合过程更加接近现有方法。

URL

https://arxiv.org/abs/2408.01366

PDF

https://arxiv.org/pdf/2408.01366.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot