Paper Reading AI Learner

RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models

2024-04-07 12:05:47
Qi Lv, Hao Li, Xiang Deng, Rui Shao, Michael Yu Wang, Liqiang Nie

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive reasoning abilities and general intelligence in various domains. It inspires researchers to train end-to-end MLLMs or utilize large models to generate policies with human-selected prompts for embodied agents. However, these methods exhibit limited generalization capabilities on unseen tasks or scenarios, and overlook the multimodal environment information which is critical for robots to make decisions. In this paper, we introduce a novel Robotic Multimodal Perception-Planning (RoboMP$^2$) framework for robotic manipulation which consists of a Goal-Conditioned Multimodal Preceptor (GCMP) and a Retrieval-Augmented Multimodal Planner (RAMP). Specially, GCMP captures environment states by employing a tailored MLLMs for embodied agents with the abilities of semantic reasoning and localization. RAMP utilizes coarse-to-fine retrieval method to find the $k$ most-relevant policies as in-context demonstrations to enhance the planner. Extensive experiments demonstrate the superiority of RoboMP$^2$ on both VIMA benchmark and real-world tasks, with around 10% improvement over the baselines.

Abstract (translated)

多模态大型语言模型(MLLMs)在各种领域表现出惊人的推理能力和通用智能。这激发了研究人员训练端到端MLLMs或利用大型模型生成具有人类选择的提示的身体代理策略。然而,这些方法在未见过的任务或场景上表现出有限的泛化能力,并忽略了对于机器人做出决策至关重要的多模态环境信息。在本文中,我们引入了一种名为RoboMP$^2$的机器人多模态感知规划(RoboMP)框架,用于机器人操作。该框架包括一个由自适应MLLM捕获环境状态的目标条件式多模态感知器(GCMP)和一个用于增强规划器检索策略的检索增强多模态规划器(RAMP)。特别地,GCMP通过为具有语义推理和局部定位能力的身体代理使用定制的MLLM来捕获环境状态。RAMP利用粗到细的检索方法找到$k$个最有相关的策略,作为上下文的演示以提高规划器。大量实验证明,RoboMP$^2在VIMA基准和现实世界任务上具有优越性,与基线相比约10%的改进。

URL

https://arxiv.org/abs/2404.04929

PDF

https://arxiv.org/pdf/2404.04929.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot