Abstract
Multimodal Large Language Models (MLLMs) have shown impressive reasoning abilities and general intelligence in various domains. It inspires researchers to train end-to-end MLLMs or utilize large models to generate policies with human-selected prompts for embodied agents. However, these methods exhibit limited generalization capabilities on unseen tasks or scenarios, and overlook the multimodal environment information which is critical for robots to make decisions. In this paper, we introduce a novel Robotic Multimodal Perception-Planning (RoboMP$^2$) framework for robotic manipulation which consists of a Goal-Conditioned Multimodal Preceptor (GCMP) and a Retrieval-Augmented Multimodal Planner (RAMP). Specially, GCMP captures environment states by employing a tailored MLLMs for embodied agents with the abilities of semantic reasoning and localization. RAMP utilizes coarse-to-fine retrieval method to find the $k$ most-relevant policies as in-context demonstrations to enhance the planner. Extensive experiments demonstrate the superiority of RoboMP$^2$ on both VIMA benchmark and real-world tasks, with around 10% improvement over the baselines.
Abstract (translated)
多模态大型语言模型(MLLMs)在各种领域表现出惊人的推理能力和通用智能。这激发了研究人员训练端到端MLLMs或利用大型模型生成具有人类选择的提示的身体代理策略。然而,这些方法在未见过的任务或场景上表现出有限的泛化能力,并忽略了对于机器人做出决策至关重要的多模态环境信息。在本文中,我们引入了一种名为RoboMP$^2$的机器人多模态感知规划(RoboMP)框架,用于机器人操作。该框架包括一个由自适应MLLM捕获环境状态的目标条件式多模态感知器(GCMP)和一个用于增强规划器检索策略的检索增强多模态规划器(RAMP)。特别地,GCMP通过为具有语义推理和局部定位能力的身体代理使用定制的MLLM来捕获环境状态。RAMP利用粗到细的检索方法找到$k$个最有相关的策略,作为上下文的演示以提高规划器。大量实验证明,RoboMP$^2在VIMA基准和现实世界任务上具有优越性,与基线相比约10%的改进。
URL
https://arxiv.org/abs/2404.04929