Abstract
Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.
Abstract (translated)
身体感知型人工智能是机器人领域的一个关键前沿,能够为机器人在物理环境中实现长期目标的计划和执行行动序列。在本文中,我们介绍了EmbodiedGPT,这是一种面向身体感知型人工智能的身体感知型多媒基座模型,赋予身体感知型代理多媒理解和执行能力。为了实现这一点,我们采取了以下努力:(i) 我们制作了一个大规模的身体感知型规划数据集,称为EgoCOT。该数据集精选了Ego4D数据集中的 carefully selected 视频,并配上高质量的语言指令。具体来说,我们使用“思维链”模式生成一组子目标,以进行有效的身体感知型规划。(ii) 我们引入了高效的训练方法,为EmbodiedGPT提供高质量的规划生成,通过前缀调整将7B的大型语言模型(LLM)适应到EgoCOT数据集上。(iii) 我们引入了一种范式,从LLM生成的规划查询中提取任务相关特征,形成高级别的规划和低级别的控制之间的闭合循环。广泛的实验表明,EmbodiedGPT对身体感知型任务的有效性,包括身体感知规划、身体感知控制、视觉标题制作和视觉问答。值得注意的是,EmbodiedGPT通过提取更有效的特征,显著增强了身体感知型控制任务的成功率。它在 Franka Kitchen 基准测试中成功率的显著提高,以及在Meta-World基准测试中成功率的1.3倍显著提高,相比之下,与Ego4D数据集微调的BLIP-2基线相比,其成功率显著提高。
URL
https://arxiv.org/abs/2305.15021