Paper Reading AI Learner

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

2023-05-24 11:04:30
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo

Abstract

Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.

Abstract (translated)

身体感知型人工智能是机器人领域的一个关键前沿,能够为机器人在物理环境中实现长期目标的计划和执行行动序列。在本文中,我们介绍了EmbodiedGPT,这是一种面向身体感知型人工智能的身体感知型多媒基座模型,赋予身体感知型代理多媒理解和执行能力。为了实现这一点,我们采取了以下努力:(i) 我们制作了一个大规模的身体感知型规划数据集,称为EgoCOT。该数据集精选了Ego4D数据集中的 carefully selected 视频,并配上高质量的语言指令。具体来说,我们使用“思维链”模式生成一组子目标,以进行有效的身体感知型规划。(ii) 我们引入了高效的训练方法,为EmbodiedGPT提供高质量的规划生成,通过前缀调整将7B的大型语言模型(LLM)适应到EgoCOT数据集上。(iii) 我们引入了一种范式,从LLM生成的规划查询中提取任务相关特征,形成高级别的规划和低级别的控制之间的闭合循环。广泛的实验表明,EmbodiedGPT对身体感知型任务的有效性,包括身体感知规划、身体感知控制、视觉标题制作和视觉问答。值得注意的是,EmbodiedGPT通过提取更有效的特征,显著增强了身体感知型控制任务的成功率。它在 Franka Kitchen 基准测试中成功率的显著提高,以及在Meta-World基准测试中成功率的1.3倍显著提高,相比之下,与Ego4D数据集微调的BLIP-2基线相比,其成功率显著提高。

URL

https://arxiv.org/abs/2305.15021

PDF

https://arxiv.org/pdf/2305.15021.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot