Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model

Abstract
Abstract (translated)
URL
PDF

Abstract

With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance.

Abstract (translated)

借助大型语言模型的力量，开放性代理可以灵活地理解人类指令，生成可解释的指导策略，并输出可执行动作。如今，多模态语言模型（MLMs）将多模态信号集成到LLM中，进一步增加了实体代理对复杂任务的感知，并允许实体代理更细致地感知世界理解任务。然而，现有作品：1）由代理独立操作，从感知到动作，导致复杂任务之间的缺口；2）在静态数据上训练MLMs，难以应对开放性场景中的动态；3）将先验知识直接作为提示输入，抑制了应用的灵活性。我们提出了STEVE-2，一个为开放性 embodied 任务提供层次化知识蒸馏框架，其特点为：1）多粒度任务分层的 hierarchical 系统，2）用于并行模拟数据的镜像蒸馏方法，3）用于引入额外知识的额外专家模型。蒸馏后，实体代理可以在没有额外专家指导的情况下完成复杂的、开放性的任务，利用多样 MLM 的性能和知识。对导航和创建任务的广泛评估强调了STEVE-2在开放性任务中的优越性能，性能比为 $1.4 \times$ - $7.3 \times$。

URL

https://arxiv.org/abs/2404.04619

PDF

https://arxiv.org/pdf/2404.04619.pdf

Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model

Abstract

Abstract (translated)

URL

PDF Copy

PDF