Abstract
With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance.
Abstract (translated)
借助大型语言模型的力量,开放性代理可以灵活地理解人类指令,生成可解释的指导策略,并输出可执行动作。如今,多模态语言模型(MLMs)将多模态信号集成到LLM中,进一步增加了实体代理对复杂任务的感知,并允许实体代理更细致地感知世界理解任务。然而,现有作品:1)由代理独立操作,从感知到动作,导致复杂任务之间的缺口;2)在静态数据上训练MLMs,难以应对开放性场景中的动态;3)将先验知识直接作为提示输入,抑制了应用的灵活性。我们提出了STEVE-2,一个为开放性 embodied 任务提供层次化知识蒸馏框架,其特点为:1)多粒度任务分层的 hierarchical 系统,2)用于并行模拟数据的镜像蒸馏方法,3)用于引入额外知识的额外专家模型。蒸馏后,实体代理可以在没有额外专家指导的情况下完成复杂的、开放性的任务,利用多样 MLM 的性能和知识。对导航和创建任务的广泛评估强调了STEVE-2在开放性任务中的优越性能,性能比为 $1.4 \times$ - $7.3 \times$。
URL
https://arxiv.org/abs/2404.04619