Abstract
Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments. One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data. To this end, we introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data. Socratic Planner first decomposes the instructions into substructural information of the task through self-questioning and answering, translating it into a high-level plan, i.e., a sequence of subgoals. Subgoals are executed sequentially, with our visually grounded re-planning mechanism adjusting plans dynamically through a dense visual feedback. We also introduce an evaluation metric of high-level plans, RelaxedHLP, for a more comprehensive evaluation. Experiments demonstrate the effectiveness of the Socratic Planner, achieving competitive performance on both zero-shot and few-shot task planning in the ALFRED benchmark, particularly excelling in tasks requiring higher-dimensional inference. Additionally, a precise adjustments in the plan were achieved by incorporating environmental visual information.
Abstract (translated)
实体化指令跟随(EIF)是通过在3D环境中导航和与物体互动来执行自然语言指令的任务。在EIF中,一个主要的挑战是合成任务规划,通常通过带有标签数据的监督或上下文学习来解决。为此,我们引入了Socratic Planner,第一种无需训练数据的原子规划方法。Socratic Planner首先通过自问自答将指令分解为任务的主干信息,然后将其转换为高级计划,即一系列子目标。子目标按顺序执行,我们的视觉 grounded 的再规划机制通过密集的视觉反馈动态调整计划。我们还引入了一个高层次计划评估指标,RelaxedHLP,以进行更全面的评估。实验证明Socratic Planner的有效性,在ALFRED基准中实现了与零 shot和少 shot 任务规划的竞争性能,特别是在需要更高维推理的任务中表现尤为出色。此外,通过将环境视觉信息融入规划,实现了精确的计划调整。
URL
https://arxiv.org/abs/2404.15190