Paper Reading AI Learner

Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

2024-12-27 10:05:45
Yuxiao Yang, Shenao Zhang, Zhihan Liu, Huaxiu Yao, Zhaoran Wang

Abstract

This work focuses on building a task planner for Embodied Instruction Following (EIF) using Large Language Models (LLMs). Previous works typically train a planner to imitate expert trajectories, treating this as a supervised task. While these methods achieve competitive performance, they often lack sufficient robustness. When a suboptimal action is taken, the planner may encounter an out-of-distribution state, which can lead to task failure. In contrast, we frame the task as a Partially Observable Markov Decision Process (POMDP) and aim to develop a robust planner under a few-shot assumption. Thus, we propose a closed-loop planner with an adaptation module and a novel hindsight method, aiming to use as much information as possible to assist the planner. Our experiments on the ALFRED dataset indicate that our planner achieves competitive performance under a few-shot assumption. For the first time, our few-shot agent's performance approaches and even surpasses that of the full-shot supervised agent.

Abstract (translated)

这项工作致力于使用大型语言模型(LLMs)为具身指令遵循(EIF)构建任务规划器。先前的研究通常训练一个模仿专家轨迹的规划器,将其视为监督学习任务。尽管这些方法能够达到竞争力的表现,但它们常常缺乏足够的鲁棒性。当采取次优行动时,规划器可能会遇到分布外的状态,这可能导致任务失败。相反,我们将任务建模为部分可观测马尔可夫决策过程(POMDP),并旨在开发一种在少量样本假设下的稳健规划器。因此,我们提出了一种闭环规划器,配备有适应模块和一种新颖的回顾方法,以尽可能利用信息来辅助规划器。我们在ALFRED数据集上的实验表明,在少量样本假设下,我们的规划器实现了竞争性表现。首次,我们的少量样本代理的表现接近甚至超过了完全监督下的全量样本代理。

URL

https://arxiv.org/abs/2412.19562

PDF

https://arxiv.org/pdf/2412.19562.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot