Abstract
Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world, which makes learning complex tasks with sparse rewards difficult. If initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that is tested and verified during exploration, to improve sample efficiency in embodied RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM on the basis of its experiences. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics.
Abstract (translated)
强化学习(RL)的代理通常学习的是轮廓图(Tabula rasa),在没有对世界的先前知识的情况下,这使得学习稀疏奖励的复杂任务变得困难。如果以高级目标及其之间的过渡为目标初始化,RL代理可以利用抽象世界模型(AWM)进行规划和探索。我们建议使用少量的大型语言模型(LLMs)来假设AWM,并在探索期间对其进行测试和验证,以提高实体强化学习的样本效率。我们的DECKARD代理在Minecraft游戏中使用LLMs引导的探索两个阶段:(1)梦境阶段,代理使用LLM将任务分解为一系列目标序列,并假设其为AWM;(2)唤醒阶段,代理学习每个目标模块的政策,并基于其经验验证或修正假设的AWM。我们的方法假设AWM并用LLMs进行建模,然后基于代理经验验证AWM,不仅增加了 contemporary方法的样本效率,而且 robust 和纠正了LLM中的错误,成功地将LLMs中的噪声互联网规模信息与基于环境动力学的知识相结合。
URL
https://arxiv.org/abs/2301.12050