Abstract
Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples; or are limited to reward functions expressible by compact code, which may require source code and have difficulty capturing nuanced semantics; or require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. By studying their relative tradeoffs, we shed light on questions regarding intrinsic reward design for sparse reward problems. Our approach achieves state-of-the-art performance across a range of challenging, sparse reward tasks from the NetHack Learning Environment in a simple unified process, solely using the agent's gathered experience, without requiring external datasets nor source code. We make our code available at \url{URL} (coming soon).
Abstract (translated)
自动从自然语言描述中合成密集奖励是强化学习(RL)中一个有前景的范式,适用于稀疏奖励问题、开放性探索和分层技能设计。最近的研究通过利用大型语言模型(LLMs)的先验知识取得了显著进展。然而,这些方法存在一些重要的局限:它们要么无法扩展到需要数十亿环境样本的问题上;要么仅限于可以由紧凑代码表达的奖励函数,这可能需要源代码且难以捕捉细微语义;或者需要一个多样化的离线数据集,而这可能不存在或难以收集。在本研究中,我们通过算法和系统层面的贡献组合来解决这些局限性。我们提出了ONI,一种分布式架构,该架构同时使用LLM反馈学习RL策略和内在奖励函数。我们的方法通过异步LLM服务器对代理收集的经验进行标注,然后将之提炼成一个内在奖励模型。我们探索了一系列不同复杂度的奖励建模算法选择,包括哈希、分类和排序模型。通过对它们相对权衡的研究,我们为稀疏奖励问题中的内在奖励设计提供了见解。我们的方法在NetHack学习环境中的一系列具有挑战性的、稀疏奖励任务上实现了最先进的性能,并且仅使用了代理收集的经验,无需外部数据集或源代码。我们的代码将在\url{URL}(即将上线)中提供。
URL
https://arxiv.org/abs/2410.23022