Paper Reading AI Learner

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

2024-10-30 13:52:43
Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, Brandon Amos

Abstract

Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples; or are limited to reward functions expressible by compact code, which may require source code and have difficulty capturing nuanced semantics; or require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. By studying their relative tradeoffs, we shed light on questions regarding intrinsic reward design for sparse reward problems. Our approach achieves state-of-the-art performance across a range of challenging, sparse reward tasks from the NetHack Learning Environment in a simple unified process, solely using the agent's gathered experience, without requiring external datasets nor source code. We make our code available at \url{URL} (coming soon).

Abstract (translated)

自动从自然语言描述中合成密集奖励是强化学习(RL)中一个有前景的范式,适用于稀疏奖励问题、开放性探索和分层技能设计。最近的研究通过利用大型语言模型(LLMs)的先验知识取得了显著进展。然而,这些方法存在一些重要的局限:它们要么无法扩展到需要数十亿环境样本的问题上;要么仅限于可以由紧凑代码表达的奖励函数,这可能需要源代码且难以捕捉细微语义;或者需要一个多样化的离线数据集,而这可能不存在或难以收集。在本研究中,我们通过算法和系统层面的贡献组合来解决这些局限性。我们提出了ONI,一种分布式架构,该架构同时使用LLM反馈学习RL策略和内在奖励函数。我们的方法通过异步LLM服务器对代理收集的经验进行标注,然后将之提炼成一个内在奖励模型。我们探索了一系列不同复杂度的奖励建模算法选择,包括哈希、分类和排序模型。通过对它们相对权衡的研究,我们为稀疏奖励问题中的内在奖励设计提供了见解。我们的方法在NetHack学习环境中的一系列具有挑战性的、稀疏奖励任务上实现了最先进的性能,并且仅使用了代理收集的经验,无需外部数据集或源代码。我们的代码将在\url{URL}(即将上线)中提供。

URL

https://arxiv.org/abs/2410.23022

PDF

https://arxiv.org/pdf/2410.23022.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot