Abstract
Learning from active human involvement enables the human subject to actively intervene and demonstrate to the AI agent during training. The interaction and corrective feedback from human brings safety and AI alignment to the learning process. In this work, we propose a new reward-free active human involvement method called Proxy Value Propagation for policy optimization. Our key insight is that a proxy value function can be designed to express human intents, wherein state-action pairs in the human demonstration are labeled with high values, while those agents' actions that are intervened receive low values. Through the TD-learning framework, labeled values of demonstrated state-action pairs are further propagated to other unlabeled data generated from agents' exploration. The proxy value function thus induces a policy that faithfully emulates human behaviors. Human-in-the-loop experiments show the generality and efficiency of our method. With minimal modification to existing reinforcement learning algorithms, our method can learn to solve continuous and discrete control tasks with various human control devices, including the challenging task of driving in Grand Theft Auto V. Demo video and code are available at: this https URL
Abstract (translated)
通过主动的人类参与进行学习使人类主体能够在训练过程中积极干预并展示给AI代理。这种互动和来自人的纠正反馈为学习过程带来了安全性和与人工智能的对齐。在本文中,我们提出了一种新的无奖励的主动人类介入方法,称为代理价值传播(Proxy Value Propagation),用于策略优化。我们的关键见解是,可以设计一个代理价值函数来表达人类意图,在此情况下,人展示的状态-动作对被赋予高值,而那些被干预的操作则得到低值。通过TD学习框架,标记的示范状态-动作对的价值进一步传播到由代理探索生成的未标记数据中。因此,该代理价值函数诱导出一种策略,可以忠实地模仿人类行为。人机交互实验表明了我们方法的普适性和效率性。在不修改现有强化学习算法的情况下,我们的方法能够通过各种人类控制设备(包括《侠盗猎车手V》中的驾驶等挑战性任务)来学会解决连续和离散控制任务。演示视频和代码可以在以下网址找到:[此URL]
URL
https://arxiv.org/abs/2502.03369