Abstract
Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.
Abstract (translated)
定向探索在强化学习(RL)中是一个关键挑战,尤其是在奖励很少的情况下。信息引导采样(IDS),旨在优化信息比例,通过增加后悔来增加信息增益。然而,估计信息增益计算量很大或依赖于限制性假设,这些假设禁止在许多实际场景中使用信息增益。在本文中,我们假设在适当的条件下,可以使用卷积 Stein差异(KSD)来计算当前估计的转移模型和未知的最优,KSD可以以 closed-form 的形式使用内核 Stein差异计算。基于 KSD,我们开发了一种新的算法 STEERING:基于模型的强化学习中的定向探索。为了使其推导成立,我们开发了离散条件分布的新变体,并进一步证明了 STEERING 可以存储 sublinear 的 Bayesian 后悔,并改进了信息增强 MBRL 的先前学习速率,包括IDS。实验表明,该提议算法的计算成本可承担,并优于 several 先前方法。
URL
https://arxiv.org/abs/2301.12038