Abstract
Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.
Abstract (translated)
选择探索性行动,生成丰富的经验流以更好地学习是 reinforcement learning (RL) 中的 fundamental 挑战。解决这个问题的方法包括根据特定的政策选择行动一段时间,也被称为选项。最近的一项工作基于图洛伦兹多项式的eigen函数构建这些探索性选项。重要的是,这些方法目前主要局限于表格领域,其中(1)图洛伦兹矩阵要么给出,要么可以完全估计,(2)对矩阵进行eigendecomposition计算可计算,(3)价值函数可以精确学习。此外,这些方法需要单独的选项发现阶段。这些假设从根本上不可扩展。在本文中,我们解决这些问题并展示如何利用最近直接逼近洛伦兹多项式eigen函数的结果,真正扩展基于选项的探索。为此,我们介绍了一个完全在线的深度强化学习算法,用于发现洛伦兹多项式选项,并评估我们的方法在各种像素任务上的性能。我们与多个先进的探索方法进行比较,并表明我们的方法有效、通用,尤其是在非稳定 settings 中特别有前途。
URL
https://arxiv.org/abs/2301.11181