Abstract
Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.
Abstract (translated)
梯度方法用于 reinforcement learning 中的价值估计具有有利的稳定性特性,但通常比时间差学习方法慢得多。我们研究这种慢的主要原因,并表明,平均平方主人误差(MSBE)是一个条件不好的 loss 函数,因为它的哈密顿算子具有大的 condition-number。为了解决 MSBE 条件不好对梯度方法的不利影响,我们提出一种低 complexity 的批量无关近邻方法,它 approximately 遵循高斯-牛顿方向,并具有近似的稳定性参数化。我们的主要算法称为RANS,具有效率,即它比残留梯度方法快得多,但计算复杂性几乎相同,并在我们测试的 classic 问题上与时间差学习方法竞争。
URL
https://arxiv.org/abs/2301.13757