Abstract
Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy data. However, these methods suffer from the high variance of the off-policy policy gradient (OPPG) estimator, which results in poor sample efficiency during training. In this paper, we propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue. Specifically, this baseline maintains the OPPG estimator's unbiasedness while theoretically minimizing its variance. To enhance practical computational efficiency, we design an approximated version of this optimal baseline. Utilizing this approximation, our method (Off-OAB) aims to decrease the OPPG estimator's variance during policy optimization. We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.
Abstract (translated)
基于策略的方法在解决具有挑战性的强化学习问题方面取得了显著的成功。在这些方法中,离线策略梯度方法特别重要,因为它们可以利用离线数据。然而,由于这些方法在离线策略梯度(OPPG)估计器的高方差性,导致训练期间样本效率较差。在本文中,我们提出了一种带有最优动作相关基线的离线策略梯度方法,以减轻这一方差问题。具体来说,这个基准在理论上将OPPG估计器的方差最小化,同时保持其无偏性。为了提高实际计算效率,我们设计了一个近似的最优基准。利用这个近似,我们的方法(Off-OAB)旨在在策略优化过程中降低OPPG估计器的方差。我们在OpenAI Gym和MuJoCo的六个代表性任务上评估了所提出的Off-OAB方法,这些任务中它显著超越了当前的最先进方法。
URL
https://arxiv.org/abs/2405.02572