Abstract
Estimation of value in policy gradient methods is a fundamental problem. Generalized Advantage Estimation (GAE) is an exponentially-weighted estimator of an advantage function similar to $\lambda$-return. It substantially reduces the variance of policy gradient estimates at the expense of bias. In practical applications, a truncated GAE is used due to the incompleteness of the trajectory, which results in a large bias during estimation. To address this challenge, instead of using the entire truncated GAE, we propose to take a part of it when calculating updates, which significantly reduces the bias resulting from the incomplete trajectory. We perform experiments in MuJoCo and $\mu$RTS to investigate the effect of different partial coefficient and sampling lengths. We show that our partial GAE approach yields better empirical results in both environments.
Abstract (translated)
政策梯度方法中的价值估计是一个基本问题。Generalized Advantage Estimation (GAE) 是一种以指数加权方式估计类似于 $lambda$-return 的优势函数的方法。它在很大程度上减少了政策梯度估计的方差,而同时减少了偏差。在实际应用中,由于轨迹的不完整,通常会使用截断 GAE 进行估计,这会导致在估计过程中存在较大偏差。为了解决这个问题,我们提议在计算更新时仅使用部分截断 GAE,这在很大程度上减少了不完整轨迹所导致的偏差。我们在 MuJoCo 和 $mu$RTS 中进行实验,以研究不同 partial coefficient 和采样长度对实验结果的影响。我们表明,我们 partial GAE 方法在两个环境中都取得了更好的经验结果。
URL
https://arxiv.org/abs/2301.10920