In goal-conditioned reinforcement learning (GCRL), sparse rewards present significant challenges, often obstructing efficient learning. Although multi-step GCRL can boost this efficiency, it can also lead to off-policy biases in target values. This paper dives deep into these biases, categorizing them into two distinct categories: "shooting" and "shifting". Recognizing that certain behavior policies can hasten policy refinement, we present solutions designed to capitalize on the positive aspects of these biases while minimizing their drawbacks, enabling the use of larger step sizes to speed up GCRL. An empirical study demonstrates that our approach ensures a resilient and robust improvement, even in ten-step learning scenarios, leading to superior learning efficiency and performance that generally surpass the baseline and several state-of-the-art multi-step GCRL benchmarks.
在目标条件强化学习（GCRL）中，稀疏奖励带来了显著的挑战，通常会阻碍有效的学习。尽管多步GCRL可以提高学习效率，但它也可能导致目标值的非策略偏差。本文对这些偏差进行了深入的研究，将它们分为两个 distinct类别："射击"和"移动"。认识到某些行为策略可以加速策略的优化，我们提出了旨在利用这些偏差的优势并最小化其缺陷的解决方案，从而加快GCRL。一个实证研究证明了我们的方法确保了即使在十步学习场景中，我们的方法也具有弹性和稳健的改进，从而实现了卓越的学习效率和性能，通常超过了基线和几个最先进的 multi-step GCRL 基准。