Regret Minimization Experience Replay

2021-05-15 16:08:45

Zhenghai Xue, Xu-Hui Liu, Jing-Cheng Pang, Shengyi Jiang, Feng Xu, Yang Yu

arXiv_AI

Abstract
Abstract (translated)
URL
PDF

Abstract

Experience replay is widely used in various deep off-policy reinforcement learning (RL) algorithms. It stores previously collected samples for further reuse. To better utilize these samples, prioritized sampling is a promising technique to improve the performance of RL agents. Previous prioritization methods based on temporal-difference (TD) error are highly heuristic and divergent from the objective of RL. In this work, we analyze the optimal prioritization strategy that can minimize the regret of RL policy theoretically. Our theory suggests that the data with higher TD error, better on-policiness and more corrective feedback should be assigned with higher weights during sampling. Based on this theory, we propose two practical algorithms, RM-DisCor and RM-TCE. RM-DisCor is a general algorithm and RM-TCE is a more efficient variant relying on the temporal ordering of states. Both algorithms improve the performance of off-policy RL algorithms in challenging RL benchmarks, including MuJoCo, Atari and Meta-World.

Abstract (translated)

URL

https://arxiv.org/abs/2105.07253

PDF

https://arxiv.org/pdf/2105.07253.pdf

Regret Minimization Experience Replay

Abstract

Abstract (translated)

URL

PDF Copy

PDF