Mirror Learning: A Unifying Framework of Policy Optimisation

2022-01-07 09:16:03

Jakub Grudzien Kuba, Christian Schroeder de Witt, Jakob Foerster

arXiv_AI

arXiv_AI Reinforcement_Learning Pose

Abstract
Abstract (translated)
URL
PDF

Abstract

General policy improvement (GPI) and trust-region learning (TRL) are the predominant frameworks within contemporary reinforcement learning (RL), which serve as the core models for solving Markov decision processes (MDPs). Unfortunately, in their mathematical form, they are sensitive to modifications, and thus, the practical instantiations that implement them do not automatically inherit their improvement guarantees. As a result, the spectrum of available rigorous MDP-solvers is narrow. Indeed, many state-of-the-art (SOTA) algorithms, such as TRPO and PPO, are not proven to converge. In this paper, we propose \textsl{mirror learning} -- a general solution to the RL problem. We reveal GPI and TRL to be but small points within this far greater space of algorithms which boasts the monotonic improvement property and converges to the optimal policy. We show that virtually all SOTA algorithms for RL are instances of mirror learning, and thus suggest that their empirical performance is a consequence of their theoretical properties, rather than of approximate analogies. Excitingly, we show that mirror learning opens up a whole new space of policy learning methods with convergence guarantees.

Abstract (translated)

URL

https://arxiv.org/abs/2201.02373

PDF

https://arxiv.org/pdf/2201.02373.pdf