Abstract
We study the finite-horizon offline reinforcement learning (RL) problem. Since actions at any state can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically complex than offline policy learning for a finite sequence of stochastic contextual bandit environments. We formalize this insight by showing that the statistical hardness of offline RL instances can be measured by estimating the size of actions' impact on next-state distributions. Furthermore, this estimated impact allows us to propagate just enough value function uncertainty from future steps to avoid model exploitation, enabling us to develop algorithms that improve upon traditional pessimistic approaches for offline RL on statistically simple instances. Our approach is supported by theory and simulations.
Abstract (translated)
我们对有限 horizon 离线强化学习 (RL) 问题进行研究。由于在任何状态采取的行动都可以影响下一个状态分布,相关的分布迁移挑战可能导致这个问题比对于一个有限序列的随机上下文Bandit环境 offline policy 学习的问题更加统计复杂。我们 formalize 了这一洞察力,通过表明,离线 RL 实例的统计困难可以通过估计行动对下一个状态分布的影响的大小来衡量。此外,这个估计的影响可以让我们从未来的步骤中传播足够的价值函数不确定性,以避免模型利用,使我们能够开发算法,改进在传统情况下对离线 RL 对统计简单的实例的悲观方法。我们的方法得到了理论和模拟的支持。
URL
https://arxiv.org/abs/2302.00284