Abstract
The Distributionally Robust Markov Decision Process (DRMDP) is a popular framework for addressing dynamics shift in reinforcement learning by learning policies robust to the worst-case transition dynamics within a constrained set. However, solving its dual optimization oracle poses significant challenges, limiting theoretical analysis and computational efficiency. The recently proposed Robust Regularized Markov Decision Process (RRMDP) replaces the uncertainty set constraint with a regularization term on the value function, offering improved scalability and theoretical insights. Yet, existing RRMDP methods rely on unstructured regularization, often leading to overly conservative policies by considering transitions that are unrealistic. To address these issues, we propose a novel framework, the $d$-rectangular linear robust regularized Markov decision process ($d$-RRMDP), which introduces a linear latent structure into both transition kernels and regularization. For the offline RL setting, where an agent learns robust policies from a pre-collected dataset in the nominal environment, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI), employing linear function approximation and $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, showing these bounds depend on how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. This term is further shown to be fundamental to $d$-RRMDPs via information-theoretic lower bounds. Finally, numerical experiments validate that R2PVI learns robust policies and is computationally more efficient than methods for constrained DRMDPs.
Abstract (translated)
分布稳健马尔可夫决策过程(DRMDP)是一种流行的框架,通过学习在受限集合内对最坏情况的转移动态具有鲁棒性的策略来解决强化学习中的动力学变化问题。然而,求解其对偶优化预言带来了显著挑战,限制了理论分析和计算效率。最近提出的稳健正则化马尔可夫决策过程(RRMDP)通过在价值函数上引入一个正则化项替换了不确定性集约束,提供了改进的可扩展性和理论洞察力。然而,现有的RRMDP方法依赖于非结构化的正则化,这往往会导致过于保守的策略,因为它们考虑了不现实的转移情况。为了解决这些问题,我们提出了一种新的框架,$d$-矩形线性稳健正则化马尔可夫决策过程($d$-RRMDP),它在转移核和正则化中引入了一个线性的潜在结构。对于离线强化学习场景,在该场景下代理程序从名义环境中的预先收集的数据集中学习鲁棒策略,我们开发了一组算法,稳健正则悲观值迭代(R2PVI),使用线性函数逼近,并基于$f$-散度在转移核上引入正则化项。我们提供了关于R2PVI策略次优差距的实例依赖上限,表明这些界限取决于数据集如何覆盖由最优鲁棒政策访问的状态动作空间,在稳健可接受的转移情况下。进一步证明了这个术语对于$d$-RRMDPs是通过信息论下界基本的。最后,数值实验验证了R2PVI能够学习到鲁棒策略,并且在计算效率上优于约束DRMDPs的方法。
URL
https://arxiv.org/abs/2411.18612