Abstract
Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. However, the model is often unknown and must be learned online while still ensuring the constraint is met, or at least the violation is bounded with time. Some recent papers have made progress on this very challenging problem but either need unsatisfactory assumptions such as knowledge of a safe policy, or have high cumulative regret. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm achieves an efficient tradeoff between exploration and exploitation by use of the posterior sampling principle, and provably suffers only bounded constraint violation by leveraging the idea of pessimism. Our approach is based on a primal-dual approach. We establish a sub-linear $\tilde{\mathcal{ O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2 |\mathcal{A}| K} \right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint violation regret over $K$ episodes for an $|\mathcal{S}|$-state, $|\mathcal{A}|$-action and horizon $H$ CMDP.
Abstract (translated)
约束马尔可夫决策过程(CMDPs)是一种在多个目标之间进行Sequential决策的模型场景,在许多应用中变得越来越重要。然而,该模型通常未知,必须在线学习,但仍然必须保证满足约束,或者至少随着时间的推移保持有限制的违反。一些最近的论文在这方面取得了进展,但要么需要令人满意的假设,例如了解安全策略,要么累积 regret非常高。我们提出了 Safe PSRL(后采样基于RL)算法,它不需要这些假设,但仍然表现很好,无论是从理论 regret bounds 还是从经验上看。该算法通过使用后采样原则实现了探索和利用之间的高效权衡,并且通过利用悲观的想法,仅导致有限制的约束违反 regret。我们的方法是基于主对主的方法的。我们建立了一个 sub-linear $ ilde{mathcal{O}}left(H^{2.5} sqrt{|mathcal{S}|^2 |mathcal{A}| K} ight)$ 的 Bayesian 奖励 objective regret upper bound,同时还有一个有限制的,即 $ ilde{mathcal{O}}left(1 ight)$ 的约束违反 regret,对一个 $|mathcal{S}|$-state、 $|mathcal{A}|$-action 和 horizon $H$ 的 $|mathcal{S}|$-state、 $|mathcal{A}|$-action 和 horizon $H$ CMDP 的 $K$ 个样本的累计 regret。
URL
https://arxiv.org/abs/2301.11547