Abstract
To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods.
Abstract (translated)
将强化学习应用于实际应用时,需要代理遵守各自领域的安全性指南。安全强化学习可以通过将指南转换为强化学习问题中的约束来有效地处理指南。在本文中,我们基于信任区域方法开发了一种安全分布强化学习方法,该方法可以 consistently 满足约束。然而,由于分布批评器的估计偏差,某些政策可能无法满足安全性指南,而信任区域方法所需的重要性采样可能会妨碍性能,因此我们需要通过以下方法增强安全性表现。首先,我们训练分布批评器以具有较低的估计偏差,通过使用 proposed 目标分布,其中偏差和方差可以 trade-off。其次,我们提出了信任区域方法的新替代方法,通过使用重构技巧,以 Q 函数的形式表示。此外,根据初始政策设置,信任区域内可能没有任何政策满足所有约束。为了解决可能可行的问题,我们提出了梯度积分方法,该方法保证从不安全初始政策中找到一个满足所有约束的政策。从广泛的实验中,具有风险偏好限制的方法表现出 minimal 的约束违反,而与现有的安全强化学习方法相比,它实现了更高的回报。
URL
https://arxiv.org/abs/2301.10923