Abstract
Safety is a critical hurdle that limits the application of deep reinforcement learning (RL) to real-world control tasks. To this end, constrained reinforcement learning leverages cost functions to improve safety in constrained Markov decision processes. However, such constrained RL methods fail to achieve zero violation even when the cost limit is zero. This paper analyzes the reason for such failure, which suggests that a proper cost function plays an important role in constrained RL. Inspired by the analysis, we propose AutoCost, a simple yet effective framework that automatically searches for cost functions that help constrained RL to achieve zero-violation performance. We validate the proposed method and the searched cost function on the safe RL benchmark Safety Gym. We compare the performance of augmented agents that use our cost function to provide additive intrinsic costs with baseline agents that use the same policy learners but with only extrinsic costs. Results show that the converged policies with intrinsic costs in all environments achieve zero constraint violation and comparable performance with baselines.
Abstract (translated)
安全性是限制深度强化学习(RL)应用于实际控制任务的关键障碍。为此,约束强化学习利用成本函数改善约束马尔可夫决策过程的安全性。然而,即使成本限制为0,这些约束强化学习方法仍然无法达到零违反。本文分析了这种失败的原因,这表明适当的成本函数在约束强化学习中扮演着重要的角色。受到分析的启发,我们提出了AutoCost,一个简单但有效的框架,自动搜索帮助约束强化学习实现零违反性能的成本函数。我们在安全RL基准安全体育馆上验证了我们的方法以及搜索的成本函数。我们比较了添加我们的成本函数以提供累加内在成本的增强代理与使用相同策略学习器但只有外部成本的基线代理的性能。结果表明,在所有环境中,内在成本的共轭策略实现了零违反,与基线表现相当。
URL
https://arxiv.org/abs/2301.10339