Abstract
Deep reinforcement learning (DRL) has demonstrated remarkable performance in many continuous control tasks. However, a significant obstacle to the real-world application of DRL is the lack of safety guarantees. Although DRL agents can satisfy system safety in expectation through reward shaping, designing agents to consistently meet hard constraints (e.g., safety specifications) at every time step remains a formidable challenge. In contrast, existing work in the field of safe control provides guarantees on persistent satisfaction of hard safety constraints. However, these methods require explicit analytical system dynamics models to synthesize safe control, which are typically inaccessible in DRL settings. In this paper, we present a model-free safe control algorithm, the implicit safe set algorithm, for synthesizing safeguards for DRL agents that ensure provable safety throughout training. The proposed algorithm synthesizes a safety index (barrier certificate) and a subsequent safe control law solely by querying a black-box dynamic function (e.g., a digital twin simulator). Moreover, we theoretically prove that the implicit safe set algorithm guarantees finite time convergence to the safe set and forward invariance for both continuous-time and discrete-time systems. We validate the proposed algorithm on the state-of-the-art Safety Gym benchmark, where it achieves zero safety violations while gaining $95\% \pm 9\%$ cumulative reward compared to state-of-the-art safe DRL methods. Furthermore, the resulting algorithm scales well to high-dimensional systems with parallel computing.
Abstract (translated)
深度强化学习(DRL)在许多连续控制任务中表现出显著的性能。然而,DRL在现实世界应用中的一大障碍是缺乏安全性保证。尽管DRL代理可以通过奖励塑造满足系统安全,但设计一个在每一步都确保达到严格约束的安全代理仍然具有挑战性。相比之下,该领域现有的安全控制方法提供了对严格安全约束的持续满足的保证。然而,这些方法需要显式地分析系统动态模型来合成安全控制,这在DRL环境中通常是不可访问的。在本文中,我们提出了一个模型-无关的安全控制算法,称为隐式安全集算法,用于为DRL代理合成训练过程中的安全保障。所提出的算法通过查询黑盒动态函数(例如数字双胞胎模拟器)生成安全指数和安全控制律。此外,我们理论证明,隐式安全集算法保证连续时间和离散时间系统的有限时间收敛和前馈不变性。我们在最新的Safety Gym基准上验证了所提出的算法,该算法在实现零安全违规的同时,与最先进的 safe DRL 方法相比获得了95% ± 9%的累积奖励。此外,该算法具有良好的扩展性,可应用于高维系统,通过并行计算实现。
URL
https://arxiv.org/abs/2405.02754