Abstract
Reinforcement learning (RL) has been widely used in decision-making tasks, but it cannot guarantee the agent's safety in the training process due to the requirements of interaction with the environment, which seriously limits its industrial applications such as autonomous driving. Safe RL methods are developed to handle this issue by constraining the expected safety violation costs as a training objective, but they still permit unsafe state occurrence, which is unacceptable in autonomous driving tasks. Moreover, these methods are difficult to achieve a balance between the cost and return expectations, which leads to learning performance degradation for the algorithms. In this paper, we propose a novel algorithm based on the long and short-term constraints (LSTC) for safe RL. The short-term constraint aims to guarantee the short-term state safety that the vehicle explores, while the long-term constraint ensures the overall safety of the vehicle throughout the decision-making process. In addition, we develop a safe RL method with dual-constraint optimization based on the Lagrange multiplier to optimize the training process for end-to-end autonomous driving. Comprehensive experiments were conducted on the MetaDrive simulator. Experimental results demonstrate that the proposed method achieves higher safety in continuous state and action tasks, and exhibits higher exploration performance in long-distance decision-making tasks compared with state-of-the-art methods.
Abstract (translated)
强化学习(RL)在决策任务中得到了广泛应用,但由于与环境的交互要求,在训练过程中无法保证智能体的安全性,这严重限制了其在自动驾驶等领域的工业应用。为了处理这个问题,人们发展了一些安全RL方法,通过将预期安全违规成本作为训练目标来约束,但这些方法仍然允许不安全的状态发生,这是在自动驾驶任务中不可接受的。此外,这些方法很难在成本和回报期望之间实现平衡,导致算法的学习性能下降。在本文中,我们提出了一种基于长短期约束(LSTC)的安全RL新算法。短期约束旨在确保车辆探索过程中短期的安全性,而长期约束确保了车辆在决策过程中整个安全性。此外,我们基于拉格朗日乘数开发了一种安全RL方法,用于优化端到端自动驾驶的训练过程。在元驱动仿真器上进行了全面的实验。实验结果表明,与最先进的方法相比,所提出的方法在连续状态和动作任务中实现了更高的安全性,并且在长途决策任务中表现出更高的探索性能。
URL
https://arxiv.org/abs/2403.18209