Abstract
To plan safely in uncertain environments, agents must balance utility with safety constraints. Safe planning problems can be modeled as a chance-constrained partially observable Markov decision process (CC-POMDP) and solutions often use expensive rollouts or heuristics to estimate the optimal value and action-selection policy. This work introduces the ConstrainedZero policy iteration algorithm that solves CC-POMDPs in belief space by learning neural network approximations of the optimal value and policy with an additional network head that estimates the failure probability given a belief. This failure probability guides safe action selection during online Monte Carlo tree search (MCTS). To avoid overemphasizing search based on the failure estimates, we introduce $\Delta$-MCTS, which uses adaptive conformal inference to update the failure threshold during planning. The approach is tested on a safety-critical POMDP benchmark, an aircraft collision avoidance system, and the sustainability problem of safe CO$_2$ storage. Results show that by separating safety constraints from the objective we can achieve a target level of safety without optimizing the balance between rewards and costs.
Abstract (translated)
在不确定的环境中进行安全规划时,智能体必须平衡可用性和安全性约束。安全规划问题可以建模为一个带约束的马尔可夫决策过程(CC-POMDP),并且解决方案通常使用昂贵的展开或启发式来估计最优价值和动作选择策略。本文介绍了一种约束零策略迭代算法,通过学习神经网络对最优价值和策略的近似来求解CC-POMDP。这种故障概率在在线蒙特卡洛树搜索(MCTS)中引导安全动作选择。为了避免过分依赖基于故障估计的搜索,我们引入了$\Delta$-MCTS,它使用自适应平滑推理来在规划过程中更新故障阈值。该方法在安全关键POMDP基准、飞机避障系统和可持续性问题上进行了测试。结果表明,通过将安全性约束与目标分离,我们可以在不优化奖励和成本之间的平衡的情况下实现预期的安全性水平。
URL
https://arxiv.org/abs/2405.00644