Abstract
Partially Observable Markov Decision Processes (POMDPs) can model complex sequential decision-making problems under stochastic and uncertain environments. A main reason hindering their broad adoption in real-world applications is the lack of availability of a suitable POMDP model or a simulator thereof. Available solution algorithms, such as Reinforcement Learning (RL), require the knowledge of the transition dynamics and the observation generating process, which are often unknown and non-trivial to infer. In this work, we propose a combined framework for inference and robust solution of POMDPs via deep RL. First, all transition and observation model parameters are jointly inferred via Markov Chain Monte Carlo sampling of a hidden Markov model, which is conditioned on actions, in order to recover full posterior distributions from the available data. The POMDP with uncertain parameters is then solved via deep RL techniques with the parameter distributions incorporated into the solution via domain randomization, in order to develop solutions that are robust to model uncertainty. As a further contribution, we compare the use of transformers and long short-term memory networks, which constitute model-free RL solutions, with a model-based/model-free hybrid approach. We apply these methods to the real-world problem of optimal maintenance planning for railway assets.
Abstract (translated)
partiallyObservable Markov Decision Processes (POMDPs)可以在随机和不确定的环境中模拟复杂的Sequential决策问题。一个主要障碍是在现实世界中广泛采用POMDP模型或其模拟器的原因是缺乏适当的POMDP模型或其模拟器。可用的解决方案算法,如强化学习(RL),需要了解转移动态和观察生成过程的知识,这些往往 unknown 且难以推断。在这项工作中,我们提出了一种综合框架,通过深度强化学习来推断和 robust 解决方案 POMDPs。首先,通过 hidden Markov模型的马尔可夫链蒙特卡罗采样,联合推断所有转移和观察模型参数,以从可用数据中恢复完整的后验分布。对于具有不确定参数的POMDP,我们使用深度强化学习技术,通过领域随机化将参数分布集成到解决方案中,以开发 robust 的解决方案,以应对模型不确定性。作为进一步的贡献,我们比较了使用Transformers和长期短期记忆网络,组成了模型无关的强化学习解决方案,与基于模型/模型无关的混合方法。我们将这些方法应用于铁路资产最优维护计划的现实世界问题。
URL
https://arxiv.org/abs/2307.08082