Abstract
Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such translations manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objectives (expressed in logic or automata formalism) to scalar rewards for discrete-time Markov decision processes (MDPs). Unfortunately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega-regular languages generalize regular languages to infinite-horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense-time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spending positive time in the good states, and 2) expectation semantics where the goal of the learner is to optimize the long-run expected average time spent in the ``good states" of the automaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the-shelf RL algorithms for CTMDPs. We demonstrate the effectiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives.
Abstract (translated)
连续时间马尔可夫决策过程(CTMDPs)是表达高密度时间和随机环境Sequential决策的常用模型。当环境随机演化只能通过采样获取时,无模型 Reinforcement Learning(RL)是计算最优决策序列的首选算法。然而,RL要求将学习目标编码为 scalar 奖励信号。由于手动进行这种翻译既繁琐又容易犯错误,已经提出了一些方法将高级别目标(表达为逻辑或机器代数形式)转换为 scalar 奖励信号,以应对 CTMDPs 的高密度时间特性。不幸的是,对于 CTMDPs 没有自动翻译方法可用。我们考虑将 CTMDP 环境与表示学习目标的 omega-Regular 语言进行比较。omega-Regular 语言将 regular 语言扩展到无限级指定,可以表达流行的线性时间逻辑LTL 中的属性。为了适应 CTMDPs 的高密度时间特性,我们考虑两个不同的 omega-Regular 目标语义:1)满足语义,学习的目标是最大化在好状态的正时间花费概率,2)期望语义,学习的目标是优化机器代数“好状态”的长期期望平均花费时间。我们提出了一种方法,可以正确翻译为 scalar 奖励信号,这些信号可以轻松地为 off-the-shelf RL 算法用于 CTMDPs 所使用。我们通过评估一些流行的 CTMDP 基准与 omega-Regular 目标进行了演示,证明了所提出算法的有效性。
URL
https://arxiv.org/abs/2303.09528