Abstract
Temporal logics, such as linear temporal logic (LTL), offer a precise means of specifying tasks for (deep) reinforcement learning (RL) agents. In our work, we consider the setting where the task is specified by an LTL objective and there is an additional scalar reward that we need to optimize. Previous works focus either on learning a LTL task-satisfying policy alone or are restricted to finite state spaces. We make two contributions: First, we introduce an RL-friendly approach to this setting by formulating this problem as a single optimization objective. Our formulation guarantees that an optimal policy will be reward-maximal from the set of policies that maximize the likelihood of satisfying the LTL specification. Second, we address a sparsity issue that often arises for LTL-guided Deep RL policies by introducing Cycle Experience Replay (CyclER), a technique that automatically guides RL agents towards the satisfaction of an LTL specification. Our experiments demonstrate the efficacy of CyclER in finding performant deep RL policies in both continuous and discrete experimental domains.
Abstract (translated)
temporal logics,如线性时间逻辑(LTL)为(深度)强化学习(RL)智能体提供了精确指定任务的方法。在我们的工作中,我们考虑了将任务由LTL目标指定,并且还需要优化一个标量的奖励的情况。以前的工作只关注学习一个LTL任务满足的策略,或者受到有限状态空间的限制。我们做出了两个贡献:首先,我们通过将这个问题建模为一个优化目标,引入了一种RL友好的方法。我们的建模保证了一个最优策略将从满足LTL规范的策略中具有最大可能性的集合中选择。其次,我们解决了通常在LTL指导的深度RL策略中出现的稀疏性问题,通过引入Cycle Experience Replay(CER)技术,一种自动引导RL代理满足LTL规范的技术。我们的实验证明了CER在找到高效的深度RL策略在连续和离散实验域中的有效性。
URL
https://arxiv.org/abs/2404.11578