Paper Reading AI Learner

Reinforcement Learning for Omega-Regular Specifications on Continuous-Time MDP

2023-03-16 17:45:38
Amin Falah, Shibashis Guha, Ashutosh Trivedi


Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such translations manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objectives (expressed in logic or automata formalism) to scalar rewards for discrete-time Markov decision processes (MDPs). Unfortunately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega-regular languages generalize regular languages to infinite-horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense-time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spending positive time in the good states, and 2) expectation semantics where the goal of the learner is to optimize the long-run expected average time spent in the ``good states" of the automaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the-shelf RL algorithms for CTMDPs. We demonstrate the effectiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives.

Abstract (translated)

连续时间马尔可夫决策过程(CTMDPs)是表达高密度时间和随机环境Sequential决策的常用模型。当环境随机演化只能通过采样获取时,无模型 Reinforcement Learning(RL)是计算最优决策序列的首选算法。然而,RL要求将学习目标编码为 scalar 奖励信号。由于手动进行这种翻译既繁琐又容易犯错误,已经提出了一些方法将高级别目标(表达为逻辑或机器代数形式)转换为 scalar 奖励信号,以应对 CTMDPs 的高密度时间特性。不幸的是,对于 CTMDPs 没有自动翻译方法可用。我们考虑将 CTMDP 环境与表示学习目标的 omega-Regular 语言进行比较。omega-Regular 语言将 regular 语言扩展到无限级指定,可以表达流行的线性时间逻辑LTL 中的属性。为了适应 CTMDPs 的高密度时间特性,我们考虑两个不同的 omega-Regular 目标语义:1)满足语义,学习的目标是最大化在好状态的正时间花费概率,2)期望语义,学习的目标是优化机器代数“好状态”的长期期望平均花费时间。我们提出了一种方法,可以正确翻译为 scalar 奖励信号,这些信号可以轻松地为 off-the-shelf RL 算法用于 CTMDPs 所使用。我们通过评估一些流行的 CTMDP 基准与 omega-Regular 目标进行了演示,证明了所提出算法的有效性。



3D Action Action_Localization Action_Recognition Activity Adversarial Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot