Paper Reading AI Learner

Safe Posterior Sampling for Constrained MDPs with Bounded Constraint Violation

2023-01-27 06:18:25
Krishna C Kalagarla, Rahul Jain, Pierluigi Nuzzo

Abstract

Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. However, the model is often unknown and must be learned online while still ensuring the constraint is met, or at least the violation is bounded with time. Some recent papers have made progress on this very challenging problem but either need unsatisfactory assumptions such as knowledge of a safe policy, or have high cumulative regret. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm achieves an efficient tradeoff between exploration and exploitation by use of the posterior sampling principle, and provably suffers only bounded constraint violation by leveraging the idea of pessimism. Our approach is based on a primal-dual approach. We establish a sub-linear $\tilde{\mathcal{ O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2 |\mathcal{A}| K} \right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint violation regret over $K$ episodes for an $|\mathcal{S}|$-state, $|\mathcal{A}|$-action and horizon $H$ CMDP.

Abstract (translated)

约束马尔可夫决策过程(CMDPs)是一种在多个目标之间进行Sequential决策的模型场景,在许多应用中变得越来越重要。然而,该模型通常未知,必须在线学习,但仍然必须保证满足约束,或者至少随着时间的推移保持有限制的违反。一些最近的论文在这方面取得了进展,但要么需要令人满意的假设,例如了解安全策略,要么累积 regret非常高。我们提出了 Safe PSRL(后采样基于RL)算法,它不需要这些假设,但仍然表现很好,无论是从理论 regret bounds 还是从经验上看。该算法通过使用后采样原则实现了探索和利用之间的高效权衡,并且通过利用悲观的想法,仅导致有限制的约束违反 regret。我们的方法是基于主对主的方法的。我们建立了一个 sub-linear $ ilde{mathcal{O}}left(H^{2.5} sqrt{|mathcal{S}|^2 |mathcal{A}| K} ight)$ 的 Bayesian 奖励 objective regret upper bound,同时还有一个有限制的,即 $ ilde{mathcal{O}}left(1 ight)$ 的约束违反 regret,对一个 $|mathcal{S}|$-state、 $|mathcal{A}|$-action 和 horizon $H$ 的 $|mathcal{S}|$-state、 $|mathcal{A}|$-action 和 horizon $H$ CMDP 的 $K$ 个样本的累计 regret。

URL

https://arxiv.org/abs/2301.11547

PDF

https://arxiv.org/pdf/2301.11547.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot