Paper Reading AI Learner

Efficient Trust Region-Based Safe Reinforcement Learning with Low-Bias Distributional Actor-Critic

2023-01-26 04:05:40
Dohyeong Kim, Kyungjae Lee, Songhwai Oh

Abstract

To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods.

Abstract (translated)

将强化学习应用于实际应用时,需要代理遵守各自领域的安全性指南。安全强化学习可以通过将指南转换为强化学习问题中的约束来有效地处理指南。在本文中,我们基于信任区域方法开发了一种安全分布强化学习方法,该方法可以 consistently 满足约束。然而,由于分布批评器的估计偏差,某些政策可能无法满足安全性指南,而信任区域方法所需的重要性采样可能会妨碍性能,因此我们需要通过以下方法增强安全性表现。首先,我们训练分布批评器以具有较低的估计偏差,通过使用 proposed 目标分布,其中偏差和方差可以 trade-off。其次,我们提出了信任区域方法的新替代方法,通过使用重构技巧,以 Q 函数的形式表示。此外,根据初始政策设置,信任区域内可能没有任何政策满足所有约束。为了解决可能可行的问题,我们提出了梯度积分方法,该方法保证从不安全初始政策中找到一个满足所有约束的政策。从广泛的实验中,具有风险偏好限制的方法表现出 minimal 的约束违反,而与现有的安全强化学习方法相比,它实现了更高的回报。

URL

https://arxiv.org/abs/2301.10923

PDF

https://arxiv.org/pdf/2301.10923.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot