Paper Reading AI Learner

A Dual Perspective of Reinforcement Learning for Imposing Policy Constraints

2024-04-25 09:50:57
Bram De Cooman, Johan Suykens

Abstract

Model-free reinforcement learning methods lack an inherent mechanism to impose behavioural constraints on the trained policies. While certain extensions exist, they remain limited to specific types of constraints, such as value constraints with additional reward signals or visitation density constraints. In this work we try to unify these existing techniques and bridge the gap with classical optimization and control theory, using a generic primal-dual framework for value-based and actor-critic reinforcement learning methods. The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy, as an intrinsic relationship between such dual constraints (or regularization terms) and reward modifications in the primal is reveiled. Furthermore, using this framework, we are able to introduce some novel types of constraints, allowing to impose bounds on the policy's action density or on costs associated with transitions between consecutive states and actions. From the adjusted primal-dual optimization problems, a practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training using trainable reward modifications. The resulting $\texttt{DualCRL}$ method is examined in more detail and evaluated under different (combinations of) constraints on two interpretable environments. The results highlight the efficacy of the method, which ultimately provides the designer of such systems with a versatile toolbox of possible policy constraints.

Abstract (translated)

模型无关强化学习方法缺乏对训练后策略施加行为约束的固有机制。虽然存在某些扩展,但它们仍然局限于特定的约束类型,例如带有额外奖励信号的价值约束或访问密度约束。在这项工作中,我们试图统一这些现有技术,并使用基于价值的actor-critic强化学习方法的泛化二次框架来弥合与经典优化和控制理论之间的差距。所得到的双重形式展开在很大程度上有助于对学习到的策略施加额外的约束,因为这种双重约束(或 regularization 项)与原初在值上的约束之间揭示了一种固有的关系。此外,利用这种框架,我们能够引入一些新颖的约束类型,使得能够对策略的动作密度或连续状态和动作之间的转移成本施加限制。从调整后的原初-二次优化问题中,我们得到了一个实际算法,它在训练过程中自动处理各种策略约束。利用不同的约束组合,对两个可解释的环境进行了评估。结果表明,该方法非常有效,最终为设计这种系统的设计者提供了一个丰富的策略约束工具箱。

URL

https://arxiv.org/abs/2404.16468

PDF

https://arxiv.org/pdf/2404.16468.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot