Paper Reading AI Learner

Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization

2024-11-27 18:57:03
Cheng Tang, Zhishuai Liu, Pan Xu

Abstract

The Distributionally Robust Markov Decision Process (DRMDP) is a popular framework for addressing dynamics shift in reinforcement learning by learning policies robust to the worst-case transition dynamics within a constrained set. However, solving its dual optimization oracle poses significant challenges, limiting theoretical analysis and computational efficiency. The recently proposed Robust Regularized Markov Decision Process (RRMDP) replaces the uncertainty set constraint with a regularization term on the value function, offering improved scalability and theoretical insights. Yet, existing RRMDP methods rely on unstructured regularization, often leading to overly conservative policies by considering transitions that are unrealistic. To address these issues, we propose a novel framework, the $d$-rectangular linear robust regularized Markov decision process ($d$-RRMDP), which introduces a linear latent structure into both transition kernels and regularization. For the offline RL setting, where an agent learns robust policies from a pre-collected dataset in the nominal environment, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI), employing linear function approximation and $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, showing these bounds depend on how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. This term is further shown to be fundamental to $d$-RRMDPs via information-theoretic lower bounds. Finally, numerical experiments validate that R2PVI learns robust policies and is computationally more efficient than methods for constrained DRMDPs.

Abstract (translated)

分布稳健马尔可夫决策过程(DRMDP)是一种流行的框架,通过学习在受限集合内对最坏情况的转移动态具有鲁棒性的策略来解决强化学习中的动力学变化问题。然而,求解其对偶优化预言带来了显著挑战,限制了理论分析和计算效率。最近提出的稳健正则化马尔可夫决策过程(RRMDP)通过在价值函数上引入一个正则化项替换了不确定性集约束,提供了改进的可扩展性和理论洞察力。然而,现有的RRMDP方法依赖于非结构化的正则化,这往往会导致过于保守的策略,因为它们考虑了不现实的转移情况。为了解决这些问题,我们提出了一种新的框架,$d$-矩形线性稳健正则化马尔可夫决策过程($d$-RRMDP),它在转移核和正则化中引入了一个线性的潜在结构。对于离线强化学习场景,在该场景下代理程序从名义环境中的预先收集的数据集中学习鲁棒策略,我们开发了一组算法,稳健正则悲观值迭代(R2PVI),使用线性函数逼近,并基于$f$-散度在转移核上引入正则化项。我们提供了关于R2PVI策略次优差距的实例依赖上限,表明这些界限取决于数据集如何覆盖由最优鲁棒政策访问的状态动作空间,在稳健可接受的转移情况下。进一步证明了这个术语对于$d$-RRMDPs是通过信息论下界基本的。最后,数值实验验证了R2PVI能够学习到鲁棒策略,并且在计算效率上优于约束DRMDPs的方法。

URL

https://arxiv.org/abs/2411.18612

PDF

https://arxiv.org/pdf/2411.18612.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot