Paper Reading AI Learner

Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

2024-04-09 10:15:18
Xudong Yu, Chenjia Bai, Hongyi Guo, Changhong Wang, Zhen Wang

Abstract

Offline Reinforcement Learning (RL) faces distributional shift and unreliable value estimation, especially for out-of-distribution (OOD) actions. To address this, existing uncertainty-based methods penalize the value function with uncertainty quantification and demand numerous ensemble networks, posing computational challenges and suboptimal outcomes. In this paper, we introduce a novel strategy employing diverse randomized value functions to estimate the posterior distribution of $Q$-values. It provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of $Q$-values. By applying moderate value penalties for OOD actions, our method fosters a provably pessimistic approach. We also emphasize on diversity within randomized value functions and enhance efficiency by introducing a diversity regularization method, reducing the requisite number of networks. These modules lead to reliable value estimation and efficient policy learning from offline data. Theoretical analysis shows that our method recovers the provably efficient LCB-penalty under linear MDP assumptions. Extensive empirical results also demonstrate that our proposed method significantly outperforms baseline methods in terms of performance and parametric efficiency.

Abstract (translated)

离线强化学习(RL)面临着分布平移和不可靠的值估计,尤其是在离散(OD)动作中。为解决这个问题,现有的基于不确定性的方法通过不确定性量化对价值函数进行惩罚,并要求大量的集成网络,导致计算复杂度和次优结果。在本文中,我们引入了一种新颖的方法,利用多样化的随机价值函数估计后验分布。它提供了稳健的不确定性量化并估计了 $Q$ 值的较低置信度边界(LCB)。通过为离散动作应用适度价值惩罚,我们的方法培养了一种可证明的悲观方法。我们还强调随机价值函数内的多样性,通过引入多样性正则化方法,减少了网络的数量。这些模块使得从离线数据进行可靠的值估计和有效的策略学习成为可能。理论分析表明,在线性MDP假设下,我们 method 恢复了可证明的有效的LCB惩罚。大量的实证结果还表明,与基线方法相比,我们提出的方法在表现和参数效率方面显著优于基线方法。

URL

https://arxiv.org/abs/2404.06188

PDF

https://arxiv.org/pdf/2404.06188.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot