Paper Reading AI Learner

Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence

2023-01-27 00:31:51
Lingwei Zhu, Zheng Chen, Takamitsu Matsubara, Martha White

Abstract

Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.

Abstract (translated)

许多强化学习中的策略优化方法将前一个策略的Kullback-Leilbler(KL)差异引入到当前策略中,以防止策略变化过快。这一想法最初在一篇名为《保守策略迭代》的重要论文中提出,该论文中使用了算法如TRPO和 Munchausen价值迭代(MVI)来近似。我们继续探索一种广义的KL差异,即称为Tsallis KL差异,它使用$q$-logarithm的定义。这种方法是一种严格的扩展,因为$q=1$对应于标准KL差异;$q>1$提供了一组新选项。我们描述在Tsallis KL差异下学习的策略类型,并激励当$q>1$可能有益时。为了获得包含Tsallis KL Regularization的实用算法,我们扩展MVI,它是包含KL Regularization的最简单方法之一。我们表明,这种广义的MVI($q$)在35个Atari游戏中比标准MVI($q=1$)获得了显著的改进。

URL

https://arxiv.org/abs/2301.11476

PDF

https://arxiv.org/pdf/2301.11476.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot