Paper Reading AI Learner

Joint action loss for proximal policy optimization

2023-01-26 03:42:29
Xiulei Song, Yizhao Jin, Greg Slabaugh, Simon Lucas

Abstract

PPO (Proximal Policy Optimization) is a state-of-the-art policy gradient algorithm that has been successfully applied to complex computer games such as Dota 2 and Honor of Kings. In these environments, an agent makes compound actions consisting of multiple sub-actions. PPO uses clipping to restrict policy updates. Although clipping is simple and effective, it is not efficient in its sample use. For compound actions, most PPO implementations consider the joint probability (density) of sub-actions, which means that if the ratio of a sample (state compound-action pair) exceeds the range, the gradient the sample produces is zero. Instead, for each sub-action we calculate the loss separately, which is less prone to clipping during updates thereby making better use of samples. Further, we propose a multi-action mixed loss that combines joint and separate probabilities. We perform experiments in Gym-$\mu$RTS and MuJoCo. Our hybrid model improves performance by more than 50\% in different MuJoCo environments compared to OpenAI's PPO benchmark results. And in Gym-$\mu$RTS, we find the sub-action loss outperforms the standard PPO approach, especially when the clip range is large. Our findings suggest this method can better balance the use-efficiency and quality of samples.

Abstract (translated)

PPO (渐进策略优化) 是一种先进的策略梯度算法,已经成功应用于复杂的计算机游戏,例如Dota 2和荣耀之王。在这些环境中,一个代理做出包含多个子行为的复合行动。PPO使用裁剪来限制策略更新。虽然裁剪很简单且有效,但它在样本使用方面并不高效。对于复合行动,大多数PPO实现考虑子行为的联合概率(密度),这意味着如果样本(状态复合-行动对)的比例超过范围,则样本产生的梯度为零。相反,我们分别计算每个子行为的 loss,这比在更新期间裁剪样本更有效地利用样本。此外,我们提出了一种多行动混合 loss,它结合联合和分离概率。我们在Gym-$mu$RTS 和 MuJoCo 中进行实验。我们的混合模型在多个 MuJoCo 环境中比OpenAI的PPO基准结果提高了性能超过50%。在Gym-$mu$RTS 中,我们发现子行动 loss 比标准 PPO 方法更有效,特别是当裁剪范围很大时。我们的发现表明这种方法可以更好地平衡样本使用效率和质量。

URL

https://arxiv.org/abs/2301.10919

PDF

https://arxiv.org/pdf/2301.10919.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot