Abstract
PPO (Proximal Policy Optimization) is a state-of-the-art policy gradient algorithm that has been successfully applied to complex computer games such as Dota 2 and Honor of Kings. In these environments, an agent makes compound actions consisting of multiple sub-actions. PPO uses clipping to restrict policy updates. Although clipping is simple and effective, it is not efficient in its sample use. For compound actions, most PPO implementations consider the joint probability (density) of sub-actions, which means that if the ratio of a sample (state compound-action pair) exceeds the range, the gradient the sample produces is zero. Instead, for each sub-action we calculate the loss separately, which is less prone to clipping during updates thereby making better use of samples. Further, we propose a multi-action mixed loss that combines joint and separate probabilities. We perform experiments in Gym-$\mu$RTS and MuJoCo. Our hybrid model improves performance by more than 50\% in different MuJoCo environments compared to OpenAI's PPO benchmark results. And in Gym-$\mu$RTS, we find the sub-action loss outperforms the standard PPO approach, especially when the clip range is large. Our findings suggest this method can better balance the use-efficiency and quality of samples.
Abstract (translated)
PPO (渐进策略优化) 是一种先进的策略梯度算法,已经成功应用于复杂的计算机游戏,例如Dota 2和荣耀之王。在这些环境中,一个代理做出包含多个子行为的复合行动。PPO使用裁剪来限制策略更新。虽然裁剪很简单且有效,但它在样本使用方面并不高效。对于复合行动,大多数PPO实现考虑子行为的联合概率(密度),这意味着如果样本(状态复合-行动对)的比例超过范围,则样本产生的梯度为零。相反,我们分别计算每个子行为的 loss,这比在更新期间裁剪样本更有效地利用样本。此外,我们提出了一种多行动混合 loss,它结合联合和分离概率。我们在Gym-$mu$RTS 和 MuJoCo 中进行实验。我们的混合模型在多个 MuJoCo 环境中比OpenAI的PPO基准结果提高了性能超过50%。在Gym-$mu$RTS 中,我们发现子行动 loss 比标准 PPO 方法更有效,特别是当裁剪范围很大时。我们的发现表明这种方法可以更好地平衡样本使用效率和质量。
URL
https://arxiv.org/abs/2301.10919