Abstract
Policy-Space Response Oracles (PSRO) as a general algorithmic framework has achieved state-of-the-art performance in learning equilibrium policies of two-player zero-sum games. However, the hand-crafted hyperparameter value selection in most of the existing works requires extensive domain knowledge, forming the main barrier to applying PSRO to different games. In this work, we make the first attempt to investigate the possibility of self-adaptively determining the optimal hyperparameter values in the PSRO framework. Our contributions are three-fold: (1) Using several hyperparameters, we propose a parametric PSRO that unifies the gradient descent ascent (GDA) and different PSRO variants. (2) We propose the self-adaptive PSRO (SPSRO) by casting the hyperparameter value selection of the parametric PSRO as a hyperparameter optimization (HPO) problem where our objective is to learn an HPO policy that can self-adaptively determine the optimal hyperparameter values during the running of the parametric PSRO. (3) To overcome the poor performance of online HPO methods, we propose a novel offline HPO approach to optimize the HPO policy based on the Transformer architecture. Experiments on various two-player zero-sum games demonstrate the superiority of SPSRO over different baselines.
Abstract (translated)
作为通用算法框架,Policy-Space Response Oracles (PSRO) 在学习两个零和博弈中的均衡策略方面已经达到了最先进的性能。然而,在大多数现有工作中,需要深入了解领域知识来进行手动超参数值选择,这使得将PSRO应用于不同游戏的主要障碍。在这项工作中,我们首次尝试研究在PSRO框架中自适应确定最优超参数值的可能性。我们的贡献有三点:(1)我们提出了一种参数PSRO,将梯度上升(GDA)和不同PSRO变体统一起来。(2)我们通过将参数PSRO的超参数值选择问题转化为超参数优化(HPO)问题,提出了一种自适应的PSRO(SPSRO)。我们的目标是在参数PSRO的运行过程中自适应地确定最优超参数值。(3)为了克服在线HPO方法的低性能,我们提出了一种基于Transformer架构的新在线HPO方法,用于优化基于Transformer架构的HPO政策。对于各种两个零和博弈的实验,SPSRO都证明了其优越性。
URL
https://arxiv.org/abs/2404.11144