Abstract
Offline reinforcement learning has developed rapidly over the recent years, but estimating the actual performance of offline policies still remains a challenge. We propose a scoring metric for offline policies that highly correlates with actual policy performance and can be directly used for offline policy optimization in a supervised manner. To achieve this, we leverage the contrastive learning framework to design a scoring metric that gives high scores to policies that imitate the actions yielding relatively high returns while avoiding those yielding relatively low returns. Our experiments show that 1) our scoring metric is able to more accurately rank offline policies and 2) the policies optimized using our metric show high performance on various offline reinforcement learning benchmarks. Notably, our algorithm has a much lower network capacity requirement for the policy network compared to other supervised learning-based methods and also does not need any additional networks such as a Q-network.
Abstract (translated)
离线强化学习在过去几年中迅速发展,但估计离线政策的实际表现仍然是一项挑战。我们提出一种评分指标,该指标与实际政策表现高度相关,可以直接用于 supervised 的离线政策优化。为了实现这一点,我们利用对比学习框架设计了一个评分指标,该指标为模仿带来相对较高回报的行动而避免带来相对较低回报的行动给予了高评分。我们的实验表明,1)我们的评分指标能够更准确地评估离线政策,2)使用我们的指标优化的政策在各种离线强化学习基准上表现出高水平的表现。值得注意的是,与其他基于监督学习的方法相比,我们的算法对政策网络的神经网络容量要求更低,而且也不需要像 Q 网络 这样的额外网络。
URL
https://arxiv.org/abs/2301.12842