Abstract
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping) and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative rewards via a direct policy parameterization between two completions to a prompt, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally tractable than PPO.
Abstract (translated)
最初是为连续控制问题而开发的,但Proximal Policy Optimization (PPO)现在已成为各种强化学习(RL)应用(包括对生成模型的微调)的摇钱树。不幸的是,PPO需要多个技巧来实现稳定的收敛(例如值网络,截断) ,并以其对这些组件的具体实现非常敏感而闻名。为了应对这个问题,我们回退一步并问:在生成模型时代,一个简约的RL算法会是什么样子?我们提出了REBEL,一种通过直接对两个完成之间的策略参数化来降低策略优化问题的算法。在理论方面,我们证明了诸如自然策略梯度等基本RL算法可以被视为REBEL的变体,这使我们能够匹配RL文献中关于收敛和样本复杂性的最强已知理论保证。REBEL还可以干净地整合离线数据,并处理我们经常遇到的实际问题中的自偏好。在实证研究中,我们发现REBEL在语言建模和图像生成方面的性能与PPO和DPO相当或更好,同时比PPO更简单地实现,并且具有更快的计算可处理性。
URL
https://arxiv.org/abs/2404.16767