Abstract
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
Abstract (translated)
密集过程奖励已被证明是大规模语言模型(LLMs)推理时间扩展中稀疏结果级奖励的一个更有效的替代方案,特别是在需要复杂多步骤推理的任务中。尽管密集奖励也为基于强化学习(RL)的LLM提供了一个有吸引力的选择,因为其细粒度奖励具有解决一些固有的结果奖励问题的潜力,如训练效率和信用分配,但这种潜在的优势尚未得到充分利用。这主要是由于在线培训过程奖励模型(PRMs)时面临的挑战,例如收集高质量的过程标签成本高昂且难度大,使得它们特别容易受到奖励篡改的影响。 为了解决这些问题,我们提出了PRIME(通过隐式回报进行过程强化),它仅使用策略回放和结果标签就可以实现实时的PRM更新,并利用隐含的过程奖励。PRIME能够与各种优势函数很好地结合,并且放弃了现有的方法所需的专用奖励模型训练阶段,从而大大减少了开发负担。我们在竞赛数学和编程任务上展示了PRIME的有效性。从Qwen2.5-Math-7B-Base开始,PRIME在多个关键推理基准测试中平均提高了SFT模型15.1%的性能。值得注意的是,我们的最终模型Eurus-2-7B-PRIME在七个推理基准测试上超越了Qwen2.5-Math-7B-Instruct,并且使用了其十分之一的训练数据。 这项研究强调了密集过程奖励对于提升LLMs在需要复杂推理任务上的性能具有巨大的潜力,同时也展示了如何通过巧妙的方法设计来克服在线培训中的挑战。
URL
https://arxiv.org/abs/2502.01456