Paper Reading AI Learner

Process Reinforcement through Implicit Rewards

2025-02-03 15:43:48
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding

Abstract

Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

Abstract (translated)

密集过程奖励已被证明是大规模语言模型(LLMs)推理时间扩展中稀疏结果级奖励的一个更有效的替代方案,特别是在需要复杂多步骤推理的任务中。尽管密集奖励也为基于强化学习(RL)的LLM提供了一个有吸引力的选择,因为其细粒度奖励具有解决一些固有的结果奖励问题的潜力,如训练效率和信用分配,但这种潜在的优势尚未得到充分利用。这主要是由于在线培训过程奖励模型(PRMs)时面临的挑战,例如收集高质量的过程标签成本高昂且难度大,使得它们特别容易受到奖励篡改的影响。 为了解决这些问题,我们提出了PRIME(通过隐式回报进行过程强化),它仅使用策略回放和结果标签就可以实现实时的PRM更新,并利用隐含的过程奖励。PRIME能够与各种优势函数很好地结合,并且放弃了现有的方法所需的专用奖励模型训练阶段,从而大大减少了开发负担。我们在竞赛数学和编程任务上展示了PRIME的有效性。从Qwen2.5-Math-7B-Base开始,PRIME在多个关键推理基准测试中平均提高了SFT模型15.1%的性能。值得注意的是,我们的最终模型Eurus-2-7B-PRIME在七个推理基准测试上超越了Qwen2.5-Math-7B-Instruct,并且使用了其十分之一的训练数据。 这项研究强调了密集过程奖励对于提升LLMs在需要复杂推理任务上的性能具有巨大的潜力,同时也展示了如何通过巧妙的方法设计来克服在线培训中的挑战。

URL

https://arxiv.org/abs/2502.01456

PDF

https://arxiv.org/pdf/2502.01456.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot