Paper Reading AI Learner

From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment

2025-06-14 10:58:38
Bin Xie, Bingbing Xu, Yige Yuan, Shengmao Zhu, Huawei Shen

Abstract

Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches using reward-guided search (RGS) primarily rely on outcome reward models (ORMs), which suffer from a critical granularity mismatch: ORMs are designed to provide outcome rewards for complete responses, while RGS methods rely on process rewards to guide the policy, leading to inconsistent scoring and suboptimal alignment. To address this challenge, we introduce process reward models (PRMs) into RGS and argue that an ideal PRM should satisfy two objectives: Score Consistency, ensuring coherent evaluation across partial and complete responses, and Preference Consistency, aligning partial sequence assessments with human preferences. Based on these, we propose SP-PRM, a novel dual-consistency framework integrating score consistency-based and preference consistency-based partial evaluation modules without relying on human annotation. Extensive experiments on dialogue, summarization, and reasoning tasks demonstrate that SP-PRM substantially enhances existing RGS methods, achieving a 3.6%-10.3% improvement in GPT-4 evaluation scores across all tasks.

Abstract (translated)

在推理时间对齐方法中,由于其高效性和有效性,在将大型语言模型(LLMs)与人类偏好对齐方面受到了广泛关注。然而,目前主导的使用奖励引导搜索(RGS)的方法主要依赖于结果奖励模型(ORMs),这些模型存在一个关键的粒度不匹配问题:ORMs 设计用于为完整响应提供结果奖励,而 RGS 方法则依靠过程奖励来指导策略选择,这导致了评分的一致性和对齐效果不佳。为了应对这一挑战,我们将过程奖励模型(PRMs)引入到 RGS 中,并提出一个理想的 PRM 应该满足两个目标:一致性评分和偏好一致性的要求。前者确保了在部分响应和完整响应之间的评价保持连贯性;后者则保证了针对序列的部分评估与人类偏好的对齐。 基于这两个目标,我们提出了 SP-PRM——一种新的双一致性框架,它集成了基于评分一致性和偏好一致性的局部评估模块,并且无需依赖人工标注。在对话、摘要和推理任务的广泛实验中表明,SP-PRM 显著增强了现有的 RGS 方法,在所有任务上实现了 GPT-4 评价分数提升了3.6%到10.3%的成绩。

URL

https://arxiv.org/abs/2506.12446

PDF

https://arxiv.org/pdf/2506.12446.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot