Abstract
Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches using reward-guided search (RGS) primarily rely on outcome reward models (ORMs), which suffer from a critical granularity mismatch: ORMs are designed to provide outcome rewards for complete responses, while RGS methods rely on process rewards to guide the policy, leading to inconsistent scoring and suboptimal alignment. To address this challenge, we introduce process reward models (PRMs) into RGS and argue that an ideal PRM should satisfy two objectives: Score Consistency, ensuring coherent evaluation across partial and complete responses, and Preference Consistency, aligning partial sequence assessments with human preferences. Based on these, we propose SP-PRM, a novel dual-consistency framework integrating score consistency-based and preference consistency-based partial evaluation modules without relying on human annotation. Extensive experiments on dialogue, summarization, and reasoning tasks demonstrate that SP-PRM substantially enhances existing RGS methods, achieving a 3.6%-10.3% improvement in GPT-4 evaluation scores across all tasks.
Abstract (translated)
在推理时间对齐方法中,由于其高效性和有效性,在将大型语言模型(LLMs)与人类偏好对齐方面受到了广泛关注。然而,目前主导的使用奖励引导搜索(RGS)的方法主要依赖于结果奖励模型(ORMs),这些模型存在一个关键的粒度不匹配问题:ORMs 设计用于为完整响应提供结果奖励,而 RGS 方法则依靠过程奖励来指导策略选择,这导致了评分的一致性和对齐效果不佳。为了应对这一挑战,我们将过程奖励模型(PRMs)引入到 RGS 中,并提出一个理想的 PRM 应该满足两个目标:一致性评分和偏好一致性的要求。前者确保了在部分响应和完整响应之间的评价保持连贯性;后者则保证了针对序列的部分评估与人类偏好的对齐。 基于这两个目标,我们提出了 SP-PRM——一种新的双一致性框架,它集成了基于评分一致性和偏好一致性的局部评估模块,并且无需依赖人工标注。在对话、摘要和推理任务的广泛实验中表明,SP-PRM 显著增强了现有的 RGS 方法,在所有任务上实现了 GPT-4 评价分数提升了3.6%到10.3%的成绩。
URL
https://arxiv.org/abs/2506.12446