Abstract
Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.
Abstract (translated)
进程或分步监督在提升大型语言模型(LLMs)的复杂多步骤推理能力方面发挥了关键作用。然而,高效的自动化过程注释仍然是一个重大挑战。为了解决这个问题,我们引入了带有参考引导评估的单次通过注释框架(SPARE),这是一种新颖的结构化框架,能够通过将每个解决方案步骤与参考方案中的一个或多个步骤对齐,并辅以明确的理由进行评价,实现单一阶段、每一步骤的标注。我们展示了参考指导下的分步评价有效地促进了跨越三个领域的四个数据集上的流程监督:数学推理、多跳组合问答和空间推理。我们证明了SPARE在以下两个方面与基线相比提升了推理性能:(1)在线强化学习环境中进行模型微调,以促进推断时间的贪婪解码;以及(2)训练奖励模型用于对多个LLM生成的输出进行排名/聚合。此外,在具有挑战性的数学数据集上,SPARE实现了竞争性表现,并提供了比基于树搜索的自动注释高出2.6倍的效率,仅需38%的运行时间。我们公开发布了代码库和经过训练的SPARE-PRM模型以促进进一步的研究和可重复性。
URL
https://arxiv.org/abs/2506.15498