Paper Reading AI Learner

SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

2025-06-18 14:37:59
Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych

Abstract

Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.

Abstract (translated)

进程或分步监督在提升大型语言模型(LLMs)的复杂多步骤推理能力方面发挥了关键作用。然而,高效的自动化过程注释仍然是一个重大挑战。为了解决这个问题,我们引入了带有参考引导评估的单次通过注释框架(SPARE),这是一种新颖的结构化框架,能够通过将每个解决方案步骤与参考方案中的一个或多个步骤对齐,并辅以明确的理由进行评价,实现单一阶段、每一步骤的标注。我们展示了参考指导下的分步评价有效地促进了跨越三个领域的四个数据集上的流程监督:数学推理、多跳组合问答和空间推理。我们证明了SPARE在以下两个方面与基线相比提升了推理性能:(1)在线强化学习环境中进行模型微调,以促进推断时间的贪婪解码;以及(2)训练奖励模型用于对多个LLM生成的输出进行排名/聚合。此外,在具有挑战性的数学数据集上,SPARE实现了竞争性表现,并提供了比基于树搜索的自动注释高出2.6倍的效率,仅需38%的运行时间。我们公开发布了代码库和经过训练的SPARE-PRM模型以促进进一步的研究和可重复性。

URL

https://arxiv.org/abs/2506.15498

PDF

https://arxiv.org/pdf/2506.15498.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot