Paper Reading AI Learner

VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

2025-06-11 17:10:36
Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at this https URL.

Abstract (translated)

带有可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLM)的关键技术,其中验证工程发挥了核心作用。然而,用于指令遵循的最佳强化学习实践尚未得到充分探索。在这项工作中,我们探讨了在指令跟随中实现RL所面临的验证挑战,并提出了一种名为VerIF的方法,该方法结合了基于规则的代码验证和大型推理模型(如QwQ-32B)中的LLM验证。为了支持这种方法,我们构建了一个高质量的指令遵循数据集VerInstruct,其中包括约22,000个实例及其相关的验证信号。我们将使用VerIF进行RL训练应用于两个模型,并在几个代表性的指令跟随基准测试中实现了显著改进。经过训练的模型在同类大小的模型中达到了最先进的性能水平,并且对未见过的约束具有良好的泛化能力。我们进一步观察到,它们的一般能力并未受到影响,这表明带有VerIF的RL可以整合到现有的RL配方中以提升整体模型性能。我们在[此链接](https://example.com/)发布了我们的数据集、代码和模型,以促进未来的研究。

URL

https://arxiv.org/abs/2506.09942

PDF

https://arxiv.org/pdf/2506.09942.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot