Paper Reading AI Learner

AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning

2025-06-18 17:29:19
Tevin Wang, Chenyan Xiong

Abstract

Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6\% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1\% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at this https URL.

Abstract (translated)

基于规则的奖励为从人类反馈中改进强化学习(RLHF)提供了一种有前景的战略,但目前的方法通常依赖于手动规则工程。我们提出了AutoRule,这是一种全自动方法,用于从偏好反馈中提取规则,并将其制定成基于规则的奖励。AutoRule提取过程分为三个阶段:它利用一个推理模型来解释用户偏好,从这些解释的推理链中识别候选规则,并将它们综合为统一的规则集。在使用最终确定的规则集时,我们采用语言模型验证器来计算每个输出满足规则的比例,将此指标作为辅助奖励,在策略优化过程中与学习到的奖励模型一起使用。 使用AutoRule训练Llama-3-8B模型,在AlpacaEval2.0上的长度控制胜率相对提高了28.6%,在独立于MT-Bench子集上第二回合的表现比GRPO基线(仅使用相同的已学习奖励模型,但不使用基于规则的辅助奖励)高出6.1%。我们的分析证实,提取出的规则与数据集偏好具有良好的一致性。我们发现,在运行两个时期时,AutoRule显示的奖励作弊现象少于学习到的奖励模型。最后,案例研究表明,提取的规则捕捉到了不同数据集中所重视的独特品质。 提取的规则详见附录,并且代码已在以下URL开源:[这里提供具体的GitHub或相关链接]。

URL

https://arxiv.org/abs/2506.15651

PDF

https://arxiv.org/pdf/2506.15651.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot