Abstract
Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6\% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1\% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at this https URL.
Abstract (translated)
基于规则的奖励为从人类反馈中改进强化学习(RLHF)提供了一种有前景的战略,但目前的方法通常依赖于手动规则工程。我们提出了AutoRule,这是一种全自动方法,用于从偏好反馈中提取规则,并将其制定成基于规则的奖励。AutoRule提取过程分为三个阶段:它利用一个推理模型来解释用户偏好,从这些解释的推理链中识别候选规则,并将它们综合为统一的规则集。在使用最终确定的规则集时,我们采用语言模型验证器来计算每个输出满足规则的比例,将此指标作为辅助奖励,在策略优化过程中与学习到的奖励模型一起使用。 使用AutoRule训练Llama-3-8B模型,在AlpacaEval2.0上的长度控制胜率相对提高了28.6%,在独立于MT-Bench子集上第二回合的表现比GRPO基线(仅使用相同的已学习奖励模型,但不使用基于规则的辅助奖励)高出6.1%。我们的分析证实,提取出的规则与数据集偏好具有良好的一致性。我们发现,在运行两个时期时,AutoRule显示的奖励作弊现象少于学习到的奖励模型。最后,案例研究表明,提取的规则捕捉到了不同数据集中所重视的独特品质。 提取的规则详见附录,并且代码已在以下URL开源:[这里提供具体的GitHub或相关链接]。
URL
https://arxiv.org/abs/2506.15651