Abstract
While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation. We show the effectiveness of RCfD on three language tasks, which achieves comparable performance to carefully tuned baselines while mitigating ROO.
Abstract (translated)
虽然强化学习(RL)已被证明是调整大型语言模型(LLMs)的关键,但它可能导致奖励过度优化(ROO)。现有的方法通过添加KL正则化来解决ROO问题,但这需要计算密集的参数调整。此外,KL正则化仅关注正则化语言策略,而忽视了一个可能的奖励函数的来源:本身。受到演示引导的RL的启发,我们在这里引入了从演示中调节奖励(RCfD)的概念,它利用人类演示和奖励模型重新调整奖励目标。正式地说,给定提示,RCfD目标最小化演示和LLM奖励之间的距离,而不是直接最大化奖励函数。这种目标转换避免了激励LLM利用奖励模型,同时促进了更自然和多样化的语言生成。我们在三个语言任务上展示了RCfD的有效性,这些任务在谨慎调整基线的同时减轻了ROO。
URL
https://arxiv.org/abs/2404.19409