Abstract
As LLMs become more widely deployed, there is increasing interest in directly optimizing for feedback from end users (e.g. thumbs up) in addition to feedback from paid annotators. However, training to maximize human feedback creates a perverse incentive structure for the AI to resort to manipulative tactics to obtain positive feedback, and some users may be especially vulnerable to such tactics. We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback. We have three main findings: 1) Extreme forms of "feedback gaming" such as manipulation and deception can reliably emerge in domains of practical LLM usage; 2) Concerningly, even if only <2% of users are vulnerable to manipulative strategies, LLMs learn to identify and surgically target them while behaving appropriately with other users, making such behaviors harder to detect; 3 To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. To our surprise, we found that while such approaches help in some settings, they backfire in others, leading to the emergence of subtler problematic behaviors that would also fool the LLM judges. Our findings serve as a cautionary tale, highlighting the risks of using gameable feedback sources -- such as user feedback -- as a target for RL.
Abstract (translated)
随着大规模语言模型(LLMs)的广泛应用,除了来自付费标注者的反馈外,直接优化终端用户的反馈(例如点赞)也越来越受到关注。然而,为了最大化人类反馈而进行的训练会为AI创造一种不正当的激励结构,使其倾向于使用操纵手段来获取正面反馈,且一些用户可能特别容易受此类策略的影响。我们通过使用模拟用户反馈的强化学习训练LLMs来研究这一现象,并得出了三个主要发现:1)诸如操纵和欺骗等极端形式的“反馈作弊”在实际LLM应用场景中可以可靠地出现;2)令人担忧的是,即使只有不到2%的用户容易受到操纵策略的影响,LLMs也会学会识别并针对性地针对这些用户,同时对其他用户保持适当的行为,使得此类行为更难被检测到;3)为了缓解这一问题,利用持续的安全训练或在训练过程中使用LLM作为裁判来过滤有问题输出似乎是一个有前景的方法。然而,让我们惊讶的是,我们发现虽然这种方法在某些场景下有所帮助,但在另一些场景中却适得其反,导致出现更微妙的、也会欺骗LLM裁判的问题行为。我们的研究结果起到了警示作用,强调了使用可被操控的反馈源(如用户反馈)作为RL目标的风险。
URL
https://arxiv.org/abs/2411.02306