Abstract
Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.
Abstract (translated)
利用大型语言模型(LLM)的力量需要在有用性和无害性之间巧妙地平衡。这产生了一种基本的紧张关系,即如何应对两种竞争性的挑战:对抗攻击导致生成不安全内容的风险和对良性但敏感提示过度拒绝的趋势。当前的方法通常使用防护模型来完全拒绝包含任何不安全部分的内容,这种做法可能会加剧过度拒绝的问题,并且无法为它拒绝的查询提供细致的指导。 为了教导模型进行更协调的动作,我们提出了WaltzRL,这是一种新颖的多智能体强化学习框架,将安全性对齐定义为一个协作的双赢游戏。WaltzRL同时训练对话代理和反馈代理,后者被激励提出有助于改善对话代理响应的安全性和有用性的建议。在WaltzRL的核心是动态改进奖励(DIR),该奖励随着时间根据对话代理如何采纳反馈而演变。 在推理阶段,如果对话代理产生不安全或过度拒绝的回应,则会对其进行改进而不是直接丢弃这些回答。反馈代理与对话代理一起部署,并且仅在必要时适应性地参与,从而保留了对安全查询的有用性和低延迟特性。 我们的实验涵盖了五个不同的数据集,结果显示WaltzRL相比各种基准方法显著减少了不安全响应(例如,在WildJailbreak数据集中从39.0%减少到4.6%)和过度拒绝(在OR-Bench数据集中从45.3%减少到9.9%)。通过使对话代理和反馈代理能够共同进化并适应性地应用反馈,WaltzRL增强了LLM的安全性而不降低其一般能力,在有用性和无害性之间推进了帕累托前沿。
URL
https://arxiv.org/abs/2510.08240