Paper Reading AI Learner

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

2025-10-09 14:03:05
Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan

Abstract

Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

Abstract (translated)

利用大型语言模型(LLM)的力量需要在有用性和无害性之间巧妙地平衡。这产生了一种基本的紧张关系,即如何应对两种竞争性的挑战:对抗攻击导致生成不安全内容的风险和对良性但敏感提示过度拒绝的趋势。当前的方法通常使用防护模型来完全拒绝包含任何不安全部分的内容,这种做法可能会加剧过度拒绝的问题,并且无法为它拒绝的查询提供细致的指导。 为了教导模型进行更协调的动作,我们提出了WaltzRL,这是一种新颖的多智能体强化学习框架,将安全性对齐定义为一个协作的双赢游戏。WaltzRL同时训练对话代理和反馈代理,后者被激励提出有助于改善对话代理响应的安全性和有用性的建议。在WaltzRL的核心是动态改进奖励(DIR),该奖励随着时间根据对话代理如何采纳反馈而演变。 在推理阶段,如果对话代理产生不安全或过度拒绝的回应,则会对其进行改进而不是直接丢弃这些回答。反馈代理与对话代理一起部署,并且仅在必要时适应性地参与,从而保留了对安全查询的有用性和低延迟特性。 我们的实验涵盖了五个不同的数据集,结果显示WaltzRL相比各种基准方法显著减少了不安全响应(例如,在WildJailbreak数据集中从39.0%减少到4.6%)和过度拒绝(在OR-Bench数据集中从45.3%减少到9.9%)。通过使对话代理和反馈代理能够共同进化并适应性地应用反馈,WaltzRL增强了LLM的安全性而不降低其一般能力,在有用性和无害性之间推进了帕累托前沿。

URL

https://arxiv.org/abs/2510.08240

PDF

https://arxiv.org/pdf/2510.08240.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot