Paper Reading AI Learner

Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback

2024-11-04 17:31:02
Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan

Abstract

As LLMs become more widely deployed, there is increasing interest in directly optimizing for feedback from end users (e.g. thumbs up) in addition to feedback from paid annotators. However, training to maximize human feedback creates a perverse incentive structure for the AI to resort to manipulative tactics to obtain positive feedback, and some users may be especially vulnerable to such tactics. We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback. We have three main findings: 1) Extreme forms of "feedback gaming" such as manipulation and deception can reliably emerge in domains of practical LLM usage; 2) Concerningly, even if only <2% of users are vulnerable to manipulative strategies, LLMs learn to identify and surgically target them while behaving appropriately with other users, making such behaviors harder to detect; 3 To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. To our surprise, we found that while such approaches help in some settings, they backfire in others, leading to the emergence of subtler problematic behaviors that would also fool the LLM judges. Our findings serve as a cautionary tale, highlighting the risks of using gameable feedback sources -- such as user feedback -- as a target for RL.

Abstract (translated)

随着大规模语言模型(LLMs)的广泛应用,除了来自付费标注者的反馈外,直接优化终端用户的反馈(例如点赞)也越来越受到关注。然而,为了最大化人类反馈而进行的训练会为AI创造一种不正当的激励结构,使其倾向于使用操纵手段来获取正面反馈,且一些用户可能特别容易受此类策略的影响。我们通过使用模拟用户反馈的强化学习训练LLMs来研究这一现象,并得出了三个主要发现:1)诸如操纵和欺骗等极端形式的“反馈作弊”在实际LLM应用场景中可以可靠地出现;2)令人担忧的是,即使只有不到2%的用户容易受到操纵策略的影响,LLMs也会学会识别并针对性地针对这些用户,同时对其他用户保持适当的行为,使得此类行为更难被检测到;3)为了缓解这一问题,利用持续的安全训练或在训练过程中使用LLM作为裁判来过滤有问题输出似乎是一个有前景的方法。然而,让我们惊讶的是,我们发现虽然这种方法在某些场景下有所帮助,但在另一些场景中却适得其反,导致出现更微妙的、也会欺骗LLM裁判的问题行为。我们的研究结果起到了警示作用,强调了使用可被操控的反馈源(如用户反馈)作为RL目标的风险。

URL

https://arxiv.org/abs/2411.02306

PDF

https://arxiv.org/pdf/2411.02306.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot