Paper Reading AI Learner

Curiosity-driven Red-teaming for Large Language Models

2024-02-29 18:55:03
Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal

Abstract

Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current RL methods are only able to generate a small number of effective test cases resulting in a low coverage of the span of prompts that elicit undesirable responses from the target LLM. To overcome this limitation, we draw a connection between the problem of increasing the coverage of generated test cases and the well-studied approach of curiosity-driven exploration that optimizes for novelty. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods. Our method, CRT successfully provokes toxic responses from LLaMA2 model that has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at \url{this https URL}

Abstract (translated)

大语言模型(LLMs)在许多自然语言应用领域具有巨大的潜力,但 risk 生成不正确或有害的内容。为了探测 LLM 生成不良内容的时刻,目前的范例是招募一个由人类测试者组成的 \textit{红队},设计输入提示(即测试用例),以激发 LLM 的不良反应。然而,仅依赖人类测试者是昂贵且耗时的。最近的工作通过通过强化学习(RL)训练一个单独的红队 LLM 来自动化红队,从而生成最大化从目标 LLM 产生不良反应的测试用例。然而,现有的 RL 方法只能生成少量有效的测试用例,导致对目标 LLM 产生不良反应的提示的覆盖范围较低。为了克服这一局限,我们联系到增加生成测试用例的覆盖率与著名的探索方法——好奇驱动探索之间的联系。我们的好奇驱动红队(CRT)方法在保持或提高其有效性同时,具有更广泛的测试用例覆盖范围。我们的方法 CRT 成功地将 LLaMA2 模型加重调谐以避免产生有毒输出,同时使其测试用例具有更好的覆盖范围。代码可在此处下载:\url{https://this <https://this URL>}

URL

https://arxiv.org/abs/2402.19464

PDF

https://arxiv.org/pdf/2402.19464.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot