Paper Reading AI Learner

Reducing the Scope of Language Models with Circuit Breakers

2024-10-28 23:06:57
David Yunis, Siyu Huo, Chulaka Gunasekara, Danish Contractor

Abstract

Language models are now deployed in a wide variety of user-facing applications, often for specific purposes like answering questions about documentation or acting as coding assistants. As these models are intended for particular purposes, they should not be able to answer irrelevant queries like requests for poetry or questions about physics, or even worse, queries that can only be answered by humans like sensitive company policies. Instead we would like them to only answer queries corresponding to desired behavior and refuse all other requests, which we refer to as scoping. We find that, despite the use of system prompts, two representative language models can be poorly scoped and respond to queries they should not be addressing. We then conduct a comprehensive empirical evaluation of methods which could be used for scoping the behavior of language models. Among many other results, we show that a recently-proposed method for general alignment, Circuit Breakers (CB), can be adapted to scope language models to very specific tasks like sentiment analysis or summarization or even tasks with finer-grained scoping (e.g. summarizing only news articles). When compared to standard methods like fine-tuning or preference learning, CB is more robust both for out of distribution tasks, and to adversarial prompting techniques. We also show that layering SFT and CB together often results in the best of both worlds: improved performance only on relevant queries, while rejecting irrelevant ones.

Abstract (translated)

语言模型现在被部署在各种面向用户的应用程序中,通常用于特定目的,比如回答关于文档的问题或作为编码助手。由于这些模型旨在满足特定需求,它们不应能够回答无关的查询请求,如诗歌创作或物理学问题,更不应该回答只能由人类解答的敏感公司政策等问题。相反,我们希望它们只对符合预期行为的查询作出回应,并拒绝所有其他请求,这被称为范围限定(scoping)。我们发现,尽管使用了系统提示词,两个具有代表性的语言模型仍然可能被不恰当地限定范围并响应不应涉及的查询。随后,我们进行了全面的经验评估,以研究可用于限定语言模型行为的方法。在许多其他结果中,我们展示了最近提出的一种用于一般对齐的方法——断路器(Circuit Breakers, CB),可以适应于将语言模型限制到非常具体的任务上,如情感分析或摘要生成,甚至更细粒度的任务(例如仅总结新闻文章)。与标准方法(如微调或偏好学习)相比,CB在应对非分布任务和对抗性提示技术方面表现得更为稳健。我们还展示了结合使用SFT(监督微调)和CB通常可以带来两全其美的效果:只提升相关查询的性能,同时拒绝无关请求。

URL

https://arxiv.org/abs/2410.21597

PDF

https://arxiv.org/pdf/2410.21597.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot