Abstract
Language models are now deployed in a wide variety of user-facing applications, often for specific purposes like answering questions about documentation or acting as coding assistants. As these models are intended for particular purposes, they should not be able to answer irrelevant queries like requests for poetry or questions about physics, or even worse, queries that can only be answered by humans like sensitive company policies. Instead we would like them to only answer queries corresponding to desired behavior and refuse all other requests, which we refer to as scoping. We find that, despite the use of system prompts, two representative language models can be poorly scoped and respond to queries they should not be addressing. We then conduct a comprehensive empirical evaluation of methods which could be used for scoping the behavior of language models. Among many other results, we show that a recently-proposed method for general alignment, Circuit Breakers (CB), can be adapted to scope language models to very specific tasks like sentiment analysis or summarization or even tasks with finer-grained scoping (e.g. summarizing only news articles). When compared to standard methods like fine-tuning or preference learning, CB is more robust both for out of distribution tasks, and to adversarial prompting techniques. We also show that layering SFT and CB together often results in the best of both worlds: improved performance only on relevant queries, while rejecting irrelevant ones.
Abstract (translated)
语言模型现在被部署在各种面向用户的应用程序中,通常用于特定目的,比如回答关于文档的问题或作为编码助手。由于这些模型旨在满足特定需求,它们不应能够回答无关的查询请求,如诗歌创作或物理学问题,更不应该回答只能由人类解答的敏感公司政策等问题。相反,我们希望它们只对符合预期行为的查询作出回应,并拒绝所有其他请求,这被称为范围限定(scoping)。我们发现,尽管使用了系统提示词,两个具有代表性的语言模型仍然可能被不恰当地限定范围并响应不应涉及的查询。随后,我们进行了全面的经验评估,以研究可用于限定语言模型行为的方法。在许多其他结果中,我们展示了最近提出的一种用于一般对齐的方法——断路器(Circuit Breakers, CB),可以适应于将语言模型限制到非常具体的任务上,如情感分析或摘要生成,甚至更细粒度的任务(例如仅总结新闻文章)。与标准方法(如微调或偏好学习)相比,CB在应对非分布任务和对抗性提示技术方面表现得更为稳健。我们还展示了结合使用SFT(监督微调)和CB通常可以带来两全其美的效果:只提升相关查询的性能,同时拒绝无关请求。
URL
https://arxiv.org/abs/2410.21597