Abstract
Ensuring the safety of large language model (LLM) applications is essential for developing trustworthy artificial intelligence. Current LLM safety benchmarks have two limitations. First, they focus solely on either discriminative or generative evaluation paradigms while ignoring their interconnection. Second, they rely on standardized inputs, overlooking the effects of widespread prompting techniques, such as system prompts, few-shot demonstrations, and chain-of-thought prompting. To overcome these issues, we developed SG-Bench, a novel benchmark to assess the generalization of LLM safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment of 3 advanced proprietary LLMs and 10 open-source LLMs with the benchmark reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment. We also explain these findings quantitatively and qualitatively to provide insights for future research.
Abstract (translated)
确保大型语言模型(LLM)应用的安全性对于开发可信赖的人工智能至关重要。当前的LLM安全基准存在两个限制。首先,它们只专注于区分性或生成性评估范式之一,而忽略了两者之间的联系。其次,它们依赖于标准化输入,忽视了广泛提示技术的影响,如系统提示、少量样本演示和思维链提示。为了解决这些问题,我们开发了SG-Bench,这是一个新的基准测试,用于评估LLM安全性的泛化能力在各种任务和提示类型上的表现。该基准集成了生成性和区分性评估任务,并包含扩展数据以检查提示工程和越狱对LLM安全性的影响。使用此基准评估3个先进的专有LLM和10个开源LLM的结果显示,大多数LLM在区分性任务上的表现不如生成性任务,并且对提示高度敏感,表明其安全一致性泛化能力较差。我们还通过定量和定性的方法解释了这些发现,以提供对未来研究的洞见。
URL
https://arxiv.org/abs/2410.21965