Paper Reading AI Learner

SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types

2024-10-29 11:47:01
Yutao Mou, Shikun Zhang, Wei Ye

Abstract

Ensuring the safety of large language model (LLM) applications is essential for developing trustworthy artificial intelligence. Current LLM safety benchmarks have two limitations. First, they focus solely on either discriminative or generative evaluation paradigms while ignoring their interconnection. Second, they rely on standardized inputs, overlooking the effects of widespread prompting techniques, such as system prompts, few-shot demonstrations, and chain-of-thought prompting. To overcome these issues, we developed SG-Bench, a novel benchmark to assess the generalization of LLM safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment of 3 advanced proprietary LLMs and 10 open-source LLMs with the benchmark reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment. We also explain these findings quantitatively and qualitatively to provide insights for future research.

Abstract (translated)

确保大型语言模型(LLM)应用的安全性对于开发可信赖的人工智能至关重要。当前的LLM安全基准存在两个限制。首先,它们只专注于区分性或生成性评估范式之一,而忽略了两者之间的联系。其次,它们依赖于标准化输入,忽视了广泛提示技术的影响,如系统提示、少量样本演示和思维链提示。为了解决这些问题,我们开发了SG-Bench,这是一个新的基准测试,用于评估LLM安全性的泛化能力在各种任务和提示类型上的表现。该基准集成了生成性和区分性评估任务,并包含扩展数据以检查提示工程和越狱对LLM安全性的影响。使用此基准评估3个先进的专有LLM和10个开源LLM的结果显示,大多数LLM在区分性任务上的表现不如生成性任务,并且对提示高度敏感,表明其安全一致性泛化能力较差。我们还通过定量和定性的方法解释了这些发现,以提供对未来研究的洞见。

URL

https://arxiv.org/abs/2410.21965

PDF

https://arxiv.org/pdf/2410.21965.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot