Paper Reading AI Learner

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

2024-10-03 17:10:41
Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Ruibin Yuan, Xueqi Cheng

Abstract

As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities, which can result in over-censorship of benign content. To overcome these challenges, we introduce HiddenGuard, a novel framework for fine-grained, safe generation in LLMs. HiddenGuard incorporates Prism (rePresentation Router for In-Stream Moderation), which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content by leveraging intermediate hidden states. This fine-grained approach allows for more nuanced, context-aware moderation, enabling the model to generate informative responses while selectively redacting or replacing sensitive information, rather than outright refusal. We also contribute a comprehensive dataset with token-level fine-grained annotations of potentially harmful information across diverse contexts. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content while preserving the overall utility and informativeness of the model's responses.

Abstract (translated)

随着大型语言模型(LLMs)变得越来越强大,确保其安全和与人类价值观保持一致仍然是一个关键挑战。理想情况下,LLMs应提供有益的回答,同时避免披露有害或敏感信息。然而,当前的 alignment 方法,这些方法依赖拒绝策略,如将模型训练为完全拒绝有害提示或应用粗略过滤器,因为其二进制性质而受到限制。这些方法要么完全否认访问信息,要么在信息不足的情况下授予它,导致过于谨慎的回答或无法检测到细微的有害内容。例如,LLMs 可能因滥用担忧而拒绝提供关于药物的基本、公共信息。此外,这些基于拒绝的方法很难处理混合内容场景,并且缺乏适应语境敏感性的能力,可能导致对良性内容的过度审查。为了克服这些挑战,我们引入了 HiddenGuard,一种用于在 LLMs 中进行细粒度、安全生成的全新框架。HiddenGuard 包含了 Prism(在流媒体审核中实现真实时间、词级检测和编辑有害内容的表示路由器),它与 LLM 并行工作,利用中间隐藏状态对有害内容进行实时、词级检测和编辑。这种细粒度的方法允许更细微、上下文感知的审核,使模型可以在选择性地编辑或替换敏感信息的同时生成有益的回答,而不仅仅是直接拒绝。我们还贡献了一个覆盖各种上下文的完整数据集,其中包含了可能有害信息的词级细粒度注释。我们的实验证明,HiddenGuard 在检测和编辑有害内容的同时保留模型的整体实用性和信息性方面取得了超过 90% 的 F1 分数。

URL

https://arxiv.org/abs/2410.02684

PDF

https://arxiv.org/pdf/2410.02684.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot