Abstract
As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities, which can result in over-censorship of benign content. To overcome these challenges, we introduce HiddenGuard, a novel framework for fine-grained, safe generation in LLMs. HiddenGuard incorporates Prism (rePresentation Router for In-Stream Moderation), which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content by leveraging intermediate hidden states. This fine-grained approach allows for more nuanced, context-aware moderation, enabling the model to generate informative responses while selectively redacting or replacing sensitive information, rather than outright refusal. We also contribute a comprehensive dataset with token-level fine-grained annotations of potentially harmful information across diverse contexts. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content while preserving the overall utility and informativeness of the model's responses.
Abstract (translated)
随着大型语言模型(LLMs)变得越来越强大,确保其安全和与人类价值观保持一致仍然是一个关键挑战。理想情况下,LLMs应提供有益的回答,同时避免披露有害或敏感信息。然而,当前的 alignment 方法,这些方法依赖拒绝策略,如将模型训练为完全拒绝有害提示或应用粗略过滤器,因为其二进制性质而受到限制。这些方法要么完全否认访问信息,要么在信息不足的情况下授予它,导致过于谨慎的回答或无法检测到细微的有害内容。例如,LLMs 可能因滥用担忧而拒绝提供关于药物的基本、公共信息。此外,这些基于拒绝的方法很难处理混合内容场景,并且缺乏适应语境敏感性的能力,可能导致对良性内容的过度审查。为了克服这些挑战,我们引入了 HiddenGuard,一种用于在 LLMs 中进行细粒度、安全生成的全新框架。HiddenGuard 包含了 Prism(在流媒体审核中实现真实时间、词级检测和编辑有害内容的表示路由器),它与 LLM 并行工作,利用中间隐藏状态对有害内容进行实时、词级检测和编辑。这种细粒度的方法允许更细微、上下文感知的审核,使模型可以在选择性地编辑或替换敏感信息的同时生成有益的回答,而不仅仅是直接拒绝。我们还贡献了一个覆盖各种上下文的完整数据集,其中包含了可能有害信息的词级细粒度注释。我们的实验证明,HiddenGuard 在检测和编辑有害内容的同时保留模型的整体实用性和信息性方面取得了超过 90% 的 F1 分数。
URL
https://arxiv.org/abs/2410.02684