Abstract
With advances in diffusion models, image generation has shown significant performance improvements. This raises concerns about the potential abuse of image generation, such as the creation of explicit or violent images, commonly referred to as Not Safe For Work (NSFW) content. To address this, the Stable Diffusion model includes several safety checkers to censor initial text prompts and final output images generated from the model. However, recent research has shown that these safety checkers have vulnerabilities against adversarial attacks, allowing them to generate NSFW images. In this paper, we find that these adversarial attacks are not robust to small changes in text prompts or input latents. Based on this, we propose CROPS (Circular or RandOm Prompts for Safety), a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training. Moreover, we develop an approach that utilizes one-step diffusion models for efficient NSFW detection (CROPS-1), further reducing computational resources. We demonstrate the superiority of our method in terms of performance and applicability.
Abstract (translated)
随着扩散模型的进步,图像生成技术在性能上取得了显著提升。然而,这也引发了对可能滥用图像生成的担忧,例如创建色情或暴力内容(通常称为不适合工作场合的内容,即NSFW)。为解决这一问题,Stable Diffusion 模型包含了几种安全检查机制,用于审查初始的文字提示和模型生成的最终输出图像。然而,近期研究显示这些安全检查器容易受到对抗性攻击的影响,使得它们在生成 NSFW 图像时失效。 在这篇论文中,我们发现这些对抗性攻击对于文本提示或输入潜在变量(latents)的小变化并不稳定。基于这一观察,我们提出了一种称为 CROPS(Circular or RandOm Prompts for Safety)的模型无关框架,可以轻松地防御生成 NSFW 图像的对抗性攻击,并且无需进行额外训练即可实现。此外,我们还开发了一种利用一步扩散模型进行高效 NSFW 检测的方法(CROPS-1),进一步减少了计算资源需求。 我们的方法在性能和应用范围方面都表现出了优越性。
URL
https://arxiv.org/abs/2501.05359