Abstract
Despite the impressive capabilities of generating images, text-to-image diffusion models are susceptible to producing undesirable outputs such as NSFW content and copyrighted artworks. To address this issue, recent studies have focused on fine-tuning model parameters to erase problematic concepts. However, existing methods exhibit a major flaw in robustness, as fine-tuned models often reproduce the undesirable outputs when faced with cleverly crafted prompts. This reveals a fundamental limitation in the current approaches and may raise risks for the deployment of diffusion models in the open world. To address this gap, we locate the concept-correlated neurons and find that these neurons show high sensitivity to adversarial prompts, thus could be deactivated when erasing and reactivated again under attacks. To improve the robustness, we introduce a new pruning-based strategy for concept erasing. Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons. Our method can be easily integrated with existing concept-erasing techniques, offering a robust improvement against adversarial inputs. Experimental results show a significant enhancement in our model's ability to resist adversarial inputs, achieving nearly a 40% improvement in erasing the NSFW content and a 30% improvement in erasing artwork style.
Abstract (translated)
尽管生成图像的能力非常出色,但文本到图像扩散模型容易产生不希望的结果,例如 NSFW 内容和国家版权的艺术作品。为解决这个问题,近年来研究的重点是微调模型参数以消除问题概念。然而,现有的方法存在一个重大缺陷,即在面临巧妙构思的提示时,微调的模型往往会产生不希望的结果。这揭示了当前方法的一个基本局限,可能会对将扩散模型部署到开放世界造成风险。为了填补这个空白,我们定位了相关概念的神经元,发现这些神经元对攻击性提示高度敏感,因此可以在消除和重新激活时再次静止。为了提高稳健性,我们引入了一种基于剪枝的新策略进行概念消除。我们的方法选择性地剪除了与目标概念相关的关键参数,从而减少了概念相关神经元的敏感性。我们的方法可以轻松地与现有的概念消除技术集成,为攻击性输入提供了一个稳健的改进。实验结果表明,我们的模型对攻击性输入的抵抗能力得到了显著提高,消除 NSFW 内容和艺术作品风格的能力得到了近 40% 的提升,提高了 30% 的艺术作品风格。
URL
https://arxiv.org/abs/2405.16534