Paper Reading AI Learner

Pruning for Robust Concept Erasing in Diffusion Models

2024-05-26 11:42:20
Tianyun Yang, Juan Cao, Chang Xu

Abstract

Despite the impressive capabilities of generating images, text-to-image diffusion models are susceptible to producing undesirable outputs such as NSFW content and copyrighted artworks. To address this issue, recent studies have focused on fine-tuning model parameters to erase problematic concepts. However, existing methods exhibit a major flaw in robustness, as fine-tuned models often reproduce the undesirable outputs when faced with cleverly crafted prompts. This reveals a fundamental limitation in the current approaches and may raise risks for the deployment of diffusion models in the open world. To address this gap, we locate the concept-correlated neurons and find that these neurons show high sensitivity to adversarial prompts, thus could be deactivated when erasing and reactivated again under attacks. To improve the robustness, we introduce a new pruning-based strategy for concept erasing. Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons. Our method can be easily integrated with existing concept-erasing techniques, offering a robust improvement against adversarial inputs. Experimental results show a significant enhancement in our model's ability to resist adversarial inputs, achieving nearly a 40% improvement in erasing the NSFW content and a 30% improvement in erasing artwork style.

Abstract (translated)

尽管生成图像的能力非常出色,但文本到图像扩散模型容易产生不希望的结果,例如 NSFW 内容和国家版权的艺术作品。为解决这个问题,近年来研究的重点是微调模型参数以消除问题概念。然而,现有的方法存在一个重大缺陷,即在面临巧妙构思的提示时,微调的模型往往会产生不希望的结果。这揭示了当前方法的一个基本局限,可能会对将扩散模型部署到开放世界造成风险。为了填补这个空白,我们定位了相关概念的神经元,发现这些神经元对攻击性提示高度敏感,因此可以在消除和重新激活时再次静止。为了提高稳健性,我们引入了一种基于剪枝的新策略进行概念消除。我们的方法选择性地剪除了与目标概念相关的关键参数,从而减少了概念相关神经元的敏感性。我们的方法可以轻松地与现有的概念消除技术集成,为攻击性输入提供了一个稳健的改进。实验结果表明,我们的模型对攻击性输入的抵抗能力得到了显著提高,消除 NSFW 内容和艺术作品风格的能力得到了近 40% 的提升,提高了 30% 的艺术作品风格。

URL

https://arxiv.org/abs/2405.16534

PDF

https://arxiv.org/pdf/2405.16534.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot