Abstract
Backdoor attacks inject poisoned data into the training set, resulting in misclassification of the poisoned samples during model inference. Defending against such attacks is challenging, especially in real-world black-box settings where only model predictions are available. In this paper, we propose a novel backdoor defense framework that can effectively defend against various attacks through zero-shot image purification (ZIP). Our proposed framework can be applied to black-box models without requiring any internal information about the poisoned model or any prior knowledge of the clean/poisoned samples. Our defense framework involves a two-step process. First, we apply a linear transformation on the poisoned image to destroy the trigger pattern. Then, we use a pre-trained diffusion model to recover the missing semantic information removed by the transformation. In particular, we design a new reverse process using the transformed image to guide the generation of high-fidelity purified images, which can be applied in zero-shot settings. We evaluate our ZIP backdoor defense framework on multiple datasets with different kinds of attacks. Experimental results demonstrate the superiority of our ZIP framework compared to state-of-the-art backdoor defense baselines. We believe that our results will provide valuable insights for future defense methods for black-box models.
Abstract (translated)
后门攻击将有毒数据注入训练集,导致模型推理时误分类有毒样本。对这种攻击的防御非常困难,特别是在只有模型预测可用的真实黑盒环境中。在本文中,我们提出了一种新的后门防御框架,可以通过零次采样图像净化(ZIP)有效地防御各种攻击。该框架可以应用于黑盒模型,而无需有关毒模型或干净/有毒样本的任何内部信息。我们的防御框架涉及两个步骤。首先,我们对有毒图像进行线性变换,摧毁触发模式。然后,我们使用预先训练的扩散模型恢复被变换掉的语义信息。特别是,我们设计了一个新的逆过程,使用变换图像来指导生成高保真度净化图像,该过程可以在零次采样环境中应用。我们对各种不同类型的攻击 multiple datasets 进行了多个数据集的评估。实验结果显示,我们的 ZIP 框架相对于最先进的后门防御基线更加优秀。我们相信,我们的结果将为黑盒模型的未来防御方法提供有价值的洞察。
URL
https://arxiv.org/abs/2303.12175