Abstract
Language models are often at risk of diverse backdoor attacks, especially data poisoning. Thus, it is important to investigate defense solutions for addressing them. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers, leaving a universal defense against various backdoor attacks with diverse triggers largely unexplored. In this paper, we propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised Product-of-Experts), which is inspired by the shortcut nature of backdoor attacks, to defend various backdoor attacks. DPoE consists of two models: a shallow model that captures the backdoor shortcuts and a main model that is prevented from learning the backdoor shortcuts. To address the label flip caused by backdoor attackers, DPoE incorporates a denoising design. Experiments on SST-2 dataset show that DPoE significantly improves the defense performance against various types of backdoor triggers including word-level, sentence-level, and syntactic triggers. Furthermore, DPoE is also effective under a more challenging but practical setting that mixes multiple types of trigger.
Abstract (translated)
语言模型经常面临多种后缀攻击的风险,特别是数据中毒。因此,研究防御解决方案是非常必要的。现有的后缀防御方法主要关注具有明确触发器的后缀攻击,而忽略了多种不同类型的后缀攻击,即各种不同类型的后缀攻击的通用防御方法 largely unexplored。在本文中,我们提出了一种基于整体集成的后缀防御框架,称为 DPoE (Denoised Product-of- Experts),它受后缀攻击的快捷性启发,以保护各种后缀攻击。DPoE 由两个模型组成:一个浅层的模型,用于捕获后缀快捷,一个主要的模型,以防止学习后缀快捷。为了应对后缀攻击者造成的标签翻转,DPoE 采用了去噪设计。对 SST-2 数据集的实验表明,DPoE 显著改进了对抗各种类型后缀触发器,包括词级、句子级和语法触发器的攻击性能。此外,DPoE 在混合多种触发器的更困难但实用的场景中也有效。
URL
https://arxiv.org/abs/2305.14910