BDMMT: Backdoor Sample Detection for Language Models through Model Mutation Testing

Abstract
Abstract (translated)
URL
PDF

Abstract

Deep neural networks (DNNs) and natural language processing (NLP) systems have developed rapidly and have been widely used in various real-world fields. However, they have been shown to be vulnerable to backdoor attacks. Specifically, the adversary injects a backdoor into the model during the training phase, so that input samples with backdoor triggers are classified as the target class. Some attacks have achieved high attack success rates on the pre-trained language models (LMs), but there have yet to be effective defense methods. In this work, we propose a defense method based on deep model mutation testing. Our main justification is that backdoor samples are much more robust than clean samples if we impose random mutations on the LMs and that backdoors are generalizable. We first confirm the effectiveness of model mutation testing in detecting backdoor samples and select the most appropriate mutation operators. We then systematically defend against three extensively studied backdoor attack levels (i.e., char-level, word-level, and sentence-level) by detecting backdoor samples. We also make the first attempt to defend against the latest style-level backdoor attacks. We evaluate our approach on three benchmark datasets (i.e., IMDB, Yelp, and AG news) and three style transfer datasets (i.e., SST-2, Hate-speech, and AG news). The extensive experimental results demonstrate that our approach can detect backdoor samples more efficiently and accurately than the three state-of-the-art defense approaches.

Abstract (translated)

深度学习(DNN)和自然语言处理(NLP)系统快速发展,在各种实际领域得到广泛应用。然而,它们已被证明易受后插入攻击的影响。具体而言,攻击者会在训练阶段将后插入信息注入模型中,使得带有后插入触发器的输入样本被归类为目标类别。一些攻击在训练过的语言模型(LM)上取得了很高的攻击成功率,但目前还没有有效的防御方法。在这项工作中,我们提出了基于深度模型变异测试的防御方法。我们的主要理由是,如果对LM进行随机变异,那么后插入样本比干净样本更加稳定,而且后插入信息是可以 general 化的。我们首先通过验证模型变异测试在检测后插入样本方面的有效性,并选择最合适的变异操作。然后通过检测后插入样本来系统性地防御三个被广泛研究的后插入攻击级别(即字符级、单词级和句子级)。我们还尝试防御最新的风格级后插入攻击。我们对三个基准数据集(即 IMDb、Yelp 和 AG新闻)以及三个风格转移数据集(即 SST-2、仇恨言论和 AG新闻)进行了评估。广泛的实验结果显示,我们的方法能够更有效地和准确地检测后插入样本,比三个最先进的防御方法更有效。

URL

https://arxiv.org/abs/2301.10412

PDF

https://arxiv.org/pdf/2301.10412.pdf