Paper Reading AI Learner

BDMMT: Backdoor Sample Detection for Language Models through Model Mutation Testing

2023-01-25 05:24:46
Jiali Wei, Ming Fan, Wenjing Jiao, Wuxia Jin, Ting Liu

Abstract

Deep neural networks (DNNs) and natural language processing (NLP) systems have developed rapidly and have been widely used in various real-world fields. However, they have been shown to be vulnerable to backdoor attacks. Specifically, the adversary injects a backdoor into the model during the training phase, so that input samples with backdoor triggers are classified as the target class. Some attacks have achieved high attack success rates on the pre-trained language models (LMs), but there have yet to be effective defense methods. In this work, we propose a defense method based on deep model mutation testing. Our main justification is that backdoor samples are much more robust than clean samples if we impose random mutations on the LMs and that backdoors are generalizable. We first confirm the effectiveness of model mutation testing in detecting backdoor samples and select the most appropriate mutation operators. We then systematically defend against three extensively studied backdoor attack levels (i.e., char-level, word-level, and sentence-level) by detecting backdoor samples. We also make the first attempt to defend against the latest style-level backdoor attacks. We evaluate our approach on three benchmark datasets (i.e., IMDB, Yelp, and AG news) and three style transfer datasets (i.e., SST-2, Hate-speech, and AG news). The extensive experimental results demonstrate that our approach can detect backdoor samples more efficiently and accurately than the three state-of-the-art defense approaches.

Abstract (translated)

深度学习(DNN)和自然语言处理(NLP)系统快速发展,在各种实际领域得到广泛应用。然而,它们已被证明易受后插入攻击的影响。具体而言,攻击者会在训练阶段将后插入信息注入模型中,使得带有后插入触发器的输入样本被归类为目标类别。一些攻击在训练过的语言模型(LM)上取得了很高的攻击成功率,但目前还没有有效的防御方法。在这项工作中,我们提出了基于深度模型变异测试的防御方法。我们的主要理由是,如果对LM进行随机变异,那么后插入样本比干净样本更加稳定,而且后插入信息是可以 general 化的。我们首先通过验证模型变异测试在检测后插入样本方面的有效性,并选择最合适的变异操作。然后通过检测后插入样本来系统性地防御三个被广泛研究的后插入攻击级别(即字符级、单词级和句子级)。我们还尝试防御最新的风格级后插入攻击。我们对三个基准数据集(即 IMDb、Yelp 和 AG新闻)以及三个风格转移数据集(即 SST-2、仇恨言论和 AG新闻)进行了评估。广泛的实验结果显示,我们的方法能够更有效地和准确地检测后插入样本,比三个最先进的防御方法更有效。

URL

https://arxiv.org/abs/2301.10412

PDF

https://arxiv.org/pdf/2301.10412.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot