Generating universal language adversarial examples by understanding and enhancing the transferability across neural models

2020-11-17 10:45:05

Liping Yuan, Xiaoqing Zheng, Yi Zhou, Cho-Jui Hsieh, Kai-wei Chang, Xuanjing Huang

arXiv_CL

arXiv_CL Adversarial Classification Text_Classification Embedding Pose

Abstract
Abstract (translated)
URL
PDF

Abstract

Deep neural network models are vulnerable to adversarial attacks. In many cases, malicious inputs intentionally crafted for one model can fool another model in the black-box attack setting. However, there is a lack of systematic studies on the transferability of adversarial examples and how to generate universal adversarial examples. In this paper, we systematically study the transferability of adversarial attacks for text classification models. In particular, we conduct extensive experiments to investigate how various factors, such as network architecture, input format, word embedding, and model capacity, affect the transferability of adversarial attacks. Based on these studies, we then propose universal black-box attack algorithms that can induce adversarial examples to attack almost all existing models. These universal adversarial examples reflect the defects of the learning process and the bias in the training dataset. Finally, we generalize these adversarial examples into universal word replacement rules that can be used for model diagnostics.

Abstract (translated)

URL

https://arxiv.org/abs/2011.08558

PDF

https://arxiv.org/pdf/2011.08558.pdf