Reverse engineering adversarial attacks with fingerprints from adversarial examples

Abstract
Abstract (translated)
URL
PDF

Abstract

In spite of intense research efforts, deep neural networks remain vulnerable to adversarial examples: an input that forces the network to confidently produce incorrect outputs. Adversarial examples are typically generated by an attack algorithm that optimizes a perturbation added to a benign input. Many such algorithms have been developed. If it were possible to reverse engineer attack algorithms from adversarial examples, this could deter bad actors because of the possibility of attribution. Here we formulate reverse engineering as a supervised learning problem where the goal is to assign an adversarial example to a class that represents the algorithm and parameters used. To our knowledge it has not been previously shown whether this is even possible. We first test whether we can classify the perturbations added to images by attacks on undefended single-label image classification models. Taking a ``fight fire with fire'' approach, we leverage the sensitivity of deep neural networks to adversarial examples, training them to classify these perturbations. On a 17-class dataset (5 attacks, 4 bounded with 4 epsilon values each), we achieve an accuracy of 99.4\% with a ResNet50 model trained on the perturbations. We then ask whether we can perform this task without access to the perturbations, obtaining an estimate of them with signal processing algorithms, an approach we call ``fingerprinting''. We find the JPEG algorithm serves as a simple yet effective fingerprinter (85.05\% accuracy), providing a strong baseline for future work. We discuss how our approach can be extended to attack agnostic, learnable fingerprints, and to open-world scenarios with unknown attacks.

Abstract (translated)

尽管进行了广泛的研究,深度学习仍然容易受到对抗性例子的攻击:这种输入会迫使网络产生错误的输出。对抗性例子通常是由攻击算法优化的对良性输入添加的扰动产生的。已经开发了许多这样的算法。如果可能从对抗性例子中逆向工程攻击算法,这可能会阻止不良行为者,因为可能难以归因。在这里我们将逆工程定义为一个监督学习问题,其目标是将对抗性例子分配到代表算法和参数使用的类别中。据我们所知,这个问题以前并没有被证明过。我们首先测试了是否能够对无法攻击的单个标签图像分类模型的攻击中添加扰动进行分类。采用“打火与火”的方法,我们利用深度学习网络对对抗性例子的敏感性,训练它们分类这些扰动。在一个17个类别的dataset上(5次攻击,每个攻击都有4个epsilon值限制),我们使用ResNet50模型训练的扰动训练出99.4%的准确率。然后我们询问是否可以利用没有访问扰动来进行这项工作,使用信号处理算法进行“指纹识别”,这是一种我们称为“指纹识别”的方法。我们发现JPEG算法可以作为简单的但有效的指纹识别器(85.05%)提供未来的工作强有力的基线。我们讨论了我们的方法如何可以扩展到攻击无关、可学习指纹识别,以及与未知的攻击开放的世界场景。

URL

https://arxiv.org/abs/2301.13869

PDF

https://arxiv.org/pdf/2301.13869.pdf