Interpretable Adversarial Training for Text

Abstract
Abstract (translated)
URL
PDF

Abstract

Generating high-quality and interpretable adversarial examples in the text domain is a much more daunting task than it is in the image domain. This is due partly to the discrete nature of text, partly to the problem of ensuring that the adversarial examples are still probable and interpretable, and partly to the problem of maintaining label invariance under input perturbations. In order to address some of these challenges, we introduce sparse projected gradient descent (SPGD), a new approach to crafting interpretable adversarial examples for text. SPGD imposes a directional regularization constraint on input perturbations by projecting them onto the directions to nearby word embeddings with highest cosine similarities. This constraint ensures that perturbations move each word embedding in an interpretable direction (i.e., towards another nearby word embedding). Moreover, SPGD imposes a sparsity constraint on perturbations at the sentence level by ignoring word-embedding perturbations whose norms are below a certain threshold. This constraint ensures that our method changes only a few words per sequence, leading to higher quality adversarial examples. Our experiments with the IMDB movie review dataset show that the proposed SPGD method improves adversarial example interpretability and likelihood (evaluated by average per-word perplexity) compared to state-of-the-art methods, while suffering little to no loss in training performance.

Abstract (translated)

在文本域中生成高质量和可解释的对抗性示例比在图像域中更为艰巨。这部分是由于文本的离散性，部分是由于确保对抗性示例仍然是可能的和可解释的问题，部分是由于在输入扰动下保持标签不变性的问题。为了解决这些挑战，我们引入了稀疏投影梯度下降（spgd），这是一种为文本制作可解释的对抗性示例的新方法。SPGD通过将输入扰动投影到附近具有最高余弦相似性的单词嵌入的方向上，对输入扰动施加方向正则化约束。这种约束确保了干扰将每个嵌入的单词移动到一个可解释的方向（即，向另一个邻近的嵌入单词移动）。此外，SPGD通过忽略规范低于某一阈值的嵌入词干扰，对句子级的干扰施加稀疏约束。这个约束确保我们的方法在每个序列中只更改几个词，从而产生更高质量的对抗性示例。我们对IMDB电影评论数据集的实验表明，与最先进的方法相比，所提出的SPGD方法提高了对抗性示例的可解释性和可能性（通过平均每个单词的困惑程度进行评估），同时在训练性能上几乎没有损失。

URL

https://arxiv.org/abs/1905.12864

PDF

https://arxiv.org/pdf/1905.12864.pdf