Abstract
Although counterfactual explanations are a popular approach to explain ML black-box classifiers, they are less widespread in NLP. Most methods find those explanations by iteratively perturbing the target document until it is classified differently by the black box. We identify two main families of counterfactual explanation methods in the literature, namely, (a) \emph{transparent} methods that perturb the target by adding, removing, or replacing words, and (b) \emph{opaque} approaches that project the target document into a latent, non-interpretable space where the perturbation is carried out subsequently. This article offers a comparative study of the performance of these two families of methods on three classical NLP tasks. Our empirical evidence shows that opaque approaches can be an overkill for downstream applications such as fake news detection or sentiment analysis since they add an additional level of complexity with no significant performance gain. These observations motivate our discussion, which raises the question of whether it makes sense to explain a black box using another black box.
Abstract (translated)
尽管反事实解释是解释机器学习黑盒分类器的一种流行方法,但在自然语言处理中并不普遍。大多数方法通过迭代地扰动目标文档,使其与黑盒分类器的预测不同来进行这些解释。我们文献中识别出两种主要的反事实解释方法,即(a)透明方法,通过添加、删除或替换单词来扰动目标,以及(b)不透明方法,将目标文档投影到隐含、不可解释的空间中,然后在那里进行扰动。本文对这两种方法在三个经典的自然语言处理任务上的性能进行了比较研究。我们的实证证据表明,不透明方法对于下游应用(如虚假新闻检测或情感分析)来说可能是一个过度的复杂方法,因为它们增加了一个没有显著性能提升的额外的复杂性层。这些观察结果引发了我们的讨论,并提出了一个值得探讨的问题:是否使用另一个黑盒来解释黑盒是有意义的?
URL
https://arxiv.org/abs/2404.14943