Abstract
Neuron Interpretation has gained traction in the field of interpretability, and have provided fine-grained insights into what a model learns and how language knowledge is distributed amongst its different components. However, the lack of evaluation benchmark and metrics have led to siloed progress within these various methods, with very little work comparing them and highlighting their strengths and weaknesses. The reason for this discrepancy is the difficulty of creating ground truth datasets, for example, many neurons within a given model may learn the same phenomena, and hence there may not be one correct answer. Moreover, a learned phenomenon may spread across several neurons that work together -- surfacing these to create a gold standard challenging. In this work, we propose an evaluation framework that measures the compatibility of a neuron analysis method with other methods. We hypothesize that the more compatible a method is with the majority of the methods, the more confident one can be about its performance. We systematically evaluate our proposed framework and present a comparative analysis of a large set of neuron interpretation methods. We make the evaluation framework available to the community. It enables the evaluation of any new method using 20 concepts and across three pre-trained models.The code is released at this https URL
Abstract (translated)
神经元解释在解释性领域的进展已经取得了成功,并提供了深入了解模型学习了什么以及如何将语言知识分布在其不同组件中的精细洞察力。然而,缺乏评估基准和指标导致这些方法之间存在壁,并且几乎没有工作进行比较和突出它们的优点和缺点。导致这种差异的原因是创建真实数据集的困难,例如,在一个给定模型中,许多神经元可能学习相同的现象,因此可能没有一个正确答案。此外,一个学习的现象可能会在几个神经元合作的情况下传播, - 找到这些来创建黄金标准挑战性。在这个工作中,我们提出了一个评估框架,该框架衡量神经元分析方法与其他方法的兼容性。我们假设,一个方法与大多数方法的兼容性越好,就越有信心地评估其性能。我们系统地评估了我们提出的框架,并比较分析了一组神经元解释方法。我们向社区提供了评估框架。它使可以使用20个概念和三个预训练模型评估任何新的方法。代码在这里发布。
URL
https://arxiv.org/abs/2301.12608