Abstract
To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We introduce a new method for identifying these subgraphs: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
Abstract (translated)
截至目前,大多数在深度视觉模型中实现人类可解释计算的网络子组件的发现都涉及对单个单元的深入研究和对大量人类劳动的充分利用。我们研究可扩展的方法来提取视觉模型计算图的子图,该子图在识别特定视觉概念时起作用。我们引入了一种新的方法来识别这些子图:使用几个示例来指定一个视觉概念,然后跟踪层之间神经元激活之间的相互依赖,或者它们的功能性连接。我们发现,我们的方法提取了因果影响模型输出并能够防御对抗攻击的电路。编辑这些电路可以保护预训练的大型模型免受攻击。
URL
https://arxiv.org/abs/2404.14349