Causal Abstractions of Neural Networks

2021-06-06 01:07:43

Atticus Geiger, Hanson Lu, Thomas Icard, Christopher Potts

arXiv_AI

Abstract
Abstract (translated)
URL
PDF

Abstract

Structural analysis methods (e.g., probing and feature attribution) are increasingly important tools for neural network analysis. We propose a new structural analysis method grounded in a formal theory of \textit{causal abstraction} that provides rich characterizations of model-internal representations and their roles in input/output behavior. In this method, neural representations are aligned with variables in interpretable causal models, and then \textit{interchange interventions} are used to experimentally verify that the neural representations have the causal properties of their aligned variables. We apply this method in a case study to analyze neural models trained on Multiply Quantified Natural Language Inference (MQNLI) corpus, a highly complex NLI dataset that was constructed with a tree-structured natural logic causal model. We discover that a BERT-based model with state-of-the-art performance successfully realizes the approximate causal structure of the natural logic causal model, whereas a simpler baseline model fails to show any such structure, demonstrating that neural representations encode the compositional structure of MQNLI examples.

Abstract (translated)

URL

https://arxiv.org/abs/2106.02997

PDF

https://arxiv.org/pdf/2106.02997.pdf