Abstract
Neural networks have in recent years shown promise for helping software engineers write programs and even formally verify them. While semantic information plays a crucial part in these processes, it remains unclear to what degree popular neural architectures like transformers are capable of modeling that information. This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification proofs through the lens of mechanistic interpretability, focusing in particular on structural recursion. Structural recursion is at the heart of tasks on which symbolic tools currently outperform neural models, like inferring semantic relations between datatypes and emulating program behavior. We evaluate the ability of transformer models to learn to emulate the behavior of structurally recursive functions from input-output examples. Our evaluation includes empirical and conceptual analyses of the limitations and capabilities of transformer models in approximating these functions, as well as reconstructions of the ``shortcut" algorithms the model learns. By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions. Our work provides a new foundation for understanding the behavior of neural networks that fail to solve the very tasks they are trained for.
Abstract (translated)
神经网络近年来表现出有助于帮助软件工程师编写程序以及Formal Verification的潜力。虽然语义信息在这些过程中扮演着关键角色,但目前仍不清楚像Transformers这样的流行神经网络架构是否能够建模这些信息的相当程度。本文通过机械解释性的视角审视了与程序和Formal Verification证明相关的神经网络学习算法的行为,特别是重点关注结构重复。结构重复是当前符号工具能够比神经网络模型更好地完成的任务的核心,例如推断数据类型之间的语义关系并模拟程序行为。我们评估了Transformer模型学习到结构重复函数的行为的能力。我们的评估包括对Transformer模型在这些函数approximating方面的限制和能力进行实证和概念分析,以及重新构建模型学习到的“捷径”算法。通过重新构建这些算法,我们准确地预测了91%的approximating函数失败案例。我们的工作为理解神经网络未能解决它们训练目标的任务的行为提供了新的基础。
URL
https://arxiv.org/abs/2305.14699