Abstract
We examine the intrinsic (within the attention head) and extrinsic (amongst the attention heads) structure of the self-attention mechanism in transformers. Theoretical evidence for invariance of the self-attention mechanism to softmax activation is obtained by appealing to paradifferential calculus, (and is supported by computational examples), which relies on the intrinsic organization of the attention heads. Furthermore, we use an existing methodology for hierarchical organization of tensors to examine network structure by constructing hierarchal partition trees with respect to the query, key, and head axes of network 3-tensors. Such an organization is consequential since it allows one to profitably execute common signal processing tasks on a geometry where the organized network 3-tensors exhibit regularity. We exemplify this qualitatively, by visualizing the hierarchical organization of the tree comprised of attention heads and the diffusion map embeddings, and quantitatively by investigating network sparsity with the expansion coefficients of individual attention heads and the entire network with respect to the bi and tri-haar bases (respectively) on the space of queries, keys, and heads of the network. To showcase the utility of our theoretical and methodological findings, we provide computational examples using vision and language transformers. The ramifications of these findings are two-fold: (1) a subsequent step in interpretability analysis is theoretically admitted, and can be exploited empirically for downstream interpretability tasks (2) one can use the network 3-tensor organization for empirical network applications such as model pruning (by virtue of network sparsity) and network architecture comparison.
Abstract (translated)
我们研究了变压器中自我注意机制的内在(在注意力头内)和外在(在注意力头之间)结构。通过利用参数微分学,获得了自我注意机制对softmax激活不变性的理论证据(并得到了计算示例的支持),这依赖于注意力头的内在组织方式。此外,我们使用现有的张量层次化组织方法来构建基于查询、键和头部轴的网络3维张量的层级分区树,以分析网络结构。这种组织具有重要意义,因为它允许在由组织后的网络3维张量表现出规则性的几何空间中有效地执行常见信号处理任务。通过可视化注意力头组成的树形图以及扩散映射嵌入,我们定性地展示了这一点,并通过研究个体注意力头和整个网络相对于查询、键和头部的双向(bi)和三向(tri)哈尔基的扩展系数来定量分析了网络稀疏性。 为了展示我们的理论与方法发现的实际效用,我们使用视觉和语言变压器提供计算示例。这些发现的影响是两方面的:(1) 在解释性分析中,理论上可以进行下一步,并且可以在实践中利用以执行下游任务;(2) 可以通过利用网络稀疏性来使用网络3维张量组织来进行模型剪枝等实际应用,并比较不同网络架构。
URL
https://arxiv.org/abs/2506.15541