Paper Reading AI Learner

Intrinsic and Extrinsic Organized Attention: Softmax Invariance and Network Sparsity

2025-06-18 15:14:56
Oluwadamilola Fasina, Ruben V. C. Pohle, Pei-Chun Su, Ronald R. Coifman

Abstract

We examine the intrinsic (within the attention head) and extrinsic (amongst the attention heads) structure of the self-attention mechanism in transformers. Theoretical evidence for invariance of the self-attention mechanism to softmax activation is obtained by appealing to paradifferential calculus, (and is supported by computational examples), which relies on the intrinsic organization of the attention heads. Furthermore, we use an existing methodology for hierarchical organization of tensors to examine network structure by constructing hierarchal partition trees with respect to the query, key, and head axes of network 3-tensors. Such an organization is consequential since it allows one to profitably execute common signal processing tasks on a geometry where the organized network 3-tensors exhibit regularity. We exemplify this qualitatively, by visualizing the hierarchical organization of the tree comprised of attention heads and the diffusion map embeddings, and quantitatively by investigating network sparsity with the expansion coefficients of individual attention heads and the entire network with respect to the bi and tri-haar bases (respectively) on the space of queries, keys, and heads of the network. To showcase the utility of our theoretical and methodological findings, we provide computational examples using vision and language transformers. The ramifications of these findings are two-fold: (1) a subsequent step in interpretability analysis is theoretically admitted, and can be exploited empirically for downstream interpretability tasks (2) one can use the network 3-tensor organization for empirical network applications such as model pruning (by virtue of network sparsity) and network architecture comparison.

Abstract (translated)

我们研究了变压器中自我注意机制的内在(在注意力头内)和外在(在注意力头之间)结构。通过利用参数微分学,获得了自我注意机制对softmax激活不变性的理论证据(并得到了计算示例的支持),这依赖于注意力头的内在组织方式。此外,我们使用现有的张量层次化组织方法来构建基于查询、键和头部轴的网络3维张量的层级分区树,以分析网络结构。这种组织具有重要意义,因为它允许在由组织后的网络3维张量表现出规则性的几何空间中有效地执行常见信号处理任务。通过可视化注意力头组成的树形图以及扩散映射嵌入,我们定性地展示了这一点,并通过研究个体注意力头和整个网络相对于查询、键和头部的双向(bi)和三向(tri)哈尔基的扩展系数来定量分析了网络稀疏性。 为了展示我们的理论与方法发现的实际效用,我们使用视觉和语言变压器提供计算示例。这些发现的影响是两方面的:(1) 在解释性分析中,理论上可以进行下一步,并且可以在实践中利用以执行下游任务;(2) 可以通过利用网络稀疏性来使用网络3维张量组织来进行模型剪枝等实际应用,并比较不同网络架构。

URL

https://arxiv.org/abs/2506.15541

PDF

https://arxiv.org/pdf/2506.15541.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot