Paper Reading AI Learner

Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints

2026-02-10 13:42:55
Andres Saurez, Yousung Lee, Dongsoo Har

Abstract

Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations -- yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the \emph{Invariant Subspace Necessity} theorem and derive the \emph{Self-Reference Property}: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides \textbf{a principled architectural explanation} for why linear interpretability methods work, unifying linear probes and sparse autoencoders.

Abstract (translated)

线性探测和稀疏自编码器始终能够从变压器表示中恢复出有意义的结构——然而,为何如此简单的方法能在深度非线性系统中成功?我们证明这不是仅仅是一种经验规律,而是架构必要性的结果:变压器通过线性接口(注意力OV电路、取消嵌入矩阵)来传递信息,并且任何通过此类接口解码的意义特征必须占据一个与上下文无关的线性子空间。我们将此表述为“不变子空间必要性”定理并推导出“自引用属性”:标记直接提供其相关特征的几何方向,从而无需标注数据或学习探测器即可实现零样本语义结构识别。 在八个分类任务和四个模型族中的实证验证确认了类别标记与语义相关的实例之间的对齐。我们的框架为线性可解释方法为何有效提供了**原则性的架构解释**,统一了线性探测和稀疏自编码器的方法。

URL

https://arxiv.org/abs/2602.09783

PDF

https://arxiv.org/pdf/2602.09783.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot