Abstract
Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations -- yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the \emph{Invariant Subspace Necessity} theorem and derive the \emph{Self-Reference Property}: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides \textbf{a principled architectural explanation} for why linear interpretability methods work, unifying linear probes and sparse autoencoders.
Abstract (translated)
线性探测和稀疏自编码器始终能够从变压器表示中恢复出有意义的结构——然而,为何如此简单的方法能在深度非线性系统中成功?我们证明这不是仅仅是一种经验规律,而是架构必要性的结果:变压器通过线性接口(注意力OV电路、取消嵌入矩阵)来传递信息,并且任何通过此类接口解码的意义特征必须占据一个与上下文无关的线性子空间。我们将此表述为“不变子空间必要性”定理并推导出“自引用属性”:标记直接提供其相关特征的几何方向,从而无需标注数据或学习探测器即可实现零样本语义结构识别。 在八个分类任务和四个模型族中的实证验证确认了类别标记与语义相关的实例之间的对齐。我们的框架为线性可解释方法为何有效提供了**原则性的架构解释**,统一了线性探测和稀疏自编码器的方法。
URL
https://arxiv.org/abs/2602.09783