Abstract
Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are \emph{dense}), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs -- suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and finally to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in language model computation and should not be dismissed as training noise.
Abstract (translated)
稀疏自编码器(SAEs)旨在通过施加稀疏性约束从语言模型中提取可解释的特征。理想情况下,训练后的 SAE 会生成既稀疏又有语义意义的潜在表示。然而,许多 SAE 的潜在层激活频繁(即为“密集”),引发担忧认为它们可能是训练过程中的不希望出现的副产物。在本工作中,我们系统地研究了密集潜在层的几何特性、功能和起源,并展示了这些密集层不仅持久存在且常常反映了有意义的模型表示。 首先,我们证明密集潜在层往往形成对抗性对,重建残差流中的特定方向,并通过消除其子空间抑制重新训练后的 SAE 中新密集特征的出现——这表明高密度特征是残差空间的一个内在属性。接下来,我们引入了一种密集潜在层分类法,识别出与位置跟踪、上下文绑定、熵调节、字母特异性输出信号、词性以及主成分重建相关的类。 最后,我们分析了这些特征在不同层级上的演化过程,揭示了从早期层级的结构性特征到中层的语义特征再到模型最后一层的输出导向信号的变化趋势。我们的研究结果表明,密集潜在层在语言模型计算中扮演着功能性角色,并且不应被简单地视为训练噪音。
URL
https://arxiv.org/abs/2506.15679