Improving Dictionary Learning with Gated Sparse Autoencoders

2024-04-24 17:47:22

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

arXiv_AI

Abstract
Abstract (translated)
URL
PDF

Abstract

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.

Abstract (translated)

最近的工作发现，稀疏自动编码器（SAEs）是发现自然语言模型（LMs）激活的有用特征的有效技术，通过找到稀疏、线性的LM激活的稀疏重构。我们引入了门控稀疏自动编码器（Gated SSAE），它比现有的训练方法实现了帕累托改进。在SAE中，用于鼓励稀疏性的L1惩罚引入了许多不利的偏差，例如收缩——系统性地低估特征激活。Gated SSAE的关键洞察力是分离确定使用方向的功能和估计这些方向的大小：这使我们能够仅对前者应用L1惩罚，从而限制了不良影响范围。通过在具有7B参数的LM上训练SAEs，我们发现，在典型的超参数范围内，Gated SSAE解决了收缩，具有与SAEs相似的可解释性，并且需要一半的 firing特征才能实现与同质重构的相当的重建保真度。

URL

https://arxiv.org/abs/2404.16014

PDF

https://arxiv.org/pdf/2404.16014.pdf

Improving Dictionary Learning with Gated Sparse Autoencoders

Abstract

Abstract (translated)

URL

PDF Copy

PDF