Abstract
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
Abstract (translated)
尽管归一化层长期以来被视为深度学习架构中的不可或缺的组成部分,但最近引入的动态Tanh(DyT)展示了一种可能性:即存在替代方案。点对点函数DyT能够限制极端值以实现稳定收敛,并且达到了与传统归一化层相当的性能水平;本研究进一步探索了可以超越它的功能设计。 首先,我们研究了点对点函数的基本属性如何影响训练过程和模型表现。基于这些发现,我们在大规模范围内搜索更有效的功能设计方案。通过这一探索,我们引入了一个新的函数$\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$,其中$\mathrm{erf}(x)$是重缩放的高斯累积分布函数,并将其识别为最高效的方案。 实验表明,无论是在视觉领域(图像识别和生成)、语音表示还是DNA序列建模等广泛的领域中,Derf均超过了LayerNorm、RMSNorm以及DyT的表现。我们的研究结果表明,Derf性能提升的主要原因在于其改进的泛化能力而不是更强的数据拟合能力。 由于其简单性和超越其他归一化方法的强大表现力,Derf成为一种适用于无归一化Transformer架构的实际选择。
URL
https://arxiv.org/abs/2512.10938