Paper Reading AI Learner

Stronger Normalization-Free Transformers

2025-12-11 18:58:49
Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu

Abstract

Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

Abstract (translated)

尽管归一化层长期以来被视为深度学习架构中的不可或缺的组成部分,但最近引入的动态Tanh(DyT)展示了一种可能性:即存在替代方案。点对点函数DyT能够限制极端值以实现稳定收敛,并且达到了与传统归一化层相当的性能水平;本研究进一步探索了可以超越它的功能设计。 首先,我们研究了点对点函数的基本属性如何影响训练过程和模型表现。基于这些发现,我们在大规模范围内搜索更有效的功能设计方案。通过这一探索,我们引入了一个新的函数$\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$,其中$\mathrm{erf}(x)$是重缩放的高斯累积分布函数,并将其识别为最高效的方案。 实验表明,无论是在视觉领域(图像识别和生成)、语音表示还是DNA序列建模等广泛的领域中,Derf均超过了LayerNorm、RMSNorm以及DyT的表现。我们的研究结果表明,Derf性能提升的主要原因在于其改进的泛化能力而不是更强的数据拟合能力。 由于其简单性和超越其他归一化方法的强大表现力,Derf成为一种适用于无归一化Transformer架构的实际选择。

URL

https://arxiv.org/abs/2512.10938

PDF

https://arxiv.org/pdf/2512.10938.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot