Abstract
Deep double descent is one of the key phenomena underlying the generalization capability of deep learning models. In this study, epoch-wise double descent, which is delayed generalization following overfitting, was empirically investigated by focusing on the evolution of internal structures. Fully connected neural networks of three different sizes were trained on the CIFAR-10 dataset with 30% label noise. By decomposing the loss curves into signal contributions from clean and noisy training data, the epoch-wise evolutions of internal signals were analyzed separately. Three main findings were obtained from this analysis. First, the model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase, corresponding to a "benign overfitting" state. Second, noisy data were learned after clean data, and as learning progressed, their corresponding internal activations became increasingly separated in outer layers; this enabled the model to overfit only noisy data. Third, a single, very large activation emerged in the shallow layer across all models; this phenomenon is referred as "outliers," "massive activa-tions," and "super activations" in recent large language models and evolves with re-generalization. The magnitude of large activation correlated with input patterns but not with output patterns. These empirical findings directly link the recent key phenomena of "deep double descent," "benign overfitting," and "large activation", and support the proposal of a novel scenario for understanding deep double descent.
Abstract (translated)
深度双重下降是深入理解深度学习模型泛化能力的关键现象之一。在这项研究中,通过对内部结构演化的关注,从经验上探讨了基于每个训练周期的双重下降(即过度拟合后的延迟泛化)。实验使用带有30%标签噪声的CIFAR-10数据集对三种不同大小的全连接神经网络进行了训练。通过将损失曲线分解为干净和有噪声的训练数据信号贡献,分别分析了各个时期内部信号的变化。 从这项分析中获得了三个主要发现: 第一,即使模型在双重下降阶段完美地拟合了带噪训练数据之后,它仍然能够实现对测试数据的强大再泛化能力,这对应于所谓的“良性过拟合”状态。 第二,在学习过程中,先学习干净的数据后学习有噪声的数据,并且随着学习的进展,它们对应的内部激活在外部层中变得越来越分离。这意味着模型仅过度拟合并适应了那些带有噪声的数据点。 第三,在所有模型的浅层中出现了一个非常大的单一激活现象;这一现象在最近的大规模语言模型中被称为“离群值”、“巨大激活”和“超级激活”。这种大规模激活的现象与输入模式有关,但与输出模式无关。 这些经验发现直接将近期的关键现象——深度双重下降、良性过拟合以及大型激活联系在一起,并支持提出了一种新的理解深度双重下降的情景。
URL
https://arxiv.org/abs/2601.08316