Abstract
Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated generalization-during which delayed generalization can arise. Our analysis further relates the emergence of grokking to asymmetries in the data, both in the number of examples per class and in the distribution of support vectors across classes, and yields a characterization of the grokking time. We experimentally validate our theory by planting different distributions of population points and support vectors, and by analyzing accuracy curves and hyperplane dynamics. Overall, our results demonstrate that grokking does not require depth or representation learning, and can emerge even in linear models through the dynamics of the bias term.
Abstract (translated)
“Grokking”现象,即延迟泛化(delayed generalization),通常归因于深度神经网络的深度和组成结构。我们在最简单的可能设置下研究了这种现象:在数据线性可分(且最大间隔可分)的情况下学习具有对数损失函数的线性模型来进行二元分类。我们探讨了三种测试场景:(1) 测试数据与训练数据来自相同的分布,在这种情况下不会观察到Grokking;(2) 测试数据集中在边缘区域,在这种情况下会观察到Grokking;以及(3) 通过投影梯度下降(PGD)攻击生成的对抗性测试数据,在这种情况下也会观察到Grokking。我们理论分析表明,梯度下降所隐含的偏置诱导了一种三阶段的学习过程:整体样本主导、支持向量主导下的去学习和泛化,其中延迟泛化可能在此过程中出现。 我们的研究进一步将Grokking现象与数据中的不对称性关联起来,这些不对称性包括每个类别的示例数量以及各类别之间支点向量(支持向量)分布的差异。我们还给出了关于Grokking时间特性的描述。通过在不同的整体样本和支点向量分布上进行实验,并分析准确率曲线及超平面动态变化,我们的理论得到了验证。 总体而言,研究表明Grokking现象不需要深度网络或表示学习能力,并且即使在线性模型中也能因偏置项的动态而出现。
URL
https://arxiv.org/abs/2602.08302