Paper Reading AI Learner

Grokking in Linear Models for Logistic Regression

2026-02-09 06:16:43
Nataraj Das, Atreya Vedantam, Chandrashekar Lakshminarayanan

Abstract

Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated generalization-during which delayed generalization can arise. Our analysis further relates the emergence of grokking to asymmetries in the data, both in the number of examples per class and in the distribution of support vectors across classes, and yields a characterization of the grokking time. We experimentally validate our theory by planting different distributions of population points and support vectors, and by analyzing accuracy curves and hyperplane dynamics. Overall, our results demonstrate that grokking does not require depth or representation learning, and can emerge even in linear models through the dynamics of the bias term.

Abstract (translated)

“Grokking”现象,即延迟泛化(delayed generalization),通常归因于深度神经网络的深度和组成结构。我们在最简单的可能设置下研究了这种现象:在数据线性可分(且最大间隔可分)的情况下学习具有对数损失函数的线性模型来进行二元分类。我们探讨了三种测试场景:(1) 测试数据与训练数据来自相同的分布,在这种情况下不会观察到Grokking;(2) 测试数据集中在边缘区域,在这种情况下会观察到Grokking;以及(3) 通过投影梯度下降(PGD)攻击生成的对抗性测试数据,在这种情况下也会观察到Grokking。我们理论分析表明,梯度下降所隐含的偏置诱导了一种三阶段的学习过程:整体样本主导、支持向量主导下的去学习和泛化,其中延迟泛化可能在此过程中出现。 我们的研究进一步将Grokking现象与数据中的不对称性关联起来,这些不对称性包括每个类别的示例数量以及各类别之间支点向量(支持向量)分布的差异。我们还给出了关于Grokking时间特性的描述。通过在不同的整体样本和支点向量分布上进行实验,并分析准确率曲线及超平面动态变化,我们的理论得到了验证。 总体而言,研究表明Grokking现象不需要深度网络或表示学习能力,并且即使在线性模型中也能因偏置项的动态而出现。

URL

https://arxiv.org/abs/2602.08302

PDF

https://arxiv.org/pdf/2602.08302.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot