Paper Reading AI Learner

Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT

2025-10-09 16:22:30
Noor Ul Zain, Mohsin Raza, Ahsan Adeel

Abstract

We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.

Abstract (translated)

我们展示了一个名为Co$^4$的微型机器(Adeel,2025),它拥有单层结构、两个头和8M参数,在大约成本为$O(N)$的情况下运行(其中N是输入标记的数量)。仅通过两次训练周期,该模型就在BabyLM挑战赛基准测试中超过了GPT-2 (124M参数,12层,时间复杂度$O(N^2))$ 和 GPT-BERT (30M参数,12层,时间复杂度$O(N^2))$的表现。这两个基线模型在训练过程中都经历了十次周期的训练。Co$^4$在处理10M个标记时展现了数量级更高的训练效率,表明它具有高度样本高效的预训练特性。 通过使用BabyLM挑战赛评估管道,在复杂基准测试中,Co$^4$展现出了强大的零样本(zero-shot)和微调性能。特别是在SuperGLUE任务上,Co$^4$在7个零样本指标中的5个超过了GPT-2,并且在7个微调任务中有6个优于GPT-2;同时,在两种情况下,它也在4项指标中胜过GPT-BERT。 这些结果表明需要重新审视现有的深度学习范式及其相关的规模扩展定律。

URL

https://arxiv.org/abs/2510.08404

PDF

https://arxiv.org/pdf/2510.08404.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot