Abstract
We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.
Abstract (translated)
我们展示了一个名为Co$^4$的微型机器(Adeel,2025),它拥有单层结构、两个头和8M参数,在大约成本为$O(N)$的情况下运行(其中N是输入标记的数量)。仅通过两次训练周期,该模型就在BabyLM挑战赛基准测试中超过了GPT-2 (124M参数,12层,时间复杂度$O(N^2))$ 和 GPT-BERT (30M参数,12层,时间复杂度$O(N^2))$的表现。这两个基线模型在训练过程中都经历了十次周期的训练。Co$^4$在处理10M个标记时展现了数量级更高的训练效率,表明它具有高度样本高效的预训练特性。 通过使用BabyLM挑战赛评估管道,在复杂基准测试中,Co$^4$展现出了强大的零样本(zero-shot)和微调性能。特别是在SuperGLUE任务上,Co$^4$在7个零样本指标中的5个超过了GPT-2,并且在7个微调任务中有6个优于GPT-2;同时,在两种情况下,它也在4项指标中胜过GPT-BERT。 这些结果表明需要重新审视现有的深度学习范式及其相关的规模扩展定律。
URL
https://arxiv.org/abs/2510.08404