Abstract
As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
Abstract (translated)
当高质量公共文本资源接近枯竭,预训练从依赖更多标记转向利用更优质的标记。这种现象被称为“数据墙”(Data Wall)。然而,现有的方法要么依赖于忽略训练动态的启发式静态过滤器,要么使用基于原始梯度的动力学标准但与优化器无关。我们提出了一种名为OPUS(由优化器诱导的选择性投影效用选择)的数据选择框架,该框架在由优化器更新空间定义的有用性方面进行工作。OPUS通过将候选数据的有效更新映射到从稳定且符合分布特性的代理中衍生的目标方向来为其打分。为了确保可扩展性,我们采用了Ghost技术与CountSketch方法以提高计算效率,并使用Boltzmann采样增加数据多样性,仅增加了4.7%的额外计算开销。 在GPT-2 Large/XL模型预训练(分别基于FineWeb和FineWeb-Edu语料库)中,OPUS在300亿个标记的数据量下表现优于工业级基线模型,并且甚至超过了完全使用2000亿个标记进行训练的效果。此外,在与工业级别的静态过滤器结合后,即使面对质量较低的数据集,OPUS也进一步提高了预训练效率。 另外,在对Qwen3-8B-Base模型在SciencePedia上的继续预训练中,仅用50亿个标记,OPUS就实现了优于完全使用300亿个标记进行训练的效果。这表明了在专业领域内显著的数据效率提升。
URL
https://arxiv.org/abs/2602.05400