Paper Reading AI Learner

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

2026-02-05 07:34:23
Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

Abstract

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

Abstract (translated)

当高质量公共文本资源接近枯竭,预训练从依赖更多标记转向利用更优质的标记。这种现象被称为“数据墙”(Data Wall)。然而,现有的方法要么依赖于忽略训练动态的启发式静态过滤器,要么使用基于原始梯度的动力学标准但与优化器无关。我们提出了一种名为OPUS(由优化器诱导的选择性投影效用选择)的数据选择框架,该框架在由优化器更新空间定义的有用性方面进行工作。OPUS通过将候选数据的有效更新映射到从稳定且符合分布特性的代理中衍生的目标方向来为其打分。为了确保可扩展性,我们采用了Ghost技术与CountSketch方法以提高计算效率,并使用Boltzmann采样增加数据多样性,仅增加了4.7%的额外计算开销。 在GPT-2 Large/XL模型预训练(分别基于FineWeb和FineWeb-Edu语料库)中,OPUS在300亿个标记的数据量下表现优于工业级基线模型,并且甚至超过了完全使用2000亿个标记进行训练的效果。此外,在与工业级别的静态过滤器结合后,即使面对质量较低的数据集,OPUS也进一步提高了预训练效率。 另外,在对Qwen3-8B-Base模型在SciencePedia上的继续预训练中,仅用50亿个标记,OPUS就实现了优于完全使用300亿个标记进行训练的效果。这表明了在专业领域内显著的数据效率提升。

URL

https://arxiv.org/abs/2602.05400

PDF

https://arxiv.org/pdf/2602.05400.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot