Abstract
Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: this https URL
Abstract (translated)
数据在语言模型获取技能和知识的过程中扮演着最重要的角色。缺乏大规模且组织良好的预训练数据集会导致成本高昂且难以访问的数据管道问题。我们推出了Essential-Web v1.0,这是一个包含24万亿个标记的大型数据集,其中每一份文档都通过一个涵盖主题、格式、内容复杂度和质量的十二类分类法进行了标注。这些分类标签是由EAI-Distill-0.5b模型生成的,这是一款经过微调的拥有0.5亿参数量的模型,其标注者一致性达到了Qwen2.5-32B-Instruct的97%水平。 通过使用类似SQL风格的过滤器,我们可以获得在数学(相对优于当前最优方法SOTA低8.0%)、网络代码(高14.3%)、STEM(高24.5%)和医学(高8.6%)领域的竞争力强的网页精选数据集。Essential-Web v1.0可以在HuggingFace上获取:[此链接](this https URL)
URL
https://arxiv.org/abs/2506.14111