Abstract
The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are publicly available at this https URL.
Abstract (translated)
目前的缩放语言模型趋势涉及增加参数数量和训练数据集大小。从趋势的推测来看,训练数据集大小可能很快受到在互联网上可用的文本数据的数量的限制。出于这一限制的启发,我们研究在数据受限的情况下缩放语言模型的趋势。具体来说,我们进行了大量实验, varying 数据重复度和计算预算的广度,范围从900亿训练代币和900亿参数模型到更高。我们发现,在限制数据并固定计算预算的情况下,使用重复数据进行训练相比于仅有独特数据产生几乎无变化的损失。然而,随着重复数据的增加,增加计算价值的效果最终衰减到零。我们提出了计算最优性的 scaling 定律,并经验证了它考虑到了重复代币和多余的参数价值下降的情况。最后,我们尝试缓解数据稀缺性的措施,包括增加训练数据集的代码数据或删除常见的过滤器。我们400次训练运行中的模型和数据集在此 https URL 上公开可用。
URL
https://arxiv.org/abs/2305.16264