Abstract
Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.
Abstract (translated)
代码大型语言模型(Code LLMs)虽然功能强大,但训练成本高昂。缩放法则预测了从模型规模、数据和计算资源等方面的影响性能。然而,不同的编程语言(PLs)在预训练阶段对基础模型的性能影响各异,导致了不准确的性能预测。此外,现有的研究大多关注无语言特定性的设置,忽略了现代软件开发中固有的多语言特性。因此,首先有必要探讨不同编程语言的缩放法则,并考虑它们之间的相互影响以达到最终的多语言缩放法则。 在本文中,我们首次系统地探索了跨多种编程语言的代码预训练的缩放法则,进行了超过1000次以上的实验(相当于336,000+小时H800),涵盖多个PLs、模型大小(从2亿到140亿参数)以及数据集规模(从1T标记)。我们建立了针对多种编程语言的代码LLMs的全面缩放法则,揭示了解释型语言(如Python)比编译型语言(如Rust)更能从更大的模型尺寸和更多的训练数据中获益。这项研究表明,多语言预训练提供了协同效应,特别是在语法相似的语言之间尤其明显。 此外,平行配对的预训练策略(将代码片段与其翻译版本连接起来)显著提升了跨语言能力,并具有良好的缩放特性。最后,我们提出了一种依赖比例的多语言缩放法则,通过优先考虑高实用性的编程语言(如Python)、平衡高协同效应的语言对(如JavaScript和TypeScript),并减少在快速饱和的语言(如Rust)上的分配来优化训练令牌的分配,在相同的计算预算下比均匀分布取得了所有PLs中的平均性能优势。
URL
https://arxiv.org/abs/2512.13472