Abstract
Large language models (LLMs) demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving. Recent methods have improved reasoning through expanded corpus and multistage training combining reinforcement learning and supervised fine-tuning. Although some methods suggest that small but targeted dataset can incentivize reasoning via only distillation, a reasoning scaling laws is still taking shape, increasing computational costs. To address this, we propose a data-efficient distillation framework (DED) that optimizes the Pareto frontier of reasoning distillation. Inspired by the on-policy learning and diverse roll-out strategies of reinforcement learning, the key idea of our approach is threefold: (1) We identify that benchmark scores alone do not determine an effective teacher model. Through comprehensive comparisons of leading reasoning LLMs, we develop a method to select an optimal teacher model. (2) While scaling distillation can enhance reasoning, it often degrades out-of-domain performance. A carefully curated, smaller corpus achieves a balanced trade-off between in-domain and out-of-domain capabilities. (3) Diverse reasoning trajectories encourage the student model to develop robust reasoning skills. We validate our method through evaluations on mathematical reasoning (AIME 2024/2025, MATH-500) and code generation (LiveCodeBench), achieving state-of-the-art results with only 0.8k carefully curated examples, bypassing the need for extensive scaling. Our systematic analysis demonstrates that DED outperforms existing methods by considering factors beyond superficial hardness, token length, or teacher model capability. This work offers a practical and efficient pathway to advanced reasoning while preserving general capabilities.
Abstract (translated)
大型语言模型(LLMs)在算法编码和数学问题解决等任务中展示了卓越的推理能力。最近的方法通过扩展语料库以及结合强化学习和监督微调的多阶段训练,改进了这些模型的推理能力。尽管有些方法提出使用少量但有针对性的数据集仅通过知识蒸馏来激励推理的有效性,但是推理规模法则仍在形成之中,并且计算成本不断上升。为了应对这一挑战,我们提出了一个数据高效的蒸馏框架(DED),该框架优化了推理蒸馏的帕累托前沿。受强化学习中的在线学习和多样化生成策略的启发,我们的方法的关键理念有三个方面: 1. 我们发现基准分数本身并不能决定一个有效的教师模型。通过综合对比领先的推理LLMs,我们开发了一种选择最优教师模型的方法。 2. 尽管扩展蒸馏可以增强推理能力,但它往往会导致领域外性能下降。经过精心挑选的、较小的数据集可以在领域内和领域外的能力之间实现平衡折衷。 3. 多样化的推理轨迹鼓励学生模型发展出更为稳健的推理技能。 我们通过在数学推理(AIME 2024/2025,MATH-500)和代码生成(LiveCodeBench)上的评估验证了我们的方法的有效性,在仅使用800个精心挑选的例子的情况下达到了最先进的结果,并且无需进行大规模扩展。我们的系统分析表明,DED通过考虑除表面难度、标记长度或教师模型能力之外的因素,超越了现有方法的表现。这项工作提供了一条实用而高效的途径,用于在保持通用能力的同时实现高级推理技能的发展。
URL
https://arxiv.org/abs/2508.09883