Abstract
Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.
Abstract (translated)
训练后的微调已经证明了其在增强大规模语言模型(LLMs)推理能力方面的重要性。主要的微调方法可以分为监督微调(SFT)和强化微调(RFT)两大类。SFT因其效率高且适合小型语言模型而受到青睐,但可能导致过拟合,并限制大型模型的推理能力。相比之下,RFT通常能产生更好的泛化效果,但是高度依赖于基础模型的质量。为了克服SFT和RFT的局限性,我们提出了一种新的微调范式——统一微调(UFT),它将SFT和RFT整合为一个单一且集成的过程。UFT使模型能够有效地探索解决方案,同时融入信息丰富的监督信号,弥合了现有方法中记忆与思考之间的差距。值得注意的是,无论模型大小如何,UFT在总体上都优于SFT和RFT。此外,我们从理论上证明了UFT突破了RFT内在的指数级样本复杂性瓶颈,首次展示了统一训练能够在长时态推理任务上实现指数级加速收敛的效果。
URL
https://arxiv.org/abs/2505.16984