Abstract
Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.
Abstract (translated)
强化学习(RL)已成为训练大型语言推理模型(LLMs)的关键组成部分。然而,最近的研究对其在提高多步推理能力方面的有效性提出了质疑,特别是在解决难题方面。为了解决这一挑战,我们提出了一种简单而有效的方法:问题增强策略(Question Augmentation),通过在训练过程中引入部分解决方案来降低问题难度,并提供更有信息量的学习信号。我们的方法QuestA,在对数学推理任务进行RL训练时应用,不仅提高了首次正确率(pass@1),还提高了k次尝试中的正确率(pass@k)——特别是在那些标准RL难以取得进展的问题上表现突出。这使我们在持续改进像DeepScaleR和OpenMath Nemotron这样的开源模型方面取得了进步,并进一步增强了它们的推理能力。 使用参数量为1.5B的模型,我们在这三个数学基准测试中达到了新的最先进水平:AIME24上的准确率为67.1%(提高了5.3%),AIME25上的准确率为59.5%(提高了10.0%),以及HMMT25上的准确率为35.5%(提高了4.0%)。此外,我们还提供了理论解释说明QuestA如何提高样本效率,为通过RL扩展推理能力提供了一条实用且通用的方法路径。
URL
https://arxiv.org/abs/2507.13266