Abstract
Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.
Abstract (translated)
带有可验证奖励的强化学习(RLVR)通过使用简单的二元反馈来对大型语言模型进行后期训练,已经展示了显著的经验成功。然而,对其为何有效的原因还缺乏系统性的理论理解。本文通过对RLVR的训练过程在完整响应(轨迹)和标记水平上进行分析,建立了一个理论基础。我们分析的核心是一个称为梯度差距(Gradient Gap)的数量,它正式化了从低回报到高回报区域改进的方向。我们证明收敛性关键地依赖于将更新方向与这个梯度差距对齐。此外,基于梯度差距的幅度,我们推导出一个精确的步长阈值:在该阈值之下,学习会收敛;而在之上,性能则会崩溃。我们的理论进一步预测了临界步长如何随着响应长度和成功率的变化而变化,解释了为什么像长度归一化这样的实际启发式方法能提高稳定性,并表明,在固定的学习率下,成功率会在严格低于100%的水平上停滞不前。我们通过有控制的多臂赌博机模拟以及大型语言模型(LLM)实验验证了这些预测,包括使用GRPO对Qwen2.5-7B进行训练的情况。
URL
https://arxiv.org/abs/2510.08539