Paper Reading AI Learner

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

2025-10-09 17:53:41
Joe Suk, Yaqi Duan

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.

Abstract (translated)

带有可验证奖励的强化学习(RLVR)通过使用简单的二元反馈来对大型语言模型进行后期训练,已经展示了显著的经验成功。然而,对其为何有效的原因还缺乏系统性的理论理解。本文通过对RLVR的训练过程在完整响应(轨迹)和标记水平上进行分析,建立了一个理论基础。我们分析的核心是一个称为梯度差距(Gradient Gap)的数量,它正式化了从低回报到高回报区域改进的方向。我们证明收敛性关键地依赖于将更新方向与这个梯度差距对齐。此外,基于梯度差距的幅度,我们推导出一个精确的步长阈值:在该阈值之下,学习会收敛;而在之上,性能则会崩溃。我们的理论进一步预测了临界步长如何随着响应长度和成功率的变化而变化,解释了为什么像长度归一化这样的实际启发式方法能提高稳定性,并表明,在固定的学习率下,成功率会在严格低于100%的水平上停滞不前。我们通过有控制的多臂赌博机模拟以及大型语言模型(LLM)实验验证了这些预测,包括使用GRPO对Qwen2.5-7B进行训练的情况。

URL

https://arxiv.org/abs/2510.08539

PDF

https://arxiv.org/pdf/2510.08539.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot