Paper Reading AI Learner

Demystifying Long Chain-of-Thought Reasoning in LLMs

2025-02-05 17:13:32
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue

Abstract

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: this https URL.

Abstract (translated)

扩大推理计算能力可以增强大型语言模型(LLMs)的推理性能,尤其是长链条思维(CoTs),这使得回溯和错误校正等策略得以实现。强化学习(RL)已成为开发这些能力的关键方法之一,但关于何种条件下会出现长链条思维以及如何进行有效的RL训练仍然不明确。在这项研究中,我们系统地调查了长链条思维推理的机制,并确定了模型能够生成长链条轨迹的关键因素。通过广泛的监督微调(SFT)和强化学习实验,我们提出了四个主要发现: 1. **虽然严格的监督微调并非绝对必要**,但这种做法可以简化训练过程并提高效率。 2. **随着训练计算量的增加,推理能力倾向于出现**,但是这一发展并不是确定无疑的。因此,为了稳定链条长度的增长,奖励塑形(即设计合理的奖励函数)变得至关重要。 3. **扩大可验证的回报信号对于强化学习至关重要**。我们发现利用带有过滤机制的噪声网络提取解决方案显示出强大潜力,特别是在如STEM推理这样的分布外任务上效果尤为显著。 4. **基础模型中固有具备纠错等核心能力**,但通过RL有效激励这些技能以应对复杂任务需要大量的计算资源,并且衡量它们的发展需要采用细致的方法。 这些见解为优化训练策略、增强LLMs中的长链条思维推理提供了实用指导。我们的代码可在提供的链接处获取。

URL

https://arxiv.org/abs/2502.03373

PDF

https://arxiv.org/pdf/2502.03373.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot