Paper Reading AI Learner

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

2025-04-15 18:10:22
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, Wanjun Zhong

Abstract

While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.

Abstract (translated)

尽管通过强化学习(RL)训练的推理模型(例如DeepSeek R1)在文本推理方面表现出色,但在需要结构化问题解决能力的情境下,如几何推理、简洁计算或复杂方程求解等领域,它们的表现却不如使用代码解释器(CI)等计算工具的情况。为缩小这一差距,我们提出了ReTool,这是一种增强长篇推理的集成工具学习方法,包括两个关键特性:(1) 在自然语言推理过程中动态插入实时代码执行;和 (2) 一个自动化的RL范式,允许策略在多次交互中进行实时代码执行,并教导模型如何以及何时调用工具以根据结果反馈来改进自身。ReTool采用了一个系统性的训练框架,从合成冷启动数据生成开始,生成增强代码的长篇推理痕迹用于微调基础模型。随后的强化学习训练利用任务成果作为奖励,迭代地优化模型使用工具的策略,使得模型能够在没有人类先验知识的情况下自主发现最佳工具调用模式。 在具有挑战性的MATH奥林匹克基准AIME上进行的实验显示了ReTool的优势:我们的32B模型经过400个训练步骤后实现了67%的准确率,不仅超过了基于文本的RL基线(1080步达到40%的准确率),而且在效率和性能方面都表现更优。特别值得注意的是,在扩展设置中,ReTool-32B达到了72.5%的准确率,比OpenAI的o1-preview高出27.9%。进一步分析揭示了诸如代码自我纠正等新行为,这表明模型自主掌握了适应性工具使用,标志着一种“顿悟时刻”的出现。 这些发现强调了结果驱动型工具整合在推进复杂数学推理方面的潜力,并为混合神经符号系统提供了新的见解。

URL

https://arxiv.org/abs/2504.11536

PDF

https://arxiv.org/pdf/2504.11536.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot