Paper Reading AI Learner

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

2026-02-06 18:59:27
Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen, Zhiqiang Zhang, Jun Zhou, Jian Shao, Yueting Zhuang, Yongliang Shen

Abstract

Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.

Abstract (translated)

大型推理模型通过扩展推断时的思维链达到强大的性能,但这一体系面临着二次成本、上下文长度限制以及由于中间效果丢失而导致的推理能力下降的问题。迭代推理通过周期性地总结中间思想来缓解这些问题,但现有的方法依赖于监督学习或固定的启发式算法,并且无法优化何时进行总结、保留什么内容以及如何继续推理等问题。 我们提出了一种名为InftyThink+的端到端强化学习框架,该框架旨在优化整个迭代推理路径。InftyThink+基于模型控制的迭代边界和显式的总结机制,采用两阶段训练方案:首先通过监督学习启动,然后过渡到轨迹级别的强化学习,使模型能够学会策略性的总结和继续决策。 在DeepSeek-R1-Distill-Qwen-1.5B上的实验表明,InftyThink+在AIME24数据集上提高了21%的准确率,并且明显优于传统的长思维链推理强化学习方法。此外,该方法在处理分布外基准时表现更佳。更重要的是,InftyThink+显著减少了推断延迟并加速了强化学习训练过程,从而展示了改进的推理效率以及更强的性能。 这项研究和框架设计为大型语言模型的优化提供了新的方向,特别是在解决长序列任务或需要多次迭代的问题上表现出色,并且对于推动高效、高性能的人工智能系统具有重要意义。

URL

https://arxiv.org/abs/2602.06960

PDF

https://arxiv.org/pdf/2602.06960.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot