Abstract
Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
Abstract (translated)
大型推理模型通过扩展推断时的思维链达到强大的性能,但这一体系面临着二次成本、上下文长度限制以及由于中间效果丢失而导致的推理能力下降的问题。迭代推理通过周期性地总结中间思想来缓解这些问题,但现有的方法依赖于监督学习或固定的启发式算法,并且无法优化何时进行总结、保留什么内容以及如何继续推理等问题。 我们提出了一种名为InftyThink+的端到端强化学习框架,该框架旨在优化整个迭代推理路径。InftyThink+基于模型控制的迭代边界和显式的总结机制,采用两阶段训练方案:首先通过监督学习启动,然后过渡到轨迹级别的强化学习,使模型能够学会策略性的总结和继续决策。 在DeepSeek-R1-Distill-Qwen-1.5B上的实验表明,InftyThink+在AIME24数据集上提高了21%的准确率,并且明显优于传统的长思维链推理强化学习方法。此外,该方法在处理分布外基准时表现更佳。更重要的是,InftyThink+显著减少了推断延迟并加速了强化学习训练过程,从而展示了改进的推理效率以及更强的性能。 这项研究和框架设计为大型语言模型的优化提供了新的方向,特别是在解决长序列任务或需要多次迭代的问题上表现出色,并且对于推动高效、高性能的人工智能系统具有重要意义。
URL
https://arxiv.org/abs/2602.06960