Abstract
Large Reasoning Models (LRMs) often suffer from the ``over-thinking'' problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.
Abstract (translated)
大型推理模型(LRMs)常常遇到“过度思考”的问题,即在执行简单任务时生成不必要的长篇推理。为解决这一问题,一些策略已被提出,例如长度惩罚或路由机制,但这些方法通常只是特定于任务的启发式方法,并缺乏一个通用的自适应推理框架。在这篇文章中,我们介绍了ARM2,这是一种统一模型,通过增强的带长度感知优化的强化学习框架,在多种格式下自适应地平衡推理性能和效率。除了传统的自然语言推断之外,ARM2还集成了视觉理解能力,将其适用范围扩展到了多模态场景。此外,ARM2将可执行代码集成到推理中,与长链推理相比,这可以显著减少标记成本,同时保持任务性能。 实验表明,ARM2在与使用GRPO训练的传统推理模型相当的性能水平上,平均减少了70%以上的令牌使用量。我们进一步进行了广泛的分析以验证ARM2的有效性和设计合理性。
URL
https://arxiv.org/abs/2510.08163