Abstract
Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and \textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) \textit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) \textit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall.
Abstract (translated)
最近,在大型语言模型(LLM)领域的进展见证了高级推理范式的快速发展,这些范式现在被整合到了多模态大型语言模型(MLLMs)中。然而,现有的方法往往存在不足:仅使用强化学习(RL)的方法可能在样本效率和激活完全缺失的推理能力方面遇到困难;而传统的流程则从冷启动监督微调(SFT)阶段开始,之后再进行RL,这种方法可能会限制模型的探索能力和导致次优收敛。为此,在这项工作中,我们引入了**Metis-RISE**(**R**L **I**ncentivizes and **S**FT **E**nhances),用于多模态推理模型的学习。 与传统方法不同,**Metis-RISE** 独特地省略了初始的 SFT 阶段,而是从一个 RL 阶段开始(例如,使用 Group Relative Policy Optimization 的变体)来激励和激活模型潜在的推理能力。随后,在目标微调阶段解决在RL过程中识别出的两个关键挑战:(1) **任务采样效率低下**,即对于模型拥有但不一致应用正确推理的任务,我们通过使用来自 RL 模型本身的自我蒸馏推理轨迹来应对这一问题;以及 (2) **根本能力缺失**,即当模型完全失败时,我们通过为提示注入专家增强的知识来解决。这种将RL用于激励随后进行SFT以提高性能的策略构成了Metis-RISE的核心,从而形成了两个版本的我们的MLLM(70亿和720亿参数)。在OpenCompass多模态推理排行榜上的评估表明,这两种模型均实现了同类规模模型中的最佳性能,其中720亿参数版本总排名第四。
URL
https://arxiv.org/abs/2506.13056