Abstract
Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.
Abstract (translated)
细粒度视觉推理仍然是多模态大型语言模型(MLLM)的核心挑战。最近引入的ReasonMap通过展示即使是最先进的MLLM在诸如交通图这样的结构化和信息丰富的环境中进行空间推理时仍面临困难,进一步强调了这一差距。然而,在此类任务上使用标准强化学习(RL)会受到稀疏奖励和不稳定优化的影响。为了解决这个问题,我们首先构建了ReasonMap-Plus,这是一个扩展的数据集,通过视觉问答(VQA)任务引入密集的奖励信号,从而能够有效地从头开始训练细粒度的视觉理解技能。 接下来,我们提出了RewardMap,这是一种多阶段RL框架,旨在提高MLLM的视觉理解和推理能力。RewardMap包含两个关键设计: 1. 我们引入了一个基于难度感知的奖励设计方案,该方案结合了细节奖励,直接解决了稀疏奖励问题,并提供了更丰富的监督信号。 2. 我们提出了一种从简单感知任务逐步过渡到复杂推理任务的多阶段RL策略,这种方法比传统的有监督微调(SFT)提供了一个更为有效的冷启动策略。 在ReasonMap和ReasonMap-Plus上的实验表明,RewardMap的每一个组件都为性能提升做出了贡献,并且它们的结合带来了最好的结果。此外,在涵盖空间推理、细粒度视觉推理以及超越交通图的一般任务等六个基准测试中,使用RewardMap训练的模型平均提高了3.47%的性能,这进一步证明了其在提高视觉理解和推理能力方面的有效性。
URL
https://arxiv.org/abs/2510.02240