Abstract
As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.
Abstract (translated)
随着大型语言模型(LLM)在文本推理方面取得了显著进展,增强大型视觉-语言模型(LVLM)的多模态推理能力的兴趣也随之增加。然而,现有的方法主要以直接、文本中心的方式处理多模态推理,在这种方式中,无论是推理还是答案推导都完全通过文本进行,唯一的区别在于存在多模态输入。因此,这些方法在需要精确几何理解及连续空间跟踪的任务上(人类通常通过心理可视化和操作来实现这些能力)往往遇到根本性的局限性。 为了解决这些问题,我们提出了一种新的范式——“空间绘图推理”,使LVLM可以通过基本的绘制操作在视觉空间中进行推理。通过赋予模型诸如标注边界框及绘制辅助线等基础绘图操作的能力,它们能够直接通过视觉操控表达和分析空间关系,并且避免了之前工具整合推理方法中存在的专业感知工具性能上限问题。 为了培养这种能力,我们开发了一个三阶段的训练框架:使用合成数据进行冷启动训练以建立基本绘图技能;采用反射拒绝采样增强自我反思行为;以及直接针对目标奖励优化的强化学习。广泛的实验表明,我们的模型VILASR在包括迷宫导航、静态空间推理、基于视频的推理和多视角推理任务在内的多样化空间推理基准测试中均显著优于现有方法,平均提升了18.4%。
URL
https://arxiv.org/abs/2506.09965