Abstract
Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.
Abstract (translated)
空间推理仍然是视觉语言模型(VLMs)的基本挑战,尽管最近有所进展,但现有方法在实现稳健性能方面仍面临困难。我们发现这一限制的根本原因在于一个关键缺口:现有的方法试图直接学习空间推理,而没有建立感知和理解的层级基础。为解决此难题,我们提出了一种全面的方法,用于逐步构建空间智能。 我们引入了SpatialLadder-26k,这是一个多模态数据集,包含26,610个样本,涵盖了对象定位、单幅图像、多视角以及视频的空间推理任务,并通过标准化流程创建,确保在各种模式上系统性地覆盖。基于此数据集,我们设计了一个三阶段的逐步训练框架:(1) 通过对象定位建立空间感知;(2) 通过多维空间任务发展空间理解;(3) 使用可验证奖励进行强化学习来加强复杂推理能力。 这种方法产生了SpatialLadder模型,这是一个拥有30亿参数的模型,在空间推理基准测试中取得了最先进的性能。相较于基础模型,平均提高了23.4%,超过了GPT-4o 20.8%和Gemini-2.0-Flash 10.1%。值得注意的是,SpatialLadder在跨域基准测试中保持了7.2%的改进,这证明从感知到推理的逐步训练对于稳健的空间智能是至关重要的。
URL
https://arxiv.org/abs/2510.08531