Paper Reading AI Learner

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

2025-10-09 17:50:54
Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

Abstract

Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.

Abstract (translated)

空间推理仍然是视觉语言模型(VLMs)的基本挑战,尽管最近有所进展,但现有方法在实现稳健性能方面仍面临困难。我们发现这一限制的根本原因在于一个关键缺口:现有的方法试图直接学习空间推理,而没有建立感知和理解的层级基础。为解决此难题,我们提出了一种全面的方法,用于逐步构建空间智能。 我们引入了SpatialLadder-26k,这是一个多模态数据集,包含26,610个样本,涵盖了对象定位、单幅图像、多视角以及视频的空间推理任务,并通过标准化流程创建,确保在各种模式上系统性地覆盖。基于此数据集,我们设计了一个三阶段的逐步训练框架:(1) 通过对象定位建立空间感知;(2) 通过多维空间任务发展空间理解;(3) 使用可验证奖励进行强化学习来加强复杂推理能力。 这种方法产生了SpatialLadder模型,这是一个拥有30亿参数的模型,在空间推理基准测试中取得了最先进的性能。相较于基础模型,平均提高了23.4%,超过了GPT-4o 20.8%和Gemini-2.0-Flash 10.1%。值得注意的是,SpatialLadder在跨域基准测试中保持了7.2%的改进,这证明从感知到推理的逐步训练对于稳健的空间智能是至关重要的。

URL

https://arxiv.org/abs/2510.08531

PDF

https://arxiv.org/pdf/2510.08531.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot