Paper Reading AI Learner

VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency

2026-02-05 10:07:11
Zhuang Xiong, Chen Zhang, Qingshan Xu, Wenbing Tao

Abstract

Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.

Abstract (translated)

尽管通过3D视觉基础模型在无校准单目SLAM方面取得了近期进展,但在长序列上尺度漂移仍然严重。运动无关分区破坏了上下文连贯性,并导致零运动漂移,而传统的几何对齐则计算成本高昂。为了解决这些问题,我们提出了VGGT-Motion系统,这是一个用于实现高效且鲁棒的千米级轨迹全局一致性的无校准SLAM系统。 具体而言,我们首先提出了一种基于光学流引导自适应分区、修剪静态冗余并封装转弯以保持稳定局部几何结构的运动感知子图构建机制。然后,我们设计了一个由锚点驱动的直接Sim(3)注册策略。通过利用平衡上下文信息的锚点,该策略实现了无搜索的像素级密集对齐和高效的闭环检测,而无需昂贵的特征匹配操作。最后,一种轻量级的姿态图优化方法在子地图级别上以线性复杂度强制全局一致性,从而支持可扩展的长距离操作。 实验表明,VGGT-Motion显著提高了轨迹的准确性和效率,在零样本、远程无校准单目SLAM中达到了最先进的性能。

URL

https://arxiv.org/abs/2602.05508

PDF

https://arxiv.org/pdf/2602.05508.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot