Abstract
Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
Abstract (translated)
尽管通过3D视觉基础模型在无校准单目SLAM方面取得了近期进展,但在长序列上尺度漂移仍然严重。运动无关分区破坏了上下文连贯性,并导致零运动漂移,而传统的几何对齐则计算成本高昂。为了解决这些问题,我们提出了VGGT-Motion系统,这是一个用于实现高效且鲁棒的千米级轨迹全局一致性的无校准SLAM系统。 具体而言,我们首先提出了一种基于光学流引导自适应分区、修剪静态冗余并封装转弯以保持稳定局部几何结构的运动感知子图构建机制。然后,我们设计了一个由锚点驱动的直接Sim(3)注册策略。通过利用平衡上下文信息的锚点,该策略实现了无搜索的像素级密集对齐和高效的闭环检测,而无需昂贵的特征匹配操作。最后,一种轻量级的姿态图优化方法在子地图级别上以线性复杂度强制全局一致性,从而支持可扩展的长距离操作。 实验表明,VGGT-Motion显著提高了轨迹的准确性和效率,在零样本、远程无校准单目SLAM中达到了最先进的性能。
URL
https://arxiv.org/abs/2602.05508