Abstract
On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity. Explore more demos on our project page: this https URL.
Abstract (translated)
从单目图像序列实时进行3D重建是计算机视觉领域的一个长期挑战,对于真实到模拟(real-to-sim)、AR/VR以及机器人技术等应用至关重要。现有方法面临一个主要权衡:每场景优化可获得高保真度但计算成本高昂;而前馈基础模型则能够支持实时推断,但在准确性和鲁棒性方面存在不足。在本研究中,我们提出了ARTDECO框架,该框架结合了前馈模型的效率与基于SLAM(Simultaneous Localization and Mapping)管道的可靠性。 ARTDECO利用3D基础模型进行姿态估计和点预测,并通过高斯解码器将多尺度特征转换为结构化的三维高斯分布。为了在保持精度的同时实现大规模操作中的高效性,我们设计了一种分层的高斯表示方法并采用了感知细节层次(LoD)的渲染策略,在提高渲染保真度的同时减少了冗余。 我们在八个不同室内和室外基准测试上进行了实验,结果显示ARTDECO能够提供与SLAM相媲美的交互性能、接近前馈系统的鲁棒性以及近乎每场景优化级别的重建质量。这为真实世界环境的即时数字化提供了实用途径,在确保准确几何结构的同时实现高度视觉保真度。 您可以在此项目页面查看更多的演示:[链接](请将此占位符替换为实际URL)。
URL
https://arxiv.org/abs/2510.08551