Abstract
Recent feed-forward reconstruction models like VGGT and $\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{this https URL}{\texttt{this https URL}}$
Abstract (translated)
最近的前向反馈重建模型,如VGGT和$\pi^3$,在图像重建质量上表现出色,但由于其二次内存复杂度,无法处理流媒体视频,这限制了它们的实际部署。尽管现有的流媒体方法通过学习记忆机制或因果注意力解决了这一问题,但这些方法需要大量的重新训练,并且可能未能充分利用最先进的离线模型中的强几何先验。 我们提出了LASER框架,这是一个无需训练的框架,它可以将一个离线重建模型转化为一个流式系统,通过在连续的时间窗口中对预测进行对齐来实现这一点。我们观察到,简单的相似性变换($\mathrm{Sim}(3)$)对齐由于层深度错位而失效:单目尺度模糊导致不同场景层的相对深度比例在不同的窗口之间不一致变化。 为了解决这个问题,我们引入了逐层尺度对齐方法,该方法将深度预测分割成离散的层次,并计算每个层次的比例因子,然后将其传播到相邻的时间窗口和时间戳上。大量的实验表明,LASER在相机姿态估计和点云重建方面的表现达到了最先进的水平,同时还能以每秒14帧的速度运行,并且在RTX A6000 GPU上的峰值内存占用仅为6GB,这使得它能够处理千米级的流媒体视频,在实际应用中具有可行性。 项目网站:[此链接](https://this https URL/)
URL
https://arxiv.org/abs/2512.13680