Paper Reading AI Learner

LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

2025-12-15 18:59:04
Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo, Huaizu Jiang

Abstract

Recent feed-forward reconstruction models like VGGT and $\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{this https URL}{\texttt{this https URL}}$

Abstract (translated)

最近的前向反馈重建模型,如VGGT和$\pi^3$,在图像重建质量上表现出色,但由于其二次内存复杂度,无法处理流媒体视频,这限制了它们的实际部署。尽管现有的流媒体方法通过学习记忆机制或因果注意力解决了这一问题,但这些方法需要大量的重新训练,并且可能未能充分利用最先进的离线模型中的强几何先验。 我们提出了LASER框架,这是一个无需训练的框架,它可以将一个离线重建模型转化为一个流式系统,通过在连续的时间窗口中对预测进行对齐来实现这一点。我们观察到,简单的相似性变换($\mathrm{Sim}(3)$)对齐由于层深度错位而失效:单目尺度模糊导致不同场景层的相对深度比例在不同的窗口之间不一致变化。 为了解决这个问题,我们引入了逐层尺度对齐方法,该方法将深度预测分割成离散的层次,并计算每个层次的比例因子,然后将其传播到相邻的时间窗口和时间戳上。大量的实验表明,LASER在相机姿态估计和点云重建方面的表现达到了最先进的水平,同时还能以每秒14帧的速度运行,并且在RTX A6000 GPU上的峰值内存占用仅为6GB,这使得它能够处理千米级的流媒体视频,在实际应用中具有可行性。 项目网站:[此链接](https://this https URL/)

URL

https://arxiv.org/abs/2512.13680

PDF

https://arxiv.org/pdf/2512.13680.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot