Paper Reading AI Learner

SPARK: Scalable Real-Time Point Cloud Aggregation with Multi-View Self-Calibration

2026-01-13 10:32:22
Chentian Sun

Abstract

Real-time multi-camera 3D reconstruction is crucial for 3D perception, immersive interaction, and robotics. Existing methods struggle with multi-view fusion, camera extrinsic uncertainty, and scalability for large camera setups. We propose SPARK, a self-calibrating real-time multi-camera point cloud reconstruction framework that jointly handles point cloud fusion and extrinsic uncertainty. SPARK consists of: (1) a geometry-aware online extrinsic estimation module leveraging multi-view priors and enforcing cross-view and temporal consistency for stable self-calibration, and (2) a confidence-driven point cloud fusion strategy modeling depth reliability and visibility at pixel and point levels to suppress noise and view-dependent inconsistencies. By performing frame-wise fusion without accumulation, SPARK produces stable point clouds in dynamic scenes while scaling linearly with the number of cameras. Extensive experiments on real-world multi-camera systems show that SPARK outperforms existing approaches in extrinsic accuracy, geometric consistency, temporal stability, and real-time performance, demonstrating its effectiveness and scalability for large-scale multi-camera 3D reconstruction.

Abstract (translated)

实时多摄像头三维重建对于三维感知、沉浸式交互和机器人技术至关重要。现有的方法在处理多视图融合、摄像机外参不确定性和大规模摄像头设置的可扩展性方面存在困难。我们提出了一种名为SPARK的自校准实时多摄像头点云重建框架,该框架同时处理点云融合和外参不确定性问题。SPARK包含以下两部分: 1. 一种基于几何信息的在线外参估计模块,利用多视图先验并强制执行跨视角和时间上的一致性以实现稳定的自我校准。 2. 一种信心驱动的点云融合策略,在像素级和点级别建模深度可靠性和可见度,以此抑制噪声和视角依赖性不一致。 通过进行帧级别的融合而不积累,SPARK能够在动态场景中生成稳定且准确的点云,并且随着摄像头数量线性扩展。在真实世界的多摄像头系统上的大量实验表明,与现有方法相比,SPARK在外参精度、几何一致性、时间稳定性以及实时性能方面表现出色,证明了其对于大规模多摄像头三维重建的有效性和可扩展性。

URL

https://arxiv.org/abs/2601.08414

PDF

https://arxiv.org/pdf/2601.08414.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot