Paper Reading AI Learner

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

2024-10-31 17:58:22
Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, Songyou Peng

Abstract

We introduce NoPoSplat, a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from \textit{unposed} sparse multi-view images. Our model, trained exclusively with photometric loss, achieves real-time 3D Gaussian reconstruction during inference. To eliminate the need for accurate pose input during reconstruction, we anchor one input view's local camera coordinates as the canonical space and train the network to predict Gaussian primitives for all views within this space. This approach obviates the need to transform Gaussian primitives from local coordinates into a global coordinate system, thus avoiding errors associated with per-frame Gaussians and pose estimation. To resolve scale ambiguity, we design and compare various intrinsic embedding methods, ultimately opting to convert camera intrinsics into a token embedding and concatenate it with image tokens as input to the model, enabling accurate scene scale prediction. We utilize the reconstructed 3D Gaussians for novel view synthesis and pose estimation tasks and propose a two-stage coarse-to-fine pipeline for accurate pose estimation. Experimental results demonstrate that our pose-free approach can achieve superior novel view synthesis quality compared to pose-required methods, particularly in scenarios with limited input image overlap. For pose estimation, our method, trained without ground truth depth or explicit matching loss, significantly outperforms the state-of-the-art methods with substantial improvements. This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios. Code and trained models are available at this https URL.

Abstract (translated)

我们介绍了NoPoSplat,这是一种前馈模型,能够从未定位的稀疏多视角图像中重建由3D高斯参数化的三维场景。我们的模型完全通过光度损失进行训练,在推理过程中实现了实时的3D高斯重建。为了消除重建时需要精确姿态输入的需求,我们将一个输入视图的局部相机坐标系锚定为规范空间,并训练网络在该空间内预测所有视图的高斯原语。这种方法消除了将高斯原语从局部坐标转换到全局坐标系统的需求,从而避免了与每帧高斯和姿态估计相关的错误。为了解决尺度模糊问题,我们设计并比较了几种内在嵌入方法,并最终选择将相机内参转化为一个标记嵌入并与图像标记串联作为模型的输入,以实现准确的场景尺度预测。我们将重建的3D高斯用于新颖视图合成和姿态估计任务,并提出了一种两阶段的从粗到精的管道来进行精确的姿态估计。实验结果表明,我们的无姿态方法在新视角合成质量上优于需要姿态的方法,特别是在输入图像重叠较少的情况下。对于姿态估计,我们的方法在没有地面真实深度或显式匹配损失的情况下训练,显著超越了最先进的方法,并取得了重大改进。这项工作在无姿态的通用三维重建方面取得了重要进展,并展示了其在现实场景中的应用潜力。代码和训练模型可以在这个 https URL 找到。

URL

https://arxiv.org/abs/2410.24207

PDF

https://arxiv.org/pdf/2410.24207.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot