Paper Reading AI Learner

6Img-to-3D: Few-Image Large-Scale Outdoor Driving Scene Reconstruction

2024-04-18 17:58:16
Théo Gieruc, Marius Kästingschäfer, Sebastian Bernhard, Mathieu Salzmann

Abstract

Current 3D reconstruction techniques struggle to infer unbounded scenes from a few images faithfully. Specifically, existing methods have high computational demands, require detailed pose information, and cannot reconstruct occluded regions reliably. We introduce 6Img-to-3D, an efficient, scalable transformer-based encoder-renderer method for single-shot image to 3D reconstruction. Our method outputs a 3D-consistent parameterized triplane from only six outward-facing input images for large-scale, unbounded outdoor driving scenarios. We take a step towards resolving existing shortcomings by combining contracted custom cross- and self-attention mechanisms for triplane parameterization, differentiable volume rendering, scene contraction, and image feature projection. We showcase that six surround-view vehicle images from a single timestamp without global pose information are enough to reconstruct 360$^{\circ}$ scenes during inference time, taking 395 ms. Our method allows, for example, rendering third-person images and birds-eye views. Our code is available at this https URL, and more examples can be found at our website here this https URL.

Abstract (translated)

目前的三维重建技术很难从几张图像中忠实推断无限制的场景。具体来说,现有的方法具有高的计算需求,需要详细的姿态信息,并且无法可靠地重构遮挡区域。我们引入了6Img-to-3D,一种高效、可扩展的基于Transformer的单击图像到3D重建方法。我们的方法输出从仅六个外向 facing输入图像中得到的大规模无限制 outdoor driving 场景中的 3D 一致参数化三平面。我们通过结合收缩的自定义 cross- 和自注意机制来解决现有不足,实现不同纹理渲染、场景收缩和图像特征投影。我们证明了,在推理过程中仅使用一个时间戳的6个环绕视图车辆图像足以重构360$^{\circ}$的场景,需要395毫秒。我们的方法允许,例如,渲染三个人物图像和鸟瞰视图。我们的代码可以从此链接获得,更多例子可以在我们的网站 https://this.url 找到。

URL

https://arxiv.org/abs/2404.12378

PDF

https://arxiv.org/pdf/2404.12378.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot