Abstract
Current 3D reconstruction techniques struggle to infer unbounded scenes from a few images faithfully. Specifically, existing methods have high computational demands, require detailed pose information, and cannot reconstruct occluded regions reliably. We introduce 6Img-to-3D, an efficient, scalable transformer-based encoder-renderer method for single-shot image to 3D reconstruction. Our method outputs a 3D-consistent parameterized triplane from only six outward-facing input images for large-scale, unbounded outdoor driving scenarios. We take a step towards resolving existing shortcomings by combining contracted custom cross- and self-attention mechanisms for triplane parameterization, differentiable volume rendering, scene contraction, and image feature projection. We showcase that six surround-view vehicle images from a single timestamp without global pose information are enough to reconstruct 360$^{\circ}$ scenes during inference time, taking 395 ms. Our method allows, for example, rendering third-person images and birds-eye views. Our code is available at this https URL, and more examples can be found at our website here this https URL.
Abstract (translated)
目前的三维重建技术很难从几张图像中忠实推断无限制的场景。具体来说,现有的方法具有高的计算需求,需要详细的姿态信息,并且无法可靠地重构遮挡区域。我们引入了6Img-to-3D,一种高效、可扩展的基于Transformer的单击图像到3D重建方法。我们的方法输出从仅六个外向 facing输入图像中得到的大规模无限制 outdoor driving 场景中的 3D 一致参数化三平面。我们通过结合收缩的自定义 cross- 和自注意机制来解决现有不足,实现不同纹理渲染、场景收缩和图像特征投影。我们证明了,在推理过程中仅使用一个时间戳的6个环绕视图车辆图像足以重构360$^{\circ}$的场景,需要395毫秒。我们的方法允许,例如,渲染三个人物图像和鸟瞰视图。我们的代码可以从此链接获得,更多例子可以在我们的网站 https://this.url 找到。
URL
https://arxiv.org/abs/2404.12378