Abstract
Generating realistic, room-level indoor scenes with semantically plausible and detailed appearances from in-the-wild images is crucial for various applications in VR, AR, and robotics. The success of NeRF-based generative methods indicates a promising direction to address this challenge. However, unlike their success at the object level, existing scene-level generative methods require additional information, such as multiple views, depth images, or semantic guidance, rather than relying solely on RGB images. This is because NeRF-based methods necessitate prior knowledge of camera poses, which is challenging to approximate for indoor scenes due to the complexity of defining alignment and the difficulty of globally estimating poses from a single image, given the unseen parts behind the camera. To address this challenge, we redefine global poses within the framework of Local-Pose-Alignment (LPA) -- an anchor-based multi-local-coordinate system that uses a selected number of anchors as the roots of these coordinates. Building on this foundation, we introduce LPA-GAN, a novel NeRF-based generative approach that incorporates specific modifications to estimate the priors of camera poses under LPA. It also co-optimizes the pose predictor and scene generation processes. Our ablation study and comparisons with straightforward extensions of NeRF-based object generative methods demonstrate the effectiveness of our approach. Furthermore, visual comparisons with other techniques reveal that our method achieves superior view-to-view consistency and semantic normality.
Abstract (translated)
从野外图像生成具有语义合理且细节丰富的房间级别室内场景对于VR、AR和机器人技术等多种应用至关重要。基于NeRF(神经辐射场)的生成方法的成功表明,这是一个很有前景的方向来解决这一挑战。然而,与它们在物体级别的成功不同,现有的场景级生成方法需要额外的信息,例如多视角图像、深度图或语义指导,而不仅仅是依赖于RGB图像。这是因为基于NeRF的方法需要相机姿态的先验知识,而这对于室内场景来说是具有挑战性的,因为定义对齐的复杂性以及从单张图像全局估计姿态的难度(尤其是考虑到摄像机后方未见的部分)。为了解决这一挑战,我们重新定义了全局姿态,并在局部姿态对齐(Local-Pose-Alignment, LPA)框架内进行操作。这是一个基于锚点的多本地坐标系统,它使用选定数量的锚点作为这些坐标的根。 在此基础上,我们引入了一种新的NeRF生成方法——LPA-GAN,该方法在LPA框架下专门进行了修改以估计相机姿态的先验知识,并且优化了姿态预测和场景生成过程。我们的消融研究以及与基于NeRF的对象生成方法的直接扩展进行比较表明了这种方法的有效性。此外,与其它技术进行的视觉对比显示,我们的方法实现了更好的视图间一致性及语义合理性。 通过这种创新的方法,LPA-GAN能够在仅依赖于RGB图像的情况下有效地生成室内场景,并且在保持高真实性和细节的同时解决了姿态估计的问题,从而为虚拟现实和增强现实中高质量室内外环境建模提供了新的可能。
URL
https://arxiv.org/abs/2504.02337