Paper Reading AI Learner

LPA3D: 3D Room-Level Scene Generation from In-the-Wild Images

2025-04-03 07:18:48
Ming-Jia Yang, Yu-Xiao Guo, Yang Liu, Bin Zhou, Xin Tong

Abstract

Generating realistic, room-level indoor scenes with semantically plausible and detailed appearances from in-the-wild images is crucial for various applications in VR, AR, and robotics. The success of NeRF-based generative methods indicates a promising direction to address this challenge. However, unlike their success at the object level, existing scene-level generative methods require additional information, such as multiple views, depth images, or semantic guidance, rather than relying solely on RGB images. This is because NeRF-based methods necessitate prior knowledge of camera poses, which is challenging to approximate for indoor scenes due to the complexity of defining alignment and the difficulty of globally estimating poses from a single image, given the unseen parts behind the camera. To address this challenge, we redefine global poses within the framework of Local-Pose-Alignment (LPA) -- an anchor-based multi-local-coordinate system that uses a selected number of anchors as the roots of these coordinates. Building on this foundation, we introduce LPA-GAN, a novel NeRF-based generative approach that incorporates specific modifications to estimate the priors of camera poses under LPA. It also co-optimizes the pose predictor and scene generation processes. Our ablation study and comparisons with straightforward extensions of NeRF-based object generative methods demonstrate the effectiveness of our approach. Furthermore, visual comparisons with other techniques reveal that our method achieves superior view-to-view consistency and semantic normality.

Abstract (translated)

从野外图像生成具有语义合理且细节丰富的房间级别室内场景对于VR、AR和机器人技术等多种应用至关重要。基于NeRF(神经辐射场)的生成方法的成功表明,这是一个很有前景的方向来解决这一挑战。然而,与它们在物体级别的成功不同,现有的场景级生成方法需要额外的信息,例如多视角图像、深度图或语义指导,而不仅仅是依赖于RGB图像。这是因为基于NeRF的方法需要相机姿态的先验知识,而这对于室内场景来说是具有挑战性的,因为定义对齐的复杂性以及从单张图像全局估计姿态的难度(尤其是考虑到摄像机后方未见的部分)。为了解决这一挑战,我们重新定义了全局姿态,并在局部姿态对齐(Local-Pose-Alignment, LPA)框架内进行操作。这是一个基于锚点的多本地坐标系统,它使用选定数量的锚点作为这些坐标的根。 在此基础上,我们引入了一种新的NeRF生成方法——LPA-GAN,该方法在LPA框架下专门进行了修改以估计相机姿态的先验知识,并且优化了姿态预测和场景生成过程。我们的消融研究以及与基于NeRF的对象生成方法的直接扩展进行比较表明了这种方法的有效性。此外,与其它技术进行的视觉对比显示,我们的方法实现了更好的视图间一致性及语义合理性。 通过这种创新的方法,LPA-GAN能够在仅依赖于RGB图像的情况下有效地生成室内场景,并且在保持高真实性和细节的同时解决了姿态估计的问题,从而为虚拟现实和增强现实中高质量室内外环境建模提供了新的可能。

URL

https://arxiv.org/abs/2504.02337

PDF

https://arxiv.org/pdf/2504.02337.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot