Paper Reading AI Learner

SceneLCM: End-to-End Layout-Guided Interactive Indoor Scene Generation with Latent Consistency Model

2025-06-08 11:30:31
Yangkai Lin, Jiabao Lei, Kui Jia

Abstract

Our project page: this https URL. Automated generation of complex, interactive indoor scenes tailored to user prompt remains a formidable challenge. While existing methods achieve indoor scene synthesis, they struggle with rigid editing constraints, physical incoherence, excessive human effort, single-room limitations, and suboptimal material quality. To address these limitations, we propose SceneLCM, an end-to-end framework that synergizes Large Language Model (LLM) for layout design with Latent Consistency Model(LCM) for scene optimization. Our approach decomposes scene generation into four modular pipelines: (1) Layout Generation. We employ LLM-guided 3D spatial reasoning to convert textual descriptions into parametric blueprints(3D layout). And an iterative programmatic validation mechanism iteratively refines layout parameters through LLM-mediated dialogue loops; (2) Furniture Generation. SceneLCM employs Consistency Trajectory Sampling(CTS), a consistency distillation sampling loss guided by LCM, to form fast, semantically rich, and high-quality representations. We also offer two theoretical justification to demonstrate that our CTS loss is equivalent to consistency loss and its distillation error is bounded by the truncation error of the Euler solver; (3) Environment Optimization. We use a multiresolution texture field to encode the appearance of the scene, and optimize via CTS loss. To maintain cross-geometric texture coherence, we introduce a normal-aware cross-attention decoder to predict RGB by cross-attending to the anchors locations in geometrically heterogeneous instance. (4)Physically Editing. SceneLCM supports physically editing by integrating physical simulation, achieved persistent physical realism. Extensive experiments validate SceneLCM's superiority over state-of-the-art techniques, showing its wide-ranging potential for diverse applications.

Abstract (translated)

我们的项目页面:此 HTTPS URL。自动生成复杂且互动性强的室内场景以响应用户提示仍然是一个重大挑战。虽然现有方法能够合成室内场景,但它们在编辑约束、物理不一致性、人力投入过大、仅限单个房间以及材料质量不佳等方面存在局限性。为了解决这些问题,我们提出了SceneLCM框架,该框架整合了大型语言模型(LLM)用于布局设计和潜在一致模型(LCM)用于场景优化的端到端方法。 我们的方法将场景生成分解成四个模块化管道: 1. **布局生成**:我们使用由LLM引导的3D空间推理技术,将文本描述转换为参数化的蓝图(即3D布局)。同时,通过LLM中介对话循环进行迭代验证机制,逐步细化布局参数。 2. **家具生成**:SceneLCM采用了一致性轨迹采样(CTS)方法,该方法由LCM指导的一致性蒸馏采样损失驱动,能够快速形成语义丰富且高质量的表示。此外,我们提供了两个理论依据来证明我们的CTS损失等同于一致性损失,并且其蒸馏误差被欧拉解算器截断误差所限定。 3. **环境优化**:我们使用多分辨率纹理字段编码场景外观,并通过CTS损失进行优化。为了保持跨几何结构的纹理连贯性,我们引入了一种法线感知交叉注意解码器,通过交叉关注不同实例中的锚定位置来预测RGB值。 4. **物理编辑支持**:SceneLCM通过整合物理模拟实现了持久的物理真实性,从而支持场景的物理编辑。广泛实验验证了SceneLCM在当前技术前沿上的优越性,并展示了其在各种应用领域的广泛应用潜力。

URL

https://arxiv.org/abs/2506.07091

PDF

https://arxiv.org/pdf/2506.07091.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot