Paper Reading AI Learner

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

2025-09-24 09:06:41
Yandan Yang, Baoxiong Jia, Shujie Zhang, Siyuan Huang

Abstract

Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: this https URL.

Abstract (translated)

随着嵌入式人工智能(Embodied AI)的发展,室内场景合成变得越来越重要。这不仅需要视觉上逼真的三维环境,还需要物理上合理且功能多样的空间。尽管最近的方法在提升视觉保真度方面取得了进展,但它们通常仍局限于固定场景类别,缺乏足够的物体级细节和物理一致性,并且难以适应复杂的用户指令。 在这项工作中,我们介绍了SceneWeaver,这是一个基于代理的框架,通过工具驱动的迭代优化统一了各种场景合成范式。SceneWeaver的核心是一个语言模型驱动的计划器,该计划器从一套可扩展的场景生成工具中进行选择,这些工具有数据驱动的生成模型、视觉方法和大型语言模型(LLM)方法等。计划过程通过自我评估物理合理性、视觉逼真度和与用户输入的语义一致性来进行指导。 这种闭环式的思考-行动-反思设计使代理能够识别语义不一致,调用特定工具,并在连续迭代中更新环境。在常见房间类型和开放式词汇表房间类型的广泛实验中,我们发现SceneWeaver不仅在物理、视觉和语义指标上优于先前的方法,而且还能有效地推广到具有多样指令的复杂场景,标志着向着通用3D环境生成迈出了一步。 项目网站:[此链接](https://this-url.com/)(请将“this https URL”替换为实际网址)。

URL

https://arxiv.org/abs/2509.20414

PDF

https://arxiv.org/pdf/2509.20414.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot