Paper Reading AI Learner

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

2025-05-05 17:59:58
Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, Zhaoshuo Li

Abstract

Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.

Abstract (translated)

从文本合成交互式3D场景对于游戏、虚拟现实和具身人工智能(Embodied AI)至关重要。然而,现有的方法面临着一些挑战。基于学习的方法依赖于小型的室内数据集,限制了场景多样性和布局复杂性;而大型语言模型(LLM)虽然可以利用多样的文本领域知识,但在空间真实感方面表现不佳,常常生成不符合常识的物体放置位置。我们的关键洞察是视觉感知能够弥合这一差距,通过提供现实的空间指导来弥补大型语言模型在此方面的不足。 为此,我们引入了Scenethesis框架,这是一个无需训练的代理框架,将基于LLM的场景规划与由视觉引导的布局精炼相结合。给定一个文本提示后,Scenethesis首先使用LLM生成粗略的布局草案;然后通过产生图像指引并提取场景结构来捕捉物体间的关联性以进行细化处理。接下来,优化模块迭代执行精确的姿态对齐和物理合理性检查,防止诸如物体穿透或不稳定等伪影现象的发生。最后,评判模块验证空间一致性。 全面实验表明,Scenethesis能够生成多样、真实且符合物理规律的3D交互场景,在虚拟内容创建、模拟环境以及具身人工智能研究方面具有重要的价值。

URL

https://arxiv.org/abs/2505.02836

PDF

https://arxiv.org/pdf/2505.02836.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot