Paper Reading AI Learner

Agentic 3D Scene Generation with Spatially Contextualized VLMs

2025-05-26 15:28:17
Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

Abstract

Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications.

Abstract (translated)

尽管近期得益于视觉-语言模型(VLM)的进展,多模态内容生成得到了显著提升,但这些模型在理解和生成结构化的三维场景方面的能力仍然很大程度上未被探索。这一限制制约了它们在基于空间的任务中的实用性,如具身人工智能、沉浸式模拟和交互式的3D应用中。为此,我们引入了一个新的范例,使VLM能够通过注入不断演进的空间上下文来生成、理解和编辑复杂的三维环境。该上下文由多模态输入构建而成,并包含三个组成部分:场景肖像,提供高层次的语义蓝图;带有语义标签的点云,捕获对象级别的几何信息;以及场景超图,编码丰富的空间关系,包括一元、二元和高阶约束。这三部分共同为VLM提供了结构化且具有几何感知的工作记忆,将它的固有多模态推理能力与三维理解相结合,实现有效的空间推理。 在此基础上,我们开发了一个代理性3D场景生成流水线,在这个管道中,VLM可以迭代地读取并更新空间上下文。该流程的特点包括高质量的资产生成、带有自动验证的环境设置以及通过场景超图指导的人体工学调整。实验表明,我们的框架能够处理多样且具有挑战性的输入,达到了此前工作中未观察到的一般化水平。进一步的结果显示,注入空间上下文使VLM能够执行下游任务如交互式场景编辑和路径规划,这预示着在计算机图形、3D视觉以及具身应用中的智能空间系统有着强大的潜力。

URL

https://arxiv.org/abs/2505.20129

PDF

https://arxiv.org/pdf/2505.20129.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot