Abstract
Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications.
Abstract (translated)
尽管近期得益于视觉-语言模型(VLM)的进展,多模态内容生成得到了显著提升,但这些模型在理解和生成结构化的三维场景方面的能力仍然很大程度上未被探索。这一限制制约了它们在基于空间的任务中的实用性,如具身人工智能、沉浸式模拟和交互式的3D应用中。为此,我们引入了一个新的范例,使VLM能够通过注入不断演进的空间上下文来生成、理解和编辑复杂的三维环境。该上下文由多模态输入构建而成,并包含三个组成部分:场景肖像,提供高层次的语义蓝图;带有语义标签的点云,捕获对象级别的几何信息;以及场景超图,编码丰富的空间关系,包括一元、二元和高阶约束。这三部分共同为VLM提供了结构化且具有几何感知的工作记忆,将它的固有多模态推理能力与三维理解相结合,实现有效的空间推理。 在此基础上,我们开发了一个代理性3D场景生成流水线,在这个管道中,VLM可以迭代地读取并更新空间上下文。该流程的特点包括高质量的资产生成、带有自动验证的环境设置以及通过场景超图指导的人体工学调整。实验表明,我们的框架能够处理多样且具有挑战性的输入,达到了此前工作中未观察到的一般化水平。进一步的结果显示,注入空间上下文使VLM能够执行下游任务如交互式场景编辑和路径规划,这预示着在计算机图形、3D视觉以及具身应用中的智能空间系统有着强大的潜力。
URL
https://arxiv.org/abs/2505.20129