Automatic creation of 3D scenes for immersive VR presence has been a significant research focus for decades. However, existing methods often rely on either high-poly mesh modeling with post-hoc simplification or massive 3D Gaussians, resulting in a complex pipeline or limited visual realism. In this paper, we demonstrate that such exhaustive modeling is unnecessary for achieving compelling immersive experience. We introduce ImmerseGen, a novel agent-guided framework for compact and photorealistic world modeling. ImmerseGen represents scenes as hierarchical compositions of lightweight geometric proxies, i.e., simplified terrain and billboard meshes, and generates photorealistic appearance by synthesizing RGBA textures onto these proxies. Specifically, we propose terrain-conditioned texturing for user-centric base world synthesis, and RGBA asset texturing for midground and foreground this http URL reformulation offers several advantages: (i) it simplifies modeling by enabling agents to guide generative models in producing coherent textures that integrate seamlessly with the scene; (ii) it bypasses complex geometry creation and decimation by directly synthesizing photorealistic textures on proxies, preserving visual quality without degradation; (iii) it enables compact representations suitable for real-time rendering on mobile VR headsets. To automate scene creation from text prompts, we introduce VLM-based modeling agents enhanced with semantic grid-based analysis for improved spatial reasoning and accurate asset placement. ImmerseGen further enriches scenes with dynamic effects and ambient audio to support multisensory immersion. Experiments on scene generation and live VR showcases demonstrate that ImmerseGen achieves superior photorealism, spatial coherence and rendering efficiency compared to prior methods. Project webpage: this https URL.
几十年来,自动创建用于沉浸式VR体验的3D场景一直是重要的研究焦点。然而,现有的方法通常依赖于高多边形网格建模后再进行简化处理或使用大量的三维高斯模型,这导致了复杂的流程或是有限的真实感视觉效果。在本文中,我们展示了为了实现令人信服的沉浸式体验,并不需要这种详尽的建模工作。我们引入了ImmerseGen,这是一个新的代理引导框架,用于紧凑且逼真的世界建模。 ImmerseGen将场景表示为轻量级几何代理(即简化的地形和海报网格)的层次组合,并通过合成RGBA纹理在这些代理上生成逼真的外观效果。具体而言,我们提出了基于地形条件的纹理处理方法来合成用户为中心的基本世界的组成元素,以及用于中景和前景元素的RGBA资产纹理化。这种重新构想提供了几个优势:(i) 它简化了建模过程,通过让代理指导生成模型生产与场景无缝集成的连贯纹理;(ii) 无需复杂的几何创建和减面处理,直接在代理上合成逼真的纹理可以保持视觉质量而不退化;(iii) 支持适合移动VR头显实时渲染的紧凑表示形式。 为了从文本提示中自动创建场景,我们引入了增强有语义网格分析功能的VLM建模代理,以改进空间推理和准确的资产放置。ImmerseGen进一步通过动态效果和环境音频来丰富场景,支持多感官沉浸体验。实验结果表明,在场景生成和现场VR演示中,与先前方法相比,ImmerseGen在逼真度、空间一致性以及渲染效率方面都表现出色。 项目网页:[请参阅原文提供的链接]
https://arxiv.org/abs/2506.14315
Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, the generation of large-scale 3D scenes that require spatial coherence remains underexplored. In this paper, we propose X-Scene, a novel framework for large-scale driving scene generation that achieves both geometric intricacy and appearance fidelity, while offering flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level conditions such as user-provided or text-driven layout for detailed scene composition and high-level semantic guidance such as user-intent and LLM-enriched text prompts for efficient customization. To enhance geometrical and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and the corresponding multiview images, while ensuring alignment between modalities. Additionally, we extend the generated local region into a large-scale scene through consistency-aware scene outpainting, which extrapolates new occupancy and images conditioned on the previously generated area, enhancing spatial continuity and preserving visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as scene exploration. Comprehensive experiments demonstrate that X-Scene significantly advances controllability and fidelity for large-scale driving scene generation, empowering data generation and simulation for autonomous driving.
扩散模型通过实现现实数据合成、预测端到端规划和闭环模拟,在推动自动驾驶技术进步方面发挥了重要作用,尤其是在时间上的一致性生成方面。然而,对于需要空间连贯性的大规模3D场景的生成研究仍较为有限。为此,本文提出了一种名为X-Scene的新框架,用于进行大规模驾驶场景生成,该框架能够同时实现几何复杂性和外观真实性,并提供灵活的可控性。 具体而言,X-Scene支持多粒度控制,包括低级条件(如用户提供的或文本驱动的布局),以构成详细的场景组合;以及高级语义指导(如用户的意图和增强型LLM文本提示)用于高效定制。为了提升几何学和视觉真实感,我们引入了一个统一的管道,在该管道中顺序生成3D语义占用图和对应的多视角图像,并确保不同模式之间的一致性。此外,通过一致性感知场景扩增技术,我们将生成的局部区域扩展为大规模场景,以增强空间连续性和保持视觉连贯性。 最终生成的场景被提升至高质量的3DGS表示形式,支持如场景探索等多样化的应用需求。全面的实验表明,X-Scene显著提升了大规模驾驶场景生成中的可控性和真实感,从而增强了自动驾驶数据生成和模拟的能力。
https://arxiv.org/abs/2506.13558
Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at this https URL.
构建一个物理上真实且比例准确的三维模拟世界对于具身智能任务的训练和评估至关重要。三维数据资产的多样性、现实性、低成本获取性和可负担性是实现具身人工智能中的泛化和规模扩展的关键因素。然而,目前大多数具身智能任务仍然严重依赖于传统的由人工创建和标注的3D计算机图形资源,这些资源面临着高昂的生产成本和有限的真实感问题。这些问题显著限制了数据驱动方法的可扩展性。 我们提出了一种名为EmbodiedGen的基础平台,该平台用于生成交互式三维世界。它能够在低成本的情况下,大规模地生成高质量、可控且高度逼真的3D资产,并具备准确的物理特性和实际世界的规模(在统一机器人描述格式URDF中)。这些资源可以直接导入到各种物理模拟引擎中以实现细微程度的物理控制,支持下游任务中的训练和评估工作。EmbodiedGen是一个易于使用的全功能工具包,由六个关键模块组成:Image-to-3D、Text-to-3D、Texture Generation(纹理生成)、Articulated Object Generation(连杆对象生成)、Scene Generation(场景生成)和Layout Generation(布局生成)。通过利用生成式AI技术,EmbodiedGen能够创建包含生成式3D资产的多样化且可交互的三维世界,以解决具身智能相关研究中泛化与评估方面的挑战。该代码可在提供的URL处获取。
https://arxiv.org/abs/2506.10600
3D Gaussian Splatting has achieved remarkable success in reconstructing both static and dynamic 3D scenes. However, in a scene represented by 3D Gaussian primitives, interactions between objects suffer from inaccurate 3D segmentation, imprecise deformation among different materials, and severe rendering artifacts. To address these challenges, we introduce PIG: Physically-Based Multi-Material Interaction with 3D Gaussians, a novel approach that combines 3D object segmentation with the simulation of interacting objects in high precision. Firstly, our method facilitates fast and accurate mapping from 2D pixels to 3D Gaussians, enabling precise 3D object-level segmentation. Secondly, we assign unique physical properties to correspondingly segmented objects within the scene for multi-material coupled interactions. Finally, we have successfully embedded constraint scales into deformation gradients, specifically clamping the scaling and rotation properties of the Gaussian primitives to eliminate artifacts and achieve geometric fidelity and visual consistency. Experimental results demonstrate that our method not only outperforms the state-of-the-art (SOTA) in terms of visual quality, but also opens up new directions and pipelines for the field of physically realistic scene generation.
3D高斯点阵(Gaussian Splatting)在重建静态和动态的三维场景方面取得了显著的成功。然而,在由3D高斯原语表示的场景中,物体之间的交互受到不准确的三维分割、不同材料间变形的不精确性以及严重的渲染伪影的影响。为了应对这些挑战,我们引入了PIG:基于物理的多材料相互作用与3D高斯方法(Physically-Based Multi-Material Interaction with 3D Gaussians),这是一种结合了快速且精准地从2D像素映射到3D高斯点阵,并在物体级进行精确三维分割的新方法,同时模拟场景内交互对象之间的复杂物理现象。具体而言: 首先,我们的方法支持将二维像素快速、准确地映射为3D高斯,从而实现物体级别的精细三维分割。 其次,在场景中我们为不同材料的对应分割对象分配特定的物理属性,以实现多材料耦合互动。 最后,我们将约束尺度成功嵌入到变形梯度中,特别针对高斯原语的缩放和旋转特性进行了限制,这有助于消除伪影,并达到几何精确性和视觉一致性的目标。 实验结果表明,我们的方法不仅在视觉质量上超过了最先进的技术(SOTA),而且还为物理现实场景生成领域开辟了新的方向和技术途径。
https://arxiv.org/abs/2506.07657
We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.
我们介绍了Genesis,这是一个统一的框架,用于生成多视角驾驶视频和LiDAR序列,并确保它们在时空上和跨模态上的连贯性。Genesis采用两阶段架构,结合了基于DiT(Diffusion Transformer)的视频扩散模型与3D-VAE编码器,以及使用NeRF渲染和自适应采样的BEV感知LiDAR生成器。两种模式通过共享潜在空间直接耦合在一起,从而实现在视觉和几何领域的一致演变。为了用结构化的语义引导生成过程,我们引入了DataCrafter,这是一个基于视觉-语言模型的图像描述模块,能够提供场景级和实例级监督。在nuScenes基准测试上的大量实验表明,Genesis在视频和LiDAR指标(FVD 16.95、FID 4.24、Chamfer 0.611)上均达到了最先进的性能,并且增强了下游任务如分割和3D检测的实用性,验证了生成数据在语义保真度和实际应用中的价值。
https://arxiv.org/abs/2506.07497
Our project page: this https URL. Automated generation of complex, interactive indoor scenes tailored to user prompt remains a formidable challenge. While existing methods achieve indoor scene synthesis, they struggle with rigid editing constraints, physical incoherence, excessive human effort, single-room limitations, and suboptimal material quality. To address these limitations, we propose SceneLCM, an end-to-end framework that synergizes Large Language Model (LLM) for layout design with Latent Consistency Model(LCM) for scene optimization. Our approach decomposes scene generation into four modular pipelines: (1) Layout Generation. We employ LLM-guided 3D spatial reasoning to convert textual descriptions into parametric blueprints(3D layout). And an iterative programmatic validation mechanism iteratively refines layout parameters through LLM-mediated dialogue loops; (2) Furniture Generation. SceneLCM employs Consistency Trajectory Sampling(CTS), a consistency distillation sampling loss guided by LCM, to form fast, semantically rich, and high-quality representations. We also offer two theoretical justification to demonstrate that our CTS loss is equivalent to consistency loss and its distillation error is bounded by the truncation error of the Euler solver; (3) Environment Optimization. We use a multiresolution texture field to encode the appearance of the scene, and optimize via CTS loss. To maintain cross-geometric texture coherence, we introduce a normal-aware cross-attention decoder to predict RGB by cross-attending to the anchors locations in geometrically heterogeneous instance. (4)Physically Editing. SceneLCM supports physically editing by integrating physical simulation, achieved persistent physical realism. Extensive experiments validate SceneLCM's superiority over state-of-the-art techniques, showing its wide-ranging potential for diverse applications.
我们的项目页面:此 HTTPS URL。自动生成复杂且互动性强的室内场景以响应用户提示仍然是一个重大挑战。虽然现有方法能够合成室内场景,但它们在编辑约束、物理不一致性、人力投入过大、仅限单个房间以及材料质量不佳等方面存在局限性。为了解决这些问题,我们提出了SceneLCM框架,该框架整合了大型语言模型(LLM)用于布局设计和潜在一致模型(LCM)用于场景优化的端到端方法。 我们的方法将场景生成分解成四个模块化管道: 1. **布局生成**:我们使用由LLM引导的3D空间推理技术,将文本描述转换为参数化的蓝图(即3D布局)。同时,通过LLM中介对话循环进行迭代验证机制,逐步细化布局参数。 2. **家具生成**:SceneLCM采用了一致性轨迹采样(CTS)方法,该方法由LCM指导的一致性蒸馏采样损失驱动,能够快速形成语义丰富且高质量的表示。此外,我们提供了两个理论依据来证明我们的CTS损失等同于一致性损失,并且其蒸馏误差被欧拉解算器截断误差所限定。 3. **环境优化**:我们使用多分辨率纹理字段编码场景外观,并通过CTS损失进行优化。为了保持跨几何结构的纹理连贯性,我们引入了一种法线感知交叉注意解码器,通过交叉关注不同实例中的锚定位置来预测RGB值。 4. **物理编辑支持**:SceneLCM通过整合物理模拟实现了持久的物理真实性,从而支持场景的物理编辑。广泛实验验证了SceneLCM在当前技术前沿上的优越性,并展示了其在各种应用领域的广泛应用潜力。
https://arxiv.org/abs/2506.07091
The generation of high-quality 3D environments is crucial for industries such as gaming, virtual reality, and cinema, yet remains resource-intensive due to the reliance on manual processes. This study performs a systematic review of existing generative AI techniques for 3D scene generation, analyzing their characteristics, strengths, limitations, and potential for improvement. By examining state-of-the-art approaches, it presents key challenges such as scene authenticity and the influence of textual inputs. Special attention is given to how AI can blend different stylistic domains while maintaining coherence, the impact of training data on output quality, and the limitations of current models. In addition, this review surveys existing evaluation metrics for assessing realism and explores how industry professionals incorporate AI into their workflows. The findings of this study aim to provide a comprehensive understanding of the current landscape and serve as a foundation for future research on AI-driven 3D content generation. Key findings include that advanced generative architectures enable high-quality 3D content creation at a high computational cost, effective multi-modal integration techniques like cross-attention and latent space alignment facilitate text-to-3D tasks, and the quality and diversity of training data combined with comprehensive evaluation metrics are critical to achieving scalable, robust 3D scene generation.
高质量的三维环境生成对于游戏、虚拟现实和电影等行业至关重要,但因其依赖于手动过程而仍然非常耗费资源。本研究对现有的用于生成3D场景的生成式AI技术进行了系统性回顾,分析了这些技术的特点、优势、局限性和改进潜力。通过考察最前沿的方法,它揭示了一些关键挑战,如场景的真实性以及文本输入的影响。特别关注的是,如何利用AI在保持一致性的同时融合不同的风格领域,训练数据对输出质量的影响以及当前模型的限制。此外,本综述还调研了现有的评估真实性的指标,并探讨了行业专业人士如何将AI整合到他们的工作流程中。这项研究的发现旨在为当前的研究提供一个全面的理解基础,并作为未来关于基于人工智能的3D内容生成研究的基础。 关键发现包括:先进的生成架构能够在高计算成本下创建高质量的3D内容;有效的多模态集成技术,如跨注意力和潜在空间对齐,有助于文本到3D的任务;以及高质量且多样化的训练数据结合全面的评估指标对于实现可扩展、稳健的3D场景生成至关重要。
https://arxiv.org/abs/2506.05449
Real-world applications like video gaming and virtual reality often demand the ability to model 3D scenes that users can explore along custom camera trajectories. While significant progress has been made in generating 3D objects from text or images, creating long-range, 3D-consistent, explorable 3D scenes remains a complex and challenging problem. In this work, we present Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image with user-defined camera path. Unlike existing approaches, Voyager achieves end-to-end scene generation and reconstruction with inherent consistency across frames, eliminating the need for 3D reconstruction pipelines (e.g., structure-from-motion or multi-view stereo). Our method integrates three key components: 1) World-Consistent Video Diffusion: A unified architecture that jointly generates aligned RGB and depth video sequences, conditioned on existing world observation to ensure global coherence 2) Long-Range World Exploration: An efficient world cache with point culling and an auto-regressive inference with smooth video sampling for iterative scene extension with context-aware consistency, and 3) Scalable Data Engine: A video reconstruction pipeline that automates camera pose estimation and metric depth prediction for arbitrary videos, enabling large-scale, diverse training data curation without manual 3D annotations. Collectively, these designs result in a clear improvement over existing methods in visual quality and geometric accuracy, with versatile applications.
现实世界的应用,如视频游戏和虚拟现实,经常需要能够根据用户的自定义相机轨迹来模拟可以探索的三维场景。虽然从文本或图像生成3D物体已取得显著进展,但创建长距离、一致且可探索的3D场景仍然是一个复杂且具有挑战性的问题。在这项工作中,我们介绍了Voyager,这是一种新颖的视频扩散框架,可以从单张图片和用户定义的相机路径中生成世界一致性3D点云序列。与现有方法不同,Voyager实现了端到端的场景生成与重建,并在整个帧之间保持固有的连贯性,从而消除了对三维重建管线(例如运动结构或多视图立体匹配)的需求。 我们的方法集成了三个关键组件: 1. **世界一致性视频扩散**:一种统一架构,在给定现有世界观察的基础上联合生成一致的RGB和深度视频序列,以确保全局连贯性。 2. **长距离世界探索**:一个高效的世界缓存系统与点剔除结合,并通过平滑视频采样实现了自回归推理,能够进行上下文感知的一致迭代场景扩展。 3. **可扩展数据引擎**:一种视频重建管线,自动估计相机姿态并预测任意视频的度量深度,使大规模、多样化的训练数据整理成为可能,而无需手动标注三维信息。 通过这些设计,Voyager在视觉质量和几何精度上相对于现有方法有了明显的改进,并具有广泛的应用潜力。
https://arxiv.org/abs/2506.04225
Domain generalization (DG) for object detection aims to enhance detectors' performance in unseen scenarios. This task remains challenging due to complex variations in real-world applications. Recently, diffusion models have demonstrated remarkable capabilities in diverse scene generation, which inspires us to explore their potential for improving DG tasks. Instead of generating images, our method extracts multi-step intermediate features during the diffusion process to obtain domain-invariant features for generalized detection. Furthermore, we propose an efficient knowledge transfer framework that enables detectors to inherit the generalization capabilities of diffusion models through feature and object-level alignment, without increasing inference time. We conduct extensive experiments on six challenging DG benchmarks. The results demonstrate that our method achieves substantial improvements of 14.0% mAP over existing DG approaches across different domains and corruption types. Notably, our method even outperforms most domain adaptation methods without accessing any target domain data. Moreover, the diffusion-guided detectors show consistent improvements of 15.9% mAP on average compared to the baseline. Our work aims to present an effective approach for domain-generalized detection and provide potential insights for robust visual recognition in real-world scenarios. The code is available at this https URL.
领域泛化(Domain Generalization,DG)在目标检测中的应用旨在提高检测器在未见过场景下的表现。由于实际应用场景中存在复杂多变的情况,这一任务依然充满挑战。最近,扩散模型展示了其在生成多样化场景方面的卓越能力,这激发了我们将它们用于改善DG任务的潜力的研究兴趣。与直接生成图像不同,我们的方法通过在扩散过程中提取多步中间特征来获取跨域不变性特征,从而实现泛化检测。 此外,我们提出了一种高效的知识转移框架,使检测器能够继承扩散模型的泛化能力,而无需增加推理时间,并且该框架是通过特征级和对象级对齐完成的。我们在六个具有挑战性的DG基准数据集上进行了广泛的实验,结果显示我们的方法在不同域和损坏类型上比现有DG方法提高了14.0% mAP(平均精度)。值得注意的是,即使不访问目标域数据,我们提出的方法也超越了大多数领域适应方法的表现。 此外,由扩散模型指导的检测器相比基准线,在所有测试场景中平均mAP值提升达到了15.9%。我们的工作旨在提供一种有效的跨领域泛化检测解决方案,并为在实际场景中的稳健视觉识别提供潜在见解。代码可以在以下链接获取:[此链接](https://this https URL)。 请注意,最后一个URL应替换为您实际提供的代码仓库地址或其他相关资源的正确链接。
https://arxiv.org/abs/2503.02101
Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene this http URL, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient and user-friendly solution that unifies text-based and graph based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.
可控性在3D室内场景合成的实际应用中起着关键作用。现有研究要么提供粗糙的语言控制,虽然使用方便但缺乏精细的场景定制;要么采用基于图的控制方式,这种控制方式提供了更好的可控性,但需要用户具备复杂且繁琐的图形设计知识。为了解决这些问题,我们提出了一种名为FreeScene的框架,它既便捷又有效,能够使用户对室内场景进行细致的控制。 在FreeScene中,支持自由形式的用户输入,包括文本描述和/或参考图像,允许用户表达各种设计意图。通过基于视觉语言模型(VLM)的图设计器,用户的这些输入会被充分分析并整合到一个图形表示之中。然后我们提出了一种混合图扩散变压器(MG-DiT),该变压器能够执行图感知去噪以增强场景生成过程。我们的MG-DiT不仅擅长保持图形结构,还具有广泛的应用性,包括但不限于文本到场景、图到场景以及重排任务,并且所有这些都在单一模型中实现。 通过广泛的实验验证,FreeScene提供了一种高效且用户友好的解决方案,能够统一基于文本和基于图的场景合成方法。在多种应用场合下,该方案不仅在生成质量上超越了当前最先进的方法,在可控性方面也表现出色。
https://arxiv.org/abs/2506.02781
Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.
从文本生成三维世界是计算机视觉领域一个备受期待的目标。现有的方法在场景内部探索方面受到限制,即超出中心或全景视角时会产生拉伸和噪点等不理想的效果。为此,我们提出了WorldExplorer这一新方法,它基于自回归视频轨迹生成技术,在广泛的视角范围内构建出完全可导航且具有连贯视觉质量的三维场景。我们的方法首先通过创建一系列多视图一致性图像(对应360度全景)来初始化场景。然后利用视频扩散模型在迭代场景生成流水线中扩展这些初始设置,具体来说,我们在预定义的短轨迹上生成多个视频,深入探索场景,包括围绕物体运动的情况。我们提出的新颖场景记忆机制让每个视频都基于最相关的先前视图条件化处理,同时一个碰撞检测机制防止了诸如移动进物体等退化的结果产生。最后,通过3D高斯点绘优化将所有生成的视角融合为统一的三维表示。 相比于之前的方法,WorldExplorer能够生成在大范围相机运动下依然保持稳定的高质量场景,首次实现了真实且不受限的探索。我们认为这标志着向生成沉浸式和真正可探索的虚拟三维环境迈出的重要一步。
https://arxiv.org/abs/2506.01799
Designing 3D scenes is traditionally a challenging task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ArtiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. First, we generate 2D images from a scene description, then extract the shape and appearance of objects to create 3D models. These models are assembled into the final scene using geometry, position, and pose information derived from the same intermediary image. Being generalizable to a wide range of scenes and styles, ArtiScene outperforms state-of-the-art benchmarks by a large margin in layout and aesthetic quality by quantitative metrics. It also averages a 74.89% winning rate in extensive user studies and 95.07% in GPT-4o evaluation. Project page: this https URL
设计三维场景传统上是一项既需要艺术专长又需掌握复杂软件技能的挑战性任务。最近在文本到3D生成领域的进步通过让用户基于简单的文字描述来创建场景,大大简化了这一过程。然而,由于这些方法通常要求额外训练或上下文学习,因此受限于高质量三维数据有限可用性的性能问题仍然存在。相比之下,现代从网络规模图像中学习的文本到图像模型能够产生具有多样性和可靠性空间布局以及一致且视觉吸引人的风格的场景。 我们的关键见解是:与其直接从3D场景进行学习,不如利用生成的2D图像作为中间体来指导3D合成。基于此理念,我们介绍了ArtiScene——一个无需训练的自动化管线,用于场景设计,该管道将自由形式文本到图像生成的灵活性与2D中间布局的多样性和可靠性相结合。 首先,从场景描述中生成2D图像;然后提取对象的形状和外观以创建3D模型。这些模型利用来自同一中间图像的几何、位置和姿态信息进行最终场景组装。ArtiScene能够广泛适用于各种类型的场景和风格,并且在广泛的用户研究中获得了74.89%的胜率,在GPT-4o评估中得到了95.07%的好评。 通过定量指标,ArtiScene在布局和美学质量上大大优于最先进的基准测试。项目页面:[提供链接]
https://arxiv.org/abs/2506.00742
Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.
利用最近的扩散模型,基于LiDAR的大规模3D场景生成取得了显著的成功。尽管近期基于体素的方法可以同时生成几何结构和语义标签,现有的范围视图方法仅限于产生未标记的LiDAR场景。依赖预训练分割模型来预测语义地图通常会导致跨模态一致性较差的结果。为了在保留范围视图表示的优点(如计算效率和简化网络设计)的同时解决这一局限性,我们提出了Spiral,这是一种新颖的范围视图LiDAR扩散模型,可以同时生成深度图像、反射率图像和语义地图。此外,我们引入了新的基于语义的度量标准来评估生成的带标签的范围视图数据的质量。在SemanticKITTI和nuScenes数据集上的实验表明,Spiral实现了最先进的性能,并且参数规模最小,在结合生成模型和分割模型的两步方法中表现更佳。此外,我们验证了由Spiral生成的范围图像可以有效地用于下游分割训练中的合成数据增强,从而显著减少了对LiDAR数据的手动标注工作。
https://arxiv.org/abs/2505.22643
Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications.
尽管近期得益于视觉-语言模型(VLM)的进展,多模态内容生成得到了显著提升,但这些模型在理解和生成结构化的三维场景方面的能力仍然很大程度上未被探索。这一限制制约了它们在基于空间的任务中的实用性,如具身人工智能、沉浸式模拟和交互式的3D应用中。为此,我们引入了一个新的范例,使VLM能够通过注入不断演进的空间上下文来生成、理解和编辑复杂的三维环境。该上下文由多模态输入构建而成,并包含三个组成部分:场景肖像,提供高层次的语义蓝图;带有语义标签的点云,捕获对象级别的几何信息;以及场景超图,编码丰富的空间关系,包括一元、二元和高阶约束。这三部分共同为VLM提供了结构化且具有几何感知的工作记忆,将它的固有多模态推理能力与三维理解相结合,实现有效的空间推理。 在此基础上,我们开发了一个代理性3D场景生成流水线,在这个管道中,VLM可以迭代地读取并更新空间上下文。该流程的特点包括高质量的资产生成、带有自动验证的环境设置以及通过场景超图指导的人体工学调整。实验表明,我们的框架能够处理多样且具有挑战性的输入,达到了此前工作中未观察到的一般化水平。进一步的结果显示,注入空间上下文使VLM能够执行下游任务如交互式场景编辑和路径规划,这预示着在计算机图形、3D视觉以及具身应用中的智能空间系统有着强大的潜力。
https://arxiv.org/abs/2505.20129
WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. While prior works are restricted to rigid body or simple elastic dynamics, WonderPlay features a hybrid generative simulator to synthesize a wide range of 3D dynamics. The hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elastic, and rigid bodies -- all using a single image input. Code will be made public. Project website: this https URL
WonderPlay 是一个新颖的框架,它将物理模拟与视频生成相结合,可以从单一图像中生成条件化的动态3D场景。虽然先前的工作仅限于刚体或简单的弹性动力学,但 WonderPlay 特别设计了一个混合生成式仿真器来合成各种各样的 3D 动力学。该混合生成式仿真器首先使用物理求解器模拟粗糙的 3D 动力学,然后利用视频生成器在这些基础之上产生更精细、更具现实感的动作视频。接着,所生成的视频被用来更新动态的 3D 场景,形成一个闭环过程,在其中物理求解器与视频生成器之间相互作用。这种方案使得用户能够直观地控制场景,并且结合了基于物理模拟器的精确动力学和扩散基础视频生成器的表现力。 实验结果表明,WonderPlay 允许用户通过单一图像输入与包含布料、沙子、雪、液体、烟雾、弹性体以及刚性物体等不同内容的各种场景进行互动。代码将会公开发布。项目网站:[此链接](https://this.url/)
https://arxiv.org/abs/2505.18151
Recently, 3D GANs based on 3D Gaussian splatting have been proposed for high quality synthesis of human heads. However, existing methods stabilize training and enhance rendering quality from steep viewpoints by conditioning the random latent vector on the current camera position. This compromises 3D consistency, as we observe significant identity changes when re-synthesizing the 3D head with each camera shift. Conversely, fixing the camera to a single viewpoint yields high-quality renderings for that perspective but results in poor performance for novel views. Removing view-conditioning typically destabilizes GAN training, often causing the training to collapse. In response to these challenges, we introduce CGS-GAN, a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without relying on view-conditioning. To ensure training stability, we introduce a multi-view regularization technique that enhances generator convergence with minimal computational overhead. Additionally, we adapt the conditional loss used in existing 3D Gaussian splatting GANs and propose a generator architecture designed to not only stabilize training but also facilitate efficient rendering and straightforward scaling, enabling output resolutions up to $2048^2$. To evaluate the capabilities of CGS-GAN, we curate a new dataset derived from FFHQ. This dataset enables very high resolutions, focuses on larger portions of the human head, reduces view-dependent artifacts for improved 3D consistency, and excludes images where subjects are obscured by hands or other objects. As a result, our approach achieves very high rendering quality, supported by competitive FID scores, while ensuring consistent 3D scene generation. Check our our project page here: this https URL
最近,基于三维高斯点阵(3D Gaussian splatting)的三维生成对抗网络(3D GANs)被提出用于高质量的人脸合成。然而,现有的方法通过将随机潜在向量条件化于当前摄像机位置来稳定训练并提升从极端视角进行渲染的质量,这种方法会损害三维一致性,因为在重新合成人脸时随着相机的位置变化会产生明显的身份变化。相比之下,固定相机在一个单一的视角虽然可以为该视角提供高质量的渲染效果,但对于新的视图则表现不佳。移除视图条件化通常会导致GAN训练不稳定,经常使训练崩溃。 为了应对这些挑战,我们引入了CGS-GAN(Conditional Gaussian Splatting GAN),这是一种新型的三维高斯点阵生成对抗网络框架,它能够在不依赖于视图条件下实现稳定的训练和高质量、三维一致的人脸合成。为确保训练稳定性,我们提出了一种多视角正则化技术,该技术在几乎无额外计算成本的情况下增强了生成器的收敛性。此外,我们将现有的3D高斯点阵GAN中使用的条件损失进行了调整,并设计了一个新的生成器架构,不仅可以稳定训练过程,而且有助于高效的渲染和轻松扩展,从而能够输出高达$2048^2$分辨率的结果。 为了评估CGS-GAN的能力,我们基于FFHQ数据集构建了新的数据集。该数据集支持非常高的解析度,重点关注人脸的较大区域,并减少视图相关的伪影以改善三维一致性。同时排除了被手或其他物体遮挡的图像样本。最终结果表明,我们的方法在高质量渲染方面取得了卓越的成绩,这得到了竞争性的FID(Fréchet Inception Distance)分数的支持,同时也保证了一致的三维场景生成。 如需了解更多信息,请访问我们的项目页面:[此链接](https://this-url.com) (请将"this https URL"替换为您实际的网址)。
https://arxiv.org/abs/2505.17590
Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce 3DTown, a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that 3DTown outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, and TripoSG, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.
获取详细的三维场景通常需要昂贵的设备、多视角数据或复杂的建模过程。因此,一种轻量级的方法——从单一顶视图图像生成复杂三维场景,在实际应用中扮演着重要角色。尽管最近的三维生成模型在物体级别的表现非常出色,但它们扩展到全场景生成时往往会导致不一致的几何结构、布局幻觉以及低质量的网格。为此,我们在本研究中引入了3DTown,这是一个无需训练框架,旨在从单一顶视图图像合成真实且连贯的三维场景。 我们的方法基于两个原则:区域化生成以提高二维到三维的一致性和分辨率,以及空间感知的三维修复填充来确保全局场景的一致性及高质量几何结构生成。具体而言,我们将输入图像分解为重叠的区域,并使用预训练的三维对象生成器生成每个区域;随后进行掩码修正流修复填充过程,以填补缺失的几何信息并保持结构连续性。 这种模块化设计允许我们克服分辨率瓶颈问题,并在无需三维监督或微调的情况下保存空间结构。通过在各种场景中进行广泛的实验表明,3DTown 在几何质量、空间一致性以及纹理保真度方面均优于当前最佳基线模型(包括 Trellis, Hunyuan3D-2 和 TripoSG)。 我们的研究成果展示了基于单一图像生成高质量三维城镇的可行性,并且采用了一种无需训练的原理性方法。
https://arxiv.org/abs/2505.15765
3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene distributions, improving fidelity, diversity, and view consistency. Recent advances like diffusion models bridge 3D scene synthesis and photorealism by reframing generation as image or video synthesis problems. This survey provides a systematic overview of state-of-the-art approaches, organizing them into four paradigms: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. We analyze their technical foundations, trade-offs, and representative results, and review commonly used datasets, evaluation protocols, and downstream applications. We conclude by discussing key challenges in generation capacity, 3D representation, data and annotations, and evaluation, and outline promising directions including higher fidelity, physics-aware and interactive generation, and unified perception-generation models. This review organizes recent advances in 3D scene generation and highlights promising directions at the intersection of generative AI, 3D vision, and embodied intelligence. To track ongoing developments, we maintain an up-to-date project page: this https URL.
三维场景生成旨在为沉浸式媒体、机器人技术、自动驾驶和具身人工智能等应用合成具有空间结构化、语义意义且逼真的环境。早期基于程序规则的方法虽然具备可扩展性,但多样性有限。近年来,深度生成模型(如GANs、扩散模型)以及3D表示方法(如NeRF、3D高斯分布)的进步使得能够学习真实场景的分布,从而提高了逼真度、多样性和视角一致性。最近的技术进步,比如扩散模型通过将生成问题重新定义为图像或视频合成问题的方式,成功地连接了三维场景生成与照片级真实性。本综述系统性地概述了当前最先进的方法,并将其分类为四大范式:程序化生成、基于神经3D的生成、基于图像的生成和基于视频的生成。我们分析了它们的技术基础、权衡以及代表性结果,还回顾了一些常用的数据库、评估协议及下游应用。最后,讨论了在生成能力、三维表示、数据与注释以及评估方面的关键挑战,并概述了一系列有前景的方向,包括更高的保真度、物理感知和互动生成,以及统一的感知-生成模型。这篇综述整理了最近在三维场景生成领域的进展,并突出了生成AI、三维视觉和具身智能交叉领域中的潜在发展方向。为了跟踪正在进行的发展,我们维护了一个最新的项目页面:[此URL](this https URL)。
https://arxiv.org/abs/2505.05474
Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: this https URL
在仿真中训练机器人需要反映下游任务特定挑战的多样化三维场景。然而,满足严格任务要求(如具有合理空间布局的高杂乱环境)的场景非常罕见且难以手动整理。因此,我们利用过程模型生成大规模场景数据,这些模型可以近似现实环境中机器人的操作需求,并将这些数据调整以适应具体任务目标。为此,我们训练了一个统一的基于扩散的生成模型,该模型能预测从固定资产库中放置哪些物体及其 SE(3) 位姿(位置和姿态)。此模型作为一个灵活的场景先验知识可以利用强化学习后训练、条件生成或推理时间搜索进行调整,即使这些目标与原始数据分布不同也能引导生成朝向下游目标发展。我们的方法实现了尊重物理可行性的目标导向场景合成,并且能够跨各种场景类型扩展。 我们还引入了一种基于MCTS(蒙特卡罗树搜索)的推理时间搜索策略应用于扩散模型中,并通过投影和模拟来确保可行性,同时发布了一个包含超过4400万SE(3)场景的数据集,这些场景涵盖了五种多样的环境。网站提供了视频、代码、数据和模型权重:[此链接](https://example.com/)(请根据实际链接替换)。
https://arxiv.org/abs/2505.04831
Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.
从文本合成交互式3D场景对于游戏、虚拟现实和具身人工智能(Embodied AI)至关重要。然而,现有的方法面临着一些挑战。基于学习的方法依赖于小型的室内数据集,限制了场景多样性和布局复杂性;而大型语言模型(LLM)虽然可以利用多样的文本领域知识,但在空间真实感方面表现不佳,常常生成不符合常识的物体放置位置。我们的关键洞察是视觉感知能够弥合这一差距,通过提供现实的空间指导来弥补大型语言模型在此方面的不足。 为此,我们引入了Scenethesis框架,这是一个无需训练的代理框架,将基于LLM的场景规划与由视觉引导的布局精炼相结合。给定一个文本提示后,Scenethesis首先使用LLM生成粗略的布局草案;然后通过产生图像指引并提取场景结构来捕捉物体间的关联性以进行细化处理。接下来,优化模块迭代执行精确的姿态对齐和物理合理性检查,防止诸如物体穿透或不稳定等伪影现象的发生。最后,评判模块验证空间一致性。 全面实验表明,Scenethesis能够生成多样、真实且符合物理规律的3D交互场景,在虚拟内容创建、模拟环境以及具身人工智能研究方面具有重要的价值。
https://arxiv.org/abs/2505.02836