3D scene generation has garnered growing attention in recent years and has made significant progress. Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments. To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities. Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff. Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation. Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles. To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations. Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations. Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.
近年来,三维场景生成受到了越来越多的关注,并取得了显著进展。然而,生成四维城市(包含时间维度的变化)比单纯生成三维场景更具挑战性,这是因为建筑物和车辆等结构复杂、视觉多样化的物体的存在以及人们对城市环境中任何失真的高度敏感性。为了解决这些问题,我们提出了CityDreamer4D,这是一种专门用于生成无边界四维城市的组合式生成模型。 我们的主要见解是: 1. 四维城市的生成应该将动态对象(例如车辆)与静态场景(例如建筑物和道路)区分开来。 2. 在四维场景中的所有物体都应由不同类型神经场组成,包括针对建筑物、车辆及背景物质的神经场。 具体来说,我们提出了交通场景生成器和无边界布局生成器,这两个生成器使用高度紧凑的BEV(鸟瞰图)表示方式来产生动态交通场景和静态城市布局。在四维城市中,物体通过结合面向物质和实例化的神经场(背景物质、建筑物和车辆)进行生成。 为了适应背景物质和实例的不同特性,这些神经场采用了定制化生成哈希网格和周期位置嵌入作为场景参数化方式。 此外,我们提供了一整套用于城市生成的数据集,包括OSM、Google Earth 和 CityTopia。其中OSM数据集提供了多种真实世界的城市布局,而 Google 地球和CityTopia 数据集则提供了大规模、高质量的城市图像,并附有3D实例标注信息。借助其组合式设计,CityDreamer4D 支持一系列下游应用,包括实例编辑、城市风格化及都市仿真等,同时在生成逼真的四维城市方面表现出色。
https://arxiv.org/abs/2501.08983
Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods.
最近在大规模重建和生成模型方面的进展显著提升了场景重建和新颖视图生成的性能。然而,由于计算限制,这些大型模型每次推理只能处理一个小区域,这使得长距离的一致性场景生成变得具有挑战性。为了解决这个问题,我们提出了StarGen,这是一个新的框架,它采用预训练的视频扩散模型以自回归方式用于长距离场景生成。每个视频片段的生成受到空间相邻图像和之前生成片段中时间上重叠的图像的空间扭曲条件的影响,这提高了长时间范围内场景生成中的时空一致性,并实现了精确的姿态控制。这种时空条件可以与各种输入条件兼容,从而能够支持多种任务,包括稀疏视图插值、持续视图生成以及布局条件下的城市生成。定量和定性的评估表明,StarGen在可扩展性、保真度和姿态准确性方面优于现有的方法。
https://arxiv.org/abs/2501.05763
3D scene generation conditioned on text prompts has significantly progressed due to the development of 2D diffusion generation models. However, the textual description of 3D scenes is inherently inaccurate and lacks fine-grained control during training, leading to implausible scene generation. As an intuitive and feasible solution, the 3D layout allows for precise specification of object locations within the scene. To this end, we present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions. Specifically, we first introduce a scene hybrid representation to decouple objects and backgrounds, which is initialized via a pre-trained text-to-3D model. Then, we propose a two-stage scheme to optimize the geometry and appearance of the initialized scene separately. To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model which are finetuned on a scene dataset. Extensive experiments demonstrate that our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches. Furthermore, the generated scene allows for flexible yet precise editing, thereby facilitating multiple downstream applications.
基于文本提示的3D场景生成由于2D扩散生成模型的发展而取得了显著进步。然而,对3D场景的文字描述本质上不够准确,并且在训练过程中缺乏细粒度控制,导致生成的场景不合理。作为一种直观可行的解决方案,3D布局允许精确指定场景中对象的位置。为此,我们提出了一种使用附加语义布局作为提示来注入对3D物体位置精准控制的文本到场景生成方法(即Layout2Scene)。具体而言,我们首先引入一种场景混合表示法,以分离物体和背景,并通过预训练的文本到3D模型进行初始化。然后,我们提出了一个两阶段方案,分别优化初始化场景的几何形状和外观。为了充分利用在几何形状和外观生成中的2D扩散先验知识,我们引入了语义引导的几何扩散模型以及基于语义-几何的引导扩散模型,并在这两类模型上使用场景数据集进行微调。 广泛的实验表明,我们的方法能够比最先进的方法生成更合理、更逼真的场景。此外,所生成的场景允许灵活而精确地编辑,从而促进了多种下游应用的发展。
https://arxiv.org/abs/2501.02519
A high-fidelity digital simulation environment is crucial for accurately replicating physical operational processes. However, inconsistencies between simulation and physical environments result in low confidence in simulation outcomes, limiting their effectiveness in guiding real-world production. Unlike the traditional step-by-step point cloud "segmentation-registration" generation method, this paper introduces, for the first time, a novel Multi-Robot Manufacturing Digital Scene Generation (MRG) method that leverages multi-instance point cloud registration, specifically within manufacturing scenes. Tailored to the characteristics of industrial robots and manufacturing settings, an instance-focused transformer module is developed to delineate instance boundaries and capture correlations between local regions. Additionally, a hypothesis generation module is proposed to extract target instances while preserving key features. Finally, an efficient screening and optimization algorithm is designed to refine the final registration results. Experimental evaluations on the Scan2CAD and Welding-Station datasets demonstrate that: (1) the proposed method outperforms existing multi-instance point cloud registration techniques; (2) compared to state-of-the-art methods, the Scan2CAD dataset achieves improvements in MR and MP by 12.15% and 17.79%, respectively; and (3) on the Welding-Station dataset, MR and MP are enhanced by 16.95% and 24.15%, respectively. This work marks the first application of multi-instance point cloud registration in manufacturing scenes, significantly advancing the precision and reliability of digital simulation environments for industrial applications.
高保真数字仿真环境对于准确再现物理操作过程至关重要。然而,仿真与物理环境之间的不一致性导致了对仿真结果的信心不足,从而限制了其在指导现实世界生产中的有效性。不同于传统的逐点云“分割-注册”生成方法,本文首次提出了一种新颖的多机器人制造数字场景生成(MRG)方法,该方法利用了制造场景中多实例点云注册技术。针对工业机器人和制造环境的特点,开发了一个以实例为中心的变压器模块来界定实例边界并捕捉局部区域间的相关性。此外,还提出了一个假设生成模块来提取目标实例的同时保留关键特征。最后,设计了一种高效的筛选和优化算法来精炼最终的注册结果。在Scan2CAD和Welding-Station数据集上的实验评估表明:(1) 提出的方法优于现有的多实例点云注册技术;(2) 与最先进的方法相比,在Scan2CAD数据集中,MR(匹配率)和MP(模型精确度)分别提高了12.15% 和17.79%;(3) 在Welding-Station数据集中,MR和MP分别提升了16.95% 和24.15%。这项工作标志着多实例点云注册技术在制造场景中的首次应用,显著推进了工业应用中数字仿真环境的精度和可靠性。
https://arxiv.org/abs/2501.02041
Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.
从自动驾驶车辆的驾驶轨迹中合成逼真的视觉观测结果是可扩展训练自动驾驶模型的关键步骤。基于重建的方法通过神经渲染技术,可以从行驶日志创建3D场景并生成几何一致性的行驶视频,但它们依赖于昂贵的对象标注限制了其在现实环境中的一般化能力。另一方面,生成式模型可以通过一种更为通用的方式合成受动作条件约束的行驶视频,但在保持三维视觉一致性方面往往遇到困难。 在这篇论文中,我们提出了DreamDrive这一4D时空场景生成方法,它结合了生成和重建的优点,从而可以综合地产生具有3D一致性的通用化4D驾驶场景以及动态驾驶视频。具体而言,我们利用视频扩散模型的生成能力来合成一系列视觉参考,并通过一种新颖的混合高斯表示法将其提升为4D。在给定驾驶轨迹之后,我们进一步使用高斯点光源技术渲染出三维一致性良好的驾驶视频。这种方法借助于生成先验知识可以从现实中的行驶数据中产生高质量的4D场景,而神经渲染则确保从这些4D场景生成具有3D一致性的视频。 大量的实验表明,在nuScenes和街景图像上,DreamDrive可以生成可控且通用化的4D驾驶场景,以高保真度及3D一致性合成新的行驶视图,并在自监督模式下分解静态与动态元素。此外,它还能增强自动驾驶中的感知和规划任务。
https://arxiv.org/abs/2501.00601
In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation. Project page: this https URL
在这项工作中,我们介绍了Prometheus,这是一种具有三维感知的潜在扩散模型,可在几秒钟内生成从物体到场景级别的文本到3D内容。我们将3D场景生成视为在潜在扩散框架内的多视角、前馈、像素对齐的3D高斯生成。为了确保泛化能力,我们在预训练的文本到图像生成模型的基础上构建了我们的模型,并仅进行了少量调整,之后使用来自单视图和多视图数据集的大规模图像对其进行进一步训练。此外,我们还将RGB-D潜在空间引入到3D高斯生成中,以分离外观信息与几何信息,从而能够高效地前馈生成具有更高保真度和几何结构的3D高斯模型。广泛的实验结果证明了我们的方法在前馈3D高斯重建及文本到3D生成方面的有效性。项目页面:[此链接](https://this-url.com/)
https://arxiv.org/abs/2412.21117
Recent advancements in object-centric text-to-3D generation have shown impressive results. However, generating complex 3D scenes remains an open challenge due to the intricate relations between objects. Moreover, existing methods are largely based on score distillation sampling (SDS), which constrains the ability to manipulate multiobjects with specific interactions. Addressing these critical yet underexplored issues, we present a novel framework of Scene Graph and Layout Guided 3D Scene Generation (GraLa3D). Given a text prompt describing a complex 3D scene, GraLa3D utilizes LLM to model the scene using a scene graph representation with layout bounding box information. GraLa3D uniquely constructs the scene graph with single-object nodes and composite super-nodes. In addition to constraining 3D generation within the desirable layout, a major contribution lies in the modeling of interactions between objects in a super-node, while alleviating appearance leakage across objects within such nodes. Our experiments confirm that GraLa3D overcomes the above limitations and generates complex 3D scenes closely aligned with text prompts.
最近在基于对象的文本到3D生成领域的进展已经取得了令人印象深刻的结果。然而,由于物体之间的复杂关系,生成复杂的3D场景仍然是一项挑战。此外,现有的方法大多基于得分蒸馏采样(SDS),这限制了多物体间特定交互操作的能力。为了应对这些关键但尚未充分探索的问题,我们提出了一种新的框架——场景图和布局引导的3D场景生成(GraLa3D)。给定一个描述复杂3D场景的文本提示,GraLa3D利用大规模语言模型(LLM)通过带有布局边界框信息的场景图来建模场景。GraLa3D独特地构建了具有单个对象节点和复合超节点的场景图。除了在期望的布局内约束3D生成之外,主要贡献在于能够建模超节点中物体之间的交互,并缓解这些节点内部各物体之间出现泄漏的问题。实验结果证实,GraLa3D克服了上述限制,并能够生成与文本提示紧密相关的复杂3D场景。
https://arxiv.org/abs/2412.20473
Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it demonstrates resilience to depth-deficient regions, providing reliable completion for both continuous areas and isolated points, and (2) it faithfully preserves scale consistency with the conditioned known depth when filling in missing values. Drawing on these advantages, our approach proves its worth in various downstream tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction with DUST3R, and LiDAR depth completion, exceeding current solutions in both numerical performance and visual quality. Our project page with source code is available at this https URL.
缺失值仍然是深度数据在广泛应用中常见的挑战,这些问题源于多种原因,如不完整数据采集和视角变化。本研究通过DepthLab填补了这一空白,DepthLab是一款基于图像扩散先验的基础深度插补模型。我们的模型具有两个显著优势:(1) 对于缺深度区域表现出强大的适应性,能够可靠地完成连续区域和孤立点的填充;(2) 在填充缺失值时,能忠实保持与已知条件深度的一致比例。凭借这些优势,我们的方法在多种下游任务中证明了其价值,包括3D场景插补、文本到3D场景生成、使用DUST3R进行稀疏视图重建以及LiDAR深度完成,在数值性能和视觉质量方面均超越现有解决方案。我们的项目页面(含源代码)可在以下链接获取:[此 https URL]。
https://arxiv.org/abs/2412.18153
To enhance autonomous driving safety in complex scenarios, various methods have been proposed to simulate LiDAR point cloud data. Nevertheless, these methods often face challenges in producing high-quality, diverse, and controllable foreground objects. To address the needs of object-aware tasks in 3D perception, we introduce OLiDM, a novel framework capable of generating high-fidelity LiDAR data at both the object and the scene levels. OLiDM consists of two pivotal components: the Object-Scene Progressive Generation (OPG) module and the Object Semantic Alignment (OSA) module. OPG adapts to user-specific prompts to generate desired foreground objects, which are subsequently employed as conditions in scene generation, ensuring controllable outputs at both the object and scene levels. This also facilitates the association of user-defined object-level annotations with the generated LiDAR scenes. Moreover, OSA aims to rectify the misalignment between foreground objects and background scenes, enhancing the overall quality of the generated objects. The broad effectiveness of OLiDM is demonstrated across various LiDAR generation tasks, as well as in 3D perception tasks. Specifically, on the KITTI-360 dataset, OLiDM surpasses prior state-of-the-art methods such as UltraLiDAR by 17.5 in FPD. Additionally, in sparse-to-dense LiDAR completion, OLiDM achieves a significant improvement over LiDARGen, with a 57.47\% increase in semantic IoU. Moreover, OLiDM enhances the performance of mainstream 3D detectors by 2.4\% in mAP and 1.9\% in NDS, underscoring its potential in advancing object-aware 3D tasks. Code is available at: this https URL.
为了提高复杂场景下自动驾驶的安全性,已经提出了多种方法来模拟激光雷达点云数据。然而,这些方法在生成高质量、多样化且可控制的前景对象方面常常面临挑战。为了解决三维感知中物体感知任务的需求,我们引入了OLiDM(Object-Level Aware LiDAR Data Model),这是一种能够在对象和场景层面生成高保真度激光雷达数据的新框架。OLiDM由两个关键组件构成:对象-场景渐进生成(OPG)模块和对象语义对齐(OSA)模块。OPG能够适应用户特定的提示,生成所需前景对象,并将其用作场景生成中的条件,确保在对象和场景层面均能产生可控制的输出。这也有助于将用户定义的对象级注释与生成的激光雷达场景关联起来。此外,OSA旨在纠正前景对象和背景场景之间的错位,提高生成物体的整体质量。OLiDM的有效性已经通过各种激光雷达生成任务以及三维感知任务得到了广泛的验证。具体来说,在KITTI-360数据集上,OLiDM在FPD指标上超越了UltraLiDAR等先前的领先方法17.5分。此外,在稀疏到稠密的激光雷达完成任务中,OLiDM相比于LiDARGen实现了语义IoU 57.47%的显著提升。而且,OLiDM提升了主流三维检测器的表现,mAP提高了2.4%,NDS提高了1.9%,突显了其在推进物体感知三维任务中的潜力。代码可在以下链接获取:此 https URL。
https://arxiv.org/abs/2412.17226
We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, and highlight the creative capabilities for 4D scene generation from real or generated videos. See our project page for results and interactive demos: this https URL.
我们介绍了CAT4D,这是一种从单目视频中创建4D(动态3D)场景的方法。CAT4D利用了一个多视角视频扩散模型,该模型在多样化的数据集组合上进行了训练,能够在任何指定的相机姿态和时间戳下实现新视角合成。结合一种新颖的采样方法,该模型能够将单一单目视频转换为多视角视频,通过优化可变形3D高斯表示,实现了稳健的4D重建。我们在新视角合成和动态场景重建基准测试中展示了具有竞争力的表现,并突出了从真实或生成视频中创造4D场景的能力。请访问我们的项目页面查看结果和互动演示:此 https URL。
https://arxiv.org/abs/2411.18613
Scene generation is crucial to many computer graphics applications. Recent advances in generative AI have streamlined sketch-to-image workflows, easing the workload for artists and designers in creating scene concept art. However, these methods often struggle for complex scenes with multiple detailed objects, sometimes missing small or uncommon instances. In this paper, we propose a Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after reviewing the entire cross-attention mechanism. This scheme revitalizes the existing ControlNet model, enabling effective handling of multi-instance generations, involving prompt balance, characteristics prominence, and dense tuning. Specifically, this approach enhances keyword representation via the prompt balance module, reducing the risk of missing critical instances. It also includes a characteristics prominence module that highlights TopK indices in each channel, ensuring essential features are better represented based on token sketches. Additionally, it employs dense tuning to refine contour details in the attention map, compensating for instance-related regions. Experiments validate that our triplet tuning approach substantially improves the performance of existing sketch-to-image models. It consistently generates detailed, multi-instance 2D images, closely adhering to the input prompts and enhancing visual quality in complex multi-instance scenes. Code is available at this https URL.
场景生成对许多计算机图形应用至关重要。最近在生成式人工智能方面的进展简化了草图转图像的工作流程,减轻了艺术家和设计师创作场景概念艺术的工作量。然而,这些方法往往难以处理包含多个详细对象的复杂场景,有时会遗漏小或不常见的实例。在这篇论文中,我们在审查整个交叉注意力机制后提出了一个无需训练的三元组调整方案(T3-S2S)以进行草图转场景生成。该方案重新激活了现有的ControlNet模型,使其能够有效处理多实例生成,包括提示平衡、特征突出和密集调优。具体而言,这种方法通过提示平衡模块增强了关键词表示,降低了遗漏关键实例的风险。它还包含一个特征突出模块,在每个通道中突出TopK索引,确保基于标记草图的重要特征得到更好的表现。此外,该方法采用密集调优来细化注意力图中的轮廓细节,弥补与实例相关的区域。实验验证了我们的三元组调整方案显著提高了现有草图转图像模型的性能。它能够一致地生成详细、多实例的2D图像,紧密遵循输入提示并增强复杂多实例场景下的视觉质量。代码可在此 https URL 获取。
https://arxiv.org/abs/2412.13486
This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.
本文探讨了一个具有挑战性的问题:如何从单张任意图像高效地创建高质量、宽范围的3D场景?现有的方法面临诸多限制,如需要多视角数据、耗时的逐个场景优化、背景视觉质量低以及未见区域重建失真。我们提出了一种全新的流水线来克服这些限制。具体而言,我们引入了一个大规模重构模型,该模型利用视频扩散模型中的潜变量以前馈方式预测场景的3D高斯散射点。该视频扩散模型设计为能够根据指定的相机轨迹精确生成视频,使其能够在保持3D一致性的同时生成包含多视角信息的压缩视频潜变量。我们采用渐进训练策略对3D重构模型进行训练,使它能在视频潜变量空间中高效地生成高质量、宽范围和通用的3D场景。通过在各种数据集上的广泛评估表明,我们的模型显著优于现有的单视图3D场景生成方法,尤其是在处理域外图像时表现突出。首次证明了可以在扩散模型的潜在空间上有效地构建3D重构模型以实现高效的3D场景生成。
https://arxiv.org/abs/2412.12091
Recent diffusion models have demonstrated remarkable performance in both 3D scene generation and perception tasks. Nevertheless, existing methods typically separate these two processes, acting as a data augmenter to generate synthetic data for downstream perception tasks. In this work, we propose OccScene, a novel mutual learning paradigm that integrates fine-grained 3D perception and high-quality generation in a unified framework, achieving a cross-task win-win effect. OccScene generates new and consistent 3D realistic scenes only depending on text prompts, guided with semantic occupancy in a joint-training diffusion framework. To align the occupancy with the diffusion latent, a Mamba-based Dual Alignment module is introduced to incorporate fine-grained semantics and geometry as perception priors. Within OccScene, the perception module can be effectively improved with customized and diverse generated scenes, while the perception priors in return enhance the generation performance for mutual benefits. Extensive experiments show that OccScene achieves realistic 3D scene generation in broad indoor and outdoor scenarios, while concurrently boosting the perception models to achieve substantial performance improvements in the 3D perception task of semantic occupancy prediction.
最近的扩散模型在3D场景生成和感知任务中展现了出色的性能。然而,现有方法通常将这两个过程分开,作为数据增强器来为下游感知任务生成合成数据。在这项工作中,我们提出了OccScene,一个新颖的相互学习范式,在统一框架内整合了细粒度的3D感知和高质量生成,实现了跨任务双赢效果。OccScene仅依赖文本提示即可生成新的、一致的3D现实场景,并通过语义占用在联合训练扩散框架中进行指导。为了将占用与扩散潜在空间对齐,我们引入了一个基于Mamba的双重对齐模块,以结合细粒度的语义和几何作为感知先验。在OccScene内部,感知模块可以通过定制化和多样化的生成场景得到有效提升,而反过来,这些感知先验又会增强生成性能,实现相互受益。大量的实验表明,OccScene能够实现在广泛室内和室外场景下的现实3D场景生成,并且同时提升了感知模型,在语义占用预测的3D感知任务中实现了显著的性能改进。
https://arxiv.org/abs/2412.11183
Modeling the evolutions of driving scenarios is important for the evaluation and decision-making of autonomous driving systems. Most existing methods focus on one aspect of scene evolution such as map generation, motion prediction, and trajectory planning. In this paper, we propose a unified Generative Pre-training for Driving (GPD-1) model to accomplish all these tasks altogether without additional fine-tuning. We represent each scene with ego, agent, and map tokens and formulate autonomous driving as a unified token generation problem. We adopt the autoregressive transformer architecture and use a scene-level attention mask to enable intra-scene bi-directional interactions. For the ego and agent tokens, we propose a hierarchical positional tokenizer to effectively encode both 2D positions and headings. For the map tokens, we train a map vector-quantized autoencoder to efficiently compress ego-centric semantic maps into discrete tokens. We pre-train our GPD-1 on the large-scale nuPlan dataset and conduct extensive experiments to evaluate its effectiveness. With different prompts, our GPD-1 successfully generalizes to various tasks without finetuning, including scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning. Code: this https URL.
对驾驶场景的演化进行建模对于评估和自主驾驶系统的决策至关重要。现有的大多数方法侧重于场景演化的某一方面,如地图生成、运动预测和轨迹规划等。在本文中,我们提出了一种统一的驾驶预训练模型(GPD-1),可以在不进行额外微调的情况下完成所有这些任务。我们将每个场景表示为自我车辆(ego)、代理(agent)和地图标记,并将自动驾驶问题表述为一个统一的令牌生成问题。我们采用了自回归Transformer架构,并使用场景级注意力掩码以实现场景内的双向交互。对于自我车辆和代理标记,我们提出了一种层次位置编码器来有效地编码二维位置和方向信息。对于地图标记,我们训练了一个地图向量量化自动编码器,以高效地将自我中心语义地图压缩为离散标记。我们在大规模的nuPlan数据集上对GPD-1进行预训练,并通过广泛的实验评估其有效性。使用不同的提示词,我们的GPD-1成功泛化到了各种任务中,无需微调即可完成场景生成、交通模拟、闭环模拟、地图预测和运动规划等任务。代码:此 https URL。
https://arxiv.org/abs/2412.08643
Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text. However, existing T2I models show decayed performance in compositional image generation involving multiple objects and intricate relationships. We attribute this problem to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only. To address this problem, we construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs (SG), which precisely describe attributes and relationships of multiple objects, effectively representing the semantic structure in complex scenes. Based on LAION-SG, we train a new foundation model SDXL-SG to incorporate structural annotation information into the generation process. Extensive experiments show advanced models trained on our LAION-SG boast significant performance improvements in complex scene generation over models on existing datasets. We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation, establishing a new standard for this domain.
近期的文本到图像(T2I)生成技术在从文本生产高质量图像方面取得了显著的成功。然而,现有的T2I模型在涉及多个对象和复杂关系的组合图像生成中表现不佳。我们把这个问题归因于现有图像-文本配对数据集的局限性,这些数据集中缺乏精准的对象间关系标注,仅有提示词。为了解决这个问题,我们构建了LAION-SG,这是一个大规模的数据集,包含高质量的场景图(SG)结构化标注,能够精确描述多个对象的属性和它们之间的关系,有效地表达复杂场景中的语义结构。基于LAION-SG,我们训练了一个新的基础模型SDXL-SG,将结构化注释信息融入生成过程中。广泛的实验显示,在我们的LAION-SG数据集上训练的先进模型在复杂场景生成方面显著优于现有数据集上的模型。此外,我们还引入了CompSG-Bench,这是一个评估组合图像生成能力的标准基准测试,为该领域设立了新的标准。
https://arxiv.org/abs/2412.08580
Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms - semantic occupancy, video, and LiDAR - in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks.
生成高保真、可控且带有标注的训练数据对于自动驾驶至关重要。现有的方法通常直接从粗略的场景布局生成单一的数据形式,这不仅无法输出多样化的下游任务所需的丰富数据类型,还难以模拟直接从布局到数据分布的过程。在本文中,我们介绍了UniScene,这是首个用于生成驾驶场景中三种关键数据形式——语义占用、视频和激光雷达(LiDAR)数据的统一框架。UniScene采用了一个渐进式生成过程,将复杂的场景生成任务分解为两个分层步骤:(a) 首先从定制的场景布局生成一个富含语义和几何信息的元场景表示——即语义占用,并且(b) 在此基础上分别通过两种新颖的基于高斯联合渲染和先验指导稀疏建模的数据转移策略,生成视频和LiDAR数据。这种以占用为中心的方法减轻了生成复杂场景时的压力,同时为后续生成阶段提供了详细的中间表示形式。广泛的实验表明,UniScene在语义占用、视频和LiDAR数据的生成方面超越了之前的最优技术,并确实对下游驾驶任务产生了积极影响。
https://arxiv.org/abs/2412.05435
Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Adopting a training-free architecture, our PaintScene4D efficiently produces realistic 4D scenes that can be viewed from arbitrary trajectories. The code will be made publicly available. Our project page is at this https URL
https://arxiv.org/abs/2412.04471
Realistic and interactive scene simulation is a key prerequisite for autonomous vehicle (AV) development. In this work, we present SceneDiffuser, a scene-level diffusion prior designed for traffic simulation. It offers a unified framework that addresses two key stages of simulation: scene initialization, which involves generating initial traffic layouts, and scene rollout, which encompasses the closed-loop simulation of agent behaviors. While diffusion models have been proven effective in learning realistic and multimodal agent distributions, several challenges remain, including controllability, maintaining realism in closed-loop simulations, and ensuring inference efficiency. To address these issues, we introduce amortized diffusion for simulation. This novel diffusion denoising paradigm amortizes the computational cost of denoising over future simulation steps, significantly reducing the cost per rollout step (16x less inference steps) while also mitigating closed-loop errors. We further enhance controllability through the introduction of generalized hard constraints, a simple yet effective inference-time constraint mechanism, as well as language-based constrained scene generation via few-shot prompting of a large language model (LLM). Our investigations into model scaling reveal that increased computational resources significantly improve overall simulation realism. We demonstrate the effectiveness of our approach on the Waymo Open Sim Agents Challenge, achieving top open-loop performance and the best closed-loop performance among diffusion models.
现实和交互式的场景模拟是自动驾驶汽车(AV)开发的关键前提。在这项工作中,我们介绍了SceneDiffuser,这是一种为交通仿真设计的场景级扩散先验模型。它提供了一个统一框架来解决仿真中的两个关键阶段:场景初始化,包括生成初始的交通布局;以及场景展开,涵盖了代理行为的闭环模拟。尽管扩散模型已被证明在学习现实和多模态的代理分布方面非常有效,但仍存在一些挑战,如可控性、保持闭环模拟中的现实性和确保推理效率等。为了解决这些问题,我们引入了用于仿真的摊销扩散。这一新颖的扩散去噪范式将去噪计算成本分摊到未来的仿真步骤中,显著降低了每步展开的成本(减少16倍推理步骤),同时也减轻了闭环误差。此外,通过引入通用硬约束条件——一种简单但有效的推理时约束机制以及基于语言的小样本提示大型语言模型(LLM)的受限场景生成方法,进一步增强了可控性。我们的研究发现增加计算资源能显著提高整体仿真现实性。我们在Waymo开放模拟代理挑战赛上的演示表明,我们的方法在开环性能上表现最佳,并且在扩散模型中实现了最好的闭环性能。
https://arxiv.org/abs/2412.12129
One of the strategies to detect the pose and shape of unknown objects is their geometric modeling, consisting on fitting known geometric entities. Classical geometric modeling fits simple shapes such as spheres or cylinders, but often those don't cover the variety of shapes that can be encountered. For those situations, one solution is the use of superquadrics, which can adapt to a wider variety of shapes. One of the limitations of superquadrics is that they cannot model objects with holes, such as those with handles. This work aims to fit supersurfaces of degree four, in particular supertoroids, to objects with a single hole. Following the results of superquadrics, simple expressions for the major and minor radial distances are derived, which lead to the fitting of the intrinsic and extrinsic parameters of the supertoroid. The differential geometry of the surface is also studied as a function of these parameters. The result is a supergeometric modeling that can be used for symmetric objects with and without holes with a simple distance function for the fitting. The proposed algorithm expands considerably the amount of shapes that can be targeted for geometric modeling.
https://arxiv.org/abs/2412.04174
We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. Previous methods for scene generation either suffer from limited scales or lack geometric and appearance consistency along generated sequences. In contrast, we leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects. Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.
https://arxiv.org/abs/2412.03934