With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research. Project website: this http URL.
随着最近在Embodied人工智能(EAI)研究中的发展,对于高质量、大规模互动场景生成的高需求不断增加。然而,以前的方法在场景合成中过于关注生成场景的自然性和真实性,而场景的物理可行性和互动性却被大大忽视了。为了应对这一差异,我们引入了PhyScene,一种专为身体代理生成具有真实布局、关节和丰富物理交互性的交互式3D场景的新方法。基于条件扩散模型来捕捉场景布局,我们设计了一种新的基于物理和交互性的指导机制,结合了物体碰撞、房间布局和物体可达性等方面的约束。通过大量实验,我们证明了PhyScene有效地利用了这些指导功能进行物理交互式场景生成,在现有状态最先进的方法之上取得了很大的优势。我们的研究结果表明,PhyScene生成的场景在促进交互环境中的代理多样化技能学习方面具有相当大的潜力,从而推动了在身体人工智能研究中的进一步发展。项目网站:此链接。
https://arxiv.org/abs/2404.09465
Cinemagraph is a unique form of visual media that combines elements of still photography and subtle motion to create a captivating experience. However, the majority of videos generated by recent works lack depth information and are confined to the constraints of 2D image space. In this paper, inspired by significant progress in the field of novel view synthesis (NVS) achieved by 3D Gaussian Splatting (3D-GS), we propose LoopGaussian to elevate cinemagraph from 2D image space to 3D space using 3D Gaussian modeling. To achieve this, we first employ the 3D-GS method to reconstruct 3D Gaussian point clouds from multi-view images of static scenes,incorporating shape regularization terms to prevent blurring or artifacts caused by object deformation. We then adopt an autoencoder tailored for 3D Gaussian to project it into feature space. To maintain the local continuity of the scene, we devise SuperGaussian for clustering based on the acquired features. By calculating the similarity between clusters and employing a two-stage estimation method, we derive an Eulerian motion field to describe velocities across the entire scene. The 3D Gaussian points then move within the estimated Eulerian motion field. Through bidirectional animation techniques, we ultimately generate a 3D Cinemagraph that exhibits natural and seamlessly loopable dynamics. Experiment results validate the effectiveness of our approach, demonstrating high-quality and visually appealing scene generation.
电影图像是将静态照片和微动态运动的元素结合在一起,创造了一种引人入胜的体验。然而,由最近作品生成的多数视频缺乏深度信息,并局限于2D图像空间的限制。在本文中,我们受到3D高斯平滑(3D-GS)在 novel view synthesis(NVS)领域的重大进步的启发,提出了一种将电影图象从2D图像空间提升到3D空间使用3D高斯建模的方法。为了实现这一目标,我们首先使用3D-GS方法从静态场景的多视角图像中重构3D高斯点云,包括形状正则化项,以防止由于物体变形引起的模糊或伪影。然后我们采用一个专为3D高斯设计的自动编码器将其投影到特征空间。为了保持场景的局部连续性,我们根据获得的特征设计 SuperGaussian for clustering。通过计算聚类之间的相似度并使用双阶段估计方法,我们得到一个欧拉运动场,描述了场景中整个空间的瞬时速度。然后,通过双向动画技术,我们最终生成一个3D电影图象,展示了自然和无缝的循环动态。实验结果证实了我们的方法的有效性,表明了高质量和视觉上吸引人的场景生成。
https://arxiv.org/abs/2404.08966
We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
我们介绍了一种从文本描述生成通用面向未来的3D场景的技术,名为RealmDreamer。我们的技术通过优化3D高斯平铺表示来匹配复杂的文本提示。我们通过利用最先进的文本到图像生成器的状态,将样本提升到3D并计算遮挡体积。然后,在多个视角上对这种表示进行优化,将其作为3D修复任务与图像条件扩散模型一起进行。为了学习正确的几何结构,我们在修复模型上通过条件于修复模型的样本,从而赋予了丰富几何结构。最后,我们使用增强的生成器样本对模型进行微调。值得注意的是,我们的技术不需要视频或多视角数据,可以生成不同风格的高质量3D场景,包括多个物体。此外,其普遍性还允许从单张图像合成3D。
https://arxiv.org/abs/2404.07199
Many safety-critical applications, especially in autonomous driving, require reliable object detectors. They can be very effectively assisted by a method to search for and identify potential failures and systematic errors before these detectors are deployed. Systematic errors are characterized by combinations of attributes such as object location, scale, orientation, and color, as well as the composition of their respective backgrounds. To identify them, one must rely on something other than real images from a test set because they do not account for very rare but possible combinations of attributes. To overcome this limitation, we propose a pipeline for generating realistic synthetic scenes with fine-grained control, allowing the creation of complex scenes with multiple objects. Our approach, BEV2EGO, allows for a realistic generation of the complete scene with road-contingent control that maps 2D bird's-eye view (BEV) scene configurations to a first-person view (EGO). In addition, we propose a benchmark for controlled scene generation to select the most appropriate generative outpainting model for BEV2EGO. We further use it to perform a systematic analysis of multiple state-of-the-art object detection models and discover differences between them.
许多关键安全应用程序,尤其是在自动驾驶领域,需要可靠的物体检测器。在部署这些检测器之前,通过一种方法搜索和识别潜在故障和系统误差,可以大大有效地辅助这些检测器。系统误差的特点是包括物体位置、尺寸、方向和颜色的属性,以及它们各自的背景的组合。要识别它们,一个人必须依赖除了测试集中的真实图像之外的其他东西,因为它们没有考虑到很少见但可能出现的属性组合。为了克服这一局限性,我们提出了一个生成真实合成场景的流水线,具有细粒度控制,允许创建具有多个物体的复杂场景。我们的方法BEV2EGO允许通过道路 contingent控制生成完整的场景,将2D鸟类视场(BEV)场景配置映射到第一人称视角(EGO)。此外,我们还提出了一个基准,用于控制场景生成,以选择最合适的生成修复模型为BEV2EGO。我们进一步使用它进行对多个最先进的物体检测模型的系统分析,并发现了它们之间的差异。
https://arxiv.org/abs/2404.07045
The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{\circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: this http URL
虚拟现实应用程序的需求不断增加,凸显了创建沉浸式的3D资产的重要性。我们提出了一个文本到3D 360$^{\circ}$场景生成管道,用于在几分钟内创建全面的360$^{\circ}$场景。我们的方法利用了2D扩散模型的生成能力以及提示自优化来创建高质量和高全球一致性的全景图像。这个图像作为初步的"平"(2D)场景表示。接着,它被提升到3D高斯分布中,利用插值技术实现实时探索。为了产生一致的3D几何,我们的管道通过将2D单目深度对齐到一个全局优化点云中来构建空间一致的结构。这个点云作为3D高斯圆心的初始状态。为了解决单视图输入中固有的可见问题,我们对合成视图和输入视图施加语义和几何约束作为正则化。这些指导Gaussians的优化,有助于重构未见区域。总之,我们的方法在360$^{\circ}$的视角内提供了一个全球一致的3D场景,提高了现有技术的沉浸体验。项目网站:http:// this http URL
https://arxiv.org/abs/2404.06903
Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes, we represent 3D scene with a Scalable Hash Grid structure, incrementally adapting to the growing scale of urban scenes. Extensive experiments substantiate the capability of our framework to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations, showing the powers of steerable urban scene generation. Website: this https URL.
通过大规模的文本到图像扩散模型,文本到3D生成功能取得了显著的成功。然而,在扩展方法以达到城市规模方面,尚无范式。城市场景具有大量的元素、复杂的布局关系和广泛的规模,这使得对模糊文本描述的有效模型优化具有挑战性。在这项工作中,我们克服了限制,通过将组件化的3D布局表示引入文本到3D范式中,作为额外的先验。它包括一系列语义原型,具有简单的几何结构和明确的布局关系,补充了文本描述,并实现了可引导的生成。在此基础上,我们提出了两个修改建议--(1) 我们引入了布局引导的变分 score distillation 以解决模型优化不足的问题。它将分数扩散采样过程与3D布局的几何和语义约束相结合。(2) 为了处理城市场景的无界性质,我们使用可扩展哈希网格结构表示3D场景,并随着城市场景规模的增长,逐步适应。大量实验证实了我们的框架可以将文本到3D生成扩展到覆盖超过1000米驾驶距离的大型城市场景,这是第一次实现。我们还展示了各种场景编辑示例,展示了我们框架的可引导城市场景生成的力量。网站:https://this URL。
https://arxiv.org/abs/2404.06780
Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: this https URL.
生成具有更高分辨率的人中心场景,具有详细信息和控制仍然是对现有文本到图像扩散模型的挑战。这一挑战源于训练图像大小有限、文本编码器能力有限以及生成涉及多个人的复杂场景的固有难度。虽然现有的方法试图解决训练大小限制,但它们通常产生具有严重伪影的人为中心场景。我们提出BeyondScene,一种新框架,克服了先前的限制,使用现有的预训练扩散模型生成卓越的高分辨率(超过8K)人中心场景,具有出色的文本图像匹配和自然性。BeyondScene采用阶段性和层次结构的方法,首先生成关注多个人类实例创建关键元素的详细基础图像,并超越了扩散模型的token limit,然后平滑地将基础图像转换为高分辨率输出,超过训练图像大小,并利用我们提出的实例感知层次结构扩展过程,其中包含我们提出的频高注入前向扩散和自适应联合扩散,超越了现有的方法在详细文本描述和自然性方面的表现。BeyondScene在详细文本描述和自然性方面超过了现有的方法,为高级应用于高分辨率人中心场景创建打开了道路,而无需进行昂贵的重新训练。项目页面:https://this URL。
https://arxiv.org/abs/2404.04544
Text-to-3D scene generation holds immense potential for the gaming, film, and architecture sectors. Despite significant progress, existing methods struggle with maintaining high quality, consistency, and editing flexibility. In this paper, we propose DreamScene, a 3D Gaussian-based novel text-to-3D scene generation framework, to tackle the aforementioned three challenges mainly via two strategies. First, DreamScene employs Formation Pattern Sampling (FPS), a multi-timestep sampling strategy guided by the formation patterns of 3D objects, to form fast, semantically rich, and high-quality representations. FPS uses 3D Gaussian filtering for optimization stability, and leverages reconstruction techniques to generate plausible textures. Second, DreamScene employs a progressive three-stage camera sampling strategy, specifically designed for both indoor and outdoor settings, to effectively ensure object-environment integration and scene-wide 3D consistency. Last, DreamScene enhances scene editing flexibility by integrating objects and environments, enabling targeted adjustments. Extensive experiments validate DreamScene's superiority over current state-of-the-art techniques, heralding its wide-ranging potential for diverse applications. Code and demos will be released at this https URL .
文本到3D场景生成在游戏、电影和建筑领域具有巨大的潜力。尽管已经取得了一定的进展,但现有的方法在保持高质量、一致性和编辑灵活性方面仍然存在挑战。在本文中,我们提出了DreamScene,一种基于3D高斯的新文本到3D场景生成框架,通过两种策略来解决上述三个挑战。首先,DreamScene采用形成模式采样(FPS),一种由3D物体形成模式指导的多时间步采样策略,以形成快速、语义丰富、高质量的代表。FPS使用3D高斯滤波器优化稳定性,并利用重构技术生成逼真的纹理。其次,DreamScene采用一种渐进式的三阶段相机采样策略,特别为室内和室外场景设计,以有效确保物体环境和场景范围的3D一致性。最后,DreamScene通过将物体和环境集成,增强了场景编辑的灵活性,实现了针对性的调整。大量实验证实了DreamScene在现有技术水平上的优越性,预示着其在各种应用领域广泛的潜力。代码和演示将在此处发布:https://www.dreamscene.org 。
https://arxiv.org/abs/2404.03575
In this paper, we introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models (a.k.a., diffusion models) and transformers. Our proposed framework, termed the "World-Centric Diffusion Transformer" (WcDT), optimizes the entire trajectory generation process, from feature extraction to model inference. To enhance the scene diversity and stochasticity, the historical trajectory data is first preprocessed and encoded into latent space using Denoising Diffusion Probabilistic Models (DDPM) enhanced with Diffusion with Transformer (DiT) blocks. Then, the latent features, historical trajectories, HD map features, and historical traffic signal information are fused with various transformer-based encoders. The encoded traffic scenes are then decoded by a trajectory decoder to generate multimodal future trajectories. Comprehensive experimental results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories, showing its potential for integration into automatic driving simulation systems.
在本文中,我们提出了一个新颖的方法,通过利用扩散概率模型的互补优势来设计自动驾驶轨迹生成框架。我们的方法被称为"世界中心扩散Transformer"(WcDT),在整个轨迹生成过程中优化了特征提取到模型推理。为了增强场景多样性和随机性,首先对历史轨迹数据进行预处理,并使用Denoising Diffusion Probabilistic Models(DDPM)增强的Diffusion with Transformer(DiT)块将它们编码到潜在空间中。然后,将潜在特征、历史轨迹、高程图特征和历史交通信号信息与各种Transformer基编码器进行融合。接着,编码的交通场景被轨迹解码器解码,生成多模态的未来轨迹。全面的实验结果表明,与传统的轨迹生成方法相比,所提出的方法在生成真实和多样轨迹方面表现出优异的性能,表明其潜在用于自动驾驶模拟系统。
https://arxiv.org/abs/2404.02082
Synthesizing realistic and diverse indoor 3D scene layouts in a controllable fashion opens up applications in simulated navigation and virtual reality. As concise and robust representations of a scene, scene graphs have proven to be well-suited as the semantic control on the generated layout. We present a variant of the conditional variational autoencoder (cVAE) model to synthesize 3D scenes from scene graphs and floor plans. We exploit the properties of self-attention layers to capture high-level relationships between objects in a scene, and use these as the building blocks of our model. Our model, leverages graph transformers to estimate the size, dimension and orientation of the objects in a room while satisfying relationships in the given scene graph. Our experiments shows self-attention layers leads to sparser (HOW MUCH) and more diverse scenes (HOW MUCH)\. Included in this work, we publish the first large-scale dataset for conditioned scene generation from scene graphs, containing over XXX rooms (of floor plans and scene graphs).
控制生成现实和多样室内3D场景布局是一种可控制的方式为模拟导航和虚拟现实打开了许多应用。作为场景的简洁而强大的表示,场景图已被证明作为对生成的布局的语义控制。我们提出了一种从场景图和 floor plan 合成 3D 场景的条件下变分自编码器(cVAE)模型。我们利用自注意层的特性来捕捉场景中物体之间的高层次关系,并利用这些作为我们模型的构建模块。我们的模型利用图变换器估计场景中物体的大小、维度和方向,同时满足给定场景图中的关系。我们的实验表明,自注意层导致更稀疏(多少)的场景(多少)。本工作中,我们发布了第一个大规模条件场景生成数据集,包含超过XXX个房间(地板图和场景图)。
https://arxiv.org/abs/2404.01887
Diffusion models (DMs) excel in photo-realistic image synthesis, but their adaptation to LiDAR scene generation poses a substantial hurdle. This is primarily because DMs operating in the point space struggle to preserve the curve-like patterns and 3D geometry of LiDAR scenes, which consumes much of their representation power. In this paper, we propose LiDAR Diffusion Models (LiDMs) to generate LiDAR-realistic scenes from a latent space tailored to capture the realism of LiDAR scenes by incorporating geometric priors into the learning pipeline. Our method targets three major desiderata: pattern realism, geometry realism, and object realism. Specifically, we introduce curve-wise compression to simulate real-world LiDAR patterns, point-wise coordinate supervision to learn scene geometry, and patch-wise encoding for a full 3D object context. With these three core designs, our method achieves competitive performance on unconditional LiDAR generation in 64-beam scenario and state of the art on conditional LiDAR generation, while maintaining high efficiency compared to point-based DMs (up to 107$\times$ faster). Furthermore, by compressing LiDAR scenes into a latent space, we enable the controllability of DMs with various conditions such as semantic maps, camera views, and text prompts. Our code and pretrained weights are available at this https URL.
扩散模型(DMs)在照片现实图像合成方面表现出色,但将它们应用于激光雷达(LiDAR)场景生成存在重大挑战。这主要是因为在点空间操作的DM很难保留LiDAR场景的曲线状模式和3D几何,这会消耗它们的代表力。在本文中,我们提出了LiDAR扩散模型(LiDMs)来从针对LiDAR场景的潜在空间中生成LiDAR真实场景,通过将几何先验集成到学习管道中来实现这一目标。我们的方法针对三个主要需求:模式现实、几何现实和物体现实。具体来说,我们引入了曲线级压缩来模拟真实世界LiDAR模式,点级坐标监督来学习场景几何,以及完整的3D对象编码来提供对3D对象的全面信息。 通过这三个核心设计,我们的方法在64束场景下的无条件LiDAR生成上实现了与条件LiDAR生成相匹敌的竞争性能,同时在条件LiDAR生成方面达到了最先进的水平,同时保持了与基于点的DM相比高达107倍的高效率。此外,通过将LiDAR场景压缩到潜在空间中,我们使得DM具有各种条件下的可控性,如语义图、相机视图和文本提示。 我们的代码和预训练权重可以从以下链接获取:https://www.xxx.com/。
https://arxiv.org/abs/2404.00815
The generation of 3D scenes from user-specified conditions offers a promising avenue for alleviating the production burden in 3D applications. Previous studies required significant effort to realize the desired scene, owing to limited control conditions. We propose a method for controlling and generating 3D scenes under multimodal conditions using partial images, layout information represented in the top view, and text prompts. Combining these conditions to generate a 3D scene involves the following significant difficulties: (1) the creation of large datasets, (2) reflection on the interaction of multimodal conditions, and (3) domain dependence of the layout conditions. We decompose the process of 3D scene generation into 2D image generation from the given conditions and 3D scene generation from 2D images. 2D image generation is achieved by fine-tuning a pretrained text-to-image model with a small artificial dataset of partial images and layouts, and 3D scene generation is achieved by layout-conditioned depth estimation and neural radiance fields (NeRF), thereby avoiding the creation of large datasets. The use of a common representation of spatial information using 360-degree images allows for the consideration of multimodal condition interactions and reduces the domain dependence of the layout control. The experimental results qualitatively and quantitatively demonstrated that the proposed method can generate 3D scenes in diverse domains, from indoor to outdoor, according to multimodal conditions.
从用户指定条件生成3D场景为减轻3D应用程序的生产负担提供了有前途的途径。之前的研究因为控制条件有限而需要大量努力来实现期望的场景。我们提出了一种使用部分图像、表示在顶视图中的布局信息以及文本提示控制和生成3D场景的方法。将这些条件组合生成3D场景涉及以下重大困难:(1)创建大量数据集,(2)反思多模态条件的交互,(3)布局条件的领域依赖性。我们将3D场景生成的过程分解为从给定条件的2D图像生成和从2D图像生成的3D场景生成。2D图像生成是通过预训练的文本到图像模型的小人工数据集微调来实现的,而3D场景生成是通过布局条件下的深度估计和神经辐射场(NeRF)实现的,从而避免了创建大量数据集。使用360度图像的常见表示空间信息允许考虑多模态条件的交互,并减少了布局控制的领域依赖性。实验结果既定性地又定量地证明了所提出的方法可以根据多模态条件生成各种领域的3D场景。
https://arxiv.org/abs/2404.00345
Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate-they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The lack of a more flexible motion model contributes to the gap in realism between 4D generation methods and recent, near-photorealistic video generation models. Here, we propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We represent the global motion of a scene's bounding box using rigid transformation along a trajectory parameterized by a spline. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion, which we evaluate qualitatively and through a user study. Video results can be viewed on our website: this https URL.
近年来,基于预训练的文本到视频模型的监督文本-4D生成技术已经出现。然而,现有的运动表示方法,如变形模型或时间依赖的神经表示,在生成运动方面的能力有限,它们无法生成超出体积渲染 bounding 盒的范围之外的运动。缺乏更灵活的运动模型导致4D生成方法和最近近实感的视频生成模型之间的现实 gap。在这里,我们提出 TC4D:轨迹条件文本-4D生成,其中将运动因素化为全局和局部组件。我们通过参数化由平滑曲线定义的轨迹上的全局运动来表示场景的边界框全局运动。我们通过监督来自文本到视频模型来学习局部变形,使其符合全局轨迹。我们的方法能够合成沿着任意轨迹运动的场景,进行级联场景生成,以及显著提高真实感和生成运动的数量。我们通过定性评估和用户研究来评估这种方法的质量和效果。视频结果可以在我们的网站上查看:https://www.的这个URL。
https://arxiv.org/abs/2403.17920
Recent advancements in diffusion models for 2D and 3D content creation have sparked a surge of interest in generating 4D content. However, the scarcity of 3D scene datasets constrains current methodologies to primarily object-centric generation. To overcome this limitation, we present Comp4D, a novel framework for Compositional 4D Generation. Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately. Utilizing Large Language Models (LLMs), the framework begins by decomposing an input text prompt into distinct entities and maps out their trajectories. It then constructs the compositional 4D scene by accurately positioning these objects along their designated paths. To refine the scene, our method employs a compositional score distillation technique guided by the pre-defined trajectories, utilizing pre-trained diffusion models across text-to-image, text-to-video, and text-to-3D domains. Extensive experiments demonstrate our outstanding 4D content creation capability compared to prior arts, showcasing superior visual quality, motion fidelity, and enhanced object interactions.
近年来,在二维和三维内容创作中扩散模型的最新进展引发了人们对生成4D内容的浓厚兴趣。然而,3D场景数据集的稀缺性限制了当前方法主要集中在以物体为中心的生成。为了克服这一限制,我们提出了Comp4D,一种用于合成4D内容的全新框架。与传统方法生成整个场景的单一4D表示不同,Comp4D创新地构建了场景中的每个4D物体。利用大型语言模型(LLMs),该框架首先将输入文本提示分解为不同的实体,并对其轨迹进行探索。然后,它准确地将这些物体沿着其预设路径定位,构建了合成4D场景。为了优化场景,我们的方法采用了一个基于预定义轨迹的合成分数蒸馏技术,利用了跨文本到图像、文本到视频和文本到3D领域的预训练扩散模型。大量实验证明,与以前的艺术作品相比,我们的4D内容创作能力非凡,展示了卓越的视觉质量、动作流畅性和增强的物体交互。
https://arxiv.org/abs/2403.16993
Due to its great application potential, large-scale scene generation has drawn extensive attention in academia and industry. Recent research employs powerful generative models to create desired scenes and achieves promising results. However, most of these methods represent the scene using 3D primitives (e.g. point cloud or radiance field) incompatible with the industrial pipeline, which leads to a substantial gap between academic research and industrial deployment. Procedural Controllable Generation (PCG) is an efficient technique for creating scalable and high-quality assets, but it is unfriendly for ordinary users as it demands profound domain expertise. To address these issues, we resort to using the large language model (LLM) to drive the procedural modeling. In this paper, we introduce a large-scale scene generation framework, SceneX, which can automatically produce high-quality procedural models according to designers' textual descriptions.Specifically, the proposed method comprises two components, PCGBench and PCGPlanner. The former encompasses an extensive collection of accessible procedural assets and thousands of hand-craft API documents. The latter aims to generate executable actions for Blender to produce controllable and precise 3D assets guided by the user's instructions. Our SceneX can generate a city spanning 2.5 km times 2.5 km with delicate layout and geometric structures, drastically reducing the time cost from several weeks for professional PCG engineers to just a few hours for an ordinary user. Extensive experiments demonstrated the capability of our method in controllable large-scale scene generation and editing, including asset placement and season translation.
由于其在学术界和产业界具有巨大的应用潜力,大规模场景生成已经引起了广泛关注。最近的研究采用强大的生成模型来创建所需场景,并取得了积极的结果。然而,大多数这些方法使用与工业流程不兼容的3D原语(如点云或辐射场)来表示场景,导致学术研究和工业部署之间的差距相当大。 procedural controllable generation (PCG)是一种有效的创建可扩展和高品质资产的技术,但它对普通用户来说并不友好,因为它需要深入的领域专业知识。为解决这些问题,我们转向使用大型语言模型(LLM)驱动程序建模。在本文中,我们介绍了一个大规模场景生成框架SceneX,可以根据设计者的文本描述自动生成高质量程序化模型。具体来说,所提出的方法包括两个组件:PCGBench和PCGPlanner。前者涵盖了广泛的可用程序化资产和数千个手工艺API文档。后者旨在为Blender生成可控制和精确的3D资产,根据用户的指示进行指导。我们的SceneX可以在生成的2.5公里x2.5公里的城市的精细布局和几何结构中生成,大大减少了专业PCG工程师从几周的时间成本降低到普通用户只需几小时的时间成本。 extensive experiments证明了我们在可控的大规模场景生成和编辑方面的能力,包括资产放置和季节翻译。
https://arxiv.org/abs/2403.15698
Generating realistic 3D scenes is challenging due to the complexity of room layouts and object geometries.We propose a sketch based knowledge enhanced diffusion architecture (SEK) for generating customized, diverse, and plausible 3D scenes. SEK conditions the denoising process with a hand-drawn sketch of the target scene and cues from an object relationship knowledge base. We first construct an external knowledge base containing object relationships and then leverage knowledge enhanced graph reasoning to assist our model in understanding hand-drawn sketches. A scene is represented as a combination of 3D objects and their relationships, and then incrementally diffused to reach a Gaussian distribution.We propose a 3D denoising scene transformer that learns to reverse the diffusion process, conditioned by a hand-drawn sketch along with knowledge cues, to regressively generate the scene including the 3D object instances as well as their layout. Experiments on the 3D-FRONT dataset show that our model improves FID, CKL by 17.41%, 37.18% in 3D scene generation and FID, KID by 19.12%, 20.06% in 3D scene completion compared to the nearest competitor DiffuScene.
生成逼真的3D场景具有复杂的空间布局和物体几何形状的复杂性。我们提出了一种基于手绘场景的增强扩散架构(SEK)用于生成定制的、多样化和逼真的3D场景。SEK通过目标场景的手绘草图和物体关系知识库中的提示来约束去噪过程。我们首先构建了一个包含物体关系的外部知识库,然后利用增强图推理来协助我们的模型理解手绘草图。场景被表示为3D物体及其关系的组合,然后通过逐层扩散达到高斯分布。我们提出了一种3D去噪场景变换器,它通过手绘草图和知识提示来学习反扩散过程,以递归地生成场景,包括3D物体实例及其布局。在3D-FRONT数据集的实验中,我们的模型将FID和CKL提高了17.41%和37.18%,而3D场景生成和完成分别提高了19.12%和20.06%,相对于最近的竞争对手DiffuScene。
https://arxiv.org/abs/2403.14121
Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Early works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in shape generation with powerful generative models, such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which implies that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D scenes from scene graph. To enrich the representation capability of the given scene graph inputs, large language model is utilized to explicitly aggregate the global graph features with local relationship features. With a unified graph convolution network (GCN), graph features are extracted from scene graphs updated via joint layout-shape distribution. During scene generation, an IoU-based regularization loss is introduced to constrain the predicted 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity. The source code will be released after publication.
组件式3D场景合成在机器人学、电影和游戏等各个行业中具有广泛的应用,因为它与现实世界多物体环境中的复杂性密切相关。早期的作品通常基于形状检索的框架,但自然地存在形状多样性的限制。随着强大生成模型的进步(如扩散模型),形状生成取得了显著提高。然而,这些方法分别处理3D形状生成和布局生成。生成的场景通常受到布局碰撞的影响,这表明在场景级别上,场景级保真度还有待进一步探索。在本文中,我们的目标是生成真实和合理的3D场景,从场景图入手。为了丰富给定的场景图输入的表示能力,我们使用了大型语言模型来明确聚合全局图特征和局部关系特征。通过统一的图卷积网络(GCN),从更新后的场景图中提取 graph 特征。在场景生成过程中,引入了基于IoU的 Regularization Loss 来约束预测的3D布局。在SG-FRONT数据集上的基准测试中,我们的方法实现了更好的3D场景合成,尤其是在场景级别保真度方面。源代码将在发表后发布。
https://arxiv.org/abs/2403.12848
Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
近年来,基于文本的3D场景生成技术取得了快速进展。其成功主要归功于使用现有的生成模型迭代进行图像扭曲和修复以生成3D场景。然而,这些方法过于依赖现有模型的输出,导致在几何和外观上产生误差,从而使模型无法应用于各种场景(例如户外和虚幻场景)。为了应对这一局限,我们通过查询和聚合全局3D信息来生成局部视图,然后逐步生成3D场景。具体来说,我们采用基于三平面特征的NeRF作为统一的三维场景表示,约束全局3D一致性,并提出了一个生成修复网络,通过利用扩散模型的自然图像先验以及当前场景的全局3D信息来合成高质量的新内容。我们广泛的实验证明,与以前的方法相比,我们的方法在支持各种场景生成和任意相机轨迹的同时,提高了视觉质量和3D一致性。
https://arxiv.org/abs/2403.09439
We present "SemCity," a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object, synthetic indoor scenes, or synthetic outdoor scenes, while the generation of real-world outdoor scenes is rarely addressed. In this paper, we concentrate on generating a real-outdoor scene through learning a diffusion model on a real-world outdoor dataset. In contrast to synthetic data, real-outdoor datasets often contain more empty spaces due to sensor limitations, causing challenges in learning real-outdoor distributions. To address this issue, we exploit a triplane representation as a proxy form of scene distributions to be learned by our diffusion model. Furthermore, we propose a triplane manipulation that integrates seamlessly with our triplane diffusion model. The manipulation improves our diffusion model's applicability in a variety of downstream tasks related to outdoor scene generation such as scene inpainting, scene outpainting, and semantic scene completion refinements. In experimental results, we demonstrate that our triplane diffusion model shows meaningful generation results compared with existing work in a real-outdoor dataset, SemanticKITTI. We also show our triplane manipulation facilitates seamlessly adding, removing, or modifying objects within a scene. Further, it also enables the expansion of scenes toward a city-level scale. Finally, we evaluate our method on semantic scene completion refinements where our diffusion model enhances predictions of semantic scene completion networks by learning scene distribution. Our code is available at this https URL.
我们提出了一个名为“SemCity”的3D扩散模型,用于在现实世界户外环境中生成语义场景。大多数3D扩散模型集中于生成单个物体、合成室内场景或合成室外场景,而现实世界户外场景的生成很少被关注。在本文中,我们专注于通过在现实世界户外数据集中学习扩散模型来生成真实户外场景。与合成数据相比,现实世界户外数据集通常包含更多的空旷空间,导致学习真实户外分布具有挑战性。为了解决这个问题,我们利用三平面表示作为一种场景分布的代理形式,作为我们的扩散模型可以学习的三平面操作。此外,我们还提出了一种与三平面扩散模型无缝集成的三平面操作。操作改善了我们的扩散模型在户外场景生成任务中的适用性,例如场景修复、场景去修复和语义场景完成 refinements。在实验结果中,我们证明了我们的三平面扩散模型在真实户外数据集上的生成结果与现有工作相比具有实际意义,即使在语义KITTI数据集上也是如此。我们还证明了我们的三平面操作使场景内对象在不同场景之间的添加、删除或修改变得更加容易。此外,它还使场景可以扩展到城市级别。最后,我们在语义场景完成 refinements 上评估我们的方法,我们的扩散模型通过学习场景分布增强了语义场景完成网络的预测。我们的代码可在此处访问:https://www.xxxxxx.com/
https://arxiv.org/abs/2403.07773
Current state-of-the-art (SOTA) 3D object detection methods often require a large amount of 3D bounding box annotations for training. However, collecting such large-scale densely-supervised datasets is notoriously costly. To reduce the cumbersome data annotation process, we propose a novel sparsely-annotated framework, in which we just annotate one 3D object per scene. Such a sparse annotation strategy could significantly reduce the heavy annotation burden, while inexact and incomplete sparse supervision may severely deteriorate the detection performance. To address this issue, we develop the SS3D++ method that alternatively improves 3D detector training and confident fully-annotated scene generation in a unified learning scheme. Using sparse annotations as seeds, we progressively generate confident fully-annotated scenes based on designing a missing-annotated instance mining module and reliable background mining module. Our proposed method produces competitive results when compared with SOTA weakly-supervised methods using the same or even more annotation costs. Besides, compared with SOTA fully-supervised methods, we achieve on-par or even better performance on the KITTI dataset with about 5x less annotation cost, and 90% of their performance on the Waymo dataset with about 15x less annotation cost. The additional unlabeled training scenes could further boost the performance. The code will be available at this https URL.
目前最先进的3D物体检测方法通常需要大量的3D边界框注释来进行训练。然而,收集这样大规模的密集监督数据集是非常昂贵的。为了减少繁琐的数据注释过程,我们提出了一个新颖的稀疏注释框架,其中我们仅在每个场景中注释一个3D物体。这种稀疏注释策略可以显著减少繁重的注释负担,然而,不准确和不完整的稀疏监督可能会严重削弱检测性能。为了解决这个问题,我们开发了SS3D++方法,在统一的训练方案中同时改进3D检测训练和自信完全注释场景生成。通过稀疏注释作为种子,我们根据设计缺失注释实例挖掘模块和可靠背景挖掘模块,逐步生成自信完全注释场景。与SOTA弱监督方法相比,我们的方法在相同或甚至更高的注释成本下产生了竞争力的结果。此外,与SOTA完全监督方法相比,我们在KITTI数据集上实现了与或甚至更好的性能,在不到5倍的成本下,而在Waymo数据集上实现了与或更好的性能,在不到15倍的成本下。此外,稀疏训练场景可以进一步提高性能。代码将在此处公布:https://www.xxx.com。
https://arxiv.org/abs/2403.02818