Due to its great application potential, large-scale scene generation has drawn extensive attention in academia and industry. Recent research employs powerful generative models to create desired scenes and achieves promising results. However, most of these methods represent the scene using 3D primitives (e.g. point cloud or radiance field) incompatible with the industrial pipeline, which leads to a substantial gap between academic research and industrial deployment. Procedural Controllable Generation (PCG) is an efficient technique for creating scalable and high-quality assets, but it is unfriendly for ordinary users as it demands profound domain expertise. To address these issues, we resort to using the large language model (LLM) to drive the procedural modeling. In this paper, we introduce a large-scale scene generation framework, SceneX, which can automatically produce high-quality procedural models according to designers' textual descriptions.Specifically, the proposed method comprises two components, PCGBench and PCGPlanner. The former encompasses an extensive collection of accessible procedural assets and thousands of hand-craft API documents. The latter aims to generate executable actions for Blender to produce controllable and precise 3D assets guided by the user's instructions. Our SceneX can generate a city spanning 2.5 km times 2.5 km with delicate layout and geometric structures, drastically reducing the time cost from several weeks for professional PCG engineers to just a few hours for an ordinary user. Extensive experiments demonstrated the capability of our method in controllable large-scale scene generation and editing, including asset placement and season translation.
由于其在学术界和产业界具有巨大的应用潜力,大规模场景生成已经引起了广泛关注。最近的研究采用强大的生成模型来创建所需场景,并取得了积极的结果。然而,大多数这些方法使用与工业流程不兼容的3D原语(如点云或辐射场)来表示场景,导致学术研究和工业部署之间的差距相当大。 procedural controllable generation (PCG)是一种有效的创建可扩展和高品质资产的技术,但它对普通用户来说并不友好,因为它需要深入的领域专业知识。为解决这些问题,我们转向使用大型语言模型(LLM)驱动程序建模。在本文中,我们介绍了一个大规模场景生成框架SceneX,可以根据设计者的文本描述自动生成高质量程序化模型。具体来说,所提出的方法包括两个组件:PCGBench和PCGPlanner。前者涵盖了广泛的可用程序化资产和数千个手工艺API文档。后者旨在为Blender生成可控制和精确的3D资产,根据用户的指示进行指导。我们的SceneX可以在生成的2.5公里x2.5公里的城市的精细布局和几何结构中生成,大大减少了专业PCG工程师从几周的时间成本降低到普通用户只需几小时的时间成本。 extensive experiments证明了我们在可控的大规模场景生成和编辑方面的能力,包括资产放置和季节翻译。
https://arxiv.org/abs/2403.15698
Generating realistic 3D scenes is challenging due to the complexity of room layouts and object geometries.We propose a sketch based knowledge enhanced diffusion architecture (SEK) for generating customized, diverse, and plausible 3D scenes. SEK conditions the denoising process with a hand-drawn sketch of the target scene and cues from an object relationship knowledge base. We first construct an external knowledge base containing object relationships and then leverage knowledge enhanced graph reasoning to assist our model in understanding hand-drawn sketches. A scene is represented as a combination of 3D objects and their relationships, and then incrementally diffused to reach a Gaussian distribution.We propose a 3D denoising scene transformer that learns to reverse the diffusion process, conditioned by a hand-drawn sketch along with knowledge cues, to regressively generate the scene including the 3D object instances as well as their layout. Experiments on the 3D-FRONT dataset show that our model improves FID, CKL by 17.41%, 37.18% in 3D scene generation and FID, KID by 19.12%, 20.06% in 3D scene completion compared to the nearest competitor DiffuScene.
生成逼真的3D场景具有复杂的空间布局和物体几何形状的复杂性。我们提出了一种基于手绘场景的增强扩散架构(SEK)用于生成定制的、多样化和逼真的3D场景。SEK通过目标场景的手绘草图和物体关系知识库中的提示来约束去噪过程。我们首先构建了一个包含物体关系的外部知识库,然后利用增强图推理来协助我们的模型理解手绘草图。场景被表示为3D物体及其关系的组合,然后通过逐层扩散达到高斯分布。我们提出了一种3D去噪场景变换器,它通过手绘草图和知识提示来学习反扩散过程,以递归地生成场景,包括3D物体实例及其布局。在3D-FRONT数据集的实验中,我们的模型将FID和CKL提高了17.41%和37.18%,而3D场景生成和完成分别提高了19.12%和20.06%,相对于最近的竞争对手DiffuScene。
https://arxiv.org/abs/2403.14121
Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Early works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in shape generation with powerful generative models, such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which implies that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D scenes from scene graph. To enrich the representation capability of the given scene graph inputs, large language model is utilized to explicitly aggregate the global graph features with local relationship features. With a unified graph convolution network (GCN), graph features are extracted from scene graphs updated via joint layout-shape distribution. During scene generation, an IoU-based regularization loss is introduced to constrain the predicted 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity. The source code will be released after publication.
组件式3D场景合成在机器人学、电影和游戏等各个行业中具有广泛的应用,因为它与现实世界多物体环境中的复杂性密切相关。早期的作品通常基于形状检索的框架,但自然地存在形状多样性的限制。随着强大生成模型的进步(如扩散模型),形状生成取得了显著提高。然而,这些方法分别处理3D形状生成和布局生成。生成的场景通常受到布局碰撞的影响,这表明在场景级别上,场景级保真度还有待进一步探索。在本文中,我们的目标是生成真实和合理的3D场景,从场景图入手。为了丰富给定的场景图输入的表示能力,我们使用了大型语言模型来明确聚合全局图特征和局部关系特征。通过统一的图卷积网络(GCN),从更新后的场景图中提取 graph 特征。在场景生成过程中,引入了基于IoU的 Regularization Loss 来约束预测的3D布局。在SG-FRONT数据集上的基准测试中,我们的方法实现了更好的3D场景合成,尤其是在场景级别保真度方面。源代码将在发表后发布。
https://arxiv.org/abs/2403.12848
Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
近年来,基于文本的3D场景生成技术取得了快速进展。其成功主要归功于使用现有的生成模型迭代进行图像扭曲和修复以生成3D场景。然而,这些方法过于依赖现有模型的输出,导致在几何和外观上产生误差,从而使模型无法应用于各种场景(例如户外和虚幻场景)。为了应对这一局限,我们通过查询和聚合全局3D信息来生成局部视图,然后逐步生成3D场景。具体来说,我们采用基于三平面特征的NeRF作为统一的三维场景表示,约束全局3D一致性,并提出了一个生成修复网络,通过利用扩散模型的自然图像先验以及当前场景的全局3D信息来合成高质量的新内容。我们广泛的实验证明,与以前的方法相比,我们的方法在支持各种场景生成和任意相机轨迹的同时,提高了视觉质量和3D一致性。
https://arxiv.org/abs/2403.09439
We present "SemCity," a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object, synthetic indoor scenes, or synthetic outdoor scenes, while the generation of real-world outdoor scenes is rarely addressed. In this paper, we concentrate on generating a real-outdoor scene through learning a diffusion model on a real-world outdoor dataset. In contrast to synthetic data, real-outdoor datasets often contain more empty spaces due to sensor limitations, causing challenges in learning real-outdoor distributions. To address this issue, we exploit a triplane representation as a proxy form of scene distributions to be learned by our diffusion model. Furthermore, we propose a triplane manipulation that integrates seamlessly with our triplane diffusion model. The manipulation improves our diffusion model's applicability in a variety of downstream tasks related to outdoor scene generation such as scene inpainting, scene outpainting, and semantic scene completion refinements. In experimental results, we demonstrate that our triplane diffusion model shows meaningful generation results compared with existing work in a real-outdoor dataset, SemanticKITTI. We also show our triplane manipulation facilitates seamlessly adding, removing, or modifying objects within a scene. Further, it also enables the expansion of scenes toward a city-level scale. Finally, we evaluate our method on semantic scene completion refinements where our diffusion model enhances predictions of semantic scene completion networks by learning scene distribution. Our code is available at this https URL.
我们提出了一个名为“SemCity”的3D扩散模型,用于在现实世界户外环境中生成语义场景。大多数3D扩散模型集中于生成单个物体、合成室内场景或合成室外场景,而现实世界户外场景的生成很少被关注。在本文中,我们专注于通过在现实世界户外数据集中学习扩散模型来生成真实户外场景。与合成数据相比,现实世界户外数据集通常包含更多的空旷空间,导致学习真实户外分布具有挑战性。为了解决这个问题,我们利用三平面表示作为一种场景分布的代理形式,作为我们的扩散模型可以学习的三平面操作。此外,我们还提出了一种与三平面扩散模型无缝集成的三平面操作。操作改善了我们的扩散模型在户外场景生成任务中的适用性,例如场景修复、场景去修复和语义场景完成 refinements。在实验结果中,我们证明了我们的三平面扩散模型在真实户外数据集上的生成结果与现有工作相比具有实际意义,即使在语义KITTI数据集上也是如此。我们还证明了我们的三平面操作使场景内对象在不同场景之间的添加、删除或修改变得更加容易。此外,它还使场景可以扩展到城市级别。最后,我们在语义场景完成 refinements 上评估我们的方法,我们的扩散模型通过学习场景分布增强了语义场景完成网络的预测。我们的代码可在此处访问:https://www.xxxxxx.com/
https://arxiv.org/abs/2403.07773
Current state-of-the-art (SOTA) 3D object detection methods often require a large amount of 3D bounding box annotations for training. However, collecting such large-scale densely-supervised datasets is notoriously costly. To reduce the cumbersome data annotation process, we propose a novel sparsely-annotated framework, in which we just annotate one 3D object per scene. Such a sparse annotation strategy could significantly reduce the heavy annotation burden, while inexact and incomplete sparse supervision may severely deteriorate the detection performance. To address this issue, we develop the SS3D++ method that alternatively improves 3D detector training and confident fully-annotated scene generation in a unified learning scheme. Using sparse annotations as seeds, we progressively generate confident fully-annotated scenes based on designing a missing-annotated instance mining module and reliable background mining module. Our proposed method produces competitive results when compared with SOTA weakly-supervised methods using the same or even more annotation costs. Besides, compared with SOTA fully-supervised methods, we achieve on-par or even better performance on the KITTI dataset with about 5x less annotation cost, and 90% of their performance on the Waymo dataset with about 15x less annotation cost. The additional unlabeled training scenes could further boost the performance. The code will be available at this https URL.
目前最先进的3D物体检测方法通常需要大量的3D边界框注释来进行训练。然而,收集这样大规模的密集监督数据集是非常昂贵的。为了减少繁琐的数据注释过程,我们提出了一个新颖的稀疏注释框架,其中我们仅在每个场景中注释一个3D物体。这种稀疏注释策略可以显著减少繁重的注释负担,然而,不准确和不完整的稀疏监督可能会严重削弱检测性能。为了解决这个问题,我们开发了SS3D++方法,在统一的训练方案中同时改进3D检测训练和自信完全注释场景生成。通过稀疏注释作为种子,我们根据设计缺失注释实例挖掘模块和可靠背景挖掘模块,逐步生成自信完全注释场景。与SOTA弱监督方法相比,我们的方法在相同或甚至更高的注释成本下产生了竞争力的结果。此外,与SOTA完全监督方法相比,我们在KITTI数据集上实现了与或甚至更好的性能,在不到5倍的成本下,而在Waymo数据集上实现了与或更好的性能,在不到15倍的成本下。此外,稀疏训练场景可以进一步提高性能。代码将在此处公布:https://www.xxx.com。
https://arxiv.org/abs/2403.02818
We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. Concretely, our method jointly optimizes multiple NeRFs from scratch - each representing its own object - along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation. For results and an interactive demo, see our project page at this https URL
我们提出了一种生成3D场景的方法,这些场景可以分解为其组件对象。这种分解是无监督的,仅依赖于一个大型的预训练文本到图像模型的知识。我们的关键洞见是,物体可以通过找到3D场景中的一部分,当其空间重新排列时,仍然产生相同的场景配置而被发现。具体来说,我们的方法从零开始优化多个NeRFs,每个NeRF代表其自己的物体,以及一系列布局,将它们组合成场景。然后,我们鼓励这些组合场景根据图像生成器处于同分布中。我们证明了尽管我们的方法非常简单,但成功地将3D场景分解为单独的物体,为文本到3D内容创作带来了新的功能。查看我们的项目页面,https://www.example.com/,以查看结果和交互式演示。
https://arxiv.org/abs/2402.16936
We present GALA3D, generative 3D GAussians with LAyout-guided control, for effective compositional text-to-3D generation. We first utilize large language models (LLMs) to generate the initial layout and introduce a layout-guided 3D Gaussian representation for 3D content generation with adaptive geometric constraints. We then propose an object-scene compositional optimization mechanism with conditioned diffusion to collaboratively generate realistic 3D scenes with consistent geometry, texture, scale, and accurate interactions among multiple objects while simultaneously adjusting the coarse layout priors extracted from the LLMs to align with the generated scene. Experiments show that GALA3D is a user-friendly, end-to-end framework for state-of-the-art scene-level 3D content generation and controllable editing while ensuring the high fidelity of object-level entities within the scene. Source codes and models will be available at this https URL.
我们提出了GALA3D,一种基于LAyout的生成3D Gausians,用于有效的构图文本到3D生成。首先,我们利用大型语言模型(LLMs)生成初始布局,并引入了一个布局指导的3D高斯表示,用于3D内容生成,具有自适应的几何约束。然后,我们提出了一个物体场景组合优化机制,通过条件扩散来协同生成具有一致几何、纹理、尺寸和多个物体之间精确交互的现实3D场景。同时,在调整LLMs中提取的粗布局先验以与生成的场景对齐的同时,确保场景中物体级别实体的高保真度。实验结果表明,GALA3D是一个用户友好、端到端的现代3D内容生成和可控制编辑框架,同时确保场景中物体级别实体的精确几何、纹理、尺寸和精确交互。源代码和模型将在此处https URL中提供。
https://arxiv.org/abs/2402.07207
The popularity of LiDAR devices and sensor technology has gradually empowered users from autonomous driving to forest monitoring, and research on 3D LiDAR has made remarkable progress over the years. Unlike 2D images, whose focused area is visible and rich in texture information, understanding the point distribution can help companies and researchers find better ways to develop point-based 3D applications. In this work, we contribute an unreal-based LiDAR simulation tool and a 3D simulation dataset named LiDAR-Forest, which can be used by various studies to evaluate forest reconstruction, tree DBH estimation, and point cloud compression for easy visualization. The simulation is customizable in tree species, LiDAR types and scene generation, with low cost and high efficiency.
LiDAR设备和传感器技术的流行逐渐从自动驾驶扩展到森林监测,多年来3D LiDAR的研究取得了显著进展。与2D图像不同,其集中区域可见且纹理信息丰富,理解点分布可以帮助企业和研究人员找到更好的基于点型3D应用程序的开发方式。在这项工作中,我们贡献了一个基于虚幻技术的LiDAR仿真工具和名为LiDAR-Forest的3D仿真数据集,该数据集可用于各种研究以进行森林重建、树木DBH估计和点云压缩的简单可视化。仿真可在树种、LiDAR类型和场景生成方面进行自定义,具有低成本和高效率。
https://arxiv.org/abs/2402.04546
We present a system for generating indoor scenes in response to text prompts. The prompts are not limited to a fixed vocabulary of scene descriptions, and the objects in generated scenes are not restricted to a fixed set of object categories -- we call this setting indoor scene generation. Unlike most prior work on indoor scene generation, our system does not require a large training dataset of existing 3D scenes. Instead, it leverages the world knowledge encoded in pre-trained large language models (LLMs) to synthesize programs in a domain-specific layout language that describe objects and spatial relations between them. Executing such a program produces a specification of a constraint satisfaction problem, which the system solves using a gradient-based optimization scheme to produce object positions and orientations. To produce object geometry, the system retrieves 3D meshes from a database. Unlike prior work which uses databases of category-annotated, mutually-aligned meshes, we develop a pipeline using vision-language models (VLMs) to retrieve meshes from massive databases of un-annotated, inconsistently-aligned meshes. Experimental evaluations show that our system outperforms generative models trained on 3D data for traditional, closed-universe scene generation tasks; it also outperforms a recent LLM-based layout generation method on open-universe scene generation.
我们提出了一个根据文本提示生成室内场景的系统。提示不仅限于固定的场景描述词汇,生成的场景中的物体也不受固定物体类别的限制——我们称之为室内场景生成。与大多数先前的室内场景生成工作不同,我们的系统不需要训练现有3D场景的大型语言模型(LLM)的大规模训练数据。相反,它利用预先训练的较大语言模型(LLM)中的世界知识,在领域特定的布局语言中合成程序,描述对象和他们之间的空间关系。执行这样的程序会生成一个约束满足问题的规范,系统通过梯度基于优化方案来求解物体位置和方向。为了生成物体几何,系统从数据库中检索3D模型。与先前的使用类别注释、相互对齐的mesh数据库的工作不同,我们使用视觉语言模型(VLM)的管道从大型无注释、不一致对齐的mesh数据库中检索mesh。实验评估显示,我们的系统在传统、封闭宇宙场景生成任务中超过了基于3D数据的生成模型;它还在开放宇宙场景生成中超过了基于LLM的布局生成方法。
https://arxiv.org/abs/2403.09675
We present BlockFusion, a diffusion-based model that generates 3D scenes as unit blocks and seamlessly incorporates new blocks to extend the scene. BlockFusion is trained using datasets of 3D blocks that are randomly cropped from complete 3D scene meshes. Through per-block fitting, all training blocks are converted into the hybrid neural fields: with a tri-plane containing the geometry features, followed by a Multi-layer Perceptron (MLP) for decoding the signed distance values. A variational auto-encoder is employed to compress the tri-planes into the latent tri-plane space, on which the denoising diffusion process is performed. Diffusion applied to the latent representations allows for high-quality and diverse 3D scene generation. To expand a scene during generation, one needs only to append empty blocks to overlap with the current scene and extrapolate existing latent tri-planes to populate new blocks. The extrapolation is done by conditioning the generation process with the feature samples from the overlapping tri-planes during the denoising iterations. Latent tri-plane extrapolation produces semantically and geometrically meaningful transitions that harmoniously blend with the existing scene. A 2D layout conditioning mechanism is used to control the placement and arrangement of scene elements. Experimental results indicate that BlockFusion is capable of generating diverse, geometrically consistent and unbounded large 3D scenes with unprecedented high-quality shapes in both indoor and outdoor scenarios.
我们提出了BlockFusion模型,一种基于扩散的模型,通过单元块生成3D场景,并无缝地引入新单元块来扩展场景。BlockFusion使用从完整3D场景网格随机裁剪的数据集进行训练。通过块级调整,所有训练块都被转换成包含几何特征的三维混合神经网络:包含三平面的几何特征,然后是一个多层感知器(MLP)用于解码符号距离值。采用变分自编码器将三平面压缩到潜在的三平面上,然后对潜在的三平面向进行去噪扩散处理。对潜在表示应用扩散允许产生高质量和多样化的3D场景。在生成过程中扩展场景,只需要在当前场景中附上空块,并扩展现有的潜在三平面向来填充新块。扩展是通过在去噪迭代过程中对重叠三平面的特征样本进行条件处理来完成的。潜在三平面向扩展产生既具有语义又具有几何意义的变化,与现有场景和谐相处。采用二维布局条件机制来控制场景元素的放置和排列。实验结果表明,BlockFusion在室内和室外场景中都能生成具有独特高质量形状的多边形3D场景。
https://arxiv.org/abs/2401.17053
Object recognition and object pose estimation in robotic grasping continue to be significant challenges, since building a labelled dataset can be time consuming and financially costly in terms of data collection and annotation. In this work, we propose a synthetic data generation method that minimizes human intervention and makes downstream image segmentation algorithms more robust by combining a generated synthetic dataset with a smaller real-world dataset (hybrid dataset). Annotation experiments show that the proposed synthetic scene generation can diminish labelling time dramatically. RGB image segmentation is trained with hybrid dataset and combined with depth information to produce pixel-to-point correspondence of individual segmented objects. The object to grasp is then determined by the confidence score of the segmentation algorithm. Pick-and-place experiments demonstrate that segmentation trained on our hybrid dataset (98.9%, 70%) outperforms the real dataset and a publicly available dataset by (6.7%, 18.8%) and (2.8%, 10%) in terms of labelling and grasping success rate, respectively. Supplementary material is available at this https URL.
机器人抓取中的物体识别和物体姿态估计仍然是一个重要的挑战,因为构建带标签的数据集可能需要花费大量的时间和金钱。在这项工作中,我们提出了一种合成数据生成方法,通过最小化人类干预并使下游图像分割算法更加稳健地结合生成合成数据和较小的现实数据(混合数据)来解决这一问题。注释实验表明,所提出的合成场景生成可以极大地减少标注时间。通过结合带标签的合成数据和深度信息来训练RGB图像分割,然后通过分割算法的置信度分数确定抓取物体。抓取实验表明,在我们的混合数据上训练的分割算法(98.9%,70%)在标签和抓取成功率方面优于真实数据和公开数据(6.7%,18.8%和2.8%,10%)。附加材料可在此处下载。
https://arxiv.org/abs/2401.13405
Directly generating scenes from satellite imagery offers exciting possibilities for integration into applications like games and map services. However, challenges arise from significant view changes and scene scale. Previous efforts mainly focused on image or video generation, lacking exploration into the adaptability of scene generation for arbitrary views. Existing 3D generation works either operate at the object level or are difficult to utilize the geometry obtained from satellite imagery. To overcome these limitations, we propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques. Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner. The representation can be utilized to render arbitrary views which would excel in both single-frame quality and inter-frame consistency. Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery.
从卫星图像直接生成场景具有将应用于游戏和地图服务的令人兴奋的可能性。然而,从卫星图像中产生视觉变化和场景规模会带来挑战。以前的努力主要集中在图像或视频生成,而忽略了场景生成对任意视图的适应性。现有的3D生成方法只能在物体级别操作,或者难以利用从卫星图像获得的几何信息。为了克服这些限制,我们提出了一个新颖的直接3D场景生成架构,通过引入扩散模型到3D稀疏表示中,并将其与神经渲染技术相结合,实现对任意视图的生成。具体来说,我们的方法首先使用3D扩散模型在给定几何生成纹理颜色,然后以递归方式将其转换为场景表示。该表示可用于渲染任意视图,在单帧质量和跨帧一致性方面都具有卓越的表现。在两个城市规模的数据集上进行的实验证明,我们的模型在从卫星图像生成逼真的街景图像序列和对视图城市场景方面表现出卓越的性能。
https://arxiv.org/abs/2401.10786
Indoor scene generation has attracted significant attention recently as it is crucial for applications of gaming, virtual reality, and interior design. Current indoor scene generation methods can produce reasonable room layouts but often lack diversity and realism. This is primarily due to the limited coverage of existing datasets, including only large furniture without tiny furnishings in daily life. To address these challenges, we propose FurniScene, a large-scale 3D room dataset with intricate furnishing scenes from interior design professionals. Specifically, the FurniScene consists of 11,698 rooms and 39,691 unique furniture CAD models with 89 different types, covering things from large beds to small teacups on the coffee table. To better suit fine-grained indoor scene layout generation, we introduce a novel Two-Stage Diffusion Scene Model (TSDSM) and conduct an evaluation benchmark for various indoor scene generation based on FurniScene. Quantitative and qualitative evaluations demonstrate the capability of our method to generate highly realistic indoor scenes. Our dataset and code will be publicly available soon.
近年来,室内场景生成因其在游戏、虚拟现实和室内设计等领域的应用而受到了广泛关注。然而,当前的室内场景生成方法通常缺乏多样性和现实感。这主要是由于现有数据集的覆盖范围有限,仅包括大型家具,而没有包括日常生活中的琐碎家具。为了应对这些挑战,我们提出了FurniScene,一个大型3D房间数据集,涵盖室内设计专业人士的复杂家具场景。具体来说,FurniScene包括11,698个房间和39,691个独特的家具CAD模型,具有89种不同类型的家具,从大型床到咖啡桌上的小茶杯。为了更好地适应细粒度室内场景布局生成,我们引入了一种名为Two-Stage Diffusion Scene Model(TSDSM)的新模型,并基于FurniScene对各种室内场景生成进行了评估基准。定量和定性评估结果表明,我们的方法可以生成高度逼真的室内场景。我们的数据集和代码将很快公开发布。
https://arxiv.org/abs/2401.03470
We introduce Text2Immersion, an elegant method for producing high-quality 3D immersive scenes from text prompts. Our proposed pipeline initiates by progressively generating a Gaussian cloud using pre-trained 2D diffusion and depth estimation models. This is followed by a refining stage on the Gaussian cloud, interpolating and refining it to enhance the details of the generated scene. Distinct from prevalent methods that focus on single object or indoor scenes, or employ zoom-out trajectories, our approach generates diverse scenes with various objects, even extending to the creation of imaginary scenes. Consequently, Text2Immersion can have wide-ranging implications for various applications such as virtual reality, game development, and automated content creation. Extensive evaluations demonstrate that our system surpasses other methods in rendering quality and diversity, further progressing towards text-driven 3D scene generation. We will make the source code publicly accessible at the project page.
我们介绍了一种名为Text2Immersion的方法,它是一种产生高质量3D沉浸场景的方法,基于预训练的2D扩散和深度估计模型。我们提出的流程首先使用预训练的2D扩散和深度估计模型生成高斯云。接着是高斯云的优化阶段,通过插值和优化来增强生成的场景的细节。与常见的关注单一物体或室内场景,或采用放大出轨迹的方法不同,我们的方法生成了各种物体的大致场景,甚至扩展到创建想象场景。因此,Text2Immersion在各种应用领域(如虚拟现实、游戏开发和自动内容创作)具有广泛的意义。 extensive评估表明,我们的系统在渲染质量和多样性方面超过了其他方法,进一步迈向基于文本的3D场景生成。我们将在项目页面上公开源代码。
https://arxiv.org/abs/2312.09242
We introduce DreamDrone, an innovative method for generating unbounded flythrough scenes from textual prompts. Central to our method is a novel feature-correspondence-guidance diffusion process, which utilizes the strong correspondence of intermediate features in the diffusion model. Leveraging this guidance strategy, we further propose an advanced technique for editing the intermediate latent code, enabling the generation of subsequent novel views with geometric consistency. Extensive experiments reveal that DreamDrone significantly surpasses existing methods, delivering highly authentic scene generation with exceptional visual quality. This approach marks a significant step in zero-shot perpetual view generation from textual prompts, enabling the creation of diverse scenes, including natural landscapes like oases and caves, as well as complex urban settings such as Lego-style street views. Our code is publicly available.
我们介绍了一种名为DreamDrone的创新方法,用于从文本提示生成无限制的飞行场景。该方法的核心是一种新颖的特征匹配-指导扩散过程,该过程利用扩散模型中中间特征的强一致性。利用这种指导策略,我们进一步提出了一个用于编辑中间潜在代码的高级技术,使得生成后续的新视图具有几何一致性。大量实验证明,DreamDrone显著超越了现有方法,具有卓越的视觉效果,使零散提示文本生成方法取得了重大突破。这种方法标志着从文本提示中生成持久视图的技术迈出了重要的一步,包括自然风光如绿洲和洞穴,以及复杂的都市环境如乐高式街道视图。我们的代码是公开可用的。
https://arxiv.org/abs/2312.08746
We introduce WonderJourney, a modularized framework for perpetual 3D scene generation. Unlike prior work on view generation that focuses on a single type of scenes, we start at any user-provided location (by a text description or an image) and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey, a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes, and a large VLM to verify the generated scenes. We show compelling, diverse visual results across various scene types and styles, forming imaginary "wonderjourneys". Project website: this https URL
我们介绍了一个模块化的框架WonderJourney,用于永久性3D场景生成。与之前关注于单种类型的视图生成的研究不同,我们从一个用户提供的位置(通过文本描述或图像)开始,并通过一系列多样且相互连接的3D场景生成旅程。我们利用LLM生成旅途中场景的文本描述,一个基于文本的点云生成管道来制作引人入胜且连贯的3D场景序列,以及一个大型VLM来验证生成的场景。我们展示了各种场景类型和风格引人入胜的视觉结果,形成了一系列想象的“探险之旅”。项目网站:这是 https://this.link
https://arxiv.org/abs/2312.03884
Generating multi-camera street-view videos is critical for augmenting autonomous driving datasets, addressing the urgent demand for extensive and varied data. Due to the limitations in diversity and challenges in handling lighting conditions, traditional rendering-based methods are increasingly being supplanted by diffusion-based methods. However, a significant challenge in diffusion-based methods is ensuring that the generated sensor data preserve both intra-world consistency and inter-sensor coherence. To address these challenges, we combine an additional explicit world volume and propose the World Volume-aware Multi-camera Driving Scene Generator (WoVoGen). This system is specifically designed to leverage 4D world volume as a foundational element for video generation. Our model operates in two distinct phases: (i) envisioning the future 4D temporal world volume based on vehicle control sequences, and (ii) generating multi-camera videos, informed by this envisioned 4D temporal world volume and sensor interconnectivity. The incorporation of the 4D world volume empowers WoVoGen not only to generate high-quality street-view videos in response to vehicle control inputs but also to facilitate scene editing tasks.
生成多摄像头街头视角视频对增强自动驾驶数据集至关重要,解决广泛而多样数据的迫切需求。由于多样性的限制和处理光线条件的挑战,传统渲染方法正逐渐被扩散方法所取代。然而,扩散方法的一个关键挑战是确保生成的传感器数据保留内部世界一致性和跨传感器一致性。为了应对这些挑战,我们结合了一个额外的显式世界体积,并提出了名为世界体积感知多摄像头驾驶场景生成器(WoVoGen)的系统。这个系统专门设计为利用4D世界体积作为视频生成的基本元素。我们的模型分为两个阶段:首先,根据车辆控制序列预见未来的4D时间世界体积;其次,根据预见的4D时间世界体积和传感器互连性生成多摄像头视频。引入4D世界体积使WoVoGen不仅能够根据车辆控制输入生成高质量的街头视角视频,还能够在场景编辑任务中发挥作用。
https://arxiv.org/abs/2312.02934
Generating large-scale 3D scenes cannot simply apply existing 3D object synthesis technique since 3D scenes usually hold complex spatial configurations and consist of a number of objects at varying scales. We thus propose a practical and efficient 3D representation that incorporates an equivariant radiance field with the guidance of a bird's-eye view (BEV) map. Concretely, objects of synthesized 3D scenes could be easily manipulated through steering the corresponding BEV maps. Moreover, by adequately incorporating positional encoding and low-pass filters into the generator, the representation becomes equivariant to the given BEV map. Such equivariance allows us to produce large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and then stitching them with smooth consistency. Extensive experiments on 3D scene datasets demonstrate the effectiveness of our approach. Our project website is at this https URL.
生成大规模3D场景不能简单地应用现有的3D物体合成技术,因为3D场景通常具有复杂的空间配置,由多个大小不同的物体组成。因此,我们提出了一个实用且高效的3D表示,通过在鸟瞰图(BEV)地图的指导下,包含等价辐射场。具体来说,通过通过操纵相应的BEV地图,可以轻松地操作合成3D场景中的物体。此外,通过适当地将位置编码和低通滤波器融入生成器中,表示对给定的BEV地图具有等价性。这种等价性允许我们通过合成局部场景,然后平滑地拼接它们,生成大型、甚至无限大的3D场景。在3D场景数据集上进行的大量实验证实了我们的方法的有效性。我们的项目网站位于https://www.example.com。
https://arxiv.org/abs/2312.02136
Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.
大规模扩散生成模型极大地简化了用户提供的文本提示和图像为基础的图像,视频和三维资产创建。然而,从文本到4D动态三维场景生成的挑战问题仍然没有被深入探索。我们提出了Dream-in-4D,它采用了一种新的两阶段方法进行文本到4D合成,利用(1)3D和2D扩散指导来有效地在第一阶段学习高质量静态3D资产;(2)一个可变的神经元辐射场,它明确地分离了学习到的静态资产的变形,在运动学习过程中保持质量;和(3)一个多分辨率特征网格,带有时空总变差损失,在第二阶段有效地学习视频扩散指导下的运动。通过用户偏好研究,我们证明了与基线方法相比,我们的方法显著提高了文本到4D生成中的图像和运动质量,3D一致性和文本准确性。得益于其运动分离表示,Dream-in-4D还可以很容易地适应可控制生成,其中外观由一个或多个图像定义,无需修改运动学习阶段。因此,我们的方法为文本到4D,图像到4D和个人4D生成任务提供了第一次统一的方法。
https://arxiv.org/abs/2311.16854