Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
近年来,基于文本的3D场景生成技术取得了快速进展。其成功主要归功于使用现有的生成模型迭代进行图像扭曲和修复以生成3D场景。然而,这些方法过于依赖现有模型的输出,导致在几何和外观上产生误差,从而使模型无法应用于各种场景(例如户外和虚幻场景)。为了应对这一局限,我们通过查询和聚合全局3D信息来生成局部视图,然后逐步生成3D场景。具体来说,我们采用基于三平面特征的NeRF作为统一的三维场景表示,约束全局3D一致性,并提出了一个生成修复网络,通过利用扩散模型的自然图像先验以及当前场景的全局3D信息来合成高质量的新内容。我们广泛的实验证明,与以前的方法相比,我们的方法在支持各种场景生成和任意相机轨迹的同时,提高了视觉质量和3D一致性。
https://arxiv.org/abs/2403.09439
We present "SemCity," a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object, synthetic indoor scenes, or synthetic outdoor scenes, while the generation of real-world outdoor scenes is rarely addressed. In this paper, we concentrate on generating a real-outdoor scene through learning a diffusion model on a real-world outdoor dataset. In contrast to synthetic data, real-outdoor datasets often contain more empty spaces due to sensor limitations, causing challenges in learning real-outdoor distributions. To address this issue, we exploit a triplane representation as a proxy form of scene distributions to be learned by our diffusion model. Furthermore, we propose a triplane manipulation that integrates seamlessly with our triplane diffusion model. The manipulation improves our diffusion model's applicability in a variety of downstream tasks related to outdoor scene generation such as scene inpainting, scene outpainting, and semantic scene completion refinements. In experimental results, we demonstrate that our triplane diffusion model shows meaningful generation results compared with existing work in a real-outdoor dataset, SemanticKITTI. We also show our triplane manipulation facilitates seamlessly adding, removing, or modifying objects within a scene. Further, it also enables the expansion of scenes toward a city-level scale. Finally, we evaluate our method on semantic scene completion refinements where our diffusion model enhances predictions of semantic scene completion networks by learning scene distribution. Our code is available at this https URL.
我们提出了一个名为“SemCity”的3D扩散模型,用于在现实世界户外环境中生成语义场景。大多数3D扩散模型集中于生成单个物体、合成室内场景或合成室外场景,而现实世界户外场景的生成很少被关注。在本文中,我们专注于通过在现实世界户外数据集中学习扩散模型来生成真实户外场景。与合成数据相比,现实世界户外数据集通常包含更多的空旷空间,导致学习真实户外分布具有挑战性。为了解决这个问题,我们利用三平面表示作为一种场景分布的代理形式,作为我们的扩散模型可以学习的三平面操作。此外,我们还提出了一种与三平面扩散模型无缝集成的三平面操作。操作改善了我们的扩散模型在户外场景生成任务中的适用性,例如场景修复、场景去修复和语义场景完成 refinements。在实验结果中,我们证明了我们的三平面扩散模型在真实户外数据集上的生成结果与现有工作相比具有实际意义,即使在语义KITTI数据集上也是如此。我们还证明了我们的三平面操作使场景内对象在不同场景之间的添加、删除或修改变得更加容易。此外,它还使场景可以扩展到城市级别。最后,我们在语义场景完成 refinements 上评估我们的方法,我们的扩散模型通过学习场景分布增强了语义场景完成网络的预测。我们的代码可在此处访问:https://www.xxxxxx.com/
https://arxiv.org/abs/2403.07773
Current state-of-the-art (SOTA) 3D object detection methods often require a large amount of 3D bounding box annotations for training. However, collecting such large-scale densely-supervised datasets is notoriously costly. To reduce the cumbersome data annotation process, we propose a novel sparsely-annotated framework, in which we just annotate one 3D object per scene. Such a sparse annotation strategy could significantly reduce the heavy annotation burden, while inexact and incomplete sparse supervision may severely deteriorate the detection performance. To address this issue, we develop the SS3D++ method that alternatively improves 3D detector training and confident fully-annotated scene generation in a unified learning scheme. Using sparse annotations as seeds, we progressively generate confident fully-annotated scenes based on designing a missing-annotated instance mining module and reliable background mining module. Our proposed method produces competitive results when compared with SOTA weakly-supervised methods using the same or even more annotation costs. Besides, compared with SOTA fully-supervised methods, we achieve on-par or even better performance on the KITTI dataset with about 5x less annotation cost, and 90% of their performance on the Waymo dataset with about 15x less annotation cost. The additional unlabeled training scenes could further boost the performance. The code will be available at this https URL.
目前最先进的3D物体检测方法通常需要大量的3D边界框注释来进行训练。然而,收集这样大规模的密集监督数据集是非常昂贵的。为了减少繁琐的数据注释过程,我们提出了一个新颖的稀疏注释框架,其中我们仅在每个场景中注释一个3D物体。这种稀疏注释策略可以显著减少繁重的注释负担,然而,不准确和不完整的稀疏监督可能会严重削弱检测性能。为了解决这个问题,我们开发了SS3D++方法,在统一的训练方案中同时改进3D检测训练和自信完全注释场景生成。通过稀疏注释作为种子,我们根据设计缺失注释实例挖掘模块和可靠背景挖掘模块,逐步生成自信完全注释场景。与SOTA弱监督方法相比,我们的方法在相同或甚至更高的注释成本下产生了竞争力的结果。此外,与SOTA完全监督方法相比,我们在KITTI数据集上实现了与或甚至更好的性能,在不到5倍的成本下,而在Waymo数据集上实现了与或更好的性能,在不到15倍的成本下。此外,稀疏训练场景可以进一步提高性能。代码将在此处公布:https://www.xxx.com。
https://arxiv.org/abs/2403.02818
We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. Concretely, our method jointly optimizes multiple NeRFs from scratch - each representing its own object - along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation. For results and an interactive demo, see our project page at this https URL
我们提出了一种生成3D场景的方法,这些场景可以分解为其组件对象。这种分解是无监督的,仅依赖于一个大型的预训练文本到图像模型的知识。我们的关键洞见是,物体可以通过找到3D场景中的一部分,当其空间重新排列时,仍然产生相同的场景配置而被发现。具体来说,我们的方法从零开始优化多个NeRFs,每个NeRF代表其自己的物体,以及一系列布局,将它们组合成场景。然后,我们鼓励这些组合场景根据图像生成器处于同分布中。我们证明了尽管我们的方法非常简单,但成功地将3D场景分解为单独的物体,为文本到3D内容创作带来了新的功能。查看我们的项目页面,https://www.example.com/,以查看结果和交互式演示。
https://arxiv.org/abs/2402.16936
We present GALA3D, generative 3D GAussians with LAyout-guided control, for effective compositional text-to-3D generation. We first utilize large language models (LLMs) to generate the initial layout and introduce a layout-guided 3D Gaussian representation for 3D content generation with adaptive geometric constraints. We then propose an object-scene compositional optimization mechanism with conditioned diffusion to collaboratively generate realistic 3D scenes with consistent geometry, texture, scale, and accurate interactions among multiple objects while simultaneously adjusting the coarse layout priors extracted from the LLMs to align with the generated scene. Experiments show that GALA3D is a user-friendly, end-to-end framework for state-of-the-art scene-level 3D content generation and controllable editing while ensuring the high fidelity of object-level entities within the scene. Source codes and models will be available at this https URL.
我们提出了GALA3D,一种基于LAyout的生成3D Gausians,用于有效的构图文本到3D生成。首先,我们利用大型语言模型(LLMs)生成初始布局,并引入了一个布局指导的3D高斯表示,用于3D内容生成,具有自适应的几何约束。然后,我们提出了一个物体场景组合优化机制,通过条件扩散来协同生成具有一致几何、纹理、尺寸和多个物体之间精确交互的现实3D场景。同时,在调整LLMs中提取的粗布局先验以与生成的场景对齐的同时,确保场景中物体级别实体的高保真度。实验结果表明,GALA3D是一个用户友好、端到端的现代3D内容生成和可控制编辑框架,同时确保场景中物体级别实体的精确几何、纹理、尺寸和精确交互。源代码和模型将在此处https URL中提供。
https://arxiv.org/abs/2402.07207
The popularity of LiDAR devices and sensor technology has gradually empowered users from autonomous driving to forest monitoring, and research on 3D LiDAR has made remarkable progress over the years. Unlike 2D images, whose focused area is visible and rich in texture information, understanding the point distribution can help companies and researchers find better ways to develop point-based 3D applications. In this work, we contribute an unreal-based LiDAR simulation tool and a 3D simulation dataset named LiDAR-Forest, which can be used by various studies to evaluate forest reconstruction, tree DBH estimation, and point cloud compression for easy visualization. The simulation is customizable in tree species, LiDAR types and scene generation, with low cost and high efficiency.
LiDAR设备和传感器技术的流行逐渐从自动驾驶扩展到森林监测,多年来3D LiDAR的研究取得了显著进展。与2D图像不同,其集中区域可见且纹理信息丰富,理解点分布可以帮助企业和研究人员找到更好的基于点型3D应用程序的开发方式。在这项工作中,我们贡献了一个基于虚幻技术的LiDAR仿真工具和名为LiDAR-Forest的3D仿真数据集,该数据集可用于各种研究以进行森林重建、树木DBH估计和点云压缩的简单可视化。仿真可在树种、LiDAR类型和场景生成方面进行自定义,具有低成本和高效率。
https://arxiv.org/abs/2402.04546
We present a system for generating indoor scenes in response to text prompts. The prompts are not limited to a fixed vocabulary of scene descriptions, and the objects in generated scenes are not restricted to a fixed set of object categories -- we call this setting indoor scene generation. Unlike most prior work on indoor scene generation, our system does not require a large training dataset of existing 3D scenes. Instead, it leverages the world knowledge encoded in pre-trained large language models (LLMs) to synthesize programs in a domain-specific layout language that describe objects and spatial relations between them. Executing such a program produces a specification of a constraint satisfaction problem, which the system solves using a gradient-based optimization scheme to produce object positions and orientations. To produce object geometry, the system retrieves 3D meshes from a database. Unlike prior work which uses databases of category-annotated, mutually-aligned meshes, we develop a pipeline using vision-language models (VLMs) to retrieve meshes from massive databases of un-annotated, inconsistently-aligned meshes. Experimental evaluations show that our system outperforms generative models trained on 3D data for traditional, closed-universe scene generation tasks; it also outperforms a recent LLM-based layout generation method on open-universe scene generation.
我们提出了一个根据文本提示生成室内场景的系统。提示不仅限于固定的场景描述词汇,生成的场景中的物体也不受固定物体类别的限制——我们称之为室内场景生成。与大多数先前的室内场景生成工作不同,我们的系统不需要训练现有3D场景的大型语言模型(LLM)的大规模训练数据。相反,它利用预先训练的较大语言模型(LLM)中的世界知识,在领域特定的布局语言中合成程序,描述对象和他们之间的空间关系。执行这样的程序会生成一个约束满足问题的规范,系统通过梯度基于优化方案来求解物体位置和方向。为了生成物体几何,系统从数据库中检索3D模型。与先前的使用类别注释、相互对齐的mesh数据库的工作不同,我们使用视觉语言模型(VLM)的管道从大型无注释、不一致对齐的mesh数据库中检索mesh。实验评估显示,我们的系统在传统、封闭宇宙场景生成任务中超过了基于3D数据的生成模型;它还在开放宇宙场景生成中超过了基于LLM的布局生成方法。
https://arxiv.org/abs/2403.09675
We present BlockFusion, a diffusion-based model that generates 3D scenes as unit blocks and seamlessly incorporates new blocks to extend the scene. BlockFusion is trained using datasets of 3D blocks that are randomly cropped from complete 3D scene meshes. Through per-block fitting, all training blocks are converted into the hybrid neural fields: with a tri-plane containing the geometry features, followed by a Multi-layer Perceptron (MLP) for decoding the signed distance values. A variational auto-encoder is employed to compress the tri-planes into the latent tri-plane space, on which the denoising diffusion process is performed. Diffusion applied to the latent representations allows for high-quality and diverse 3D scene generation. To expand a scene during generation, one needs only to append empty blocks to overlap with the current scene and extrapolate existing latent tri-planes to populate new blocks. The extrapolation is done by conditioning the generation process with the feature samples from the overlapping tri-planes during the denoising iterations. Latent tri-plane extrapolation produces semantically and geometrically meaningful transitions that harmoniously blend with the existing scene. A 2D layout conditioning mechanism is used to control the placement and arrangement of scene elements. Experimental results indicate that BlockFusion is capable of generating diverse, geometrically consistent and unbounded large 3D scenes with unprecedented high-quality shapes in both indoor and outdoor scenarios.
我们提出了BlockFusion模型,一种基于扩散的模型,通过单元块生成3D场景,并无缝地引入新单元块来扩展场景。BlockFusion使用从完整3D场景网格随机裁剪的数据集进行训练。通过块级调整,所有训练块都被转换成包含几何特征的三维混合神经网络:包含三平面的几何特征,然后是一个多层感知器(MLP)用于解码符号距离值。采用变分自编码器将三平面压缩到潜在的三平面上,然后对潜在的三平面向进行去噪扩散处理。对潜在表示应用扩散允许产生高质量和多样化的3D场景。在生成过程中扩展场景,只需要在当前场景中附上空块,并扩展现有的潜在三平面向来填充新块。扩展是通过在去噪迭代过程中对重叠三平面的特征样本进行条件处理来完成的。潜在三平面向扩展产生既具有语义又具有几何意义的变化,与现有场景和谐相处。采用二维布局条件机制来控制场景元素的放置和排列。实验结果表明,BlockFusion在室内和室外场景中都能生成具有独特高质量形状的多边形3D场景。
https://arxiv.org/abs/2401.17053
Object recognition and object pose estimation in robotic grasping continue to be significant challenges, since building a labelled dataset can be time consuming and financially costly in terms of data collection and annotation. In this work, we propose a synthetic data generation method that minimizes human intervention and makes downstream image segmentation algorithms more robust by combining a generated synthetic dataset with a smaller real-world dataset (hybrid dataset). Annotation experiments show that the proposed synthetic scene generation can diminish labelling time dramatically. RGB image segmentation is trained with hybrid dataset and combined with depth information to produce pixel-to-point correspondence of individual segmented objects. The object to grasp is then determined by the confidence score of the segmentation algorithm. Pick-and-place experiments demonstrate that segmentation trained on our hybrid dataset (98.9%, 70%) outperforms the real dataset and a publicly available dataset by (6.7%, 18.8%) and (2.8%, 10%) in terms of labelling and grasping success rate, respectively. Supplementary material is available at this https URL.
机器人抓取中的物体识别和物体姿态估计仍然是一个重要的挑战,因为构建带标签的数据集可能需要花费大量的时间和金钱。在这项工作中,我们提出了一种合成数据生成方法,通过最小化人类干预并使下游图像分割算法更加稳健地结合生成合成数据和较小的现实数据(混合数据)来解决这一问题。注释实验表明,所提出的合成场景生成可以极大地减少标注时间。通过结合带标签的合成数据和深度信息来训练RGB图像分割,然后通过分割算法的置信度分数确定抓取物体。抓取实验表明,在我们的混合数据上训练的分割算法(98.9%,70%)在标签和抓取成功率方面优于真实数据和公开数据(6.7%,18.8%和2.8%,10%)。附加材料可在此处下载。
https://arxiv.org/abs/2401.13405
Directly generating scenes from satellite imagery offers exciting possibilities for integration into applications like games and map services. However, challenges arise from significant view changes and scene scale. Previous efforts mainly focused on image or video generation, lacking exploration into the adaptability of scene generation for arbitrary views. Existing 3D generation works either operate at the object level or are difficult to utilize the geometry obtained from satellite imagery. To overcome these limitations, we propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques. Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner. The representation can be utilized to render arbitrary views which would excel in both single-frame quality and inter-frame consistency. Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery.
从卫星图像直接生成场景具有将应用于游戏和地图服务的令人兴奋的可能性。然而,从卫星图像中产生视觉变化和场景规模会带来挑战。以前的努力主要集中在图像或视频生成,而忽略了场景生成对任意视图的适应性。现有的3D生成方法只能在物体级别操作,或者难以利用从卫星图像获得的几何信息。为了克服这些限制,我们提出了一个新颖的直接3D场景生成架构,通过引入扩散模型到3D稀疏表示中,并将其与神经渲染技术相结合,实现对任意视图的生成。具体来说,我们的方法首先使用3D扩散模型在给定几何生成纹理颜色,然后以递归方式将其转换为场景表示。该表示可用于渲染任意视图,在单帧质量和跨帧一致性方面都具有卓越的表现。在两个城市规模的数据集上进行的实验证明,我们的模型在从卫星图像生成逼真的街景图像序列和对视图城市场景方面表现出卓越的性能。
https://arxiv.org/abs/2401.10786
Indoor scene generation has attracted significant attention recently as it is crucial for applications of gaming, virtual reality, and interior design. Current indoor scene generation methods can produce reasonable room layouts but often lack diversity and realism. This is primarily due to the limited coverage of existing datasets, including only large furniture without tiny furnishings in daily life. To address these challenges, we propose FurniScene, a large-scale 3D room dataset with intricate furnishing scenes from interior design professionals. Specifically, the FurniScene consists of 11,698 rooms and 39,691 unique furniture CAD models with 89 different types, covering things from large beds to small teacups on the coffee table. To better suit fine-grained indoor scene layout generation, we introduce a novel Two-Stage Diffusion Scene Model (TSDSM) and conduct an evaluation benchmark for various indoor scene generation based on FurniScene. Quantitative and qualitative evaluations demonstrate the capability of our method to generate highly realistic indoor scenes. Our dataset and code will be publicly available soon.
近年来,室内场景生成因其在游戏、虚拟现实和室内设计等领域的应用而受到了广泛关注。然而,当前的室内场景生成方法通常缺乏多样性和现实感。这主要是由于现有数据集的覆盖范围有限,仅包括大型家具,而没有包括日常生活中的琐碎家具。为了应对这些挑战,我们提出了FurniScene,一个大型3D房间数据集,涵盖室内设计专业人士的复杂家具场景。具体来说,FurniScene包括11,698个房间和39,691个独特的家具CAD模型,具有89种不同类型的家具,从大型床到咖啡桌上的小茶杯。为了更好地适应细粒度室内场景布局生成,我们引入了一种名为Two-Stage Diffusion Scene Model(TSDSM)的新模型,并基于FurniScene对各种室内场景生成进行了评估基准。定量和定性评估结果表明,我们的方法可以生成高度逼真的室内场景。我们的数据集和代码将很快公开发布。
https://arxiv.org/abs/2401.03470
We introduce Text2Immersion, an elegant method for producing high-quality 3D immersive scenes from text prompts. Our proposed pipeline initiates by progressively generating a Gaussian cloud using pre-trained 2D diffusion and depth estimation models. This is followed by a refining stage on the Gaussian cloud, interpolating and refining it to enhance the details of the generated scene. Distinct from prevalent methods that focus on single object or indoor scenes, or employ zoom-out trajectories, our approach generates diverse scenes with various objects, even extending to the creation of imaginary scenes. Consequently, Text2Immersion can have wide-ranging implications for various applications such as virtual reality, game development, and automated content creation. Extensive evaluations demonstrate that our system surpasses other methods in rendering quality and diversity, further progressing towards text-driven 3D scene generation. We will make the source code publicly accessible at the project page.
我们介绍了一种名为Text2Immersion的方法,它是一种产生高质量3D沉浸场景的方法,基于预训练的2D扩散和深度估计模型。我们提出的流程首先使用预训练的2D扩散和深度估计模型生成高斯云。接着是高斯云的优化阶段,通过插值和优化来增强生成的场景的细节。与常见的关注单一物体或室内场景,或采用放大出轨迹的方法不同,我们的方法生成了各种物体的大致场景,甚至扩展到创建想象场景。因此,Text2Immersion在各种应用领域(如虚拟现实、游戏开发和自动内容创作)具有广泛的意义。 extensive评估表明,我们的系统在渲染质量和多样性方面超过了其他方法,进一步迈向基于文本的3D场景生成。我们将在项目页面上公开源代码。
https://arxiv.org/abs/2312.09242
We introduce DreamDrone, an innovative method for generating unbounded flythrough scenes from textual prompts. Central to our method is a novel feature-correspondence-guidance diffusion process, which utilizes the strong correspondence of intermediate features in the diffusion model. Leveraging this guidance strategy, we further propose an advanced technique for editing the intermediate latent code, enabling the generation of subsequent novel views with geometric consistency. Extensive experiments reveal that DreamDrone significantly surpasses existing methods, delivering highly authentic scene generation with exceptional visual quality. This approach marks a significant step in zero-shot perpetual view generation from textual prompts, enabling the creation of diverse scenes, including natural landscapes like oases and caves, as well as complex urban settings such as Lego-style street views. Our code is publicly available.
我们介绍了一种名为DreamDrone的创新方法,用于从文本提示生成无限制的飞行场景。该方法的核心是一种新颖的特征匹配-指导扩散过程,该过程利用扩散模型中中间特征的强一致性。利用这种指导策略,我们进一步提出了一个用于编辑中间潜在代码的高级技术,使得生成后续的新视图具有几何一致性。大量实验证明,DreamDrone显著超越了现有方法,具有卓越的视觉效果,使零散提示文本生成方法取得了重大突破。这种方法标志着从文本提示中生成持久视图的技术迈出了重要的一步,包括自然风光如绿洲和洞穴,以及复杂的都市环境如乐高式街道视图。我们的代码是公开可用的。
https://arxiv.org/abs/2312.08746
We introduce WonderJourney, a modularized framework for perpetual 3D scene generation. Unlike prior work on view generation that focuses on a single type of scenes, we start at any user-provided location (by a text description or an image) and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey, a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes, and a large VLM to verify the generated scenes. We show compelling, diverse visual results across various scene types and styles, forming imaginary "wonderjourneys". Project website: this https URL
我们介绍了一个模块化的框架WonderJourney,用于永久性3D场景生成。与之前关注于单种类型的视图生成的研究不同,我们从一个用户提供的位置(通过文本描述或图像)开始,并通过一系列多样且相互连接的3D场景生成旅程。我们利用LLM生成旅途中场景的文本描述,一个基于文本的点云生成管道来制作引人入胜且连贯的3D场景序列,以及一个大型VLM来验证生成的场景。我们展示了各种场景类型和风格引人入胜的视觉结果,形成了一系列想象的“探险之旅”。项目网站:这是 https://this.link
https://arxiv.org/abs/2312.03884
Generating multi-camera street-view videos is critical for augmenting autonomous driving datasets, addressing the urgent demand for extensive and varied data. Due to the limitations in diversity and challenges in handling lighting conditions, traditional rendering-based methods are increasingly being supplanted by diffusion-based methods. However, a significant challenge in diffusion-based methods is ensuring that the generated sensor data preserve both intra-world consistency and inter-sensor coherence. To address these challenges, we combine an additional explicit world volume and propose the World Volume-aware Multi-camera Driving Scene Generator (WoVoGen). This system is specifically designed to leverage 4D world volume as a foundational element for video generation. Our model operates in two distinct phases: (i) envisioning the future 4D temporal world volume based on vehicle control sequences, and (ii) generating multi-camera videos, informed by this envisioned 4D temporal world volume and sensor interconnectivity. The incorporation of the 4D world volume empowers WoVoGen not only to generate high-quality street-view videos in response to vehicle control inputs but also to facilitate scene editing tasks.
生成多摄像头街头视角视频对增强自动驾驶数据集至关重要,解决广泛而多样数据的迫切需求。由于多样性的限制和处理光线条件的挑战,传统渲染方法正逐渐被扩散方法所取代。然而,扩散方法的一个关键挑战是确保生成的传感器数据保留内部世界一致性和跨传感器一致性。为了应对这些挑战,我们结合了一个额外的显式世界体积,并提出了名为世界体积感知多摄像头驾驶场景生成器(WoVoGen)的系统。这个系统专门设计为利用4D世界体积作为视频生成的基本元素。我们的模型分为两个阶段:首先,根据车辆控制序列预见未来的4D时间世界体积;其次,根据预见的4D时间世界体积和传感器互连性生成多摄像头视频。引入4D世界体积使WoVoGen不仅能够根据车辆控制输入生成高质量的街头视角视频,还能够在场景编辑任务中发挥作用。
https://arxiv.org/abs/2312.02934
Generating large-scale 3D scenes cannot simply apply existing 3D object synthesis technique since 3D scenes usually hold complex spatial configurations and consist of a number of objects at varying scales. We thus propose a practical and efficient 3D representation that incorporates an equivariant radiance field with the guidance of a bird's-eye view (BEV) map. Concretely, objects of synthesized 3D scenes could be easily manipulated through steering the corresponding BEV maps. Moreover, by adequately incorporating positional encoding and low-pass filters into the generator, the representation becomes equivariant to the given BEV map. Such equivariance allows us to produce large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and then stitching them with smooth consistency. Extensive experiments on 3D scene datasets demonstrate the effectiveness of our approach. Our project website is at this https URL.
生成大规模3D场景不能简单地应用现有的3D物体合成技术,因为3D场景通常具有复杂的空间配置,由多个大小不同的物体组成。因此,我们提出了一个实用且高效的3D表示,通过在鸟瞰图(BEV)地图的指导下,包含等价辐射场。具体来说,通过通过操纵相应的BEV地图,可以轻松地操作合成3D场景中的物体。此外,通过适当地将位置编码和低通滤波器融入生成器中,表示对给定的BEV地图具有等价性。这种等价性允许我们通过合成局部场景,然后平滑地拼接它们,生成大型、甚至无限大的3D场景。在3D场景数据集上进行的大量实验证实了我们的方法的有效性。我们的项目网站位于https://www.example.com。
https://arxiv.org/abs/2312.02136
Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.
大规模扩散生成模型极大地简化了用户提供的文本提示和图像为基础的图像,视频和三维资产创建。然而,从文本到4D动态三维场景生成的挑战问题仍然没有被深入探索。我们提出了Dream-in-4D,它采用了一种新的两阶段方法进行文本到4D合成,利用(1)3D和2D扩散指导来有效地在第一阶段学习高质量静态3D资产;(2)一个可变的神经元辐射场,它明确地分离了学习到的静态资产的变形,在运动学习过程中保持质量;和(3)一个多分辨率特征网格,带有时空总变差损失,在第二阶段有效地学习视频扩散指导下的运动。通过用户偏好研究,我们证明了与基线方法相比,我们的方法显著提高了文本到4D生成中的图像和运动质量,3D一致性和文本准确性。得益于其运动分离表示,Dream-in-4D还可以很容易地适应可控制生成,其中外观由一个或多个图像定义,无需修改运动学习阶段。因此,我们的方法为文本到4D,图像到4D和个人4D生成任务提供了第一次统一的方法。
https://arxiv.org/abs/2311.16854
With the widespread usage of VR devices and contents, demands for 3D scene generation techniques become more popular. Existing 3D scene generation models, however, limit the target scene to specific domain, primarily due to their training strategies using 3D scan dataset that is far from the real-world. To address such limitation, we propose LucidDreamer, a domain-free scene generation pipeline by fully leveraging the power of existing large-scale diffusion-based generative model. Our LucidDreamer has two alternate steps: Dreaming and Alignment. First, to generate multi-view consistent images from inputs, we set the point cloud as a geometrical guideline for each image generation. Specifically, we project a portion of point cloud to the desired view and provide the projection as a guidance for inpainting using the generative model. The inpainted images are lifted to 3D space with estimated depth maps, composing a new points. Second, to aggregate the new points into the 3D scene, we propose an aligning algorithm which harmoniously integrates the portions of newly generated 3D scenes. The finally obtained 3D scene serves as initial points for optimizing Gaussian splats. LucidDreamer produces Gaussian splats that are highly-detailed compared to the previous 3D scene generation methods, with no constraint on domain of the target scene.
随着VR设备和内容的广泛使用,对3D场景生成技术的需求越来越高。然而,现有的3D场景生成模型限制了目标场景,主要由于它们使用距离真实世界较远的3D扫描数据集进行训练。为了克服这种限制,我们提出了LucidDreamer,一个领域无关的场景生成管道,完全利用现有的大规模扩散模型的力量。我们的LucidDreamer有两个交替步骤:梦境和对齐。首先,为了从输入中生成多视角一致的图像,我们将点云设为每个图像生成的几何指南。具体来说,我们将点云的部分投影到所需视角,并为修复使用生成模型提供投影。通过估计深度图提升修复后的图像。其次,为了将新点整合到3D场景中,我们提出了一个对齐算法,它和谐地将新生成3D场景的部分整合到场景中。最终获得的3D场景作为优化Gaussian spats的初始点。LucidDreamer产生的Gaussian spats与之前的三维场景生成方法相比具有高度详细,且没有对目标场景的领域限制。
https://arxiv.org/abs/2311.13384
Directly transferring the 2D techniques to 3D scene generation is challenging due to significant resolution reduction and the scarcity of comprehensive real-world 3D scene datasets. To address these issues, our work introduces the Pyramid Discrete Diffusion model (PDD) for 3D scene generation. This novel approach employs a multi-scale model capable of progressively generating high-quality 3D scenes from coarse to fine. In this way, the PDD can generate high-quality scenes within limited resource constraints and does not require additional data sources. To the best of our knowledge, we are the first to adopt the simple but effective coarse-to-fine strategy for 3D large scene generation. Our experiments, covering both unconditional and conditional generation, have yielded impressive results, showcasing the model's effectiveness and robustness in generating realistic and detailed 3D scenes. Our code will be available to the public.
将2D技术直接应用于3D场景生成具有挑战性,因为会导致显著的分辨率降低以及缺乏完整的真实世界3D场景数据集。为解决这些问题,我们的工作引入了Pyramid Discrete Diffusion模型(PDD)用于3D场景生成。这种新方法采用了一个多尺度模型,可以在粗到细的尺度上逐步生成高质量的3D场景。这样,PDD可以在有限的资源限制内生成高品质的场景,并且不需要额外的数据源。据我们所知,我们 是第一个采用简单但有效的粗-到细策略进行3D大型场景生成的机构。我们的实验,包括有条件生成和无条件生成,都取得了令人印象深刻的成果,展示了模型在生成真实和详细3D场景方面的有效性和鲁棒性。我们的代码将公开发布。
https://arxiv.org/abs/2311.12085
Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.
当前以主题驱动的图像生成方法在人物中心图像生成方面遇到了显著的挑战。原因在于,它们通过微调共同的预训练扩散来学习语义场景和人物生成,这涉及到一种不可调和的训练不平衡。具体来说,为了生成真实的人,它们需要对预训练模型进行充分调整,这导致模型在训练数据上过拟合,而且即使经过充分微调,这些方法也无法生成高质量的人物,因为场景和人物生成的联合学习也会导致质量妥协。在本文中,我们提出了Face-diffuser,一种有效的协同生成管道来消除上述训练不平衡和质量妥协。具体来说,我们首先开发了两种专用预训练扩散模型,即文本驱动扩散模型(TDM)和主题增强扩散模型(SDM),用于场景和人物生成。采样过程分为三个连续阶段:语义场景构建、主题场景融合和主题增强。第一和最后一个阶段由TDM和SDM分别执行。主题场景融合阶段是通过一种新颖而高效机制实现的协同,即自适应对比度增强噪声融合(SNF)。具体来说,它是基于我们的一项关键观察,即分类器无关的指导响应与生成图像的清晰度之间存在一个强大的联系。在每一步中,SNF利用每个模型的独特优势,并允许在感知到的清晰度下自动对两个模型的预测噪声进行空间融合。大量实验证实了Face-diffuser的惊人效果和稳健性。
https://arxiv.org/abs/2311.10329