Abstract
Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes, we represent 3D scene with a Scalable Hash Grid structure, incrementally adapting to the growing scale of urban scenes. Extensive experiments substantiate the capability of our framework to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations, showing the powers of steerable urban scene generation. Website: this https URL.
Abstract (translated)
通过大规模的文本到图像扩散模型,文本到3D生成功能取得了显著的成功。然而,在扩展方法以达到城市规模方面,尚无范式。城市场景具有大量的元素、复杂的布局关系和广泛的规模,这使得对模糊文本描述的有效模型优化具有挑战性。在这项工作中,我们克服了限制,通过将组件化的3D布局表示引入文本到3D范式中,作为额外的先验。它包括一系列语义原型,具有简单的几何结构和明确的布局关系,补充了文本描述,并实现了可引导的生成。在此基础上,我们提出了两个修改建议--(1) 我们引入了布局引导的变分 score distillation 以解决模型优化不足的问题。它将分数扩散采样过程与3D布局的几何和语义约束相结合。(2) 为了处理城市场景的无界性质,我们使用可扩展哈希网格结构表示3D场景,并随着城市场景规模的增长,逐步适应。大量实验证实了我们的框架可以将文本到3D生成扩展到覆盖超过1000米驾驶距离的大型城市场景,这是第一次实现。我们还展示了各种场景编辑示例,展示了我们框架的可引导城市场景生成的力量。网站:https://this URL。
URL
https://arxiv.org/abs/2404.06780