Abstract
Designing complex 3D scenes has been a tedious, manual process requiring domain expertise. Emerging text-to-3D generative models show great promise for making this task more intuitive, but existing approaches are limited to object-level generation. We introduce \textbf{locally conditioned diffusion} as an approach to compositional scene diffusion, providing control over semantic parts using text prompts and bounding boxes while ensuring seamless transitions between these parts. We demonstrate a score distillation sampling--based text-to-3D synthesis pipeline that enables compositional 3D scene generation at a higher fidelity than relevant baselines.
Abstract (translated)
设计复杂的三维场景是一项繁琐、需要领域专业知识的手动过程。新兴的文本到三维生成模型显示出希望使任务更加直观的巨大潜力,但现有的方法仅限于对象级别的生成。我们提出了 \textbf{locally conditioned diffusion} 作为组成性场景扩散的方法,使用文本提示和边界框提供对语义部分的控制,同时确保这些部分的无缝过渡。我们展示了基于分数蒸馏采样的文本到三维合成管道,使其生成组成的三维场景比相关基线更加逼真。
URL
https://arxiv.org/abs/2303.12218