Abstract
Text-to-3D is an emerging task that allows users to create 3D content with infinite possibilities. Existing works tackle the problem by optimizing a 3D representation with guidance from pre-trained diffusion models. An apparent drawback is that they need to optimize from scratch for each prompt, which is computationally expensive and often yields poor visual fidelity. In this paper, we propose DreamPortrait, which aims to generate text-guided 3D-aware portraits in a single-forward pass for efficiency. To achieve this, we extend Score Distillation Sampling from datapoint to distribution formulation, which injects semantic prior into a 3D distribution. However, the direct extension will lead to the mode collapse problem since the objective only pursues semantic alignment. Hence, we propose to optimize a distribution with hierarchical condition adapters and GAN loss regularization. For better 3D modeling, we further design a 3D-aware gated cross-attention mechanism to explicitly let the model perceive the correspondence between the text and the 3D-aware space. These elaborated designs enable our model to generate portraits with robust multi-view semantic consistency, eliminating the need for optimization-based methods. Extensive experiments demonstrate our model's highly competitive performance and significant speed boost against existing methods.
Abstract (translated)
文本到三维是一项新兴任务,它允许用户以无限的可能创造三维内容。现有的工作通过从训练好的扩散模型中指导优化三维表示来解决这个问题。一个明显的缺点是他们需要为每个提示都重新优化,这是计算代价高昂的,并且通常会导致视觉效果不佳。在本文中,我们提出了梦想肖像,它旨在以高效的方式从文本引导的三维意识肖像中生成。为了实现这个目标,我们将评分蒸馏采样扩展到分布 formulation,将语义先验注入到三维分布中。然而,直接扩展将会导致模式崩溃问题,因为目标只是追求语义匹配。因此,我们提议使用分层条件适配器和GAN损失 Regularization 来优化分布。为了提供更好的三维建模,我们还设计了三维意识闭路交叉注意力机制,以明确让模型感知文本和三维意识空间之间的对应关系。这些 elaborate 的设计使我们能够生成具有稳健多视角语义一致性的肖像,从而不再需要基于优化的方法。广泛的实验证明了我们的模型的高竞争力表现以及与现有方法的重大速度提升。
URL
https://arxiv.org/abs/2306.02083