Abstract
Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts. This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes cross-attention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works.
Abstract (translated)
大型文本引导扩散模型吸引了广泛关注,因为它们能够合成传递复杂视觉概念的 diverse 图像。这种生成能力最近被利用来实现文本到3D的合成。在这项工作中,我们提出了一种技术,利用隐扩散模型的力量编辑现有的3D对象。我们使用Oriented 2D图像作为输入,并学习基于网格的体积表示。为了指导体积表示符合目标文本 prompt,我们遵循无条件文本到3D方法,并优化SDS损失。然而,我们发现,将这种扩散引导损失与基于图像的恢复损失相结合,并鼓励表示不偏离输入对象太过强烈,是一项挑战性的任务,因为这需要同时实现两个相互冲突的目标,而只看到2D投影的结构与外观结合。因此,我们引入了一种新的体积恢复损失,它在3D空间直接操作,利用我们的3D表示的明确性质来促进原始对象和编辑对象全局结构的相关性。此外,我们提出了一种技术,优化交叉注意力体积网格,以 refine 操作的Spatial extent。广泛的实验和比较证明了我们的方法在创建无法通过先前工作实现的一系列编辑。
URL
https://arxiv.org/abs/2303.12048