Abstract
In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes.
Abstract (translated)
在本文中,我们探讨了一个开放的研究任务,即从给定的文字描述中生成可控制3D形状纹理。以前的工作要么需要真实的标题标签标注,要么需要进行大量的优化时间。为了解决这些难题,我们提出了一个新的框架TAPS3D,以训练一个基于伪标题的文本引导3D形状生成器。具体来说,基于渲染的2D图像,我们从CLIP词汇库中检索相关词汇,并使用模板使用模板构建伪标题。我们构建的伪标题为生成的3D形状提供了高水平的语义监督。此外,为了产生细致的纹理和提高几何多样性,我们提议采用低层次的图像 Regularization 方法,使伪渲染图像与真实图像对齐。在推理阶段,我们的提议模型可以从给定的文字中不需要任何额外的优化就能生成3D形状纹理。我们进行了广泛的实验来分析我们提出的每个组件,并展示我们框架在生成高保真的3D形状和与文本相关的形状方面的效率。
URL
https://arxiv.org/abs/2303.13273