Abstract
Text-to-3D content creation has recently received much attention, especially with the prevalence of 3D Gaussians Splatting. In general, GS-based methods comprise two key stages: initialization and rendering optimization. To achieve initialization, existing works directly apply random sphere initialization or 3D diffusion models, e.g., Point-E, to derive the initial shapes. However, such strategies suffer from two critical yet challenging problems: 1) the final shapes are still similar to the initial ones even after training; 2) shapes can be produced only from simple texts, e.g., "a dog", not for lexically richer texts, e.g., "a dog is sitting on the top of the airplane". To address these problems, this paper proposes a novel general framework to boost the 3D GS Initialization for text-to-3D generation upon the lexical richness. Our key idea is to aggregate 3D Gaussians into spatially uniform voxels to represent complex shapes while enabling the spatial interaction among the 3D Gaussians and semantic interaction between Gaussians and texts. Specifically, we first construct a voxelized representation, where each voxel holds a 3D Gaussian with its position, scale, and rotation fixed while setting opacity as the sole factor to determine a position's occupancy. We then design an initialization network mainly consisting of two novel components: 1) Global Information Perception (GIP) block and 2) Gaussians-Text Fusion (GTF) block. Such a design enables each 3D Gaussian to assimilate the spatial information from other areas and semantic information from texts. Extensive experiments show the superiority of our framework of high-quality 3D GS initialization against the existing methods, e.g., Shap-E, by taking lexically simple, medium, and hard texts. Also, our framework can be seamlessly plugged into SoTA training frameworks, e.g., LucidDreamer, for semantically consistent text-to-3D generation.
Abstract (translated)
文本到3D内容创建最近受到了很多关注,尤其是随着3D高斯核平铺的普遍流行。通常,基于GS的方法包含两个关键阶段:初始化和渲染优化。为了实现初始化,现有工作直接应用随机球形初始化或3D扩散模型,例如Point-E,来确定初始形状。然而,这种策略存在两个关键但具有挑战性的问题:1)经过训练后,最终形状仍然与初始形状相似;2)仅能从简单的文本中生成形状,而不能从词汇量更大的文本中生成形状。为了解决这些问题,本文提出了一种新的一般框架,以提高基于词汇量丰富的文本到3D生成的3D高斯初始化。我们提出,将3D高斯聚合为均匀的体素以表示复杂形状,同时允许3D高斯和文本之间的空间交互以及高斯和文本之间的语义交互。具体来说,我们首先构建了一个体素化表示,其中每个体素都包含一个3D高斯,其位置、缩放和旋转都固定,而透明度成为唯一确定位置占有率的因素。然后我们设计了一个主要由两个新组件构成的初始化网络:1)全局信息感知(GIP)块和2)高斯-文本融合(GTF)块。这种设计使每个3D高斯能够吸收其他区域的空间信息和高斯与文本之间的语义信息。大量的实验结果表明,我们提出的基于词汇量丰富的3D高斯初始化框架比现有方法(如Shap-E)具有更高的质量。此外,我们的框架可以无缝地接入SoTA训练框架,如LucidDreamer,实现语义一致的文本到3D生成。
URL
https://arxiv.org/abs/2408.01269