Abstract
Distilling pre-trained 2D diffusion models into 3D assets has driven remarkable advances in text-to-3D synthesis. However, existing methods typically rely on Score Distillation Sampling (SDS) loss, which involves asymmetric KL divergence--a formulation that inherently favors mode-seeking behavior and limits generation diversity. In this paper, we introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss, a score-based objective that effectively mitigates mode collapse. Furthermore, Dive3D integrates both diffusion distillation and reward-guided optimization under a unified divergence perspective. Such reformulation, together with SIM loss, yields significantly more diverse 3D outputs while improving text alignment, human preference, and overall visual fidelity. We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments, including diversity, photorealism, and aesthetic appeal. We further evaluate its performance on the GPTEval3D benchmark, comparing against nine state-of-the-art baselines. Dive3D also achieves strong results on quantitative metrics, including text-asset alignment, 3D plausibility, text-geometry consistency, texture quality, and geometric detail.
Abstract (translated)
将预训练的2D扩散模型提炼到3D资产中,已在文本到3D合成领域取得了显著进展。然而,现有的方法通常依赖于分数蒸馏采样(SDS)损失,该方法涉及非对称KL散度——这种形式本质上倾向于寻求模式行为,并限制了生成多样性。在本文中,我们介绍了Dive3D,这是一种新颖的文本到3D生成框架,它用基于分数隐式匹配(SIM)损失替换了基于KL的目标函数,后者是一种有效的分数目标函数,能够有效缓解模式崩溃现象。此外,Dive3D将扩散蒸馏和奖励引导优化整合在一个统一的散度视角下。这种重新表述结合了SIM损失,产生了更多样化的3D输出,并在文本对齐、人类偏好以及整体视觉保真度方面有所改进。我们在各种2D到3D提示上验证了Dive3D的有效性,发现它在包括多样性、照片真实感和美学吸引力在内的定性评估中始终优于先前的方法。我们进一步在其性能上进行了定量测试,使用GPTEval3D基准对,并与九种最新的基线方法进行了比较。Dive3D还在量化指标方面取得了强大结果,包括文本-资产对齐度、3D合理性、文本-几何一致性、纹理质量和几何细节。
URL
https://arxiv.org/abs/2506.13594