Paper Reading AI Learner

Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching

2025-06-16 15:21:30
Weimin Bai, Yubo Li, Wenzheng Chen, Weijian Luo, He Sun

Abstract

Distilling pre-trained 2D diffusion models into 3D assets has driven remarkable advances in text-to-3D synthesis. However, existing methods typically rely on Score Distillation Sampling (SDS) loss, which involves asymmetric KL divergence--a formulation that inherently favors mode-seeking behavior and limits generation diversity. In this paper, we introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss, a score-based objective that effectively mitigates mode collapse. Furthermore, Dive3D integrates both diffusion distillation and reward-guided optimization under a unified divergence perspective. Such reformulation, together with SIM loss, yields significantly more diverse 3D outputs while improving text alignment, human preference, and overall visual fidelity. We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments, including diversity, photorealism, and aesthetic appeal. We further evaluate its performance on the GPTEval3D benchmark, comparing against nine state-of-the-art baselines. Dive3D also achieves strong results on quantitative metrics, including text-asset alignment, 3D plausibility, text-geometry consistency, texture quality, and geometric detail.

Abstract (translated)

将预训练的2D扩散模型提炼到3D资产中,已在文本到3D合成领域取得了显著进展。然而,现有的方法通常依赖于分数蒸馏采样(SDS)损失,该方法涉及非对称KL散度——这种形式本质上倾向于寻求模式行为,并限制了生成多样性。在本文中,我们介绍了Dive3D,这是一种新颖的文本到3D生成框架,它用基于分数隐式匹配(SIM)损失替换了基于KL的目标函数,后者是一种有效的分数目标函数,能够有效缓解模式崩溃现象。此外,Dive3D将扩散蒸馏和奖励引导优化整合在一个统一的散度视角下。这种重新表述结合了SIM损失,产生了更多样化的3D输出,并在文本对齐、人类偏好以及整体视觉保真度方面有所改进。我们在各种2D到3D提示上验证了Dive3D的有效性,发现它在包括多样性、照片真实感和美学吸引力在内的定性评估中始终优于先前的方法。我们进一步在其性能上进行了定量测试,使用GPTEval3D基准对,并与九种最新的基线方法进行了比较。Dive3D还在量化指标方面取得了强大结果,包括文本-资产对齐度、3D合理性、文本-几何一致性、纹理质量和几何细节。

URL

https://arxiv.org/abs/2506.13594

PDF

https://arxiv.org/pdf/2506.13594.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot