Paper Reading AI Learner

Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation

2024-11-04 17:21:42
Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xinzhou Wang, Qingxiang Lin, Jiaao Yu, Lifu Wang, Zhuo Chen, Sicong Liu, Yuhong Liu, Yong Yang, Di Wang, Jie Jiang, Chunchao Guo

Abstract

While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D-1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. % Extensive experimental results demonstrate the effectiveness of Hunyuan3D-1.0 in generating high-quality 3D assets. Our framework involves the text-to-image model ~\ie, Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has $10\times$ more parameters than our lite and other existing model. Our Hunyuan3D-1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets.

Abstract (translated)

虽然三维生成模型极大地改善了艺术家的工作流程,现有的用于三维生成的扩散模型仍面临着生成速度慢和泛化能力差的问题。为了解决这一问题,我们提出了一种两阶段的方法,命名为Hunyuan3D-1.0,包括轻量级版本和标准版本,它们均支持基于文本和图像条件的生成。在第一阶段,我们采用了一个多视角扩散模型,该模型能够高效地在大约4秒内生成多视角RGB图。这些多视角图像从不同的视点捕捉三维资产的丰富细节,将任务从单视图重建扩展到多视图重建。在第二阶段,我们引入了一种前馈重构模型,在约7秒的时间内快速且忠实地根据生成的多视角图像进行三维资产的重构。重构网络学会了处理由多视角扩散带来的噪声和不一致性,并利用条件图像中的可用信息高效地恢复三维结构。广泛的实验结果证明了Hunyuan3D-1.0在生成高质量三维资产方面的有效性。我们的框架涉及文本到图像模型,即Hunyuan-DiT,使其成为一个统一的框架来支持基于文本和图像条件的三维生成。与轻量级版本和其他现有模型相比,我们的标准版本参数多出约10倍。Hunyuan3D-1.0实现了速度和质量之间的出色平衡,在显著减少生成时间的同时保持了生成资产的质量和多样性。

URL

https://arxiv.org/abs/2411.02293

PDF

https://arxiv.org/pdf/2411.02293.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot