Paper Reading AI Learner

WordRobe: Text-Guided Generation of Textured 3D Garments

2024-03-26 09:44:34
Astitva Srivastava, Pranav Manu, Amit Raj, Varun Jampani, Avinash Sharma

Abstract

In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.

Abstract (translated)

在本文中,我们解决了一个新颖且具有挑战性的问题:用高品质纹理的文本驱动生成3D服装。我们提出了 "WordRobe",一种用于从用户友好的文本提示中生成非姿态&纹理3D服装网格的新颖框架。我们通过首先使用一种新颖的粗到细的训练策略和学习3D服装的潜在表示来达到这个目标,并使用潜在解分枝损失促进更好的潜在互插。随后,我们以弱监督的方式将服装潜在空间与CLIP嵌入空间对齐,实现基于文本的3D服装生成和编辑。对于外观建模,我们利用控制网的零 shot生成能力,在单次前馈推理步骤中合成视图一致的纹理映射,从而大大缩短了生成时间,与现有方法相比。我们在现有SOTAs中证明了卓越的性能,包括学习3D服装潜在空间、服装插值和文本驱动纹理合成,这是通过定量和用户研究来支持的。使用WordRobe生成的非姿态3D服装网格可以直接输入到标准的布料模拟和动画流水线中,无需后处理。

URL

https://arxiv.org/abs/2403.17541

PDF

https://arxiv.org/pdf/2403.17541.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot