Paper Reading AI Learner

TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision

2023-03-23 13:53:16
Jiacheng Wei, Hao Wang, Jiashi Feng, Guosheng Lin, Kim-Hui Yap

Abstract

In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes.

Abstract (translated)

在本文中,我们探讨了一个开放的研究任务,即从给定的文字描述中生成可控制3D形状纹理。以前的工作要么需要真实的标题标签标注,要么需要进行大量的优化时间。为了解决这些难题,我们提出了一个新的框架TAPS3D,以训练一个基于伪标题的文本引导3D形状生成器。具体来说,基于渲染的2D图像,我们从CLIP词汇库中检索相关词汇,并使用模板使用模板构建伪标题。我们构建的伪标题为生成的3D形状提供了高水平的语义监督。此外,为了产生细致的纹理和提高几何多样性,我们提议采用低层次的图像 Regularization 方法,使伪渲染图像与真实图像对齐。在推理阶段,我们的提议模型可以从给定的文字中不需要任何额外的优化就能生成3D形状纹理。我们进行了广泛的实验来分析我们提出的每个组件,并展示我们框架在生成高保真的3D形状和与文本相关的形状方面的效率。

URL

https://arxiv.org/abs/2303.13273

PDF

https://arxiv.org/pdf/2303.13273.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot