Paper Reading AI Learner

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

2023-01-30 14:58:23
Ming Tao, Bing-Kun Bao, Hao Tang, Changsheng Xu

Abstract

Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are difficult to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to accurately assess the image quality. Furthermore, we propose a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts. The CLIP-integrated generator and discriminator boost training efficiency, and as a result, our model only requires about 3% training data and 6% learnable parameters, achieving comparable results to large pretrained autoregressive and diffusion models. Moreover, our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN. The extensive experimental results demonstrate the excellent performance of our GALIP. Code is available at this https URL.

Abstract (translated)

合成高保真的从文本生成复杂图像是一项挑战性的任务。基于大规模的预训练,自回归和扩散模型可以生成逼真的图像。尽管这些大型模型已经取得了显著进展,但仍有三处缺陷。1)这些模型需要巨大的训练数据和参数才能达到良好的表现。2)多步生成设计极大地减缓了图像合成过程。3)合成的视觉特征难以控制,需要精心设计的提示。为了实现高品质、高效、快速且可控制的图像到文本合成,我们提出了生成对抗网络Clips,即GALIP。 GALIP利用在鉴别器和生成器中的强大的预训练Clips模型。具体来说,我们提出了基于Clip的鉴别器。Clip的复杂的场景理解能力使鉴别器能够准确地评估图像质量。此外,我们提出了基于Clip的生成器,通过桥接特征和提示,从Clip中诱导视觉概念。Clip集成的生成器和鉴别器可以提高训练效率,因此,我们的模型只需要约3%的训练数据和6%可学习参数,与大型预训练自回归和扩散模型的结果相当。此外,我们的模型实现了120倍更快的合成速度,并继承了GAN的平滑隐藏空间。广泛的实验结果证明了我们的GALIP出色的表现。代码在此httpsURL可用。

URL

https://arxiv.org/abs/2301.12959

PDF

https://arxiv.org/pdf/2301.12959.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot