Abstract
Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are difficult to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to accurately assess the image quality. Furthermore, we propose a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts. The CLIP-integrated generator and discriminator boost training efficiency, and as a result, our model only requires about 3% training data and 6% learnable parameters, achieving comparable results to large pretrained autoregressive and diffusion models. Moreover, our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN. The extensive experimental results demonstrate the excellent performance of our GALIP. Code is available at this https URL.
Abstract (translated)
合成高保真的从文本生成复杂图像是一项挑战性的任务。基于大规模的预训练,自回归和扩散模型可以生成逼真的图像。尽管这些大型模型已经取得了显著进展,但仍有三处缺陷。1)这些模型需要巨大的训练数据和参数才能达到良好的表现。2)多步生成设计极大地减缓了图像合成过程。3)合成的视觉特征难以控制,需要精心设计的提示。为了实现高品质、高效、快速且可控制的图像到文本合成,我们提出了生成对抗网络Clips,即GALIP。 GALIP利用在鉴别器和生成器中的强大的预训练Clips模型。具体来说,我们提出了基于Clip的鉴别器。Clip的复杂的场景理解能力使鉴别器能够准确地评估图像质量。此外,我们提出了基于Clip的生成器,通过桥接特征和提示,从Clip中诱导视觉概念。Clip集成的生成器和鉴别器可以提高训练效率,因此,我们的模型只需要约3%的训练数据和6%可学习参数,与大型预训练自回归和扩散模型的结果相当。此外,我们的模型实现了120倍更快的合成速度,并继承了GAN的平滑隐藏空间。广泛的实验结果证明了我们的GALIP出色的表现。代码在此httpsURL可用。
URL
https://arxiv.org/abs/2301.12959