Abstract
Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.
Abstract (translated)
文本到图像合成技术最近取得了显著进展,得益于训练了大量的预训练语言模型、大规模的训练数据以及引入了如扩散和自回归模型等可扩展模型家族。然而,表现最好的模型需要进行迭代评估才能生成单个样本。相比之下,生成对抗网络(GANs)只需要一次前向处理。因此,它们的速度非常快,但目前仍然远远落后于大规模文本到图像合成的最新技术。本文旨在确定恢复竞争力所需的步骤。我们提出的模型StyleGAN-T解决了大规模文本到图像合成的具体需求,如大容量、稳定地训练在不同数据集上、强大的文本对齐以及控制变量与文本对齐的权衡。StyleGAN-T在样本质量和速度方面 significantly improves over previous GANs,并比过去的快速文本到图像合成技术在样本质量和速度方面表现更好。
URL
https://arxiv.org/abs/2301.09515