Abstract
Recent text-to-image generation models have demonstrated impressive capability of generating text-aligned images with high fidelity. However, generating images of novel concept provided by the user input image is still a challenging task. To address this problem, researchers have been exploring various methods for customizing pre-trained text-to-image generation models. Currently, most existing methods for customizing pre-trained text-to-image generation models involve the use of regularization techniques to prevent over-fitting. While regularization will ease the challenge of customization and leads to successful content creation with respect to text guidance, it may restrict the model capability, resulting in the loss of detailed information and inferior performance. In this work, we propose a novel framework for customized text-to-image generation without the use of regularization. Specifically, our proposed framework consists of an encoder network and a novel sampling method which can tackle the over-fitting problem without the use of regularization. With the proposed framework, we are able to customize a large-scale text-to-image generation model within half a minute on single GPU, with only one image provided by the user. We demonstrate in experiments that our proposed framework outperforms existing methods, and preserves more fine-grained details.
Abstract (translated)
最近,生成文本对齐图像的人工神经网络模型表现出令人印象深刻的能力,能够生成高保真的图像。然而,从用户输入的图像生成新的创意图像仍然是一个挑战性的任务。为了解决这个问题,研究人员一直在探索各种方法来定制训练好的文本到图像生成模型。目前,大多数现有的定制文本到图像生成模型的方法都涉及使用正则化技术来防止过拟合。虽然正则化能够减轻定制的挑战,并在文本指导下成功创建内容,但它可能会限制模型的能力,导致丢失详细的信息和较差的性能。在这项工作中,我们提出了一种独特的框架,不需要使用正则化技术来定制文本到图像生成模型。具体来说,我们的框架由编码网络和一种新的采样方法组成,该方法可以在不使用正则化的情况下解决过拟合问题。通过使用该框架,可以在单个GPU上在不到一分钟的时间内定制一个大规模的文本到图像生成模型,只需要用户提供一个图像。我们实验表明,我们的框架比现有的方法表现更好,并保留了更多的细节。
URL
https://arxiv.org/abs/2305.13579