Paper Reading AI Learner

Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach

2023-05-23 01:14:53
Yufan Zhou, Ruiyi Zhang, Tong Sun, Jinhui Xu

Abstract

Recent text-to-image generation models have demonstrated impressive capability of generating text-aligned images with high fidelity. However, generating images of novel concept provided by the user input image is still a challenging task. To address this problem, researchers have been exploring various methods for customizing pre-trained text-to-image generation models. Currently, most existing methods for customizing pre-trained text-to-image generation models involve the use of regularization techniques to prevent over-fitting. While regularization will ease the challenge of customization and leads to successful content creation with respect to text guidance, it may restrict the model capability, resulting in the loss of detailed information and inferior performance. In this work, we propose a novel framework for customized text-to-image generation without the use of regularization. Specifically, our proposed framework consists of an encoder network and a novel sampling method which can tackle the over-fitting problem without the use of regularization. With the proposed framework, we are able to customize a large-scale text-to-image generation model within half a minute on single GPU, with only one image provided by the user. We demonstrate in experiments that our proposed framework outperforms existing methods, and preserves more fine-grained details.

Abstract (translated)

最近,生成文本对齐图像的人工神经网络模型表现出令人印象深刻的能力,能够生成高保真的图像。然而,从用户输入的图像生成新的创意图像仍然是一个挑战性的任务。为了解决这个问题,研究人员一直在探索各种方法来定制训练好的文本到图像生成模型。目前,大多数现有的定制文本到图像生成模型的方法都涉及使用正则化技术来防止过拟合。虽然正则化能够减轻定制的挑战,并在文本指导下成功创建内容,但它可能会限制模型的能力,导致丢失详细的信息和较差的性能。在这项工作中,我们提出了一种独特的框架,不需要使用正则化技术来定制文本到图像生成模型。具体来说,我们的框架由编码网络和一种新的采样方法组成,该方法可以在不使用正则化的情况下解决过拟合问题。通过使用该框架,可以在单个GPU上在不到一分钟的时间内定制一个大规模的文本到图像生成模型,只需要用户提供一个图像。我们实验表明,我们的框架比现有的方法表现更好,并保留了更多的细节。

URL

https://arxiv.org/abs/2305.13579

PDF

https://arxiv.org/pdf/2305.13579.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot