Paper Reading AI Learner

One-Step Image Translation with Text-to-Image Models

2024-03-18 17:59:40
Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, Jun-Yan Zhu

Abstract

In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at this https URL.

Abstract (translated)

在这项工作中,我们解决了现有条件扩散模型的两个局限:由于迭代去噪过程导致其推理速度较慢,以及它们依赖于成对数据进行模型微调。为了应对这些问题,我们引入了一种通过对抗学习目标将单步扩散模型适应新任务和领域的通用方法。具体来说,我们将各种模块整合到一个具有小训练权重的单端到端生成器网络中,提高其保留输入图像结构的能力,同时减少过拟合。我们证明了,对于未配对设置,我们的模型CycleGAN-Turbo在各种场景平移任务中优于现有的基于GAN和扩散的方法,如日夜转换和添加/删除天气效果(如雾、雪和雨)。我们将我们的方法扩展到配对设置,其中我们的模型pix2pix-Turbo与近期的类似工作 Control-Net for Sketch2Photo和Edge2Image相当,但只有一个步骤的推理。这项工作表明,单步扩散模型可以作为各种GAN学习目标的强大骨架。我们的代码和模型可以从该https URL获取。

URL

https://arxiv.org/abs/2403.12036

PDF

https://arxiv.org/pdf/2403.12036.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot