Paper Reading AI Learner

On Inference Stability for Diffusion Models

2023-12-19 18:57:34
Viet Nguyen, Giang Vu, Tung Nguyen Thanh, Khoat Than, Toan Tran

Abstract

Denoising Probabilistic Models (DPMs) represent an emerging domain of generative models that excel in generating diverse and high-quality images. However, most current training methods for DPMs often neglect the correlation between timesteps, limiting the model's performance in generating images effectively. Notably, we theoretically point out that this issue can be caused by the cumulative estimation gap between the predicted and the actual trajectory. To minimize that gap, we propose a novel \textit{sequence-aware} loss that aims to reduce the estimation gap to enhance the sampling quality. Furthermore, we theoretically show that our proposed loss function is a tighter upper bound of the estimation loss in comparison with the conventional loss in DPMs. Experimental results on several benchmark datasets including CIFAR10, CelebA, and CelebA-HQ consistently show a remarkable improvement of our proposed method regarding the image generalization quality measured by FID and Inception Score compared to several DPM baselines. Our code and pre-trained checkpoints are available at \url{this https URL}.

Abstract (translated)

滤波概率模型(DPMs)代表了一种新兴的生成模型领域,在生成多样且高质量图像方面表现出色。然而,大多数现有的DPM训练方法往往忽视了时间步之间的相关性,从而限制了模型在生成图像方面的有效性能。值得注意的是,我们理论性地指出,这个问题可以是由预测和实际轨迹的累积估计差距引起的。为了最小化这个差距,我们提出了一个名为序列感知损失的新损失函数,旨在降低估计差距以提高抽样质量。此外,我们理论性地证明了与DPMs中的传统损失相比,我们的损失函数是一个更严格的下界。在多个基准数据集(包括CIFAR10、CelebA和CelebA-HQ)上的实验结果表明,与几个DPM基线相比,我们提出的方法在测量FID和Inception分数的图像泛化质量方面显著改进。您可以在此处访问我们的代码和预训练检查点:https://this URL。

URL

https://arxiv.org/abs/2312.12431

PDF

https://arxiv.org/pdf/2312.12431.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot