Paper Reading AI Learner

Input Perturbation Reduces Exposure Bias in Diffusion Models

2023-01-27 13:34:54
Mang Ning, Enver Sangineto, Angelo Porrello, Simone Calderara, Rita Cucchiara

Abstract

Denoising Diffusion Probabilistic Models have shown an impressive generation quality, although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the \textbf{exposure bias} problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that the proposed input perturbation leads to a significant improvement of the sample quality while reducing both the training and the inference times. For instance, on CelebA 64$\times$64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time.

Abstract (translated)

去噪扩散概率模型已经展示了令人印象深刻的生成质量,尽管它们的长采样链导致高计算成本。在本文中,我们观察到一个较长的采样链也导致一种错误累积现象,类似于自回归文本生成中的 extbf{暴露偏见} 问题。具体地说,我们注意到训练和测试之间存在差异,因为前者取决于 ground truth 样本,而后者取决于先前生成的结果。为了减轻这个问题,我们提出了一个非常简单但有效的训练 Regularization,包括对 ground truth 样本进行扰动,以模拟推断时间预测错误。我们经验证表明, proposed 输入扰动导致样本质量的重大改善,同时减少训练和推断时间。例如,在CelebA 64$ imes$64中,我们实现了新的 FID 得分1.27,而节省37.5%的训练时间。

URL

https://arxiv.org/abs/2301.11706

PDF

https://arxiv.org/pdf/2301.11706.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot