Abstract
Diffusion models have significantly advanced the state of the art in image, audio, and video generation tasks. However, their applications in practical scenarios are hindered by slow inference speed. Drawing inspiration from the approximation strategies utilized in consistency models, we propose the Sub-path Linear Approximation Model (SLAM), which accelerates diffusion models while maintaining high-quality image generation. SLAM treats the PF-ODE trajectory as a series of PF-ODE sub-paths divided by sampled points, and harnesses sub-path linear (SL) ODEs to form a progressive and continuous error estimation along each individual PF-ODE sub-path. The optimization on such SL-ODEs allows SLAM to construct denoising mappings with smaller cumulative approximated errors. An efficient distillation method is also developed to facilitate the incorporation of more advanced diffusion models, such as latent diffusion models. Our extensive experimental results demonstrate that SLAM achieves an efficient training regimen, requiring only 6 A100 GPU days to produce a high-quality generative model capable of 2 to 4-step generation with high performance. Comprehensive evaluations on LAION, MS COCO 2014, and MS COCO 2017 datasets also illustrate that SLAM surpasses existing acceleration methods in few-step generation tasks, achieving state-of-the-art performance both on FID and the quality of the generated images.
Abstract (translated)
扩散模型在图像、音频和视频生成任务方面显著提高了先进水平。然而,在实际场景中,它们的推理速度较慢,从而限制了其应用。从一致性模型中使用的逼近策略中汲取灵感,我们提出了Sub-path Linear Approximation Model(SLAM),它通过保持高质图像生成的同时加速扩散模型而得到了发展。SLAM将PF-ODE轨迹视为一系列通过采样的点分隔的PF-ODE子路径,并利用子路径线性(SL) ODE形成每个PF-ODE子路径的渐进和连续误差估计。在SL-ODE上进行优化允许SLAM构建具有较小累积近似误差的去噪映射。还开发了一种有效的去雾方法,以促进更复杂的扩散模型的引入,例如潜在扩散模型。我们的广泛实验结果表明,SLAM实现了高效的训练方法,只需6个A100 GPU天的时间就能生产出具有2到4步生成能力的高质量生成模型,具有出色的性能。对LAION、MS COCO 2014和MS COCO 2017数据集的全面评估还证明了SLAM在几步生成任务中超越了现有加速方法,同时在FID和生成图像的质量方面实现了最先进的性能。
URL
https://arxiv.org/abs/2404.13903