Abstract
Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
Abstract (translated)
扩散模型已经在图像和视频生成方面证明高度有效;然而,当生成不同尺寸的图像时,由于单尺度训练数据,它们仍然面临着构图挑战。为适应高分辨率图像和视频生成,将大型预训练扩散模型扩展到更高的分辨率需要大量的计算和优化资源,然而,实现与低分辨率模型相当的水平仍然具有挑战性。本文提出了一种新型的自级联扩散模型,它利用了预训练低分辨率模型所获得的有用知识,以快速适应高分辨率图像和视频生成,采用调整免费或便宜的升采样范式。将一系列多尺度升采样模块集成到自级联扩散模型中,它可以有效地适应高分辨率,保留原始构图和生成能力。我们还提出了一种引导式噪声重新调度策略,以加速推理过程并改善局部结构细节。与完全重训练相比,我们的方法实现了5倍的学习速度,只需要额外的0.002M个调整参数。大量实验证明,通过仅微调10k步,我们的方法可以快速适应高分辨率图像和视频合成,几乎不需要额外的推理时间。
URL
https://arxiv.org/abs/2402.10491