Abstract
Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.
Abstract (translated)
生成时间一致的高保真视频在计算上非常昂贵,尤其是在更长的时间跨度内。尽管最近出现的扩散变换器(DiTs)在这个领域取得了显著进展,但由于它们依赖于更大的模型和更加复杂的注意力机制,这反而导致了推理速度变慢的问题。在这篇论文中,我们介绍了一种无训练方法来加速视频DiTs,称之为自适应缓存(AdaCache)。这种方法受到这样一个事实的启发:“并非所有视频都是一样的”:也就是说,一些视频需要较少的去噪步骤就能达到合理的质量,而另一些则不然。基于这一点,我们不仅在扩散过程中缓存了计算结果,而且还设计了一个针对每个视频生成定制化的缓存计划,以最大化质量和延迟之间的平衡。此外,我们还引入了一种运动正则化(MoReg)方案,在AdaCache中利用视频信息,实质上是根据运动内容来控制计算分配。总体而言,我们的即插即用贡献显著提高了推理速度(例如在Open-Sora 720p - 2s视频生成中的加速可达4.7倍),并且不会牺牲生成质量,适用于多个视频DiT基准模型。
URL
https://arxiv.org/abs/2411.02397