Abstract
Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3$\times$ speedup for Wan with nearly no quality loss for VBench, and 2$\times$ speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.
Abstract (translated)
基于扩散变压器(DiT)架构,模型如Sora、CogVideoX和Wan在文本到视频、图像到视频以及视频编辑任务上取得了显著进展。尽管这些技术有了进步,但基于扩散的视频生成仍然计算密集型工作,尤其是对于高分辨率和长时间的视频而言更是如此。之前的工作通过跳过某些计算来加速推理过程,但这通常会以严重的质量下降为代价。在本文中,我们提出了SRDiffusion,这是一个新的框架,它利用大型模型和小型模型之间的协作来减少推理成本。大型模型处理高噪声步骤,确保语义和运动的保真度(草图绘制),而小型模型则在低噪声步骤中细化视觉细节(渲染)。实验结果表明,我们的方法优于现有方法,在VBench上实现了Wan模型超过3倍的速度提升且几乎无质量损失,并为CogVideoX提供了2倍的速度提升。我们的方法作为与现有加速策略不同的新方向被提出,为可扩展视频生成提供了一种实用的解决方案。
URL
https://arxiv.org/abs/2505.19151