Paper Reading AI Learner

SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation

2025-05-25 13:58:52
Shenggan Cheng, Yuanxin Wei, Lansong Diao, Yong Liu, Bujiao Chen, Lianghua Huang, Yu Liu, Wenyuan Yu, Jiangsu Du, Wei Lin, Yang You

Abstract

Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3$\times$ speedup for Wan with nearly no quality loss for VBench, and 2$\times$ speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.

Abstract (translated)

基于扩散变压器(DiT)架构,模型如Sora、CogVideoX和Wan在文本到视频、图像到视频以及视频编辑任务上取得了显著进展。尽管这些技术有了进步,但基于扩散的视频生成仍然计算密集型工作,尤其是对于高分辨率和长时间的视频而言更是如此。之前的工作通过跳过某些计算来加速推理过程,但这通常会以严重的质量下降为代价。在本文中,我们提出了SRDiffusion,这是一个新的框架,它利用大型模型和小型模型之间的协作来减少推理成本。大型模型处理高噪声步骤,确保语义和运动的保真度(草图绘制),而小型模型则在低噪声步骤中细化视觉细节(渲染)。实验结果表明,我们的方法优于现有方法,在VBench上实现了Wan模型超过3倍的速度提升且几乎无质量损失,并为CogVideoX提供了2倍的速度提升。我们的方法作为与现有加速策略不同的新方向被提出,为可扩展视频生成提供了一种实用的解决方案。

URL

https://arxiv.org/abs/2505.19151

PDF

https://arxiv.org/pdf/2505.19151.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot