Paper Reading AI Learner

One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

2025-06-18 16:06:30
Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, Lei Zhang

Abstract

It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at this https URL.

Abstract (translated)

在现实世界的视频超分辨率(Real-VSR)中,尤其是在利用预训练的生成模型如稳定扩散(Stable Diffusion, SD)进行逼真细节合成时,再现丰富的空间细节同时保持时间一致性是一个具有挑战性的问题。现有的基于SD的Real-VSR方法通常为了保证时间连贯性而牺牲了空间细节,导致视觉质量不佳。我们认为关键在于如何有效地从低质量输入视频中提取鲁棒的时间一致性先验,并在不破坏这些先验的情况下增强视频的细节。 为此,我们提出了一种双LoRA学习(DLoRAL)范式来训练一个基于SD的一步扩散模型,在实现现实帧细节的同时保持时间一致性。具体而言,我们引入了一个跨帧检索(Cross-Frame Retrieval, CFR)模块,用于聚合不同帧之间的互补信息,并训练一个一致性LoRA(Consistency-LoRA, C-LoRA),以从退化输入中学习鲁棒的时间表示。在完成一致性学习之后,我们将CFR和C-LoRA固定下来,并训练一个细节LoRA(Detail-LoRA, D-LoRA)来增强空间细节,同时与C-LoRA定义的时间空间对齐,从而保持时间连贯性。这两个阶段交替进行优化,共同提供一致且细节丰富的输出。在推理过程中,两个LoRA分支被合并到SD模型中,允许以一步扩散的方式高效和高质量地恢复视频。 实验表明,DLoRAL在准确性和速度上都表现出强大的性能。代码和模型可以在这个链接中找到:[此URL](请将此处的"this https URL"替换为实际可用的具体网址)。

URL

https://arxiv.org/abs/2506.15591

PDF

https://arxiv.org/pdf/2506.15591.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot