Paper Reading AI Learner

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

2024-03-25 17:59:26
Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, Tao Mei

Abstract

Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically, SATeCo freezes all the parameters of the pre-trained UNet and VAE, and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel, guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention, and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.

Abstract (translated)

扩散模型仅在图像超分辨率任务中达到了临界点。然而,利用扩散模型进行视频超分辨率并不容易,这需要不仅保留低分辨率视频到高分辨率视频的视觉外观,而且还要保证视频帧之间的时间一致性。在本文中,我们提出了一个新的方法,即追求空间适应性和时间一致性(SATeCo),用于视频超分辨率。SATeCo的基础是学习低分辨率视频到高分辨率视频的空间-时间指导,以校准潜在空间高分辨率视频去噪和像素空间视频重建。从技术上讲,SATeCo冻结了预训练UNet和VAE的所有参数,仅在UNet和VAE的解码器中优化两个故意设计的空间特征适应(SFA)和时间特征对齐(TFA)模块。SFA通过根据每个像素的适应性估计平移参数来修改帧特征,从而保证高分辨率帧合成时的逐像素指导。TFA深入研究了3D局部窗口(试管)内的特征交互,通过自注意实现 tubelet 和其低分辨率对应物的跨注意,并执行时间特征对齐。在REDS4和Vid4数据集上进行的大量实验证明了我们方法的有效性。

URL

https://arxiv.org/abs/2403.17000

PDF

https://arxiv.org/pdf/2403.17000.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot