Abstract
Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically, SATeCo freezes all the parameters of the pre-trained UNet and VAE, and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel, guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention, and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.
Abstract (translated)
扩散模型仅在图像超分辨率任务中达到了临界点。然而,利用扩散模型进行视频超分辨率并不容易,这需要不仅保留低分辨率视频到高分辨率视频的视觉外观,而且还要保证视频帧之间的时间一致性。在本文中,我们提出了一个新的方法,即追求空间适应性和时间一致性(SATeCo),用于视频超分辨率。SATeCo的基础是学习低分辨率视频到高分辨率视频的空间-时间指导,以校准潜在空间高分辨率视频去噪和像素空间视频重建。从技术上讲,SATeCo冻结了预训练UNet和VAE的所有参数,仅在UNet和VAE的解码器中优化两个故意设计的空间特征适应(SFA)和时间特征对齐(TFA)模块。SFA通过根据每个像素的适应性估计平移参数来修改帧特征,从而保证高分辨率帧合成时的逐像素指导。TFA深入研究了3D局部窗口(试管)内的特征交互,通过自注意实现 tubelet 和其低分辨率对应物的跨注意,并执行时间特征对齐。在REDS4和Vid4数据集上进行的大量实验证明了我们方法的有效性。
URL
https://arxiv.org/abs/2403.17000