Paper Reading AI Learner

LADDER: An Efficient Framework for Video Frame Interpolation

2024-04-17 06:47:17
Tong Shen, Dong Li, Ziheng Gao, Lu Tian, Emad Barsoum

Abstract

Video Frame Interpolation (VFI) is a crucial technique in various applications such as slow-motion generation, frame rate conversion, video frame restoration etc. This paper introduces an efficient video frame interpolation framework that aims to strike a favorable balance between efficiency and quality. Our framework follows a general paradigm consisting of a flow estimator and a refinement module, while incorporating carefully designed components. First of all, we adopt depth-wise convolution with large kernels in the flow estimator that simultaneously reduces the parameters and enhances the receptive field for encoding rich context and handling complex motion. Secondly, diverging from a common design for the refinement module with a UNet-structure (encoder-decoder structure), which we find redundant, our decoder-only refinement module directly enhances the result from coarse to fine features, offering a more efficient process. In addition, to address the challenge of handling high-definition frames, we also introduce an innovative HD-aware augmentation strategy during training, leading to consistent enhancement on HD images. Extensive experiments are conducted on diverse datasets, Vimeo90K, UCF101, Xiph and SNU-FILM. The results demonstrate that our approach achieves state-of-the-art performance with clear improvement while requiring much less FLOPs and parameters, reaching to a better spot for balancing efficiency and quality.

Abstract (translated)

视频帧插值(VFI)是各种应用(如慢动作生成、帧率转换、视频帧恢复等)中的关键技术。本文介绍了一种高效的视频帧插值框架,旨在在效率和质量之间取得良好的平衡。我们的框架包括一个流估计算法和一个优化模块,并精心设计了一些组件。首先,我们采用大尺寸的卷积来减少参数并增强编码丰富语境和处理复杂运动的能力。其次,从常见的优化模块设计(我们发现它是冗余的)中进行差异,我们的仅解码器优化模块直接增强从粗到细的特征,实现更高效的过程。此外,为了处理高清晰度帧,我们在训练过程中引入了一种创新的高清度增强策略,在HD图像上实现一致的增强。我们在多种数据集(Vimeo90K、UCF101、Xiph和SNU-FILM)上进行了广泛的实验。结果表明,我们的方法在具有显着提高的同时需要更少的FLOPs和参数,达到更好的平衡点,实现最高性能。

URL

https://arxiv.org/abs/2404.11108

PDF

https://arxiv.org/pdf/2404.11108.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot