Paper Reading AI Learner

Raformer: Redundancy-Aware Transformer for Video Wire Inpainting

2024-04-24 11:02:13
Zhong Ji, Yimu Su, Yan Zhang, Jiacheng Hou, Yanwei Pang, Jungong Han

Abstract

Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods.

Abstract (translated)

视频电线修复(VWI)是视频修复中的一个突出应用,旨在通过完美地去除电影或电视剧中的电线,从而节省大量的时间和劳动力。然而,由于电线比通常针对的视频修复任务中的物体更长且更细,电线 removal 带来了更大的挑战,并且经常与人员和背景物体不规则地相交,这增加了修复过程的复杂性。为了克服现有视频电线数据集中的限制,其特征是数据量小、质量差且场景有限,我们提出了一个新的 VWI 数据集,称为电线移除视频数据集 2(WRV2)和高仿电线形状(PWS)掩码。WRV2 数据集包括超过 4,000 个视频,平均长度为 80 帧,旨在促进修复模型的开发和效果。在此基础上,我们的研究提出了 Redundancy-Aware Transformer(Raformer)方法,该方法解决了视频修复中电线去除的独特挑战。与传统方法不同,Raformer 采用了一种新的策略,通过粗粒度、窗口 based 的注意力机制选择性地绕过冗余部分,例如缺乏对修复有用信息的静态背景段。Raformer 的核心是 Redundancy-Aware 注意力(RAA)模块,通过粗粒度、窗口 based 的注意力机制隔离和强调关键内容。这还由软特征对齐(SFA)模块补充,该模块细化这些特征并实现端到端特征对齐。对传统视频修复数据集和我们的 WRV2 数据集的实验证明,Raformer 优于其他最先进的治疗方法。

URL

https://arxiv.org/abs/2404.15802

PDF

https://arxiv.org/pdf/2404.15802.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot