Paper Reading AI Learner

Align-and-Attend Network for Globally and Locally Coherent Video Inpainting

2019-05-30 14:14:08
Sanghyun Woo, Dahun Kim, KwanYong Park, Joon-Young Lee, In So Kweon

Abstract

We propose a novel feed-forward network for video inpainting. We use a set of sampled video frames as the reference to take visible contents to fill the hole of a target frame. Our video inpainting network consists of two stages. The first stage is an alignment module that uses computed homographies between the reference frames and the target frame. The visible patches are then aggregated based on the frame similarity to fill in the target holes roughly. The second stage is a non-local attention module that matches the generated patches with known reference patches (in space and time) to refine the previous global alignment stage. Both stages consist of large spatial-temporal window size for the reference and thus enable modeling long-range correlations between distant information and the hole regions. Therefore, even challenging scenes with large or slowly moving holes can be handled, which have been hardly modeled by existing flow-based approach. Our network is also designed with a recurrent propagation stream to encourage temporal consistency in video results. Experiments on video object removal demonstrate that our method inpaints the holes with globally and locally coherent contents.

Abstract (translated)

提出了一种新的视频修复前馈网络。我们使用一组采样的视频帧作为参考,以获取可见内容来填充目标帧的孔。我们的视频喷漆网络由两个阶段组成。第一阶段是一个对齐模块,在参考帧和目标帧之间使用计算出的同形图。然后根据帧相似性对可见的补丁进行聚合,以大致填充目标孔。第二阶段是一个非本地注意模块,它将生成的补丁与已知的参考补丁(在空间和时间上)匹配,以优化前一个全局对齐阶段。这两个阶段都由大的时空窗口尺寸组成,以供参考,从而能够建模远程信息和孔区域之间的长期相关性。因此,即使是具有较大或缓慢移动的孔的具有挑战性的场景也可以处理,这几乎没有用现有的基于流的方法建模。我们的网络还设计了一个循环传播流,以鼓励视频结果的时间一致性。视频对象去除实验表明,我们的方法可以去除具有全局和局部相干内容的空洞。

URL

https://arxiv.org/abs/1905.13066

PDF

https://arxiv.org/pdf/1905.13066.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot