Align-and-Attend Network for Globally and Locally Coherent Video Inpainting

Abstract
Abstract (translated)
URL
PDF

Abstract

We propose a novel feed-forward network for video inpainting. We use a set of sampled video frames as the reference to take visible contents to fill the hole of a target frame. Our video inpainting network consists of two stages. The first stage is an alignment module that uses computed homographies between the reference frames and the target frame. The visible patches are then aggregated based on the frame similarity to fill in the target holes roughly. The second stage is a non-local attention module that matches the generated patches with known reference patches (in space and time) to refine the previous global alignment stage. Both stages consist of large spatial-temporal window size for the reference and thus enable modeling long-range correlations between distant information and the hole regions. Therefore, even challenging scenes with large or slowly moving holes can be handled, which have been hardly modeled by existing flow-based approach. Our network is also designed with a recurrent propagation stream to encourage temporal consistency in video results. Experiments on video object removal demonstrate that our method inpaints the holes with globally and locally coherent contents.

Abstract (translated)

提出了一种新的视频修复前馈网络。我们使用一组采样的视频帧作为参考，以获取可见内容来填充目标帧的孔。我们的视频喷漆网络由两个阶段组成。第一阶段是一个对齐模块，在参考帧和目标帧之间使用计算出的同形图。然后根据帧相似性对可见的补丁进行聚合，以大致填充目标孔。第二阶段是一个非本地注意模块，它将生成的补丁与已知的参考补丁（在空间和时间上）匹配，以优化前一个全局对齐阶段。这两个阶段都由大的时空窗口尺寸组成，以供参考，从而能够建模远程信息和孔区域之间的长期相关性。因此，即使是具有较大或缓慢移动的孔的具有挑战性的场景也可以处理，这几乎没有用现有的基于流的方法建模。我们的网络还设计了一个循环传播流，以鼓励视频结果的时间一致性。视频对象去除实验表明，我们的方法可以去除具有全局和局部相干内容的空洞。

URL

https://arxiv.org/abs/1905.13066

PDF

https://arxiv.org/pdf/1905.13066.pdf