Abstract
Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the "how to inpaint". This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate "whereto inpaint". However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both "where to inpaint" and "how to inpaint" simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Furthermore, we customize a dataset consisting of synthetic corrupted videos, real-world corrupted videos, and their corresponding completed videos. This dataset serves as a valuable resource for advancing blind video inpainting research. Extensive experimental results demonstrate the effectiveness and superiority of our method.
Abstract (translated)
视频修复技术旨在用合理的画面填补受损区域。现有方法通常假设已知被破坏区域的位置,主要关注如何进行修补(即“如何填充”)。这种依赖性要求手动使用二值掩码标注被破坏的区域以指示要修复的地方,而这些掩码的标注工作耗时且成本高昂,限制了当前方法的实际应用范围。在本文中,我们希望放宽这一假设,并定义一种新的盲视频修补设置,使网络能够直接从受损视频学习映射到修复结果的过程,从而无需对受损区域进行注释。 为此,我们提出了一种端到端的盲视频修补网络(BVINet),以同时解决“在哪里填充”和“如何填充”的问题。一方面,通过检测帧中的语义不连续区域,并利用视频的时间一致性先验信息,BVINet能够预测被损坏区域的掩码。另一方面,将这些预测出的掩码整合进BVINet中,使其可以从未受损区域捕获有效上下文信息来填补受损部分。 此外,我们引入了一致性损失函数以正则化BVINet的训练参数,在这种情况下,掩码预测和视频完成互相制约,从而最大化训练模型的整体性能。为了进一步推进盲视频修补研究的发展,我们定制了一个包含合成受损视频、现实世界中的受损视频及其对应的修复后的视频的数据集。 广泛的实验结果证明了我们方法的有效性和优越性。
URL
https://arxiv.org/abs/2502.01181