Paper Reading AI Learner

BVINet: Unlocking Blind Video Inpainting with Zero Annotations

2025-02-03 09:17:24
Zhiliang Wu, Kerui Chen, Kun Li, Hehe Fan, Yi Yang

Abstract

Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the "how to inpaint". This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate "whereto inpaint". However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both "where to inpaint" and "how to inpaint" simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Furthermore, we customize a dataset consisting of synthetic corrupted videos, real-world corrupted videos, and their corresponding completed videos. This dataset serves as a valuable resource for advancing blind video inpainting research. Extensive experimental results demonstrate the effectiveness and superiority of our method.

Abstract (translated)

视频修复技术旨在用合理的画面填补受损区域。现有方法通常假设已知被破坏区域的位置,主要关注如何进行修补(即“如何填充”)。这种依赖性要求手动使用二值掩码标注被破坏的区域以指示要修复的地方,而这些掩码的标注工作耗时且成本高昂,限制了当前方法的实际应用范围。在本文中,我们希望放宽这一假设,并定义一种新的盲视频修补设置,使网络能够直接从受损视频学习映射到修复结果的过程,从而无需对受损区域进行注释。 为此,我们提出了一种端到端的盲视频修补网络(BVINet),以同时解决“在哪里填充”和“如何填充”的问题。一方面,通过检测帧中的语义不连续区域,并利用视频的时间一致性先验信息,BVINet能够预测被损坏区域的掩码。另一方面,将这些预测出的掩码整合进BVINet中,使其可以从未受损区域捕获有效上下文信息来填补受损部分。 此外,我们引入了一致性损失函数以正则化BVINet的训练参数,在这种情况下,掩码预测和视频完成互相制约,从而最大化训练模型的整体性能。为了进一步推进盲视频修补研究的发展,我们定制了一个包含合成受损视频、现实世界中的受损视频及其对应的修复后的视频的数据集。 广泛的实验结果证明了我们方法的有效性和优越性。

URL

https://arxiv.org/abs/2502.01181

PDF

https://arxiv.org/pdf/2502.01181.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot