Paper Reading AI Learner

Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence

2019-05-08 08:04:35
Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon

Abstract

Blind video decaptioning is a problem of automatically removing text overlays and inpainting the occluded parts in videos without any input masks. While recent deep learning based inpainting methods deal with a single image and mostly assume that the positions of the corrupted pixels are known, we aim at automatic text removal in video sequences without mask information. In this paper, we propose a simple yet effective framework for fast blind video decaptioning. We construct an encoder-decoder model, where the encoder takes multiple source frames that can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a residual connection from the input frame to the decoder output to enforce our network to focus on the corrupted regions only. Our proposed model was ranked in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2: Video decaptioning. In addition, we further improve this strong model by applying a recurrent feedback. The recurrent feedback not only enforces temporal coherence but also provides strong clues on where the corrupted pixels are. Both qualitative and quantitative experiments demonstrate that our full model produces accurate and temporally consistent video results in real time (50+ fps).

Abstract (translated)

盲视频去除是一个自动删除文本覆盖和在没有任何输入遮罩的情况下在视频中保留被遮挡部分的问题。虽然最近的基于深度学习的修复方法只处理一幅图像,并且大多假定损坏像素的位置已知,但我们的目标是在没有遮罩信息的视频序列中自动删除文本。在本文中,我们提出了一个简单而有效的快速盲视频去除框架。我们构建了一个编码器-解码器模型,其中编码器采用多个源帧,可以提供从场景动态中显示的可见像素。这些提示被聚合并送入解码器。我们将输入帧的剩余连接应用到解码器输出,以强制我们的网络只关注损坏的区域。我们提出的模型在ECCV Challern 2018单圈涂料竞赛赛道2:视频衰减中排名第一。此外,我们还通过应用循环反馈进一步改进了这种强模型。重复反馈不仅增强了时间一致性,而且提供了关于损坏像素位置的有力线索。定性和定量实验都表明,我们的完整模型能够实时(50+fps)产生准确和时间一致的视频结果。

URL

https://arxiv.org/abs/1905.02949

PDF

https://arxiv.org/pdf/1905.02949.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot