Paper Reading AI Learner

Temporally Coherent Video Harmonization Using Adversarial Networks

2018-09-05 08:01:15
Haozhi Huang, Senzhe Xu, Junxiong Cai, Wei Liu, Shimin Hu

Abstract

Compositing is one of the most important editing operations for images and videos. The process of improving the realism of composite results is often called harmonization. Previous approaches for harmonization mainly focus on images. In this work, we take one step further to attack the problem of video harmonization. Specifically, we train a convolutional neural network in an adversarial way, exploiting a pixel-wise disharmony discriminator to achieve more realistic harmonized results and introducing a temporal loss to increase temporal consistency between consecutive harmonized frames. Thanks to the pixel-wise disharmony discriminator, we are also able to relieve the need of input foreground masks. Since existing video datasets which have ground-truth foreground masks and optical flows are not sufficiently large, we propose a simple yet efficient method to build up a synthetic dataset supporting supervised training of the proposed adversarial network. Experiments show that training on our synthetic dataset generalizes well to the real-world composite dataset. Also, our method successfully incorporates temporal consistency during training and achieves more harmonious results than previous methods.

Abstract (translated)

合成是图像和视频最重要的编辑操作之一。改善复合结果的真实性的过程通常称为协调。以前的协调方法主要集中在图像上。在这项工作中,我们更进一步,以解决视频协调问题。具体来说,我们以对抗的方式训练卷积神经网络,利用像素方式的不和谐鉴别器来实现更真实的协调结果,并引入时间损失以增加连续协调帧之间的时间一致性。由于像素方式的不和谐鉴别器,我们还能够减轻输入前景掩模的需要。由于具有地面实况前景掩模和光流的现有视频数据集不够大,我们提出了一种简单而有效的方法来建立支持所提出的对抗网络的监督训练的合成数据集。实验表明,对我们的合成数据集的训练很好地概括了真实世界的复合数据集。此外,我们的方法成功地在训练期间结合了时间一致性,并且比以前的方法实现了更加和谐

URL

https://arxiv.org/abs/1809.01372

PDF

https://arxiv.org/pdf/1809.01372.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot