Paper Reading AI Learner

Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss

2019-05-09 19:14:26
Lele Chen, Ross K. Maddox, Zhiyao Duan, Chenliang Xu

Abstract

We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons.

Abstract (translated)

我们设计了一个级联的gan方法来生成人脸视频,它对不同的人脸形状、视角、面部特征和嘈杂的音频条件都具有鲁棒性。我们没有学习直接从音频到视频帧的映射,而是建议首先将音频传输到高级结构,即面部标志,然后生成基于标志的视频帧。与直接音频到图像的方法相比,我们的级联方法避免在与语音内容无关的视听信号之间拟合假相关性。我们,人类,对视频中的时间不连续和细微的伪影很敏感。为了避免这些像素抖动问题,加强网络对视听相关区域的聚焦,我们提出了一种具有注意机制的动态可调像素损失。此外,为了使人脸运动同步,生成更清晰的图像,我们提出了一种基于回归的鉴别器结构,该结构考虑了序列级信息和帧级信息。对多个数据集和真实样本进行的深思熟虑的实验表明,在定量和定性比较方面,我们的方法获得的结果明显优于最先进的方法。

URL

https://arxiv.org/abs/1905.03820

PDF

https://arxiv.org/pdf/1905.03820.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot