Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss

Abstract
Abstract (translated)
URL
PDF

Abstract

We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons.

Abstract (translated)

我们设计了一个级联的gan方法来生成人脸视频，它对不同的人脸形状、视角、面部特征和嘈杂的音频条件都具有鲁棒性。我们没有学习直接从音频到视频帧的映射，而是建议首先将音频传输到高级结构，即面部标志，然后生成基于标志的视频帧。与直接音频到图像的方法相比，我们的级联方法避免在与语音内容无关的视听信号之间拟合假相关性。我们，人类，对视频中的时间不连续和细微的伪影很敏感。为了避免这些像素抖动问题，加强网络对视听相关区域的聚焦，我们提出了一种具有注意机制的动态可调像素损失。此外，为了使人脸运动同步，生成更清晰的图像，我们提出了一种基于回归的鉴别器结构，该结构考虑了序列级信息和帧级信息。对多个数据集和真实样本进行的深思熟虑的实验表明，在定量和定性比较方面，我们的方法获得的结果明显优于最先进的方法。

URL

https://arxiv.org/abs/1905.03820

PDF

https://arxiv.org/pdf/1905.03820.pdf