Abstract
Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips. In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales. Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation. The experiments show significant improvements over the state of the art in head motion dynamics quality and in multi-scale audio-visual synchrony both in the landmark domain and in the image domain.
Abstract (translated)
使用语音输入信号使用深度生成模型动画静态面部图像是一个活跃的研究主题,并取得了重要进展。然而,大部分精力都投入到了同步和渲染质量的提高,而生成自然头动,更不用说头动和语音的音频-视觉相关性,往往被忽视。在这个研究中,我们提出了多尺度音频-视觉同步损失和多尺度自回归GAN,更好地处理语音和头动和嘴唇动态的短期和长期相关性。特别是,我们在 multimodal inputPyramid上训练了一组同步模型,并在多尺度生成网络中利用这些模型作为指导,产生音频对齐的运动在不同时间尺度上展开。我们的生成器运行在面部地标 domain,这是一个标准的低维度头表示。实验表明,在地标 domain和图像 domain中,头动动态质量和多尺度音频-视觉同步方面都取得了与当前最先进的水平相比的重大改进。
URL
https://arxiv.org/abs/2307.03270