Paper Reading AI Learner

A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

2023-07-04 08:29:59
Louis Airale (UGA, LIG), Dominique Vaufreydaz (LIG), Xavier Alameda-Pineda (UGA)

Abstract

Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips. In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales. Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation. The experiments show significant improvements over the state of the art in head motion dynamics quality and in multi-scale audio-visual synchrony both in the landmark domain and in the image domain.

Abstract (translated)

使用语音输入信号使用深度生成模型动画静态面部图像是一个活跃的研究主题,并取得了重要进展。然而,大部分精力都投入到了同步和渲染质量的提高,而生成自然头动,更不用说头动和语音的音频-视觉相关性,往往被忽视。在这个研究中,我们提出了多尺度音频-视觉同步损失和多尺度自回归GAN,更好地处理语音和头动和嘴唇动态的短期和长期相关性。特别是,我们在 multimodal inputPyramid上训练了一组同步模型,并在多尺度生成网络中利用这些模型作为指导,产生音频对齐的运动在不同时间尺度上展开。我们的生成器运行在面部地标 domain,这是一个标准的低维度头表示。实验表明,在地标 domain和图像 domain中,头动动态质量和多尺度音频-视觉同步方面都取得了与当前最先进的水平相比的重大改进。

URL

https://arxiv.org/abs/2307.03270

PDF

https://arxiv.org/pdf/2307.03270.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot