Paper Reading AI Learner

MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

2023-07-19 14:45:11
Yunfei Liu, Lijian Lin, Fei Yu, Changyin Zhou, Yu Li

Abstract

Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided renderer syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.

Abstract (translated)

Audio驱动的肖像动画旨在生成由给定音频条件生成的肖像视频。动画高保真的多媒质肖像具有多种应用。先前的方法曾试图捕捉不同运动模式并生成高保真的肖像视频,通过训练不同模型或从给定视频中采样信号来实现。然而,缺乏 lips同步和其他运动(例如头部姿势和眨眼)的相关性学习通常会导致不自然的结果。在本文中,我们提出了一种多人、多样化和高保真的对话肖像生成统一系统。我们的方法和有三个阶段,即1)基于双重注意力的一次性网络(MODA)从给定音频生成对话表示。在MODA中,我们设计了一个双重注意力模块来编码准确的口部运动和多种模式。2)面部构建网络生成密集且详细的面部地标,3)时间引导渲染合成稳定视频。广泛的评估表明,与先前的方法相比,我们提出的系统生成更自然、真实的视频肖像。

URL

https://arxiv.org/abs/2307.10008

PDF

https://arxiv.org/pdf/2307.10008.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot