Paper Reading AI Learner

Parametric Implicit Face Representation for Audio-Driven Facial Reenactment

2023-06-13 07:08:22
Ricong Huang, Peiwen Lai, Yipeng Qin, Guanbin Li

Abstract

Audio-driven facial reenactment is a crucial technique that has a range of applications in film-making, virtual avatars and video conferences. Existing works either employ explicit intermediate face representations (e.g., 2D facial landmarks or 3D face models) or implicit ones (e.g., Neural Radiance Fields), thus suffering from the trade-offs between interpretability and expressive power, hence between controllability and quality of the results. In this work, we break these trade-offs with our novel parametric implicit face representation and propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads. Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models, thereby taking the best of both explicit and implicit methods. In addition, we propose several new techniques to improve the three components of our framework, including i) incorporating contextual information into the audio-to-expression parameters encoding; ii) using conditional image synthesis to parameterize the implicit representation and implementing it with an innovative tri-plane structure for efficient learning; iii) formulating facial reenactment as a conditional image inpainting problem and proposing a novel data augmentation technique to improve model generalizability. Extensive experiments demonstrate that our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.

Abstract (translated)

音频驱动的面部重绘是一项关键的技术,可以在电影制作、虚拟角色和视频会议等领域广泛应用。现有的工作通常使用明确的中间面部表示法(例如2D面部地标或3D面部模型)或隐形的表示法(例如神经辐射场),因此面临着解释性和表现力之间的权衡,以及控制性和结果质量之间的权衡。在本文中,我们提出了一种新的参数化的隐形面部表示法,并提出了一种新的音频驱动的面部重绘框架,既能控制又能生成高质量的对话头。具体来说,我们的参数化的隐形表示法将隐形表示法与3D面部模型的可解释参数相结合,从而克服了明确和隐形方法的最佳结合。此外,我们提出了几种新技术来改进我们框架的三个组件,包括i)将上下文信息融入音频到表达参数编码中;ii)使用条件图像合成来参数化隐形表示,并采用创新的三平面结构进行高效学习;iii)将面部重绘问题转化为条件图像填充问题,并提出了一种新的数据增强技术,以提高模型的泛化能力。广泛的实验结果表明,我们的方法能够生成比先前方法更为真实的结果,更加忠实地反映发言者的身份和谈话风格。

URL

https://arxiv.org/abs/2306.07579

PDF

https://arxiv.org/pdf/2306.07579.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot