Paper Reading AI Learner

R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer Conditioning

2023-12-09 13:21:01
Zhiling Ye, LiangGuo Zhang, Dingheng Zeng, Quan Lu, Ning Jiang

Abstract

Dynamic NeRFs have recently garnered growing attention for 3D talking portrait synthesis. Despite advances in rendering speed and visual quality, challenges persist in enhancing efficiency and effectiveness. We present R2-Talker, an efficient and effective framework enabling realistic real-time talking head synthesis. Specifically, using multi-resolution hash grids, we introduce a novel approach for encoding facial landmarks as conditional features. This approach losslessly encodes landmark structures as conditional features, decoupling input diversity, and conditional spaces by mapping arbitrary landmarks to a unified feature space. We further propose a scheme of progressive multilayer conditioning in the NeRF rendering pipeline for effective conditional feature fusion. Our new approach has the following advantages as demonstrated by extensive experiments compared with the state-of-the-art works: 1) The lossless input encoding enables acquiring more precise features, yielding superior visual quality. The decoupling of inputs and conditional spaces improves generalizability. 2) The fusing of conditional features and MLP outputs at each MLP layer enhances conditional impact, resulting in more accurate lip synthesis and better visual quality. 3) It compactly structures the fusion of conditional features, significantly enhancing computational efficiency.

Abstract (translated)

动态神经网络最近因3DTalker技术而受到了越来越多的关注。尽管渲染速度和视觉效果的提高,但在提高效率和效果方面仍然存在挑战。我们提出了R2-Talker,一种高效且有效的框架,实现真实实时谈话头合成。具体来说,我们使用多分辨率哈希网格引入了一种新的方法,将面部关键点编码为条件特征。这种方法无损地编码关键点结构作为条件特征,解耦输入多样性,通过将任意关键点映射到统一的特征空间,实现了条件空间。我们还提出了在NeRF渲染管道中进行逐步多层调节的方案,以实现有效的条件特征融合。通过与最先进的工作的广泛实验进行比较,我们的新方法具有以下优点:1)无损输入编码允许获得更精确的的特征,产生更好的视觉效果。解耦输入和条件空间有助于提高泛化能力。2)在MLP层中融合条件特征和MLP输出增强了条件影响,导致更准确的嘴合成和更好的视觉效果。3)它简化了条件特征融合,显著提高了计算效率。

URL

https://arxiv.org/abs/2312.05572

PDF

https://arxiv.org/pdf/2312.05572.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot