Abstract
Dynamic NeRFs have recently garnered growing attention for 3D talking portrait synthesis. Despite advances in rendering speed and visual quality, challenges persist in enhancing efficiency and effectiveness. We present R2-Talker, an efficient and effective framework enabling realistic real-time talking head synthesis. Specifically, using multi-resolution hash grids, we introduce a novel approach for encoding facial landmarks as conditional features. This approach losslessly encodes landmark structures as conditional features, decoupling input diversity, and conditional spaces by mapping arbitrary landmarks to a unified feature space. We further propose a scheme of progressive multilayer conditioning in the NeRF rendering pipeline for effective conditional feature fusion. Our new approach has the following advantages as demonstrated by extensive experiments compared with the state-of-the-art works: 1) The lossless input encoding enables acquiring more precise features, yielding superior visual quality. The decoupling of inputs and conditional spaces improves generalizability. 2) The fusing of conditional features and MLP outputs at each MLP layer enhances conditional impact, resulting in more accurate lip synthesis and better visual quality. 3) It compactly structures the fusion of conditional features, significantly enhancing computational efficiency.
Abstract (translated)
动态神经网络最近因3DTalker技术而受到了越来越多的关注。尽管渲染速度和视觉效果的提高,但在提高效率和效果方面仍然存在挑战。我们提出了R2-Talker,一种高效且有效的框架,实现真实实时谈话头合成。具体来说,我们使用多分辨率哈希网格引入了一种新的方法,将面部关键点编码为条件特征。这种方法无损地编码关键点结构作为条件特征,解耦输入多样性,通过将任意关键点映射到统一的特征空间,实现了条件空间。我们还提出了在NeRF渲染管道中进行逐步多层调节的方案,以实现有效的条件特征融合。通过与最先进的工作的广泛实验进行比较,我们的新方法具有以下优点:1)无损输入编码允许获得更精确的的特征,产生更好的视觉效果。解耦输入和条件空间有助于提高泛化能力。2)在MLP层中融合条件特征和MLP输出增强了条件影响,导致更准确的嘴合成和更好的视觉效果。3)它简化了条件特征融合,显著提高了计算效率。
URL
https://arxiv.org/abs/2312.05572