Abstract
The talking head generation recently attracted considerable attention due to its widespread application prospects, especially for digital avatars and 3D animation design. Inspired by this practical demand, several works explored Neural Radiance Fields (NeRF) to synthesize the talking heads. However, these methods based on NeRF face two challenges: (1) Difficulty in generating style-controllable talking heads. (2) Displacement artifacts around the neck in rendered images. To overcome these two challenges, we propose a novel generative paradigm \textit{Embedded Representation Learning Network} (ERLNet) with two learning stages. First, the \textit{ audio-driven FLAME} (ADF) module is constructed to produce facial expression and head pose sequences synchronized with content audio and style video. Second, given the sequence deduced by the ADF, one novel \textit{dual-branch fusion NeRF} (DBF-NeRF) explores these contents to render the final images. Extensive empirical studies demonstrate that the collaboration of these two stages effectively facilitates our method to render a more realistic talking head than the existing algorithms.
Abstract (translated)
口语头生成最近因其在数字虚拟人和3D动画设计方面的广泛应用前景而引起了相当的关注。受到这一实际需求的启发,几篇论文探索了使用Neural Radiance Fields(NeRF)合成口语头。然而,基于NeRF的方法面临两个挑战:(1)生成风格可控的口语头困难。(2)在渲染图像中围绕颈部发生的位移伪影。为了克服这两个挑战,我们提出了一个名为ERLNet(嵌入表示学习网络)的新生成范式,包括两个学习阶段。首先,构造了一个音频驱动的FLAME(ADF)模块,用于产生与内容音频和风格视频同步的面部表情和头姿序列。其次,根据FLAME计算的序列,一篇名为DBF-NeRF的新口语头融合NeRF(DBF-NeRF)探索了这些内容,以渲染最终图像。大量的实证研究证明了这两个阶段的协同作用有效地促进了我们方法比现有算法渲染更逼真的口语头的效果。
URL
https://arxiv.org/abs/2404.19038