Abstract
Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this conflict, we decompose the model into two branches separately for the face and inside mouth areas, therefore simplifying the learning tasks to help reconstruct more accurate motion and structure of the mouth region. Extensive experiments demonstrate that our method renders high-quality lip-synchronized talking head videos, with better facial fidelity and higher efficiency compared with previous methods.
Abstract (translated)
光晕场在合成逼真的人脸3D对话头方面表现出了出色的性能。然而,由于难以通过直接修改点外观来呈现面部运动,因此普遍采用通过直接修改点外观来呈现面部运动的范式可能导致动态区域的扭曲。为解决这个问题,我们引入了TalkingGaussian,一种基于变形的高保真度对话头合成光晕场框架。通过利用基于点的高斯平展,我们可以通过应用平滑和连续的变形来表示我们的方法中的面部运动,而无需像以前的方法一样学习困难的面部变化。由于这种简化,可以在保留高度完整的面部特征的同时合成精确的面部运动。在这样一种变形范式下,我们进一步识别出可能影响学习详细口语运动的方向。为了应对这个冲突,我们分别将模型分解为面部和嘴部两个分支,从而简化学习任务,有助于重建更准确的面部区域的运动和结构。大量实验证明,我们的方法生成的 lip-synchronized 对话头视频具有高质量、面部特征更加逼真,以及比以前方法更高的效率。
URL
https://arxiv.org/abs/2404.15264