Abstract
Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. The main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. Although lip alignment has seen many related studies, existing methods struggle to synthesize natural and realistic expressions, resulting in a mechanical and stiff appearance of facial animations. Even with some research extracting emotional features from speech, the randomness of facial movements limits the effective expression of emotions. To address this issue, this paper proposes a method called CSTalk (Correlation Supervised) that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions that conform to human facial motion patterns. To generate more intricate animations, we employ a rich set of control parameters based on the metahuman character model and capture a dataset for five different emotions. We train a generative network using an autoencoder structure and input an emotion embedding vector to achieve the generation of user-control expressions. Experimental results demonstrate that our method outperforms existing state-of-the-art methods.
Abstract (translated)
多年来,为了实现 speech-driven 3D 面部动画技术,已经开展了很多研究,但实际应用仍存在一定的期望。主要挑战在于数据限制、嘴唇对齐问题和面部表情自然性的问题。尽管在嘴唇对齐方面已经进行了很多相关研究,但现有的方法很难合成自然和逼真的表情,导致面部动画显得僵硬和机械。即使有些研究从语音中提取了情感特征,但面部运动的随机性仍然限制了情感的有效表达。为了解决这个问题,本文提出了一种名为 CSTalk(相关监督)的方法,该方法建模了不同面部运动区域之间的相关性,并监督生成模型的训练,以生成符合人类面部运动模式的真实表情。为了生成更复杂的动画,我们基于元人体模型的一组丰富参数,并捕获了五个不同情感的数据集。我们使用自动编码器结构训练生成网络,并输入情感嵌入向量以实现用户可控表情的生成。实验结果表明,我们的方法超越了现有最先进的方法。
URL
https://arxiv.org/abs/2404.18604