Abstract
We present a novel approach for synthesizing 3D talking heads with controllable emotion, featuring enhanced lip synchronization and rendering quality. Despite significant progress in the field, prior methods still suffer from multi-view consistency and a lack of emotional expressiveness. To address these issues, we collect EmoTalk3D dataset with calibrated multi-view videos, emotional annotations, and per-frame 3D geometry. By training on the EmoTalk3D dataset, we propose a \textit{`Speech-to-Geometry-to-Appearance'} mapping framework that first predicts faithful 3D geometry sequence from the audio features, then the appearance of a 3D talking head represented by 4D Gaussians is synthesized from the predicted geometry. The appearance is further disentangled into canonical and dynamic Gaussians, learned from multi-view videos, and fused to render free-view talking head animation. Moreover, our model enables controllable emotion in the generated talking heads and can be rendered in wide-range views. Our method exhibits improved rendering quality and stability in lip motion generation while capturing dynamic facial details such as wrinkles and subtle expressions. Experiments demonstrate the effectiveness of our approach in generating high-fidelity and emotion-controllable 3D talking heads. The code and EmoTalk3D dataset are released at this https URL.
Abstract (translated)
我们提出了一种新颖的方法来合成具有可控制情感的3DTalkHead,包括增强的嘴同步和渲染质量。尽管在领域中取得了显著的进步,但先前的方法仍然存在多视角一致性和情感表现不足的问题。为了解决这些问题,我们收集了经过校准的多视角视频、情感注释和每帧3D几何的EmoTalk3D数据集。通过在EmoTalk3D数据集上训练,我们提出了一种《从音频特征到几何到外观》映射框架,首先预测从音频特征中忠实的三维几何序列,然后从预测的几何中合成3DTalkHead的代表4D高斯面的外观。外观进一步分离为共轭和高动态度的Gaussian,从多视角视频中学习,并融合到渲染自由观看的TalkHead动画中。此外,我们的模型在生成的TalkHead中实现了可控制的情感,可以渲染各种视角。我们的方法在嘴运动生成方面的渲染质量和稳定性方面都表现出色,同时捕捉到皱纹和微妙表情等动态面部细节。实验结果表明,我们在生成高保真度和情感可控的3DTalkHead方面取得了显著的效果。代码和EmoTalk3D数据集可在https://这个链接上获取。
URL
https://arxiv.org/abs/2408.00297