Abstract
Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate outperforms existing state-of-the-art methods in terms of video quality and lip-synchronization, and improves flexibility in controlling emotion and head pose. The code will be available at this https URL.
Abstract (translated)
最近基于扩散模型的说话人脸生成方法展示了在将语音音频片段与给定的身份参考准确匹配方面具有令人印象深刻的潜力。然而,现有的方法仍然面临着由于不可控因素(如不精确的唇部同步、不合适头部姿势以及缺乏对面部表情的精细控制)而产生的重大挑战。为了引入除了语音音频片段以外更多的面部引导条件,提出了一种新颖的两阶段训练框架Playmate,用于生成更逼真的面部表情和说话人脸。 在第一阶段,我们引入了一个解耦的隐式3D表示,并设计了精心定制的动作解耦模块,以促进更准确的属性分离并直接从音频提示生成具有表现力的说话视频。然后,在第二阶段,我们引入了一种情感控制模块,将情感控制信息编码到潜在空间中,从而使对情绪进行精细控制成为可能,并因此能够生成带有所需情绪的说话视频。 广泛的实验表明,与现有的最先进的方法相比,Playmate在视频质量和唇部同步方面表现出色,并提高了情感和头部姿势控制方面的灵活性。代码将在提供的链接上提供。
URL
https://arxiv.org/abs/2502.07203