Paper Reading AI Learner

Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion

2025-02-11 02:53:48
Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, Shunsi Zhang

Abstract

Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate outperforms existing state-of-the-art methods in terms of video quality and lip-synchronization, and improves flexibility in controlling emotion and head pose. The code will be available at this https URL.

Abstract (translated)

最近基于扩散模型的说话人脸生成方法展示了在将语音音频片段与给定的身份参考准确匹配方面具有令人印象深刻的潜力。然而,现有的方法仍然面临着由于不可控因素(如不精确的唇部同步、不合适头部姿势以及缺乏对面部表情的精细控制)而产生的重大挑战。为了引入除了语音音频片段以外更多的面部引导条件,提出了一种新颖的两阶段训练框架Playmate,用于生成更逼真的面部表情和说话人脸。 在第一阶段,我们引入了一个解耦的隐式3D表示,并设计了精心定制的动作解耦模块,以促进更准确的属性分离并直接从音频提示生成具有表现力的说话视频。然后,在第二阶段,我们引入了一种情感控制模块,将情感控制信息编码到潜在空间中,从而使对情绪进行精细控制成为可能,并因此能够生成带有所需情绪的说话视频。 广泛的实验表明,与现有的最先进的方法相比,Playmate在视频质量和唇部同步方面表现出色,并提高了情感和头部姿势控制方面的灵活性。代码将在提供的链接上提供。

URL

https://arxiv.org/abs/2502.07203

PDF

https://arxiv.org/pdf/2502.07203.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot