Paper Reading AI Learner

Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation

2025-04-25 05:28:21
Weipeng Tan, Chuming Lin, Chengming Xu, FeiFan Xu, Xiaobin Hu, Xiaozhong Ji, Junwei Zhu, Chengjie Wang, Yanwei Fu

Abstract

Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.

Abstract (translated)

最近在 Talking Head Generation (THG) 方面的进展,通过扩散模型实现了令人印象深刻的唇部同步和视觉质量;然而,现有的方法在生成具有情感表达力的同时保持说话者身份方面仍存在困难。我们指出了当前情感面部生成中的三个关键限制:音频中内在的情感线索利用不足、情感表示中的身份泄露以及情感关联的孤立学习。 为了应对这些挑战,我们提出了一种新的框架,称为 DICE-Talk,该框架遵循将身份与情绪解耦然后合作具有相似特征的情绪的理念。首先,我们开发了一个解耦式情感嵌入器,通过跨模态注意力同时建模音频-视觉的情感线索,表示为无身份的高斯分布。其次,我们引入了一种增强关联的情感条件模块,并采用可学习的情感库,该模块通过向量量化和基于注意力的功能聚合明确捕捉了情感之间的关系。第三,我们设计了一个情绪判别目标,在扩散过程中通过潜在空间分类强制执行情感一致性。 在 MEAD 和 HDTF 数据集上的广泛实验表明,我们的方法优于现有最佳方法,在情感准确性方面表现出色的同时保持了竞争性的唇部同步性能。定性结果和用户研究进一步确认了我们方法生成的身份一致且具有丰富相关情感表情的能力,并能够自然适应未见过的身份。

URL

https://arxiv.org/abs/2504.18087

PDF

https://arxiv.org/pdf/2504.18087.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot