Paper Reading AI Learner

X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

2024-03-23 20:30:28
You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, Linjie Luo

Abstract

We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.

Abstract (translated)

我们提出了X-Portrait,一种针对生成具有表现力和时间一致性的肖像动画的创新条件扩散模型。具体来说,给定一个单张肖像作为 appearance 参考,我们旨在通过来自驱动视频的运动来动画它,捕捉高动态度和微妙面部表情,并实现广泛的头部运动。其核心在于,我们利用预训练扩散模型的生成先验作为渲染骨架,同时通过 ControlNet 中的新控制信号实现细粒度头部姿势和表情控制。与传统的粗显控制方法(如面部特征)相比,我们的运动控制模块是在原始驱动 RGB 输入的框架内学习的,可以直接从原始驱动信号中解释动态。通过基于补丁的控制模块,可以进一步增强对小规模微妙的运动关注,比如眼睛位置。值得注意的是,为了减轻来自驱动信号的身份泄漏,我们通过缩放增强交叉熵图像来训练我们的运动控制模块,确保从表现参考模块的最大分离。实验结果表明,X-Portrait 在各种面部肖像和表现驱动序列中具有普遍的有效性,并展示了其在生成具有保持一致身份特性的引人入胜肖像动画方面的卓越能力。

URL

https://arxiv.org/abs/2403.15931

PDF

https://arxiv.org/pdf/2403.15931.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot