Abstract
Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: this https URL.
Abstract (translated)
从单张静态肖像创建一个逼真的可动画化身仍然具有挑战性。现有方法往往难以捕捉微妙的面部表情、相关的全身动作以及动态背景。为了解决这些限制,我们提出了一种新颖框架,利用预训练的视频扩散变换器模型生成高保真度、连贯的说话头像,并且可以控制运动动力学。我们的工作核心是一种双阶段音频-视觉对齐策略。 在第一阶段,我们采用片段级训练方案,通过在整个场景中(包括参考肖像、上下文对象和背景)对准由音频驱动的动力学来建立连贯的整体运动。在第二阶段,我们使用唇部跟踪掩码以帧为单位细化嘴唇动作,确保与音频信号的精确同步。 为了保持身份一致性而不牺牲运动灵活性,我们将常用的参考网络替换为面部聚焦的跨注意力模块,该模块在整个视频中有效维持面部一致性。此外,我们整合了一个运动强度调节模块,它明确控制表情和身体运动强度,从而实现头像动作(不仅仅是唇部动作)可操控地调整。 广泛的实验结果表明,我们的方法在质量和现实感、连贯性、运动强度和身份保持方面均优于现有技术。 有关我们的项目的更多信息,请访问此链接:[项目页面链接] (请将“this https URL”替换为实际的项目页面URL)。
URL
https://arxiv.org/abs/2504.04842