Abstract
Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution this http URL codes are available at this https URL.
Abstract (translated)
音频驱动的单张图像说话肖像生成在虚拟现实、数字人创建和电影制作中扮演着重要角色。现有的方法通常被分为基于关键点的方法和基于图像的方法。基于关键点的方法有效地保留了人物的身份,但因3D可变形模型(3D Morphable Model)固定点限制而难以捕捉到精细的面部细节。此外,传统的生成网络在使用有限数据集建立音频与关键点之间的因果关系时面临挑战,导致姿势多样性较低。相比之下,基于图像的方法通过扩散网络能够产生具有丰富细节和高质量的人物肖像,但会导致身份失真,并且计算成本高昂。 在这项工作中,我们提出了KDTalker框架,这是第一个结合无监督隐式3D关键点与时空扩散模型的框架。利用无监督隐式的3D关键点,KDTalker可以根据面部信息密度调整面部信息,使得扩散过程能够灵活地模拟多样的头部姿势,并捕捉到细微的面部细节。此外,自定义设计的时空注意力机制确保了准确的唇部同步,在提高计算效率的同时产生了时间一致性高、高质量的动画。 实验结果表明,KDTalker在唇部同步准确性、头部姿态多样性以及执行效率方面均达到了最先进的水平。代码可在提供的链接中获取。
URL
https://arxiv.org/abs/2503.12963