Abstract
High-fidelity and efficient audio-driven talking head generation has been a key research topic in computer graphics and computer vision. In this work, we study vector image based audio-driven talking head generation. Compared with directly animating the raster image that most widely used in existing works, vector image enjoys its excellent scalability being used for many applications. There are two main challenges for vector image based talking head generation: the high-quality vector image reconstruction w.r.t. the source portrait image and the vivid animation w.r.t. the audio signal. To address these, we propose a novel scalable vector graphic reconstruction and animation method, dubbed VectorTalker. Specifically, for the highfidelity reconstruction, VectorTalker hierarchically reconstructs the vector image in a coarse-to-fine manner. For the vivid audio-driven facial animation, we propose to use facial landmarks as intermediate motion representation and propose an efficient landmark-driven vector image deformation module. Our approach can handle various styles of portrait images within a unified framework, including Japanese manga, cartoon, and photorealistic images. We conduct extensive quantitative and qualitative evaluations and the experimental results demonstrate the superiority of VectorTalker in both vector graphic reconstruction and audio-driven animation.
Abstract (translated)
高度准确和高效的基于音频的二维头生成一直是计算机图形学和计算机视觉的研究热点。在这项研究中,我们研究基于矢量图像的音频驱动头生成。与目前工作中直接动画广泛使用的像素图像相比,矢量图像具有出色的可扩展性,被应用于许多应用。矢量图像基于头生成的两个主要挑战是:高质量矢量图像关于源肖像图像的重建以及生动的音乐信号关于矢量图像的动画。为了应对这些挑战,我们提出了名为VectorTalker的新可扩展矢量图形重构和动画方法。具体来说,对于高保真的重构,VectorTalker以粗到细的方式对矢量图像进行分层重构。对于生动的音乐驱动面部动画,我们提出使用面部特征作为中间运动表示,并提出了一个高效的标记驱动矢量图像变形模块。我们的方法可以在统一的框架中处理各种肖像图像风格,包括日本漫画、卡通和实写图像。我们进行了广泛的定量评估和实验,实验结果证明了VectorTalker在矢量图形重建和音频驱动动画方面的优越性。
URL
https://arxiv.org/abs/2312.11568