Paper Reading AI Learner

VectorTalker: SVG Talking Face Generation with Progressive Vectorisation

2023-12-18 01:50:00
Hao Hu, Xuan Wang, Jingxiang Sun, Yanbo Fan, Yu Guo, Caigui Jiang

Abstract

High-fidelity and efficient audio-driven talking head generation has been a key research topic in computer graphics and computer vision. In this work, we study vector image based audio-driven talking head generation. Compared with directly animating the raster image that most widely used in existing works, vector image enjoys its excellent scalability being used for many applications. There are two main challenges for vector image based talking head generation: the high-quality vector image reconstruction w.r.t. the source portrait image and the vivid animation w.r.t. the audio signal. To address these, we propose a novel scalable vector graphic reconstruction and animation method, dubbed VectorTalker. Specifically, for the highfidelity reconstruction, VectorTalker hierarchically reconstructs the vector image in a coarse-to-fine manner. For the vivid audio-driven facial animation, we propose to use facial landmarks as intermediate motion representation and propose an efficient landmark-driven vector image deformation module. Our approach can handle various styles of portrait images within a unified framework, including Japanese manga, cartoon, and photorealistic images. We conduct extensive quantitative and qualitative evaluations and the experimental results demonstrate the superiority of VectorTalker in both vector graphic reconstruction and audio-driven animation.

Abstract (translated)

高度准确和高效的基于音频的二维头生成一直是计算机图形学和计算机视觉的研究热点。在这项研究中,我们研究基于矢量图像的音频驱动头生成。与目前工作中直接动画广泛使用的像素图像相比,矢量图像具有出色的可扩展性,被应用于许多应用。矢量图像基于头生成的两个主要挑战是:高质量矢量图像关于源肖像图像的重建以及生动的音乐信号关于矢量图像的动画。为了应对这些挑战,我们提出了名为VectorTalker的新可扩展矢量图形重构和动画方法。具体来说,对于高保真的重构,VectorTalker以粗到细的方式对矢量图像进行分层重构。对于生动的音乐驱动面部动画,我们提出使用面部特征作为中间运动表示,并提出了一个高效的标记驱动矢量图像变形模块。我们的方法可以在统一的框架中处理各种肖像图像风格,包括日本漫画、卡通和实写图像。我们进行了广泛的定量评估和实验,实验结果证明了VectorTalker在矢量图形重建和音频驱动动画方面的优越性。

URL

https://arxiv.org/abs/2312.11568

PDF

https://arxiv.org/pdf/2312.11568.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot