Paper Reading AI Learner

Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation

2025-04-08 07:23:28
Zhihua Xu, Tianshui Chen, Zhijing Yang, Siyuan Peng, Keze Wang, Liang Lin

Abstract

The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.

Abstract (translated)

音频驱动的一次性面部动画(ADOS-THA)面临的最大挑战在于捕捉相邻视频帧之间细微的变化。本质上,连续音频片段之间的时序关系与相应的连续视频帧的时序关系高度相关,提供了可以对头部动作动画进行引导和监督的重要补充信息。在本研究中,我们提出了一种学习音视频关联,并通过一种新颖的时序音视频关联嵌入(TAVCE)框架将这些关联整合起来以增强特征表示并规范化最终生成的方法。 具体而言,该方法首先学习一个音频-视觉时间相关性度量,确保连续音频片段之间的时序关系与相应连续视频帧之间的时序关系对齐。由于时序音频关系包含了有关视觉帧的对准信息,我们首先通过一种简单但有效的通道注意力机制将其整合进来,以指导学习更具代表性的特征。在训练过程中,我们也利用这些对齐的相关性作为额外目标来监督生成视觉帧。 我们在几个公开的数据集(即HDTF、LRW、VoxCeleb1和VoxCeleb2)上进行了广泛的实验,证明了我们提出的方法优于现有的领先算法。

URL

https://arxiv.org/abs/2504.05746

PDF

https://arxiv.org/pdf/2504.05746.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot