Abstract
The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.
Abstract (translated)
音频驱动的一次性面部动画(ADOS-THA)面临的最大挑战在于捕捉相邻视频帧之间细微的变化。本质上,连续音频片段之间的时序关系与相应的连续视频帧的时序关系高度相关,提供了可以对头部动作动画进行引导和监督的重要补充信息。在本研究中,我们提出了一种学习音视频关联,并通过一种新颖的时序音视频关联嵌入(TAVCE)框架将这些关联整合起来以增强特征表示并规范化最终生成的方法。 具体而言,该方法首先学习一个音频-视觉时间相关性度量,确保连续音频片段之间的时序关系与相应连续视频帧之间的时序关系对齐。由于时序音频关系包含了有关视觉帧的对准信息,我们首先通过一种简单但有效的通道注意力机制将其整合进来,以指导学习更具代表性的特征。在训练过程中,我们也利用这些对齐的相关性作为额外目标来监督生成视觉帧。 我们在几个公开的数据集(即HDTF、LRW、VoxCeleb1和VoxCeleb2)上进行了广泛的实验,证明了我们提出的方法优于现有的领先算法。
URL
https://arxiv.org/abs/2504.05746