FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features

Abstract
Abstract (translated)
URL
PDF

Abstract

The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.

Abstract (translated)

面部复原的任务是将来自驾驶视频的头动量和面部表情转移到源图像的 appearance上，这可能是不同的人（跨复原）。现有的方法基于CNN，估计源图像到当前驾驶帧的视差，然后修复和优化以产生输出动画。我们提出了一种基于Transformer的编码器来计算源图像的集合潜在表示。然后，我们使用基于Transformer的解码器预测查询像素的输出颜色，其中条件基于关键点和从驾驶帧中提取的面部表情向量。源人物的潜在表示是在自监督的方式下学习，将他们的外观、头姿势和面部表情分解成不同的组件。因此，它们非常适合跨复原。与大多数相关的工作不同，我们的方法自然地扩展到多个源图像，从而可以适应个性化的面部动态。我们还提出了数据增强和正则化方案，以防止过拟合和支持学习表示的泛化。我们在随机用户研究中评估了我们的方法。结果表明，与最先进的技术相比，在运动传递质量和时间一致性方面具有优越的性能。

URL

https://arxiv.org/abs/2404.09736

PDF

https://arxiv.org/pdf/2404.09736.pdf

FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features

Abstract

Abstract (translated)

URL

PDF Copy

PDF