MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

Abstract
Abstract (translated)
URL
PDF

Abstract

Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided renderer syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.

Abstract (translated)

Audio驱动的肖像动画旨在生成由给定音频条件生成的肖像视频。动画高保真的多媒质肖像具有多种应用。先前的方法曾试图捕捉不同运动模式并生成高保真的肖像视频,通过训练不同模型或从给定视频中采样信号来实现。然而,缺乏 lips同步和其他运动(例如头部姿势和眨眼)的相关性学习通常会导致不自然的结果。在本文中,我们提出了一种多人、多样化和高保真的对话肖像生成统一系统。我们的方法和有三个阶段,即1)基于双重注意力的一次性网络(MODA)从给定音频生成对话表示。在MODA中,我们设计了一个双重注意力模块来编码准确的口部运动和多种模式。2)面部构建网络生成密集且详细的面部地标,3)时间引导渲染合成稳定视频。广泛的评估表明,与先前的方法相比,我们提出的系统生成更自然、真实的视频肖像。

URL

https://arxiv.org/abs/2307.10008

PDF

https://arxiv.org/pdf/2307.10008.pdf

MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

Abstract

Abstract (translated)

URL

PDF Copy

PDF