Abstract
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution model; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods.
Abstract (translated)
一次性的3D谈话肖像生成旨在从未见过的图像中重构3D虚拟形象,然后通过参考视频或音频来生成谈话肖像视频。现有方法未能同时实现准确3D虚拟形象重建和稳定的谈话面动画。此外,虽然现有作品主要关注合成头部,但生成自然躯干和背景段也是获得真实谈话肖像视频至关重要。为了应对这些局限,我们提出了Real3D-Potrait,一个框架(1)通过大图像到平面模型的方法提高了一次性3D重建的能力,并从3D人脸生成模型中提取3D先验知识;(2)通过高效的运动适配器促进准确的运动条件动画;(3)使用头-躯干-背景超分辨率模型合成真实视频,并可切换背景;(4)支持基于通用音频到运动模型的单次音频驱动谈话面生成。大量实验证明,Real3D-Portrait对未见过的身份泛化效果很好,并比以前的方法生成了更逼真的谈话肖像视频。
URL
https://arxiv.org/abs/2401.08503