Emo-Avatar: Efficient Monocular Video Style Avatar through Texture Rendering

Abstract
Abstract (translated)
URL
PDF

Abstract

Artistic video portrait generation is a significant and sought-after task in the fields of computer graphics and vision. While various methods have been developed that integrate NeRFs or StyleGANs with instructional editing models for creating and editing drivable portraits, these approaches face several challenges. They often rely heavily on large datasets, require extensive customization processes, and frequently result in reduced image quality. To address the above problems, we propose the Efficient Monotonic Video Style Avatar (Emo-Avatar) through deferred neural rendering that enhances StyleGAN's capacity for producing dynamic, drivable portrait videos. We proposed a two-stage deferred neural rendering pipeline. In the first stage, we utilize few-shot PTI initialization to initialize the StyleGAN generator through several extreme poses sampled from the video to capture the consistent representation of aligned faces from the target portrait. In the second stage, we propose a Laplacian pyramid for high-frequency texture sampling from UV maps deformed by dynamic flow of expression for motion-aware texture prior integration to provide torso features to enhance StyleGAN's ability to generate complete and upper body for portrait video rendering. Emo-Avatar reduces style customization time from hours to merely 5 minutes compared with existing methods. In addition, Emo-Avatar requires only a single reference image for editing and employs region-aware contrastive learning with semantic invariant CLIP guidance, ensuring consistent high-resolution output and identity preservation. Through both quantitative and qualitative assessments, Emo-Avatar demonstrates superior performance over existing methods in terms of training efficiency, rendering quality and editability in self- and cross-reenactment.

Abstract (translated)

艺术视频肖像生成是在计算机图形学和视觉领域的一个重要且备受关注的目标。虽然已经开发了许多将NeRFs或StyleGAN与教学编辑模型相结合的方法来创建和编辑可驾驶肖像，但这些问题仍然存在。它们通常需要依赖大量数据，需要进行广泛的定制，并经常导致图像质量降低。为解决这些问题，我们提出了Efficient Monotonic Video Style Avatar（Emo-Avatar），通过 deferred neural rendering 进行延期神经渲染，以增强StyleGAN在制作动态、可驾驶肖像视频方面的能力。我们提出了一个两阶段延时神经渲染管道。在第一阶段，我们利用少样本PTI初始化来通过从视频中采样极端姿态来初始化StyleGAN生成器，以捕捉目标肖像中始终保持一致的对齐面。在第二阶段，我们提出了Laplacian金字塔用于从变形动态流动表达的UV地图中采样高频率纹理，以实现运动感知纹理先前集成，从而提供躯体特征，增强StyleGAN生成完整和上半身的能力。Emo-Avatar将风格定制时间从小时降低到了仅需5分钟，与现有方法相比具有优越性能。此外，Emo-Avatar只需要一个参考图像进行编辑，并采用基于语义不变的CLIP的局部感知对比学习，确保始终如一的高分辨率输出和身份保留。通过定量和定性评估，Emo-Avatar在自演和跨演等方面的现有方法上表现出卓越的性能。

URL

https://arxiv.org/abs/2402.00827

PDF

https://arxiv.org/pdf/2402.00827.pdf

Emo-Avatar: Efficient Monocular Video Style Avatar through Texture Rendering

Abstract

Abstract (translated)

URL

PDF Copy

PDF