Paper Reading AI Learner

3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow

2024-04-15 14:20:07
Felix Taubner, Prashant Raina, Mathieu Tuli, Eu Wern Teh, Chul Lee, Jinmiao Huang

Abstract

When working with 3D facial data, improving fidelity and avoiding the uncanny valley effect is critically dependent on accurate 3D facial performance capture. Because such methods are expensive and due to the widespread availability of 2D videos, recent methods have focused on how to perform monocular 3D face tracking. However, these methods often fall short in capturing precise facial movements due to limitations in their network architecture, training, and evaluation processes. Addressing these challenges, we propose a novel face tracker, FlowFace, that introduces an innovative 2D alignment network for dense per-vertex alignment. Unlike prior work, FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data. Our 3D model fitting module jointly fits a 3D face model from one or many observations, integrating existing neutral shape priors for enhanced identity and expression disentanglement and per-vertex deformations for detailed facial feature reconstruction. Additionally, we propose a novel metric and benchmark for assessing tracking accuracy. Our method exhibits superior performance on both custom and publicly available benchmarks. We further validate the effectiveness of our tracker by generating high-quality 3D data from 2D videos, which leads to performance gains on downstream tasks.

Abstract (translated)

在处理三维面部数据时,提高准确性和避免深度谷效应的关键取决于准确的三维面部表演捕捉。因为这些方法代价昂贵,而且由于2D视频的广泛可用性,最近的方法集中于如何进行单目三维面部跟踪。然而,由于其网络架构、训练和评估过程的局限性,这些方法往往无法准确捕捉到精确的面部运动。为了解决这些问题,我们提出了一个新颖的跟踪器——流形面部(FlowFace),它引入了一种创新的高质量2D对齐网络来解决深度对齐问题。与之前的工作不同,流形面部在高质量3D扫描注释上进行训练,而不是弱监督或合成数据。我们的3D模型拟合模块与一个或多个观察结果相结合,实现了对 identity 和expression 的增强,以及对细节面部特征的重建。此外,我们还提出了一个用于评估跟踪准确性的新指标和基准。我们的方法在自定义和公开可用的基准上都表现出卓越的性能。为了进一步验证跟踪器的有效性,我们通过从2D视频中生成高质量的3D数据,实现了在下游任务上的性能提升。

URL

https://arxiv.org/abs/2404.09819

PDF

https://arxiv.org/pdf/2404.09819.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot