Extreme head postures pose a common challenge across a spectrum of facial analysis tasks, including face detection, facial landmark detection (FLD), and head pose estimation (HPE). These tasks are interdependent, where accurate FLD relies on robust face detection, and HPE is intricately associated with these key points. This paper focuses on the integration of these tasks, particularly when addressing the complexities posed by large-angle face poses. The primary contribution of this study is the proposal of a real-time multi-task detection system capable of simultaneously performing joint detection of faces, facial landmarks, and head poses. This system builds upon the widely adopted YOLOv8 detection framework. It extends the original object detection head by incorporating additional landmark regression head, enabling efficient localization of crucial facial landmarks. Furthermore, we conduct optimizations and enhancements on various modules within the original YOLOv8 framework. To validate the effectiveness and real-time performance of our proposed model, we conduct extensive experiments on 300W-LP and AFLW2000-3D datasets. The results obtained verify the capability of our model to tackle large-angle face pose challenges while delivering real-time performance across these interconnected tasks.
极端头姿势在面部分析任务中提出了一起常见的挑战,包括面部检测、面部地标检测(FLD)和头部姿势估计(HPE)。这些任务是相互依存的,准确的FLD依赖于强大的面部检测,而HPE与这些关键点密切相关。本文重点探讨了这些任务的整合,特别是在处理大角度面部姿势的复杂性时。本研究的主要贡献是提出了一种实时多任务检测系统,能够同时完成面部、面部地标和头部姿势的联合检测。该系统基于广泛使用的YOLOv8检测框架,通过添加地标回归头扩展了原对象检测头,从而能够高效地定位关键的面部地标。此外,我们在原YOLOv8框架中优化和改进了各种模块。为了验证我们提出的模型的有效性和实时性能,我们在300W-LP和AFLW2000-3D数据集上进行了广泛的实验。所获得的结果验证我们的模型能够应对大角度面部姿势挑战,同时在这些相互关联的任务中提供实时性能。
https://arxiv.org/abs/2309.11773
Background and objectives: Patients suffering from neurological diseases may develop dysarthria, a motor speech disorder affecting the execution of speech. Close and quantitative monitoring of dysarthria evolution is crucial for enabling clinicians to promptly implement patient management strategies and maximizing effectiveness and efficiency of communication functions in term of restoring, compensating or adjusting. In the clinical assessment of orofacial structures and functions, at rest condition or during speech and non-speech movements, a qualitative evaluation is usually performed, throughout visual observation. Methods: To overcome limitations posed by qualitative assessments, this work presents a store-and-forward self-service telemonitoring system that integrates, within its cloud architecture, a convolutional neural network (CNN) for analyzing video recordings acquired by individuals with dysarthria. This architecture, called facial landmark Mask RCNN, aims at locating facial landmarks as a prior for assessing the orofacial functions related to speech and examining dysarthria evolution in neurological diseases. Results: When tested on the Toronto NeuroFace dataset, a publicly available annotated dataset of video recordings from patients with amyotrophic lateral sclerosis (ALS) and stroke, the proposed CNN achieved a normalized mean error equal to 1.79 on localizing the facial landmarks. We also tested our system in a real-life scenario on 11 bulbar-onset ALS subjects, obtaining promising outcomes in terms of facial landmark position estimation. Discussion and conclusions: This preliminary study represents a relevant step towards the use of remote tools to support clinicians in monitoring the evolution of dysarthria.
背景和目标:患有神经疾病的患者可能会发展言语障碍,这是一种影响口语执行的神经系统疾病。密切和量化监测言语障碍的发展是使临床医生能够迅速实施患者管理策略并最大限度地提高与恢复、补偿或调整相关的沟通能力效率的关键。在临床评估面部结构和功能时,在休息状态或口语和非口语运动的情况下,通常需要进行定性评估,在整个观察过程中进行。方法:为了克服定性评估所带来的限制,这项工作提出了一个存储和转发的自我监测系统,其集成了在其云架构内的一个卷积神经网络(CNN),以分析由言语障碍患者获取的视频录制。这个架构被称为面部地标掩膜RCNN,旨在确定面部地标作为评估与口语相关的面部功能以及检查神经疾病中言语障碍的发展的前置步骤。结果:在测试多伦多神经 faces 数据集(Toronto NeuroFace 数据集),这是一个公开标注的数据集,来自患者阿尔茨海默病(ALS)和中风的视频录制, proposed CNN 在确定面部地标方面的规范化均值误差为1.79。我们还测试了我们的系统在一个真实的情境下,对11个早逝型阿尔茨海默病患者进行了测试,取得了面部地标位置估计方面的积极结果。讨论和结论:这初步研究代表了使用远程工具支持临床医生监测言语障碍发展的关键步骤。
https://arxiv.org/abs/2309.09038
We present Blendshapes GHUM, an on-device ML pipeline that predicts 52 facial blendshape coefficients at 30+ FPS on modern mobile phones, from a single monocular RGB image and enables facial motion capture applications like virtual avatars. Our main contributions are: i) an annotation-free offline method for obtaining blendshape coefficients from real-world human scans, ii) a lightweight real-time model that predicts blendshape coefficients based on facial landmarks.
我们介绍了Blendshapes GHUM,一个内置的机器学习管道,可以在现代智能手机上以30+帧每秒的速度从单个单眼RGB图像中预测52个面部Blendshape系数,并实现类似于虚拟角色的面部运动捕捉应用。我们的主要贡献是: i) 一种无标注的离线方法,从现实世界的人类扫描中获得Blendshape系数,ii) 一种轻量级实时模型,基于面部地标预测Blendshape系数。
https://arxiv.org/abs/2309.05782
The incorporation of 3D data in facial analysis tasks has gained popularity in recent years. Though it provides a more accurate and detailed representation of the human face, accruing 3D face data is more complex and expensive than 2D face images. Either one has to rely on expensive 3D scanners or depth sensors which are prone to noise. An alternative option is the reconstruction of 3D faces from uncalibrated 2D images in an unsupervised way without any ground truth 3D data. However, such approaches are computationally expensive and the learned model size is not suitable for mobile or other edge device applications. Predicting dense 3D landmarks over the whole face can overcome this issue. As there is no public dataset available containing dense landmarks, we propose a pipeline to create a dense keypoint training dataset containing 520 key points across the whole face from an existing facial position map data. We train a lightweight MobileNet-based regressor model with the generated data. As we do not have access to any evaluation dataset with dense landmarks in it we evaluate our model against the 68 keypoint detection task. Experimental results show that our trained model outperforms many of the existing methods in spite of its lower model size and minimal computational cost. Also, the qualitative evaluation shows the efficiency of our trained models in extreme head pose angles as well as other facial variations and occlusions.
近年来,将三维数据纳入面部分析任务变得越来越流行。尽管它提供了更加准确和详细的人类面部表示,但积累三维面部数据比积累二维面部图像更加复杂和昂贵。要么你需要依赖昂贵的三维扫描仪或深度传感器,这些传感器容易噪声。另一种选择是以一种无监督的方式从未校准的二维图像中重构三维面部,而不需要任何真实的三维数据。然而,这些方法计算代价很高,学习的模型大小不适合移动设备或其他边缘设备应用程序。预测整个面部的密集三维地标可以克服这个问题。由于没有包含密集地标的公共数据集,我们提出了一条管道来创建一个包含整个面部520个关键点的密集关键点训练集,该数据集从现有的面部位置地图数据中生成。我们训练一个轻量级的移动网桥回归模型,使用生成的数据。由于我们没有访问任何包含密集地标的评估数据集,我们对联姻模型进行了68关键点检测任务的评价。实验结果表明,我们的训练模型尽管模型大小较小,但比许多现有方法表现更好,尽管其性能较低。此外,定性评估表明,我们的训练模型在极端头姿态角度和其他面部变异和遮挡条件下的效率。
https://arxiv.org/abs/2308.15170
The ability of humans to infer head poses from face shapes, and vice versa, indicates a strong correlation between the two. Accordingly, recent studies on face alignment have employed head pose information to predict facial landmarks in computer vision tasks. In this study, we propose a novel method that employs head pose information to improve face alignment performance by fusing said information with the feature maps of a face alignment network, rather than simply using it to initialize facial landmarks. Furthermore, the proposed network structure performs robust face alignment through a dual-dimensional network using multidimensional features represented by 2D feature maps and a 3D heatmap. For effective dense face alignment, we also propose a prediction method for facial geometric landmarks through training based on knowledge distillation using predicted keypoints. We experimentally assessed the correlation between the predicted facial landmarks and head pose information, as well as variations in the accuracy of facial landmarks with respect to the quality of head pose information. In addition, we demonstrated the effectiveness of the proposed method through a competitive performance comparison with state-of-the-art methods on the AFLW2000-3D, AFLW, and BIWI datasets.
人类从面部形状推断头部姿态的能力,以及反过来,表明这两个方面之间存在强烈的相关性。因此,最近在面部对齐方面的研究使用了头部姿态信息来预测面部地标,以改善面部对齐性能。在本研究中,我们提出了一种新的方法来使用头部姿态信息来提高面部对齐性能,通过将这些信息与面部对齐网络的特征映射相结合,而不是仅仅使用它来初始化面部地标。此外,我们提出了一种网络结构,它通过使用两个维度的网络,使用2D特征映射和3D热图来表示多个维度的特征。为了实现高效的密集面部对齐,我们还提出了一种面部几何地标的预测方法,通过基于预测关键点的知识蒸馏来训练。我们实验性地评估了预测面部地标和头部姿态信息之间的相关性,以及面部地标的准确性与头部姿态信息的质量之间的变化。此外,我们还通过在 AFLW2000-3D、AFLW和BIWI数据集上与最先进的方法进行竞争性能比较,证明了我们提出的方法的有效性。
https://arxiv.org/abs/2308.13327
Facial landmark detection is an essential technology for driver status tracking and has been in demand for real-time estimations. As a landmark coordinate prediction, heatmap-based methods are known to achieve a high accuracy, and Lite-HRNet can achieve a fast estimation. However, with Lite-HRNet, the problem of a heavy computational cost of the fusion block, which connects feature maps with different resolutions, has yet to be solved. In addition, the strong output module used in HRNetV2 is not applied to Lite-HRNet. Given these problems, we propose a novel architecture called Lite-HRNet Plus. Lite-HRNet Plus achieves two improvements: a novel fusion block based on a channel attention and a novel output module with less computational intensity using multi-resolution feature maps. Through experiments conducted on two facial landmark datasets, we confirmed that Lite-HRNet Plus further improved the accuracy in comparison with conventional methods, and achieved a state-of-the-art accuracy with a computational complexity with the range of 10M FLOPs.
面部 landmark 检测是司机状态追踪的必备技术,并一直受到实时估计的需求。作为一种 landmark 坐标预测方法,已知通过热图方法可以实现高精度,而 Lite-HRNet 可以实现快速的估计。然而,与 Lite-HRNet 一起使用, Fusion block 的计算成本问题仍然存在,该 block 连接了不同分辨率的特征图。此外,HRNetV2 中使用的强大输出模块不适用于 Lite-HRNet。鉴于这些问题,我们提出了一种新架构称为 Lite-HRNet Plus。 Lite-HRNet Plus 实现了两个改进:基于通道关注的新 fusion block 和使用多分辨率特征图以减少计算强度的新输出模块。通过在两个面部 landmark 数据集上进行实验,我们确认 Lite-HRNet Plus 相对于传统方法进一步提高了精度,并在计算复杂性范围为 10 百万 FLOPs 的情况下实现了先进的精度。
https://arxiv.org/abs/2308.12133
This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one while preserving behaviors shape as they carry communicative meaning. Behavior expressivity style is viewed here as the qualitative properties of behaviors. We propose TranSTYLer, a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with the style of a target speaker. We assume that behavior expressivity style is encoded across various modalities of communication, including text, speech, body gestures, and facial expressions. The model employs a style and content disentanglement schema to ensure that the transferred style does not interfere with the meaning conveyed by the source behaviors. Our approach eliminates the need for style labels and allows the generalization to styles that have not been seen during the training phase. We train our model on the PATS corpus, which we extended to include dialog acts and 2D facial landmarks. Objective and subjective evaluations show that our model outperforms state of the art models in style transfer for both seen and unseen styles during training. To tackle the issues of style and content leakage that may arise, we propose a methodology to assess the degree to which behavior and gestures associated with the target style are successfully transferred, while ensuring the preservation of the ones related to the source content.
本论文探讨了如何将虚拟代理的行为表达风格转移到另一个代理上,同时保留其行为形状,因为它们传达了通信意义。行为表达风格被视为行为的质量属性。我们提出了 TranSTYLer 多模式Transformer模型,该模型将多模式行为从一个源演讲者到一个目标演讲者的多模式行为合成起来。我们认为行为表达风格在通信的各种模式中编码,包括文本、演讲、身体手势和面部表情。模型使用风格和内容分离框架以确保转移风格不会干扰源行为传达的意义。我们的方法不需要风格标签,并允许在训练阶段推广到未曾展示的风格。我们在 PATS corpus 上训练我们的模型,该 corpus 扩展了对话行为和 2D 面部地标。客观和主观评估表明,我们在训练期间可见和未可见风格的行为表达和手势表现上,我们的模型比最先进的模型表现更好。为了应对可能出现的风格和内容泄漏问题,我们提出了一种方法,评估目标风格相关的和行为和手势成功地转移的程度,同时确保保留与源内容相关的行为和手势。
https://arxiv.org/abs/2308.10843
Parkinson's disease (PD) diagnosis remains challenging due to lacking a reliable biomarker and limited access to clinical care. In this study, we present an analysis of the largest video dataset containing micro-expressions to screen for PD. We collected 3,871 videos from 1,059 unique participants, including 256 self-reported PD patients. The recordings are from diverse sources encompassing participants' homes across multiple countries, a clinic, and a PD care facility in the US. Leveraging facial landmarks and action units, we extracted features relevant to Hypomimia, a prominent symptom of PD characterized by reduced facial expressions. An ensemble of AI models trained on these features achieved an accuracy of 89.7% and an Area Under the Receiver Operating Characteristic (AUROC) of 89.3% while being free from detectable bias across population subgroups based on sex and ethnicity on held-out data. Further analysis reveals that features from the smiling videos alone lead to comparable performance, even on two external test sets the model has never seen during training, suggesting the potential for PD risk assessment from smiling selfie videos.
帕金森病(PD)的诊断仍然面临挑战,由于缺乏可靠的生物标记和临床护理的有限访问,因此难以进行准确诊断。在本研究中,我们分析了包含微表情的视频数据集,以检测PD。我们从1,059个独特的参与者中收集了3,871个视频,其中包括256名自我报告的PD患者。这些记录来自多个国家、一家诊所和在美国的一家PD护理设施。利用面部地标和行动单元,我们提取了与低表情减少相关的特征。基于这些特征训练的AI模型在准确性和AUROC方面取得了89.7%和89.3%的水平,而保留的数据集排除了基于性别和种族的可检测偏见。进一步分析表明,仅从微笑着的视频中提取的特征会导致相似的性能,即使在训练期间模型从未见过的两个外部测试集上也是如此,这表明从微笑自拍照视频中进行PD风险评估的潜力。
https://arxiv.org/abs/2308.02588
Depression is a common mental health disorder that can cause consequential symptoms with continuously depressed mood that leads to emotional distress. One category of depression is Concealed Depression, where patients intentionally or unintentionally hide their genuine emotions through exterior optimism, thereby complicating and delaying diagnosis and treatment and leading to unexpected suicides. In this paper, we propose to diagnose concealed depression by using facial micro-expressions (FMEs) to detect and recognize underlying true emotions. However, the extremely low intensity and subtle nature of FMEs make their recognition a tough task. We propose a facial landmark-based Region-of-Interest (ROI) approach to address the challenge, and describe a low-cost and privacy-preserving solution that enables self-diagnosis using portable mobile devices in a personal setting (e.g., at home). We present results and findings that validate our method, and discuss other technical challenges and future directions in applying such techniques to real clinical settings.
抑郁是一种常见的心理健康障碍,它可能导致持续抑郁情绪和随之而来的症状,导致情感困扰。一种抑郁症状是掩盖性的抑郁,患者有意或无意通过外部乐观来掩盖真实的情感,从而复杂化、延迟诊断和治疗,并导致意外自杀。在本文中,我们提议通过面部微表情(FMEs)来诊断掩盖性的抑郁,以检测和识别背后的真实情感。然而,FMEs的极低强度和微妙性质使得它们的识别是一项困难的任务。我们提议基于面部地标的ROI approach来解决该挑战,并描述一种低成本、保护隐私的解决方案,可以在个人环境中(如在家里)使用便携式移动设备进行自我诊断。我们呈现了验证我们方法的结果和发现,并讨论了将这种方法应用于实际临床环境中的其他技术挑战和未来的研究方向。
https://arxiv.org/abs/2307.15862
Facial expression is related to facial muscle contractions and different muscle movements correspond to different emotional states. For micro-expression recognition, the muscle movements are usually subtle, which has a negative impact on the performance of current facial emotion recognition algorithms. Most existing methods use self-attention mechanisms to capture relationships between tokens in a sequence, but they do not take into account the inherent spatial relationships between facial landmarks. This can result in sub-optimal performance on micro-expression recognition tasks.Therefore, learning to recognize facial muscle movements is a key challenge in the area of micro-expression recognition. In this paper, we propose a Hierarchical Transformer Network (HTNet) to identify critical areas of facial muscle movement. HTNet includes two major components: a transformer layer that leverages the local temporal features and an aggregation layer that extracts local and global semantical facial features. Specifically, HTNet divides the face into four different facial areas: left lip area, left eye area, right eye area and right lip area. The transformer layer is used to focus on representing local minor muscle movement with local self-attention in each area. The aggregation layer is used to learn the interactions between eye areas and lip areas. The experiments on four publicly available micro-expression datasets show that the proposed approach outperforms previous methods by a large margin. The codes and models are available at: \url{this https URL}
面部表情与面部肌肉收缩有关,不同的肌肉运动对应着不同的情感状态。对于微表情识别,肌肉运动通常比较微妙,这会对当前面部情感识别算法的性能产生负面影响。大多数现有方法使用自注意力机制来捕捉序列中的 token 之间的关系,但它们没有考虑到面部地标的内在空间关系。这可能会导致在微表情识别任务中的 sub-optimal 表现。因此,学习识别面部肌肉运动是微表情识别领域的一个关键挑战。在本文中,我们提出了一种Hierarchical Transformer Network (HTNet)来识别面部肌肉运动的关键区域。HTNet 包括两个主要组件:一个Transformer层,利用 local Temporal 特征,另一个是聚合层,提取 local 和 global 语义面部特征。具体来说,HTNet将面部分为四个不同的面部区域:左唇区、左眼区、右眼区和右唇区。Transformer 层用于在每个区域中 local 自注意力地代表 local 的小肌肉运动。聚合层用于学习眼区和唇区之间的相互作用。在四个公开可用的微表情数据集上的实验表明, proposed 方法比先前方法表现更好。代码和模型可在 \url{this https URL} 找到。
https://arxiv.org/abs/2307.14637
In this research work, we proposed a novel ChildGAN, a pair of GAN networks for generating synthetic boys and girls facial data derived from StyleGAN2. ChildGAN is built by performing smooth domain transfer using transfer learning. It provides photo-realistic, high-quality data samples. A large-scale dataset is rendered with a variety of smart facial transformations: facial expressions, age progression, eye blink effects, head pose, skin and hair color variations, and variable lighting conditions. The dataset comprises more than 300k distinct data samples. Further, the uniqueness and characteristics of the rendered facial features are validated by running different computer vision application tests which include CNN-based child gender classifier, face localization and facial landmarks detection test, identity similarity evaluation using ArcFace, and lastly running eye detection and eye aspect ratio tests. The results demonstrate that synthetic child facial data of high quality offers an alternative to the cost and complexity of collecting a large-scale dataset from real children.
在本研究中,我们提出了一种新的ChildGAN,即从StyleGAN2中提取合成男孩和女孩面部数据的GAN网络,ChildGAN通过使用转移学习实现平滑域转换,提供逼真高质量的数据样本。它提供了多种智能面部变换,包括面部表情、年龄进展、眨眼效应、头部姿势、皮肤和头发颜色的变化,以及多种照明条件。该数据集包含超过300k个不同的数据样本。此外,通过运行不同的计算机视觉应用程序测试,包括基于卷积神经网络的儿童性别分类器、面部定位和面部地标检测测试、使用ArcFace进行身份相似度评估,最后运行眼检测和眼 aspect ratio测试,结果验证渲染面部特征的独特性和特征,证明高质量的合成儿童面部数据可以替代从真实儿童收集大规模数据的成本和复杂性。
https://arxiv.org/abs/2307.13746
Facial expression recognition (FER) remains a challenging task due to the ambiguity of expressions. The derived noisy labels significantly harm the performance in real-world scenarios. To address this issue, we present a new FER model named Landmark-Aware Net~(LA-Net), which leverages facial landmarks to mitigate the impact of label noise from two perspectives. Firstly, LA-Net uses landmark information to suppress the uncertainty in expression space and constructs the label distribution of each sample by neighborhood aggregation, which in turn improves the quality of training supervision. Secondly, the model incorporates landmark information into expression representations using the devised expression-landmark contrastive loss. The enhanced expression feature extractor can be less susceptible to label noise. Our method can be integrated with any deep neural network for better training supervision without introducing extra inference costs. We conduct extensive experiments on both in-the-wild datasets and synthetic noisy datasets and demonstrate that LA-Net achieves state-of-the-art performance.
面部表情识别(FER)仍然是一项具有挑战性的任务,因为面部表情的歧义。产生的噪声标签在真实场景下显著影响表现。为了解决这个问题,我们提出了一个名为 landmarks-aware net~(LA-Net)的新FER模型,该模型利用面部地标以减少标签噪声的影响,从两个方面实现。首先,LA-Net使用地标信息抑制表达空间中的不确定,通过邻域聚合每个样本的标签分布,从而改善训练监督的质量。其次,模型使用专门设计的表达-地标对比度损失将地标信息嵌入表达表示中。增强的表达特征提取器可能不再易于受到标签噪声的影响。我们的方法和任何深度学习网络都可以集成,以更好地训练监督,而无需引入额外的推理成本。我们在野生数据和合成噪声数据上进行了广泛的实验,并证明了LA-Net取得了最先进的性能。
https://arxiv.org/abs/2307.09023
MOBIO is a bi-modal database that was captured almost exclusively on mobile phones. It aims to improve research into deploying biometric techniques to mobile devices. Research has been shown that face and speaker recognition can be performed in a mobile environment. Facial landmark localization aims at finding the coordinates of a set of pre-defined key points for 2D face images. A facial landmark usually has specific semantic meaning, e.g. nose tip or eye centre, which provides rich geometric information for other face analysis tasks such as face recognition, emotion estimation and 3D face reconstruction. Pretty much facial landmark detection methods adopt still face databases, such as 300W, AFW, AFLW, or COFW, for evaluation, but seldomly use mobile data. Our work is first to perform facial landmark detection evaluation on the mobile still data, i.e., face images from MOBIO database. About 20,600 face images have been extracted from this audio-visual database and manually labeled with 22 landmarks as the groundtruth. Several state-of-the-art facial landmark detection methods are adopted to evaluate their performance on these data. The result shows that the data from MOBIO database is pretty challenging. This database can be a new challenging one for facial landmark detection evaluation.
MOBIO是一个双模数据库,几乎被手机捕获的唯一方式。它旨在改进将生物特征技术应用于移动设备的研究。研究表明,在移动环境中进行面部和语音识别是可以实现的。面部地标定位旨在找到2D面部图像中的预先定义的关键点坐标。一个面部地标通常具有特定的语义含义,例如鼻子尖端或眼睛中心,为其他面部分析任务,如面部识别、情感估计和3D面部重建提供丰富的几何信息。几乎所有的面部地标检测方法都采用静态面部数据库,如300W、AFW、AFLW或COFW,进行评估,但很少使用移动数据。我们的工作是首先在MOBIO数据库中的移动静态数据上进行面部地标检测评估,即从MOBIO数据库中获取的面部图像。大约20,600张面部图像从该音频视频数据库中提取出来,并手动标注为基准值22个地标。采用几种最先进的面部地标检测方法来评估它们在这些数据上的性能。结果显示,MOBIO数据库中的数据相当具有挑战性。这个数据库可以成为面部地标检测评估的新挑战。
https://arxiv.org/abs/2307.03329
Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips. In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales. Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation. The experiments show significant improvements over the state of the art in head motion dynamics quality and in multi-scale audio-visual synchrony both in the landmark domain and in the image domain.
使用语音输入信号使用深度生成模型动画静态面部图像是一个活跃的研究主题,并取得了重要进展。然而,大部分精力都投入到了同步和渲染质量的提高,而生成自然头动,更不用说头动和语音的音频-视觉相关性,往往被忽视。在这个研究中,我们提出了多尺度音频-视觉同步损失和多尺度自回归GAN,更好地处理语音和头动和嘴唇动态的短期和长期相关性。特别是,我们在 multimodal inputPyramid上训练了一组同步模型,并在多尺度生成网络中利用这些模型作为指导,产生音频对齐的运动在不同时间尺度上展开。我们的生成器运行在面部地标 domain,这是一个标准的低维度头表示。实验表明,在地标 domain和图像 domain中,头动动态质量和多尺度音频-视觉同步方面都取得了与当前最先进的水平相比的重大改进。
https://arxiv.org/abs/2307.03270
This technical report describes our QuAVF@NTU-NVIDIA submission to the Ego4D Talking to Me (TTM) Challenge 2023. Based on the observation from the TTM task and the provided dataset, we propose to use two separate models to process the input videos and audio. By doing so, we can utilize all the labeled training data, including those without bounding box labels. Furthermore, we leverage the face quality score from a facial landmark prediction model for filtering noisy face input data. The face quality score is also employed in our proposed quality-aware fusion for integrating the results from two branches. With the simple architecture design, our model achieves 67.4% mean average precision (mAP) on the test set, which ranks first on the leaderboard and outperforms the baseline method by a large margin. Code is available at: this https URL
本技术报告描述了我们的 QuAVF@NTU-NVIDIA 提交了 Ego4D 与我对话 (TTM) 挑战 2023。基于 TTM 任务和提供的数据集观察,我们建议使用两个独立的模型来处理输入视频和音频。通过这样做,我们可以利用所有标记的训练数据,包括没有界框标签的数据。此外,我们利用面部地标预测模型的面部质量得分来过滤噪声的面部输入数据。面部质量得分也被用于我们提出的质量 aware 融合方法,以整合两个分支的结果。通过简单的架构设计,我们的模型在测试集上取得了 67.4% 的平均绝对精度 (mAP),在排行榜上排名第一,比基准方法高出很多。代码可在 this https URL 中找到。
https://arxiv.org/abs/2306.17404
Face anti-spoofing (FAS) is indispensable for a face recognition system. Many texture-driven countermeasures were developed against presentation attacks (PAs), but the performance against unseen domains or unseen spoofing types is still unsatisfactory. Instead of exhaustively collecting all the spoofing variations and making binary decisions of live/spoof, we offer a new perspective on the FAS task to distinguish between normal and abnormal movements of live and spoof presentations. We propose Geometry-Aware Interaction Network (GAIN), which exploits dense facial landmarks with spatio-temporal graph convolutional network (ST-GCN) to establish a more interpretable and modularized FAS model. Additionally, with our cross-attention feature interaction mechanism, GAIN can be easily integrated with other existing methods to significantly boost performance. Our approach achieves state-of-the-art performance in the standard intra- and cross-dataset evaluations. Moreover, our model outperforms state-of-the-art methods by a large margin in the cross-dataset cross-type protocol on CASIA-SURF 3DMask (+10.26% higher AUC score), exhibiting strong robustness against domain shifts and unseen spoofing types.
面部反伪造(FAS)对于面部识别系统是必不可少的。许多基于纹理的反伪造措施是针对演示攻击(PAs)而开发的,但针对未知的领域或未知的仿冒类型的表现仍然不够理想。我们提出了一种新的观点,即对面部特征的详细建模,以区分真实的演示和仿冒演示的正常和异常运动。我们提出了Geometry-Aware Interaction Network(GAIN),利用空间时间卷积网络(ST-GCN)密集地建模面部特征,以建立更加可解释和模块化的FAS模型。此外,我们的交叉注意力特征交互机制使我们能够轻松地与其他现有方法集成,以显著增强性能。我们的方法在标准内部和跨数据集评估中实现了最先进的性能。此外,我们在CASIA-SURF 3DMask跨数据集跨类型协议中以显著优势超越了最先进的方法,其AUC得分提高了10.26%,表现出对领域迁移和未知的仿冒类型的强大鲁棒性。
https://arxiv.org/abs/2306.14313
The objective of a style transfer is to maintain the content of an image while transferring the style of another image. However, conventional research on style transfer has a significant limitation in preserving facial landmarks, such as the eyes, nose, and mouth, which are crucial for maintaining the identity of the image. In Korean portraits, the majority of individuals wear "Gat", a type of headdress exclusively worn by men. Owing to its distinct characteristics from the hair in ID photos, transferring the "Gat" is challenging. To address this issue, this study proposes a deep learning network that can perform style transfer, including the "Gat", while preserving the identity of the face. Unlike existing style transfer approaches, the proposed method aims to preserve texture, costume, and the "Gat" on the style image. The Generative Adversarial Network forms the backbone of the proposed network. The color, texture, and intensity were extracted differently based on the characteristics of each block and layer of the pre-trained VGG-16, and only the necessary elements during training were preserved using a facial landmark mask. The head area was presented using the eyebrow area to transfer the "Gat". Furthermore, the identity of the face was retained, and style correlation was considered based on the Gram matrix. The proposed approach demonstrated superior transfer and preservation performance compared to previous studies.
风格迁移的目标是在保持图像内容的同时迁移另一张图像的风格。然而,传统的风格迁移研究在保留面部特征方面存在巨大的限制,例如眼睛、鼻子和嘴巴等面部地标,这些特征是保持图像身份的关键。在韩国肖像中,大多数人穿着“Gat”,这是一种专为男性准备的头巾,由于其与身份照片 hair 的显著特征不同,因此迁移“Gat”是一项挑战性的任务。为了解决这一问题,本研究提出了一种深度学习网络,可以在进行风格迁移的同时保留面部身份。与现有的风格迁移方法不同,该方法旨在在风格图像中保留纹理、服装和“Gat”等特征。生成对抗网络是该网络的基座。根据训练集每个块和层的特征,不同地从每个块和层中提取颜色、纹理和强度,在训练期间仅使用面部地标 mask 保留必要的元素。头部区域使用眉间区域来呈现“Gat”特征。此外,保留了面部身份,并考虑风格相关度基于Gram矩阵。该方法相对于以前的研究表现出更好的传输和保留性能。
https://arxiv.org/abs/2306.13418
Audio-driven facial reenactment is a crucial technique that has a range of applications in film-making, virtual avatars and video conferences. Existing works either employ explicit intermediate face representations (e.g., 2D facial landmarks or 3D face models) or implicit ones (e.g., Neural Radiance Fields), thus suffering from the trade-offs between interpretability and expressive power, hence between controllability and quality of the results. In this work, we break these trade-offs with our novel parametric implicit face representation and propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads. Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models, thereby taking the best of both explicit and implicit methods. In addition, we propose several new techniques to improve the three components of our framework, including i) incorporating contextual information into the audio-to-expression parameters encoding; ii) using conditional image synthesis to parameterize the implicit representation and implementing it with an innovative tri-plane structure for efficient learning; iii) formulating facial reenactment as a conditional image inpainting problem and proposing a novel data augmentation technique to improve model generalizability. Extensive experiments demonstrate that our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.
音频驱动的面部重绘是一项关键的技术,可以在电影制作、虚拟角色和视频会议等领域广泛应用。现有的工作通常使用明确的中间面部表示法(例如2D面部地标或3D面部模型)或隐形的表示法(例如神经辐射场),因此面临着解释性和表现力之间的权衡,以及控制性和结果质量之间的权衡。在本文中,我们提出了一种新的参数化的隐形面部表示法,并提出了一种新的音频驱动的面部重绘框架,既能控制又能生成高质量的对话头。具体来说,我们的参数化的隐形表示法将隐形表示法与3D面部模型的可解释参数相结合,从而克服了明确和隐形方法的最佳结合。此外,我们提出了几种新技术来改进我们框架的三个组件,包括i)将上下文信息融入音频到表达参数编码中;ii)使用条件图像合成来参数化隐形表示,并采用创新的三平面结构进行高效学习;iii)将面部重绘问题转化为条件图像填充问题,并提出了一种新的数据增强技术,以提高模型的泛化能力。广泛的实验结果表明,我们的方法能够生成比先前方法更为真实的结果,更加忠实地反映发言者的身份和谈话风格。
https://arxiv.org/abs/2306.07579
Recently, deep learning-based facial landmark detection has achieved significant improvement. However, the semantic ambiguity problem degrades detection performance. Specifically, the semantic ambiguity causes inconsistent annotation and negatively affects the model's convergence, leading to worse accuracy and instability prediction. To solve this problem, we propose a Self-adapTive Ambiguity Reduction (STAR) loss by exploiting the properties of semantic ambiguity. We find that semantic ambiguity results in the anisotropic predicted distribution, which inspires us to use predicted distribution to represent semantic ambiguity. Based on this, we design the STAR loss that measures the anisotropism of the predicted distribution. Compared with the standard regression loss, STAR loss is encouraged to be small when the predicted distribution is anisotropic and thus adaptively mitigates the impact of semantic ambiguity. Moreover, we propose two kinds of eigenvalue restriction methods that could avoid both distribution's abnormal change and the model's premature convergence. Finally, the comprehensive experiments demonstrate that STAR loss outperforms the state-of-the-art methods on three benchmarks, i.e., COFW, 300W, and WFLW, with negligible computation overhead. Code is at this https URL.
最近,基于深度学习的面部地标检测取得了显著改进。然而,语义歧义问题削弱了检测性能。具体来说,语义歧义导致不一致的标注并负面影响模型收敛,导致更准确的预测但不稳定的预测。为了解决这一问题,我们提出了一种自适应歧义减少(STAR)损失,利用语义歧义的特性。我们发现,语义歧义导致预测分布anisotropic,启发我们使用预测分布来表示语义歧义。基于这种情况,我们设计了STAR损失,用于衡量预测分布的异质性。与传统的回归损失相比,STAR损失在预测分布anisotropic时Encouraged to be small,从而自适应地减缓语义歧义的影响。此外,我们提出了两种特征向量限制方法,可以避免分布的异常变化和模型的过早收敛。最终,综合实验表明,STAR损失在三个基准问题上(COFW、300W和WFLW)优于最先进的方法,并且计算开销为零。代码在此https URL。
https://arxiv.org/abs/2306.02763
In the realm of facial analysis, accurate landmark detection is crucial for various applications, ranging from face recognition and expression analysis to animation. Conventional heatmap or coordinate regression-based techniques, however, often face challenges in terms of computational burden and quantization errors. To address these issues, we present the KeyPoint Positioning System (KeyPosS), a groundbreaking facial landmark detection framework that stands out from existing methods. For the first time, KeyPosS employs the True-range Multilateration algorithm, a technique originally used in GPS systems, to achieve rapid and precise facial landmark detection without relying on computationally intensive regression approaches. The framework utilizes a fully convolutional network to predict a distance map, which computes the distance between a Point of Interest (POI) and multiple anchor points. These anchor points are ingeniously harnessed to triangulate the POI's position through the True-range Multilateration algorithm. Notably, the plug-and-play nature of KeyPosS enables seamless integration into any decoding stage, ensuring a versatile and adaptable solution. We conducted a thorough evaluation of KeyPosS's performance by benchmarking it against state-of-the-art models on four different datasets. The results show that KeyPosS substantially outperforms leading methods in low-resolution settings while requiring a minimal time overhead. The code is available at this https URL.
在面部分析领域,准确的地标检测对于各种应用至关重要,包括人脸识别和表情分析到动画。然而,传统的热图或坐标回归based技术通常面临着计算负担和量化错误方面的挑战。为了解决这些问题,我们提出了 KeyPoint Positioning System (KeyPosS),一个突破性的面部地标检测框架,与现有方法区别开来。首次,KeyPosS采用了True-range Multilateration算法,这是一种最初用于GPS系统的技术,以快速和准确地检测面部地标,而无需依赖计算密集型回归方法。框架使用一个完整的卷积神经网络预测距离地图,该地图计算一个兴趣点(POI)与多个基准点之间的距离。这些基准点通过True-range Multilateration算法巧妙地 harness 起来,以三角化POI的位置。值得注意的是,KeyPosS的可插拔性质使其能够无缝融入任何解码阶段,以确保一个多功能且可适应的解决方案。我们进行了 thorough 评估 KeyPosS 的性能,基准它与传统方法在四个不同数据集上的差异。结果表明,KeyPosS在低分辨率设置下显著优于领先方法,而仅需要最小时间 overhead。代码在此 https URL 可用。
https://arxiv.org/abs/2305.16437