Recently, deep learning-based facial landmark detection has achieved significant improvement. However, the semantic ambiguity problem degrades detection performance. Specifically, the semantic ambiguity causes inconsistent annotation and negatively affects the model's convergence, leading to worse accuracy and instability prediction. To solve this problem, we propose a Self-adapTive Ambiguity Reduction (STAR) loss by exploiting the properties of semantic ambiguity. We find that semantic ambiguity results in the anisotropic predicted distribution, which inspires us to use predicted distribution to represent semantic ambiguity. Based on this, we design the STAR loss that measures the anisotropism of the predicted distribution. Compared with the standard regression loss, STAR loss is encouraged to be small when the predicted distribution is anisotropic and thus adaptively mitigates the impact of semantic ambiguity. Moreover, we propose two kinds of eigenvalue restriction methods that could avoid both distribution's abnormal change and the model's premature convergence. Finally, the comprehensive experiments demonstrate that STAR loss outperforms the state-of-the-art methods on three benchmarks, i.e., COFW, 300W, and WFLW, with negligible computation overhead. Code is at this https URL.
最近,基于深度学习的面部地标检测取得了显著改进。然而,语义歧义问题削弱了检测性能。具体来说,语义歧义导致不一致的标注并负面影响模型收敛,导致更准确的预测但不稳定的预测。为了解决这一问题,我们提出了一种自适应歧义减少(STAR)损失,利用语义歧义的特性。我们发现,语义歧义导致预测分布anisotropic,启发我们使用预测分布来表示语义歧义。基于这种情况,我们设计了STAR损失,用于衡量预测分布的异质性。与传统的回归损失相比,STAR损失在预测分布anisotropic时Encouraged to be small,从而自适应地减缓语义歧义的影响。此外,我们提出了两种特征向量限制方法,可以避免分布的异常变化和模型的过早收敛。最终,综合实验表明,STAR损失在三个基准问题上(COFW、300W和WFLW)优于最先进的方法,并且计算开销为零。代码在此https URL。
https://arxiv.org/abs/2306.02763
In the realm of facial analysis, accurate landmark detection is crucial for various applications, ranging from face recognition and expression analysis to animation. Conventional heatmap or coordinate regression-based techniques, however, often face challenges in terms of computational burden and quantization errors. To address these issues, we present the KeyPoint Positioning System (KeyPosS), a groundbreaking facial landmark detection framework that stands out from existing methods. For the first time, KeyPosS employs the True-range Multilateration algorithm, a technique originally used in GPS systems, to achieve rapid and precise facial landmark detection without relying on computationally intensive regression approaches. The framework utilizes a fully convolutional network to predict a distance map, which computes the distance between a Point of Interest (POI) and multiple anchor points. These anchor points are ingeniously harnessed to triangulate the POI's position through the True-range Multilateration algorithm. Notably, the plug-and-play nature of KeyPosS enables seamless integration into any decoding stage, ensuring a versatile and adaptable solution. We conducted a thorough evaluation of KeyPosS's performance by benchmarking it against state-of-the-art models on four different datasets. The results show that KeyPosS substantially outperforms leading methods in low-resolution settings while requiring a minimal time overhead. The code is available at this https URL.
在面部分析领域,准确的地标检测对于各种应用至关重要,包括人脸识别和表情分析到动画。然而,传统的热图或坐标回归based技术通常面临着计算负担和量化错误方面的挑战。为了解决这些问题,我们提出了 KeyPoint Positioning System (KeyPosS),一个突破性的面部地标检测框架,与现有方法区别开来。首次,KeyPosS采用了True-range Multilateration算法,这是一种最初用于GPS系统的技术,以快速和准确地检测面部地标,而无需依赖计算密集型回归方法。框架使用一个完整的卷积神经网络预测距离地图,该地图计算一个兴趣点(POI)与多个基准点之间的距离。这些基准点通过True-range Multilateration算法巧妙地 harness 起来,以三角化POI的位置。值得注意的是,KeyPosS的可插拔性质使其能够无缝融入任何解码阶段,以确保一个多功能且可适应的解决方案。我们进行了 thorough 评估 KeyPosS 的性能,基准它与传统方法在四个不同数据集上的差异。结果表明,KeyPosS在低分辨率设置下显著优于领先方法,而仅需要最小时间 overhead。代码在此 https URL 可用。
https://arxiv.org/abs/2305.16437
Detecting 3D mask attacks to a face recognition system is challenging. Although genuine faces and 3D face masks show significantly different remote photoplethysmography (rPPG) signals, rPPG-based face anti-spoofing methods often suffer from performance degradation due to unstable face alignment in the video sequence and weak rPPG signals. To enhance the rPPG signal in a motion-robust way, a landmark-anchored face stitching method is proposed to align the faces robustly and precisely at the pixel-wise level by using both SIFT keypoints and facial landmarks. To better encode the rPPG signal, a weighted spatial-temporal representation is proposed, which emphasizes the face regions with rich blood vessels. In addition, characteristics of rPPG signals in different color spaces are jointly utilized. To improve the generalization capability, a lightweight EfficientNet with a Gated Recurrent Unit (GRU) is designed to extract both spatial and temporal features from the rPPG spatial-temporal representation for classification. The proposed method is compared with the state-of-the-art methods on five benchmark datasets under both intra-dataset and cross-dataset evaluations. The proposed method shows a significant and consistent improvement in performance over other state-of-the-art rPPG-based methods for face spoofing detection.
检测面部识别系统的三维口罩攻击是一项挑战性的任务。虽然真实的面部和3D口罩显示显著不同的远程光偏振测量(rPPG)信号,但基于rPPG的面部反伪造方法经常由于视频序列中面部不稳定性以及较弱的rPPG信号而性能下降。为了在运动条件下增强rPPG信号,一种地标性框架面部拼接方法被提出,通过同时使用SIFT关键点和面部地标来 robustly and precisely align the faces at the pixel-level。为了更好地编码rPPG信号,一种加权时间和空间表示被提出,该表示强调具有丰富血管的面部区域。此外,不同颜色空间中的rPPG信号特征也被共同利用。为了提高泛化能力,一种轻量级高效的神经网络和一个门控循环单元(GRU)被设计,从rPPG时间和空间表示中分别提取空间和时间特征来进行分类。在内部数据集和跨数据集评估中,该方法与最先进的方法进行了比较。该方法在面部仿冒检测中的表现比其他任何基于rPPG的面部伪造方法都显著提高。
https://arxiv.org/abs/2305.15940
Recently, talking face generation has drawn ever-increasing attention from the research community in computer vision due to its arduous challenges and widespread application scenarios, e.g. movie animation and virtual anchor. Although persevering efforts have been undertaken to enhance the fidelity and lip-sync quality of generated talking face videos, there is still large room for further improvements of synthesis quality and efficiency. Actually, these attempts somewhat ignore the explorations of fine-granularity feature extraction/integration and the consistency between probability distributions of landmarks, thereby recurring the issues of local details blurring and degraded fidelity. To mitigate these dilemmas, in this paper, a novel CLIP-based Attention and Probability Map Guided Network (CPNet) is delicately designed for inferring high-fidelity talking face videos. Specifically, considering the demands of fine-grained feature recalibration, a clip-based attention condenser is exploited to transfer knowledge with rich semantic priors from the prevailing CLIP model. Moreover, to guarantee the consistency in probability space and suppress the landmark ambiguity, we creatively propose the density map of facial landmark as auxiliary supervisory signal to guide the landmark distribution learning of generated frame. Extensive experiments on the widely-used benchmark dataset demonstrate the superiority of our CPNet against state of the arts in terms of image and lip-sync quality. In addition, a cohort of studies are also conducted to ablate the impacts of the individual pivotal components.
近年来,对话脸生成技术因为具有挑战性的任务和广泛的应用场景,例如电影动画和虚拟主持人而引起了越来越多的计算机视觉研究 community 的关注。尽管已经做出了不断尝试来改进生成的对话脸视频的清晰度和音轨同步质量,但仍然有大量的改进空间来提高合成质量和效率。实际上,这些尝试在某种程度上忽视了精细特征提取/整合和地标概率分布一致性的探索,从而导致了 local 细节模糊和清晰度降低的问题。为了缓解这些困境,在本文中,我们设计了一种基于 CLIP 的注意力和概率图引导网络(CPNet),专门用于推断高清晰度对话脸视频。具体来说,考虑到精细特征重排的需求,我们利用片段式注意力凝聚器从现有的 CLIP 模型中传输具有丰富语义先验的知识。此外,为了确保概率空间一致性并抑制地标歧义,我们创造性地提出了面部地标密度图作为辅助监督信号,以指导生成帧地标分布学习。在广泛使用的基准数据集上进行广泛的实验,证明了我们的CPNet 在图像和音轨同步质量方面的优越性。此外,我们还进行了一系列研究,以消除个体关键组件的影响。
https://arxiv.org/abs/2305.13962
Synthesizing high-fidelity head avatars is a central problem for computer vision and graphics. While head avatar synthesis algorithms have advanced rapidly, the best ones still face great obstacles in real-world scenarios. One of the vital causes is inadequate datasets -- 1) current public datasets can only support researchers to explore high-fidelity head avatars in one or two task directions; 2) these datasets usually contain digital head assets with limited data volume, and narrow distribution over different attributes. In this paper, we present RenderMe-360, a comprehensive 4D human head dataset to drive advance in head avatar research. It contains massive data assets, with 243+ million complete head frames, and over 800k video sequences from 500 different identities captured by synchronized multi-view cameras at 30 FPS. It is a large-scale digital library for head avatars with three key attributes: 1) High Fidelity: all subjects are captured by 60 synchronized, high-resolution 2K cameras in 360 degrees. 2) High Diversity: The collected subjects vary from different ages, eras, ethnicities, and cultures, providing abundant materials with distinctive styles in appearance and geometry. Moreover, each subject is asked to perform various motions, such as expressions and head rotations, which further extend the richness of assets. 3) Rich Annotations: we provide annotations with different granularities: cameras' parameters, matting, scan, 2D/3D facial landmarks, FLAME fitting, and text description. Based on the dataset, we build a comprehensive benchmark for head avatar research, with 16 state-of-the-art methods performed on five main tasks: novel view synthesis, novel expression synthesis, hair rendering, hair editing, and talking head generation. Our experiments uncover the strengths and weaknesses of current methods. RenderMe-360 opens the door for future exploration in head avatars.
生成高质量头部虚拟角色是计算机视觉和图形领域的关键问题。尽管头部虚拟角色合成算法已经取得了迅速的进展,但最好的算法仍然在现实场景中面临巨大的障碍。其中一个关键原因是没有充分的数据集。目前公开的数据集只能支持研究人员在一到两个任务方向上探索高质量头部虚拟角色。这些数据集通常包含有限的数字头部资产,并且在不同属性上的分布狭窄。在本文中,我们介绍了RenderMe-360,一个全面的4D人类头部数据集,以推动头部虚拟角色研究的进一步发展。它包含大量的数据资产,拥有2.43亿完整的头部帧,以及超过800万的视频序列,由同步多视角摄像机从500个不同身份 captured at 30 FPS 同步捕捉。它是一个大型的数字图书馆,拥有三个关键属性:1)高分辨率:所有参与者都被60个同步高分辨率2K摄像机在360度内捕捉。2)多样性高:收集的参与者来自不同的年龄、时期、民族和文化,提供了丰富的材料,具有独特的外观和几何风格。此外,每个参与者被要求进行各种运动,例如表情和头部旋转,进一步扩展资产的多样性。3)丰富注释:我们提供不同粒度的注释:摄像机参数、裁剪、扫描、2D/3D面部 landmarks、FLAME fitting和文本描述。基于这些数据集,我们建立了头部虚拟角色研究的全面基准,完成了16种最先进的方法,涵盖五个主要任务:新视角合成、新表情合成、头发渲染、头发编辑和对话人生成。我们的实验揭示了当前方法的优点和缺点。RenderMe-360为头部虚拟角色的未来发展打开了大门。
https://arxiv.org/abs/2305.13353
We present a software that predicts non-cleft facial images for patients with cleft lip, thereby facilitating the understanding, awareness and discussion of cleft lip surgeries. To protect patients privacy, we design a software framework using image inpainting, which does not require cleft lip images for training, thereby mitigating the risk of model leakage. We implement a novel multi-task architecture that predicts both the non-cleft facial image and facial landmarks, resulting in better performance as evaluated by surgeons. The software is implemented with PyTorch and is usable with consumer-level color images with a fast prediction speed, enabling effective deployment.
我们呈现了一种软件,用于预测患者唇裂术后的无唇裂面部图像,从而方便理解、认识和讨论唇裂术后治疗。为了保护患者隐私,我们使用图像修复技术设计了一个软件框架,该框架不需要使用唇裂图像进行训练,从而避免了模型泄漏的风险。我们实现了一种 novel 多任务架构,可以同时预测无唇裂面部图像和面部地标,结果由医生评估后表现出更好的性能。软件使用PyTorch实现,可以与消费级彩色图像以快速预测速度使用,从而实现有效的部署。
https://arxiv.org/abs/2305.10589
This paper presents a fully automatic registration method of dental cone-beam computed tomography (CBCT) and face scan data. It can be used for a digital platform of 3D jaw-teeth-face models in a variety of applications, including 3D digital treatment planning and orthognathic surgery. Difficulties in accurately merging facial scans and CBCT images are due to the different image acquisition methods and limited area of correspondence between the two facial surfaces. In addition, it is difficult to use machine learning techniques because they use face-related 3D medical data with radiation exposure, which are difficult to obtain for training. The proposed method addresses these problems by reusing an existing machine-learning-based 2D landmark detection algorithm in an open-source library and developing a novel mathematical algorithm that identifies paired 3D landmarks from knowledge of the corresponding 2D landmarks. A main contribution of this study is that the proposed method does not require annotated training data of facial landmarks because it uses a pre-trained facial landmark detection algorithm that is known to be robust and generalized to various 2D face image models. Note that this reduces a 3D landmark detection problem to a 2D problem of identifying the corresponding landmarks on two 2D projection images generated from two different projection angles. Here, the 3D landmarks for registration were selected from the sub-surfaces with the least geometric change under the CBCT and face scan environments. For the final fine-tuning of the registration, the Iterative Closest Point method was applied, which utilizes geometrical information around the 3D landmarks. The experimental results show that the proposed method achieved an averaged surface distance error of 0.74 mm for three pairs of CBCT and face scan datasets.
本文介绍了一种完全自动的牙齿CBCT和面部扫描数据的匹配方法。该方法可以用于多种应用,包括3D数字治疗规划和面部正畸手术。在准确融合面部扫描和CBCT图像方面存在的困难是由于不同的图像获取方法和两个面部表面之间的有限对应区域。此外,使用机器学习技术很难,因为它们使用与面部相关的3D医疗数据,这些数据很难用于训练。 proposed方法解决这些问题的方法是在开源库中重用一个现有的基于机器学习的2D地标检测算法,并开发一个新的数学算法,从相应的2D地标的知识中识别一对3D地标。本文的主要贡献是 proposed方法不需要面部地标检测训练数据的标注数据,因为它使用一个已知能够稳健且泛化到各种2D面部图像模型的预训练面部地标检测算法。注意,这将3D地标检测问题降低到2D问题,即从两个不同投影角度生成的两个2D投影图像中的对应地标识别问题。在这里,从CBCT和面部扫描环境中最小化几何变化的sub-surfaces中的3D地标进行匹配。在最后的匹配精度优化中,迭代最近点方法被应用,该方法利用围绕着3D地标的几何信息。实验结果显示, proposed方法在三个CBCT和面部扫描数据集上实现了平均表面距离误差为0.74毫米。
https://arxiv.org/abs/2305.10132
While current talking head models are capable of generating photorealistic talking head videos, they provide limited pose controllability. Most methods require specific video sequences that should exactly contain the head pose desired, being far from user-friendly pose control. Three-dimensional morphable models (3DMM) offer semantic pose control, but they fail to capture certain expressions. We present a novel method that utilizes parametric control of head orientation and facial expression over a pre-trained neural-talking head model. To enable this, we introduce a landmark-parameter morphable model (LPMM), which offers control over the facial landmark domain through a set of semantic parameters. Using LPMM, it is possible to adjust specific head pose factors, without distorting other facial attributes. The results show our approach provides intuitive rig-like control over neural talking head models, allowing both parameter and image-based inputs.
虽然目前的Talk head模型能够生成逼真的Talk head视频,但它们提供的姿势控制是有限的。大多数方法需要特定的视频序列,应该 exactly包含 desired head pose 的姿势,用户体验性姿势控制非常差。三维可变形模型(3DMM)提供了语义姿势控制,但它们无法捕捉某些面部表情。我们提出了一种新方法,利用预训练的神经网络Talk head模型的姿态定向和面部表情参数控制。为了实现这一点,我们引入了一个地标参数可变形模型(LPMM),通过一组语义参数控制面部表情地标领域。使用LPMM,可以实现特定的head pose factors的调整,而不扭曲其他面部表情属性。结果表明,我们的方法对神经网络Talk head模型提供了直觉的机械式控制,允许参数和图像输入。
https://arxiv.org/abs/2305.10456
Facial action unit (AU) detection is challenging due to the difficulty in capturing correlated information from subtle and dynamic AUs. Existing methods often resort to the localization of correlated regions of AUs, in which predefining local AU attentions by correlated facial landmarks often discards essential parts, or learning global attention maps often contains irrelevant areas. Furthermore, existing relational reasoning methods often employ common patterns for all AUs while ignoring the specific way of each AU. To tackle these limitations, we propose a novel adaptive attention and relation (AAR) framework for facial AU detection. Specifically, we propose an adaptive attention regression network to regress the global attention map of each AU under the constraint of attention predefinition and the guidance of AU detection, which is beneficial for capturing both specified dependencies by landmarks in strongly correlated regions and facial globally distributed dependencies in weakly correlated regions. Moreover, considering the diversity and dynamics of AUs, we propose an adaptive spatio-temporal graph convolutional network to simultaneously reason the independent pattern of each AU, the inter-dependencies among AUs, as well as the temporal dependencies. Extensive experiments show that our approach (i) achieves competitive performance on challenging benchmarks including BP4D, DISFA, and GFT in constrained scenarios and Aff-Wild2 in unconstrained scenarios, and (ii) can precisely learn the regional correlation distribution of each AU.
面部动作单元(AU)检测是一项挑战性的任务,因为难以从微妙和动态的AU中提取相关的信息。现有的方法往往依赖于对 AU 相关区域的定位,其中通过相关的面部地标预先定义 local AU 的关注区域往往忽略了关键部分,或者学习全球注意力地图往往包含无关区域。此外,现有的关系推理方法往往采用所有 AU 的共同模式,而忽略每个 AU 的特定方式。为了克服这些限制,我们提出了一个 novel Adaptive Attention andRelation (AAR) 框架,用于面部 AU 检测。具体来说,我们提出了一个Adaptive Attention Regression Network,在注意力预定义和 AU 检测指导的限制下,重新计算每个 AU 的全球注意力地图。这种方法有助于捕捉在强相关区域通过地标指定的依赖关系,以及在弱相关区域 global 面部分布的依赖关系。此外,考虑到 AU 的多样性和动态性,我们提出了一个Adaptive spaCy 时间卷积神经网络,可以同时推理每个 AU 的独立模式,以及 AU 之间的依赖关系和时间依赖。广泛的实验结果表明,我们的方法(i)能够在包括约束场景的 BP4D、DISFA 和 GFT 等挑战性基准上表现出优异的性能,以及(ii)能够精确学习每个 AU 的区域相关性分布。
https://arxiv.org/abs/2001.01168
Animal affective computing is a quickly growing field of research, where only recently first efforts to go beyond animal tracking into recognizing their internal states, such as pain and emotions, have emerged. In most mammals, facial expressions are an important channel for communicating information about these states. However, unlike the human domain, there is an acute lack of datasets that make automation of facial analysis of animals feasible. This paper aims to fill this gap by presenting a dataset called Cat Facial Landmarks in the Wild (CatFLW) which contains 2016 images of cat faces in different environments and conditions, annotated with 48 facial landmarks specifically chosen for their relationship with underlying musculature, and relevance to cat-specific facial Action Units (CatFACS). To the best of our knowledge, this dataset has the largest amount of cat facial landmarks available. In addition, we describe a semi-supervised (human-in-the-loop) method of annotating images with landmarks, used for creating this dataset, which significantly reduces the annotation time and could be used for creating similar datasets for other animals. The dataset is available on request.
动物情感计算是一个快速发展的研究领域,最近才开始尝试超越动物追踪,识别它们的内部状态,如疼痛和情绪。在大多数哺乳动物中,面部表情是传达这些状态信息的重要渠道。然而,与人类领域不同,缺乏能够自动化对动物面部表情分析的 datasets。本文旨在填补这一空缺,介绍一个名为“ Wild Cat Facial Landmarks”的dataset,该dataset包含2016张不同环境和条件下猫面部表情的照片,并特别选择了48个面部地标,与底层肌肉关系以及与猫特定面部行动单元(CatFACS)的相关性进行注释。据我们所知,这个dataset中包含了目前可用的猫面部地标最多的数据。此外,我们介绍了一种半监督(人类在循环中)的方法,用于对图像进行地标注释,该方法 significantly减少了注释时间,可以用于为其他动物创建类似dataset。该dataset可按需获取。
https://arxiv.org/abs/2305.04232
High-quality reconstruction of controllable 3D head avatars from 2D videos is highly desirable for virtual human applications in movies, games, and telepresence. Neural implicit fields provide a powerful representation to model 3D head avatars with personalized shape, expressions, and facial parts, e.g., hair and mouth interior, that go beyond the linear 3D morphable model (3DMM). However, existing methods do not model faces with fine-scale facial features, or local control of facial parts that extrapolate asymmetric expressions from monocular videos. Further, most condition only on 3DMM parameters with poor(er) locality, and resolve local features with a global neural field. We build on part-based implicit shape models that decompose a global deformation field into local ones. Our novel formulation models multiple implicit deformation fields with local semantic rig-like control via 3DMM-based parameters, and representative facial landmarks. Further, we propose a local control loss and attention mask mechanism that promote sparsity of each learned deformation field. Our formulation renders sharper locally controllable nonlinear deformations than previous implicit monocular approaches, especially mouth interior, asymmetric expressions, and facial details.
高质量的从2D视频中控制3D头部虚拟化身的重构对于电影、游戏和远程陪伴等领域的虚拟人类应用来说是高度想要的。神经网络隐式场提供了一种强大的表示方法,用于建模具有个人形状、表情和面部部件的3D头部虚拟化身,例如头发和口腔内部,超越了线性3D可变形模型(3DMM)的限制。然而,现有的方法没有建模高精度面部特征或从单目视频推断不对称表情的面部部分。此外,大多数条件只依赖于较差地区的3DMM参数,并使用全局神经网络场解决局部特征。我们基于部分隐式形状模型构建了一个将全局变形场分解为局部控制的局部语义框架。我们的新配方法通过基于3DMM参数建模多个隐式变形场,并使用代表性的面部地标进行控制。此外,我们提出了一种局部控制损失和注意力掩膜机制,促进每个学习到的变形场的稀疏性。我们的配方表现出比先前的单向隐式方法更加可控非线性变形,特别是口腔内部、不对称表情和面部细节。
https://arxiv.org/abs/2304.11113
Recently, event cameras have shown large applicability in several computer vision fields especially concerning tasks that require high temporal resolution. In this work, we investigate the usage of such kind of data for emotion recognition by presenting NEFER, a dataset for Neuromorphic Event-based Facial Expression Recognition. NEFER is composed of paired RGB and event videos representing human faces labeled with the respective emotions and also annotated with face bounding boxes and facial landmarks. We detail the data acquisition process as well as providing a baseline method for RGB and event data. The collected data captures subtle micro-expressions, which are hard to spot with RGB data, yet emerge in the event domain. We report a double recognition accuracy for the event-based approach, proving the effectiveness of a neuromorphic approach for analyzing fast and hardly detectable expressions and the emotions they conceal.
最近,事件相机在许多计算机视觉领域表现出了广泛的应用,特别是在需要高时间分辨率的任务中。在这项工作中,我们探讨了如何使用这种数据进行情感识别,并介绍了NEFER,一个基于神经形态学的事件面部表达识别数据集。NEFER由配对的RGB和事件视频组成,代表人类面部,分别标注了相应的情感,并添加了面部边框和面部地标。我们详细描述了数据收集过程,并提供了RGB数据和事件数据的基础方法。收集的数据捕捉了微妙的微表情,这些在RGB数据中难以发现,但在事件域中出现了。我们报告了基于事件的方法和的双重识别精度,证明了神经形态学方法用于分析快速且难以检测的表情和它们掩盖的情感的有效性。
https://arxiv.org/abs/2304.06351
Cascaded computation, whereby predictions are recurrently refined over several stages, has been a persistent theme throughout the development of landmark detection models. In this work, we show that the recently proposed Deep Equilibrium Model (DEQ) can be naturally adapted to this form of computation. Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the challenging WFLW facial landmark dataset, reaching $3.92$ NME with fewer parameters and a training memory cost of $\mathcal{O}(1)$ in the number of recurrent modules. Furthermore, we show that DEQs are particularly suited for landmark detection in videos. In this setting, it is typical to train on still images due to the lack of labelled videos. This can lead to a ``flickering'' effect at inference time on video, whereby a model can rapidly oscillate between different plausible solutions across consecutive frames. By rephrasing DEQs as a constrained optimization, we emulate recurrence at inference time, despite not having access to temporal data at training time. This Recurrence without Recurrence (RwR) paradigm helps in reducing landmark flicker, which we demonstrate by introducing a new metric, normalized mean flicker (NMF), and contributing a new facial landmark video dataset (WFLW-V) targeting landmark uncertainty. On the WFLW-V hard subset made up of $500$ videos, our LDEQ with RwR improves the NME and NMF by $10$ and $13\%$ respectively, compared to the strongest previously published model using a hand-tuned conventional filter.
循环计算,即预测在多次迭代中不断 refined 的过程,一直是 landmark 检测模型发展中的一个持续主题。在这项工作中,我们表明,最近提出的 Deep Equilibrium Model (DEQ) 可以自然适应循环计算形式。我们的 landmark DEQ (LDEQ) 在挑战性的 WFLW 面部地标数据集上实现了最先进的性能,只需要较少参数,并且训练记忆成本在重复模块数量上仅为 $\mathcal{O}(1)$。此外,我们表明,DEQ 特别适用于视频地标检测。在这个背景下,由于缺少标注视频,通常需要在静态图像上训练。这可能导致视频推理时的“闪烁”效应,使得模型可以在相邻帧之间快速振荡 between 不同的合理解决方案。通过将 DEQ 重新表述为有限制的优化,我们在推理时模拟了循环,尽管在训练时没有访问时间数据。这种循环但不循环(RwR)范式有助于减少地标闪烁,我们通过引入一个新的度量,即标准化平均闪烁(NMF),并贡献了一个针对地标不确定性的新面部地标视频数据集(WFLW-V)。在由 $500$ 个视频组成的 WFLW-V 困难子集中,我们的 LDEQ 与 RwR 相比,分别提高了 NME 和 NMF $10$ 和 $13\%$,与使用手动调整的传统滤波器最新发布的最强的模型相比。
https://arxiv.org/abs/2304.00600
While recent research has made significant progress in speech-driven talking face generation, the quality of the generated video still lags behind that of real recordings. One reason for this is the use of handcrafted intermediate representations like facial landmarks and 3DMM coefficients, which are designed based on human knowledge and are insufficient to precisely describe facial movements. Additionally, these methods require an external pretrained model for extracting these representations, whose performance sets an upper bound on talking face generation. To address these limitations, we propose a novel method called DAE-Talker that leverages data-driven latent representations obtained from a diffusion autoencoder (DAE). DAE contains an image encoder that encodes an image into a latent vector and a DDIM image decoder that reconstructs the image from it. We train our DAE on talking face video frames and then extract their latent representations as the training target for a Conformer-based speech2latent model. This allows DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech, rather than relying on a predetermined head pose from a template video. We also introduce pose modelling in speech2latent for pose controllability. Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness. We also conduct ablation studies to analyze the effectiveness of the proposed techniques and demonstrate the pose controllability of DAE-Talker.
最近在基于语音驱动的面部生成方面取得了重要进展,但生成视频的质量仍比实际录制的质量差。原因之一是利用手工加工的、例如面部地标和3DMM系数的中间表示,这些表示是基于人类知识设计的,不足以精确描述面部运动。此外,这些方法需要外部预训练模型来提取这些表示,其表现设定了面部生成的高度限制。为了解决这些限制,我们提出了一种名为DAE-Talker的新方法,利用扩散自动编码器(DAE)获取的数据驱动隐层表示。DAE包含一个图像编码器,将图像编码为隐向量,以及一个DDIM图像解码器,从中提取图像。我们在面部生成视频帧上训练我们的DAE,然后提取它们的隐层表示作为基于变形的语音2隐层模型的训练目标。这使得DAE-Talker可以合成完整的视频帧,产生与语音内容对齐的自然头部运动,而不是依赖模板视频预先确定的头部姿态。我们还在语音2隐层模型中引入了姿态建模,以实现姿态控制。此外,我们提出了一种生成连续视频帧的方法,使用单个帧的DDIM图像解码器进行训练,从而消除直接建模连续帧联合分布的需要。我们的实验表明,DAE-Talker在同步、视频清晰度和姿态自然性方面优于现有的流行方法。我们还进行了去重研究来分析所提出的方法的效果,并展示了DAE-Talker的姿态控制能力。
https://arxiv.org/abs/2303.17550
Audio-driven talking head animation is a challenging research topic with many real-world applications. Recent works have focused on creating photo-realistic 2D animation, while learning different talking or singing styles remains an open problem. In this paper, we present a new method to generate talking head animation with learnable style references. Given a set of style reference frames, our framework can reconstruct 2D talking head animation based on a single input image and an audio stream. Our method first produces facial landmarks motion from the audio stream and constructs the intermediate style patterns from the style reference images. We then feed both outputs into a style-aware image generator to generate the photo-realistic and fidelity 2D animation. In practice, our framework can extract the style information of a specific character and transfer it to any new static image for talking head animation. The intensive experimental results show that our method achieves better results than recent state-of-the-art approaches qualitatively and quantitatively.
音频驱动的对话式头像动画是一个具有许多实际应用场景的具有挑战性的研究领域。最近的工作主要集中在创建逼真的2D动画,而学习不同的说话或唱歌风格仍然是一个未解决的难题。在本文中,我们提出了一种可以学习可编程风格参考的新方法,以生成对话式头像动画。给定一组风格参考框架,我们的框架可以根据一个输入图像和一个音频流重建2D对话式头像动画。我们的方法首先从音频流中提取面部关键帧运动,并从风格参考图像中构建中间风格模式。我们将两个输出输入到风格意识到图像生成器,以生成逼真和细节丰富的2D动画。在实践中,我们的框架可以提取特定角色的风格信息,并将其转移到对话式头像动画中的任何新的静态图像。密集的实验结果显示,我们的方法取得了比最近先进的方法 qualitative和 quantitative 更好的结果。
https://arxiv.org/abs/2303.09799
Most facial landmark detection methods predict landmarks by mapping the input facial appearance features to landmark heatmaps and have achieved promising results. However, when the face image is suffering from large poses, heavy occlusions and complicated illuminations, they cannot learn discriminative feature representations and effective facial shape constraints, nor can they accurately predict the value of each element in the landmark heatmap, limiting their detection accuracy. To address this problem, we propose a novel Reference Heatmap Transformer (RHT) by introducing reference heatmap information for more precise facial landmark detection. The proposed RHT consists of a Soft Transformation Module (STM) and a Hard Transformation Module (HTM), which can cooperate with each other to encourage the accurate transformation of the reference heatmap information and facial shape constraints. Then, a Multi-Scale Feature Fusion Module (MSFFM) is proposed to fuse the transformed heatmap features and the semantic features learned from the original face images to enhance feature representations for producing more accurate target heatmaps. To the best of our knowledge, this is the first study to explore how to enhance facial landmark detection by transforming the reference heatmap information. The experimental results from challenging benchmark datasets demonstrate that our proposed method outperforms the state-of-the-art methods in the literature.
大多数面部地标检测方法通过将输入的面部外观特征映射到地标热图来预测地标,并取得了良好的结果。然而,当面部图像受到大型姿态、严重遮挡和复杂的照明条件时,它们无法学习鲜明的特征表示和有效的面部形状限制,也无法准确地预测地标热图每个元素的值,从而限制了它们的检测精度。为了解决这一问题,我们提出了一种新的参考热图Transformer(RHT),通过引入参考热图信息来提高更精确的面部地标检测。 proposed RHT由一个软转换模块(STM)和一个硬转换模块(HTM)组成,可以互相合作,鼓励准确转换参考热图信息和面部形状限制。然后,我们提出了一个多尺度特征融合模块(MSFFM),将转换后热图特征和从原始面部图像中学习到的语义特征进行融合,以增强特征表示,以产生更准确的目标热图。据我们所知,这是第一个研究探索通过转换参考热图信息来提高面部地标检测的方法。挑战性基准数据集的实验结果表明,我们提出的方法在文献中比最先进的方法表现更好。
https://arxiv.org/abs/2303.07840
One of the key issues in facial expression recognition in the wild (FER-W) is that curating large-scale labeled facial images is challenging due to the inherent complexity and ambiguity of facial images. Therefore, in this paper, we propose a self-supervised simple facial landmark encoding (SimFLE) method that can learn effective encoding of facial landmarks, which are important features for improving the performance of FER-W, without expensive labels. Specifically, we introduce novel FaceMAE module for this purpose. FaceMAE reconstructs masked facial images with elaborately designed semantic masking. Unlike previous random masking, semantic masking is conducted based on channel information processed in the backbone, so rich semantics of channels can be explored. Additionally, the semantic masking process is fully trainable, enabling FaceMAE to guide the backbone to learn spatial details and contextual properties of fine-grained facial landmarks. Experimental results on several FER-W benchmarks prove that the proposed SimFLE is superior in facial landmark localization and noticeably improved performance compared to the supervised baseline and other self-supervised methods.
在野生面部表情识别中(FER-W)的一个关键问题是如何编辑大规模标记的面部图像,这些图像需要进行昂贵的标签。因此,在本文中,我们提出了一种自监督的简单面部地标编码方法(SimFLE),该方法可以学习有效的面部地标编码,这是改善FER-W性能的重要特征,而无需昂贵的标签。具体而言,我们介绍了一种新型的FaceMAE模块,该模块使用精心设计的语义掩码来重构 mask 过的面部图像。与以前的随机掩码不同,语义掩码基于主干线的通道信息进行,因此可以探索通道丰富的语义。此外,语义掩码过程是完全可训练的,使 FaceMAE 可以指导主干线学习精细面部地标的空间细节和上下文特性。在多个FER-W基准测试中,实验结果表明, proposed SimFLE 在面部地标定位和显著改进性能方面比监督基线和其他自监督方法更好。
https://arxiv.org/abs/2303.07648
Development of human machine interface has become a necessity for modern day machines to catalyze more autonomy and more efficiency. Gaze driven human intervention is an effective and convenient option for creating an interface to alleviate human errors. Facial landmark detection is very crucial for designing a robust gaze detection system. Regression based methods capacitate good spatial localization of the landmarks corresponding to different parts of the faces. But there are still scope of improvements which have been addressed by incorporating attention. In this paper, we have proposed a deep coarse-to-fine architecture called LocalEyenet for localization of only the eye regions that can be trained end-to-end. The model architecture, build on stacked hourglass backbone, learns the self-attention in feature maps which aids in preserving global as well as local spatial dependencies in face image. We have incorporated deep layer aggregation in each hourglass to minimize the loss of attention over the depth of architecture. Our model shows good generalization ability in cross-dataset evaluation and in real-time localization of eyes.
人类机器界面的发展已经成为当代机器促进更多自主和更高效的必要条件。视觉驱动的人类干预是一种有效和方便的方式,用于创建减轻人类错误的界面。面部地标检测对于设计可靠的视觉检测系统非常重要。基于回归的方法能够确保对与面部不同部分对应的地标进行良好的空间定位。但是,仍然可以通过引入注意力来解决改进的空间。在本文中,我们提出了一种叫做Local Eyenet的深度粗到细架构,用于仅训练可以 end-to-end 训练的 eye 区域的定位。模型架构基于栈式漏斗 backbone 建立,学习特征映射中的自我关注,有助于保留面部图像的全局和局部空间依赖关系。在每个漏斗层中,我们进行了深度层聚合,以最小化架构深度中的注意力损失。我们的模型在跨数据集评估和实时眼部定位方面表现出良好的泛化能力。
https://arxiv.org/abs/2303.12728
Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark generator (UniFLG). The proposed system exploits end-to-end text-to-speech not only for synthesizing speech but also for extracting a series of latent representations that are common to text and speech, and feeds it to a landmark decoder to generate facial landmarks. We demonstrate that our system achieves higher naturalness in both speech synthesis and facial landmark generation compared to the state-of-the-art text-driven method. We further demonstrate that our system can generate facial landmarks from speech of speakers without facial video data or even speech data.
对话面容生成因其广泛的应用而被广泛研究。用于对话面容生成的两个主要框架包括一个基于文本的框架,该框架从文本中生成同步的口语和对话面容,以及一个基于语音的框架,该框架从语音中生成对话面容。为了整合这些框架,本文提出了一个统一的面部地标生成器(UniFLG)。该提出的系统利用端到端文本到语音技术,不仅合成语音,还提取文本和语音中共同的隐态表示,并将其喂给地标解码器生成面部地标。我们证明,我们系统的语音合成和面部地标生成相比现有基于文本的方法更加自然。我们还证明,我们系统可以从没有面部视频数据或甚至没有语音数据的发言中生成面部地标。
https://arxiv.org/abs/2302.14337
This paper explores automated face and facial landmark detection of neonates, which is an important first step in many video-based neonatal health applications, such as vital sign estimation, pain assessment, sleep-wake classification, and jaundice detection. Utilising three publicly available datasets of neonates in the clinical environment, 366 images (258 subjects) and 89 (66 subjects) were annotated for training and testing, respectively. Transfer learning was applied to two YOLO-based models, with input training images augmented with random horizontal flipping, photo-metric colour distortion, translation and scaling during each training epoch. Additionally, the re-orientation of input images and fusion of trained deep learning models was explored. Our proposed model based on YOLOv7Face outperformed existing methods with a mean average precision of 84.8% for face detection, and a normalised mean error of 0.072 for facial landmark detection. Overall, this will assist in the development of fully automated neonatal health assessment algorithms.
本 paper 探讨了自动检测新生儿的面部和面部地标,这在许多基于视频的新生儿健康应用中是一个重要的的第一步,例如估计生命体征、评估疼痛、睡眠-清醒分类和检测黄疸。利用临床环境中公开可用的三个新生儿数据集,共进行了 366 张照片(258 名受试者)和 89 张照片(66 名受试者)的标注,用于训练和测试。 Transfer learning 应用于两个基于 YOLO 的模型,在每个训练 epoch 中,输入训练图像随机地进行水平翻转、photo-metric 颜色扭曲、旋转和缩放。此外,探索了输入图像的重新定向和训练深度神经网络的融合。我们提出的基于 YOLOv7Face 的模型在面部检测方面表现更好,面部地标检测的平均精度为 84.8%,均值误差为 0.072。Overall,这将协助开发完全自动化的新生儿健康评估算法。
https://arxiv.org/abs/2302.04341