Recently, deep learning-based facial landmark detection for in-the-wild faces has achieved significant improvement. However, there are still challenges in face landmark detection in other domains (e.g. cartoon, caricature, etc). This is due to the scarcity of extensively annotated training data. To tackle this concern, we design a two-stage training approach that effectively leverages limited datasets and the pre-trained diffusion model to obtain aligned pairs of landmarks and face in multiple domains. In the first stage, we train a landmark-conditioned face generation model on a large dataset of real faces. In the second stage, we fine-tune the above model on a small dataset of image-landmark pairs with text prompts for controlling the domain. Our new designs enable our method to generate high-quality synthetic paired datasets from multiple domains while preserving the alignment between landmarks and facial features. Finally, we fine-tuned a pre-trained face landmark detection model on the synthetic dataset to achieve multi-domain face landmark detection. Our qualitative and quantitative results demonstrate that our method outperforms existing methods on multi-domain face landmark detection.
近年来,基于深度学习的野外面部关键点检测取得了显著的改进。然而,在其他领域(如卡通、漫画等)进行面部关键点检测仍然具有挑战性。这是因为缺乏大量注释的训练数据。为了应对这一问题,我们设计了一种两阶段训练方法,有效利用有限的数据集和预训练扩散模型,在多个领域获得对齐的关键点和面部。在第一阶段,我们在大量真实面部分割的大数据集上训练了一个带有关键点条件的面部生成模型。在第二阶段,我们在一个小型数据集中对上述模型进行微调,该数据集包含用于控制领域的图像关键点对。我们的新设计使我们的方法能够从多个领域生成高质量合成对齐的关键点数据,同时保留关键点与面部特征之间的对齐关系。最后,我们在合成数据集上对预训练面部关键点检测模型进行微调,以实现多领域面部关键点检测。我们的定性和定量结果表明,我们的方法在多领域面部关键点检测方面超过了现有方法。
https://arxiv.org/abs/2401.13191
Introduction: In the realm of human-computer interaction and behavioral research, accurate real-time gaze estimation is critical. Traditional methods often rely on expensive equipment or large datasets, which are impractical in many scenarios. This paper introduces a novel, geometry-based approach to address these challenges, utilizing consumer-grade hardware for broader applicability. Methods: We leverage novel face landmark detection neural networks capable of fast inference on consumer-grade chips to generate accurate and stable 3D landmarks of the face and iris. From these, we derive a small set of geometry-based descriptors, forming an 8-dimensional manifold representing the eye and head movements. These descriptors are then used to formulate linear equations for predicting eye-gaze direction. Results: Our approach demonstrates the ability to predict gaze with an angular error of less than 1.9 degrees, rivaling state-of-the-art systems while operating in real-time and requiring negligible computational resources. Conclusion: The developed method marks a significant step forward in gaze estimation technology, offering a highly accurate, efficient, and accessible alternative to traditional systems. It opens up new possibilities for real-time applications in diverse fields, from gaming to psychological research.
简介:在人类-计算机交互和行为研究的领域,精确实时眼神检测是至关重要的。传统方法通常依赖于昂贵的设备或大量数据,这在许多场景下是不切实际的。本文介绍了一种新颖的基于几何的方法来解决这些挑战,利用消费级硬件实现更广泛的适用性。方法:我们利用具有快速检测消费者级芯片上面部关键点的神经网络来生成准确且稳定的面部和眼睛的三维关键点。从中,我们导出一个基于几何的描述符,构成一个8维的流形,表示眼和头的运动。这些描述符随后被用来形成预测眼 gaze 方向的线性方程。结果:我们的方法在角误差不到1.9度的情况下,展示了与最先进的系统相媲美的能力,同时在实时操作中,且对计算资源的需求非常小。结论:所开发的方法在目光检测技术上取得了显著的突破,为传统系统提供了一种高准确度、高效和易用性的替代方案。这为各种领域的实时应用提供了新的可能性,从游戏到心理学研究。
https://arxiv.org/abs/2401.00406
High-fidelity and efficient audio-driven talking head generation has been a key research topic in computer graphics and computer vision. In this work, we study vector image based audio-driven talking head generation. Compared with directly animating the raster image that most widely used in existing works, vector image enjoys its excellent scalability being used for many applications. There are two main challenges for vector image based talking head generation: the high-quality vector image reconstruction w.r.t. the source portrait image and the vivid animation w.r.t. the audio signal. To address these, we propose a novel scalable vector graphic reconstruction and animation method, dubbed VectorTalker. Specifically, for the highfidelity reconstruction, VectorTalker hierarchically reconstructs the vector image in a coarse-to-fine manner. For the vivid audio-driven facial animation, we propose to use facial landmarks as intermediate motion representation and propose an efficient landmark-driven vector image deformation module. Our approach can handle various styles of portrait images within a unified framework, including Japanese manga, cartoon, and photorealistic images. We conduct extensive quantitative and qualitative evaluations and the experimental results demonstrate the superiority of VectorTalker in both vector graphic reconstruction and audio-driven animation.
高度准确和高效的基于音频的二维头生成一直是计算机图形学和计算机视觉的研究热点。在这项研究中,我们研究基于矢量图像的音频驱动头生成。与目前工作中直接动画广泛使用的像素图像相比,矢量图像具有出色的可扩展性,被应用于许多应用。矢量图像基于头生成的两个主要挑战是:高质量矢量图像关于源肖像图像的重建以及生动的音乐信号关于矢量图像的动画。为了应对这些挑战,我们提出了名为VectorTalker的新可扩展矢量图形重构和动画方法。具体来说,对于高保真的重构,VectorTalker以粗到细的方式对矢量图像进行分层重构。对于生动的音乐驱动面部动画,我们提出使用面部特征作为中间运动表示,并提出了一个高效的标记驱动矢量图像变形模块。我们的方法可以在统一的框架中处理各种肖像图像风格,包括日本漫画、卡通和实写图像。我们进行了广泛的定量评估和实验,实验结果证明了VectorTalker在矢量图形重建和音频驱动动画方面的优越性。
https://arxiv.org/abs/2312.11568
Face swapping has gained significant traction, driven by the plethora of human face synthesis facilitated by deep learning methods. However, previous face swapping methods that used generative adversarial networks (GANs) as backbones have faced challenges such as inconsistency in blending, distortions, artifacts, and issues with training stability. To address these limitations, we propose an innovative end-to-end framework for high-fidelity face swapping. First, we introduce a StyleGAN-based facial attributes encoder that extracts essential features from faces and inverts them into a latent style code, encapsulating indispensable facial attributes for successful face swapping. Second, we introduce an attention-based style blending module to effectively transfer Face IDs from source to target. To ensure accurate and quality transferring, a series of constraint measures including contrastive face ID learning, facial landmark alignment, and dual swap consistency is implemented. Finally, the blended style code is translated back to the image space via the style decoder, which is of high training stability and generative capability. Extensive experiments on the CelebA-HQ dataset highlight the superior visual quality of generated images from our face-swapping methodology when compared to other state-of-the-art methods, and the effectiveness of each proposed module. Source code and weights will be publicly available.
面部换脸技术已经取得了显著的突破,得益于深度学习方法催生的丰富的人脸合成数据。然而,之前使用生成对抗网络(GANs)作为后端的换脸方法遇到了诸如混合不一致性、扭曲、伪影和训练稳定性等问题。为了应对这些局限,我们提出了一个高保真度面部换脸的端到端框架。首先,我们引入了一个基于StyleGAN的 facial属性编码器,从人脸中提取关键特征并将其转换为潜在风格代码,包含成功换脸所必需的面部属性。其次,我们引入了一个自注意力机制的换脸模块,有效地将源目标对人脸ID的转移。为确保准确和高质量的换脸,包括对比度人脸ID学习、面部关键点对齐和双替换一致性等的一系列约束措施得到了实现。最后,通过样式解码器将混合风格代码转换回图像空间,该解码器具有高训练稳定性和生成能力。对CelebA-HQ数据集的实验表明,与其他最先进的换脸方法相比,我们面部换脸方法生成的图像具有更高的视觉质量,每个提出的模块都具有显著的效果。源代码和权重将公开可用。
https://arxiv.org/abs/2312.10843
Dynamic NeRFs have recently garnered growing attention for 3D talking portrait synthesis. Despite advances in rendering speed and visual quality, challenges persist in enhancing efficiency and effectiveness. We present R2-Talker, an efficient and effective framework enabling realistic real-time talking head synthesis. Specifically, using multi-resolution hash grids, we introduce a novel approach for encoding facial landmarks as conditional features. This approach losslessly encodes landmark structures as conditional features, decoupling input diversity, and conditional spaces by mapping arbitrary landmarks to a unified feature space. We further propose a scheme of progressive multilayer conditioning in the NeRF rendering pipeline for effective conditional feature fusion. Our new approach has the following advantages as demonstrated by extensive experiments compared with the state-of-the-art works: 1) The lossless input encoding enables acquiring more precise features, yielding superior visual quality. The decoupling of inputs and conditional spaces improves generalizability. 2) The fusing of conditional features and MLP outputs at each MLP layer enhances conditional impact, resulting in more accurate lip synthesis and better visual quality. 3) It compactly structures the fusion of conditional features, significantly enhancing computational efficiency.
动态神经网络最近因3DTalker技术而受到了越来越多的关注。尽管渲染速度和视觉效果的提高,但在提高效率和效果方面仍然存在挑战。我们提出了R2-Talker,一种高效且有效的框架,实现真实实时谈话头合成。具体来说,我们使用多分辨率哈希网格引入了一种新的方法,将面部关键点编码为条件特征。这种方法无损地编码关键点结构作为条件特征,解耦输入多样性,通过将任意关键点映射到统一的特征空间,实现了条件空间。我们还提出了在NeRF渲染管道中进行逐步多层调节的方案,以实现有效的条件特征融合。通过与最先进的工作的广泛实验进行比较,我们的新方法具有以下优点:1)无损输入编码允许获得更精确的的特征,产生更好的视觉效果。解耦输入和条件空间有助于提高泛化能力。2)在MLP层中融合条件特征和MLP输出增强了条件影响,导致更准确的嘴合成和更好的视觉效果。3)它简化了条件特征融合,显著提高了计算效率。
https://arxiv.org/abs/2312.05572
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. To tackle these challenges, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. Firstly, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10\%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we achieve the state of the art.
野外的动态面部表情识别(DFER)仍然受到数据限制的影响,例如,姿态、遮挡和光照不足的数量和多样性,以及面部表情的固有歧义性。相比之下,静态面部表情识别(SFER)目前表现出更高的性能,并可以从更丰富的高质量训练数据中受益。此外,DFER中表情特征和动态关系的隐蔽特征仍没有被充分利用。为解决这些挑战,我们引入了一种新颖的静态到动态模型(S2D),它利用现有的SFER知识以及从提取到的面部关键点感知特征中隐含的动态信息,从而显著提高了DFER的性能。首先,我们为SFER构建和训练了一个图像模型,该模型仅包含标准的Vision Transformer(ViT)和多视角互补提示(MCPs)。然后,通过在图像模型中插入时间建模器(TMAs),我们获得DFER的动态模型。MCPs通过来自标准面部关键点检测器的标记感知特征增强面部表情特征。TMAs捕捉并建模面部表情中动态变化之间的关系,从而有效地扩展了预训练的图像模型。值得注意的是,MCPs和TMAs仅增加了训练参数的一小部分(不到+10%)。此外,我们还通过自监督损失基于情感锚定物(i.e.为每个情感类别提供的参考样本)来降低模糊情感标签的负面影响,进一步增强我们的S2D。在流行SFER和DFER数据集上进行实验证明,我们达到了最先进水平。
https://arxiv.org/abs/2312.05447
We propose 360° Volumetric Portrait (3VP) Avatar, a novel method for reconstructing 360° photo-realistic portrait avatars of human subjects solely based on monocular video inputs. State-of-the-art monocular avatar reconstruction methods rely on stable facial performance capturing. However, the common usage of 3DMM-based facial tracking has its limits; side-views can hardly be captured and it fails, especially, for back-views, as required inputs like facial landmarks or human parsing masks are missing. This results in incomplete avatar reconstructions that only cover the frontal hemisphere. In contrast to this, we propose a template-based tracking of the torso, head and facial expressions which allows us to cover the appearance of a human subject from all sides. Thus, given a sequence of a subject that is rotating in front of a single camera, we train a neural volumetric representation based on neural radiance fields. A key challenge to construct this representation is the modeling of appearance changes, especially, in the mouth region (i.e., lips and teeth). We, therefore, propose a deformation-field-based blend basis which allows us to interpolate between different appearance states. We evaluate our approach on captured real-world data and compare against state-of-the-art monocular reconstruction methods. In contrast to those, our method is the first monocular technique that reconstructs an entire 360° avatar.
我们提出了360度立体肖像(3VP)Avatar,这是一种仅基于单目视频输入来重建人类 subject 的 360 度照片现实主义肖像的方法。最先进的单目 Avatar 重建方法依赖于稳定的面部表演捕捉。然而,基于 3DMM 的面部跟踪的常见用法有局限性;侧面视角很难被捕捉到,尤其是在背面视角时,因为缺少面部特征点或人类解析掩码等所需输入。这导致不完整的 Avatar 重建,仅覆盖到前额叶。 相比之下,我们提出了一个基于模板的追踪方案,追踪全身、头部和面部表情,使我们能够从所有侧面覆盖人类 subject 的外观。因此,对于一个在单个相机前旋转的主体的序列,我们基于神经辐射场进行神经体积表示。构建这种表示的一个关键挑战是建模嘴部区域(即嘴唇和牙齿)的外观变化。因此,我们提出了一个变形场为基础的混合基础,使我们能够在不同外观状态之间平滑插值。我们对我们的方法在捕获的现实世界数据上进行评估,并将其与最先进的单目重建方法进行比较。与那些方法相比,我们的方法是第一个仅基于单目的 360 度Avatar 重建方法。
https://arxiv.org/abs/2312.05311
This paper explores privacy-compliant group-level emotion recognition ''in-the-wild'' within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.
本文探讨了在EmotiW挑战2023中实现隐私合规的团级情感识别“在野”的问题。团级情感识别在很多领域都有用,包括社交机器人学、对话机器人、电子辅导和学习分析等。这项研究通过仅使用全局特征来避免个体特征,即所有可以用来识别或跟踪视频中的人的面部特征(如面部表情、身体姿势、音频解码等)来实现。所提出的多模态模型由视频和音频分支组成,各分支之间存在注意力交叉。视频分支基于微调的ViT架构。音频分支提取Mel频谱图,并将其通过CNN块传递给Transformer编码器。我们 的训练范式包括生成合成数据集以在数据驱动的方式增加模型对图像中面部表情的灵敏度。丰富的实验结果表明,我们的方法的有效性。我们的隐私合规方案在EmotiW挑战中表现得相当好,在验证和测试集上的最佳模型分别达到79.24%和75.13%的准确度。值得注意的是,我们的研究结果强调了使用仅5帧均匀分布于视频上的隐私合规特征可以达到这种准确度水平。
https://arxiv.org/abs/2312.05265
Facial landmark tracking for thermal images requires tracking certain important regions of subjects' faces, using images from thermal images, which omit lighting and shading, but show the temperatures of their subjects. The fluctuations of heat in particular places reflect physiological changes like bloodflow and perspiration, which can be used to remotely gauge things like anxiety and excitement. Past work in this domain has been limited to only a very limited set of architectures and techniques. This work goes further by trying a comprehensive suit of various models with different components, such as residual connections, channel and feature-wise attention, as well as the practice of ensembling components of the network to work in parallel. The best model integrated convolutional and residual layers followed by a channel-wise self-attention layer, requiring less than 100K parameters.
面部关键点跟踪热图像需要跟踪受试者面部的某些重要区域,使用热图像的图像,这些图像省略了照明和阴影,但显示了他们的受试者的温度。特别是热力图中热力波动的地方反映了生理变化,如血流和出汗,可以用来远程衡量诸如焦虑和兴奋之类的事情。此领域过去的进展仅限于非常有限的一组架构和技巧。本研究在尝试全面尝试各种模型,包括残差连接、通道和特征级关注,以及将网络组件聚类以并行工作的实践中,向前迈进了一步。最佳模型包括集成卷积和残差层,然后是一个通道级的自注意层,总参数不到100K个。
https://arxiv.org/abs/2311.08308
In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing compression artifacts and improving video visual quality have been proposed in recent years. In this work, we propose a novel GAN-based method for compression artifacts reduction in videoconferencing. Given that, in this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission, we can maintain a set of reference keyframes of the person from the higher-quality I-frames that are transmitted within the video stream and exploit them to guide the visual quality improvement; a novel aspect of this approach is the update policy that maintains and updates a compact and effective set of reference keyframes. First, we extract multi-scale features from the compressed and reference frames. Then, our architecture combines these features in a progressive manner according to facial landmarks. This allows the restoration of the high-frequency details lost after the video compression. Experiments show that the proposed approach improves visual quality and generates photo-realistic results even with high compression rates. Code and pre-trained networks are publicly available at this https URL.
在最近几年,视频会议在个人和商务目的中已经取得了基本的地位。实现视频会议的关键技术是实现低带宽视频流实时视频传输的压缩算法,因为它们减少了所需的带宽。然而,损失性的视频压缩降低了视觉质量。因此,近年来提出了许多减少压缩伪影和提高视频视觉质量的技术。在这项工作中,我们提出了一个基于GAN的新方法来减少视频会议中的压缩伪影。 由于在当前视频中,讲话者通常在摄像头前,且在整个传输过程中保持不变,我们可以从传输中的高质量I帧中提取一系列参考关键帧,并利用它们来指导视觉质量的提高。这种方法的一个新颖之处是更新策略,它维护并更新了一个紧凑而有效的参考关键帧集。首先,我们从压缩和参考帧中提取多尺度特征。然后,根据面部特征点将这些特征以渐进的方式组合在一起。这允许在视频压缩后恢复高频细节。实验证明,与高压缩率相比,所提出的方法可以提高视觉质量并产生照片现实的结果。代码和预训练网络可以在该https URL上获取。
https://arxiv.org/abs/2311.04263
The development of various sensing technologies is improving measurements of stress and the well-being of individuals. Although progress has been made with single signal modalities like wearables and facial emotion recognition, integrating multiple modalities provides a more comprehensive understanding of stress, given that stress manifests differently across different people. Multi-modal learning aims to capitalize on the strength of each modality rather than relying on a single signal. Given the complexity of processing and integrating high-dimensional data from limited subjects, more research is needed. Numerous research efforts have been focused on fusing stress and emotion signals at an early stage, e.g., feature-level fusion using basic machine learning methods and 1D-CNN Methods. This paper proposes a multi-modal learning approach for stress detection that integrates facial landmarks and biometric signals. We test this multi-modal integration with various early-fusion and late-fusion techniques to integrate the 1D-CNN model from biometric signals and 2-D CNN using facial landmarks. We evaluate these architectures using a rigorous test of models' generalizability using the leave-one-subject-out mechanism, i.e., all samples related to a single subject are left out to train the model. Our findings show that late-fusion achieved 94.39\% accuracy, and early-fusion surpassed it with a 98.38\% accuracy rate. This research contributes valuable insights into enhancing stress detection through a multi-modal approach. The proposed research offers important knowledge in improving stress detection using a multi-modal approach.
各种传感技术的不断发展提高了对压力和个体健康状况的测量。虽然单信号模态(如可穿戴设备和面部表情识别)的进步已经取得,但整合多个模态提供了对压力更全面的了解,因为压力在每个人身上表现方式不同。多模态学习旨在利用每个模态的优势,而不是依赖单一信号。考虑到处理和整合高维数据的精复杂度,需要进行更多的研究。已经有很多研究将压力和情感信号在早期阶段进行融合,例如使用基本机器学习方法和1D-CNN方法进行特征级融合。本文提出了一种多模态学习方法来进行压力检测,整合面部特征和生物特征信号。我们测试了这些早期融合和晚期融合技术对1D-CNN模型和2D CNN模型的整合效果。我们使用严格的模型泛化测试来评估这些架构,即所有与单一主题相关的样本都被排除,以训练模型。我们的研究结果表明,晚期融合获得了94.39%的准确率,而早期融合超过了98.38%的准确率。这项研究为通过多模态方法增强压力检测提供了宝贵的洞见。所提出的研究为通过多模态方法提高压力检测提供了重要的知识。
https://arxiv.org/abs/2311.03606
Recently, heatmap regression methods based on 1D landmark representations have shown prominent performance on locating facial landmarks. However, previous methods ignored to make deep explorations on the good potentials of 1D landmark representations for sequential and structural modeling of multiple landmarks to track facial landmarks. To address this limitation, we propose a Transformer architecture, namely 1DFormer, which learns informative 1D landmark representations by capturing the dynamic and the geometric patterns of landmarks via token communications in both temporal and spatial dimensions for facial landmark tracking. For temporal modeling, we propose a recurrent token mixing mechanism, an axis-landmark-positional embedding mechanism, as well as a confidence-enhanced multi-head attention mechanism to adaptively and robustly embed long-term landmark dynamics into their 1D representations; for structure modeling, we design intra-group and inter-group structure modeling mechanisms to encode the component-level as well as global-level facial structure patterns as a refinement for the 1D representations of landmarks through token communications in the spatial dimension via 1D convolutional layers. Experimental results on the 300VW and the TF databases show that 1DFormer successfully models the long-range sequential patterns as well as the inherent facial structures to learn informative 1D representations of landmark sequences, and achieves state-of-the-art performance on facial landmark tracking.
近年来,基于1D地标表示的热力图回归方法在检测面部标志物方面表现出优异性能。然而,之前的的方法忽略了探索1D地标表示在多个标志物序列和结构建模中的潜在优势。为了克服这一局限,我们提出了一个Transformer架构,即1DFormer,通过在时间和空间维度上捕获标志物的动态和几何模式来学习有用的1D地标表示。对于时间建模,我们提出了一个循环标记混合机制、轴心标记位置嵌入机制以及一个增强置信的多头注意力机制,以适应和稳健地将长期标志物动态嵌入其1D表示中;对于结构建模,我们设计了组内和组间结构建模机制,通过空间维度上的1D卷积层将标志物的组件级和全局级结构模式编码为对1D表示的改进。在300VW和TF数据库上的实验结果表明,1DFormer成功地建模了长距离的时间序列模式以及固有面部结构,并获得了卓越的面部标志物跟踪性能。
https://arxiv.org/abs/2311.00241
The field of animal affective computing is rapidly emerging, and analysis of facial expressions is a crucial aspect. One of the most significant challenges that researchers in the field currently face is the scarcity of high-quality, comprehensive datasets that allow the development of models for facial expressions analysis. One of the possible approaches is the utilisation of facial landmarks, which has been shown for humans and animals. In this paper we present a novel dataset of cat facial images annotated with bounding boxes and 48 facial landmarks grounded in cat facial anatomy. We also introduce a landmark detection convolution neural network-based model which uses a magnifying ensembe method. Our model shows excellent performance on cat faces and is generalizable to human facial landmark detection.
动物情感计算领域正在迅速崛起,而面部表情的分析是一个关键方面。目前该领域研究人员面临的一个最显著的挑战是高质量、全面的数据集的稀缺性,这使得开发面部表情分析模型变得困难。一种可能的解决方案是利用面部标志点,这在人类和动物身上已经被证明是有效的。在本文中,我们介绍了一个由边界框和基于猫面部解剖学结构的48个面部标志点注释的猫面部图像的新型数据集。我们还介绍了一种使用卷积神经网络-based模型的地标检测方法。我们的模型在猫面部表现出色,并且可以扩展到人类面部地标检测。
https://arxiv.org/abs/2310.09793
Recently how to introduce large amounts of unlabeled facial images in the wild into supervised Facial Action Unit (AU) detection frameworks has become a challenging problem. In this paper, we propose a new AU detection framework where multi-task learning is introduced to jointly learn AU domain separation and reconstruction and facial landmark detection by sharing the parameters of homostructural facial extraction modules. In addition, we propose a new feature alignment scheme based on contrastive learning by simple projectors and an improved contrastive loss, which adds four additional intermediate supervisors to promote the feature reconstruction process. Experimental results on two benchmarks demonstrate our superiority against the state-of-the-art methods for AU detection in the wild.
近年来,如何将大量未标注的野面部图像引入到监督面部动作单元(AU)检测框架中已成为一个具有挑战性的问题。在本文中,我们提出了一种新的人脸动作单元(AU)检测框架,引入了多任务学习来共同学习AU领域分离和重建以及通过共享同构面部提取模块的 facial特征检测。此外,我们提出了一个基于对比学习的新特征对齐方案,通过简单投影器实现,并改进了对比损失,从而增加了四个中间监督器,以促进特征重构过程。在两个基准测试上进行的实验结果表明,我们在野生环境中的人脸动作单元(AU)检测方法相对于最先进的方法具有优越性。
https://arxiv.org/abs/2310.05207
Extreme head postures pose a common challenge across a spectrum of facial analysis tasks, including face detection, facial landmark detection (FLD), and head pose estimation (HPE). These tasks are interdependent, where accurate FLD relies on robust face detection, and HPE is intricately associated with these key points. This paper focuses on the integration of these tasks, particularly when addressing the complexities posed by large-angle face poses. The primary contribution of this study is the proposal of a real-time multi-task detection system capable of simultaneously performing joint detection of faces, facial landmarks, and head poses. This system builds upon the widely adopted YOLOv8 detection framework. It extends the original object detection head by incorporating additional landmark regression head, enabling efficient localization of crucial facial landmarks. Furthermore, we conduct optimizations and enhancements on various modules within the original YOLOv8 framework. To validate the effectiveness and real-time performance of our proposed model, we conduct extensive experiments on 300W-LP and AFLW2000-3D datasets. The results obtained verify the capability of our model to tackle large-angle face pose challenges while delivering real-time performance across these interconnected tasks.
极端头姿势在面部分析任务中提出了一起常见的挑战,包括面部检测、面部地标检测(FLD)和头部姿势估计(HPE)。这些任务是相互依存的,准确的FLD依赖于强大的面部检测,而HPE与这些关键点密切相关。本文重点探讨了这些任务的整合,特别是在处理大角度面部姿势的复杂性时。本研究的主要贡献是提出了一种实时多任务检测系统,能够同时完成面部、面部地标和头部姿势的联合检测。该系统基于广泛使用的YOLOv8检测框架,通过添加地标回归头扩展了原对象检测头,从而能够高效地定位关键的面部地标。此外,我们在原YOLOv8框架中优化和改进了各种模块。为了验证我们提出的模型的有效性和实时性能,我们在300W-LP和AFLW2000-3D数据集上进行了广泛的实验。所获得的结果验证我们的模型能够应对大角度面部姿势挑战,同时在这些相互关联的任务中提供实时性能。
https://arxiv.org/abs/2309.11773
Background and objectives: Patients suffering from neurological diseases may develop dysarthria, a motor speech disorder affecting the execution of speech. Close and quantitative monitoring of dysarthria evolution is crucial for enabling clinicians to promptly implement patient management strategies and maximizing effectiveness and efficiency of communication functions in term of restoring, compensating or adjusting. In the clinical assessment of orofacial structures and functions, at rest condition or during speech and non-speech movements, a qualitative evaluation is usually performed, throughout visual observation. Methods: To overcome limitations posed by qualitative assessments, this work presents a store-and-forward self-service telemonitoring system that integrates, within its cloud architecture, a convolutional neural network (CNN) for analyzing video recordings acquired by individuals with dysarthria. This architecture, called facial landmark Mask RCNN, aims at locating facial landmarks as a prior for assessing the orofacial functions related to speech and examining dysarthria evolution in neurological diseases. Results: When tested on the Toronto NeuroFace dataset, a publicly available annotated dataset of video recordings from patients with amyotrophic lateral sclerosis (ALS) and stroke, the proposed CNN achieved a normalized mean error equal to 1.79 on localizing the facial landmarks. We also tested our system in a real-life scenario on 11 bulbar-onset ALS subjects, obtaining promising outcomes in terms of facial landmark position estimation. Discussion and conclusions: This preliminary study represents a relevant step towards the use of remote tools to support clinicians in monitoring the evolution of dysarthria.
背景和目标:患有神经疾病的患者可能会发展言语障碍,这是一种影响口语执行的神经系统疾病。密切和量化监测言语障碍的发展是使临床医生能够迅速实施患者管理策略并最大限度地提高与恢复、补偿或调整相关的沟通能力效率的关键。在临床评估面部结构和功能时,在休息状态或口语和非口语运动的情况下,通常需要进行定性评估,在整个观察过程中进行。方法:为了克服定性评估所带来的限制,这项工作提出了一个存储和转发的自我监测系统,其集成了在其云架构内的一个卷积神经网络(CNN),以分析由言语障碍患者获取的视频录制。这个架构被称为面部地标掩膜RCNN,旨在确定面部地标作为评估与口语相关的面部功能以及检查神经疾病中言语障碍的发展的前置步骤。结果:在测试多伦多神经 faces 数据集(Toronto NeuroFace 数据集),这是一个公开标注的数据集,来自患者阿尔茨海默病(ALS)和中风的视频录制, proposed CNN 在确定面部地标方面的规范化均值误差为1.79。我们还测试了我们的系统在一个真实的情境下,对11个早逝型阿尔茨海默病患者进行了测试,取得了面部地标位置估计方面的积极结果。讨论和结论:这初步研究代表了使用远程工具支持临床医生监测言语障碍发展的关键步骤。
https://arxiv.org/abs/2309.09038
We present Blendshapes GHUM, an on-device ML pipeline that predicts 52 facial blendshape coefficients at 30+ FPS on modern mobile phones, from a single monocular RGB image and enables facial motion capture applications like virtual avatars. Our main contributions are: i) an annotation-free offline method for obtaining blendshape coefficients from real-world human scans, ii) a lightweight real-time model that predicts blendshape coefficients based on facial landmarks.
我们介绍了Blendshapes GHUM,一个内置的机器学习管道,可以在现代智能手机上以30+帧每秒的速度从单个单眼RGB图像中预测52个面部Blendshape系数,并实现类似于虚拟角色的面部运动捕捉应用。我们的主要贡献是: i) 一种无标注的离线方法,从现实世界的人类扫描中获得Blendshape系数,ii) 一种轻量级实时模型,基于面部地标预测Blendshape系数。
https://arxiv.org/abs/2309.05782
The incorporation of 3D data in facial analysis tasks has gained popularity in recent years. Though it provides a more accurate and detailed representation of the human face, accruing 3D face data is more complex and expensive than 2D face images. Either one has to rely on expensive 3D scanners or depth sensors which are prone to noise. An alternative option is the reconstruction of 3D faces from uncalibrated 2D images in an unsupervised way without any ground truth 3D data. However, such approaches are computationally expensive and the learned model size is not suitable for mobile or other edge device applications. Predicting dense 3D landmarks over the whole face can overcome this issue. As there is no public dataset available containing dense landmarks, we propose a pipeline to create a dense keypoint training dataset containing 520 key points across the whole face from an existing facial position map data. We train a lightweight MobileNet-based regressor model with the generated data. As we do not have access to any evaluation dataset with dense landmarks in it we evaluate our model against the 68 keypoint detection task. Experimental results show that our trained model outperforms many of the existing methods in spite of its lower model size and minimal computational cost. Also, the qualitative evaluation shows the efficiency of our trained models in extreme head pose angles as well as other facial variations and occlusions.
近年来,将三维数据纳入面部分析任务变得越来越流行。尽管它提供了更加准确和详细的人类面部表示,但积累三维面部数据比积累二维面部图像更加复杂和昂贵。要么你需要依赖昂贵的三维扫描仪或深度传感器,这些传感器容易噪声。另一种选择是以一种无监督的方式从未校准的二维图像中重构三维面部,而不需要任何真实的三维数据。然而,这些方法计算代价很高,学习的模型大小不适合移动设备或其他边缘设备应用程序。预测整个面部的密集三维地标可以克服这个问题。由于没有包含密集地标的公共数据集,我们提出了一条管道来创建一个包含整个面部520个关键点的密集关键点训练集,该数据集从现有的面部位置地图数据中生成。我们训练一个轻量级的移动网桥回归模型,使用生成的数据。由于我们没有访问任何包含密集地标的评估数据集,我们对联姻模型进行了68关键点检测任务的评价。实验结果表明,我们的训练模型尽管模型大小较小,但比许多现有方法表现更好,尽管其性能较低。此外,定性评估表明,我们的训练模型在极端头姿态角度和其他面部变异和遮挡条件下的效率。
https://arxiv.org/abs/2308.15170
The ability of humans to infer head poses from face shapes, and vice versa, indicates a strong correlation between the two. Accordingly, recent studies on face alignment have employed head pose information to predict facial landmarks in computer vision tasks. In this study, we propose a novel method that employs head pose information to improve face alignment performance by fusing said information with the feature maps of a face alignment network, rather than simply using it to initialize facial landmarks. Furthermore, the proposed network structure performs robust face alignment through a dual-dimensional network using multidimensional features represented by 2D feature maps and a 3D heatmap. For effective dense face alignment, we also propose a prediction method for facial geometric landmarks through training based on knowledge distillation using predicted keypoints. We experimentally assessed the correlation between the predicted facial landmarks and head pose information, as well as variations in the accuracy of facial landmarks with respect to the quality of head pose information. In addition, we demonstrated the effectiveness of the proposed method through a competitive performance comparison with state-of-the-art methods on the AFLW2000-3D, AFLW, and BIWI datasets.
人类从面部形状推断头部姿态的能力,以及反过来,表明这两个方面之间存在强烈的相关性。因此,最近在面部对齐方面的研究使用了头部姿态信息来预测面部地标,以改善面部对齐性能。在本研究中,我们提出了一种新的方法来使用头部姿态信息来提高面部对齐性能,通过将这些信息与面部对齐网络的特征映射相结合,而不是仅仅使用它来初始化面部地标。此外,我们提出了一种网络结构,它通过使用两个维度的网络,使用2D特征映射和3D热图来表示多个维度的特征。为了实现高效的密集面部对齐,我们还提出了一种面部几何地标的预测方法,通过基于预测关键点的知识蒸馏来训练。我们实验性地评估了预测面部地标和头部姿态信息之间的相关性,以及面部地标的准确性与头部姿态信息的质量之间的变化。此外,我们还通过在 AFLW2000-3D、AFLW和BIWI数据集上与最先进的方法进行竞争性能比较,证明了我们提出的方法的有效性。
https://arxiv.org/abs/2308.13327
Facial landmark detection is an essential technology for driver status tracking and has been in demand for real-time estimations. As a landmark coordinate prediction, heatmap-based methods are known to achieve a high accuracy, and Lite-HRNet can achieve a fast estimation. However, with Lite-HRNet, the problem of a heavy computational cost of the fusion block, which connects feature maps with different resolutions, has yet to be solved. In addition, the strong output module used in HRNetV2 is not applied to Lite-HRNet. Given these problems, we propose a novel architecture called Lite-HRNet Plus. Lite-HRNet Plus achieves two improvements: a novel fusion block based on a channel attention and a novel output module with less computational intensity using multi-resolution feature maps. Through experiments conducted on two facial landmark datasets, we confirmed that Lite-HRNet Plus further improved the accuracy in comparison with conventional methods, and achieved a state-of-the-art accuracy with a computational complexity with the range of 10M FLOPs.
面部 landmark 检测是司机状态追踪的必备技术,并一直受到实时估计的需求。作为一种 landmark 坐标预测方法,已知通过热图方法可以实现高精度,而 Lite-HRNet 可以实现快速的估计。然而,与 Lite-HRNet 一起使用, Fusion block 的计算成本问题仍然存在,该 block 连接了不同分辨率的特征图。此外,HRNetV2 中使用的强大输出模块不适用于 Lite-HRNet。鉴于这些问题,我们提出了一种新架构称为 Lite-HRNet Plus。 Lite-HRNet Plus 实现了两个改进:基于通道关注的新 fusion block 和使用多分辨率特征图以减少计算强度的新输出模块。通过在两个面部 landmark 数据集上进行实验,我们确认 Lite-HRNet Plus 相对于传统方法进一步提高了精度,并在计算复杂性范围为 10 百万 FLOPs 的情况下实现了先进的精度。
https://arxiv.org/abs/2308.12133