Audio-driven talking head synthesis strives to generate lifelike video portraits from provided audio. The diffusion model, recognized for its superior quality and robust generalization, has been explored for this task. However, establishing a robust correspondence between temporal audio cues and corresponding spatial facial expressions with diffusion models remains a significant challenge in talking head generation. To bridge this gap, we present DreamHead, a hierarchical diffusion framework that learns spatial-temporal correspondences in talking head synthesis without compromising the model's intrinsic quality and adaptability.~DreamHead learns to predict dense facial landmarks from audios as intermediate signals to model the spatial and temporal correspondences.~Specifically, a first hierarchy of audio-to-landmark diffusion is first designed to predict temporally smooth and accurate landmark sequences given audio sequence signals. Then, a second hierarchy of landmark-to-image diffusion is further proposed to produce spatially consistent facial portrait videos, by modeling spatial correspondences between the dense facial landmark and appearance. Extensive experiments show that proposed DreamHead can effectively learn spatial-temporal consistency with the designed hierarchical diffusion and produce high-fidelity audio-driven talking head videos for multiple identities.
音频驱动的讲话头合成旨在从提供的音频中生成逼真的视频肖像。扩散模型因其高质量和稳健的泛化能力而受到广泛探索。然而,在扩散模型中建立时空音频线索与相应面部表情的可靠对应仍然是一个重要的挑战。为了弥合这个差距,我们提出了DreamHead,一种分层扩散框架,它在不损害模型固有质量和适应性的前提下,从讲话头合成中学习时空对应关系。DreamHead从音频中预测丰富的面部关键点作为中间信号,以建模时空和面部表情的对应关系。具体来说,我们首先设计了一个音频到关键点的扩散层次结构,用于预测给定音频序列信号的时变平滑和准确的关键点序列。然后,我们进一步提出了一个地标到图像的扩散层次结构,通过建模密集面部关键点和外观之间的空间对应关系,产生具有空间一致性的面部肖像视频。大量实验证明,所提出的DreamHead可以有效地利用设计的分层扩散学习时空和面部表情的对应关系,并为多个身份生成高保真的音频驱动讲话头视频。
https://arxiv.org/abs/2409.10281
Audio-driven talking face generation is a widely researched topic due to its high applicability. Reconstructing a talking face using audio significantly contributes to fields such as education, healthcare, online conversations, virtual assistants, and virtual reality. Early studies often focused solely on changing the mouth movements, which resulted in outcomes with limited practical applications. Recently, researchers have proposed a new approach of constructing the entire face, including face pose, neck, and shoulders. To achieve this, they need to generate through landmarks. However, creating stable landmarks that align well with the audio is a challenge. In this paper, we propose the KFusion of Dual-Domain model, a robust model that generates landmarks from audio. We separate the audio into two distinct domains to learn emotional information and facial context, then use a fusion mechanism based on the KAN model. Our model demonstrates high efficiency compared to recent models. This will lay the groundwork for the development of the audio-driven talking face generation problem in the future.
音频驱动的对话式面部生成是一个广泛研究的话题,因为它的实际应用非常广泛。通过音频重构对话式面部在教育、医疗、在线对话、虚拟助手和虚拟现实等领域具有重要意义。早期的研究通常仅关注改变嘴部的运动,导致实际应用效果有限。近年来,研究人员提出了一种新的方法来构建整个面部,包括脸部姿势、脖子和肩膀。要实现这一点,他们需要通过地标进行生成。然而,创建与音频一致的对齐稳定的地标是一个挑战。在本文中,我们提出了KFusion双域模型,一种基于音频生成地标的有前途的模型。我们将音频分为两个不同的领域来学习情感信息和面部上下文,然后使用基于KAN模型的融合机制。我们的模型在最近模型的效率方面表现出优异的表现。这将为实现音频驱动对话式面部生成问题的未来发展奠定了基础。
https://arxiv.org/abs/2409.05330
Facial analysis is a key component in a wide range of applications such as security, autonomous driving, entertainment, and healthcare. Despite the availability of various facial RGB datasets, the thermal modality, which plays a crucial role in life sciences, medicine, and biometrics, has been largely overlooked. To address this gap, we introduce the T-FAKE dataset, a new large-scale synthetic thermal dataset with sparse and dense landmarks. To facilitate the creation of the dataset, we propose a novel RGB2Thermal loss function, which enables the transfer of thermal style to RGB faces. By utilizing the Wasserstein distance between thermal and RGB patches and the statistical analysis of clinical temperature distributions on faces, we ensure that the generated thermal images closely resemble real samples. Using RGB2Thermal style transfer based on our RGB2Thermal loss function, we create the T-FAKE dataset, a large-scale synthetic thermal dataset of faces. Leveraging our novel T-FAKE dataset, probabilistic landmark prediction, and label adaptation networks, we demonstrate significant improvements in landmark detection methods on thermal images across different landmark conventions. Our models show excellent performance with both sparse 70-point landmarks and dense 478-point landmark annotations. Our code and models are available at this https URL.
面部分析是许多应用的关键组件,如安全、自动驾驶、娱乐和医疗。尽管各种面部RGB数据集已经存在,但生物医学和生物测量学中扮演关键角色的热模量(thermal modality)却受到了很大的忽视。为了填补这一空白,我们引入了T-FAKE数据集,一个大型合成热数据集,具有稀疏和密集的标志。为了方便数据集的创建,我们提出了一个新颖的RGB2Thermal损失函数,该函数可以将热风格传递到RGB面部。通过利用热和RGB补丁之间的Wasserstein距离和面部临床温度分布的统计分析,我们确保生成的热图像与真实样本非常相似。基于我们的RGB2Thermal损失函数,我们创建了T-FAKE数据集,一个大型合成热数据集。通过利用我们新颖的T-FAKE数据集、概率地标预测和标签适应网络,我们在不同地标范式下显著提高了热图像中的地标检测方法。我们的模型在稀疏70个地标和密集478个地标注释上表现优异。我们的代码和模型可在此处访问:https://www.thisurl.com/
https://arxiv.org/abs/2408.15127
Early detection of autism, a neurodevelopmental disorder marked by social communication challenges, is crucial for timely intervention. Recent advancements have utilized naturalistic home videos captured via the mobile application GuessWhat. Through interactive games played between children and their guardians, GuessWhat has amassed over 3,000 structured videos from 382 children, both diagnosed with and without Autism Spectrum Disorder (ASD). This collection provides a robust dataset for training computer vision models to detect ASD-related phenotypic markers, including variations in emotional expression, eye contact, and head movements. We have developed a protocol to curate high-quality videos from this dataset, forming a comprehensive training set. Utilizing this set, we trained individual LSTM-based models using eye gaze, head positions, and facial landmarks as input features, achieving test AUCs of 86%, 67%, and 78%, respectively. To boost diagnostic accuracy, we applied late fusion techniques to create ensemble models, improving the overall AUC to 90%. This approach also yielded more equitable results across different genders and age groups. Our methodology offers a significant step forward in the early detection of ASD by potentially reducing the reliance on subjective assessments and making early identification more accessibly and equitable.
早期对自闭症(ASD)的检测,这是一种以社交沟通困难为特征的神经发育障碍,对及时干预至关重要。最近的研究利用了通过移动应用程序GuessWhat捕捉的自然家庭视频,通过儿童和监护人之间的互动游戏累积了超过3,000个结构化视频,这些视频来自被确诊为ASD和未确诊ASD的382名儿童。这个数据集为训练计算机视觉模型检测ASD相关表型提供了有力的数据,包括情感表达、眼神接触和头部运动的变异性。我们开发了一个从该数据集中筛选高质量视频的协议,形成了一个全面的训练集。利用这个集,我们使用眼部注视、头部位置和面部关键点作为输入特征训练了基于LSTM的个体模型,实现了86%、67%和78%的测试AUC。为了提高诊断准确性,我们应用了晚融合技术创建了集成模型,提高了整体AUC至90%。这种方法还产生了更平等的结果,跨越了不同性别和年龄组。我们的方法在通过可能减少对主观评估的依赖来显著提高ASD早期诊断方面迈出了重要的一步,使早期识别更加容易和公平。
https://arxiv.org/abs/2408.13255
In videos containing spoofed faces, we may uncover the spoofing evidence based on either photometric or dynamic abnormality, even a combination of both. Prevailing face anti-spoofing (FAS) approaches generally concentrate on the single-frame scenario, however, purely photometric-driven methods overlook the dynamic spoofing clues that may be exposed over time. This may lead FAS systems to conclude incorrect judgments, especially in cases where it is easily distinguishable in terms of dynamics but challenging to discern in terms of photometrics. To this end, we propose the Graph Guided Video Vision Transformer (G$^2$V$^2$former), which combines faces with facial landmarks for photometric and dynamic feature fusion. We factorize the attention into space and time, and fuse them via a spatiotemporal block. Specifically, we design a novel temporal attention called Kronecker temporal attention, which has a wider receptive field, and is beneficial for capturing dynamic information. Moreover, we leverage the low-semantic motion of facial landmarks to guide the high-semantic change of facial expressions based on the motivation that regions containing landmarks may reveal more dynamic clues. Extensive experiments on nine benchmark datasets demonstrate that our method achieves superior performance under various scenarios. The codes will be released soon.
在包含伪造脸的视频中,我们可能根据光电特征或动态异常来揭示伪造证据,甚至可以是两者之间的组合。预先存在的面部抗伪造(FAS)方法通常集中于单帧场景,然而,仅基于光学的伪造证据可能掩盖了随着时间的推移可能暴露的动态伪造线索。这可能导致FAS系统得出错误的判断,尤其是在 dynamics 上容易区分,但在 photometrics 上很难分辨的情况下。因此,我们提出了Graph Guided Video Vision Transformer (G$^2$V$^2$Transformer),它结合了面部特征和面部标志以进行光电和动态特征融合。我们将注意力因素化到空间和时间,并通过一个spatiotemporal块将它们融合。具体来说,我们设计了一个名为Kronecker temporal attention的新型时间注意力,它具有更宽的接收范围,对于捕捉动态信息非常有利。此外,我们利用面部标志的低语义运动来指导根据动机,含有标志的区域可能揭示更动态的提示的脸上表情的语义变化。在九个基准数据集上的大量实验证明,我们的方法在各种情景下取得了卓越的性能。代码即将发布。
https://arxiv.org/abs/2408.07675
Drowsiness detection is essential for improving safety in areas such as transportation and workplace health. This study presents a real-time system designed to detect drowsiness using the Eye Aspect Ratio (EAR) and facial landmark detection techniques. The system leverages Dlibs pre-trained shape predictor model to accurately detect and monitor 68 facial landmarks, which are used to compute the EAR. By establishing a threshold for the EAR, the system identifies when eyes are closed, indicating potential drowsiness. The process involves capturing a live video stream, detecting faces in each frame, extracting eye landmarks, and calculating the EAR to assess alertness. Our experiments show that the system reliably detects drowsiness with high accuracy while maintaining low computational demands. This study offers a strong solution for real-time drowsiness detection, with promising applications in driver monitoring and workplace safety. Future research will investigate incorporating additional physiological and contextual data to further enhance detection accuracy and reliability.
昏睡检测对于改善交通和工作场所的健康安全至关重要。这项研究介绍了一种实时系统,用于通过Eye Aspect Ratio(EAR)和面部 landmark 检测技术检测昏睡。该系统利用了Dlibs预训练的形状预测模型,准确检测并监控68个面部表情标记物,这些标记物用于计算EAR。通过建立EAR的阈值,系统在眼睛关闭时识别眼睛,表明存在昏睡。该过程涉及捕获实时视频流,检测每帧的面部,提取眼部标记物,并计算EAR以评估警觉性。我们的实验结果表明,该系统在保持低计算负担的同时,以高准确度可靠地检测昏睡。这项研究为实时昏睡检测提供了一个强大的解决方案,在驾驶员监控和职场安全方面具有光明的前景。未来研究将调查引入其他生理学和上下文数据,以进一步增强检测准确性和可靠性。
https://arxiv.org/abs/2408.05836
Audio-driven talking face video generation has attracted increasing attention due to its huge industrial potential. Some previous methods focus on learning a direct mapping from audio to visual content. Despite progress, they often struggle with the ambiguity of the mapping process, leading to flawed results. An alternative strategy involves facial structural representations (e.g., facial landmarks) as intermediaries. This multi-stage approach better preserves the appearance details but suffers from error accumulation due to the independent optimization of different stages. Moreover, most previous methods rely on generative adversarial networks, prone to training instability and mode collapse. To address these challenges, our study proposes a novel landmark-based diffusion model for talking face generation, which leverages facial landmarks as intermediate representations while enabling end-to-end optimization. Specifically, we first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks via differentiable cross-attention, which enables end-to-end optimization for improved lip synchronization. Besides, TalkFormer employs implicit feature warping to align the reference image features with the target motion for preserving more appearance details. Extensive experiments demonstrate that our approach can synthesize high-fidelity and lip-synced talking face videos, preserving more subject appearance details from the reference image.
由于其在工业潜力方面的巨大价值,音频驱动的谈话脸视频生成引起了越来越多的关注。之前的方法往往专注于从音频到视觉内容的直接映射学习。尽管取得了一定的进展,但它们通常在映射过程的模糊性上遇到困难,导致错误积累。一种替代策略包括使用面部结构表示(如面部关键点)作为中间层。这种多阶段方法更好地保留了外观细节,但由于不同阶段独立优化,导致误差累积。此外,之前的大多数方法依赖于生成对抗网络(GANs),容易发生训练不稳定和模式崩溃。为了应对这些挑战,我们的研究提出了一种基于关键点的谈话脸生成新模型,利用面部关键点作为中间表示,实现端到端的优化。具体来说,我们首先建立了从音频到嘴部运动关键点的映射。然后,我们引入了一种创新的条件模块,称为TalkFormer,通过可变交叉注意来将合成运动与关键点代表的运动对齐,从而实现端到端的优化,提高嘴部同步。此外,TalkFormer采用隐含特征扭曲来使参考图像特征与目标运动对齐,以保留更多的外观细节。大量实验证明,我们的方法可以合成高保真度和嘴部同步的谈话脸视频,从参考图像中保留更多的主体外观细节。
https://arxiv.org/abs/2408.05416
Depression is a prevalent mental health disorder that significantly impacts individuals' lives and well-being. Early detection and intervention are crucial for effective treatment and management of depression. Recently, there are many end-to-end deep learning methods leveraging the facial expression features for automatic depression detection. However, most current methods overlook the temporal dynamics of facial expressions. Although very recent 3DCNN methods remedy this gap, they introduce more computational cost due to the selection of CNN-based backbones and redundant facial features. To address the above limitations, by considering the timing correlation of facial expressions, we propose a novel framework called FacialPulse, which recognizes depression with high accuracy and speed. By harnessing the bidirectional nature and proficiently addressing long-term dependencies, the Facial Motion Modeling Module (FMMM) is designed in FacialPulse to fully capture temporal features. Since the proposed FMMM has parallel processing capabilities and has the gate mechanism to mitigate gradient vanishing, this module can also significantly boost the training speed. Besides, to effectively use facial landmarks to replace original images to decrease information redundancy, a Facial Landmark Calibration Module (FLCM) is designed to eliminate facial landmark errors to further improve recognition accuracy. Extensive experiments on the AVEC2014 dataset and MMDA dataset (a depression dataset) demonstrate the superiority of FacialPulse on recognition accuracy and speed, with the average MAE (Mean Absolute Error) decreased by 21% compared to baselines, and the recognition speed increased by 100% compared to state-of-the-art methods. Codes are released at this https URL.
抑郁是一种普遍的心理疾病,对个人的生活和幸福感产生了严重影响。早期的检测和干预是有效治疗和管理抑郁症的关键。最近,许多基于面部表情特征的端到端深度学习方法被提出用于自动检测抑郁症。然而,大多数现有方法忽视了面部表情的时序动态。尽管最近3DCNN方法弥补了这一空白,但由于选择基于CNN的骨干网络和冗余面部特征,导致计算成本更高。为了应对上述局限,我们考虑面部表情的时序相关性,提出了一种名为FacialPulse的新框架,该框架具有高精度和高速度的准确性和快速性检测抑郁症。通过利用双向性和有效地解决长期依赖关系,Facial Motion Modeling Module(FMMM)在FacialPulse中被设计,以完全捕捉时间特征。由于所提出的FMMM具有并行处理能力,并具有缓解梯度消失的门机制,这个模块还可以显著提高训练速度。此外,为了有效地使用面部关键点来替换原始图像以降低信息冗余,还设计了一个Facial Landmark Calibration Module(FLCM),以消除面部关键点误差,进一步提高识别准确性。在AVEC2014数据集和MMDA数据集(一个抑郁症数据集)上的大量实验证明,FacialPulse在识别准确性和速度方面具有优越性,与基线相比,平均绝对误差降低了21%,识别速度提高了100%。代码发布在https://这个链接上。
https://arxiv.org/abs/2408.03499
Audio-driven talking head generation is a significant and challenging task applicable to various fields such as virtual avatars, film production, and online conferences. However, the existing GAN-based models emphasize generating well-synchronized lip shapes but overlook the visual quality of generated frames, while diffusion-based models prioritize generating high-quality frames but neglect lip shape matching, resulting in jittery mouth movements. To address the aforementioned problems, we introduce a two-stage diffusion-based model. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos. Extensive experiments demonstrate that our model yields the best performance.
音频驱动的聊天头生成是一个重要而具有挑战性的任务,适用于各种领域,如虚拟助手、电影制作和在线会议等。然而,现有的基于GAN的模型强调生成同步的嘴形状,但忽视了生成帧的视觉质量,而基于扩散的模型则优先考虑生成高质量的帧,但忽视了嘴形状的匹配,导致嘴部运动不平稳。为了解决上述问题,我们引入了一个双阶段扩散基础模型。第一阶段涉及根据给定的 speech 生成同步的 facial landmarks。在第二阶段,这些生成的嘴部标记用作去噪过程的条件,旨在优化嘴部抖动问题,并生成高保真度、同步良好的和时间相关的聊天头视频。大量实验证明,我们的模型具有最佳性能。
https://arxiv.org/abs/2408.01732
Human head detection, keypoint estimation, and 3D head model fitting are important tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce VGGHeads -- a large scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset we introduce a new model architecture capable of simultaneous heads detection and head meshes reconstruction from a single image in a single step. Through extensive experimental evaluations, we demonstrate that models trained on our synthetic data achieve strong performance on real images. Furthermore, the versatility of our dataset makes it applicable across a broad spectrum of tasks, offering a general and comprehensive representation of human heads. Additionally, we provide detailed information about the synthetic data generation pipeline, enabling it to be re-used for other tasks and domains.
https://arxiv.org/abs/2407.18245
In the context of artificial intelligence, the inherent human attribute of engaging in logical reasoning to facilitate decision-making is mirrored by the concept of explainability, which pertains to the ability of a model to provide a clear and interpretable account of how it arrived at a particular outcome. This study explores explainability techniques for binary deep neural architectures in the framework of emotion classification through video analysis. We investigate the optimization of input features to binary classifiers for emotion recognition, with face landmarks detection using an improved version of the Integrated Gradients explainability method. The main contribution of this paper consists in the employment of an innovative explainable artificial intelligence algorithm to understand the crucial facial landmarks movements during emotional feeling, using this information also for improving the performances of deep learning-based emotion classifiers. By means of explainability, we can optimize the number and the position of the facial landmarks used as input features for facial emotion recognition, lowering the impact of noisy landmarks and thus increasing the accuracy of the developed models. In order to test the effectiveness of the proposed approach, we considered a set of deep binary models for emotion classification trained initially with a complete set of facial landmarks, which are progressively reduced based on a suitable optimization procedure. The obtained results prove the robustness of the proposed explainable approach in terms of understanding the relevance of the different facial points for the different emotions, also improving the classification accuracy and diminishing the computational cost.
https://arxiv.org/abs/2407.14865
This paper introduces the Efficient Facial Landmark Detection (EFLD) model, specifically designed for edge devices confronted with the challenges related to power consumption and time latency. EFLD features a lightweight backbone and a flexible detection head, each significantly enhancing operational efficiency on resource-constrained devices. To improve the model's robustness, we propose a cross-format training strategy. This strategy leverages a wide variety of publicly accessible datasets to enhance the model's generalizability and robustness, without increasing inference costs. Our ablation study highlights the significant impact of each component on reducing computational demands, model size, and improving accuracy. EFLD demonstrates superior performance compared to competitors in the IEEE ICME 2024 Grand Challenges PAIR Competition, a contest focused on low-power, efficient, and accurate facial-landmark detection for embedded systems, showcasing its effectiveness in real-world facial landmark detection tasks.
本文介绍了专为能耗和延迟挑战的边缘设备设计的 efficient面部关键点检测(EFLD)模型。EFLD具有轻量级的骨干网络和灵活的检测头,每个显著提高了在资源受限设备上的操作效率。为了提高模型的稳健性,我们提出了跨格式训练策略。这种策略利用了广泛的公开可用数据集来增强模型的泛化能力和稳健性,而不会增加推理成本。我们的消融研究突出了每个组件对降低计算需求、模型大小和提高准确性的显著影响。EFLD在2024年IEEE ICME Grand Challenges PAIR竞赛中证明了与竞争对手相比的卓越性能,该比赛专注于为嵌入系统实现低功耗、高效、准确的关键面部关键点检测,展示了其在真实世界面部关键点检测任务中的有效性。
https://arxiv.org/abs/2407.10228
The area of portrait image animation, propelled by audio input, has witnessed notable progress in the generation of lifelike and dynamic portraits. Conventional methods are limited to utilizing either audios or facial key points to drive images into videos, while they can yield satisfactory results, certain issues exist. For instance, methods driven solely by audios can be unstable at times due to the relatively weaker audio signal, while methods driven exclusively by facial key points, although more stable in driving, can result in unnatural outcomes due to the excessive control of key point information. In addressing the previously mentioned challenges, in this paper, we introduce a novel approach which we named EchoMimic. EchoMimic is concurrently trained using both audios and facial landmarks. Through the implementation of a novel training strategy, EchoMimic is capable of generating portrait videos not only by audios and facial landmarks individually, but also by a combination of both audios and selected facial landmarks. EchoMimic has been comprehensively compared with alternative algorithms across various public datasets and our collected dataset, showcasing superior performance in both quantitative and qualitative evaluations. Additional visualization and access to the source code can be located on the EchoMimic project page.
随着音频输入的推动,人物肖像动画领域的进步显著地加快了逼真和动态人物肖像的生成。传统方法仅限于使用音频或面部关键点驱动图像进入视频,虽然它们可以产生满意的效果,但存在某些问题。例如,仅基于音频的方法有时会因为相对较弱的音频信号而不稳定,而仅基于面部关键点的方法,虽然在驱动方面更稳定,但由于关键点信息过度控制,可能导致异常结果。为解决前面提到的问题,本文我们引入了一种名为EchoMimic的新方法。EchoMimic使用同时训练音频和面部关键点。通过实施一种新颖的训练策略,EchoMimic不仅能够通过单独使用音频和面部关键点生成人物视频,而且还可以通过同时使用两者来生成。EchoMimic与各种公共数据集和我们所收集的数据集中的其他算法进行了全面比较,在定量和定性评估中均表现出卓越的性能。此外,可以在EchoMimic项目页面上找到更多可视化和源代码访问。
https://arxiv.org/abs/2407.08136
This study presents a novel driver drowsiness detection system that combines deep learning techniques with the OpenCV framework. The system utilises facial landmarks extracted from the driver's face as input to Convolutional Neural Networks trained to recognise drowsiness patterns. The integration of OpenCV enables real-time video processing, making the system suitable for practical implementation. Extensive experiments on a diverse dataset demonstrate high accuracy, sensitivity, and specificity in detecting drowsiness. The proposed system has the potential to enhance road safety by providing timely alerts to prevent accidents caused by driver fatigue. This research contributes to advancing real-time driver monitoring systems and has implications for automotive safety and intelligent transportation systems. The successful application of deep learning techniques in this context opens up new avenues for future research in driver monitoring and vehicle safety. The implementation code for the paper is available at this https URL.
这项研究提出了一种结合深度学习技术和OpenCV框架的新型驾驶员昏睡检测系统。该系统利用从驾驶员面部提取的面部特征作为卷积神经网络训练的输入,以识别昏睡模式。集成OpenCV使得实时视频处理成为可能,使该系统具有实际应用的价值。在多样数据集上的广泛实验证明该系统在检测昏睡方面具有高准确度、敏感性和特异性。所提出的系统通过及时发出警报防止由驾驶员疲劳导致的事故,有助于提高道路安全性。这项研究为驾驶员监测系统和智能交通系统提供了进一步推动,对未来的研究具有重要的意义。该论文的实施代码可在以下链接找到:https://url.cn/xyz6hxn
https://arxiv.org/abs/2406.15646
Generic Face Image Quality Assessment (GFIQA) evaluates the perceptual quality of facial images, which is crucial in improving image restoration algorithms and selecting high-quality face images for downstream tasks. We present a novel transformer-based method for GFIQA, which is aided by two unique mechanisms. First, a Dual-Set Degradation Representation Learning (DSL) mechanism uses facial images with both synthetic and real degradations to decouple degradation from content, ensuring generalizability to real-world scenarios. This self-supervised method learns degradation features on a global scale, providing a robust alternative to conventional methods that use local patch information in degradation learning. Second, our transformer leverages facial landmarks to emphasize visually salient parts of a face image in evaluating its perceptual quality. We also introduce a balanced and diverse Comprehensive Generic Face IQA (CGFIQA-40k) dataset of 40K images carefully designed to overcome the biases, in particular the imbalances in skin tone and gender representation, in existing datasets. Extensive analysis and evaluation demonstrate the robustness of our method, marking a significant improvement over prior methods.
通用面部图像质量评估(GFIQA)评估了面部图像的感知质量,这对于改进图像修复算法和为下游任务选择高质量面部图像至关重要。我们提出了一个基于Transformer的新型GFIQA方法,该方法得益于两个独特的机制。首先,双集退化表示学习(DSL)机制利用既有合成又有真实降解的面部图像,将降解与内容分离,确保对真实世界场景的泛化能力。这种自监督方法在全局范围内学习降解特征,为传统使用局部补信息进行降解学习的方法提供了稳健的替代方案。其次,我们的Transformer利用面部特征点强调面部图像评估其感知质量的视觉显着部分。我们还引入了一个平衡和多样性的全面通用面部 IQA(CGFIQA-40k)数据集,40K个经过精心设计的图像,旨在克服现有数据集中皮肤色调和性别表示方面的偏差。广泛的分析和评估证明了我们的方法的稳健性,标志着与之前方法相比取得了显著的改进。
https://arxiv.org/abs/2406.09622
Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial cues, including normalized landmarks that represent expressions, gaze, and head pose. These cues are subsequently reassembled into relocated facial landmarks. In the second step, these relocated landmarks are mapped to latent key points using self-supervised learning and then input into a pretrained model to create high-quality face images. Extensive experiments on the MEAD dataset demonstrate that our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.
生动的人脸生成具有广泛的应用价值,涵盖电影和游戏制作等多媒体领域。虽然现有的方法准确地将输入音频与 lip 运动同步,但它们通常忽视情感与面部线索之间至关重要的对齐,包括表情、眼神和头部姿态。这些对齐对于合成真实视频至关重要。为了解决这些问题,我们提出了一个两阶段音频驱动的人脸生成框架,其中采用 3D 面部标志作为中间变量。通过自监督学习实现情感与面部姿势的协同对齐。具体来说,我们将这个任务分解为两个关键步骤:语音到面部标志的合成和面部标志到人脸的生成。第一步关注于同时合成情感对齐的面部线索,包括经过归一化的标志,这些标志代表表情、眼神和头部姿态。这些线索随后被重新组装成新的面部标志。在第二步,这些重新定位的面部标志通过自监督学习映射到潜在的关键点,然后输入到预训练模型中创建高质量的人脸图像。在 MEAD 数据集上的大量实验证明,我们的模型在视觉质量和情感对齐方面显著提高了最先进水平。
https://arxiv.org/abs/2406.07895
To address this challenge, we introduce CattleFace-RGBT, a RGB-T Cattle Facial Landmark dataset consisting of 2,300 RGB-T image pairs, a total of 4,600 images. Creating a landmark dataset is time-consuming, but AI-assisted annotation can help. However, applying AI to thermal images is challenging due to suboptimal results from direct thermal training and infeasible RGB-thermal alignment due to different camera views. Therefore, we opt to transfer models trained on RGB to thermal images and refine them using our AI-assisted annotation tool following a semi-automatic annotation approach. Accurately localizing facial key points on both RGB and thermal images enables us to not only discern the cattle's respiratory signs but also measure temperatures to assess the animal's thermal state. To the best of our knowledge, this is the first dataset for the cattle facial landmark on RGB-T images. We conduct benchmarking of the CattleFace-RGBT dataset across various backbone architectures, with the objective of establishing baselines for future research, analysis, and comparison. The dataset and models are at this https URL
为了应对这个挑战,我们引入了CattleFace-RGBT,一个由2,300个RGB-T图像对组成的RGB-T牛面部关键点数据集,共4,600个图像。创建一个关键点数据集需要花费时间,但AI辅助标注可以帮助简化这个过程。然而,将AI应用于热成像图像上存在挑战,因为直接热训练的结果不理想,而且由于不同相机视角的可缩放性,RGB-热对齐也难以实现。因此,我们选择将训练在RGB上的模型迁移到热成像图像上,并使用我们的AI辅助标注工具进行半自动标注,以优化模型。在RGB和热成像图像上准确地定位面部关键点使我们能够不仅分辨牛的呼吸迹象,而且测量体温以评估动物的 thermal state。据我们所知,这是第一个在RGB-T图像上为牛创建面部关键点的数据集。我们在各种骨干架构上对CattleFace-RGBT数据集进行基准测试,旨在为未来的研究、分析和比较提供基准。数据集和模型在此处https:// URL。
https://arxiv.org/abs/2406.03431
In this paper, we examine 3 important issues in the practical use of state-of-the-art facial landmark detectors and show how a combination of specific architectural modifications can directly improve their accuracy and temporal stability. First, many facial landmark detectors require face normalization as a preprocessing step, which is accomplished by a separately-trained neural network that crops and resizes the face in the input image. There is no guarantee that this pre-trained network performs the optimal face normalization for landmark detection. We instead analyze the use of a spatial transformer network that is trained alongside the landmark detector in an unsupervised manner, and jointly learn optimal face normalization and landmark detection. Second, we show that modifying the output head of the landmark predictor to infer landmarks in a canonical 3D space can further improve accuracy. To convert the predicted 3D landmarks into screen-space, we additionally predict the camera intrinsics and head pose from the input image. As a side benefit, this allows to predict the 3D face shape from a given image only using 2D landmarks as supervision, which is useful in determining landmark visibility among other things. Finally, when training a landmark detector on multiple datasets at the same time, annotation inconsistencies across datasets forces the network to produce a suboptimal average. We propose to add a semantic correction network to address this issue. This additional lightweight neural network is trained alongside the landmark detector, without requiring any additional supervision. While the insights of this paper can be applied to most common landmark detectors, we specifically target a recently-proposed continuous 2D landmark detector to demonstrate how each of our additions leads to meaningful improvements over the state-of-the-art on standard benchmarks.
在本文中,我们研究了在先进的面部关键点检测的实际应用中的三个重要问题,并展示了如何通过特定的建筑设计修改来直接提高其准确性和时间稳定性。首先,许多面部关键点检测器需要进行人脸归一化作为预处理步骤,这通过一个单独训练的神经网络在输入图像中裁剪和调整人脸完成。然而,这个预训练网络是否执行了最优的脸部归一化尚不确定。我们而是分析了一个与关键点检测器一起在无监督方式下训练的空间Transformer网络,并共同学习最优的脸部归一化和关键点检测。 其次,我们证明了将关键点预测器的输出头修改为从属于经典3D空间的标记可以进一步提高准确度。为了将预测的3D标记转换为屏幕空间,我们还在输入图像中预测相机内参和头姿。作为附加好处,这使得可以使用仅基于2D关键点作为监督来预测3D人脸形状,这对于确定标记在另一件事物上可见性非常有用。 最后,当在同一时间训练多个数据集上的关键点检测器时,数据集中的注释不一致会迫使网络产生一个次优平均。我们提出了一种语义校正网络来解决这个 issue。这个额外的轻量级神经网络与关键点检测器一起训练,无需额外监督。虽然本文的研究可以应用于大多数常见的关键点检测器,但我们特别针对一个最近提出的连续2D关键点检测器,以展示我们添加的每个改进如何比现有技术在标准基准上实现显著的改进。
https://arxiv.org/abs/2405.20117
Appearance of a face can be greatly altered by growing a beard and mustache. The facial hairstyles in a pair of images can cause marked changes to the impostor distribution and the genuine distribution. Also, different distributions of facial hairstyle across demographics could cause a false impression of relative accuracy across demographics. We first show that, even though larger training sets boost the recognition accuracy on all facial hairstyles, accuracy variations caused by facial hairstyles persist regardless of the size of the training set. Then, we analyze the impact of having different fractions of the training data represent facial hairstyles. We created balanced training sets using a set of identities available in Webface42M that both have clean-shaven and facial hair images. We find that, even when a face recognition model is trained with a balanced clean-shaven / facial hair training set, accuracy variation on the test data does not diminish. Next, data augmentation is employed to further investigate the effect of facial hair distribution in training data by manipulating facial hair pixels with the help of facial landmark points and a facial hair segmentation model. Our results show facial hair causes an accuracy gap between clean-shaven and facial hair images, and this impact can be significantly different between African-Americans and Caucasians.
翻译:留胡须和络腮胡须会显著改变脸部的外观。一对图像中的面部发型可能会对伪造者分布和真实分布产生显著变化。此外,不同人口统计学分布中的面部发型分布可能会导致在人口统计学上相对准确性的虚假印象。我们首先证明,即使更大的训练集增强了所有面部发型识别准确性,由面部发型引起的准确性变异仍然存在,而与训练集的大小无关。接下来,我们分析使用Webface42M中可用的各个身份创建平衡训练集,这些身份都有干净的剃须和面部毛发图像。我们发现,即使使用平衡的干净剃须/面部毛发训练集训练面部识别模型,测试数据的准确性差异也不会减小。接下来,数据增强被用来通过利用面部关键点和面部毛发分割模型进一步研究面部毛发分布对训练数据的影响。我们的结果表明,胡须导致干净剃须和面部毛发图像之间的准确性差距,这种影响在非裔美国人和白人之间可能会有显著不同。
https://arxiv.org/abs/2405.20062
3D facial landmark localization has proven to be of particular use for applications, such as face tracking, 3D face modeling, and image-based 3D face reconstruction. In the supervised learning case, such methods usually rely on 3D landmark datasets derived from 3DMM-based registration that often lack spatial definition alignment, as compared with that chosen by hand-labeled human consensus, e.g., how are eyebrow landmarks defined? This creates a gap between landmark datasets generated via high-quality 2D human labels and 3DMMs, and it ultimately limits their effectiveness. To address this issue, we introduce a novel semi-supervised learning approach that learns 3D landmarks by directly lifting (visible) hand-labeled 2D landmarks and ensures better definition alignment, without the need for 3D landmark datasets. To lift 2D landmarks to 3D, we leverage 3D-aware GANs for better multi-view consistency learning and in-the-wild multi-frame videos for robust cross-generalization. Empirical experiments demonstrate that our method not only achieves better definition alignment between 2D-3D landmarks but also outperforms other supervised learning 3D landmark localization methods on both 3DMM labeled and photogrammetric ground truth evaluation datasets. Project Page: this https URL
3D面部 landmark localization已证明对于诸如面部跟踪、3D面部建模和基于图像的3D面部重建等应用具有特别大的价值。在监督学习情况下,通常依赖于从3DMM基于注册的用户手标注的数据集,这些数据集通常缺乏空间定义对齐,与用户手标注的数据集相比,如何定义眉毛标志?这导致手标注的人脸地标数据集与3DMM之间的差距,并最终限制了它们的有效性。为解决这个问题,我们引入了一种新颖的半监督学习方法,通过直接提升(可见)手标注的2D地标来学习3D地标,并确保更好的定义对齐,而无需3D地标数据集。要提升2D地标到3D,我们利用3D意识GANs进行更好的多视角一致性学习和野外多帧视频进行稳健的跨模态评估。实验结果表明,我们的方法不仅在2D-3D地标之间实现了更好的定义对齐,而且在3DMM标注和 photogrammetric 地面真实评估数据集上超过了其他监督学习3D地标局部定位方法。项目页面:https:// this URL
https://arxiv.org/abs/2405.19646