Facial expression recognition is a crucial component in enhancing human-computer interaction and developing emotion-aware systems. Real-time detection and interpretation of facial expressions have become increasingly important for various applications, from user experience personalization to intelligent surveillance systems. This study presents a novel approach to real-time sequential facial expression recognition using deep learning and geometric features. The proposed method utilizes MediaPipe FaceMesh for rapid and accurate facial landmark detection. Geometric features, including Euclidean distances and angles, are extracted from these landmarks. Temporal dynamics are incorporated by analyzing feature differences between consecutive frames, enabling the detection of onset, apex, and offset phases of expressions. For classification, a ConvLSTM1D network followed by multilayer perceptron blocks is employed. The method's performance was evaluated on multiple publicly available datasets, including CK+, Oulu-CASIA (VIS and NIR), and MMI. Accuracies of 93%, 79%, 77%, and 68% were achieved respectively. Experiments with composite datasets were also conducted to assess the model's generalization capabilities. The approach demonstrated real-time applicability, processing approximately 165 frames per second on consumer-grade hardware. This research contributes to the field of facial expression analysis by providing a fast, accurate, and adaptable solution. The findings highlight the potential for further advancements in emotion-aware technologies and personalized user experiences, paving the way for more sophisticated human-computer interaction systems. To facilitate further research in this field, the complete source code for this study has been made publicly available on GitHub: this https URL.
面部表情识别是增强人机交互和开发情感感知系统的关键组成部分。实时检测和解读面部表情对于各种应用变得越来越重要,从用户体验个性化到智能监控系统。本研究提出了一种使用深度学习和几何特征进行实时顺序面部表情识别的新方法。所提方法利用MediaPipe FaceMesh进行快速准确的面部标志点检测。通过这些标志点提取欧几里得距离和角度等几何特性。时间动态性通过分析连续帧之间特性的差异来融入,从而实现表情起始、顶峰和结束阶段的检测。分类过程中使用了ConvLSTM1D网络后接多层感知器模块。该方法在多个公开数据集上进行了性能评估,包括CK+、Oulu-CASIA(VIS和NIR)以及MMI,分别达到了93%、79%、77%和68%的准确率。为了评估模型的泛化能力,还对合成的数据集进行了实验。该方法展示了其实时适用性,在消费级硬件上每秒可处理约165帧。这项研究通过提供快速、准确且灵活的解决方案为面部表情分析领域做出了贡献。研究成果强调了在情感感知技术及个性化用户体验方面的进一步发展潜力,并为更高级的人机交互系统铺平道路。为了促进该领域的进一步研究,本研究所用的完整源代码已公开发布于GitHub:this https URL.
https://arxiv.org/abs/2512.05669
Alcohol consumption is a significant public health concern and a major cause of accidents and fatalities worldwide. This study introduces a novel video-based facial sequence analysis approach dedicated to the detection of alcohol intoxication. The method integrates facial landmark analysis via a Graph Attention Network (GAT) with spatiotemporal visual features extracted using a 3D ResNet. These features are dynamically fused with adaptive prioritization to enhance classification performance. Additionally, we introduce a curated dataset comprising 3,542 video segments derived from 202 individuals to support training and evaluation. Our model is compared against two baselines: a custom 3D-CNN and a VGGFace+LSTM architecture. Experimental results show that our approach achieves 95.82% accuracy, 0.977 precision, and 0.97 recall, outperforming prior methods. The findings demonstrate the model's potential for practical deployment in public safety systems for non-invasive, reliable alcohol intoxication detection.
酒精消费是一个重要的公共卫生问题,也是全球事故和死亡的主要原因。本研究提出了一种基于视频的面部序列分析方法,专门用于检测酒精中毒。该方法结合了通过图注意力网络(GAT)进行的面部标志点分析以及使用3D ResNet提取的空间时间视觉特征。这些特征被动态融合并采用自适应优先级以提高分类性能。此外,我们引入了一个经过精心策划的数据集,包括来自202个个体的3,542段视频片段,用于支持训练和评估。我们的模型与两个基准进行了比较:一个定制的3D-CNN架构和VGGFace+LSTM架构。实验结果显示,我们的方法实现了95.82%的准确率、0.977的精确度和0.97的召回率,超过了先前的方法。研究结果表明,该模型在公共安全系统中具有非侵入性和可靠地检测酒精中毒的实际部署潜力。
https://arxiv.org/abs/2512.04536
Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.
将深度伪造检测推广到未见过的操纵仍然是一个关键挑战。最近的一个方法是通过使用手工制作的艺术品来处理原始面部图像,以提取更通用的线索,从而训练网络解决这一问题。尽管这种方法对静态图像有效,但将其扩展到视频领域仍是一个开放性的问题。现有方法将时间上的伪迹建模为帧与帧之间的不稳定性,忽略了关键漏洞:不同面部区域之间自然运动依赖性的破坏。 在本文中,我们提出了一种合成视频生成方法,该方法通过创建具有微妙的力学不一致性的训练数据来解决这个问题。我们训练一个自动编码器将面部标志配置分解成运动基元。通过对这些基元进行操作,我们可以有选择地打破面部动作之间的自然相关性,并通过面部变形将其伪迹引入原始视频中。在我们的数据上经过训练的网络能够识别这些复杂的生物力学缺陷,在几个流行的基准测试中达到了最先进的泛化结果。
https://arxiv.org/abs/2512.04175
We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total, among which more than 52,000 are longer than 10 frames, including 68 manually reviewed trajectories that span the entire 150 frames. To the best of our knowledge, PoreTrack3D is the first benchmark dataset to capture both traditional facial landmarks and pore-scale keypoints trajectory, advancing the study of fine-grained facial expressions through the analysis of subtle skin-surface motion. We systematically evaluate state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D, establishing the first performance baseline in this domain. Overall, the pipeline developed for this benchmark dataset's creation establishes a new framework for high-fidelity facial motion capture and dynamic 3D reconstruction. Our dataset are publicly available at: this https URL
我们介绍了PoreTrack3D,这是首个针对孔隙尺度非刚性三维面部轨迹跟踪中的动态三维高斯点云的基准测试。该数据集总共包含超过440,000条面部轨迹,其中超过52,000条轨迹长度超过了10帧,并且有68条手动审查过的跨越整个150帧的完整轨迹。据我们所知,PoreTrack3D是首个能够捕捉传统面部标志点和孔隙尺度关键点轨迹的数据集,通过分析细微的皮肤表面运动来推进细粒度面部表情的研究。我们在PoreTrack3D上系统性地评估了当前最先进的动态三维高斯点云方法,并建立了该领域的首个性能基准线。总体而言,为创建这一基准测试数据集而开发的工作流程确立了一个用于高保真面部动作捕捉和动态三维重建的新框架。我们的数据集可公开获取:[此处插入具体链接]
https://arxiv.org/abs/2512.02648
Detecting driver drowsiness reliably is crucial for enhancing road safety and supporting advanced driver assistance systems (ADAS). We introduce the Eyelid Angle (ELA), a novel, reproducible metric of eye openness derived from 3D facial landmarks. Unlike conventional binary eye state estimators or 2D measures, such as the Eye Aspect Ratio (EAR), the ELA provides a stable geometric description of eyelid motion that is robust to variations in camera angle. Using the ELA, we design a blink detection framework that extracts temporal characteristics, including the closing, closed, and reopening durations, which are shown to correlate with drowsiness levels. To address the scarcity and risk of collecting natural drowsiness data, we further leverage ELA signals to animate rigged avatars in Blender 3D, enabling the creation of realistic synthetic datasets with controllable noise, camera viewpoints, and blink dynamics. Experimental results in public driver monitoring datasets demonstrate that the ELA offers lower variance under viewpoint changes compared to EAR and achieves accurate blink detection. At the same time, synthetic augmentation expands the diversity of training data for drowsiness recognition. Our findings highlight the ELA as both a reliable biometric measure and a powerful tool for generating scalable datasets in driver state monitoring.
https://arxiv.org/abs/2511.19519
One of the major causes of road accidents is driver fatigue that causes thousands of fatalities and injuries every year. This study shows development of a Driver Drowsiness Detection System meant to improve the safety of the road by alerting drivers who are showing signs of being drowsy. The system is based on a standard webcam that tracks the facial features of the driver with the main emphasis on the examination of eye movements that can be conducted with the help of the Eye Aspect Ratio (EAR) method. The Face Mesh by MediaPipe is a lightweight framework that can identify facial landmarks with high accuracy and efficiency, which is considered to be important in real time use. The system detects the moments of long eye shutdowns or a very low rate of blinking which are manifestations of drowsiness and alerts the driver through sound to get her attention back. This system achieves a high-performance and low-cost driver monitoring solution with the help of the computational power of OpenCV to process the image and the MediaPipe to identify faces. Test data experimental analyses indicate that the system is very accurate and responds quicker; this confirms that it can be a component of the current Advanced Driving Assistance System (ADAS).
https://arxiv.org/abs/2511.13618
A long road trip is fun for drivers. However, a long drive for days can be tedious for a driver to accommodate stringent deadlines to reach distant destinations. Such a scenario forces drivers to drive extra miles, utilizing extra hours daily without sufficient rest and breaks. Once a driver undergoes such a scenario, it occasionally triggers drowsiness during driving. Drowsiness in driving can be life-threatening to any individual and can affect other drivers' safety; therefore, a real-time detection system is needed. To identify fatigued facial characteristics in drivers and trigger the alarm immediately, this research develops a real-time driver drowsiness detection system utilizing deep convolutional neural networks (DCNNs) and this http URL proposed and implemented model takes real- time facial images of a driver using a live camera and utilizes a Python-based library named OpenCV to examine the facial images for facial landmarks like sufficient eye openings and yawn-like mouth movements. The DCNNs framework then gathers the data and utilizes a per-trained model to detect the drowsiness of a driver using facial landmarks. If the driver is identified as drowsy, the system issues a continuous alert in real time, embedded in the Smart Car this http URL potentially saving innocent lives on the roadways, the proposed technique offers a non-invasive, inexpensive, and cost-effective way to identify drowsiness. Our proposed and implemented DCNNs embedded drowsiness detection model successfully react with NTHU-DDD dataset and Yawn-Eye-Dataset with drowsiness detection classification accuracy of 99.6% and 97% respectively.
https://arxiv.org/abs/2511.12438
To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.
https://arxiv.org/abs/2511.09915
Diffusion-based approaches have recently achieved strong results in face swapping, offering improved visual quality over traditional GAN-based methods. However, even state-of-the-art models often suffer from fine-grained artifacts and poor identity preservation, particularly under challenging poses and expressions. A key limitation of existing approaches is their failure to meaningfully leverage 3D facial structure, which is crucial for disentangling identity from pose and expression. In this work, we propose DiffSwap++, a novel diffusion-based face-swapping pipeline that incorporates 3D facial latent features during training. By guiding the generation process with 3D-aware representations, our method enhances geometric consistency and improves the disentanglement of facial identity from appearance attributes. We further design a diffusion architecture that conditions the denoising process on both identity embeddings and facial landmarks, enabling high-fidelity and identity-preserving face swaps. Extensive experiments on CelebA, FFHQ, and CelebV-Text demonstrate that DiffSwap++ outperforms prior methods in preserving source identity while maintaining target pose and expression. Additionally, we introduce a biometric-style evaluation and conduct a user study to further validate the realism and effectiveness of our approach. Code will be made publicly available at this https URL
https://arxiv.org/abs/2511.05575
Recent advances in deep learning have significantly improved facial landmark detection. However, existing facial landmark detection datasets often define different numbers of landmarks, and most mainstream methods can only be trained on a single dataset. This limits the model generalization to different datasets and hinders the development of a unified model. To address this issue, we propose Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework that explicitly enhances dataset-specific facial structural representations (i.e., prototype). Proto-Former overcomes the limitations of single-dataset training by enabling joint training across multiple datasets within a unified architecture. Specifically, Proto-Former comprises two key components: an Adaptive Prototype-Aware Encoder (APAE) that performs adaptive feature extraction and learns prototype representations, and a Progressive Prototype-Aware Decoder (PPAD) that refines these prototypes to generate prompts that guide the model's attention to key facial regions. Furthermore, we introduce a novel Prototype-Aware (PA) loss, which achieves optimal path finding by constraining the selection weights of prototype experts. This loss function effectively resolves the problem of prototype expert addressing instability during multi-dataset training, alleviates gradient conflicts, and enables the extraction of more accurate facial structure features. Extensive experiments on widely used benchmark datasets demonstrate that our Proto-Former achieves superior performance compared to existing state-of-the-art methods. The code is publicly available at: this https URL.
近期在深度学习领域取得的进步显著提升了面部特征点检测的精度。然而,现有的面部特征点检测数据集往往定义了不同数量的特征点,并且大多数主流方法只能在一个单一的数据集上进行训练。这限制了模型对不同数据集的泛化能力,并阻碍了一体化模型的发展。为解决这一问题,我们提出了Proto-Former,这是一种统一、自适应的一站式面部特征点检测框架,它明确增强了特定于各个数据集的面部结构表示(即原型)。通过在统一架构中实现跨多个数据集的同时训练,Proto-Former克服了单一数据集训练的局限性。 具体而言,Proto-Former包括两个关键组成部分:自适应原型感知编码器(APAE),该编码器执行自适应特征提取,并学习原型表示;渐进式原型感知解码器(PPAD),它对这些原型进行精炼以生成引导模型注意特定面部区域的提示。此外,我们引入了一种新颖的原型感知(PA)损失函数,通过约束原型专家的选择权重来实现最佳路径搜索。这种损失函数有效地解决了多数据集训练期间原型专家地址不稳定的难题,减轻了梯度冲突,并使更准确地提取面部结构特征成为可能。 在广泛使用的基准数据集中进行的大量实验表明,我们的Proto-Former相较于现有最先进的方法具有优越的表现。代码可在以下网址公开获取:this https URL.
https://arxiv.org/abs/2510.15338
Facial Landmark Detection (FLD) in thermal imagery is critical for applications in challenging lighting conditions, but it is hampered by the lack of rich visual cues. Conventional cross-modal solutions, like feature fusion or image translation from RGB data, are often computationally expensive or introduce structural artifacts, limiting their practical deployment. To address this, we propose Multi-Level Cross-Modal Knowledge Distillation (MLCM-KD), a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression to create both accurate and efficient thermal FLD models. A central challenge during knowledge transfer is the profound modality gap between RGB and thermal data, where traditional unidirectional distillation fails to enforce semantic consistency across disparate feature spaces. To overcome this, we introduce Dual-Injected Knowledge Distillation (DIKD), a bidirectional mechanism designed specifically for this task. DIKD establishes a connection between modalities: it not only guides the thermal student with rich RGB features but also validates the student's learned representations by feeding them back into the frozen teacher's prediction head. This closed-loop supervision forces the student to learn modality-invariant features that are semantically aligned with the teacher, ensuring a robust and profound knowledge transfer. Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods while drastically reducing computational overhead.
在热成像中的面部特征点检测(FLD)对于应对挑战性光照条件的应用至关重要,但缺乏丰富的视觉线索阻碍了其发展。传统跨模态解决方案,如特征融合或从RGB数据转换图像,在计算成本高昂或引入结构化伪影方面往往限制了它们的实际部署。为解决这些问题,我们提出了一种名为多级跨模式知识蒸馏(MLCM-KD)的新型框架,该框架将高保真度的RGB到热成像的知识转移与模型压缩解耦,从而创建出既准确又高效的热成像FLD模型。 在知识转移过程中,RGB数据和热图像之间存在显著的模态差距,传统的单向蒸馏无法保证不同特征空间间的语义一致性。为此,我们引入了双注入知识蒸馏(DIKD),这是一种专门为此任务设计的双向机制。DIKD建立了模态之间的连接:它不仅通过丰富的RGB特性指导热成像学生模型,还通过将学生的表示反馈到冻结教师模型的预测头部来验证学生所学表示的有效性。这种闭环监督迫使学生学习出在语义上与老师一致且不受模态影响的特征,确保了深度和稳健的知识传递。 实验结果表明,我们提出的方法在公开热成像FLD基准测试中创下了新的性能记录,并显著优于先前方法,同时大幅减少了计算开销。
https://arxiv.org/abs/2510.11128
Manual annotation of anatomical landmarks on 3D facial scans is a time-consuming and expertise-dependent task, yet it remains critical for clinical assessments, morphometric analysis, and craniofacial research. While several deep learning methods have been proposed for facial landmark localization, most focus on pseudo-landmarks or require complex input representations, limiting their clinical applicability. This study presents a fully automated deep learning pipeline (PAL-Net) for localizing 50 anatomical landmarks on stereo-photogrammetry facial models. The method combines coarse alignment, region-of-interest filtering, and an initial approximation of landmarks with a patch-based pointwise CNN enhanced by attention mechanisms. Trained and evaluated on 214 annotated scans from healthy adults, PAL-Net achieved a mean localization error of 3.686 mm and preserves relevant anatomical distances with a 2.822 mm average error, comparable to intra-observer variability. To assess generalization, the model was further evaluated on 700 subjects from the FaceScape dataset, achieving a point-wise error of 0.41\,mm and a distance-wise error of 0.38\,mm. Compared to existing methods, PAL-Net offers a favorable trade-off between accuracy and computational cost. While performance degrades in regions with poor mesh quality (e.g., ears, hairline), the method demonstrates consistent accuracy across most anatomical regions. PAL-Net generalizes effectively across datasets and facial regions, outperforming existing methods in both point-wise and structural evaluations. It provides a lightweight, scalable solution for high-throughput 3D anthropometric analysis, with potential to support clinical workflows and reduce reliance on manual annotation. Source code can be found at this https URL
三维面部扫描的解剖标志的手动注释是一项耗时且依赖专业知识的任务,但对于临床评估、形态学分析和颅面研究来说至关重要。尽管已经提出了一些深度学习方法用于面部特征定位,但大多数这些方法要么关注伪标志,要么需要复杂的输入表示,这限制了它们在临床上的应用。这项研究提出了一种全自动的深度学习流水线(PAL-Net),用于定位立体摄影测量面部模型上的50个解剖标志。该方法结合了粗略对齐、感兴趣区域过滤以及基于注意力机制增强的补丁级点卷积神经网络,以实现初始特征位置估计。 在214张来自健康成年人的注释扫描上训练和评估后,PAL-Net实现了3.686毫米的平均定位误差,并且在保留相关解剖距离方面达到了2.822毫米的平均误差,这一成绩与观察者内部变异性的水平相当。为了评估模型的一般化能力,研究团队进一步使用FaceScape数据集中700个受试者的面部扫描对PAL-Net进行了测试,结果显示点间误差为0.41毫米,结构距离误差为0.38毫米。 相比于现有的方法,PAL-Net在精度和计算成本之间提供了更优的权衡。尽管在网格质量较差的区域(如耳朵、发际线)表现有所下降,但该方法在大多数解剖区域内均表现出稳定的准确性。此外,PAL-Net能够有效地跨数据集和面部区域进行泛化,并且在点间和结构评估方面优于现有方法。 这项研究为大规模高通量3D人体测量分析提供了一个轻量化、可扩展的解决方案,有可能支持临床工作流程并减少对手动注释的依赖。该研究的相关源代码可以在提供的链接中找到(此处未给出具体网址)。
https://arxiv.org/abs/2510.00910
The rapid development of deepfake generation techniques necessitates robust face forgery detection algorithms. While methods based on Convolutional Neural Networks (CNNs) and Transformers are effective, there is still room for improvement in modeling the highly complex and non-linear nature of forgery artifacts. To address this issue, we propose a novel detection method based on the Kolmogorov-Arnold Network (KAN). By replacing fixed activation functions with learnable splines, our KAN-based approach is better suited to this challenge. Furthermore, to guide the network's focus towards critical facial areas, we introduce a Landmark-assisted Adaptive Kolmogorov-Arnold Network (LAKAN) module. This module uses facial landmarks as a structural prior to dynamically generate the internal parameters of the KAN, creating an instance-specific signal that steers a general-purpose image encoder towards the most informative facial regions with artifacts. This core innovation creates a powerful combination between geometric priors and the network's learning process. Extensive experiments on multiple public datasets show that our proposed method achieves superior performance.
深度伪造生成技术的迅速发展要求有强大的面部伪造检测算法。虽然基于卷积神经网络(CNN)和变换器的方法已经很有效,但在建模高度复杂且非线性的伪造痕迹方面仍有改进空间。为了解决这个问题,我们提出了一种基于柯尔莫哥洛夫-阿诺德网络(KAN)的新型检测方法。通过用可学习样条替换固定的激活函数,我们的基于KAN的方法更适用于这一挑战。此外,为了引导网络聚焦于关键面部区域,我们引入了一个名为“Landmark-assisted Adaptive Kolmogorov-Arnold Network”(LAKAN)模块。该模块使用面部标志作为结构先验,动态生成KAN的内部参数,从而创建一个特定实例信号,将通用图像编码器引导至最具信息量且包含伪造痕迹的脸部区域。这一核心创新在几何先验和网络学习过程之间创造了一种强大的结合方式。在多个公开数据集上的广泛实验表明,我们提出的方法取得了卓越性能。
https://arxiv.org/abs/2510.00634
We propose a two-stage framework for audio-driven talking head generation with fine-grained expression control via facial Action Units (AUs). Unlike prior methods relying on emotion labels or implicit AU conditioning, our model explicitly maps AUs to 2D facial landmarks, enabling physically grounded, per-frame expression control. In the first stage, a variational motion generator predicts temporally coherent landmark sequences from audio and AU intensities. In the second stage, a diffusion-based synthesizer generates realistic, lip-synced videos conditioned on these landmarks and a reference image. This separation of motion and appearance improves expression accuracy, temporal stability, and visual realism. Experiments on the MEAD dataset show that our method outperforms state-of-the-art baselines across multiple metrics, demonstrating the effectiveness of explicit AU-to-landmark modeling for expressive talking head generation.
我们提出了一种基于音频驱动的面部表情精细控制的说话头部生成两阶段框架,该框架利用面部动作单元(Action Units,AUs)。与以往依赖情感标签或隐式AU条件的方法不同,我们的模型明确地将AU映射到2D面部特征点上,从而使表情控制更加物理现实,并且可以逐帧进行。在第一阶段中,一个变分运动生成器根据音频和AU强度预测出时间连贯的特征点序列;第二阶段则是一个基于扩散的合成器,该合成器根据这些特征点以及参考图像生成逼真的口型同步视频。这种将动作与外观分离的方法提高了表情准确性、时间稳定性及视觉真实感。 在MEAD数据集上的实验表明,我们提出的方法在多个指标上优于当前最先进的基准方法,证明了显式AU到特征点建模在表达性说话头部生成中的有效性。
https://arxiv.org/abs/2509.19749
Driver drowsiness remains a critical factor in road accidents, accounting for thousands of fatalities and injuries each year. This paper presents a comprehensive evaluation of real-time, non-intrusive drowsiness detection methods, focusing on computer vision based YOLO (You Look Only Once) algorithms. A publicly available dataset namely, UTA-RLDD was used, containing both awake and drowsy conditions, ensuring variability in gender, eyewear, illumination, and skin tone. Seven YOLO variants (v5s, v9c, v9t, v10n, v10l, v11n, v11l) are fine-tuned, with performance measured in terms of Precision, Recall, mAP0.5, and mAP 0.5-0.95. Among these, YOLOv9c achieved the highest accuracy (0.986 mAP 0.5, 0.978 Recall) while YOLOv11n strikes the optimal balance between precision (0.954) and inference efficiency, making it highly suitable for embedded deployment. Additionally, we implement an Eye Aspect Ratio (EAR) approach using Dlib's facial landmarks, which despite its low computational footprint exhibits reduced robustness under pose variation and occlusions. Our findings illustrate clear trade offs between accuracy, latency, and resource requirements, and offer practical guidelines for selecting or combining detection methods in autonomous driving and industrial safety applications.
驾驶员疲劳依然是道路事故中的一个关键因素,每年导致数千人死亡和受伤。本文对实时、非侵入式的疲劳检测方法进行了全面评估,重点关注基于计算机视觉的YOLO(You Look Only Once)算法。我们使用了一个公开的数据集UTA-RLDD,该数据集中包含了清醒和困倦两种状态,并且在性别、眼镜佩戴情况、光照条件以及肤色上具有多样性。我们对七种YOLO变体(v5s, v9c, v9t, v10n, v10l, v11n, v11l)进行了微调,性能评估指标包括精度、召回率、mAP0.5和mAP 0.5-0.95。在这其中,YOLOv9c在准确性上表现出最佳(mAP 0.5为0.986,Recall为0.978),而YOLOv11n则在精度(Precision为0.954)和推理效率之间达到了最优平衡,因此特别适合嵌入式部署。此外,我们还实现了一种基于Dlib面部特征点的Eye Aspect Ratio (EAR) 方法,尽管其计算开销较低,但在姿态变化和遮挡情况下表现出了较差的鲁棒性。我们的研究结果表明了在准确性、延迟性和资源需求之间存在明显的权衡,并为自动驾驶及工业安全应用中选择或结合检测方法提供了实用指南。
https://arxiv.org/abs/2509.17498
We present Follow-Your-Emoji-Faster, an efficient diffusion-based framework for freestyle portrait animation driven by facial landmarks. The main challenges in this task are preserving the identity of the reference portrait, accurately transferring target expressions, and maintaining long-term temporal consistency while ensuring generation efficiency. To address identity preservation and accurate expression retargeting, we enhance Stable Diffusion with two key components: a expression-aware landmarks as explicit motion signals, which improve motion alignment, support exaggerated expressions, and reduce identity leakage; and a fine-grained facial loss that leverages both expression and facial masks to better capture subtle expressions and faithfully preserve the reference appearance. With these components, our model supports controllable and expressive animation across diverse portrait types, including real faces, cartoons, sculptures, and animals. However, diffusion-based frameworks typically struggle to efficiently generate long-term stable animation results, which remains a core challenge in this task. To address this, we propose a progressive generation strategy for stable long-term animation, and introduce a Taylor-interpolated cache, achieving a 2.6X lossless acceleration. These two strategies ensure that our method produces high-quality results efficiently, making it user-friendly and accessible. Finally, we introduce EmojiBench++, a more comprehensive benchmark comprising diverse portraits, driving videos, and landmark sequences. Extensive evaluations on EmojiBench++ demonstrate that Follow-Your-Emoji-Faster achieves superior performance in both animation quality and controllability. The code, training dataset and benchmark will be found in this https URL.
我们提出了Follow-Your-Emoji-Faster,这是一种基于高效的扩散框架的自由风格肖像动画方法,该方法通过面部标志进行驱动。此任务的主要挑战在于保留参考肖像的身份、准确传递目标表情以及在确保生成效率的同时保持长期的时间一致性。为了解决身份保留和准确的表情重定向问题,我们通过两种关键组件增强了Stable Diffusion:一种是情感感知的面部标记作为明确的动作信号,这可以提高动作对齐能力、支持夸张的表情并且减少身份泄露;另一种是精细的人脸损失函数,该函数利用表情和面部遮罩来更好地捕捉细微的表情变化并忠实保留参考外观。通过这些组件的支持,我们的模型能够实现不同种类肖像(包括真实人脸、卡通人物、雕塑以及动物)的可控且富有表现力的动画效果。 然而,基于扩散框架的方法通常难以高效地生成长期稳定的动画结果,这是此任务中的核心挑战之一。为了解决这个问题,我们提出了一种渐进式生成策略以实现稳定长期的动画,并引入了泰勒插值缓存机制,实现了2.6倍无损加速的效果。这两种策略确保我们的方法能够高效生产高质量的结果,使其易于使用且普及性高。 最后,我们介绍了EmojiBench++,这是一个更全面的基准测试集合,包含多种肖像、驱动视频和面部标志序列。在EmojiBench++上的广泛评估表明,Follow-Your-Emoji-Faster在动画质量和可控性方面表现出色。相关代码、训练数据集以及基准测试可在[此链接](https://this.url.com)中找到。
https://arxiv.org/abs/2509.16630
Facial expression recognition (FER) is a crucial task in computer vision with wide range of applications including human computer interaction, surveillance, and assistive technologies. However, challenges such as occlusion, expression variability, and lack of interpretability hinder the performance of traditional FER systems. Graph Neural Networks (GNNs) offer a powerful alternative by modeling relational dependencies between facial landmarks, enabling structured and interpretable learning. In this paper, we propose GLaRE, a novel Graph-based Landmark Region Embedding network for emotion recognition. Facial landmarks are extracted using 3D facial alignment, and a quotient graph is constructed via hierarchical coarsening to preserve spatial structure while reducing complexity. Our method achieves 64.89 percentage accuracy on AffectNet and 94.24 percentage on FERG, outperforming several existing baselines. Additionally, ablation studies have demonstrated that region-level embeddings from quotient graphs have contributed to improved prediction performance.
面部表情识别(FER)是计算机视觉中的一个关键任务,具有广泛的应用场景,包括人机交互、监控和辅助技术。然而,遮挡、表情变化多样性和缺乏可解释性等问题限制了传统FER系统的性能。图神经网络(GNNs)通过建模面部特征点之间的关系依赖性提供了一个强大的替代方案,这使得结构化且易于理解的学习成为可能。在本文中,我们提出了GLaRE,这是一种新型的基于图的地标区域嵌入网络,用于情感识别。使用三维面部对齐提取面部关键点,并通过分层粗化构建商图以保持空间结构的同时减少复杂性。我们的方法在AffectNet数据集上实现了64.89%的准确率,在FERG数据集上达到了94.24%,优于多个现有基线模型。此外,消融研究证明了来自商图的区域级嵌入对预测性能有所改善。
https://arxiv.org/abs/2508.20579
Facial landmark detection is an important task in computer vision with numerous applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been widely used to achieve state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. Since argmax is not differentiable, these methods use a differentiable approximation, Soft-argmax, to enable end-to-end training on deep-nets. In this work, we revisit this long-standing choice of using Soft-argmax and demonstrate that it is not the only way to achieve strong performance. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W), converging 2.2x faster during training while maintaining better/competitive accuracy. Our code is available here: this https URL.
面部标志检测是计算机视觉中的一个重要任务,具有广泛的应用,如头部姿态估计、表情分析和人脸替换等。基于热图回归的方法被广泛应用,并在这一任务中取得了最先进的成果。这些方法通常通过计算热图上的argmax来预测一个地标点。然而,由于argmax不是可微的,这些方法采用了可微分近似Soft-argmax,以便能够在深度网络上进行端到端训练。 在这项工作中,我们重新审视了长期以来使用Soft-argmax的选择,并证明它并非唯一实现高性能的方法。相反,我们提出了一种基于经典结构化预测框架的替代训练目标。在实验中,我们的方法在三个面部标志基准数据集(WFLW、COFW和300W)上达到了最先进的性能,在训练过程中收敛速度提高了2.2倍,同时保持了更好的/竞争性的准确度。 我们的代码在此处提供:[此URL]
https://arxiv.org/abs/2508.14929
Head-mounted displays (HMDs) are essential for experiencing extended reality (XR) environments and observing virtual content. However, they obscure the upper part of the user's face, complicating external video recording and significantly impacting social XR applications such as teleconferencing, where facial expressions and eye gaze details are crucial for creating an immersive experience. This study introduces a geometry-aware learning-based framework to jointly remove HMD occlusions and reconstruct complete 3D facial geometry from RGB frames captured from a single viewpoint. The method integrates a GAN-based video inpainting network, guided by dense facial landmarks and a single occlusion-free reference frame, to restore missing facial regions while preserving identity. Subsequently, a SynergyNet-based module regresses 3D Morphable Model (3DMM) parameters from the inpainted frames, enabling accurate 3D face reconstruction. Dense landmark optimization is incorporated throughout the pipeline to improve both the inpainting quality and the fidelity of the recovered geometry. Experimental results demonstrate that the proposed framework can successfully remove HMDs from RGB facial videos while maintaining facial identity and realism, producing photorealistic 3D face geometry outputs. Ablation studies further show that the framework remains robust across different landmark densities, with only minor quality degradation under sparse landmark configurations.
头戴式显示器(HMD)对于体验扩展现实(XR)环境和观察虚拟内容至关重要。然而,它们遮挡了用户面部的上半部分,这使得外部视频录制变得复杂,并显著影响到诸如远程会议等需要社交互动的XR应用,在这些场景中,面部表情和视线细节对创造沉浸式体验来说是至关重要的因素。 本文介绍了一种基于几何感知的学习框架,该框架能够从单个视角捕获的RGB帧中同时去除HMD遮挡并重建完整的3D面部几何形状。此方法采用了一个由生成对抗网络(GAN)支持的视频修复网络,并通过密集的面部标志和一个无遮挡参考图像来引导缺失区域的恢复过程,以保持身份特征完整。 随后,在修复后的画面基础上,利用基于SynergyNet的方法回归出三维可变形模型(3DMM)参数,从而实现精确的3D人脸重建。在整个流程中融入了密集地标优化步骤,以此提升修复质量和重建几何形状的真实度。 实验结果显示,所提出的框架能够成功地从RGB面部视频中去除HMD遮挡,并保持面部身份和逼真性,产生高度真实的3D脸部几何输出效果。消融研究进一步表明,在不同密度的标记配置下,该框架仍然表现得非常稳健,即使在稀疏地标的情况下也仅有轻微的质量损失。
https://arxiv.org/abs/2508.12336
Emotion is a critical component of artificial social intelligence. However, while current methods excel in lip synchronization and image quality, they often fail to generate accurate and controllable emotional expressions while preserving the subject's identity. To address this challenge, we introduce RealTalk, a novel framework for synthesizing emotional talking heads with high emotion accuracy, enhanced emotion controllability, and robust identity preservation. RealTalk employs a variational autoencoder (VAE) to generate 3D facial landmarks from driving audio, which are concatenated with emotion-label embeddings using a ResNet-based landmark deformation model (LDM) to produce emotional landmarks. These landmarks and facial blendshape coefficients jointly condition a novel tri-plane attention Neural Radiance Field (NeRF) to synthesize highly realistic emotional talking heads. Extensive experiments demonstrate that RealTalk outperforms existing methods in emotion accuracy, controllability, and identity preservation, advancing the development of socially intelligent AI systems.
情绪是人工社会智能的一个关键组成部分。然而,当前的方法虽然在唇部同步和图像质量方面表现出色,但在生成准确且可控的情绪表达同时保持主体身份方面往往存在不足。为了解决这一挑战,我们引入了RealTalk,这是一种新型框架,用于合成高情感准确性、增强的情感控制能力和稳健的身份保留的对话头。 RealTalk 使用变分自动编码器(VAE)从驱动音频中生成3D面部标志点,并使用基于ResNet的面部标志变形模型(LDM)将这些标志点与情绪标签嵌入物连接起来,以产生具有情感的表情。此外,这些表情和面部混合形状系数共同作用于一种新的三平面注意力神经辐射场(NeRF),用于合成高度逼真的带有情感的对话头。 广泛的实验表明,RealTalk 在情感准确性、控制能力和身份保留方面优于现有方法,从而推动了社会智能AI系统的发展。
https://arxiv.org/abs/2508.12163