Automated facial expression quality assessment (FEQA) in neurological disorders is critical for enhancing diagnostic accuracy and improving patient care, yet effectively capturing the subtle motions and nuances of facial muscle movements remains a challenge. We propose to analyse facial landmark trajectories, a compact yet informative representation, that encodes these subtle motions from a high-level structural perspective. Hence, we introduce Trajectory-guided Motion Perception Transformer (TraMP-Former), a novel FEQA framework that fuses landmark trajectory features for fine-grained motion capture with visual semantic cues from RGB frames, ultimately regressing the combined features into a quality score. Extensive experiments demonstrate that TraMP-Former achieves new state-of-the-art performance on benchmark datasets with neurological disorders, including PFED5 (up by 6.51%) and an augmented Toronto NeuroFace (up by 7.62%). Our ablation studies further validate the efficiency and effectiveness of landmark trajectories in FEQA. Our code is available at this https URL.
在神经疾病中的自动面部表情质量评估(FEQA)对于提高诊断准确性并改善患者护理至关重要,然而有效地捕捉面部肌肉运动的细微变化仍是一个挑战。我们提出分析面部关键点轨迹,这是一种简洁且信息丰富的表示方法,它从高层次结构角度编码这些微妙的变化。因此,我们引入了轨迹引导的动作感知Transformer(TraMP-Former),这是一个新的FEQA框架,该框架融合了关键点轨迹特征以进行精细的运动捕捉,并结合RGB帧中的视觉语义线索,最终将组合后的特征回归为一个质量评分。 广泛的实验表明,TraMP-Former在具有神经疾病的基准数据集上实现了最先进的性能,包括PFED5(提高了6.51%)和增强版的Toronto NeuroFace(提高了7.62%)。我们的消融研究表明了关键点轨迹在FEQA中的高效性和有效性。我们的代码可在以下链接获取:[提供URL]。 该研究的主要贡献包括: 1. 提出了TraMP-Former,这是一种利用面部关键点轨迹结合视觉语义线索的新型FEQA框架。 2. 通过实验验证了关键点轨迹的有效性,并展示了其在神经疾病相关的基准数据集上的优越性能。 3. 开源代码供其他研究人员使用和进一步开发。
https://arxiv.org/abs/2504.09530
When digitizing historical archives, it is necessary to search for the faces of celebrities and ordinary people, especially in newspapers, link them to the surrounding text, and make them searchable. Existing face detectors on datasets of scanned historical documents fail remarkably -- current detection tools only achieve around $24\%$ mAP at $50:90\%$ IoU. This work compensates for this failure by introducing a new manually annotated domain-specific dataset in the style of the popular Wider Face dataset, containing 2.2k new images from digitized historical newspapers from the $19^{th}$ to $20^{th}$ century, with 11k new bounding-box annotations and associated facial landmarks. This dataset allows existing detectors to be retrained to bring their results closer to the standard in the field of face detection in the wild. We report several experimental results comparing different families of fine-tuned detectors against publicly available pre-trained face detectors and ablation studies of multiple detector sizes with comprehensive detection and landmark prediction performance results.
在数字化历史档案的过程中,需要识别名人和普通人的面部,并将其与周围的文本关联起来,使其可搜索。然而,在扫描的历史文档数据集上,现有的面部检测器表现不佳——目前的检测工具仅能达到约24% mAP(平均精度)和50:90% IoU(交并比)。为弥补这一不足,本工作引入了一个新的手动标注的专业领域数据集,该数据集模仿了流行的Wider Face数据集风格,包含来自19至20世纪数字化历史报纸的2.2k新图像以及与之相关的11k新边界框注释和面部特征点。 此数据集使得现有检测器能够重新训练以使其结果更接近于现实世界中人脸检测的标准。我们报告了多项实验结果,比较了不同类型的微调后的检测器与公开可用的预训练的人脸检测器,并进行了多种大小检测器的消融研究,提供全面的检测和特征点预测性能数据。
https://arxiv.org/abs/2504.00558
Despite the similar structures of human faces, existing face alignment methods cannot learn unified knowledge from multiple datasets with different landmark annotations. The limited training samples in a single dataset commonly result in fragile robustness in this field. To mitigate knowledge discrepancies among different datasets and train a task-agnostic unified face alignment (TUFA) framework, this paper presents a strategy to unify knowledge from multiple datasets. Specifically, we calculate a mean face shape for each dataset. To explicitly align these mean shapes on an interpretable plane based on their semantics, each shape is then incorporated with a group of semantic alignment embeddings. The 2D coordinates of these aligned shapes can be viewed as the anchors of the plane. By encoding them into structure prompts and further regressing the corresponding facial landmarks using image features, a mapping from the plane to the target faces is finally established, which unifies the learning target of different datasets. Consequently, multiple datasets can be utilized to boost the generalization ability of the model. The successful mitigation of discrepancies also enhances the efficiency of knowledge transferring to a novel dataset, significantly boosts the performance of few-shot face alignment. Additionally, the interpretable plane endows TUFA with a task-agnostic characteristic, enabling it to locate landmarks unseen during training in a zero-shot manner. Extensive experiments are carried on seven benchmarks and the results demonstrate an impressive improvement in face alignment brought by knowledge discrepancies mitigation.
尽管人类面部结构相似,现有的面部对齐方法无法从具有不同地标标注的多个数据集中学习统一的知识。单一数据集中的有限训练样本通常会导致该领域的脆弱性增强。为了减轻不同数据集之间的知识差异,并培训一个任务无关的统一面部对齐(TUFA)框架,本文提出了一种策略,以将来自多个数据集的知识进行统一化。具体来说,我们为每个数据集计算平均面部形状。为了让这些平均形状基于其语义在可解释平面上显式对齐,接着给每一种形状加入一组语义对齐嵌入。这些对齐后的形状的二维坐标可以被视为该平面的锚点。通过将它们编码成结构提示,并进一步使用图像特征回归对应的面部地标,最终建立了一个从平面到目标面部的映射关系,从而统一了不同数据集的学习目标。因此,多个数据集可以被用来增强模型的泛化能力。成功地缓解了差异性也提高了知识向新数据集转移的效率,显著提升了少样本面部对齐的表现。此外,可解释的平面赋予TUFA任务无关特性,使其能够以零样本的方式定位训练中未见过的地标。在七个基准上的广泛实验表明,通过减少知识差异所带来的人脸对齐性能有了显著改进。
https://arxiv.org/abs/2503.22359
Precise audio-visual synchronization in speech videos is crucial for content quality and viewer comprehension. Existing methods have made significant strides in addressing this challenge through rule-based approaches and end-to-end learning techniques. However, these methods often rely on limited audio-visual representations and suboptimal learning strategies, potentially constraining their effectiveness in more complex scenarios. To address these limitations, we present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities. UniSync offers broad compatibility with various audio representations (e.g., Mel spectrograms, HuBERT) and visual representations (e.g., RGB images, face parsing maps, facial landmarks, 3DMM), effectively handling their significant dimensional differences. We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs, improving discriminative capabilities. UniSync outperforms existing methods on standard datasets and demonstrates versatility across diverse audio-visual representations. Its integration into talking face generation frameworks enhances synchronization quality in both natural and AI-generated content.
在语音视频中,精确的音视频同步对于内容质量和观众理解至关重要。现有方法通过基于规则的方法和端到端学习技术在解决这一挑战方面取得了重大进展。然而,这些方法往往依赖于有限的音频-视觉表示形式以及次优的学习策略,在更复杂的场景下可能限制了它们的效果。为了解决这些问题,我们提出了UniSync,这是一种使用嵌入相似性来评估音视频同步的新方法。 UniSync具有广泛的兼容性,能够处理各种音频表示(例如梅尔光谱图、HuBERT)和视觉表示(如RGB图像、面部解析地图、面部特征点、3DMM),并且能够有效应对它们之间显著的维度差异。我们通过加入基于余量的损失组件以及跨说话人的不同步对来增强对比学习框架,从而提高其判别能力。 在标准数据集上,UniSync的表现优于现有方法,并且在各种音频-视觉表示形式中展现了灵活性。将其集成到面部生成框架中可以提升自然内容和人工智能生成内容中的同步质量。
https://arxiv.org/abs/2503.16357
Algorithmic detection of facial palsy offers the potential to improve current practices, which usually involve labor-intensive and subjective assessments by clinicians. In this paper, we present a multimodal fusion-based deep learning model that utilizes an MLP mixer-based model to process unstructured data (i.e. RGB images or images with facial line segments) and a feed-forward neural network to process structured data (i.e. facial landmark coordinates, features of facial expressions, or handcrafted features) for detecting facial palsy. We then contribute to a study to analyze the effect of different data modalities and the benefits of a multimodal fusion-based approach using videos of 20 facial palsy patients and 20 healthy subjects. Our multimodal fusion model achieved 96.00 F1, which is significantly higher than the feed-forward neural network trained on handcrafted features alone (82.80 F1) and an MLP mixer-based model trained on raw RGB images (89.00 F1).
算法检测面部偏瘫具有改善现有临床实践的潜力,后者通常依赖于耗时且主观的评估方法。本文介绍了一种基于多模态融合的深度学习模型,该模型使用MLP混合器(MLP Mixer)来处理非结构化数据(如RGB图像或带有面部线条段的图像),并利用前馈神经网络处理结构化数据(如面部关键点坐标、表情特征或手工制作的特征)以检测面部偏瘫。随后,我们进行了一项研究,分析不同数据模态的影响以及多模态融合方法的优势,使用了20名面部偏瘫患者的视频和20名健康受试者的视频。我们的多模态融合模型达到了96.00 F1分数,这显著高于仅基于手工制作特征训练的前馈神经网络(82.80 F1)以及仅基于原始RGB图像训练的MLP混合器模型(89.00 F1)。
https://arxiv.org/abs/2503.10371
Objective. Patients implanted with the PRIMA photovoltaic subretinal prosthesis in geographic atrophy report form vision with the average acuity matching the 100um pixel size. Although this remarkable outcome enables them to read and write, they report difficulty with perceiving faces. This paper provides a novel, non-pixelated algorithm for simulating prosthetic vision the way it is experienced by PRIMA patients, compares the algorithm's predictions to clinical perceptual outcomes, and offers computer vision and machine learning (ML) methods to improve face representation. Approach. Our simulation algorithm integrates a grayscale filter, spatial resolution filter, and contrast filter. This accounts for the limited sampling density of the retinal implant, as well as the reduced contrast sensitivity of prosthetic vision. Patterns of Landolt C and faces created using this simulation algorithm are compared to reports from actual PRIMA users. To recover the facial features lost in prosthetic vision, we apply an ML facial landmarking model as well as contrast adjusting tone curves to the face image prior to its projection onto the implant. Main results. Simulated prosthetic vision matches the maximum letter acuity observed in clinical studies as well as patients' subjective descriptions. Application of the inversed contrast filter helps preserve the contrast in prosthetic vision. Identification of the facial features using an ML facial landmarking model and accentuating them further improve face representation. Significance. Spatial and contrast constraints of prosthetic vision limit resolvable features and degrade natural images. ML based methods and contrast adjustments mitigate some limitations and improve face representation. Even though higher spatial resolution can be expected with implants having smaller pixels, contrast enhancement still remains essential for face recognition.
目标。植入PRIMA光伏亚视网膜假体的患者报告称,他们的视力与100微米像素大小匹配的平均清晰度相似。尽管这种显著的结果使他们能够阅读和书写,但他们表示在感知面部方面存在困难。本文提供了一种用于模拟假肢视觉的新算法(该算法类似于PRIMA患者体验到的视觉),将此算法预测结果与临床感知效果进行比较,并提出计算机视觉和机器学习(ML)方法以改善面部表现。 方法。我们的模拟算法结合了灰度滤波器、空间分辨率滤波器和对比度滤波器,这可以解释视网膜植入物有限的采样密度以及假肢视力减少的对比敏感度。使用此仿真算法创建的Landolt C图案及面孔与实际PRIMA用户报告进行了比较。为了恢复在假肢视觉中丢失的脸部特征,我们在将图像投影到假体之前应用了机器学习面部标记模型和调整对比度的色调曲线。 主要结果。模拟的假体视觉匹配临床研究观察到的最大字母清晰度以及患者的主观描述。使用反向对比滤波器有助于保留假体视觉中的对比度。使用ML面部地标模型识别面部特征并进一步强调它们可以改善面部表现。 意义。由于空间和对比限制,假体视觉在可分辨的特性方面受到限制,并且会降解自然图像。基于机器学习的方法和对比调整能够缓解一些局限性并改善面部表示。尽管具有更小像素的植入物可能具有更高的空间分辨率,但对比度增强仍然是面部识别中的关键因素。
https://arxiv.org/abs/2503.11677
Social Anxiety Disorder (SAD) is a widespread mental health condition, yet its lack of objective markers hinders timely detection and intervention. While previous research has focused on behavioral and non-verbal markers of SAD in structured activities (e.g., speeches or interviews), these settings fail to replicate real-world, unstructured social interactions fully. Identifying non-verbal markers in naturalistic, unstaged environments is essential for developing ubiquitous and non-intrusive monitoring solutions. To address this gap, we present AnxietyFaceTrack, a study leveraging facial video analysis to detect anxiety in unstaged social settings. A cohort of 91 participants engaged in a social setting with unfamiliar individuals and their facial videos were recorded using a low-cost smartphone camera. We examined facial features, including eye movements, head position, facial landmarks, and facial action units, and used self-reported survey data to establish ground truth for multiclass (anxious, neutral, non-anxious) and binary (e.g., anxious vs. neutral) classifications. Our results demonstrate that a Random Forest classifier trained on the top 20% of features achieved the highest accuracy of 91.0% for multiclass classification and an average accuracy of 92.33% across binary classifications. Notably, head position and facial landmarks yielded the best performance for individual facial regions, achieving 85.0% and 88.0% accuracy, respectively, in multiclass classification, and 89.66% and 91.0% accuracy, respectively, across binary classifications. This study introduces a non-intrusive, cost-effective solution that can be seamlessly integrated into everyday smartphones for continuous anxiety monitoring, offering a promising pathway for early detection and intervention.
社交焦虑障碍(SAD)是一种常见的心理健康状况,但由于缺乏客观标记物,及时检测和干预面临挑战。虽然以往的研究集中在结构化活动中的行为和非言语标志上(如演讲或面试),但这些设置无法完全再现真实世界中无序的社会互动。在自然、未被摆布的环境中识别非言语标志对于开发普遍且非侵入性的监测解决方案至关重要。为此,我们提出了 AnxietyFaceTrack 研究,该研究利用面部视频分析来检测无结构社交环境中的焦虑情况。本项研究包括91名参与者参与与陌生人的社会活动,并使用低成本智能手机摄像头记录了他们的面部视频。我们检查了面部特征,包括眼睛运动、头部位置、面部地标和面部动作单元,并使用自我报告调查数据建立了多分类(焦虑、中立、非焦虑)和二元分类(例如,焦虑 vs 中立)的基准。 我们的结果显示,在多分类识别任务中,基于前20%特征训练的随机森林分类器达到了最高的91.0%准确率;而在二元分类任务中,平均准确率为92.33%。值得注意的是,头部位置和面部地标在各自单独的脸部区域表现最佳:多类分类中的准确性分别为85.0%和88.0%,而二元分类中的准确性分别为89.66%和91.0%。 本研究引入了一种非侵入性且成本效益高的解决方案,可以无缝集成到日常使用的智能手机中,用于持续焦虑监测。这为早期发现和干预提供了一个有前景的途径。
https://arxiv.org/abs/2502.16106
Facial landmark tracking plays a vital role in applications such as facial recognition, expression analysis, and medical diagnostics. In this paper, we consider the performance of the Extended Kalman Filter (EKF) and Unscented Kalman Filter (UKF) in tracking 3D facial motion in both deterministic and stochastic settings. We first analyze a noise-free environment where the state transition is purely deterministic, demonstrating that UKF outperforms EKF by achieving lower mean squared error (MSE) due to its ability to capture higher-order nonlinearities. However, when stochastic noise is introduced, EKF exhibits superior robustness, maintaining lower mean square error (MSE) compared to UKF, which becomes more sensitive to measurement noise and occlusions. Our results highlight that UKF is preferable for high-precision applications in controlled environments, whereas EKF is better suited for real-world scenarios with unpredictable noise. These findings provide practical insights for selecting the appropriate filtering technique in 3D facial tracking applications, such as motion capture and facial recognition.
面部特征点追踪在诸如面部识别、表情分析和医学诊断等应用中扮演着至关重要的角色。本文研究了扩展卡尔曼滤波器(EKF)与无迹卡尔曼滤波器(UKF)在跟踪三维面部运动时,在确定性和随机环境下的性能表现。 首先,我们在没有噪声的环境中进行分析,此时状态转移是完全确定性的。结果显示,由于UKF能够捕捉到更高的非线性特性,因此其均方误差(MSE)低于EKF,从而表现出更好的性能。然而,当引入随机噪声时,EKF展现出更强的鲁棒性,在测量噪声和遮挡的影响下仍能保持更低的均方误差(MSE),而UKF则变得更易受到这些因素的影响。 我们的研究结果表明,对于控制环境中的高精度应用(例如实验室测试条件下的面部追踪),优选使用UKF;而对于存在不可预测噪声的真实世界场景(如视频会议或户外监控中的人脸识别和跟踪),EKF是更合适的选择。这一发现为在三维面部追踪应用(包括动作捕捉和人脸识别)中选择合适的滤波技术提供了实用的见解。
https://arxiv.org/abs/2502.15179
Lip reading is vital for robots in social settings, improving their ability to understand human communication. This skill allows them to communicate more easily in crowded environments, especially in caregiving and customer service roles. Generating a Persian Lip-reading dataset, this study integrates Persian lip-reading technology into the Surena-V humanoid robot to improve its speech recognition capabilities. Two complementary methods are explored, an indirect method using facial landmark tracking and a direct method leveraging convolutional neural networks (CNNs) and long short-term memory (LSTM) networks. The indirect method focuses on tracking key facial landmarks, especially around the lips, to infer movements, while the direct method processes raw video data for action and speech recognition. The best-performing model, LSTM, achieved 89\% accuracy and has been successfully implemented into the Surena-V robot for real-time human-robot interaction. The study highlights the effectiveness of these methods, particularly in environments where verbal communication is limited.
خواندن لب برای روباتها در محیطهای اجتماعی بسیار ضرورت دارد و توانایی آنها در درک ارتباط انسانی را بهبود میبخشد. این مهارت به آنها کمک میکند تا با راحتی بیشتری در محیطهای شلوغ ارتباط برقرار کنند، به ویژه در نقشهای داماداری و خدمات مشتری. این پژوهش یک مجموعه داده خواندن لب زبان فارسی تولید میکند و فناوری خواندن لب زبان فارسی را به روبات هیومانوئید Surena-V ادغام میکند تا قابلیتهای تشخیص کلام آن را بهبود بخشد. دو روش مقابلهای بررسی شده است، یک روش غیر مستقیم با استفاده از ردیابی معالم وجوه و یک روش مستقیم که از شبکههای عصبی متقابل (CNN) و شبکههای حافظه بلندمدت و کوتاهمدت (LSTM) استفاده میکند. روش غیرمستقیم بر ردیابی معالم کلیدی، به ویژه اطراف گردن لب، تمرکز دارد تا حرکات را نظاره کند، در حالی که روش مستقیم از دادههای ویدئویی خام برای تشخیص عمل و کلام استفاده میکند. مدل عملکرد برجسته، LSTM با دقت 89٪ به دست آمد و موفقیتآمیز اجرا شده است تا روبات Surena-V برای تعامل زنده انسان-ربات به کار رود. این پژوهش موثر بودن این روشها، به ویژه در محیطهایی که ارتباط صوتی محدود است، نشان میدهد.
https://arxiv.org/abs/2501.13996
The recent realistic creation and dissemination of so-called deepfakes poses a serious threat to social life, civil rest, and law. Celebrity defaming, election manipulation, and deepfakes as evidence in court of law are few potential consequences of deepfakes. The availability of open source trained models based on modern frameworks such as PyTorch or TensorFlow, video manipulations Apps such as FaceApp and REFACE, and economical computing infrastructure has easen the creation of deepfakes. Most of the existing detectors focus on detecting either face-swap, lip-sync, or puppet master deepfakes, but a unified framework to detect all three types of deepfakes is hardly explored. This paper presents a unified framework that exploits the power of proposed feature fusion of hybrid facial landmarks and our novel heart rate features for detection of all types of deepfakes. We propose novel heart rate features and fused them with the facial landmark features to better extract the facial artifacts of fake videos and natural variations available in the original videos. We used these features to train a light-weight XGBoost to classify between the deepfake and bonafide videos. We evaluated the performance of our framework on the world leaders dataset (WLDR) that contains all types of deepfakes. Experimental results illustrate that the proposed framework offers superior detection performance over the comparative deepfakes detection methods. Performance comparison of our framework against the LSTM-FCN, a candidate of deep learning model, shows that proposed model achieves similar results, however, it is more interpretable.
最近,所谓深度伪造(deepfake)的现实创作和传播对社会生活、民众安宁以及法律秩序构成了严重威胁。深度伪造可能带来的后果包括名人诽谤、选举操纵以及在法庭上使用作为证据的深度伪造视频等。现代框架如PyTorch或TensorFlow的开源训练模型、FaceApp 和 REFACE 等视频编辑应用程序,再加上经济实惠的计算基础设施,使得创建深度伪造变得更加容易。 现有的大部分检测工具主要针对脸部替换、嘴唇同步或者操纵者的深度伪造进行识别,但一种能够同时识别这三种类型深度伪造的统一框架却鲜有研究。本文提出了一种利用混合面部特征和我们创新的心率特征融合技术来检测所有类型的深度伪造的统一框架。我们提出了新颖的心率特征,并将其与面部地标特征相结合,以便更好地提取假视频中的面部伪迹以及原始视频中自然的变化。 我们使用这些特征训练了一个轻量级的XGBoost模型,用于区分深度伪造视频和真实视频。我们在包含各种类型深度伪造的世界领导人数据集(WLDR)上评估了该框架的性能。实验结果表明,所提出的框架在与比较方法进行深度伪造检测时具有优越的表现。我们还展示了我们的框架相对于LSTM-FCN这种深度学习模型候选者的性能对比,结果显示提议的方法可以获得类似的结果,但更具可解释性。
https://arxiv.org/abs/2501.11927
The increasing complexity of machine learning models in computer vision, particularly in face verification, requires the development of explainable artificial intelligence (XAI) to enhance interpretability and transparency. This study extends previous work by integrating semantic concepts derived from human cognitive processes into XAI frameworks to bridge the comprehension gap between model outputs and human understanding. We propose a novel approach combining global and local explanations, using semantic features defined by user-selected facial landmarks to generate similarity maps and textual explanations via large language models (LLMs). The methodology was validated through quantitative experiments and user feedback, demonstrating improved interpretability. Results indicate that our semantic-based approach, particularly the most detailed set, offers a more nuanced understanding of model decisions than traditional methods. User studies highlight a preference for our semantic explanations over traditional pixelbased heatmaps, emphasizing the benefits of human-centric interpretability in AI. This work contributes to the ongoing efforts to create XAI frameworks that align AI models behaviour with human cognitive processes, fostering trust and acceptance in critical applications.
在计算机视觉领域,尤其是面部验证中,机器学习模型的复杂性不断增加,需要发展可解释的人工智能(XAI)以提高其可理解性和透明度。本研究扩展了先前的工作,在XAI框架中整合来源于人类认知过程的语义概念,以此弥合模型输出与人理解之间的理解差距。我们提出了一种结合全局和局部解释的新方法,利用用户选择的脸部特征点定义的语义特性生成相似性地图并通过大型语言模型(LLM)提供文本解释。该方法通过定量实验和用户反馈进行了验证,表明其可解释性有所提高。研究结果表明,我们的基于语义的方法,特别是细节最为详尽的一组方法,在理解模型决策方面比传统方法更为细致入微。用户的调查研究表明,他们更喜欢我们提出的语义解释而非传统的像素热图,强调了以人类为中心的可解释性的优势在人工智能中的应用价值。 这项工作为创建与人类认知过程一致的人工智能框架做出了贡献,这些框架旨在建立信任并促进关键应用中对AI模型行为的认可。
https://arxiv.org/abs/2501.05471
Although facial landmark detection (FLD) has gained significant progress, existing FLD methods still suffer from performance drops on partially non-visible faces, such as faces with occlusions or under extreme lighting conditions or poses. To address this issue, we introduce ORFormer, a novel transformer-based method that can detect non-visible regions and recover their missing features from visible parts. Specifically, ORFormer associates each image patch token with one additional learnable token called the messenger token. The messenger token aggregates features from all but its patch. This way, the consensus between a patch and other patches can be assessed by referring to the similarity between its regular and messenger embeddings, enabling non-visible region identification. Our method then recovers occluded patches with features aggregated by the messenger tokens. Leveraging the recovered features, ORFormer compiles high-quality heatmaps for the downstream FLD task. Extensive experiments show that our method generates heatmaps resilient to partial occlusions. By integrating the resultant heatmaps into existing FLD methods, our method performs favorably against the state of the arts on challenging datasets such as WFLW and COFW.
尽管面部特征点检测(FLD)已经取得了显著进展,现有的FLD方法在处理部分不可见的面部时仍然面临性能下降的问题,比如被遮挡或处于极端光照条件和姿态下的面部。为了解决这一问题,我们引入了ORFormer,这是一种基于Transformer的新方法,能够检测非可见区域并从可见部分恢复其缺失特征。具体来说,ORFormer将每个图像块标记与一个额外的可学习标记——信使标记相关联。该信使标记聚合除了它所属块外的所有其他特征。通过这种方式,可以通过参考常规嵌入和信使嵌入之间的相似性来评估一块与其他块之间的一致性,从而实现非可见区域识别。然后,我们的方法使用由信使标记聚集的特征恢复被遮挡的区块。借助恢复出的特征,ORFormer为下游FLD任务编译高质量的热图。广泛的实验表明,我们提出的方法生成的热图能够抵御部分遮挡的影响。通过将这些生成的热图整合到现有的FLD方法中,在WFLW和COFW等具有挑战性的数据集上,我们的方法表现优于现有技术。
https://arxiv.org/abs/2412.13174
This work presents the IMPROVE dataset, designed to evaluate the effects of mobile phone usage on learners during online education. The dataset not only assesses academic performance and subjective learner feedback but also captures biometric, behavioral, and physiological signals, providing a comprehensive analysis of the impact of mobile phone use on learning. Multimodal data were collected from 120 learners in three groups with different phone interaction levels. A setup involving 16 sensors was implemented to collect data that have proven to be effective indicators for understanding learner behavior and cognition, including electroencephalography waves, videos, eye tracker, etc. The dataset includes metadata from the processed videos like face bounding boxes, facial landmarks, and Euler angles for head pose estimation. In addition, learner performance data and self-reported forms are included. Phone usage events were labeled, covering both supervisor-triggered and uncontrolled events. A semi-manual re-labeling system, using head pose and eye tracker data, is proposed to improve labeling accuracy. Technical validation confirmed signal quality, with statistical analyses revealing biometric changes during phone use.
本研究介绍了IMPROVE数据集,该数据集旨在评估移动电话使用对在线教育中学习者的影响。除了评估学术表现和主观的学习者反馈外,该数据集还捕捉了生物特征、行为和生理信号,从而全面分析手机使用对学习的综合影响。来自三个不同手机互动水平组别的120名学习者的多模态数据被收集。通过实施包含16个传感器的设置来收集已被证明能有效指示理解学习者行为与认知的数据,包括脑电波、视频和眼动追踪等。该数据集包含了处理后的视频元数据,如面部边界框、面部特征点以及用于头部姿态估计的欧拉角。此外,还包括了学习者的成绩数据和个人报告表单。手机使用事件被标记,涵盖了由监督者触发的和未控制的事件。提出了一种半自动重新标注系统,利用头部姿态和眼动追踪数据来提高标注准确性。技术验证确认了信号质量,并通过统计分析揭示了在使用手机时生物特征的变化。
https://arxiv.org/abs/2412.14195
Landmark-guided character animation generation is an important field. Generating character animations with facial features consistent with a reference image remains a significant challenge in conditional video generation, especially involving complex motions like dancing. Existing methods often fail to maintain facial feature consistency due to mismatches between the facial landmarks extracted from source videos and the target facial features in the reference image. To address this problem, we propose a facial landmark transformation method based on the 3D Morphable Model (3DMM). We obtain transformed landmarks that align with the target facial features by reconstructing 3D faces from the source landmarks and adjusting the 3DMM parameters to match the reference image. Our method improves the facial consistency between the generated videos and the reference images, effectively improving the facial feature mismatch problem.
地标引导的角色动画生成是一个重要的领域。在条件视频生成中,特别是涉及复杂动作如舞蹈时,生成与参考图像面部特征一致的角色动画仍是一项重大挑战。现有方法往往因源视频提取的面部地标与参考图像中的目标面部特征不匹配而无法保持面部特征的一致性。为解决这一问题,我们提出了一种基于3D可变形模型(3DMM)的面部地标变换方法。通过从源地标重建3D人脸,并调整3DMM参数以匹配参考图像,我们获得了与目标面部特征对齐的变换后的地标。我们的方法提高了生成视频与参考图像之间的面部一致性,有效解决了面部特征不匹配的问题。
https://arxiv.org/abs/2412.08976
Face image restoration aims to enhance degraded facial images while addressing challenges such as diverse degradation types, real-time processing demands, and, most crucially, the preservation of identity-specific features. Existing methods often struggle with slow processing times and suboptimal restoration, especially under severe degradation, failing to accurately reconstruct finer-level identity details. To address these issues, we introduce InstantRestore, a novel framework that leverages a single-step image diffusion model and an attention-sharing mechanism for fast and personalized face restoration. Additionally, InstantRestore incorporates a novel landmark attention loss, aligning key facial landmarks to refine the attention maps, enhancing identity preservation. At inference time, given a degraded input and a small (~4) set of reference images, InstantRestore performs a single forward pass through the network to achieve near real-time performance. Unlike prior approaches that rely on full diffusion processes or per-identity model tuning, InstantRestore offers a scalable solution suitable for large-scale applications. Extensive experiments demonstrate that InstantRestore outperforms existing methods in quality and speed, making it an appealing choice for identity-preserving face restoration.
面部图像修复旨在增强退化的面部图像,同时应对多样化的退化类型、实时处理需求以及最重要的是保持特定身份特征的挑战。现有方法常常在处理时间慢和恢复效果不佳方面存在问题,尤其是在严重退化的情况下,无法准确重建更细粒度的身份细节。为了解决这些问题,我们引入了InstantRestore,这是一个新型框架,利用单步图像扩散模型和注意力共享机制实现快速且个性化的面部修复。此外,InstantRestore集成了一个新颖的地标注意损失函数,通过关键面部地标的对齐来优化注意力图,增强身份特征的保持。在推理过程中,给定退化输入和一组较小(约4张)参考图像,InstantRestore只需一次前向传递即可实现接近实时性能。与依赖完整扩散过程或针对每个个体模型调优的传统方法不同,InstantRestore提供了一个适合大规模应用的可扩展解决方案。广泛的实验表明,InstantRestore在质量和速度上均优于现有方法,使其成为保持身份特征面部修复的一个有吸引力的选择。
https://arxiv.org/abs/2412.06753
We introduce a novel approach for high-resolution talking head generation from a single image and audio input. Prior methods using explicit face models, like 3D morphable models (3DMM) and facial landmarks, often fall short in generating high-fidelity videos due to their lack of appearance-aware motion representation. While generative approaches such as video diffusion models achieve high video quality, their slow processing speeds limit practical application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM), employs implicit motion to encode human faces into appearance-aware compressed facial latents, enhancing video generation. Although implicit motion lacks the spatial disentanglement of explicit models, which complicates alignment with subtle lip movements, we introduce motion statistics to help capture fine-grained motion information. Additionally, our model provides motion controllability to optimize the trade-off between motion intensity and visual quality during inference. IF-MDM supports real-time generation of 512x512 resolution videos at up to 45 frames per second (fps). Extensive evaluations demonstrate its superior performance over existing diffusion and explicit face models. The code will be released publicly, available alongside supplementary materials. The video results can be found on this https URL.
https://arxiv.org/abs/2412.04000
This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
https://arxiv.org/abs/2412.01254
At present, deep neural network methods have played a dominant role in face alignment field. However, they generally use predefined network structures to predict landmarks, which tends to learn general features and leads to mediocre performance, e.g., they perform well on neutral samples but struggle with faces exhibiting large poses or occlusions. Moreover, they cannot effectively deal with semantic gaps and ambiguities among features at different scales, which may hinder them from learning efficient features. To address the above issues, in this paper, we propose a Dynamic Semantic-Aggregation Transformer (DSAT) for more discriminative and representative feature (i.e., specialized feature) learning. Specifically, a Dynamic Semantic-Aware (DSA) model is first proposed to partition samples into subsets and activate the specific pathways for them by estimating the semantic correlations of feature channels, making it possible to learn specialized features from each subset. Then, a novel Dynamic Semantic Specialization (DSS) model is designed to mine the homogeneous information from features at different scales for eliminating the semantic gap and ambiguities and enhancing the representation ability. Finally, by integrating the DSA model and DSS model into our proposed DSAT in both dynamic architecture and dynamic parameter manners, more specialized features can be learned for achieving more precise face alignment. It is interesting to show that harder samples can be handled by activating more feature channels. Extensive experiments on popular face alignment datasets demonstrate that our proposed DSAT outperforms state-of-the-art models in the this http URL code is available at this https URL.
https://arxiv.org/abs/2412.00740
Deepfake facial manipulation has garnered significant public attention due to its impacts on enhancing human experiences and posing privacy threats. Despite numerous passive algorithms that have been attempted to thwart malicious Deepfake attacks, they mostly struggle with the generalizability challenge when confronted with hyper-realistic synthetic facial images. To tackle the problem, this paper proposes a proactive Deepfake detection approach by introducing a novel training-free landmark perceptual watermark, LampMark for short. We first analyze the structure-sensitive characteristics of Deepfake manipulations and devise a secure and confidential transformation pipeline from the structural representations, i.e. facial landmarks, to binary landmark perceptual watermarks. Subsequently, we present an end-to-end watermarking framework that imperceptibly and robustly embeds and extracts watermarks concerning the images to be protected. Relying on promising watermark recovery accuracies, Deepfake detection is accomplished by assessing the consistency between the content-matched landmark perceptual watermark and the robustly recovered watermark of the suspect image. Experimental results demonstrate the superior performance of our approach in watermark recovery and Deepfake detection compared to state-of-the-art methods across in-dataset, cross-dataset, and cross-manipulation scenarios.
深度伪造面部操控因其对提升人类体验和构成隐私威胁的影响而获得了公众的广泛关注。尽管已经尝试了许多被动算法来阻止恶意的深度伪造攻击,但当面对超现实合成面部图像时,它们大多难以克服泛化挑战。为了解决这一问题,本文提出了一种基于主动防御的方法,并引入了一种无需训练的地标感知水印LampMark。我们首先分析了深度伪造操控的结构敏感特性,并设计了一个从结构表示(即面部地标)到二进制地标感知水印的安全且保密的转换管道。随后,我们提出了一个端到端的水印框架,该框架可以无损且鲁棒地嵌入和提取需要保护图像上的水印。依靠有前景的水印恢复准确性,通过评估内容匹配的地标感知水印与疑似图像中稳健恢复的水印之间的一致性来完成深度伪造检测。实验结果表明,在数据集内、跨数据集和跨操纵场景下,我们的方法在水印恢复和深度伪造检测方面均优于现有最先进的方法。
https://arxiv.org/abs/2411.17209
Facial landmark detection is a fundamental problem in computer vision for many downstream applications. This paper introduces a new facial landmark detector based on vision transformers, which consists of two unique designs: Dual Vision Transformer (D-ViT) and Long Skip Connections (LSC). Based on the observation that the channel dimension of feature maps essentially represents the linear bases of the heatmap space, we propose learning the interconnections between these linear bases to model the inherent geometric relations among landmarks via Channel-split ViT. We integrate such channel-split ViT into the standard vision transformer (i.e., spatial-split ViT), forming our Dual Vision Transformer to constitute the prediction blocks. We also suggest using long skip connections to deliver low-level image features to all prediction blocks, thereby preventing useful information from being discarded by intermediate supervision. Extensive experiments are conducted to evaluate the performance of our proposal on the widely used benchmarks, i.e., WFLW, COFW, and 300W, demonstrating that our model outperforms the previous SOTAs across all three benchmarks.
面部特征点检测是计算机视觉中许多下游应用的基础问题。本文介绍了一种基于视觉变换器的新面部特征点检测器,该检测器包含两个独特的设计:双视觉变换器(D-ViT)和长跳跃连接(LSC)。根据特征图的通道维度实际上表示热图空间线性基底这一观察结果,我们提出了通过通道分割ViT学习这些线性基底之间的相互联系,以建模地标间的固有几何关系。我们将这种通道分割ViT整合到标准视觉变换器(即空间分割ViT)中,形成我们的双视觉变换器,构成预测块。此外,我们还建议使用长跳跃连接将低级图像特征传递给所有预测块,从而防止因中间监督而丢失有用信息。我们在广泛使用的基准测试集上进行了广泛的实验,包括WFLW、COFW和300W,结果表明我们的模型在这三个基准测试集中都优于之前的最佳技术水平。
https://arxiv.org/abs/2411.07167