Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.
https://arxiv.org/abs/2603.12388
Accurate facial expression imitation on human-face robots is crucial for achieving natural human-robot interaction. Most existing methods have achieved photorealistic expression imitation through mapping 2D facial landmarks to a robot's actuator commands. Their imitation of landmark trajectories is susceptible to interference from facial morphology, which would lead to a performance drop. In this paper, we propose a morphology-independent expression imitation method that decouples expressions from facial morphology to eliminate morphological influence and produce more realistic expressions for human-face robots. Specifically, we construct an expression decoupling module to learn expression semantics by disentangling the expression representation from the morphology representation in a self-supervised manner. We devise an expression transfer module to map the representations to the robot's actuator commands through a learning objective of perceiving expression errors, producing accurate facial expressions based on the learned expression semantics. To support experimental validation, a custom-designed and highly expressive human-face robot, namely Pengrui, is developed to serve as an experimental platform for realistic expression imitation. Extensive experiments demonstrate that our method enables the human-face robot to reproduce a wide range of human-like expressions effectively. All code and implementation details of the robot will be released.
在人形机器人的面部表情模仿中,准确地复制人类的面部表情对于实现自然的人机交互至关重要。现有的大多数方法通过将二维面部特征点映射到机器人的执行器命令来实现了逼真的表情模仿,但这些方法对脸部形态的变化较为敏感,这会导致表现效果下降。在本文中,我们提出了一种独立于形态的表情模仿方法,该方法能够分离出表情与面部形态,从而消除形态的影响,并产生更真实的人形机器人面部表情。 具体来说,我们构建了一个表情解耦模块,通过自监督学习来解析并学习表情语义,即从形态表示和表情表示中将其分开。我们设计了一个表情转移模块,它将这些表示映射到机器人的执行器命令上,以感知表情错误为目标进行训练,并根据学到的表情语义生成准确的面部表情。 为了支持实验验证,我们开发了一种定制且高度表达性的机器人——“彭睿”,作为用于真实表情模仿的实验平台。大量的实验证明了我们的方法能够让这种人形机器人有效而逼真地再现一系列人类面部表情。所有机器人的代码和实现细节都将公开发布。
https://arxiv.org/abs/2603.07068
With the rapid advancement of deepfake technology, malicious face manipulations pose a significant threat to personal privacy and social security. However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. Our core framework adopts an innovative 152-dimensional landmark-identity watermark termed LIDMark, which structurally interweaves facial landmarks with a unique source identifier. To robustly extract the LIDMark, we design a novel Factorized-Head Decoder (FHD). Its architecture factorizes the shared backbone features into two specialized heads (i.e., regression and classification), robustly reconstructing the embedded landmarks and identifier, respectively, even when subjected to severe distortion or tampering. This design realizes an "all-in-one" trifunctional forensic solution: the regression head underlies an "intrinsic-extrinsic" consistency check for detection and localization, while the classification head robustly decodes the source identifier for tracing. Extensive experiments show that the proposed LIDMark framework provides a unified, robust, and imperceptible solution for the detection, localization, and tracing of deepfake content. The code is available at this https URL.
随着深度伪造技术的迅速发展,恶意的脸部篡改对个人隐私和社会安全构成了重大威胁。然而,现有的主动取证方法通常将深度伪造检测、篡改定位和源头追溯视为独立的任务,并缺乏一个统一框架来共同解决这些问题。为弥补这一空白,我们提出了一种统一的主动取证框架,该框架能够同时应对这三个核心任务。 我们的核心框架采用了一种创新性的152维地标-身份水印(LIDMark),这种水印将面部特征点与唯一来源标识符结构化地交织在一起。为了稳健地提取LIDMark,我们设计了一个新颖的因子头部解码器(Factorized-Head Decoder, FHD)。FHD架构将其共享骨干特征分解为两个专门化的头部(即回归和分类),即使在遭受严重扭曲或篡改的情况下,也能分别稳健地重构嵌入的地标和标识符。这种设计实现了“三位一体”的多功能取证解决方案:回归头部用于深度伪造检测和定位的内在-外在一致性检查,而分类头部则通过解码来源标识符来进行稳健的溯源。 广泛的实验表明,所提出的LIDMark框架为深度伪造内容的检测、定位和追踪提供了一种统一、鲁棒且不可感知的解决方案。代码可在[此处](https://这个URL)获取。
https://arxiv.org/abs/2602.23523
Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.
事件相机以微秒级分辨率记录亮度变化,但将这些稀疏、异步的输出转换成神经网络可以利用的密集张量仍然是一个核心挑战。传统的直方图或全局衰减时间表面表示方法在整个图像平面上应用固定的时序参数,在实践中造成了在静止期间保持空间结构与快速运动区域保留锐利边缘之间的权衡。我们引入了局部自适应衰减表面(LADS),这是一个事件表示家族,其中每个位置的时序衰减根据本地信号动态进行调节。三种策略被探索,分别基于事件速率、高斯拉普拉斯响应和高频频谱能量。这些自适应方案在静止区域保留细节的同时减少了密集活动区域的模糊度。公开数据集上的广泛实验表明,LADS与标准非自适应表示相比,在面部检测和面部标志准确性方面均有所提升。在30 Hz下,LADS实现了更高的检测准确性和更低的地标误差,而在240 Hz时则缓解了高频操作通常观察到的精度下降问题,维持了2.44%的标准化平均错误率用于地标识别以及0.966 mAP50用于面部检测。这些高频结果甚至超过了先前在30 Hz运行的工作所报告的准确性,在基于事件的脸部分析中建立了新的基准。此外,通过在表示阶段保留空间结构,LADS支持使用更轻量级的网络架构同时保持实时性能。这些结果突显了神经形态视觉中上下文感知时序整合的重要性,并指向能够利用事件相机独特优势的实时高频人机交互系统的未来方向。
https://arxiv.org/abs/2602.23101
Craniofacial Superimposition is a forensic technique for identifying skeletal remains by comparing a post-mortem skull with ante-mortem facial photographs. A critical step in this process is Skull-Face Overlay (SFO). This stage involves aligning a 3D skull model with a 2D facial image, typically guided by cranial and facial landmarks' correspondence. However, its accuracy is undermined by individual variability in soft-tissue thickness, introducing significant uncertainty into the overlay. This paper introduces Lilium, an automated evolutionary method to enhance the accuracy and robustness of SFO. Lilium explicitly models soft-tissue variability using a 3D cone-based representation whose parameters are optimized via a Differential Evolution algorithm. The method enforces anatomical, morphological, and photographic plausibility through a combination of constraints: landmark matching, camera parameter consistency, head pose alignment, skull containment within facial boundaries, and region parallelism. This emulation of the usual forensic practitioners' approach leads Lilium to outperform the state-of-the-art method in terms of both accuracy and robustness.
颅面叠合技术是一种法医手段,通过比较死后的头骨与生前的面部照片来识别骨骼遗骸。在这个过程中最关键的步骤是颅面重叠(Skull-Face Overlay, SFO)。该阶段涉及将一个3D头骨模型与2D面部图像对齐,通常依据颅部和面部标志点的一致性来进行指导。然而,这种方法的准确性因个体软组织厚度的变化而受到影响,这引入了显著的不确定性。 本文介绍了Lilium这一自动化进化方法,旨在提高SFO的准确性和鲁棒性。Lilium通过使用基于3D锥体表示的模型来显式地建模软组织变化,并利用差分演化算法优化其参数。该方法通过一系列约束条件确保解剖学、形态学和摄影上的合理性:包括标志点匹配、相机参数一致性、头部姿态对齐、头骨位于面部边界内以及区域平行性。这种模仿传统法医从业者的方法使Lilium在准确性和鲁棒性方面都超越了当前最先进的方法。 简而言之,通过更精确地模拟软组织的变化,并优化模型的参数,Lilium能够提供比现有技术更加可靠和精准的结果。
https://arxiv.org/abs/2603.00170
This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong "default white" bias (>96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.
这项研究量化了两种广泛部署的商业图像生成器——Gemini Flash 2.5 Image(NanoBanana)和GPT Image 1.5 中的性别和肤色偏见,以测试中立提示会产生人口统计学上中立输出这一假设。我们使用四个语义中性的提示生成了3,200张逼真的图像。分析采用了一种严谨的方法,结合混合色彩归一化、面部特征遮盖以及Monk(MST)、PERLA和Fitzpatrick肤色量表进行感知一致的肤色量化。 中立的提示产生了高度极化的默认设置。两个模型都表现出强烈的“默认白色”偏见(超过96% 的输出)。然而,在性别上,两者差异明显:Gemini 更倾向于生成女性形象;而GPT 则更倾向于生成男性形象,且皮肤色调较浅。这项研究提供了一个大规模、比较性的审计,使用了光照感知的色彩计量方法,将审美呈现与合成图像中的实际色素沉着区分开来。 该研究表明,中立提示作为诊断探针而非中立指令发挥作用。它为算法视觉文化的审核提供了稳健框架,并挑战了社会语言学假设,即无标记的语言会导致包容性的表现形式。
https://arxiv.org/abs/2602.12133
Accurate facial landmark detection under occlusion remains challenging, especially for human-like faces with large appearance variation and rotation-driven self-occlusion. Existing detectors typically localize landmarks while handling occlusion implicitly, without predicting per-point visibility that downstream applications can benefits. We present OccFace, an occlusion-aware framework for universal human-like faces, including humans, stylized characters, and other non-human designs. OccFace adopts a unified dense 100-point layout and a heatmap-based backbone, and adds an occlusion module that jointly predicts landmark coordinates and per-point visibility by combining local evidence with cross-landmark context. Visibility supervision mixes manual labels with landmark-aware masking that derives pseudo visibility from mask-heatmap overlap. We also create an occlusion-aware evaluation suite reporting NME on visible vs. occluded landmarks and benchmarking visibility with Occ AP, F1@0.5, and ROC-AUC, together with a dataset annotated with 100-point landmarks and per-point visibility. Experiments show improved robustness under external occlusion and large head rotations, especially on occluded regions, while preserving accuracy on visible landmarks.
在遮挡情况下进行精确的面部特征点检测仍然具有挑战性,尤其是在外观变化大且因旋转导致自我遮挡的人脸(包括真实人类、拟人化角色及其他非人类设计)上。现有的检测器通常是在处理遮挡时隐式地定位特征点,而不预测每个特征点的可见性,后者对下游应用是有益的信息。我们提出了OccFace,这是一个针对通用人脸的遮挡感知框架,涵盖了人类、拟人化人物以及其他非人类的设计。 OccFace采用统一的密集100个特征点布局和基于热图的骨干网络,并添加了一个遮挡模块,该模块通过结合局部证据与跨特征点上下文来共同预测特征点坐标和每个特征点的可见性。可见性监督结合了手动标签以及从掩码-热图重叠中推导出伪可见性的特征点感知掩码。我们还创建了一套包含100个特征点标注及每个特征点可见性的遮挡感知评估工具,该工具报告可视与不可视特征点的NME,并使用Occ AP、F1@0.5和ROC-AUC来评估可见性。 实验表明,在外部遮挡和头部大旋转的情况下具有更好的鲁棒性,尤其是在被遮挡区域中表现更为突出,同时保持了在可视特征点上的准确性。
https://arxiv.org/abs/2602.10728
Morphable Models (3DMMs) are a type of morphable model that takes 2D images as inputs and recreates the structure and physical appearance of 3D objects, especially human faces and bodies. 3DMM combines identity and expression blendshapes with a basic face mesh to create a detailed 3D model. The variability in the 3D Morphable models can be controlled by tuning diverse parameters. They are high-level image descriptors, such as shape, texture, illumination, and camera parameters. Previous research in 3D human reconstruction concentrated solely on global face structure or geometry, ignoring face semantic features such as age, gender, and facial landmarks characterizing facial boundaries, curves, dips, and wrinkles. In order to accommodate changes in these high-level facial characteristics, this work introduces a shape and appearance-aware 3D reconstruction system (named SARS by us), a c modular pipeline that extracts body and face information from a single image to properly rebuild the 3D model of the human full body.
可变形模型(3DMM)是一种将2D图像作为输入,重建三维物体的结构和外观的形态模型,特别是在人类面部和身体方面。3DMM通过结合身份和表情变化形状与基本面部网格来创建详细的三维模型。可以通过调整多样化的参数来控制3D可变形模型中的变异,这些参数包括高层次的图像描述符,如形状、纹理、光照以及相机参数。以往关于3D人体重建的研究主要集中在全局面部结构或几何形状上,而忽略了年龄、性别等定义面部边界的特征、曲线、凹陷和皱纹这样的面部语义特征。为了适应这些高层次的人脸特征的变化,本工作引入了一种基于感知形状与外观的三维重构系统(我们命名为SARS),这是一个模块化流程,可以从单张图像中提取身体和面部信息以准确重建完整人体的3D模型。
https://arxiv.org/abs/2602.09918
One-shot prediction enables rapid adaptation of pretrained foundation models to new tasks using only one labeled example, but lacks principled uncertainty quantification. While conformal prediction provides finite-sample coverage guarantees, standard split conformal methods are inefficient in the one-shot setting due to data splitting and reliance on a single predictor. We propose Conformal Aggregation of One-Shot Predictors (CAOS), a conformal framework that adaptively aggregates multiple one-shot predictors and uses a leave-one-out calibration scheme to fully exploit scarce labeled data. Despite violating classical exchangeability assumptions, we prove that CAOS achieves valid marginal coverage using a monotonicity-based argument. Experiments on one-shot facial landmarking and RAFT text classification tasks show that CAOS produces substantially smaller prediction sets than split conformal baselines while maintaining reliable coverage.
https://arxiv.org/abs/2601.05219
High-precision facial landmark detection (FLD) relies on high-resolution deep feature representations. However, low-resolution face images or the compression (via pooling or strided convolution) of originally high-resolution images hinder the learning of such features, thereby reducing FLD accuracy. Moreover, insufficient training data and imprecise annotations further degrade performance. To address these challenges, we propose a weakly-supervised framework called Supervision-by-Hallucination-and-Transfer (SHT) for more robust and precise FLD. SHT contains two novel mutually enhanced modules: Dual Hallucination Learning Network (DHLN) and Facial Pose Transfer Network (FPTN). By incorporating FLD and face hallucination tasks, DHLN is able to learn high-resolution representations with low-resolution inputs for recovering both facial structures and local details and generating more effective landmark heatmaps. Then, by transforming faces from one pose to another, FPTN can further improve landmark heatmaps and faces hallucinated by DHLN for detecting more accurate landmarks. To the best of our knowledge, this is the first study to explore weakly-supervised FLD by integrating face hallucination and facial pose transfer tasks. Experimental results of both face hallucination and FLD demonstrate that our method surpasses state-of-the-art techniques.
高精度面部标志检测(FLD)依赖于高质量的深度特征表示。然而,低分辨率的人脸图像或对原本高分辨率图像通过池化或步进卷积进行压缩会妨碍这些特征的学习,从而降低FLD的准确性。此外,训练数据不足和标注不准确进一步降低了性能。为了解决这些问题,我们提出了一种弱监督框架,称为幻觉与迁移监督(SHT),以实现更稳健且精确的FLD。SHT包含两个新颖的相互增强模块:双幻觉学习网络(DHLN)和面部姿态转换网络(FPTN)。通过结合FLD和人脸幻觉任务,DHLN能够使用低分辨率输入来学习高分辨率表示,从而恢复面部结构和局部细节,并生成更有效的标志热图。随后,通过将面孔从一种姿态变换为另一种姿态,FPTN可以进一步改进由DHLN产生的面部标志热图及幻觉出的人脸,以检测到更加准确的标志点。据我们所知,这是首次探索结合人脸幻觉和面部姿态转换任务的弱监督FLD的研究。实验结果表明,在人脸幻觉和FLD方面,我们的方法超越了现有的先进技术。
https://arxiv.org/abs/2601.12919
Recently, deep learning based facial landmark detection (FLD) methods have achieved considerable success. However, in challenging scenarios such as large pose variations, illumination changes, and facial expression variations, they still struggle to accurately capture the geometric structure of the face, resulting in performance degradation. Moreover, the limited size and diversity of existing FLD datasets hinder robust model training, leading to reduced detection accuracy. To address these challenges, we propose a Frequency-Guided Task-Balancing Transformer (FGTBT), which enhances facial structure perception through frequency-domain modeling and multi-dataset unified training. Specifically, we propose a novel Fine-Grained Multi-Task Balancing loss (FMB-loss), which moves beyond coarse task-level balancing by assigning weights to individual landmarks based on their occurrence across datasets. This enables more effective unified training and mitigates the issue of inconsistent gradient magnitudes. Additionally, a Frequency-Guided Structure-Aware (FGSA) model is designed to utilize frequency-guided structure injection and regularization to help learn facial structure constraints. Extensive experimental results on popular benchmark datasets demonstrate that the integration of the proposed FMB-loss and FGSA model into our FGTBT framework achieves performance comparable to state-of-the-art methods. The code is available at this https URL.
最近,基于深度学习的面部标志检测(FLD)方法已经取得了显著的成功。然而,在诸如大幅度姿态变化、光照变化和面部表情变化等挑战性场景中,这些方法仍然难以准确捕捉面部的几何结构,从而导致性能下降。此外,现有FLD数据集的规模有限且多样性不足,阻碍了鲁棒模型的训练,并降低了检测精度。为了应对这些挑战,我们提出了一种频率引导的任务平衡变换器(FGTBT),该方法通过频域建模和多数据集统一训练来增强面部结构感知能力。 具体而言,我们提出了一个新颖的细粒度多任务平衡损失(FMB-loss),它超越了粗略的任务级平衡,并根据各个地标在不同数据集中出现的情况为它们分配权重。这使得统一训练更加有效,并减轻了一致性梯度大小问题。此外,设计了一个频率引导结构感知模型(FGSA),利用频率引导的结构注入和正则化来帮助学习面部结构约束。 广泛的实验结果表明,在流行的基准数据集上将所提出的FMB-loss和FGSA模型集成到我们的FGTBT框架中可以实现与当前最佳方法相当的性能。代码可在提供的链接处获取。
https://arxiv.org/abs/2601.12863
Stereo vision between images faces a range of challenges, including occlusions, motion, and camera distortions, across applications in autonomous driving, robotics, and face analysis. Due to parameter sensitivity, further complications arise for stereo matching with sparse features, such as facial landmarks. To overcome this ill-posedness and enable unsupervised sparse matching, we consider line constraints of the camera geometry from an optimal transport (OT) viewpoint. Formulating camera-projected points as (half)lines, we propose the use of the classical epipolar distance as well as a 3D ray distance to quantify matching quality. Employing these distances as a cost function of a (partial) OT problem, we arrive at efficiently solvable assignment problems. Moreover, we extend our approach to unsupervised object matching by formulating it as a hierarchical OT problem. The resulting algorithms allow for efficient feature and object matching, as demonstrated in our numerical experiments. Here, we focus on applications in facial analysis, where we aim to match distinct landmarking conventions.
图像间的立体视觉面临多种挑战,包括遮挡、运动和相机畸变等问题,在自主驾驶、机器人技术和面部分析等应用中尤为突出。特别是在稀疏特征(如面部标志)的立体匹配任务中,由于参数敏感性而引发更多复杂情况。为了解决这一不适定问题,并实现无监督下的稀疏匹配,我们从最优传输(OT)的角度出发考虑相机几何学中的线约束条件。我们将摄像机投影点表示为(半)线,并提出使用经典的极线距离以及三维射线距离来量化匹配质量。将这些距离作为(部分)OT问题的成本函数,从而转化成可以高效解决的分配问题。此外,我们通过将其表述为分层最优传输(OT)问题的形式,扩展了我们的方法至无监督的对象匹配任务中。在数值实验中展示的结果表明,该算法允许有效地进行特征和对象的匹配工作。特别地,在面部分析的应用场景下,我们的目标是将不同的地标标记规范进行匹配。
https://arxiv.org/abs/2601.12423
Orthognathic surgery repositions jaw bones to restore occlusion and enhance facial aesthetics. Accurate simulation of postoperative facial morphology is essential for preoperative planning. However, traditional biomechanical models are computationally expensive, while geometric deep learning approaches often lack interpretability. In this study, we develop and validate a physics-informed geometric deep learning framework named PhysSFI-Net for precise prediction of soft tissue deformation following orthognathic surgery. PhysSFI-Net consists of three components: a hierarchical graph module with craniofacial and surgical plan encoders combined with attention mechanisms to extract skeletal-facial interaction features; a Long Short-Term Memory (LSTM)-based sequential predictor for incremental soft tissue deformation; and a biomechanics-inspired module for high-resolution facial surface reconstruction. Model performance was assessed using point cloud shape error (Hausdorff distance), surface deviation error, and landmark localization error (Euclidean distances of craniomaxillofacial landmarks) between predicted facial shapes and corresponding ground truths. A total of 135 patients who underwent combined orthodontic and orthognathic treatment were included for model training and validation. Quantitative analysis demonstrated that PhysSFI-Net achieved a point cloud shape error of 1.070 +/- 0.088 mm, a surface deviation error of 1.296 +/- 0.349 mm, and a landmark localization error of 2.445 +/- 1.326 mm. Comparative experiments indicated that PhysSFI-Net outperformed the state-of-the-art method ACMT-Net in prediction accuracy. In conclusion, PhysSFI-Net enables interpretable, high-resolution prediction of postoperative facial morphology with superior accuracy, showing strong potential for clinical application in orthognathic surgical planning and simulation.
正颌手术通过重新定位下颌骨来恢复咬合并增强面部美学。在术前规划中,准确模拟术后面部形态至关重要。然而,传统的生物力学模型计算成本高昂,而几何深度学习方法往往缺乏可解释性。在这项研究中,我们开发和验证了一个名为PhysSFI-Net的物理信息几何深度学习框架,用于精确预测正颌手术后的软组织变形。 PhysSFI-Net包括三个组成部分:结合注意力机制的分层图模块与颅面编码器和外科计划编码器相结合以提取骨骼面部交互特征;基于长短期记忆(LSTM)的序列预测器用于增量软组织变形;以及一种受生物力学启发的模块,用于高分辨率面部表面重建。通过预测面部形状与对应真实数据之间的点云形状误差(豪斯多夫距离)、表面偏差误差及颅面标志定位误差(欧几里得距离),评估模型性能。 共有135名接受综合正畸和正颌治疗的患者被纳入模型训练和验证。定量分析表明,PhysSFI-Net实现了1.070 +/- 0.088毫米的点云形状误差、1.296 +/- 0.349毫米的表面偏差误差以及2.445 +/- 1.326毫米的标志定位误差。比较实验表明,PhysSFI-Net在预测准确性方面优于最先进的方法ACMT-Net。 结论是,PhysSFI-Net能够实现可解释、高分辨率的术后面部形态预测,并具有更高的精度,在正颌手术规划和模拟的临床应用中展现出强大的潜力。
https://arxiv.org/abs/2601.02088
Face super-resolution aims to recover high-quality facial images from severely degraded low-resolution inputs, but remains challenging due to the loss of fine structural details and identity-specific features. This work introduces SwinIFS, a landmark-guided super-resolution framework that integrates structural priors with hierarchical attention mechanisms to achieve identity-preserving reconstruction at both moderate and extreme upscaling factors. The method incorporates dense Gaussian heatmaps of key facial landmarks into the input representation, enabling the network to focus on semantically important facial regions from the earliest stages of processing. A compact Swin Transformer backbone is employed to capture long-range contextual information while preserving local geometry, allowing the model to restore subtle facial textures and maintain global structural consistency. Extensive experiments on the CelebA benchmark demonstrate that SwinIFS achieves superior perceptual quality, sharper reconstructions, and improved identity retention; it consistently produces more photorealistic results and exhibits strong performance even under 8x magnification, where most methods fail to recover meaningful structure. SwinIFS also provides an advantageous balance between reconstruction accuracy and computational efficiency, making it suitable for real-world applications in facial enhancement, surveillance, and digital restoration. Our code, model weights, and results are available at this https URL.
面部超分辨率技术旨在从严重降级的低分辨率输入中恢复高质量的人脸图像,但由于精细结构细节和特定身份特征的丢失,这一任务仍然具有挑战性。这项工作引入了SwinIFS,这是一种基于标志点引导的超级分辨率框架,它结合了结构先验知识与分层注意力机制,在适度和极端放大的情况下都能实现保真度重建。该方法将关键面部标志点的密集高斯热图融入输入表示中,使得网络能够从处理的早期阶段就开始关注语义重要的面部区域。采用了一个紧凑型Swin Transformer骨干网来捕捉长距离上下文信息同时保持局部几何形状,使模型能够恢复细微的人脸纹理并维持全局结构的一致性。 在CelebA基准测试上的广泛实验表明,SwinIFS实现了卓越的感知质量、更清晰的重建效果以及改进的身份保留能力。该方法持续生成更具真实感的结果,并且即使在8倍放大时也能展现出强大的性能,而大多数其他方法在这种极端情况下无法恢复有意义的结构。 此外,SwinIFS在重建精度和计算效率之间提供了一个有利的平衡点,使其适合用于面部美化、监控以及数字修复等现实世界的应用场景中。我们的代码、模型权重及结果可在以下链接获取:[https URL](请将[https URL]替换为实际提供的URL)。
https://arxiv.org/abs/2601.01406
Movie dubbing seeks to synthesize speech from a given script using a specific voice, while ensuring accurate lip synchronization and emotion-prosody alignment with the character's visual performance. However, existing alignment approaches based on visual features face two key limitations: (1)they rely on complex, handcrafted visual preprocessing pipelines, including facial landmark detection and feature extraction; and (2) they generalize poorly to unseen visual domains, often resulting in degraded alignment and dubbing quality. To address these issues, we propose InstructDubber, a novel instruction-based alignment dubbing method for both robust in-domain and zero-shot movie dubbing. Specifically, we first feed the video, script, and corresponding prompts into a multimodal large language model to generate natural language dubbing instructions regarding the speaking rate and emotion state depicted in the video, which is robust to visual domain variations. Second, we design an instructed duration distilling module to mine discriminative duration cues from speaking rate instructions to predict lip-aligned phoneme-level pronunciation duration. Third, for emotion-prosody alignment, we devise an instructed emotion calibrating module, which finetunes an LLM-based instruction analyzer using ground truth dubbing emotion as supervision and predicts prosody based on the calibrated emotion analysis. Finally, the predicted duration and prosody, together with the script, are fed into the audio decoder to generate video-aligned dubbing. Extensive experiments on three major benchmarks demonstrate that InstructDubber outperforms state-of-the-art approaches across both in-domain and zero-shot scenarios.
电影配音旨在使用特定的声音从给定的剧本中合成语音,同时确保准确的唇形同步和情感声调与角色视觉表现的一致性。然而,现有的基于视觉特征的对齐方法面临着两个关键限制:(1)它们依赖于复杂的、手动设计的视觉预处理管道,包括面部标志检测和特征提取;(2)它们在未见过的视觉领域中泛化能力差,往往导致对齐质量下降和配音效果不佳。为了解决这些问题,我们提出了InstructDubber,这是一种新的基于指令的对齐配音方法,旨在实现稳健的同类场景内配音和零样本电影配音。 具体而言,首先我们将视频、剧本以及相应的提示输入多模态大型语言模型,生成有关说话速度和视频中描绘的情感状态的自然语言配音指令。这些指令对于视觉领域的变化具有鲁棒性。其次,我们设计了一个基于指令的持续时间提取模块,从说话速度指令中挖掘有辨别力的持续时间线索,以预测唇形同步的音素级发音时长。第三,为了实现情感声调对齐,我们开发了一个基于指令的情感校准模块,该模块通过真实配音情感作为监督来微调大型语言模型(LLM)基础的指令分析器,并根据校准后的情感分析预测声调。 最后,将预测出的持续时间和声调与剧本一起输入音频解码器以生成视频对齐的配音。在三个主要基准上的广泛实验表明,在同类场景内和零样本场景中,InstructDubber均超越了最先进的方法。
https://arxiv.org/abs/2512.17154
Facial expression recognition is a crucial component in enhancing human-computer interaction and developing emotion-aware systems. Real-time detection and interpretation of facial expressions have become increasingly important for various applications, from user experience personalization to intelligent surveillance systems. This study presents a novel approach to real-time sequential facial expression recognition using deep learning and geometric features. The proposed method utilizes MediaPipe FaceMesh for rapid and accurate facial landmark detection. Geometric features, including Euclidean distances and angles, are extracted from these landmarks. Temporal dynamics are incorporated by analyzing feature differences between consecutive frames, enabling the detection of onset, apex, and offset phases of expressions. For classification, a ConvLSTM1D network followed by multilayer perceptron blocks is employed. The method's performance was evaluated on multiple publicly available datasets, including CK+, Oulu-CASIA (VIS and NIR), and MMI. Accuracies of 93%, 79%, 77%, and 68% were achieved respectively. Experiments with composite datasets were also conducted to assess the model's generalization capabilities. The approach demonstrated real-time applicability, processing approximately 165 frames per second on consumer-grade hardware. This research contributes to the field of facial expression analysis by providing a fast, accurate, and adaptable solution. The findings highlight the potential for further advancements in emotion-aware technologies and personalized user experiences, paving the way for more sophisticated human-computer interaction systems. To facilitate further research in this field, the complete source code for this study has been made publicly available on GitHub: this https URL.
面部表情识别是增强人机交互和开发情感感知系统的关键组成部分。实时检测和解读面部表情对于各种应用变得越来越重要,从用户体验个性化到智能监控系统。本研究提出了一种使用深度学习和几何特征进行实时顺序面部表情识别的新方法。所提方法利用MediaPipe FaceMesh进行快速准确的面部标志点检测。通过这些标志点提取欧几里得距离和角度等几何特性。时间动态性通过分析连续帧之间特性的差异来融入,从而实现表情起始、顶峰和结束阶段的检测。分类过程中使用了ConvLSTM1D网络后接多层感知器模块。该方法在多个公开数据集上进行了性能评估,包括CK+、Oulu-CASIA(VIS和NIR)以及MMI,分别达到了93%、79%、77%和68%的准确率。为了评估模型的泛化能力,还对合成的数据集进行了实验。该方法展示了其实时适用性,在消费级硬件上每秒可处理约165帧。这项研究通过提供快速、准确且灵活的解决方案为面部表情分析领域做出了贡献。研究成果强调了在情感感知技术及个性化用户体验方面的进一步发展潜力,并为更高级的人机交互系统铺平道路。为了促进该领域的进一步研究,本研究所用的完整源代码已公开发布于GitHub:this https URL.
https://arxiv.org/abs/2512.05669
Alcohol consumption is a significant public health concern and a major cause of accidents and fatalities worldwide. This study introduces a novel video-based facial sequence analysis approach dedicated to the detection of alcohol intoxication. The method integrates facial landmark analysis via a Graph Attention Network (GAT) with spatiotemporal visual features extracted using a 3D ResNet. These features are dynamically fused with adaptive prioritization to enhance classification performance. Additionally, we introduce a curated dataset comprising 3,542 video segments derived from 202 individuals to support training and evaluation. Our model is compared against two baselines: a custom 3D-CNN and a VGGFace+LSTM architecture. Experimental results show that our approach achieves 95.82% accuracy, 0.977 precision, and 0.97 recall, outperforming prior methods. The findings demonstrate the model's potential for practical deployment in public safety systems for non-invasive, reliable alcohol intoxication detection.
酒精消费是一个重要的公共卫生问题,也是全球事故和死亡的主要原因。本研究提出了一种基于视频的面部序列分析方法,专门用于检测酒精中毒。该方法结合了通过图注意力网络(GAT)进行的面部标志点分析以及使用3D ResNet提取的空间时间视觉特征。这些特征被动态融合并采用自适应优先级以提高分类性能。此外,我们引入了一个经过精心策划的数据集,包括来自202个个体的3,542段视频片段,用于支持训练和评估。我们的模型与两个基准进行了比较:一个定制的3D-CNN架构和VGGFace+LSTM架构。实验结果显示,我们的方法实现了95.82%的准确率、0.977的精确度和0.97的召回率,超过了先前的方法。研究结果表明,该模型在公共安全系统中具有非侵入性和可靠地检测酒精中毒的实际部署潜力。
https://arxiv.org/abs/2512.04536
Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.
将深度伪造检测推广到未见过的操纵仍然是一个关键挑战。最近的一个方法是通过使用手工制作的艺术品来处理原始面部图像,以提取更通用的线索,从而训练网络解决这一问题。尽管这种方法对静态图像有效,但将其扩展到视频领域仍是一个开放性的问题。现有方法将时间上的伪迹建模为帧与帧之间的不稳定性,忽略了关键漏洞:不同面部区域之间自然运动依赖性的破坏。 在本文中,我们提出了一种合成视频生成方法,该方法通过创建具有微妙的力学不一致性的训练数据来解决这个问题。我们训练一个自动编码器将面部标志配置分解成运动基元。通过对这些基元进行操作,我们可以有选择地打破面部动作之间的自然相关性,并通过面部变形将其伪迹引入原始视频中。在我们的数据上经过训练的网络能够识别这些复杂的生物力学缺陷,在几个流行的基准测试中达到了最先进的泛化结果。
https://arxiv.org/abs/2512.04175
We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total, among which more than 52,000 are longer than 10 frames, including 68 manually reviewed trajectories that span the entire 150 frames. To the best of our knowledge, PoreTrack3D is the first benchmark dataset to capture both traditional facial landmarks and pore-scale keypoints trajectory, advancing the study of fine-grained facial expressions through the analysis of subtle skin-surface motion. We systematically evaluate state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D, establishing the first performance baseline in this domain. Overall, the pipeline developed for this benchmark dataset's creation establishes a new framework for high-fidelity facial motion capture and dynamic 3D reconstruction. Our dataset are publicly available at: this https URL
我们介绍了PoreTrack3D,这是首个针对孔隙尺度非刚性三维面部轨迹跟踪中的动态三维高斯点云的基准测试。该数据集总共包含超过440,000条面部轨迹,其中超过52,000条轨迹长度超过了10帧,并且有68条手动审查过的跨越整个150帧的完整轨迹。据我们所知,PoreTrack3D是首个能够捕捉传统面部标志点和孔隙尺度关键点轨迹的数据集,通过分析细微的皮肤表面运动来推进细粒度面部表情的研究。我们在PoreTrack3D上系统性地评估了当前最先进的动态三维高斯点云方法,并建立了该领域的首个性能基准线。总体而言,为创建这一基准测试数据集而开发的工作流程确立了一个用于高保真面部动作捕捉和动态三维重建的新框架。我们的数据集可公开获取:[此处插入具体链接]
https://arxiv.org/abs/2512.02648
Detecting driver drowsiness reliably is crucial for enhancing road safety and supporting advanced driver assistance systems (ADAS). We introduce the Eyelid Angle (ELA), a novel, reproducible metric of eye openness derived from 3D facial landmarks. Unlike conventional binary eye state estimators or 2D measures, such as the Eye Aspect Ratio (EAR), the ELA provides a stable geometric description of eyelid motion that is robust to variations in camera angle. Using the ELA, we design a blink detection framework that extracts temporal characteristics, including the closing, closed, and reopening durations, which are shown to correlate with drowsiness levels. To address the scarcity and risk of collecting natural drowsiness data, we further leverage ELA signals to animate rigged avatars in Blender 3D, enabling the creation of realistic synthetic datasets with controllable noise, camera viewpoints, and blink dynamics. Experimental results in public driver monitoring datasets demonstrate that the ELA offers lower variance under viewpoint changes compared to EAR and achieves accurate blink detection. At the same time, synthetic augmentation expands the diversity of training data for drowsiness recognition. Our findings highlight the ELA as both a reliable biometric measure and a powerful tool for generating scalable datasets in driver state monitoring.
https://arxiv.org/abs/2511.19519