Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at this https URL.
https://arxiv.org/abs/2605.19821
Security systems demand continuous, cryptograph- ically robust identity verification without requiring subjects to carry physical tokens, smart cards, or dedicated hardware authenticators. This paper presents BIDO (Biometric Identity Online), a device-free authentication standard that achieves Au- thenticator Assurance Level 2 (AAL2) per NIST SP 800-63B with- out storing long-lived biometric templates, facial images, or any other form of Personally Identifiable Information (PII). BIDO derives Elliptic Curve Digital Signature Algorithm (ECDSA) key material deterministically from a live biometric measurement salted with a user-defined memorized secret at every authen- tication event, eliminating persistent private-key storage while enabling verification from any commodity sensor terminal. The generated credentials are non-discoverable (non-resident) Web Authentication (WebAuthn) credentials, fully compatible with all FIDO2-enabled websites and services without modification on the server side. A multi-stage pipeline, comprising capture of 200 valid biometric samples, feature extraction using the Dlib 68- point facial landmark predictor, affine face alignment, frontality gating, Euclidean distance computation from the inter-eye mid- point, floor-division quantization with divisor q = 8, inter-session drift stabilization, and majority-voting SHA-256 hash binding, produces a Verification Seed (Vseed) from which the WebAuthn credential is transiently derived and immediately zeroized after signing. Evaluated against three prominent face benchmarks (VGGFace2, LFW, and MegaFace), achieving 99.51% verification accuracy on LFW and 92.14% Rank-1 identification accuracy on MegaFace Challenge 1 at 10^6 distractors, with a cryptographic False Accept Rate (FAR) of 0.03%, a False Reject Rate (FRR) of 0.90%.
https://arxiv.org/abs/2605.16908
Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at this https URL.
https://arxiv.org/abs/2605.11743
Text-to-image (T2I) models, and their encoded biases, increasingly shape the visual media the public encounters. While researchers have produced a rich body of work on bias measurement, auditing, and mitigation in T2I systems, those methods largely target technical stakeholders, leaving a gap in public legibility. We introduce GLEaN (Generative Likeness Evaluation at N-Scale), a portrait-based explainability pipeline designed to make T2I model biases visually understandable to a broad audience. GLEaN comprises three stages: automated large-scale image generation from identity prompts, facial landmark-based filtering and spatial alignment, and median-pixel composition that distills a model's central tendency into a single representative portrait. The resulting composites require no statistical background to interpret; a viewer can see, at a glance, who a model 'imagines' when prompted with 'a doctor' versus a 'felon.' We demonstrate GLEaN on Stable Diffusion XL across 40 social and occupational identity prompts, producing composites that reproduce documented biases and surface new associations between skin tone and predicted emotion. We find in a between-subjects user study (N = 291) that GLEaN portraits communicate biases as effectively as conventional data tables, but require significantly less viewing time. Because the method relies solely on generated outputs, it can also be replicated on any black-box and closed-weight systems without access to model internals. GLEaN offers a scalable, model-agnostic approach to bias explainability, purpose-built for public comprehension, and is publicly available at this https URL.
https://arxiv.org/abs/2604.09923
Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender and race biases. To this end we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Evaluations of a standard representative model demonstrate that confounding visual factors, particularly head pose and image resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, we show that performance disparities across gender and race vanish. However, we identify a statistically significant age-related effect, with higher biases observed for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline, disproportionately affecting vulnerable populations. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.
https://arxiv.org/abs/2604.06961
Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision-language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.
罕见病在儿童中常表现为独特的 facial phenotypes(面部表型),为临床医生和人工智能辅助筛查系统提供了宝贵的诊断线索。然而,该领域的发展严重受限于:符合伦理规范的面部数据稀缺,且不同疾病间的表型高度相似。为应对这些挑战,我们推出了 RDFace——一个经过精心整理的基准数据集,包含 456 张儿科面部图像,涵盖 103 种罕见遗传病(平均每种疾病 4.4 个样本)。每张符合伦理验证的图像均配有标准化元数据。RDFace 使得在真实世界低数据约束下,开发和评估数据高效的人工智能模型用于罕见病诊断成为可能。我们通过交叉验证对多个预训练视觉骨干网络进行了基准测试,并探索了使用 DreamBooth 和 FastGAN 进行合成数据增强。生成的图像通过面部关键点相似度进行筛选以保持表型真实性,并与真实数据合并,在极低数据场景下将诊断准确率最高提升了 13.7%。为评估语义有效性,视觉-语言模型从真实和合成图像生成的表型描述,其报告相似度评分达到 0.84。RDFace 建立了一个透明、即用型基准数据集,以促进公平的罕见病人工智能研究,并提出了一种可扩展的框架,用于评估诊断性能及合成医学图像的完整性。
https://arxiv.org/abs/2604.03454
Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at this https URL.
基于外观的注视估计通常依赖于深度卷积神经网络(CNN)。这些模型虽然准确,但计算成本高昂且如同“黑箱模型”,可解释性不足。基于面部标志点的几何方法是一种轻量级替代方案,但其性能极限和泛化能力在现代基准测试中尚未得到充分探索。本研究对基于标志点的注视估计进行了全面评估。我们引入标准化流程,从三个大规模数据集(Gaze360、ETH-XGaze和GazeGene)中提取并归一化标志点,并训练轻量级回归模型,具体包括极端梯度提升树以及两种神经架构:整体式多层感知机(MLP)和用于捕捉双目几何特征的孪生MLP。研究发现,在域内评估中,基于标志点的模型表现较低,这可能源于标志点检测器引入的数据噪声。然而,在跨域评估中,所提出的MLP架构展现出与ResNet18基线相当的泛化能力。这些发现表明,稀疏几何特征编码了足够的信息以实现稳健的注视估计,为高效、可解释且隐私友好型的边缘应用铺平了道路。源代码及生成的基于标志点的数据集可通过此https URL获取。
https://arxiv.org/abs/2603.24724
The widespread sharing of face images on social media platforms and in large-scale datasets raises pressing privacy concerns, as biometric identifiers can be exploited without consent. Face anonymization seeks to generate realistic facial images that irreversibly conceal the subject's identity while preserving their usefulness for downstream tasks. However, most existing generative approaches focus on identity removal and image realism, often neglecting facial expressions as well as photometric consistency -- specifically attributes such as illumination and skin tone -- that are critical for applications like relighting, color constancy, and medical or affective analysis. In this work, we propose a feature-preserving anonymization framework that extends DeepPrivacy by incorporating dense facial landmarks to better retain expressions, and by introducing lightweight post-processing modules that ensure consistency in lighting direction and skin color. We further establish evaluation metrics specifically designed to quantify expression fidelity, lighting consistency, and color preservation, complementing standard measures of image realism, pose accuracy, and re-identification resistance. Experiments on the CelebA-HQ dataset demonstrate that our method produces anonymized faces with improved realism and significantly higher fidelity in expression, illumination, and skin tone compared to state-of-the-art baselines. These results underscore the importance of feature-aware anonymization as a step toward more useful, fair, and trustworthy privacy-preserving facial data.
https://arxiv.org/abs/2603.17567
Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.
https://arxiv.org/abs/2603.12388
Accurate facial expression imitation on human-face robots is crucial for achieving natural human-robot interaction. Most existing methods have achieved photorealistic expression imitation through mapping 2D facial landmarks to a robot's actuator commands. Their imitation of landmark trajectories is susceptible to interference from facial morphology, which would lead to a performance drop. In this paper, we propose a morphology-independent expression imitation method that decouples expressions from facial morphology to eliminate morphological influence and produce more realistic expressions for human-face robots. Specifically, we construct an expression decoupling module to learn expression semantics by disentangling the expression representation from the morphology representation in a self-supervised manner. We devise an expression transfer module to map the representations to the robot's actuator commands through a learning objective of perceiving expression errors, producing accurate facial expressions based on the learned expression semantics. To support experimental validation, a custom-designed and highly expressive human-face robot, namely Pengrui, is developed to serve as an experimental platform for realistic expression imitation. Extensive experiments demonstrate that our method enables the human-face robot to reproduce a wide range of human-like expressions effectively. All code and implementation details of the robot will be released.
在人形机器人的面部表情模仿中,准确地复制人类的面部表情对于实现自然的人机交互至关重要。现有的大多数方法通过将二维面部特征点映射到机器人的执行器命令来实现了逼真的表情模仿,但这些方法对脸部形态的变化较为敏感,这会导致表现效果下降。在本文中,我们提出了一种独立于形态的表情模仿方法,该方法能够分离出表情与面部形态,从而消除形态的影响,并产生更真实的人形机器人面部表情。 具体来说,我们构建了一个表情解耦模块,通过自监督学习来解析并学习表情语义,即从形态表示和表情表示中将其分开。我们设计了一个表情转移模块,它将这些表示映射到机器人的执行器命令上,以感知表情错误为目标进行训练,并根据学到的表情语义生成准确的面部表情。 为了支持实验验证,我们开发了一种定制且高度表达性的机器人——“彭睿”,作为用于真实表情模仿的实验平台。大量的实验证明了我们的方法能够让这种人形机器人有效而逼真地再现一系列人类面部表情。所有机器人的代码和实现细节都将公开发布。
https://arxiv.org/abs/2603.07068
With the rapid advancement of deepfake technology, malicious face manipulations pose a significant threat to personal privacy and social security. However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. Our core framework adopts an innovative 152-dimensional landmark-identity watermark termed LIDMark, which structurally interweaves facial landmarks with a unique source identifier. To robustly extract the LIDMark, we design a novel Factorized-Head Decoder (FHD). Its architecture factorizes the shared backbone features into two specialized heads (i.e., regression and classification), robustly reconstructing the embedded landmarks and identifier, respectively, even when subjected to severe distortion or tampering. This design realizes an "all-in-one" trifunctional forensic solution: the regression head underlies an "intrinsic-extrinsic" consistency check for detection and localization, while the classification head robustly decodes the source identifier for tracing. Extensive experiments show that the proposed LIDMark framework provides a unified, robust, and imperceptible solution for the detection, localization, and tracing of deepfake content. The code is available at this https URL.
随着深度伪造技术的迅速发展,恶意的脸部篡改对个人隐私和社会安全构成了重大威胁。然而,现有的主动取证方法通常将深度伪造检测、篡改定位和源头追溯视为独立的任务,并缺乏一个统一框架来共同解决这些问题。为弥补这一空白,我们提出了一种统一的主动取证框架,该框架能够同时应对这三个核心任务。 我们的核心框架采用了一种创新性的152维地标-身份水印(LIDMark),这种水印将面部特征点与唯一来源标识符结构化地交织在一起。为了稳健地提取LIDMark,我们设计了一个新颖的因子头部解码器(Factorized-Head Decoder, FHD)。FHD架构将其共享骨干特征分解为两个专门化的头部(即回归和分类),即使在遭受严重扭曲或篡改的情况下,也能分别稳健地重构嵌入的地标和标识符。这种设计实现了“三位一体”的多功能取证解决方案:回归头部用于深度伪造检测和定位的内在-外在一致性检查,而分类头部则通过解码来源标识符来进行稳健的溯源。 广泛的实验表明,所提出的LIDMark框架为深度伪造内容的检测、定位和追踪提供了一种统一、鲁棒且不可感知的解决方案。代码可在[此处](https://这个URL)获取。
https://arxiv.org/abs/2602.23523
Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.
事件相机以微秒级分辨率记录亮度变化,但将这些稀疏、异步的输出转换成神经网络可以利用的密集张量仍然是一个核心挑战。传统的直方图或全局衰减时间表面表示方法在整个图像平面上应用固定的时序参数,在实践中造成了在静止期间保持空间结构与快速运动区域保留锐利边缘之间的权衡。我们引入了局部自适应衰减表面(LADS),这是一个事件表示家族,其中每个位置的时序衰减根据本地信号动态进行调节。三种策略被探索,分别基于事件速率、高斯拉普拉斯响应和高频频谱能量。这些自适应方案在静止区域保留细节的同时减少了密集活动区域的模糊度。公开数据集上的广泛实验表明,LADS与标准非自适应表示相比,在面部检测和面部标志准确性方面均有所提升。在30 Hz下,LADS实现了更高的检测准确性和更低的地标误差,而在240 Hz时则缓解了高频操作通常观察到的精度下降问题,维持了2.44%的标准化平均错误率用于地标识别以及0.966 mAP50用于面部检测。这些高频结果甚至超过了先前在30 Hz运行的工作所报告的准确性,在基于事件的脸部分析中建立了新的基准。此外,通过在表示阶段保留空间结构,LADS支持使用更轻量级的网络架构同时保持实时性能。这些结果突显了神经形态视觉中上下文感知时序整合的重要性,并指向能够利用事件相机独特优势的实时高频人机交互系统的未来方向。
https://arxiv.org/abs/2602.23101
Craniofacial Superimposition is a forensic technique for identifying skeletal remains by comparing a post-mortem skull with ante-mortem facial photographs. A critical step in this process is Skull-Face Overlay (SFO). This stage involves aligning a 3D skull model with a 2D facial image, typically guided by cranial and facial landmarks' correspondence. However, its accuracy is undermined by individual variability in soft-tissue thickness, introducing significant uncertainty into the overlay. This paper introduces Lilium, an automated evolutionary method to enhance the accuracy and robustness of SFO. Lilium explicitly models soft-tissue variability using a 3D cone-based representation whose parameters are optimized via a Differential Evolution algorithm. The method enforces anatomical, morphological, and photographic plausibility through a combination of constraints: landmark matching, camera parameter consistency, head pose alignment, skull containment within facial boundaries, and region parallelism. This emulation of the usual forensic practitioners' approach leads Lilium to outperform the state-of-the-art method in terms of both accuracy and robustness.
颅面叠合技术是一种法医手段,通过比较死后的头骨与生前的面部照片来识别骨骼遗骸。在这个过程中最关键的步骤是颅面重叠(Skull-Face Overlay, SFO)。该阶段涉及将一个3D头骨模型与2D面部图像对齐,通常依据颅部和面部标志点的一致性来进行指导。然而,这种方法的准确性因个体软组织厚度的变化而受到影响,这引入了显著的不确定性。 本文介绍了Lilium这一自动化进化方法,旨在提高SFO的准确性和鲁棒性。Lilium通过使用基于3D锥体表示的模型来显式地建模软组织变化,并利用差分演化算法优化其参数。该方法通过一系列约束条件确保解剖学、形态学和摄影上的合理性:包括标志点匹配、相机参数一致性、头部姿态对齐、头骨位于面部边界内以及区域平行性。这种模仿传统法医从业者的方法使Lilium在准确性和鲁棒性方面都超越了当前最先进的方法。 简而言之,通过更精确地模拟软组织的变化,并优化模型的参数,Lilium能够提供比现有技术更加可靠和精准的结果。
https://arxiv.org/abs/2603.00170
This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong "default white" bias (>96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.
这项研究量化了两种广泛部署的商业图像生成器——Gemini Flash 2.5 Image(NanoBanana)和GPT Image 1.5 中的性别和肤色偏见,以测试中立提示会产生人口统计学上中立输出这一假设。我们使用四个语义中性的提示生成了3,200张逼真的图像。分析采用了一种严谨的方法,结合混合色彩归一化、面部特征遮盖以及Monk(MST)、PERLA和Fitzpatrick肤色量表进行感知一致的肤色量化。 中立的提示产生了高度极化的默认设置。两个模型都表现出强烈的“默认白色”偏见(超过96% 的输出)。然而,在性别上,两者差异明显:Gemini 更倾向于生成女性形象;而GPT 则更倾向于生成男性形象,且皮肤色调较浅。这项研究提供了一个大规模、比较性的审计,使用了光照感知的色彩计量方法,将审美呈现与合成图像中的实际色素沉着区分开来。 该研究表明,中立提示作为诊断探针而非中立指令发挥作用。它为算法视觉文化的审核提供了稳健框架,并挑战了社会语言学假设,即无标记的语言会导致包容性的表现形式。
https://arxiv.org/abs/2602.12133
Accurate facial landmark detection under occlusion remains challenging, especially for human-like faces with large appearance variation and rotation-driven self-occlusion. Existing detectors typically localize landmarks while handling occlusion implicitly, without predicting per-point visibility that downstream applications can benefits. We present OccFace, an occlusion-aware framework for universal human-like faces, including humans, stylized characters, and other non-human designs. OccFace adopts a unified dense 100-point layout and a heatmap-based backbone, and adds an occlusion module that jointly predicts landmark coordinates and per-point visibility by combining local evidence with cross-landmark context. Visibility supervision mixes manual labels with landmark-aware masking that derives pseudo visibility from mask-heatmap overlap. We also create an occlusion-aware evaluation suite reporting NME on visible vs. occluded landmarks and benchmarking visibility with Occ AP, F1@0.5, and ROC-AUC, together with a dataset annotated with 100-point landmarks and per-point visibility. Experiments show improved robustness under external occlusion and large head rotations, especially on occluded regions, while preserving accuracy on visible landmarks.
在遮挡情况下进行精确的面部特征点检测仍然具有挑战性,尤其是在外观变化大且因旋转导致自我遮挡的人脸(包括真实人类、拟人化角色及其他非人类设计)上。现有的检测器通常是在处理遮挡时隐式地定位特征点,而不预测每个特征点的可见性,后者对下游应用是有益的信息。我们提出了OccFace,这是一个针对通用人脸的遮挡感知框架,涵盖了人类、拟人化人物以及其他非人类的设计。 OccFace采用统一的密集100个特征点布局和基于热图的骨干网络,并添加了一个遮挡模块,该模块通过结合局部证据与跨特征点上下文来共同预测特征点坐标和每个特征点的可见性。可见性监督结合了手动标签以及从掩码-热图重叠中推导出伪可见性的特征点感知掩码。我们还创建了一套包含100个特征点标注及每个特征点可见性的遮挡感知评估工具,该工具报告可视与不可视特征点的NME,并使用Occ AP、F1@0.5和ROC-AUC来评估可见性。 实验表明,在外部遮挡和头部大旋转的情况下具有更好的鲁棒性,尤其是在被遮挡区域中表现更为突出,同时保持了在可视特征点上的准确性。
https://arxiv.org/abs/2602.10728
Morphable Models (3DMMs) are a type of morphable model that takes 2D images as inputs and recreates the structure and physical appearance of 3D objects, especially human faces and bodies. 3DMM combines identity and expression blendshapes with a basic face mesh to create a detailed 3D model. The variability in the 3D Morphable models can be controlled by tuning diverse parameters. They are high-level image descriptors, such as shape, texture, illumination, and camera parameters. Previous research in 3D human reconstruction concentrated solely on global face structure or geometry, ignoring face semantic features such as age, gender, and facial landmarks characterizing facial boundaries, curves, dips, and wrinkles. In order to accommodate changes in these high-level facial characteristics, this work introduces a shape and appearance-aware 3D reconstruction system (named SARS by us), a c modular pipeline that extracts body and face information from a single image to properly rebuild the 3D model of the human full body.
可变形模型(3DMM)是一种将2D图像作为输入,重建三维物体的结构和外观的形态模型,特别是在人类面部和身体方面。3DMM通过结合身份和表情变化形状与基本面部网格来创建详细的三维模型。可以通过调整多样化的参数来控制3D可变形模型中的变异,这些参数包括高层次的图像描述符,如形状、纹理、光照以及相机参数。以往关于3D人体重建的研究主要集中在全局面部结构或几何形状上,而忽略了年龄、性别等定义面部边界的特征、曲线、凹陷和皱纹这样的面部语义特征。为了适应这些高层次的人脸特征的变化,本工作引入了一种基于感知形状与外观的三维重构系统(我们命名为SARS),这是一个模块化流程,可以从单张图像中提取身体和面部信息以准确重建完整人体的3D模型。
https://arxiv.org/abs/2602.09918
One-shot prediction enables rapid adaptation of pretrained foundation models to new tasks using only one labeled example, but lacks principled uncertainty quantification. While conformal prediction provides finite-sample coverage guarantees, standard split conformal methods are inefficient in the one-shot setting due to data splitting and reliance on a single predictor. We propose Conformal Aggregation of One-Shot Predictors (CAOS), a conformal framework that adaptively aggregates multiple one-shot predictors and uses a leave-one-out calibration scheme to fully exploit scarce labeled data. Despite violating classical exchangeability assumptions, we prove that CAOS achieves valid marginal coverage using a monotonicity-based argument. Experiments on one-shot facial landmarking and RAFT text classification tasks show that CAOS produces substantially smaller prediction sets than split conformal baselines while maintaining reliable coverage.
https://arxiv.org/abs/2601.05219
High-precision facial landmark detection (FLD) relies on high-resolution deep feature representations. However, low-resolution face images or the compression (via pooling or strided convolution) of originally high-resolution images hinder the learning of such features, thereby reducing FLD accuracy. Moreover, insufficient training data and imprecise annotations further degrade performance. To address these challenges, we propose a weakly-supervised framework called Supervision-by-Hallucination-and-Transfer (SHT) for more robust and precise FLD. SHT contains two novel mutually enhanced modules: Dual Hallucination Learning Network (DHLN) and Facial Pose Transfer Network (FPTN). By incorporating FLD and face hallucination tasks, DHLN is able to learn high-resolution representations with low-resolution inputs for recovering both facial structures and local details and generating more effective landmark heatmaps. Then, by transforming faces from one pose to another, FPTN can further improve landmark heatmaps and faces hallucinated by DHLN for detecting more accurate landmarks. To the best of our knowledge, this is the first study to explore weakly-supervised FLD by integrating face hallucination and facial pose transfer tasks. Experimental results of both face hallucination and FLD demonstrate that our method surpasses state-of-the-art techniques.
高精度面部标志检测(FLD)依赖于高质量的深度特征表示。然而,低分辨率的人脸图像或对原本高分辨率图像通过池化或步进卷积进行压缩会妨碍这些特征的学习,从而降低FLD的准确性。此外,训练数据不足和标注不准确进一步降低了性能。为了解决这些问题,我们提出了一种弱监督框架,称为幻觉与迁移监督(SHT),以实现更稳健且精确的FLD。SHT包含两个新颖的相互增强模块:双幻觉学习网络(DHLN)和面部姿态转换网络(FPTN)。通过结合FLD和人脸幻觉任务,DHLN能够使用低分辨率输入来学习高分辨率表示,从而恢复面部结构和局部细节,并生成更有效的标志热图。随后,通过将面孔从一种姿态变换为另一种姿态,FPTN可以进一步改进由DHLN产生的面部标志热图及幻觉出的人脸,以检测到更加准确的标志点。据我们所知,这是首次探索结合人脸幻觉和面部姿态转换任务的弱监督FLD的研究。实验结果表明,在人脸幻觉和FLD方面,我们的方法超越了现有的先进技术。
https://arxiv.org/abs/2601.12919
Recently, deep learning based facial landmark detection (FLD) methods have achieved considerable success. However, in challenging scenarios such as large pose variations, illumination changes, and facial expression variations, they still struggle to accurately capture the geometric structure of the face, resulting in performance degradation. Moreover, the limited size and diversity of existing FLD datasets hinder robust model training, leading to reduced detection accuracy. To address these challenges, we propose a Frequency-Guided Task-Balancing Transformer (FGTBT), which enhances facial structure perception through frequency-domain modeling and multi-dataset unified training. Specifically, we propose a novel Fine-Grained Multi-Task Balancing loss (FMB-loss), which moves beyond coarse task-level balancing by assigning weights to individual landmarks based on their occurrence across datasets. This enables more effective unified training and mitigates the issue of inconsistent gradient magnitudes. Additionally, a Frequency-Guided Structure-Aware (FGSA) model is designed to utilize frequency-guided structure injection and regularization to help learn facial structure constraints. Extensive experimental results on popular benchmark datasets demonstrate that the integration of the proposed FMB-loss and FGSA model into our FGTBT framework achieves performance comparable to state-of-the-art methods. The code is available at this https URL.
最近,基于深度学习的面部标志检测(FLD)方法已经取得了显著的成功。然而,在诸如大幅度姿态变化、光照变化和面部表情变化等挑战性场景中,这些方法仍然难以准确捕捉面部的几何结构,从而导致性能下降。此外,现有FLD数据集的规模有限且多样性不足,阻碍了鲁棒模型的训练,并降低了检测精度。为了应对这些挑战,我们提出了一种频率引导的任务平衡变换器(FGTBT),该方法通过频域建模和多数据集统一训练来增强面部结构感知能力。 具体而言,我们提出了一个新颖的细粒度多任务平衡损失(FMB-loss),它超越了粗略的任务级平衡,并根据各个地标在不同数据集中出现的情况为它们分配权重。这使得统一训练更加有效,并减轻了一致性梯度大小问题。此外,设计了一个频率引导结构感知模型(FGSA),利用频率引导的结构注入和正则化来帮助学习面部结构约束。 广泛的实验结果表明,在流行的基准数据集上将所提出的FMB-loss和FGSA模型集成到我们的FGTBT框架中可以实现与当前最佳方法相当的性能。代码可在提供的链接处获取。
https://arxiv.org/abs/2601.12863
Stereo vision between images faces a range of challenges, including occlusions, motion, and camera distortions, across applications in autonomous driving, robotics, and face analysis. Due to parameter sensitivity, further complications arise for stereo matching with sparse features, such as facial landmarks. To overcome this ill-posedness and enable unsupervised sparse matching, we consider line constraints of the camera geometry from an optimal transport (OT) viewpoint. Formulating camera-projected points as (half)lines, we propose the use of the classical epipolar distance as well as a 3D ray distance to quantify matching quality. Employing these distances as a cost function of a (partial) OT problem, we arrive at efficiently solvable assignment problems. Moreover, we extend our approach to unsupervised object matching by formulating it as a hierarchical OT problem. The resulting algorithms allow for efficient feature and object matching, as demonstrated in our numerical experiments. Here, we focus on applications in facial analysis, where we aim to match distinct landmarking conventions.
图像间的立体视觉面临多种挑战,包括遮挡、运动和相机畸变等问题,在自主驾驶、机器人技术和面部分析等应用中尤为突出。特别是在稀疏特征(如面部标志)的立体匹配任务中,由于参数敏感性而引发更多复杂情况。为了解决这一不适定问题,并实现无监督下的稀疏匹配,我们从最优传输(OT)的角度出发考虑相机几何学中的线约束条件。我们将摄像机投影点表示为(半)线,并提出使用经典的极线距离以及三维射线距离来量化匹配质量。将这些距离作为(部分)OT问题的成本函数,从而转化成可以高效解决的分配问题。此外,我们通过将其表述为分层最优传输(OT)问题的形式,扩展了我们的方法至无监督的对象匹配任务中。在数值实验中展示的结果表明,该算法允许有效地进行特征和对象的匹配工作。特别地,在面部分析的应用场景下,我们的目标是将不同的地标标记规范进行匹配。
https://arxiv.org/abs/2601.12423