Pose-invariant face recognition has become a challenging problem for modern AI-based face recognition systems. It aims at matching a profile face captured in the wild with a frontal face registered in a database. Existing methods perform face frontalization via either generative models or learning a pose robust feature representation. In this paper, a new method is presented to perform face frontalization and recognition within the feature space. First, a novel feature space pose frontalization module (FSPFM) is proposed to transform profile images with arbitrary angles into frontal counterparts. Second, a new training paradigm is proposed to maximize the potential of FSPFM and boost its performance. The latter consists of a pre-training and an attention-guided fine-tuning stage. Moreover, extensive experiments have been conducted on five popular face recognition benchmarks. Results show that not only our method outperforms the state-of-the-art in the pose-invariant face recognition task but also maintains superior performance in other standard scenarios.
无姿态不变的人脸识别已成为现代基于AI的人脸识别系统面临的挑战性问题。该任务旨在将野外采集的侧脸与数据库中注册的正脸进行匹配。现有的方法通过生成模型或学习鲁棒的姿态特征表示来进行人脸正面化。本文提出了一种新的方法,在特征空间内执行人脸正面化和识别。首先,提出了一个新颖的特征空间姿态正面化模块(FSPFM),将任意角度的侧脸图像转换为正脸对应物。其次,提出了一种新的训练范式来最大化FSPFM的潜力并提升其性能。后者包括预训练和注意力引导微调两个阶段。此外,在五个流行的人脸识别基准数据集上进行了广泛的实验。结果显示,不仅我们提出的方法在无姿态不变人脸识别人脸任务中优于现有最佳方法,并且在其他标准场景中也保持了优越的性能。
https://arxiv.org/abs/2505.16412
Face recognition is an effective technology for identifying a target person by facial images. However, sensitive facial images raises privacy concerns. Although privacy-preserving face recognition is one of potential solutions, this solution neither fully addresses the privacy concerns nor is efficient enough. To this end, we propose an efficient privacy-preserving solution for face recognition, named Pura, which sufficiently protects facial privacy and supports face recognition over encrypted data efficiently. Specifically, we propose a privacy-preserving and non-interactive architecture for face recognition through the threshold Paillier cryptosystem. Additionally, we carefully design a suite of underlying secure computing protocols to enable efficient operations of face recognition over encrypted data directly. Furthermore, we introduce a parallel computing mechanism to enhance the performance of the proposed secure computing protocols. Privacy analysis demonstrates that Pura fully safeguards personal facial privacy. Experimental evaluations demonstrate that Pura achieves recognition speeds up to 16 times faster than the state-of-the-art.
面部识别技术通过面部图像来识别人脸,是一种有效的方法。然而,敏感的面部图像是隐私保护的一大隐患。虽然隐私保护型面部识别技术是潜在解决方案之一,但它们既未能充分解决隐私问题,又不够高效。为此,我们提出了一种高效的隐私保护方案Pura,该方案能够有效地保护面部隐私,并支持在加密数据上进行有效的面部识别操作。 具体来说,通过门限Paillier密码系统,我们设计了一个私密性保护且无需交互的面部识别架构。此外,为了直接对加密后的数据执行面部识别操作,我们精心设计了一套底层的安全计算协议。并且,我们还引入了并行计算机制以增强所提出的安全计算协议的性能。 隐私分析显示Pura能够全面保障个人面部信息的安全性。实验评估表明,与现有的前沿技术相比,Pura在识别速度方面最多可快达16倍。
https://arxiv.org/abs/2505.15476
Face recognition models have made substantial progress due to advances in deep learning and the availability of large-scale datasets. However, reliance on massive annotated datasets introduces challenges related to training computational cost and data storage, as well as potential privacy concerns regarding managing large face datasets. This paper presents DiffProb, the first data pruning approach for the application of face recognition. DiffProb assesses the prediction probabilities of training samples within each identity and prunes the ones with identical or close prediction probability values, as they are likely reinforcing the same decision boundaries, and thus contribute minimally with new information. We further enhance this process with an auxiliary cleaning mechanism to eliminate mislabeled and label-flipped samples, boosting data quality with minimal loss. Extensive experiments on CASIA-WebFace with different pruning ratios and multiple benchmarks, including LFW, CFP-FP, and IJB-C, demonstrate that DiffProb can prune up to 50% of the dataset while maintaining or even, in some settings, improving the verification accuracies. Additionally, we demonstrate DiffProb's robustness across different architectures and loss functions. Our method significantly reduces training cost and data volume, enabling efficient face recognition training and reducing the reliance on massive datasets and their demanding management.
面部识别模型由于深度学习的进步和大规模数据集的可用性而取得了显著进展。然而,依赖于大量标注数据集引入了训练计算成本高、存储需求大以及处理大规模人脸数据集时可能出现的数据隐私问题等挑战。本文介绍了DiffProb,这是第一个面向人脸识别应用的数据精简方法。DiffProb评估每个身份内训练样本的预测概率,并修剪掉那些具有相同或接近预测概率值的样本,因为这些样本很可能是重复强化相同的决策边界,因此贡献的新信息量很小。此外,我们还通过辅助清理机制来消除错误标注和标签翻转的样本,从而在几乎不损失数据的情况下提高数据质量。 我们在CASIA-WebFace上进行了广泛的实验,涵盖了不同的精简比率以及包括LFW、CFP-FP和IJB-C在内的多个基准测试。结果显示,DiffProb可以在保持或甚至在某些情况下提升验证准确率的同时最多修剪掉50%的数据集。此外,我们还展示了DiffProb在不同架构和损失函数下具有良好的鲁棒性。 我们的方法显著降低了训练成本和数据量,使高效的人脸识别训练成为可能,并减少了对大规模数据集及其复杂管理的依赖。
https://arxiv.org/abs/2505.15272
Face recognition performance based on deep learning heavily relies on large-scale training data, which is often difficult to acquire in practical applications. To address this challenge, this paper proposes a GAN-based data augmentation method with three key contributions: (1) a residual-embedded generator to alleviate gradient vanishing/exploding problems, (2) an Inception ResNet-V1 based FaceNet discriminator for improved adversarial training, and (3) an end-to-end framework that jointly optimizes data generation and recognition performance. Experimental results demonstrate that our approach achieves stable training dynamics and significantly improves face recognition accuracy by 12.7% on the LFW benchmark compared to baseline methods, while maintaining good generalization capability with limited training samples.
基于深度学习的面部识别性能很大程度上依赖于大规模训练数据,而这些数据在实际应用中往往难以获得。为了解决这一挑战,本文提出了一种基于生成对抗网络(GAN)的数据增强方法,该方法有三个关键贡献:(1) 嵌入残差的生成器以缓解梯度消失和爆炸问题;(2) 一个基于Inception ResNet-V1的FaceNet判别器,用于改进对抗训练;(3) 一种端到端框架,它同时优化数据生成和识别性能。实验结果表明,我们的方法实现了稳定的训练动力学,并且在LFW基准测试上相比基线方法将面部识别准确率提高了12.7%,同时还保持了在有限样本下的良好泛化能力。
https://arxiv.org/abs/2505.11884
Face recognition has evolved significantly with the advancement of deep learning techniques, enabling its widespread adoption in various applications requiring secure authentication. However, this progress has also increased its exposure to presentation attacks, including face morphing, which poses a serious security threat by allowing one identity to impersonate another. Therefore, modern face recognition systems must be robust against such attacks. In this work, we propose a novel approach for training deep networks for face recognition with enhanced robustness to face morphing attacks. Our method modifies the classification task by introducing a dual-branch classification strategy that effectively handles the ambiguity in the labeling of face morphs. This adaptation allows the model to incorporate morph images into the training process, improving its ability to distinguish them from bona fide samples. Our strategy has been validated on public benchmarks, demonstrating its effectiveness in enhancing robustness against face morphing attacks. Furthermore, our approach is universally applicable and can be integrated into existing face recognition training pipelines to improve classification-based recognition methods.
面部识别技术随着深度学习技术的进步而显著发展,使其在需要安全认证的各种应用中得以广泛应用。然而,这种进步也增加了它对展示攻击(包括人脸融合)的暴露风险,这种攻击允许一个身份冒充另一个身份,从而构成了严重的安全威胁。因此,现代面部识别系统必须具备抵御此类攻击的能力。 在此研究中,我们提出了一种新颖的方法,用于训练深度网络进行面部识别,并增强其对抗面部融合攻击的鲁棒性。我们的方法通过引入双分支分类策略来修改分类任务,这种策略有效地处理了人脸融合标签中的模糊性。这一调整使模型能够将融合图像纳入训练过程,从而提高其区分这些图像与真实样本的能力。 我们的策略已经在公开基准测试中得到了验证,证明它在增强抵御面部融合攻击的鲁棒性方面是有效的。此外,我们的方法具有普遍适用性,并可以整合到现有的面部识别训练流程中,以改进基于分类的方法。
https://arxiv.org/abs/2505.10497
Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.
视觉叙事系统在维护角色身份以及将动作与适当主体关联方面面临挑战,这常常导致指代幻觉。这些问题可以通过基于视觉元素对人物、物体和其他实体进行定位来解决。我们提出了StoryReasoning数据集,包含从52,016张电影图片中提取的4,178个故事,并提供了结构化的场景分析和基于视觉定位的故事文本。每个故事在不同帧之间保持了角色和对象的一致性,并通过结构化表格表示显式地建模多帧之间的关系。 我们的方法包括利用视觉相似度和面部识别进行跨帧对象重新识别,链式思维推理用于明确叙述模型构建,以及将文本元素链接到多个帧中的视觉实体的定位方案。我们通过对Qwen2.5-VL 7B进行微调来建立基线性能,创建了Qwen讲故事模型,该模型在故事中保持一致的对象引用的同时执行端到端的对象检测、重新识别和地标检测。评估结果表明,与未经微调的模型相比,平均每个故事中的幻觉减少了12.3%(从4.06减少到3.56)。
https://arxiv.org/abs/2505.10292
The goal of this paper is to enhance face recognition performance by augmenting head poses during the testing phase. Existing methods often rely on training on frontalised images or learning pose-invariant representations, yet both approaches typically require re-training and testing for each dataset, involving a substantial amount of effort. In contrast, this study proposes Pose-TTA, a novel approach that aligns faces at inference time without additional training. To achieve this, we employ a portrait animator that transfers the source image identity into the pose of a driving image. Instead of frontalising a side-profile face -- which can introduce distortion -- Pose-TTA generates matching side-profile images for comparison, thereby reducing identity information loss. Furthermore, we propose a weighted feature aggregation strategy to address any distortions or biases arising from the synthetic data, thus enhancing the reliability of the augmented images. Extensive experiments on diverse datasets and with various pre-trained face recognition models demonstrate that Pose-TTA consistently improves inference performance. Moreover, our method is straightforward to integrate into existing face recognition pipelines, as it requires no retraining or fine-tuning of the underlying recognition models.
本文的目标是通过在测试阶段增强头部姿态来提高面部识别的性能。现有的方法通常依赖于使用正面化的图像进行训练或学习位置不变表示,然而这两种方法通常都需要为每个数据集重新进行训练和测试,涉及大量的工作量。相比之下,本研究提出了一种名为Pose-TTA的新方法,在推理时对齐面部而无需额外训练。 为了实现这一目标,我们采用了肖像动画师来将源图像的身份转移到驱动图像的姿态中。与正面化侧面轮廓脸不同(这可能会引入失真),Pose-TTA生成用于比较的匹配侧面轮廓图像,从而减少身份信息损失。 此外,我们提出了一种加权特征聚合策略,以解决由合成数据引起的任何失真或偏差问题,从而使增强图像更加可靠。 在各种数据集和不同的预训练面部识别模型上进行的大量实验表明,Pose-TTA可以一致地提高推理性能。而且,我们的方法易于集成到现有的面部识别管道中,因为它不需要对底层识别模型进行重新训练或微调。
https://arxiv.org/abs/2505.09256
Facial recognition systems have achieved remarkable success by leveraging deep neural networks, advanced loss functions, and large-scale datasets. However, their performance often deteriorates in real-world scenarios involving low-quality facial images. Such degradations, common in surveillance footage or standoff imaging include low resolution, motion blur, and various distortions, resulting in a substantial domain gap from the high-quality data typically used during training. While existing approaches attempt to address robustness by modifying network architectures or modeling global spatial transformations, they frequently overlook local, non-rigid deformations that are inherently present in real-world settings. In this work, we introduce DArFace, a Deformation-Aware robust Face recognition framework that enhances robustness to such degradations without requiring paired high- and low-quality training samples. Our method adversarially integrates both global transformations (e.g., rotation, translation) and local elastic deformations during training to simulate realistic low-quality conditions. Moreover, we introduce a contrastive objective to enforce identity consistency across different deformed views. Extensive evaluations on low-quality benchmarks including TinyFace, IJB-B, and IJB-C demonstrate that DArFace surpasses state-of-the-art methods, with significant gains attributed to the inclusion of local deformation modeling.
面部识别系统通过利用深度神经网络、先进的损失函数和大规模数据集取得了显著的成功。然而,在涉及低质量面部图像的现实场景中,例如监控录像或远距离成像中的低分辨率、运动模糊以及各种失真,这些系统的性能往往会出现下降。这些问题导致了训练时通常使用的高质量数据与实际应用中存在的低质量数据之间存在着较大的领域差距(domain gap)。 虽然现有方法试图通过修改网络架构或者模拟全局空间变换来提高系统的鲁棒性,但它们常常忽视了在现实环境中普遍存在且固有的局部非刚体变形。为此,在这项工作中,我们引入了一种名为DArFace的“感知变形”的面部识别框架,该框架能够在不使用高质量和低质量配对训练样本的情况下提升系统对抗这些退化的性能。 我们的方法通过在训练过程中模拟全局变换(如旋转、平移)以及局部弹性变形来敌对性地集成两种类型的变形,从而模拟实际的低质量情况。此外,我们引入了一个对比目标以确保不同变形视图之间的身份一致性。 我们在包括TinyFace、IJB-B和IJB-C在内的多个低质量基准数据集上进行了广泛的评估,结果显示DArFace超越了现有的最佳方法,并且显著的性能提升主要归因于局部变形建模的加入。
https://arxiv.org/abs/2505.08423
Teachers' visual attention and its distribution across the students in classrooms can constitute important implications for student engagement, achievement, and professional teacher training. Despite that, inferring the information about where and which student teachers focus on is not trivial. Mobile eye tracking can provide vital help to solve this issue; however, the use of mobile eye tracking alone requires a significant amount of manual annotations. To address this limitation, we present an automated processing pipeline concept that requires minimal manually annotated data to recognize which student the teachers focus on. To this end, we utilize state-of-the-art face detection models and face recognition feature embeddings to train face recognition models with transfer learning in the classroom context and combine these models with the teachers' gaze from mobile eye trackers. We evaluated our approach with data collected from four different classrooms, and our results show that while it is possible to estimate the visually focused students with reasonable performance in all of our classroom setups, U-shaped and small classrooms led to the best results with accuracies of approximately 0.7 and 0.9, respectively. While we did not evaluate our method for teacher-student interactions and focused on the validity of the technical approach, as our methodology does not require a vast amount of manually annotated data and offers a non-intrusive way of handling teachers' visual attention, it could help improve instructional strategies, enhance classroom management, and provide feedback for professional teacher development.
教师在教室中的视觉注意力及其对学生的分配对于学生参与度、学业成就和专业教师培训具有重要的影响。然而,推断出教师关注的是哪位学生以及他们关注的位置并不容易。移动眼动追踪技术可以为解决这一问题提供重要帮助;然而,仅使用移动眼动追踪需要大量的手动标注数据。为了克服这个限制,我们提出了一种自动化处理流水线的概念,该概念只需要少量的手动标注数据就能识别出教师关注的是哪位学生。 为此,我们利用最先进的面部检测模型和面部识别特征嵌入,在课堂环境中通过迁移学习训练面部识别模型,并将这些模型与来自移动眼动追踪器的教师注视信息结合在一起。我们在四个不同教室收集的数据上评估了我们的方法,结果表明,在所有教室布置中都能合理地估计出视觉关注的学生,而U形和小型教室的表现最佳,准确率分别为大约0.7和0.9。 虽然我们没有评估这种方法在教师与学生互动方面的效果,并且主要验证了技术方法的有效性,但我们的方法不依赖大量手动标注数据,并提供了一种非侵入式的处理教师视觉注意力的方式。因此,它可以帮助改进教学策略、提高课堂管理效率以及为专业教师发展提供反馈。
https://arxiv.org/abs/2505.07552
Synthetic face datasets are increasingly used to overcome the limitations of real-world biometric data, including privacy concerns, demographic imbalance, and high collection costs. However, many existing methods lack fine-grained control over identity attributes and fail to produce paired, identity-consistent images under structured capture conditions. We introduce FLUXSynID, a framework for generating high-resolution synthetic face datasets with user-defined identity attribute distributions and paired document-style and trusted live capture images. The dataset generated using the FLUXSynID framework shows improved alignment with real-world identity distributions and greater inter-set diversity compared to prior work. The FLUXSynID framework for generating custom datasets, along with a dataset of 14,889 synthetic identities, is publicly released to support biometric research, including face recognition and morphing attack detection.
合成面部数据集的使用日益增加,旨在克服现实世界生物识别数据的限制,包括隐私问题、人口统计失衡以及高昂的数据收集成本。然而,许多现有的方法缺乏对身份属性的精细控制,并且无法在结构化的拍摄条件下生成成对的一致身份图像。我们引入了FLUXSynID框架,用于根据用户定义的身份属性分布生成高分辨率的合成面部数据集,并生成文档样式和可信实时捕捉的配对图像。 使用FLUXSynID框架生成的数据集中,与真实世界中的身份分布的契合度得到了提升,并且在跨集合多样性方面超越了先前的工作。包括14,889个合成身份的数据集以及用于生成定制数据集的FLUXSynID框架已公开发布,以支持生物识别研究,如面部识别和变体攻击检测。
https://arxiv.org/abs/2505.07530
We address the problem of whole-body person recognition in unconstrained environments. This problem arises in surveillance scenarios such as those in the IARPA Biometric Recognition and Identification at Altitude and Range (BRIAR) program, where biometric data is captured at long standoff distances, elevated viewing angles, and under adverse atmospheric conditions (e.g., turbulence and high wind velocity). To this end, we propose FarSight, a unified end-to-end system for person recognition that integrates complementary biometric cues across face, gait, and body shape modalities. FarSight incorporates novel algorithms across four core modules: multi-subject detection and tracking, recognition-aware video restoration, modality-specific biometric feature encoding, and quality-guided multi-modal fusion. These components are designed to work cohesively under degraded image conditions, large pose and scale variations, and cross-domain gaps. Extensive experiments on the BRIAR dataset, one of the most comprehensive benchmarks for long-range, multi-modal biometric recognition, demonstrate the effectiveness of FarSight. Compared to our preliminary system, this system achieves a 34.1% absolute gain in 1:1 verification accuracy (TAR@0.1% FAR), a 17.8% increase in closed-set identification (Rank-20), and a 34.3% reduction in open-set identification errors (FNIR@1% FPIR). Furthermore, FarSight was evaluated in the 2025 NIST RTE Face in Video Evaluation (FIVE), which conducts standardized face recognition testing on the BRIAR dataset. These results establish FarSight as a state-of-the-art solution for operational biometric recognition in challenging real-world conditions.
我们解决了在非受控环境中进行全身人物识别的问题。该问题主要出现在诸如IARPA生物识别和高度范围识别(BRIAR)计划之类的监控场景中,其中生物特征数据是在长距离、高视角以及恶劣的大气条件下(例如湍流和强风速)捕获的。为此,我们提出了FarSight系统,这是一个统一的端到端人物识别系统,该系统整合了面部、步态和体型等多模态生物特征线索。 FarSight包含四个核心模块的新算法:多主体检测与跟踪、以识别为导向的视频恢复、特定于模式的生物特征编码以及质量导向的跨模态融合。这些组件设计为在图像退化条件、大姿态和尺度变化以及跨域差距下协同工作。 通过对BRIAR数据集进行广泛实验,该数据集是用于长距离多模态生物识别基准测试中最全面的数据集之一,证明了FarSight的有效性。与我们的初步系统相比,在一对一验证准确性(TAR@0.1% FAR)方面提高了34.1%,在封闭式身份识别(Rank-20)中提高了17.8%,开放集合身份识别错误减少了34.3%(FNIR@1% FPIR)。此外,FarSight还在NIST 2025年FIVE(BRIAR数据集的标准化视频面部识别测试)中进行了评估。 这些结果使FarSight成为在具有挑战性的现实条件下进行操作生物特征识别的最新解决方案。
https://arxiv.org/abs/2505.04616
Cost-effective machine vision systems dedicated to real-time and accurate face detection and recognition in public places are crucial for many modern applications. However, despite their high performance, which could be reached using specialized edge or cloud AI hardware accelerators, there is still room for improvement in throughput and power consumption. This paper aims to suggest a combined hardware-software approach that optimizes face detection and recognition systems on one of the latest edge GPUs, namely NVIDIA Jetson AGX Orin. First, it leverages the simultaneous usage of all its hardware engines to improve processing time. This offers an improvement over previous works where these tasks were mainly allocated automatically and exclusively to the CPU or, to a higher extent, to the GPU core. Additionally, the paper suggests integrating a face tracker module to avoid redundantly running the face recognition algorithm for every frame but only when a new face appears in the scene. The results of extended experiments suggest that simultaneous usage of all the hardware engines that are available in the Orin GPU and tracker integration into the pipeline yield an impressive throughput of 290 FPS (frames per second) on 1920 x 1080 input size frames containing in average of 6 faces/frame. Additionally, a substantial saving of power consumption of around 800 mW was achieved when compared to running the task on the CPU/GPU engines only and without integrating a tracker into the Orin GPU\'92s pipeline. This hardware-codesign approach can pave the way to design high-performance machine vision systems at the edge, critically needed in video monitoring in public places where several nearby cameras are usually deployed for a same scene.
高效的机器视觉系统在公共场所进行实时且精确的人脸检测和识别对于许多现代应用至关重要。尽管通过使用专门的边缘或云端AI硬件加速器可以实现高性能,但在吞吐量和功耗方面仍存在改进空间。本文提出了一种结合软硬件的方法,旨在优化基于最新边缘GPU(NVIDIA Jetson AGX Orin)的人脸检测和识别系统。 首先,该方法利用了所有可用的硬件引擎的同时使用来提高处理时间。这比以往只将任务主要分配给CPU或更大程度上专用于GPU核心的做法有所改进。此外,还建议在系统中集成一个面部追踪模块,以避免对每一帧都运行人脸识别算法,而是在新的面孔出现在场景时才进行。 通过广泛的实验结果表明,在Orin GPU上同时使用所有可用的硬件引擎,并将跟踪器整合到管道中,可以实现每秒290帧(FPS)的惊人吞吐量,输入为1920 x 1080尺寸的画面且平均每帧包含6张面孔。此外,与仅在CPU/GPU引擎上运行任务而不集成追踪器相比,在Orin GPU的管道内整合追踪器实现了大约800毫瓦(mW)的功耗节省。 这种软硬件协同设计的方法为边缘高绩效机器视觉系统的开发铺平了道路,这对于公共场所中的视频监控尤为关键,通常需要部署多个近距离摄像头来覆盖同一场景。
https://arxiv.org/abs/2505.04524
Video face detection and recognition in public places at the edge is required in several applications, such as security reinforcement and contactless access to authorized venues. This paper aims to maximize the simultaneous usage of hardware engines available in edge GPUs nowadays by leveraging the concurrency and pipelining of tasks required for face detection and recognition. This also includes the video decoding task, which is required in most face monitoring applications as the video streams are usually carried via Gbps Ethernet network. This constitutes an improvement over previous works where the tasks are usually allocated to a single engine due to the lack of a unified and automated framework that simultaneously explores all hardware engines. In addition, previously, the input faces were usually embedded in still images or within raw video streams that overlook the burst delay caused by the decoding stage. The results on real-life video streams suggest that simultaneously using all the hardware engines available in the recent NVIDIA edge Orin GPU, higher throughput, and a slight saving of power consumption of around 300 mW, accounting for around 5%, have been achieved while satisfying the real-time performance constraint. The performance gets even higher by considering several video streams simultaneously. Further performance improvement could have been obtained if the number of shuffle layers that were created by the tensor RT framework for the face recognition task was lower. Thus, the paper suggests some hardware improvements to the existing edge GPU processors to enhance their performance even higher.
在公共场所进行视频面部检测和识别的需求在多个应用中存在,例如加强安全性和实现授权场所的非接触式访问。本文旨在通过利用当前边缘GPU硬件引擎的并发性和流水线作业来最大化这些引擎的同时使用,从而优化面部检测和识别任务。这包括大多数面部监控应用程序所需的视频解码任务,因为视频流通常通过Gbps以太网网络传输。 与以往的工作相比,这项研究有所改进:过去由于缺乏统一且自动化的框架,任务通常被分配给单一硬件引擎进行处理。此外,在之前的实践中,输入的面部通常是嵌入在静态图像或原始视频流中,并未考虑到解码阶段可能造成的突发延迟。实际视频流中的测试结果表明,在最近的NVIDIA边缘Orin GPU上同时使用所有可用硬件引擎能够实现更高的吞吐量和大约300毫瓦(约占5%)的功率节省,这在满足实时性能约束的同时得以体现。此外,考虑到多个视频流时,性能将进一步提高。 进一步的性能改进本可通过减少由TensorRT框架为面部识别任务创建的shuffle层数量来实现。因此,论文建议对现有的边缘GPU处理器进行一些硬件改进以进一步提升其性能。
https://arxiv.org/abs/2505.04502
With the continuous impact of epidemics, people have become accustomed to wearing masks. However, most current occluded face recognition (OFR) algorithms lack prior knowledge of occlusions, resulting in poor performance when dealing with occluded faces of varying types and severity in reality. Recognizing occluded faces is still a significant challenge, which greatly affects the convenience of people's daily lives. In this paper, we propose an identity-gated mixture of diffusion experts (MoDE) for OFR. Each diffusion-based generative expert estimates one possible complete image for occluded faces. Considering the random sampling process of the diffusion model, which introduces inevitable differences and variations between the inpainted faces and the real ones. To ensemble effective information from multi-reconstructed faces, we introduce an identity-gating network to evaluate the contribution of each reconstructed face to the identity and adaptively integrate the predictions in the decision space. Moreover, our MoDE is a plug-and-play module for most existing face recognition models. Extensive experiments on three public face datasets and two datasets in the wild validate our advanced performance for various occlusions in comparison with the competing methods.
随着疫情的持续影响,人们已经习惯了佩戴口罩。然而,大多数当前的遮挡面部识别(OFR)算法缺乏对不同类型和程度遮挡情况的先验知识,导致在处理实际生活中各种类型的遮挡面部时表现不佳。因此,识别人脸中的遮挡部分仍然是一个重大挑战,并且严重影响了人们日常生活的便利性。 本文中,我们提出了一种用于OFR的身份门控扩散专家混合模型(MoDE)。每个基于扩散生成的专家都会估算出一张可能完整的未被遮挡的脸部图像。考虑到扩散模型中的随机采样过程会引入修补后的脸部与真实脸部之间的不可避免差异和变化,为了从多个重建的脸部中集成有效的信息,我们引入了一个身份门控网络来评估每个重建脸部对身份识别的重要性,并根据在决策空间中的预测结果进行自适应整合。 此外,我们的MoDE模块可以作为插件方便地应用于大多数现有的面部识别模型。我们在三个公开的人脸数据集和两个野外环境中收集的数据集上进行了广泛的实验,验证了与竞争方法相比,在处理各种遮挡情况时我们模型的先进性能。
https://arxiv.org/abs/2505.04306
Face anti-spoofing is a critical technology for ensuring the security of face recognition systems. However, its ability to generalize across diverse scenarios remains a significant challenge. In this paper, we attribute the limited generalization ability to two key factors: covariate shift, which arises from external data collection variations, and semantic shift, which results from substantial differences in emerging attack types. To address both challenges, we propose a novel approach for learning unknown spoof prompts, relying solely on real face images from a single source domain. Our method generates textual prompts for real faces and potential unknown spoof attacks by leveraging the general knowledge embedded in vision-language models, thereby enhancing the model's ability to generalize to unseen target domains. Specifically, we introduce a diverse spoof prompt optimization framework to learn effective prompts. This framework constrains unknown spoof prompts within a relaxed prior knowledge space while maximizing their distance from real face images. Moreover, it enforces semantic independence among different spoof prompts to capture a broad range of spoof patterns. Experimental results on nine datasets demonstrate that the learned prompts effectively transfer the knowledge of vision-language models, enabling state-of-the-art generalization ability against diverse unknown attack types across unseen target domains without using any spoof face images.
人脸防伪技术是确保人脸识别系统安全性的关键。然而,其在不同场景下的泛化能力仍然是一个重大挑战。在这篇论文中,我们将有限的泛化能力归因于两个重要因素:协变量变化(covariate shift),这是由于外部数据采集差异引起的;以及语义变化(semantic shift),这是由新兴攻击类型的重大区别所导致的。 为了解决这两个挑战,我们提出了一种新的方法来学习未知的防伪提示,仅依赖单个源域的真实人脸图像。我们的方法通过利用视觉-语言模型中嵌入的一般知识生成真实面部和潜在未知伪造攻击的文字提示,从而提高模型在未见目标领域中的泛化能力。 具体来说,我们引入了一个多样化的伪造提示优化框架来学习有效的提示。此框架将未知的伪造提示约束在一个放松的先验知识空间内,同时最大化其与真实人脸图像的距离。此外,它还强制不同的伪造提示之间保持语义独立性,以捕捉广泛的伪造模式。 在九个数据集上的实验结果表明,所学得的提示有效地转移了视觉-语言模型的知识,在没有使用任何伪造面部图像的情况下,能够使模型对未见目标领域中各种未知攻击类型具有最先进的泛化能力。
https://arxiv.org/abs/2505.03611
3D mask presentation attack detection is crucial for protecting face recognition systems against the rising threat of 3D mask attacks. While most existing methods utilize multimodal features or remote photoplethysmography (rPPG) signals to distinguish between real faces and 3D masks, they face significant challenges, such as the high costs associated with multimodal sensors and limited generalization ability. Detection-related text descriptions offer concise, universal information and are cost-effective to obtain. However, the potential of vision-language multimodal features for 3D mask presentation attack detection remains unexplored. In this paper, we propose a novel knowledge-based prompt learning framework to explore the strong generalization capability of vision-language models for 3D mask presentation attack detection. Specifically, our approach incorporates entities and triples from knowledge graphs into the prompt learning process, generating fine-grained, task-specific explicit prompts that effectively harness the knowledge embedded in pre-trained vision-language models. Furthermore, considering different input images may emphasize distinct knowledge graph elements, we introduce a visual-specific knowledge filter based on an attention mechanism to refine relevant elements according to the visual context. Additionally, we leverage causal graph theory insights into the prompt learning process to further enhance the generalization ability of our method. During training, a spurious correlation elimination paradigm is employed, which removes category-irrelevant local image patches using guidance from knowledge-based text features, fostering the learning of generalized causal prompts that align with category-relevant local patches. Experimental results demonstrate that the proposed method achieves state-of-the-art intra- and cross-scenario detection performance on benchmark datasets.
3D面具呈现攻击检测对于保护面部识别系统免受日益增长的3D面具攻击威胁至关重要。尽管大多数现有的方法利用多模态特征或远程光体积描记图(rPPG)信号来区分真实人脸和3D面具,但它们面临着高昂的成本问题以及有限的泛化能力等重大挑战。与检测相关的文本描述提供了简洁、通用的信息,并且易于获取。然而,视觉-语言多模态特性在3D面具呈现攻击检测中的潜力尚未被探索。 本文提出了一种基于知识提示学习的新框架,以挖掘预训练视觉-语言模型的强大泛化能力来进行3D面具呈现攻击的检测。具体而言,我们的方法将知识图谱中的实体和三元组纳入到提示学习过程中,生成细粒度、任务特定的显式提示,从而有效地利用了嵌入在预先训练好的视觉-语言模型中的知识。 此外,考虑到不同输入图像可能强调不同的知识图谱元素,我们引入了一个基于注意力机制的视觉特异性知识过滤器来根据视觉上下文细化相关元素。另外,我们还借鉴因果图理论的思想进一步提升了方法的泛化能力,在训练过程中采用了虚假关联消除范式,该范式通过知识驱动的文本特征指导移除类别不相关的局部图像补丁,促进学习到与类别相关局部补丁相一致的一般性原因提示。 实验结果表明,所提出的方法在基准数据集上的内场景和跨场景检测性能上达到了最先进的水平。
https://arxiv.org/abs/2505.03610
Adversarial examples have revealed the vulnerability of deep learning models and raised serious concerns about information security. The transfer-based attack is a hot topic in black-box attacks that are practical to real-world scenarios where the training datasets, parameters, and structure of the target model are unknown to the attacker. However, few methods consider the particularity of class-specific deep models for fine-grained vision tasks, such as face recognition (FR), giving rise to unsatisfactory attacking performance. In this work, we first investigate what in a face exactly contributes to the embedding learning of FR models and find that both decisive and auxiliary facial features are specific to each FR model, which is quite different from the biological mechanism of human visual system. Accordingly we then propose a novel attack method named Attention-aggregated Attack (AAA) to enhance the transferability of adversarial examples against FR, which is inspired by the attention divergence and aims to destroy the facial features that are critical for the decision-making of other FR models by imitating their attentions on the clean face images. Extensive experiments conducted on various FR models validate the superiority and robust effectiveness of the proposed method over existing methods.
对抗性样本揭示了深度学习模型的脆弱性,并对信息安全提出了严重关切。基于转移攻击是针对实际场景中黑盒攻击的一个热门话题,在这些场景下,目标模型的训练数据集、参数和结构都是未知的。然而,很少有方法考虑特定于类别的深层模型在细粒度视觉任务(如人脸识别(FR))中的特性,从而导致了不理想的攻击性能。 在这项工作中,我们首先探讨了面部哪些特征对FR模型的嵌入学习至关重要,并发现决定性和辅助性的面部特征对于每个FR模型都是特有的,这与人类视觉系统的生物机制有着显著的不同。因此,我们提出了一种名为注意力聚合攻击(Attention-aggregated Attack, AAA)的新方法,以增强对抗性样本在针对人脸识别时的转移能力。该方法受注意力分歧启发,旨在通过模仿清洁面部图像上其他FR模型的关注点来破坏对决策至关重要的面部特征。 我们在各种FR模型上进行了广泛的实验,验证了所提出的方法相对于现有方法的优势及其鲁棒有效性。
https://arxiv.org/abs/2505.03383
Aiming to reduce the computational cost of Softmax in massive label space of Face Recognition (FR) benchmarks, recent studies estimate the output using a subset of identities. Although promising, the association between the computation cost and the number of identities in the dataset remains linear only with a reduced ratio. A shared characteristic among available FR methods is the employment of atomic scalar labels during training. Consequently, the input to label matching is through a dot product between the feature vector of the input and the Softmax centroids. Inspired by generative modeling, we present a simple yet effective method that substitutes scalar labels with structured identity code, i.e., a sequence of integers. Specifically, we propose a tokenization scheme that transforms atomic scalar labels into structured identity codes. Then, we train an FR backbone to predict the code for each input instead of its scalar label. As a result, the associated computational cost becomes logarithmic w.r.t. number of identities. We demonstrate the benefits of the proposed method by conducting experiments. In particular, our method outperforms its competitors by 1.52%, and 0.6% at TAR@FAR$=1e-4$ on IJB-B and IJB-C, respectively, while transforming the association between computational cost and the number of identities from linear to logarithmic. See code at this https URL
为了减少大规模标签空间中人脸识别(FR)基准测试中的Softmax计算成本,最近的研究通过使用身份子集来估计输出。尽管这种方法有前景,但其计算成本与数据集中身份数量之间的关系仍然是线性的,只是比例有所降低。现有可用的FR方法的一个共同特点是,在训练过程中采用原子标量标签。因此,输入和标签匹配是通过输入特征向量与Softmax中心点之间的点积实现的。 受到生成模型的启发,我们提出了一种简单而有效的方法,即用结构化的身份代码替换标量标签,也就是一串整数。具体来说,我们提出了一个标记化方案,将原子标量标签转换为结构化的身份代码。然后,训练FR骨干网络来预测每个输入的身份代码,而不是其标量标签。结果是与身份数量相关的计算成本变为对数关系。 通过实验展示了所提方法的益处:在IJB-B和IJB-C基准测试中,在FAR$=1e-4$下的TAR上,我们的方法分别优于竞争对手1.52%和0.6%,同时将计算成本与身份数量的关系从线性转变为对数。更多详情请参见代码链接:[此处提供实际的URL或指示读者查看原始文档中的具体网址]。 注意:在上述翻译中,“TAR@FAR”是True Acceptance Rate at False Alarm Rate(假警报率下的真实接受率)的缩写,在人脸识别研究领域常用来评估模型性能。
https://arxiv.org/abs/2505.03012
Biometric authentication has become one of the most widely used tools in the current technological era to authenticate users and to distinguish between genuine users and imposters. Face is the most common form of biometric modality that has proven effective. Deep learning-based face recognition systems are now commonly used across different domains. However, these systems usually operate like black-box models that do not provide necessary explanations or justifications for their decisions. This is a major disadvantage because users cannot trust such artificial intelligence-based biometric systems and may not feel comfortable using them when clear explanations or justifications are not provided. This paper addresses this problem by applying an efficient method for explainable face recognition systems. We use a Class Activation Mapping (CAM)-based discriminative localization (very narrow/specific localization) technique called Scaled Directed Divergence (SDD) to visually explain the results of deep learning-based face recognition systems. We perform fine localization of the face features relevant to the deep learning model for its prediction/decision. Our experiments show that the SDD Class Activation Map (CAM) highlights the relevant face features very specifically compared to the traditional CAM and very accurately. The provided visual explanations with narrow localization of relevant features can ensure much-needed transparency and trust for deep learning-based face recognition systems.
生物特征认证已成为当前技术时代中最广泛使用的工具之一,用于验证用户身份并区分真实用户和冒充者。面部是最常见的生物识别模式,已被证明非常有效。基于深度学习的面部识别系统如今在不同领域中被广泛应用。然而,这些系统通常像黑箱模型一样运作,无法提供决策背后的必要解释或理由。这是一大劣势,因为用户无法信任此类基于人工智能的生物识别系统,并且当没有提供清晰的解释时可能会感到不舒适使用它们。本文通过应用一种用于可解释面部识别系统的高效方法来解决这个问题。我们采用了一种名为尺度定向分歧(Scaled Directed Divergence, SDD)的技术,这是一种基于类激活映射(Class Activation Mapping, CAM)的区分性定位技术,用于对基于深度学习的面部识别系统的结果进行视觉说明。我们在实验中进行了精细定位,确定了与深度学习模型预测/决策相关的面部特征。我们的实验证明,SDD 类激活图比传统的CAM更具体地突出了相关面部特征,并且准确性更高。所提供的具有狭窄局部化特性的可视化解释可以确保基于深度学习的面部识别系统的所需透明度和信任度。
https://arxiv.org/abs/2505.03837
Face detection and face recognition have been in the focus of vision community since the very beginnings. Inspired by the success of the original Videoface digitizer, a pioneering device that allowed users to capture video signals from any source, we have designed an advanced video analytics tool to efficiently create structured video stories, i.e. identity-based information catalogs. VideoFace2.0 is the name of the developed system for spatial and temporal localization of each unique face in the input video, i.e. face re-identification (ReID), which also allows their cataloging, characterization and creation of structured video outputs for later downstream tasks. Developed near real-time solution is primarily designed to be utilized in application scenarios involving TV production, media analysis, and as an efficient tool for creating large video datasets necessary for training machine learning (ML) models in challenging vision tasks such as lip reading and multimodal speech recognition. Conducted experiments confirm applicability of the proposed face ReID algorithm that is combining the concepts of face detection, face recognition and passive tracking-by-detection in order to achieve robust and efficient face ReID. The system is envisioned as a compact and modular extensions of the existing video production equipment. We hope that the presented work and shared code will stimulate further interest in development of similar, application specific video analysis tools, and lower the entry barrier for production of high-quality multi-modal ML datasets in the future.
自从视觉社区的早期开始,面部检测和面部识别就一直是研究的重点。受到原始Videoface数字仪成功的启发——这是第一个允许用户从任何来源捕获视频信号的开创性设备——我们设计了一种先进的视频分析工具,可以高效地创建基于身份的信息目录,即结构化视频故事。VideoFace2.0是开发出的一套系统,它能够在输入视频中对每个独特的面部进行空间和时间上的定位,也就是所谓的“重新识别”(ReID),同时允许它们被分类、特征描述,并生成用于后续任务的结构化视频输出。这套近实时解决方案主要设计为在涉及电视制作、媒体分析的应用场景中使用,同时也是为了创建大型视频数据集而设计的一种高效工具,这些数据集对于训练机器学习(ML)模型至关重要,尤其是在唇读和多模态语音识别等具有挑战性的视觉任务上。 进行的实验证实了所提出的面部ReID算法的有效性,该算法结合了面部检测、面部识别以及被动追踪的概念以实现稳健且高效的面部重新识别。该系统构想为现有视频制作设备的一个紧凑而模块化的扩展部分。我们希望这项工作和共享的代码能够激发更多对开发类似的应用特定视频分析工具的兴趣,并在未来降低高质量多模态ML数据集生产的门槛。
https://arxiv.org/abs/2505.02060