Recent advances in generative modeling have enabled the generation of high-quality synthetic data that is applicable in a variety of domains, including face recognition. Here, state-of-the-art generative models typically rely on conditioning and fine-tuning of powerful pretrained diffusion models to facilitate the synthesis of realistic images of a desired identity. Yet, these models often do not consider the identity of subjects during training, leading to poor consistency between generated and intended identities. In contrast, methods that employ identity-based training objectives tend to overfit on various aspects of the identity, and in turn, lower the diversity of images that can be generated. To address these issues, we present in this paper a novel generative diffusion-based framework, called ID-Booth. ID-Booth consists of a denoising network responsible for data generation, a variational auto-encoder for mapping images to and from a lower-dimensional latent space and a text encoder that allows for prompt-based control over the generation procedure. The framework utilizes a novel triplet identity training objective and enables identity-consistent image generation while retaining the synthesis capabilities of pretrained diffusion models. Experiments with a state-of-the-art latent diffusion model and diverse prompts reveal that our method facilitates better intra-identity consistency and inter-identity separability than competing methods, while achieving higher image diversity. In turn, the produced data allows for effective augmentation of small-scale datasets and training of better-performing recognition models in a privacy-preserving manner. The source code for the ID-Booth framework is publicly available at this https URL.
近期在生成模型领域的进展使得能够为包括人脸识别在内的多个领域生成高质量的合成数据。这里,最先进的生成模型通常依赖于对强大的预训练扩散模型进行条件设置和微调,以促进所需身份的真实图像合成。然而,这些模型在训练时往往不考虑主体的身份,导致生成的身份与预期意图之间的一致性较差。相比之下,采用基于身份的训练目标的方法往往会过度拟合身份的各种方面,进而降低可以生成的图像多样性。 为了解决这些问题,我们在本文中提出了一种新的基于生成扩散框架的方法,称为ID-Booth(身份展位)。ID-Booth包括一个负责数据生成的去噪网络、将图像映射到和从较低维度潜在空间进行映射的变分自编码器以及允许根据提示控制生成过程的文本编码器。该框架利用了一种新颖的身份三元组训练目标,能够在保持预训练扩散模型合成能力的同时实现身份一致性的图像生成。 使用最先进的潜在扩散模型和各种不同的提示进行的实验表明,我们的方法比竞争的方法实现了更好的同身份一致性、跨身份可分离性以及更高的图像多样性。进而,所产生的数据可以有效增强小规模数据集,并以保护隐私的方式训练出更优秀的识别模型。 ID-Booth框架的源代码可在以下网址公开获得:[此链接处应填写实际URL]
https://arxiv.org/abs/2504.07392
Face recognition systems are vulnerable to physical attacks (e.g., printed photos) and digital threats (e.g., DeepFake), which are currently being studied as independent visual tasks, such as Face Anti-Spoofing and Forgery Detection. The inherent differences among various attack types present significant challenges in identifying a common feature space, making it difficult to develop a unified framework for detecting data from both attack modalities simultaneously. Inspired by the efficacy of Mixture-of-Experts (MoE) in learning across diverse domains, we explore utilizing multiple experts to learn the distinct features of various attack types. However, the feature distributions of physical and digital attacks overlap and differ. This suggests that relying solely on distinct experts to learn the unique features of each attack type may overlook shared knowledge between them. To address these issues, we propose SUEDE, the Shared Unified Experts for Physical-Digital Face Attack Detection Enhancement. SUEDE combines a shared expert (always activated) to capture common features for both attack types and multiple routed experts (selectively activated) for specific attack types. Further, we integrate CLIP as the base network to ensure the shared expert benefits from prior visual knowledge and align visual-text representations in a unified space. Extensive results demonstrate SUEDE achieves superior performance compared to state-of-the-art unified detection methods.
面部识别系统容易受到物理攻击(如打印的照片)和数字威胁(如DeepFake)的影响,目前这些独立的视觉任务正在被分别研究,例如防欺骗检测和伪造物检测。不同类型的攻击之间存在的固有差异为建立一个能够同时处理两种模式攻击数据的共同特征空间带来了巨大挑战,从而难以开发出统一框架来应对这些问题。受混合专家(MoE)在跨多样域学习中的有效性的启发,我们探索使用多个专家来学习各种攻击类型的独特特性。然而,物理和数字攻击的特征分布相互重叠但又有所不同。这表明,仅仅依靠不同的专家来学习每种特定类型的独特特征可能会忽略它们之间共享的知识。为了应对这些挑战,我们提出了SUEDE(Physical-Digital Face Attack Detection Enhancement的共同统一专家系统)。SUEDE结合了一个共享专家(始终激活)用于捕捉两种攻击类型之间的通用特性,并且多个路由专家(选择性激活)来处理特定类型的攻击。此外,我们将CLIP作为基础网络集成进来,以确保共享专家能够从先前的视觉知识中受益并使视觉-文本表示在统一的空间内对齐。广泛的实验结果表明,SUEDE相较于现有的统一检测方法表现出更优越的性能。
https://arxiv.org/abs/2504.04818
Existing human recognition systems often rely on separate, specialized models for face and body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and context vary widely. This paper introduces SapiensID, a unified model that bridges this gap, achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch (RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent tokenization of regions of interest, (ii) a masked recognition model (MRM) that learns from variable token length, and (iii) Semantic Attention Head (SAH), an module that learns pose-invariant representations by pooling features around key body parts. To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations. Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios while remaining competitive with dedicated face recognition systems. Furthermore, SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale ReID, demonstrating its ability to generalize to complex, real-world conditions.
现有的人类识别系统通常依赖于专门针对面部和身体分析的独立模型,这限制了它们在姿势、可见性和上下文变化极大的现实场景中的有效性。本文介绍了SapiensID,这是一种统一的模型,它弥合了这一差距,在各种设置中实现了稳健的表现。 SapiensID引入了以下创新: (i) Retina Patch (RP),一种动态补丁生成方案,可以根据主体尺度进行调整,并确保对感兴趣区域的一致标记化处理; (ii) 一个遮罩识别模型(Masked Recognition Model, MRM),可以从可变的令牌长度中学习; (iii) 语义注意力头(Semantic Attention Head, SAH),通过围绕关键身体部位汇集特征来学习与姿势无关的表示。 为了促进训练,我们引入了WebBody4M,这是一个大规模的数据集,捕捉各种姿态和尺度的变化。广泛的实验表明,SapiensID在多种人体重识别基准上达到了最先进的结果,在短期和长期场景中均优于专用模型,并且在专为人脸识别系统提供的方法中保持竞争力。 此外,SapiensID为新引入的跨姿势-尺度重识别挑战建立了强大的基线,展示了其在复杂、现实世界条件下的泛化能力。
https://arxiv.org/abs/2504.04708
With the rapid development of mobile devices and the fast increase of sensitive data, secure and convenient mobile authentication technologies are desired. Except for traditional passwords, many mobile devices have biometric-based authentication methods (e.g., fingerprint, voiceprint, and face recognition), but they are vulnerable to spoofing attacks. To solve this problem, we study new biometric features which are based on the dental occlusion and find that the bone-conducted sound of dental occlusion collected in binaural canals contains unique features of individual bones and teeth. Motivated by this, we propose a novel authentication system, TeethPass+, which uses earbuds to collect occlusal sounds in binaural canals to achieve authentication. First, we design an event detection method based on spectrum variance to detect bone-conducted sounds. Then, we analyze the time-frequency domain of the sounds to filter out motion noises and extract unique features of users from four aspects: teeth structure, bone structure, occlusal location, and occlusal sound. Finally, we train a Triplet network to construct the user template, which is used to complete authentication. Through extensive experiments including 53 volunteers, the performance of TeethPass+ in different environments is verified. TeethPass+ achieves an accuracy of 98.6% and resists 99.7% of spoofing attacks.
随着移动设备的快速发展和敏感数据量的快速增加,安全且便捷的移动认证技术变得越来越重要。除了传统的密码之外,许多移动设备还采用了基于生物特征的身份验证方法(例如指纹、声纹以及面部识别),但这些方法容易受到伪装攻击。为了解决这个问题,我们研究了一种新的生物特征——牙齿咬合,并发现双耳通道中采集到的由牙齿咬合产生的骨传导声音包含个体骨骼和牙齿的独特信息。基于这一发现,我们提出了一种新型认证系统TeethPass+,它利用耳机在双耳通道内收集牙齿咬合的声音来实现身份验证。 首先,我们设计了一种基于频谱变化的事件检测方法,用于识别骨传导的声音。然后,我们分析声音的时间-频率特性以过滤运动噪声,并从四个方面提取用户的独特特征:牙齿结构、骨骼结构、咬合位置和咬合声音。最后,我们训练了一个三元组网络来构建用户模板,该模板被用来完成认证过程。 通过包括53名志愿者在内的广泛实验验证了TeethPass+在不同环境下的性能表现。结果显示,TeethPass+的准确率为98.6%,并且能够抵御高达99.7%的伪装攻击。
https://arxiv.org/abs/2504.00435
Identity-preserving face synthesis aims to generate synthetic face images of virtual subjects that can substitute real-world data for training face recognition models. While prior arts strive to create images with consistent identities and diverse styles, they face a trade-off between them. Identifying their limitation of treating style variation as subject-agnostic and observing that real-world persons actually have distinct, subject-specific styles, this paper introduces MorphFace, a diffusion-based face generator. The generator learns fine-grained facial styles, e.g., shape, pose and expression, from the renderings of a 3D morphable model (3DMM). It also learns identities from an off-the-shelf recognition model. To create virtual faces, the generator is conditioned on novel identities of unlabeled synthetic faces, and novel styles that are statistically sampled from a real-world prior distribution. The sampling especially accounts for both intra-subject variation and subject distinctiveness. A context blending strategy is employed to enhance the generator's responsiveness to identity and style conditions. Extensive experiments show that MorphFace outperforms the best prior arts in face recognition efficacy.
身份保持人脸合成的目标是生成虚拟人物的合成面部图像,这些图像可以替代真实世界的数据用于训练人脸识别模型。虽然先前的研究致力于创建具有恒定身份和多样化风格的图像,但它们在两者之间面临着权衡。本文识别了之前研究将风格变化视为与主体无关的问题,并注意到现实生活中的人实际上拥有独特的、特定于个体的风格。为此,本文引入了MorphFace,这是一种基于扩散模型的脸部生成器。 该生成器从3D可变形态模型(3DMM)的渲染中学习到精细的面部样式,例如形状、姿态和表情。它还通过现有的人脸识别模型来学习身份信息。为了创建虚拟面孔,该生成器会在未标记的合成人脸的新身份上进行条件设定,并且新风格是从现实世界的先验分布中统计抽样得到的。尤其是这种抽样方法考虑到了个体内部的变化以及不同主体之间的差异性。 采用了一种上下文融合策略来增强生成器对身份和风格条件的响应能力。广泛的实验表明,MorphFace在人脸识别效果方面超越了现有的最佳技术。
https://arxiv.org/abs/2504.00430
In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class-centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss and decompose it into forgetting and retention terms. Through the theoretical framework, we point out that a class of previous methods could be mainly formulated as a loss that implicitly optimizes the forgetting term while lacking supervision for the retention term, disturbing the distribution of pre-trained model and struggling to adequately preserve knowledge of the remaining classes. To address it, we refine the retention term using "dark knowledge" and propose a mask distillation unlearning method. By applying a mask to separate forgetting logits from retention logits, our approach optimizes both the forgetting and refined retention components simultaneously, retaining knowledge of the remaining classes while ensuring thorough forgetting of the target class. Without access to the remaining data or intervention (i.e., used in some works), we achieve state-of-the-art performance across various benchmarks. What's more, DELETE is a general solution that can be applied to various downstream tasks, including face recognition, backdoor defense, and semantic segmentation with great performance.
在这项工作中,我们提出了DEcoupLEd Distillation To Erase (DELETE),这是一种适用于任何以类别为中心任务的通用且强大的撤销学习方法。为了导出这种方法,我们首先提出了一种理论框架来分析撤销损失的一般形式,并将其分解为遗忘和保留两个部分。通过该理论框架,我们指出先前一类方法的主要问题在于它们主要优化的是隐式的遗忘项损失,而缺乏对保留项的监督机制,这会扰乱预训练模型的数据分布并难以充分保持剩余类别的知识。 为了解决这个问题,我们利用“暗知识”来改进保留项,并提出了一种掩码蒸馏撤销学习方法。通过应用掩码将遗忘得分从保留得分中分离出来,我们的方法可以同时优化遗忘和改进后的保留部分,在保留其他类别知识的同时确保彻底忘记目标类别的信息。 无需访问剩余数据或干预(即在一些工作中所使用的做法),我们在各种基准测试上达到了最先进的性能水平。更重要的是,DELETE是一种通用解决方案,可应用于包括人脸识别、后门防御以及语义分割在内的多种下游任务,并且具有出色的性能表现。
https://arxiv.org/abs/2503.23751
Face recognition is a crucial topic in data science and biometric security, with applications spanning military, finance, and retail industries. This paper explores the implementation of sparse Principal Component Analysis (PCA) using the Proximal Gradient method (also known as ISTA) and the Runge-Kutta numerical methods. To address the face recognition problem, we integrate sparse PCA with either the k-nearest neighbor method or the kernel ridge regression method. Experimental results demonstrate that combining sparse PCA-solved via the Proximal Gradient method or the Runge-Kutta numerical approach-with a classification system yields higher accuracy compared to standard PCA. Additionally, we observe that the Runge-Kutta-based sparse PCA computation consistently outperforms the Proximal Gradient method in terms of speed.
人脸识别是数据科学和生物识别安全中的一个关键议题,其应用范围涵盖了军事、金融及零售等行业。本文探讨了使用近似梯度法(又称ISTA)和龙格-库塔数值方法实现稀疏主成分分析(PCA)。为解决人脸识别问题,我们将稀疏PCA与k最近邻方法或核岭回归方法相结合。 实验结果表明,通过近似梯度法或龙格-库塔数值方法求解的稀疏PCA结合分类系统,在精度上优于标准PCA。此外,我们观察到基于龙格-库塔的稀疏PCA计算在速度方面始终优于近似梯度法。
https://arxiv.org/abs/2504.01035
Automated one-to-many (1:N) face recognition is a powerful investigative tool commonly used by law enforcement agencies. In this context, potential matches resulting from automated 1:N recognition are reviewed by human examiners prior to possible use as investigative leads. While automated 1:N recognition can achieve near-perfect accuracy under ideal imaging conditions, operational scenarios may necessitate the use of surveillance imagery, which is often degraded in various quality dimensions. One important quality dimension is image resolution, typically quantified by the number of pixels on the face. The common metric for this is inter-pupillary distance (IPD), which measures the number of pixels between the pupils. Low IPD is known to degrade the accuracy of automated face recognition. However, the threshold IPD for reliability in human face recognition remains undefined. This study aims to explore the boundaries of human recognition accuracy by systematically testing accuracy across a range of IPD values. We find that at low IPDs (10px, 5px), human accuracy is at or below chance levels (50.7%, 35.9%), even as confidence in decision-making remains relatively high (77%, 70.7%). Our findings indicate that, for low IPD images, human recognition ability could be a limiting factor to overall system accuracy.
自动化的一对多(1:N)人脸识别是一项强大的调查工具,通常由执法机构使用。在这种情况下,自动1:N识别产生的潜在匹配结果会在用作调查线索之前由人工检查员进行审查。虽然在理想的成像条件下,自动化1:N识别可以实现接近完美的准确度,但在实际操作场景中可能需要使用监控图像,这些图像往往在多个质量维度上都存在退化问题。其中一个重要的质量维度是图像分辨率,通常通过面部像素的数量来量化。衡量这一指标的常用方法是两眼之间的距离(IPD),即瞳孔间的像素数。低IPD已知会降低自动化人脸识别的准确度,然而,人类在低IPD条件下的识别可靠性的阈值仍未明确定义。 本研究旨在通过对一系列不同IPD值的人脸图像进行系统测试来探索人类识别准确率的边界。我们发现,在低IPD(10px、5px)的情况下,人类识别准确率甚至低于随机水平(50.7%,35.9%),尽管在此情况下决策的信心依然相对较高(77%,70.7%)。我们的研究结果表明,对于低IPD图像而言,人类的识别能力可能成为整体系统准确度的一个限制因素。
https://arxiv.org/abs/2503.20108
Face anti-spoofing (FAS) plays a pivotal role in ensuring the security and reliability of face recognition systems. With advancements in vision-language pretrained (VLP) models, recent two-class FAS techniques have leveraged the advantages of using VLP guidance, while this potential remains unexplored in one-class FAS methods. The one-class FAS focuses on learning intrinsic liveness features solely from live training images to differentiate between live and spoof faces. However, the lack of spoof training data can lead one-class FAS models to inadvertently incorporate domain information irrelevant to the live/spoof distinction (e.g., facial content), causing performance degradation when tested with a new application domain. To address this issue, we propose a novel framework called Spoof-aware one-class face anti-spoofing with Language Image Pretraining (SLIP). Given that live faces should ideally not be obscured by any spoof-attack-related objects (e.g., paper, or masks) and are assumed to yield zero spoof cue maps, we first propose an effective language-guided spoof cue map estimation to enhance one-class FAS models by simulating whether the underlying faces are covered by attack-related objects and generating corresponding nonzero spoof cue maps. Next, we introduce a novel prompt-driven liveness feature disentanglement to alleviate live/spoof-irrelative domain variations by disentangling live/spoof-relevant and domain-dependent information. Finally, we design an effective augmentation strategy by fusing latent features from live images and spoof prompts to generate spoof-like image features and thus diversify latent spoof features to facilitate the learning of one-class FAS. Our extensive experiments and ablation studies support that SLIP consistently outperforms previous one-class FAS methods.
面部防伪(FAS)在确保人脸识别系统的安全性和可靠性方面发挥着至关重要的作用。随着视觉-语言预训练模型的发展,最近的两类面部防伪技术已经利用了使用VLP指导的优势,而这种潜力尚未在单类面部防伪方法中得到充分探索。单类FAS专注于仅从活体训练图像学习内在的真实特征来区分真实面孔和伪造面孔。然而,在缺乏伪造数据的情况下,单类FAS模型可能会无意中包含与真伪无关的领域信息(例如,面部内容),这在测试新的应用领域时会导致性能下降。 为了解决这个问题,我们提出了一种新颖的框架,称为基于语言图像预训练的防伪感知单类脸部防伪(SLIP)。鉴于真实面孔不应被任何与伪造攻击相关的物体(如纸张或面具)遮挡,并且这些情况应产生零伪造提示图,我们首先提出了一个有效的语言引导的伪造提示图估计方法来增强单类FAS模型。该方法通过模拟人脸是否被攻击相关对象覆盖并生成相应的非零伪造提示图来进行。 接下来,我们引入了一种新颖的提示驱动的真实特征分离方法,以减轻与真伪无关的领域变化,通过分离相关的和依赖领域的信息来实现这一目标。 最后,我们设计了一个有效的增强策略,通过融合来自真实图像和伪造提示的潜在特性生成类似伪造的图像特性,并从而多样化潜在的伪造特征,以便更好地学习单类FAS模型。 我们的广泛实验和消融研究表明,SLIP始终优于以前的单类面部防伪方法。
https://arxiv.org/abs/2503.19982
Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Code and data are available at: this https URL.
最近的合成3D人体数据集,包括面部、身体和手部,在逼真度方面取得了突破。使用仅基于合成训练数据的方法已经实现了面部识别和姿态估计领域的最先进性能,但对于手部而言,真实世界与合成数据之间的差距仍然很大。本文首次系统地研究了3D手部姿态估计中的真实世界与合成数据的差距,并分析了这一差距背后的关键因素,如前臂、图像频谱统计信息、手的姿态以及物体遮挡。 为了支持我们的分析,我们提出了一种数据合成分管道,用于生成高质量的数据。我们证明,在整合上述关键组件后,合成的手部数据可以达到与真实世界数据相同的准确度水平,从而为仅使用合成数据进行手部姿态估计铺平了道路。相关代码和数据可在以下网址获得:this https URL.
https://arxiv.org/abs/2503.19307
This paper showcases an experimental study on anomaly detection using computer vision. The study focuses on class distinction and performance evaluation, combining OpenCV with deep learning techniques while employing a TensorFlow-based convolutional neural network for real-time face recognition and classification. The system effectively distinguishes among three classes: authorized personnel (admin), intruders, and non-human entities. A MobileNetV2-based deep learning model is utilized to optimize real-time performance, ensuring high computational efficiency without compromising accuracy. Extensive dataset preprocessing, including image augmentation and normalization, enhances the models generalization capabilities. Our analysis demonstrates classification accuracies of 90.20% for admin, 98.60% for intruders, and 75.80% for non-human detection, while maintaining an average processing rate of 30 frames per second. The study leverages transfer learning, batch normalization, and Adam optimization to achieve stable and robust learning, and a comparative analysis of class differentiation strategies highlights the impact of feature extraction techniques and training methodologies. The results indicate that advanced feature selection and data augmentation significantly enhance detection performance, particularly in distinguishing human from non-human scenes. As an experimental study, this research provides critical insights into optimizing deep learning-based surveillance systems for high-security environments and improving the accuracy and efficiency of real-time anomaly detection.
这篇论文展示了一项使用计算机视觉进行异常检测的实验研究。该研究侧重于类别区分和性能评估,结合了OpenCV与深度学习技术,并采用基于TensorFlow的卷积神经网络进行实时面部识别和分类。系统能够有效地区分三种类别:授权人员(管理员)、入侵者和非人类实体。为了优化实时表现,使用了一个基于MobileNetV2的深度学习模型,在保证高精度的同时实现了高效的计算效率。通过对大量数据集进行预处理,包括图像增强和标准化,提升了模型的泛化能力。我们的分析显示,对于管理员、入侵者以及非人类实体检测的分类准确率分别达到了90.20%、98.60% 和75.80%,同时保持了每秒30帧的平均处理速率。该研究通过利用迁移学习、批量归一化和Adam优化器实现了稳定且稳健的学习效果,并对类别区分策略进行了比较分析,强调了特征提取技术和训练方法的影响。结果显示,高级别特征选择和数据增强显著提高了检测性能,尤其是在区分人类与非人类场景方面。作为一项实验性研究,本研究为在高安全环境中优化基于深度学习的监控系统提供了关键见解,并提升了实时异常检测的准确性和效率。
https://arxiv.org/abs/2503.19100
In recent years, sparse sampling techniques based on regression analysis have witnessed extensive applications in face recognition research. Presently, numerous sparse sampling models based on regression analysis have been explored by various researchers. Nevertheless, the recognition rates of the majority of these models would be significantly decreased when confronted with highly occluded and highly damaged face images. In this paper, a new wing-constrained sparse coding model(WCSC) and its weighted version(WWCSC) are introduced, so as to deal with the face recognition problem in complex circumstances, where the alternating direction method of multipliers (ADMM) algorithm is employed to solve the corresponding minimization problems. In addition, performances of the proposed method are examined based on the four well-known facial databases, namely the ORL facial database, the Yale facial database, the AR facial database and the FERET facial database. Also, compared to the other methods in the literatures, the WWCSC has a very high recognition rate even in complex situations where face images have high occlusion or high damage, which illustrates the robustness of the WWCSC method in facial recognition.
近年来,基于回归分析的稀疏采样技术在人脸识别研究中得到了广泛应用。目前,许多基于回归分析的稀疏采样模型已经被各种研究人员探索和使用。然而,大多数这些模型在面对高度遮挡或严重损坏的人脸图像时识别率会显著下降。本文提出了一种新的受翼约束的稀疏编码模型(WCSC)及其加权版本(WWCSC),旨在解决复杂条件下的人脸识别问题,并利用交替方向乘子法(ADMM)算法来求解相应的最小化问题。此外,基于四个知名的面部数据库——ORL面部数据库、耶鲁面部数据库、AR面部数据库和FERET面部数据库,对所提出方法的性能进行了评估。与文献中的其他方法相比,在面部图像高度遮挡或损坏的情况下,WWCSC仍然具有非常高的识别率,这表明了WWCSC在人脸识别中具备强大的鲁棒性。
https://arxiv.org/abs/2503.18652
RISC-V-based architectures are paving the way for efficient On-Device Learning (ODL) in smart edge devices. When applied across multiple nodes, ODL enables the creation of intelligent sensor networks that preserve data privacy. However, developing ODL-capable, battery-operated embedded platforms presents significant challenges due to constrained computational resources and limited device lifetime, besides intrinsic learning issues such as catastrophic forgetting. We face these challenges by proposing a regularization-based On-Device Federated Continual Learning algorithm tailored for multiple nano-drones performing face recognition tasks. We demonstrate our approach on a RISC-V-based 10-core ultra-low-power SoC, optimizing the ODL computational requirements. We improve the classification accuracy by 24% over naive fine-tuning, requiring 178 ms per local epoch and 10.5 s per global epoch, demonstrating the effectiveness of the architecture for this task.
基于RISC-V的架构正在为智能边缘设备上的高效在设备学习(On-Device Learning,ODL)铺平道路。当应用于多个节点时,ODL能够创建保护数据隐私的智能传感器网络。然而,由于计算资源受限和设备寿命有限等原因,在电池供电的嵌入式平台上开发支持ODL的功能面临着重大挑战,同时还存在诸如灾难性遗忘等内在学习问题。为应对这些挑战,我们提出了一种基于正则化的在设备联合持续学习算法,专门针对执行面部识别任务的多个纳米无人机。我们在一个RISC-V架构的10核心超低功耗片上系统(SoC)上展示了我们的方法,并优化了ODL的计算需求。与简单的微调相比,我们通过这种方法提高了24%的分类准确性,每个本地周期需要178毫秒,而全球周期则需要10.5秒,这证明了该架构在此任务上的有效性。
https://arxiv.org/abs/2503.17436
Deep Convolutional Neural Networks (CNNs) have significantly advanced deep learning, driving breakthroughs in computer vision, natural language processing, medical diagnosis, object detection, and speech recognition. Architectural innovations including 1D, 2D, and 3D convolutional models, dilated and grouped convolutions, depthwise separable convolutions, and attention mechanisms address domain-specific challenges and enhance feature representation and computational efficiency. Structural refinements such as spatial-channel exploitation, multi-path design, and feature-map enhancement contribute to robust hierarchical feature extraction and improved generalization, particularly through transfer learning. Efficient preprocessing strategies, including Fourier transforms, structured transforms, low-precision computation, and weight compression, optimize inference speed and facilitate deployment in resource-constrained environments. This survey presents a unified taxonomy that classifies CNN architectures based on spatial exploitation, multi-path structures, depth, width, dimensionality expansion, channel boosting, and attention mechanisms. It systematically reviews CNN applications in face recognition, pose estimation, action recognition, text classification, statistical language modeling, disease diagnosis, radiological analysis, cryptocurrency sentiment prediction, 1D data processing, video analysis, and speech recognition. In addition to consolidating architectural advancements, the review highlights emerging learning paradigms such as few-shot, zero-shot, weakly supervised, federated learning frameworks and future research directions include hybrid CNN-transformer models, vision-language integration, generative learning, etc. This review provides a comprehensive perspective on CNN's evolution from 2015 to 2025, outlining key innovations, challenges, and opportunities.
深度卷积神经网络(CNNs)在深度学习领域取得了重大进展,推动了计算机视觉、自然语言处理、医学诊断、目标检测和语音识别等领域的突破。包括一维、二维及三维卷积模型、扩张卷积与分组卷积、逐深可分离卷积以及注意力机制在内的架构创新解决了特定领域的挑战,并增强了特征表示能力及计算效率。空间-通道利用、多路径设计及特征图增强等结构改进有助于稳健的层级特征提取和泛化性能提升,特别是在迁移学习中表现尤为突出。有效的预处理策略,如傅里叶变换、结构转换、低精度运算及权重压缩,则优化了推理速度并支持在资源受限环境中部署应用。 本文综述提出了一种统一分类体系,根据空间利用、多路径架构、深度、宽度、维度扩展、通道增强和注意力机制对CNN架构进行分类。系统性地回顾了CNN在人脸识别、姿态估计、动作识别、文本分类、统计语言建模、疾病诊断、放射学分析、加密货币情感预测、一维数据处理、视频分析及语音识别等领域的应用。除了总结结构上的进步,还强调了新兴的学习范式如小样本学习(few-shot learning)、零样本学习(zero-shot learning)、弱监督学习和联邦学习框架,并展望了未来研究方向包括混合CNN-Transformer模型、视觉语言整合以及生成性学习等。 本文提供了2015年至2025年间CNN进化历程的全面视角,概述了关键创新、挑战及机遇。
https://arxiv.org/abs/2503.16546
Automated Face Recognition Systems (FRSs), developed using deep learning models, are deployed worldwide for identity verification and facial attribute analysis. The performance of these models is determined by a complex interdependence among the model architecture, optimization/loss function and datasets. Although FRSs have surpassed human-level accuracy, they continue to be disparate against certain demographics. Due to the ubiquity of applications, it is extremely important to understand the impact of the three components -- model architecture, loss function and face image dataset on the accuracy-disparity trade-off to design better, unbiased platforms. In this work, we perform an in-depth analysis of three FRSs for the task of gender prediction, with various architectural modifications resulting in ten deep-learning models coupled with four loss functions and benchmark them on seven face datasets across 266 evaluation configurations. Our results show that all three components have an individual as well as a combined impact on both accuracy and disparity. We identify that datasets have an inherent property that causes them to perform similarly across models, independent of the choice of loss functions. Moreover, the choice of dataset determines the model's perceived bias -- the same model reports bias in opposite directions for three gender-balanced datasets of ``in-the-wild'' face images of popular individuals. Studying the facial embeddings shows that the models are unable to generalize a uniform definition of what constitutes a ``female face'' as opposed to a ``male face'', due to dataset diversity. We provide recommendations to model developers on using our study as a blueprint for model development and subsequent deployment.
使用深度学习模型开发的自动化面部识别系统(FRS)在全球范围内用于身份验证和面部属性分析。这些模型的表现由其架构、优化/损失函数以及数据集之间的复杂相互作用决定。尽管FRS已经超过了人类级别的准确度,但在某些人口群体中依然存在显著差异。鉴于应用的广泛性,理解这三个组成部分——即模型架构、损失函数及面部图像数据集如何影响准确性与偏差权衡至关重要,以便设计更公正、无偏见的平台。 在这项工作中,我们对三个用于性别预测任务的FRS进行了深入分析,在此基础上通过各种架构修改生成了十个深度学习模型,并结合四种不同的损失函数在七个人脸数据集上进行测试,涵盖266种评估配置。我们的结果显示,这三个组成部分无论是单独还是组合起来都对准确性和偏差有着显著影响。 我们发现,某些数据集具有固有的特性,导致它们在不同模型之间表现出一致的行为,这与所选择的损失函数无关。此外,数据集的选择决定了模型被感知到的偏见——同样的模型在三个性别均衡的“真实世界”名人面部图像数据集中会显示出相反方向上的偏差。 研究面部嵌入显示,由于数据集的多样性,这些模型无法形成统一定义什么是“女性脸”与“男性脸”,从而导致了泛化能力的问题。我们为模型开发者提供了基于本研究结果进行模型开发和后续部署的建议。
https://arxiv.org/abs/2503.14138
The increasing dependence on large-scale datasets in machine learning introduces significant privacy and ethical challenges. Synthetic data generation offers a promising solution; however, most current methods rely on external datasets or pre-trained models, which add complexity and escalate resource demands. In this work, we introduce a novel self-contained synthetic augmentation technique that strategically samples from a conditional generative model trained exclusively on the target dataset. This approach eliminates the need for auxiliary data sources. Applied to face recognition datasets, our method achieves 1--12\% performance improvements on the IJB-C and IJB-B benchmarks. It outperforms models trained solely on real data and exceeds the performance of state-of-the-art synthetic data generation baselines. Notably, these enhancements often surpass those achieved through architectural improvements, underscoring the significant impact of synthetic augmentation in data-scarce environments. These findings demonstrate that carefully integrated synthetic data not only addresses privacy and resource constraints but also substantially boosts model performance. Project page this https URL
在机器学习中,对大规模数据集的依赖日益增加,这带来了重大的隐私和伦理挑战。合成数据生成提供了一种有前景的解决方案;然而,大多数当前方法依赖于外部数据集或预训练模型,这增加了复杂性并加剧了资源需求。在这项工作中,我们介绍了一种新颖的自包含合成增强技术,该技术战略性地从仅在目标数据集上进行训练的条件生成模型中采样。这种策略消除了对辅助数据源的需求。当应用于面部识别数据集时,我们的方法在IJB-C和IJB-B基准测试上实现了1-12%的性能提升。它不仅优于只使用真实数据训练的模型,还超过了最先进的合成数据生成基线表现。值得注意的是,这些改进往往超越了架构改进所带来的效果,突显了在数据稀缺环境下合成增强的重要影响。这些发现表明,精心整合的合成数据不仅可以解决隐私和资源限制问题,还能显著提升模型性能。项目页面:[此链接](请在此处插入实际链接)
https://arxiv.org/abs/2503.11544
Face Recognition Systems (FRS) are increasingly vulnerable to face-morphing attacks, prompting the development of Morphing Attack Detection (MAD) algorithms. However, a key challenge in MAD lies in its limited generalizability to unseen data and its lack of explainability-critical for practical application environments such as enrolment stations and automated border control systems. Recognizing that most existing MAD algorithms rely on supervised learning paradigms, this work explores a novel approach to MAD using zero-shot learning leveraged on Large Language Models (LLMs). We propose two types of zero-shot MAD algorithms: one leveraging general vision models and the other utilizing multimodal LLMs. For general vision models, we address the MAD task by computing the mean support embedding of an independent support set without using morphed images. For the LLM-based approach, we employ the state-of-the-art GPT-4 Turbo API with carefully crafted prompts. To evaluate the feasibility of zero-shot MAD and the effectiveness of the proposed methods, we constructed a print-scan morph dataset featuring various unseen morphing algorithms, simulating challenging real-world application scenarios. Experimental results demonstrated notable detection accuracy, validating the applicability of zero-shot learning for MAD tasks. Additionally, our investigation into LLM-based MAD revealed that multimodal LLMs, such as ChatGPT, exhibit remarkable generalizability to untrained MAD tasks. Furthermore, they possess a unique ability to provide explanations and guidance, which can enhance transparency and usability for end-users in practical applications.
面部识别系统(FRS)日益容易受到面部合成攻击的威胁,从而推动了用于检测这些攻击的变形攻击检测(MAD)算法的发展。然而,MAD的一个关键挑战在于其有限的泛化能力以及在实际应用环境中的缺乏可解释性,例如注册站和自动边境控制系统。鉴于大多数现有的MAD算法依赖于监督学习范式,本研究探索了一种新的基于零样本学习的方法来开发MAD算法,并利用大型语言模型(LLMs)。我们提出两种类型的零样本MAD算法:一种是依靠通用视觉模型,另一种则使用多模态LLMs。 对于通用视觉模型,我们在不使用变形图像的情况下通过计算独立支持集的平均支持嵌入来解决MAD任务。在基于LLMs的方法中,我们采用最先进的GPT-4 Turbo API,并结合精心设计的提示语进行操作。为了评估零样本MAD的可行性以及所提出方法的有效性,我们构建了一个打印扫描变形数据集,该数据集中包含了各种未见过的变形算法,模拟了具有挑战性的现实应用场景。实验结果表明,在检测准确性方面取得了显著成就,验证了零样本学习在MAD任务中的适用性。 此外,我们的研究发现基于LLMs的MAD方法表现出色:多模态LLMs(如ChatGPT)对未训练过的MAD任务展示出了出色的泛化能力,并且它们还具备提供解释和指导的独特能力,这可以增强实际应用中最终用户透明度和可用性。
https://arxiv.org/abs/2503.10937
The rapid growth of social media has led to the widespread sharing of individual portrait images, which pose serious privacy risks due to the capabilities of automatic face recognition (AFR) systems for mass surveillance. Hence, protecting facial privacy against unauthorized AFR systems is essential. Inspired by the generation capability of the emerging diffusion models, recent methods employ diffusion models to generate adversarial face images for privacy protection. However, they suffer from the diffusion purification effect, leading to a low protection success rate (PSR). In this paper, we first propose learning unconditional embeddings to increase the learning capacity for adversarial modifications and then use them to guide the modification of the adversarial latent code to weaken the diffusion purification effect. Moreover, we integrate an identity-preserving structure to maintain structural consistency between the original and generated images, allowing human observers to recognize the generated image as having the same identity as the original. Extensive experiments conducted on two public datasets, i.e., CelebA-HQ and LADN, demonstrate the superiority of our approach. The protected faces generated by our method outperform those produced by existing facial privacy protection approaches in terms of transferability and natural appearance.
社交媒体的快速增长导致个人肖像图片广泛分享,这给自动面部识别(AFR)系统的大规模监控带来了严重的隐私风险。因此,保护面部隐私以防止未经授权的AFR系统至关重要。受到新兴扩散模型生成能力的启发,最近的方法使用扩散模型来生成对抗性人脸图像以实现隐私保护。然而,这些方法受到了扩散净化效应的影响,导致了较低的成功保护率(PSR)。在这篇论文中,我们首先提出了学习无条件嵌入以增强针对对抗性修改的学习容量,并利用它们引导对抗性潜在代码的修改,从而削弱扩散净化效应。此外,我们整合了一个保持身份特征的结构,在原始图像和生成图像之间维持结构性的一致性,使人类观察者能够识别出生成的图片与原图具有相同的个体身份。在两个公共数据集CelebA-HQ和LADN上进行的广泛实验表明了我们的方法的优越性。我们方法生成的受保护面孔在可迁移性和自然外观方面超过了现有的面部隐私保护方法。
https://arxiv.org/abs/2503.10350
We present the Membership Inference Test Demonstrator, to emphasize the need for more transparent machine learning training processes. MINT is a technique for experimentally determining whether certain data has been used during the training of machine learning models. We conduct experiments with popular face recognition models and 5 public databases containing over 22M images. Promising results, up to 89% accuracy are achieved, suggesting that it is possible to recognize if an AI model has been trained with specific data. Finally, we present a MINT platform as demonstrator of this technology aimed to promote transparency in AI training.
我们介绍了成员推断测试演示器(MINT),以强调更透明的机器学习训练流程的需求。MINT 是一种实验性技术,用于确定某些数据是否在机器学习模型的训练过程中被使用过。我们在流行的面部识别模型上进行了实验,并且使用了包含超过2200万张图像的5个公共数据库。我们实现了令人鼓舞的结果,最高准确率达到89%,这表明有可能判断一个AI模型是否用特定的数据进行过训练。最后,我们推出了一种MINT平台作为这项技术的演示工具,旨在促进人工智能培训过程中的透明度。
https://arxiv.org/abs/2503.08332
Recent Customized Portrait Generation (CPG) methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and personalized portrait generation, we develop a multi-modal image customizer capable of generating controlled fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into CPG. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.
最近定制化肖像生成(CPG)方法,通过面部图像和文本提示作为输入,引起了广泛关注。尽管这些方法能够生成高保真的面部画像,但它们无法防止生成的画像被恶意人脸识别系统追踪并滥用。为了解决这个问题,本文提出了一种结合了面部对抗攻击的定制化肖像生成框架(Adv-CPG)。为了实现面部隐私保护,我们设计了一个轻量级的身份加密器和一个增强器。身份加密器通过直接注入目标身份信息来实施渐进式的双层加密保护;而增强器则是通过添加额外的身份指导来加强这一过程。 此外,为了完成精细且个性化的肖像生成,我们开发了一种多模态图像定制器,能够生成受控的细粒度面部特征。据我们所知,Adv-CPG是第一个将面部对抗攻击引入到CPG中的研究工作。广泛的实验展示了Adv-CPG的优势,例如:提出的Adv-CPG方法平均攻击成功率达到了28.1%,比最先进的基于噪声的方法高出2.86%。 翻译成中文如下: 最近定制化肖像生成(Customized Portrait Generation, CPG)方法因其能够利用面部图像和文本提示作为输入而吸引了大量关注。尽管这些方法可以生成高保真的面部画像,但它们无法防止生成的画像被恶意的人脸识别系统追踪并滥用。为了解决这个问题,本文提出了一种结合了面部对抗攻击(Adversarial attacks, Adv)的定制化肖像生成框架(Adv-CPG)。为了实现面部隐私保护,我们设计了一个轻量级的身份加密器和一个增强器。身份加密器通过直接注入目标身份信息来实施渐进式的双层加密保护;而增强器则是通过添加额外的身份指导来加强这一过程。 此外,为了完成精细且个性化的肖像生成,我们开发了一种多模态图像定制器,能够生成受控的细粒度面部特征。据我们所知,Adv-CPG是第一个将面部对抗攻击引入到CPG中的研究工作。广泛的实验展示了Adv-CPG的优势,例如:提出的Adv-CPG方法平均攻击成功率达到了28.1%,比最先进的基于噪声的方法高出2.86%。
https://arxiv.org/abs/2503.08269