The widespread use of deep learning face recognition raises several security concerns. Although prior works point at existing vulnerabilities, DNN backdoor attacks against real-life, unconstrained systems dealing with images captured in the wild remain a blind spot of the literature. This paper conducts the first system-level study of backdoors in deep learning-based face recognition systems. This paper yields four contributions by exploring the feasibility of DNN backdoors on these pipelines in a holistic fashion. We demonstrate for the first time two backdoor attacks on the face detection task: face generation and face landmark shift attacks. We then show that face feature extractors trained with large margin losses also fall victim to backdoor attacks. Combining our models, we then show using 20 possible pipeline configurations and 15 attack cases that a single backdoor enables an attacker to bypass the entire function of a system. Finally, we provide stakeholders with several best practices and countermeasures.
深度学习面部识别的广泛应用引发了几项安全问题。尽管先前的研究指出了现有的漏洞,但针对实际环境中捕获图像的真实无约束系统进行DNN后门攻击仍然是文献中的盲点。本文首次对基于深度学习的面部识别系统的后门进行了系统级研究。本文通过全面探索这些管道中DNN后门的可能性,在四个方面做出了贡献。我们首次展示了在面部检测任务上的两个后门攻击:面部生成和面部关键点偏移攻击。接着,我们还表明使用大边距损失训练的面部特征提取器同样容易受到后门攻击的影响。结合我们的模型,我们通过20种可能的管道配置和15个攻击案例展示了一个单一的后门就足以使攻击者绕过整个系统的功能。最后,本文为利益相关者提供了几种最佳实践和对策。
https://arxiv.org/abs/2507.01607
Recent breakthroughs in generative AI have opened the door to new research perspectives in the domain of art and cultural heritage, where a large number of artifacts have been digitized. There is a need for innovation to ease the access and highlight the content of digital collections. Such innovations develop into creative explorations of the digital image in relation to its malleability and contemporary interpretation, in confrontation to the original historical object. Based on the concept of the autonomous image, we propose a new framework towards the production of self-explaining cultural artifacts using open-source large-language, face detection, text-to-speech and audio-to-animation models. The goal is to start from a digitized artwork and to automatically assemble a short video of the latter where the main character animates to explain its content. The whole process questions cultural biases encapsulated in large-language models, the potential of digital images and deepfakes of artworks for educational purposes, along with concerns of the field of art history regarding such creative diversions.
近期在生成式人工智能领域的突破为艺术和文化遗产领域的新研究视角打开了大门,尤其是在大量文物已经数字化的情况下。为了简化数字收藏的访问并突出其内容,需要创新性的解决方案。这些创新激发了对可塑性和当代诠释下数字图像关系的创造性探索,并与原始历史对象形成了对比。 基于自主图像的概念,我们提出了一种新的框架,用于利用开源大型语言模型、面部检测技术、文字转语音和音频到动画转换模型自动生成具有自我解释性质的文化艺术品。我们的目标是从数字化的艺术作品出发,自动组装一个短片,在该短片中主要角色会动起来并解释其内容。 整个过程质疑了大型语言模型中的文化偏见,探讨了数字图像以及艺术作品的深度伪造技术在教育领域的潜在应用,并引发了艺术史领域对这类创意延伸的关注和担忧。
https://arxiv.org/abs/2506.05368
The increasing prevalence of computer vision applications necessitates handling vast amounts of visual data, often containing personal information. While this technology offers significant benefits, it should not compromise privacy. Data privacy regulations emphasize the need for individual consent for processing personal data, hindering researchers' ability to collect high-quality datasets containing the faces of the individuals. This paper presents a deep learning-based face anonymization pipeline to overcome this challenge. Unlike most of the existing methods, our method leverages recent advancements in diffusion-based inpainting models, eliminating the need for training Generative Adversarial Networks. The pipeline employs a three-stage approach: face detection with RetinaNet, feature extraction with VGG-Face, and realistic face generation using the state-of-the-art BrushNet diffusion model. BrushNet utilizes the entire image, face masks, and text prompts specifying desired facial attributes like age, ethnicity, gender, and expression. This enables the generation of natural-looking images with unrecognizable individuals, facilitating the creation of privacy-compliant datasets for computer vision research.
计算机视觉应用程序的日益普及需要处理大量包含个人信息的视觉数据。虽然这项技术提供了许多好处,但不应以牺牲隐私为代价。数据隐私法规强调了在处理个人信息时获取个体同意的重要性,这限制了研究人员收集高质量的人脸数据集的能力。本文提出了一种基于深度学习的脸部匿名化管道,旨在克服这一挑战。 与大多数现有方法不同,我们的方法利用了基于扩散的图像修复模型的最新进展,从而无需训练生成对抗网络(GAN)。该管道采用三阶段方法:使用RetinaNet进行脸部检测、用VGG-Face提取特征以及利用最先进的BrushNet扩散模型生成逼真的面部图像。BrushNet利用整个图像、脸部掩码和文本提示(包括所需的面部属性如年龄、种族、性别和表情)来生成自然且无法识别个人身份的图像,从而促进计算机视觉研究中符合隐私合规的数据集的创建。
https://arxiv.org/abs/2505.21002
Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset MIKU-EmoBench(131.2 hours) as a new benchmark for emotional text-to-speech and visual voice cloning.
获取大规模且一致性较强的情感语音数据仍然是语音合成领域的一个挑战。本文介绍了一种名为MIKU-PAL的全自动多模态管道,用于从无标签视频数据中提取高一致性的感情语音。通过利用面部检测和跟踪算法,我们开发了一个基于多模态大型语言模型(MLLM)的自动情感分析系统。我们的结果表明,MIKU-PAL能够在MELD数据集上实现与人类标注相当的准确率(68.5%),并且具有更好的一致性(Fleiss Kappa评分为0.93),同时成本更低、速度更快。基于MIKU-PAL提供的高质量、灵活且一致的标注,我们能够为精细的情感语音类别进行标签标注,涵盖了多达26种类型,并得到了人类注释者的83%合理性评分的认可。 在此基础上,我们进一步发布了一个细粒度情感语音数据集MIKU-EmoBench(131.2小时),旨在成为情感文本到语音和视觉声音克隆的新基准。
https://arxiv.org/abs/2505.15772
This study presents a robust framework that leverages advanced imaging techniques and machine learning for feature extraction and classification of key human attributes-namely skin tone, hair color, iris color, and vein-based undertones. The system employs a multi-stage pipeline involving face detection, region segmentation, and dominant color extraction to isolate and analyze these features. Techniques such as X-means clustering, alongside perceptually uniform distance metrics like Delta E (CIEDE2000), are applied within both LAB and HSV color spaces to enhance the accuracy of color differentiation. For classification, the dominant tones of the skin, hair, and iris are extracted and matched to a custom tone scale, while vein analysis from wrist images enables undertone classification into "Warm" or "Cool" based on LAB differences. Each module uses targeted segmentation and color space transformations to ensure perceptual precision. The system achieves up to 80% accuracy in tone classification using the Delta E-HSV method with Gaussian blur, demonstrating reliable performance across varied lighting and image conditions. This work highlights the potential of AI-powered color analysis and feature extraction for delivering inclusive, precise, and nuanced classification, supporting applications in beauty technology, digital personalization, and visual analytics.
这项研究提出了一种利用先进成像技术和机器学习进行特征提取和分类的关键人类属性框架,这些关键属性包括肤色、发色、虹膜颜色以及基于静脉的底色调。系统采用多阶段管道,涉及面部检测、区域分割和主色调提取来分离并分析这些特征。 研究中采用了X-means聚类等技术,并在LAB和HSV色彩空间内应用感知一致的距离度量(如CIEDE2000中的Delta E),以提高颜色区分的准确性。对于分类而言,从皮肤、头发和虹膜中提取主要色调并与自定义色调等级匹配,同时手腕图像中的静脉分析基于LAB差异将底色分类为“暖”或“冷”。每个模块使用目标分割和色彩空间转换来确保感知精度。 该系统在使用Delta E-HSV方法并结合高斯模糊的情况下,在色调分类上实现了高达80%的准确率,展示了其在不同光照和图像条件下可靠的表现力。这项工作突显了AI驱动的颜色分析和特征提取在美容技术、数字个性化以及视觉分析中的潜力,可以实现包容性、精确且细致的分类支持。
https://arxiv.org/abs/2505.14931
Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection. This model integrates distinct Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer. RG-MHA utilizes facial landmarks to create regional attention masks, guiding the model to scrutinize architectural inconsistencies across different facial areas. Crucially, the separate LAMM module dynamically generates layer-specific parameters, including mask weights and gating values, based on network context. These parameters then modulate the behavior of RG-MHA, enabling adaptive adjustment of regional focus across network depths. This architecture facilitates the capture of subtle, hierarchical forgery cues ubiquitous among diverse generation techniques, such as GANs and Diffusion Models. In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC (a +5.45% improvement over SoTA) and 98.62% mean AP (a +3.09% improvement). These results demonstrate LAMM-ViT's exceptional ability to generalize and its potential for reliable deployment against evolving synthetic media threats.
检测AI生成的人脸提出了一个关键挑战:很难捕捉到不同生成技术之间面部区域间的一致结构性关系。当前的方法侧重于特定的伪影,而不是基本的不一致现象,在面对新型生成模型时往往失败。为了解决这一局限性,我们引入了层感知掩码调制视觉变换器(LAMM-ViT),这是一种专为鲁棒的人脸伪造检测设计的视觉变换器模型。该模型在每一层中集成了区域引导多头注意力(RG-MHA)和层感知掩码调制(LAMM)组件。 RG-MHA利用面部地标来创建区域注意图,引导模型审查不同面部区域间的架构不一致性。至关重要的是,单独的LAMM模块基于网络上下文动态生成特定于每一层的参数,包括掩码权重和门控值。这些参数随后调整RG-MHA的行为,使模型能够在网络深度上适应性地调节区域关注点。这种架构便于捕捉到不同生成技术(如GAN和扩散模型)中普遍存在但细微且层级化的伪造线索。 在跨模型泛化测试中,LAMM-ViT表现出卓越的性能,实现了平均准确率(ACC)94.09%(比现有最佳方法高出5.45%),以及平均精确召回率(AP)98.62%(比现有最佳方法高3.09%)。这些结果证明了LAMM-ViT具备出色的泛化能力及其在应对不断演化的合成媒体威胁方面的可靠部署潜力。
https://arxiv.org/abs/2505.07734
Teachers' visual attention and its distribution across the students in classrooms can constitute important implications for student engagement, achievement, and professional teacher training. Despite that, inferring the information about where and which student teachers focus on is not trivial. Mobile eye tracking can provide vital help to solve this issue; however, the use of mobile eye tracking alone requires a significant amount of manual annotations. To address this limitation, we present an automated processing pipeline concept that requires minimal manually annotated data to recognize which student the teachers focus on. To this end, we utilize state-of-the-art face detection models and face recognition feature embeddings to train face recognition models with transfer learning in the classroom context and combine these models with the teachers' gaze from mobile eye trackers. We evaluated our approach with data collected from four different classrooms, and our results show that while it is possible to estimate the visually focused students with reasonable performance in all of our classroom setups, U-shaped and small classrooms led to the best results with accuracies of approximately 0.7 and 0.9, respectively. While we did not evaluate our method for teacher-student interactions and focused on the validity of the technical approach, as our methodology does not require a vast amount of manually annotated data and offers a non-intrusive way of handling teachers' visual attention, it could help improve instructional strategies, enhance classroom management, and provide feedback for professional teacher development.
教师在教室中的视觉注意力及其对学生的分配对于学生参与度、学业成就和专业教师培训具有重要的影响。然而,推断出教师关注的是哪位学生以及他们关注的位置并不容易。移动眼动追踪技术可以为解决这一问题提供重要帮助;然而,仅使用移动眼动追踪需要大量的手动标注数据。为了克服这个限制,我们提出了一种自动化处理流水线的概念,该概念只需要少量的手动标注数据就能识别出教师关注的是哪位学生。 为此,我们利用最先进的面部检测模型和面部识别特征嵌入,在课堂环境中通过迁移学习训练面部识别模型,并将这些模型与来自移动眼动追踪器的教师注视信息结合在一起。我们在四个不同教室收集的数据上评估了我们的方法,结果表明,在所有教室布置中都能合理地估计出视觉关注的学生,而U形和小型教室的表现最佳,准确率分别为大约0.7和0.9。 虽然我们没有评估这种方法在教师与学生互动方面的效果,并且主要验证了技术方法的有效性,但我们的方法不依赖大量手动标注数据,并提供了一种非侵入式的处理教师视觉注意力的方式。因此,它可以帮助改进教学策略、提高课堂管理效率以及为专业教师发展提供反馈。
https://arxiv.org/abs/2505.07552
Cost-effective machine vision systems dedicated to real-time and accurate face detection and recognition in public places are crucial for many modern applications. However, despite their high performance, which could be reached using specialized edge or cloud AI hardware accelerators, there is still room for improvement in throughput and power consumption. This paper aims to suggest a combined hardware-software approach that optimizes face detection and recognition systems on one of the latest edge GPUs, namely NVIDIA Jetson AGX Orin. First, it leverages the simultaneous usage of all its hardware engines to improve processing time. This offers an improvement over previous works where these tasks were mainly allocated automatically and exclusively to the CPU or, to a higher extent, to the GPU core. Additionally, the paper suggests integrating a face tracker module to avoid redundantly running the face recognition algorithm for every frame but only when a new face appears in the scene. The results of extended experiments suggest that simultaneous usage of all the hardware engines that are available in the Orin GPU and tracker integration into the pipeline yield an impressive throughput of 290 FPS (frames per second) on 1920 x 1080 input size frames containing in average of 6 faces/frame. Additionally, a substantial saving of power consumption of around 800 mW was achieved when compared to running the task on the CPU/GPU engines only and without integrating a tracker into the Orin GPU\'92s pipeline. This hardware-codesign approach can pave the way to design high-performance machine vision systems at the edge, critically needed in video monitoring in public places where several nearby cameras are usually deployed for a same scene.
高效的机器视觉系统在公共场所进行实时且精确的人脸检测和识别对于许多现代应用至关重要。尽管通过使用专门的边缘或云端AI硬件加速器可以实现高性能,但在吞吐量和功耗方面仍存在改进空间。本文提出了一种结合软硬件的方法,旨在优化基于最新边缘GPU(NVIDIA Jetson AGX Orin)的人脸检测和识别系统。 首先,该方法利用了所有可用的硬件引擎的同时使用来提高处理时间。这比以往只将任务主要分配给CPU或更大程度上专用于GPU核心的做法有所改进。此外,还建议在系统中集成一个面部追踪模块,以避免对每一帧都运行人脸识别算法,而是在新的面孔出现在场景时才进行。 通过广泛的实验结果表明,在Orin GPU上同时使用所有可用的硬件引擎,并将跟踪器整合到管道中,可以实现每秒290帧(FPS)的惊人吞吐量,输入为1920 x 1080尺寸的画面且平均每帧包含6张面孔。此外,与仅在CPU/GPU引擎上运行任务而不集成追踪器相比,在Orin GPU的管道内整合追踪器实现了大约800毫瓦(mW)的功耗节省。 这种软硬件协同设计的方法为边缘高绩效机器视觉系统的开发铺平了道路,这对于公共场所中的视频监控尤为关键,通常需要部署多个近距离摄像头来覆盖同一场景。
https://arxiv.org/abs/2505.04524
Video face detection and recognition in public places at the edge is required in several applications, such as security reinforcement and contactless access to authorized venues. This paper aims to maximize the simultaneous usage of hardware engines available in edge GPUs nowadays by leveraging the concurrency and pipelining of tasks required for face detection and recognition. This also includes the video decoding task, which is required in most face monitoring applications as the video streams are usually carried via Gbps Ethernet network. This constitutes an improvement over previous works where the tasks are usually allocated to a single engine due to the lack of a unified and automated framework that simultaneously explores all hardware engines. In addition, previously, the input faces were usually embedded in still images or within raw video streams that overlook the burst delay caused by the decoding stage. The results on real-life video streams suggest that simultaneously using all the hardware engines available in the recent NVIDIA edge Orin GPU, higher throughput, and a slight saving of power consumption of around 300 mW, accounting for around 5%, have been achieved while satisfying the real-time performance constraint. The performance gets even higher by considering several video streams simultaneously. Further performance improvement could have been obtained if the number of shuffle layers that were created by the tensor RT framework for the face recognition task was lower. Thus, the paper suggests some hardware improvements to the existing edge GPU processors to enhance their performance even higher.
在公共场所进行视频面部检测和识别的需求在多个应用中存在,例如加强安全性和实现授权场所的非接触式访问。本文旨在通过利用当前边缘GPU硬件引擎的并发性和流水线作业来最大化这些引擎的同时使用,从而优化面部检测和识别任务。这包括大多数面部监控应用程序所需的视频解码任务,因为视频流通常通过Gbps以太网网络传输。 与以往的工作相比,这项研究有所改进:过去由于缺乏统一且自动化的框架,任务通常被分配给单一硬件引擎进行处理。此外,在之前的实践中,输入的面部通常是嵌入在静态图像或原始视频流中,并未考虑到解码阶段可能造成的突发延迟。实际视频流中的测试结果表明,在最近的NVIDIA边缘Orin GPU上同时使用所有可用硬件引擎能够实现更高的吞吐量和大约300毫瓦(约占5%)的功率节省,这在满足实时性能约束的同时得以体现。此外,考虑到多个视频流时,性能将进一步提高。 进一步的性能改进本可通过减少由TensorRT框架为面部识别任务创建的shuffle层数量来实现。因此,论文建议对现有的边缘GPU处理器进行一些硬件改进以进一步提升其性能。
https://arxiv.org/abs/2505.04502
The rapid advancement of generative image technology has introduced significant security concerns, particularly in the domain of face generation detection. This paper investigates the vulnerabilities of current AI-generated face detection systems. Our study reveals that while existing detection methods often achieve high accuracy under standard conditions, they exhibit limited robustness against adversarial attacks. To address these challenges, we propose an approach that integrates adversarial training to mitigate the impact of adversarial examples. Furthermore, we utilize diffusion inversion and reconstruction to further enhance detection robustness. Experimental results demonstrate that minor adversarial perturbations can easily bypass existing detection systems, but our method significantly improves the robustness of these systems. Additionally, we provide an in-depth analysis of adversarial and benign examples, offering insights into the intrinsic characteristics of AI-generated content. All associated code will be made publicly available in a dedicated repository to facilitate further research and verification.
生成图像技术的迅速发展带来了显著的安全隐患,特别是在人脸生成检测领域。本文研究了当前人工智能生成的人脸检测系统的漏洞。我们的研究表明,在标准条件下,现有的检测方法通常能实现高精度,但在对抗性攻击面前则表现出有限的鲁棒性。为应对这些挑战,我们提出了一种结合对抗训练的方法,以减轻对抗样例的影响。此外,我们利用扩散逆向和重构技术进一步增强检测系统的稳健性。实验结果表明,轻微的对抗性扰动即可轻易绕过现有的检测系统,而我们的方法则显著提升了这些系统的鲁棒性。另外,我们对对抗性和良性样本进行了深入分析,提供了关于AI生成内容内在特性的见解。所有相关代码将通过一个专用仓库公开发布,以促进进一步的研究和验证。
https://arxiv.org/abs/2505.03435
Deepfakes, created using advanced AI techniques such as Variational Autoencoder and Generative Adversarial Networks, have evolved from research and entertainment applications into tools for malicious activities, posing significant threats to digital trust. Current deepfake detection techniques have evolved from CNN-based methods focused on local artifacts to more advanced approaches using vision transformers and multimodal models like CLIP, which capture global anomalies and improve cross-domain generalization. Despite recent progress, state-of-the-art deepfake detectors still face major challenges in handling distribution shifts from emerging generative models and addressing severe class imbalance between authentic and fake samples in deepfake datasets, which limits their robustness and detection accuracy. To address these challenges, we propose a framework that combines dynamic loss reweighting and ranking-based optimization, which achieves superior generalization and performance under imbalanced dataset conditions. The code is available at this https URL.
深度伪造(Deepfakes)技术利用先进的AI方法,如变分自编码器和生成对抗网络,已从研究和娱乐应用发展成为恶意活动的工具,对数字信任构成了重大威胁。当前的深度伪造检测技术已经从基于CNN的方法,这些方法侧重于局部异常特征,演进为使用视觉变压器和CLIP等多模态模型的更先进方法,这些方法能够捕捉全局异常并提高跨域泛化能力。尽管最近取得了进展,但最先进的深度伪造探测器仍然面临处理新兴生成模型带来的分布变化以及在深度伪造数据集中真伪样本之间严重类不平衡的问题的重大挑战,这限制了它们的鲁棒性和检测准确性。 为了解决这些问题,我们提出了一种结合动态损失再加权和基于排名优化框架的方法,在不均衡的数据集条件下实现了卓越的泛化能力和性能。该代码可在[此处](https://example.com)获取。
https://arxiv.org/abs/2505.02182
Face detection and face recognition have been in the focus of vision community since the very beginnings. Inspired by the success of the original Videoface digitizer, a pioneering device that allowed users to capture video signals from any source, we have designed an advanced video analytics tool to efficiently create structured video stories, i.e. identity-based information catalogs. VideoFace2.0 is the name of the developed system for spatial and temporal localization of each unique face in the input video, i.e. face re-identification (ReID), which also allows their cataloging, characterization and creation of structured video outputs for later downstream tasks. Developed near real-time solution is primarily designed to be utilized in application scenarios involving TV production, media analysis, and as an efficient tool for creating large video datasets necessary for training machine learning (ML) models in challenging vision tasks such as lip reading and multimodal speech recognition. Conducted experiments confirm applicability of the proposed face ReID algorithm that is combining the concepts of face detection, face recognition and passive tracking-by-detection in order to achieve robust and efficient face ReID. The system is envisioned as a compact and modular extensions of the existing video production equipment. We hope that the presented work and shared code will stimulate further interest in development of similar, application specific video analysis tools, and lower the entry barrier for production of high-quality multi-modal ML datasets in the future.
自从视觉社区的早期开始,面部检测和面部识别就一直是研究的重点。受到原始Videoface数字仪成功的启发——这是第一个允许用户从任何来源捕获视频信号的开创性设备——我们设计了一种先进的视频分析工具,可以高效地创建基于身份的信息目录,即结构化视频故事。VideoFace2.0是开发出的一套系统,它能够在输入视频中对每个独特的面部进行空间和时间上的定位,也就是所谓的“重新识别”(ReID),同时允许它们被分类、特征描述,并生成用于后续任务的结构化视频输出。这套近实时解决方案主要设计为在涉及电视制作、媒体分析的应用场景中使用,同时也是为了创建大型视频数据集而设计的一种高效工具,这些数据集对于训练机器学习(ML)模型至关重要,尤其是在唇读和多模态语音识别等具有挑战性的视觉任务上。 进行的实验证实了所提出的面部ReID算法的有效性,该算法结合了面部检测、面部识别以及被动追踪的概念以实现稳健且高效的面部重新识别。该系统构想为现有视频制作设备的一个紧凑而模块化的扩展部分。我们希望这项工作和共享的代码能够激发更多对开发类似的应用特定视频分析工具的兴趣,并在未来降低高质量多模态ML数据集生产的门槛。
https://arxiv.org/abs/2505.02060
Advances in image generation enable hyper-realistic synthetic faces but also pose risks, thus making synthetic face detection crucial. Previous research focuses on the general differences between generated images and real images, often overlooking the discrepancies among various generative techniques. In this paper, we explore the intrinsic relationship between synthetic images and their corresponding generation technologies. We find that specific images exhibit significant reconstruction discrepancies across different generative methods and that matching generation techniques provide more accurate reconstructions. Based on this insight, we propose a Multi-Reconstruction-based detector. By reversing and reconstructing images using multiple generative models, we analyze the reconstruction differences among real, GAN-generated, and DM-generated images to facilitate effective differentiation. Additionally, we introduce the Asian Synthetic Face Dataset (ASFD), containing synthetic Asian faces generated with various GANs and DMs. This dataset complements existing synthetic face datasets. Experimental results demonstrate that our detector achieves exceptional performance, with strong generalization and robustness.
图像生成技术的进步使得合成面部非常逼真,但也带来了风险,因此检测合成面部变得至关重要。以往的研究主要关注于生成图像和真实图像之间的总体差异,却往往忽略了不同生成技术之间存在的细微差别。在这篇论文中,我们探讨了合成图像与其相应生成技术之间的内在联系。我们发现特定的图像在不同的生成方法下会表现出显著的重建差异,并且使用匹配的生成技术能够提供更为准确的重建结果。基于这一洞察,我们提出了一个基于多重建的检测器(Multi-Reconstruction-based detector)。通过反向和重构来自多种生成模型的图像,我们可以分析真实图像、GAN生成图像以及DM生成图像之间的重建差异,从而实现有效的区分。 此外,为了支持我们的研究,我们引入了亚洲合成面部数据集(Asian Synthetic Face Dataset, ASFD),该数据集中包含使用各种GANs和DMs生成的不同类型的亚洲面孔。此数据集补充了现有的合成面部数据集,为研究提供了新的视角。 实验结果表明,我们的检测器在性能上表现出色,并且具有很强的泛化能力和鲁棒性。
https://arxiv.org/abs/2504.07382
When digitizing historical archives, it is necessary to search for the faces of celebrities and ordinary people, especially in newspapers, link them to the surrounding text, and make them searchable. Existing face detectors on datasets of scanned historical documents fail remarkably -- current detection tools only achieve around $24\%$ mAP at $50:90\%$ IoU. This work compensates for this failure by introducing a new manually annotated domain-specific dataset in the style of the popular Wider Face dataset, containing 2.2k new images from digitized historical newspapers from the $19^{th}$ to $20^{th}$ century, with 11k new bounding-box annotations and associated facial landmarks. This dataset allows existing detectors to be retrained to bring their results closer to the standard in the field of face detection in the wild. We report several experimental results comparing different families of fine-tuned detectors against publicly available pre-trained face detectors and ablation studies of multiple detector sizes with comprehensive detection and landmark prediction performance results.
在数字化历史档案的过程中,需要识别名人和普通人的面部,并将其与周围的文本关联起来,使其可搜索。然而,在扫描的历史文档数据集上,现有的面部检测器表现不佳——目前的检测工具仅能达到约24% mAP(平均精度)和50:90% IoU(交并比)。为弥补这一不足,本工作引入了一个新的手动标注的专业领域数据集,该数据集模仿了流行的Wider Face数据集风格,包含来自19至20世纪数字化历史报纸的2.2k新图像以及与之相关的11k新边界框注释和面部特征点。 此数据集使得现有检测器能够重新训练以使其结果更接近于现实世界中人脸检测的标准。我们报告了多项实验结果,比较了不同类型的微调后的检测器与公开可用的预训练的人脸检测器,并进行了多种大小检测器的消融研究,提供全面的检测和特征点预测性能数据。
https://arxiv.org/abs/2504.00558
This study explores the integration of machine learning into urban aerial image analysis, with a focus on identifying infrastructure surfaces for cars and pedestrians and analyzing historical trends. It emphasizes the transition from convolutional architectures to transformer-based pre-trained models, underscoring their potential in global geospatial analysis. A workflow is presented for automatically generating geospatial datasets, enabling the creation of semantic segmentation datasets from various sources, including WMS/WMTS links, vectorial cartography, and OpenStreetMap (OSM) overpass-turbo requests. The developed code allows a fast dataset generation process for training machine learning models using openly available data without manual labelling. Using aerial imagery and vectorial data from the respective geographical offices of Madrid and Vienna, two datasets were generated for car and pedestrian surface detection. A transformer-based model was trained and evaluated for each city, demonstrating good accuracy values. The historical trend analysis involved applying the trained model to earlier images predating the availability of vectorial data 10 to 20 years, successfully identifying temporal trends in infrastructure for pedestrians and cars across different city areas. This technique is applicable for municipal governments to gather valuable data at a minimal cost.
这项研究探讨了将机器学习技术融入城市航拍图像分析的过程,重点在于识别汽车和行人的基础设施表面,并对历史趋势进行分析。该研究强调从卷积架构向基于变压器的预训练模型过渡,突出了这些模型在全球地理空间分析中的潜力。文中提出了一种工作流程,用于自动生成地理空间数据集,可以从各种来源创建语义分割数据集,包括WMS/WMTS链接、矢量地图和OpenStreetMap (OSM) overpass-turbo请求。开发的代码允许利用公开可用的数据快速生成训练机器学习模型所需的大量数据,而无需手动标记。 使用马德里和维也纳地理办公室提供的航拍图像和矢量数据,分别针对汽车和行人路面检测生成了两个数据集。为每个城市训练并评估了一个基于变压器的模型,展示了良好的准确性值。历史趋势分析涉及将经过训练的模型应用于早于10到20年前没有矢量数据可用的早期图片上,成功识别出不同城市区域中人行道和汽车基础设施随时间的变化趋势。 这项技术对于市镇政府来说非常有用,可以在低成本下收集有价值的数据。
https://arxiv.org/abs/2503.15653
Deepfake is a widely used technology employed in recent years to create pernicious content such as fake news, movies, and rumors by altering and substituting facial information from various sources. Given the ongoing evolution of deepfakes investigation of continuous identification and prevention is crucial. Due to recent technological advancements in AI (Artificial Intelligence) distinguishing deepfakes and artificially altered images has become challenging. This approach introduces the robust detection of subtle ear movements and shape changes to generate ear descriptors. Further, we also propose a novel optimized hybrid deepfake detection model that considers the ear biometric descriptors via enhanced RCNN (Region-Based Convolutional Neural Network). Initially, the input video is converted into frames and preprocessed through resizing, normalization, grayscale conversion, and filtering processes followed by face detection using the Viola-Jones technique. Next, a hybrid model comprising DBN (Deep Belief Network) and Bi-GRU (Bidirectional Gated Recurrent Unit) is utilized for deepfake detection based on ear descriptors. The output from the detection phase is determined through improved score-level fusion. To enhance the performance, the weights of both detection models are optimally tuned using the SU-JFO (Self-Upgraded Jellyfish Optimization method). Experimentation is conducted based on four scenarios: compression, noise, rotation, pose, and illumination on three different datasets. The performance results affirm that our proposed method outperforms traditional models such as CNN (Convolution Neural Network), SqueezeNet, LeNet, LinkNet, LSTM (Long Short-Term Memory), DFP (Deepfake Predictor) [1], and ResNext+CNN+LSTM [2] in terms of various performance metrics viz. accuracy, specificity, and precision.
近年来,Deepfake技术被广泛用于生成虚假新闻、电影和谣言等内容,这种技术通过改变和替换来自各种来源的面部信息来制造有害内容。鉴于深伪视频调查的持续识别与预防变得愈发重要,在人工智能(AI)领域的最新技术进步使得区分深度伪造视频和人工修改过的图像变得更加困难。此方法提出了一种用于生成耳部描述符的稳健检测手段,以捕捉细微的耳部动作变化及形态改变。 此外,我们还提出了一种新颖优化的混合型Deepfake检测模型,该模型通过增强版RCNN(基于区域的卷积神经网络)考虑了生物特征中的耳部信息。首先,输入视频被转换为帧并通过调整大小、标准化、灰度化和滤波等预处理过程进行处理,然后使用Viola-Jones技术进行面部检测。 接下来,采用包含DBN(深度信念网络)与Bi-GRU(双向门控循环单元)的混合模型基于耳部描述符来进行深伪视频识别。检测阶段的结果通过改进后的评分级融合确定。为提升性能,我们利用SU-JFO(自我升级水母优化方法)对两种检测模型的权重进行最优调整。 实验基于三个不同的数据集,在压缩、噪声、旋转、姿势以及光照五个场景下进行了测试。结果表明,我们的方法在各种性能指标(如准确性、特异性和精确度)方面优于传统模型,例如CNN(卷积神经网络)、SqueezeNet、LeNet、LinkNet、LSTM(长短期记忆)、DFP(Deepfake预测器)[1]以及ResNext+CNN+LSTM [2]。
https://arxiv.org/abs/2503.12381
The lack of a common platform and benchmark datasets for evaluating face obfuscation methods has been a challenge, with every method being tested using arbitrary experiments, datasets, and metrics. While prior work has demonstrated that face recognition systems exhibit bias against some demographic groups, there exists a substantial gap in our understanding regarding the fairness of face obfuscation methods. Providing fair face obfuscation methods can ensure equitable protection across diverse demographic groups, especially since they can be used to preserve the privacy of vulnerable populations. To address these gaps, this paper introduces a comprehensive framework, named FairDeFace, designed to assess the adversarial robustness and fairness of face obfuscation methods. The framework introduces a set of modules encompassing data benchmarks, face detection and recognition algorithms, adversarial models, utility detection models, and fairness metrics. FairDeFace serves as a versatile platform where any face obfuscation method can be integrated, allowing for rigorous testing and comparison with other state-of-the-art methods. In its current implementation, FairDeFace incorporates 6 attacks, and several privacy, utility and fairness metrics. Using FairDeFace, and by conducting more than 500 experiments, we evaluated and compared the adversarial robustness of seven face obfuscation methods. This extensive analysis led to many interesting findings both in terms of the degree of robustness of existing methods and their biases against some gender or racial groups. FairDeFace also uses visualization of focused areas for both obfuscation and verification attacks to show not only which areas are mostly changed in the obfuscation process for some demographics, but also why they failed through focus area comparison of obfuscation and verification.
缺乏一个用于评估面部模糊方法的通用平台和基准数据集一直是一个挑战,每个方法都使用任意的实验、数据集和指标进行测试。尽管先前的工作表明面部识别系统对某些人口群体存在偏见,但我们对于面部模糊方法公平性的理解仍存在很大差距。提供公平的面部模糊方法可以确保跨多种人口群体的平等保护,尤其是因为它们可用于保护脆弱人群的隐私。为解决这些差距,本文介绍了一个全面的框架,名为FairDeFace,旨在评估面部模糊方法的对抗鲁棒性和公平性。该框架引入了一系列模块,涵盖数据基准、面部检测和识别算法、对抗模型、效用检测模型以及公平性指标。FairDeFace作为任何面部模糊方法都可以整合的多功能平台,允许对其进行严格测试并与其他最先进的方法进行比较。在其目前实施中,FairDeFace集成了6种攻击方式及多种隐私、效用和公平度量标准。通过使用FairDeFace,并进行了超过500次实验,我们评估并对比了七种面部模糊方法的对抗鲁棒性。这一广泛的分析在现有方法的鲁棒程度以及它们对某些性别或种族群体的偏见方面得出了许多有趣的发现。此外,FairDeFace还利用针对模糊和验证攻击的重点区域可视化技术来展示,在某种程度上不仅展示了某些人口群体在模糊过程中变化最大的区域,而且还通过对比重点区域揭示了为什么这些方法会失败。
https://arxiv.org/abs/2503.08731
Manual attendance tracking at large-scale events, such as marriage functions or conferences, is often inefficient and prone to human error. To address this challenge, we propose an automated, cloud-based attendance tracking system that uses cameras mounted at the entrance and exit gates. The mounted cameras continuously capture video and send the video data to cloud services to perform real-time face detection and recognition. Unlike existing solutions, our system accurately identifies attendees even when they are not looking directly at the camera, allowing natural movements, such as looking around or talking while walking. To the best of our knowledge, this is the first system to achieve high recognition rates under such dynamic conditions. Our system demonstrates overall 90% accuracy, with each video frame processed in 5 seconds, ensuring real time operation without frame loss. In addition, notifications are sent promptly to security personnel within the same latency. This system achieves 100% accuracy for individuals without facial obstructions and successfully recognizes all attendees appearing within the camera's field of view, providing a robust solution for attendee recognition in large-scale social events.
在大型活动(如婚礼或会议)中,手动签到通常效率低下且容易出错。为了应对这一挑战,我们提出了一种自动化、基于云的签到跟踪系统,该系统使用安装在入口和出口处的摄像头。这些摄像头会持续捕捉视频并将数据发送至云端进行实时面部检测和识别。与现有的解决方案不同,我们的系统即使当参与者不直接面向摄像头时也能够准确地识别他们,允许他们在自然动作中被识别,例如四处张望或边走路边交谈。 据我们所知,这是首个在如此动态条件下仍能实现高识别率的系统。该系统的整体准确性达到90%,每帧视频处理时间仅为5秒,确保了实时操作且无数据丢失。此外,安全人员会在相同的延迟时间内收到即时通知。对于没有面部遮挡的人来说,该系统的准确率为100%;并且能够成功识别所有出现在摄像头视野内的参与者,为大型社交活动中的参会者识别提供了一种可靠的解决方案。
https://arxiv.org/abs/2503.03330
This paper presents an innovative approach that enables the user to find matching faces based on the user-selected face parameters. Through gradio-based user interface, the users can interactively select the face parameters they want in their desired partner. These user-selected face parameters are transformed into a text prompt which is used by the Text-To-Image generation model to generate a realistic face image. Further, the generated image along with the images downloaded from the this http URL are processed through face detection and feature extraction model, which results in high dimensional vector embedding of 512 dimensions. The vector embeddings generated from the downloaded images are stored into vector database. Now, the similarity search is carried out between the vector embedding of generated image and the stored vector embeddings. As a result, it displays the top five similar faces based on the user-selected face parameters. This contribution holds a significant potential to turn into a high-quality personalized face matching tool.
这篇论文提出了一种创新的方法,允许用户根据自选的面部参数找到匹配的脸庞。通过基于Gradio的用户界面,用户可以互动地选择他们希望在理想伴侣身上看到的面部特征。这些由用户选定的面部参数会被转换成文本提示,并用于文本到图像生成模型来生成逼真的面部图像。接着,生成的图像以及从网站下载的其他图像会经过面部检测和特征提取模型处理,从而产生一个512维的高度抽象向量表示。 从下载图片中得到的向量表示被存储在向量数据库里。接下来,在生成图像的向量表示与已存储的向量表示之间进行相似性搜索。最终结果是基于用户选择的面部参数,展示出五张最匹配的脸庞。这一贡献具有很大的潜力,可以转化为一种高质量的个性化面部匹配工具。
https://arxiv.org/abs/2503.03204
In multimedia applications such as films and video games, spatial audio techniques are widely employed to enhance user experiences by simulating 3D sound: transforming mono audio into binaural formats. However, this process is often complex and labor-intensive for sound designers, requiring precise synchronization of audio with the spatial positions of visual components. To address these challenges, we propose a visual-based spatial audio generation system - an automated system that integrates face detection YOLOv8 for object detection, monocular depth estimation, and spatial audio techniques. Notably, the system operates without requiring additional binaural dataset training. The proposed system is evaluated against existing Spatial Audio generation system using objective metrics. Experimental results demonstrate that our method significantly improves spatial consistency between audio and video, enhances speech quality, and performs robustly in multi-speaker scenarios. By streamlining the audio-visual alignment process, the proposed system enables sound engineers to achieve high-quality results efficiently, making it a valuable tool for professionals in multimedia production.
在多媒体应用(如电影和视频游戏)中,空间音频技术被广泛用于通过模拟三维声音来增强用户体验:将单声道音频转换为双耳格式。然而,这个过程对于音效设计师来说往往复杂且劳动密集型,需要精确地将音频与视觉组件的空间位置同步。为了应对这些挑战,我们提出了一种基于视觉的空间音频生成系统——一个集成了对象检测(使用YOLOv8进行面部检测)、单目深度估计和空间音频技术的自动化系统。值得注意的是,该系统在不需额外双耳数据集训练的情况下运行。 我们的提议系统通过客观指标与现有的空间音频生成系统进行了比较评估。实验结果表明,我们提出的方法显著提高了音频和视频之间的空间一致性,增强了语音质量,并且在多说话人场景中表现出良好的鲁棒性。通过简化音视同步的过程,所提出的系统使声音工程师能够高效地获得高质量的结果,成为多媒体制作专业人士的宝贵工具。
https://arxiv.org/abs/2502.07538