Advances in image generation enable hyper-realistic synthetic faces but also pose risks, thus making synthetic face detection crucial. Previous research focuses on the general differences between generated images and real images, often overlooking the discrepancies among various generative techniques. In this paper, we explore the intrinsic relationship between synthetic images and their corresponding generation technologies. We find that specific images exhibit significant reconstruction discrepancies across different generative methods and that matching generation techniques provide more accurate reconstructions. Based on this insight, we propose a Multi-Reconstruction-based detector. By reversing and reconstructing images using multiple generative models, we analyze the reconstruction differences among real, GAN-generated, and DM-generated images to facilitate effective differentiation. Additionally, we introduce the Asian Synthetic Face Dataset (ASFD), containing synthetic Asian faces generated with various GANs and DMs. This dataset complements existing synthetic face datasets. Experimental results demonstrate that our detector achieves exceptional performance, with strong generalization and robustness.
图像生成技术的进步使得合成面部非常逼真,但也带来了风险,因此检测合成面部变得至关重要。以往的研究主要关注于生成图像和真实图像之间的总体差异,却往往忽略了不同生成技术之间存在的细微差别。在这篇论文中,我们探讨了合成图像与其相应生成技术之间的内在联系。我们发现特定的图像在不同的生成方法下会表现出显著的重建差异,并且使用匹配的生成技术能够提供更为准确的重建结果。基于这一洞察,我们提出了一个基于多重建的检测器(Multi-Reconstruction-based detector)。通过反向和重构来自多种生成模型的图像,我们可以分析真实图像、GAN生成图像以及DM生成图像之间的重建差异,从而实现有效的区分。 此外,为了支持我们的研究,我们引入了亚洲合成面部数据集(Asian Synthetic Face Dataset, ASFD),该数据集中包含使用各种GANs和DMs生成的不同类型的亚洲面孔。此数据集补充了现有的合成面部数据集,为研究提供了新的视角。 实验结果表明,我们的检测器在性能上表现出色,并且具有很强的泛化能力和鲁棒性。
https://arxiv.org/abs/2504.07382
When digitizing historical archives, it is necessary to search for the faces of celebrities and ordinary people, especially in newspapers, link them to the surrounding text, and make them searchable. Existing face detectors on datasets of scanned historical documents fail remarkably -- current detection tools only achieve around $24\%$ mAP at $50:90\%$ IoU. This work compensates for this failure by introducing a new manually annotated domain-specific dataset in the style of the popular Wider Face dataset, containing 2.2k new images from digitized historical newspapers from the $19^{th}$ to $20^{th}$ century, with 11k new bounding-box annotations and associated facial landmarks. This dataset allows existing detectors to be retrained to bring their results closer to the standard in the field of face detection in the wild. We report several experimental results comparing different families of fine-tuned detectors against publicly available pre-trained face detectors and ablation studies of multiple detector sizes with comprehensive detection and landmark prediction performance results.
在数字化历史档案的过程中,需要识别名人和普通人的面部,并将其与周围的文本关联起来,使其可搜索。然而,在扫描的历史文档数据集上,现有的面部检测器表现不佳——目前的检测工具仅能达到约24% mAP(平均精度)和50:90% IoU(交并比)。为弥补这一不足,本工作引入了一个新的手动标注的专业领域数据集,该数据集模仿了流行的Wider Face数据集风格,包含来自19至20世纪数字化历史报纸的2.2k新图像以及与之相关的11k新边界框注释和面部特征点。 此数据集使得现有检测器能够重新训练以使其结果更接近于现实世界中人脸检测的标准。我们报告了多项实验结果,比较了不同类型的微调后的检测器与公开可用的预训练的人脸检测器,并进行了多种大小检测器的消融研究,提供全面的检测和特征点预测性能数据。
https://arxiv.org/abs/2504.00558
This study explores the integration of machine learning into urban aerial image analysis, with a focus on identifying infrastructure surfaces for cars and pedestrians and analyzing historical trends. It emphasizes the transition from convolutional architectures to transformer-based pre-trained models, underscoring their potential in global geospatial analysis. A workflow is presented for automatically generating geospatial datasets, enabling the creation of semantic segmentation datasets from various sources, including WMS/WMTS links, vectorial cartography, and OpenStreetMap (OSM) overpass-turbo requests. The developed code allows a fast dataset generation process for training machine learning models using openly available data without manual labelling. Using aerial imagery and vectorial data from the respective geographical offices of Madrid and Vienna, two datasets were generated for car and pedestrian surface detection. A transformer-based model was trained and evaluated for each city, demonstrating good accuracy values. The historical trend analysis involved applying the trained model to earlier images predating the availability of vectorial data 10 to 20 years, successfully identifying temporal trends in infrastructure for pedestrians and cars across different city areas. This technique is applicable for municipal governments to gather valuable data at a minimal cost.
这项研究探讨了将机器学习技术融入城市航拍图像分析的过程,重点在于识别汽车和行人的基础设施表面,并对历史趋势进行分析。该研究强调从卷积架构向基于变压器的预训练模型过渡,突出了这些模型在全球地理空间分析中的潜力。文中提出了一种工作流程,用于自动生成地理空间数据集,可以从各种来源创建语义分割数据集,包括WMS/WMTS链接、矢量地图和OpenStreetMap (OSM) overpass-turbo请求。开发的代码允许利用公开可用的数据快速生成训练机器学习模型所需的大量数据,而无需手动标记。 使用马德里和维也纳地理办公室提供的航拍图像和矢量数据,分别针对汽车和行人路面检测生成了两个数据集。为每个城市训练并评估了一个基于变压器的模型,展示了良好的准确性值。历史趋势分析涉及将经过训练的模型应用于早于10到20年前没有矢量数据可用的早期图片上,成功识别出不同城市区域中人行道和汽车基础设施随时间的变化趋势。 这项技术对于市镇政府来说非常有用,可以在低成本下收集有价值的数据。
https://arxiv.org/abs/2503.15653
Deepfake is a widely used technology employed in recent years to create pernicious content such as fake news, movies, and rumors by altering and substituting facial information from various sources. Given the ongoing evolution of deepfakes investigation of continuous identification and prevention is crucial. Due to recent technological advancements in AI (Artificial Intelligence) distinguishing deepfakes and artificially altered images has become challenging. This approach introduces the robust detection of subtle ear movements and shape changes to generate ear descriptors. Further, we also propose a novel optimized hybrid deepfake detection model that considers the ear biometric descriptors via enhanced RCNN (Region-Based Convolutional Neural Network). Initially, the input video is converted into frames and preprocessed through resizing, normalization, grayscale conversion, and filtering processes followed by face detection using the Viola-Jones technique. Next, a hybrid model comprising DBN (Deep Belief Network) and Bi-GRU (Bidirectional Gated Recurrent Unit) is utilized for deepfake detection based on ear descriptors. The output from the detection phase is determined through improved score-level fusion. To enhance the performance, the weights of both detection models are optimally tuned using the SU-JFO (Self-Upgraded Jellyfish Optimization method). Experimentation is conducted based on four scenarios: compression, noise, rotation, pose, and illumination on three different datasets. The performance results affirm that our proposed method outperforms traditional models such as CNN (Convolution Neural Network), SqueezeNet, LeNet, LinkNet, LSTM (Long Short-Term Memory), DFP (Deepfake Predictor) [1], and ResNext+CNN+LSTM [2] in terms of various performance metrics viz. accuracy, specificity, and precision.
近年来,Deepfake技术被广泛用于生成虚假新闻、电影和谣言等内容,这种技术通过改变和替换来自各种来源的面部信息来制造有害内容。鉴于深伪视频调查的持续识别与预防变得愈发重要,在人工智能(AI)领域的最新技术进步使得区分深度伪造视频和人工修改过的图像变得更加困难。此方法提出了一种用于生成耳部描述符的稳健检测手段,以捕捉细微的耳部动作变化及形态改变。 此外,我们还提出了一种新颖优化的混合型Deepfake检测模型,该模型通过增强版RCNN(基于区域的卷积神经网络)考虑了生物特征中的耳部信息。首先,输入视频被转换为帧并通过调整大小、标准化、灰度化和滤波等预处理过程进行处理,然后使用Viola-Jones技术进行面部检测。 接下来,采用包含DBN(深度信念网络)与Bi-GRU(双向门控循环单元)的混合模型基于耳部描述符来进行深伪视频识别。检测阶段的结果通过改进后的评分级融合确定。为提升性能,我们利用SU-JFO(自我升级水母优化方法)对两种检测模型的权重进行最优调整。 实验基于三个不同的数据集,在压缩、噪声、旋转、姿势以及光照五个场景下进行了测试。结果表明,我们的方法在各种性能指标(如准确性、特异性和精确度)方面优于传统模型,例如CNN(卷积神经网络)、SqueezeNet、LeNet、LinkNet、LSTM(长短期记忆)、DFP(Deepfake预测器)[1]以及ResNext+CNN+LSTM [2]。
https://arxiv.org/abs/2503.12381
The lack of a common platform and benchmark datasets for evaluating face obfuscation methods has been a challenge, with every method being tested using arbitrary experiments, datasets, and metrics. While prior work has demonstrated that face recognition systems exhibit bias against some demographic groups, there exists a substantial gap in our understanding regarding the fairness of face obfuscation methods. Providing fair face obfuscation methods can ensure equitable protection across diverse demographic groups, especially since they can be used to preserve the privacy of vulnerable populations. To address these gaps, this paper introduces a comprehensive framework, named FairDeFace, designed to assess the adversarial robustness and fairness of face obfuscation methods. The framework introduces a set of modules encompassing data benchmarks, face detection and recognition algorithms, adversarial models, utility detection models, and fairness metrics. FairDeFace serves as a versatile platform where any face obfuscation method can be integrated, allowing for rigorous testing and comparison with other state-of-the-art methods. In its current implementation, FairDeFace incorporates 6 attacks, and several privacy, utility and fairness metrics. Using FairDeFace, and by conducting more than 500 experiments, we evaluated and compared the adversarial robustness of seven face obfuscation methods. This extensive analysis led to many interesting findings both in terms of the degree of robustness of existing methods and their biases against some gender or racial groups. FairDeFace also uses visualization of focused areas for both obfuscation and verification attacks to show not only which areas are mostly changed in the obfuscation process for some demographics, but also why they failed through focus area comparison of obfuscation and verification.
缺乏一个用于评估面部模糊方法的通用平台和基准数据集一直是一个挑战,每个方法都使用任意的实验、数据集和指标进行测试。尽管先前的工作表明面部识别系统对某些人口群体存在偏见,但我们对于面部模糊方法公平性的理解仍存在很大差距。提供公平的面部模糊方法可以确保跨多种人口群体的平等保护,尤其是因为它们可用于保护脆弱人群的隐私。为解决这些差距,本文介绍了一个全面的框架,名为FairDeFace,旨在评估面部模糊方法的对抗鲁棒性和公平性。该框架引入了一系列模块,涵盖数据基准、面部检测和识别算法、对抗模型、效用检测模型以及公平性指标。FairDeFace作为任何面部模糊方法都可以整合的多功能平台,允许对其进行严格测试并与其他最先进的方法进行比较。在其目前实施中,FairDeFace集成了6种攻击方式及多种隐私、效用和公平度量标准。通过使用FairDeFace,并进行了超过500次实验,我们评估并对比了七种面部模糊方法的对抗鲁棒性。这一广泛的分析在现有方法的鲁棒程度以及它们对某些性别或种族群体的偏见方面得出了许多有趣的发现。此外,FairDeFace还利用针对模糊和验证攻击的重点区域可视化技术来展示,在某种程度上不仅展示了某些人口群体在模糊过程中变化最大的区域,而且还通过对比重点区域揭示了为什么这些方法会失败。
https://arxiv.org/abs/2503.08731
Manual attendance tracking at large-scale events, such as marriage functions or conferences, is often inefficient and prone to human error. To address this challenge, we propose an automated, cloud-based attendance tracking system that uses cameras mounted at the entrance and exit gates. The mounted cameras continuously capture video and send the video data to cloud services to perform real-time face detection and recognition. Unlike existing solutions, our system accurately identifies attendees even when they are not looking directly at the camera, allowing natural movements, such as looking around or talking while walking. To the best of our knowledge, this is the first system to achieve high recognition rates under such dynamic conditions. Our system demonstrates overall 90% accuracy, with each video frame processed in 5 seconds, ensuring real time operation without frame loss. In addition, notifications are sent promptly to security personnel within the same latency. This system achieves 100% accuracy for individuals without facial obstructions and successfully recognizes all attendees appearing within the camera's field of view, providing a robust solution for attendee recognition in large-scale social events.
在大型活动(如婚礼或会议)中,手动签到通常效率低下且容易出错。为了应对这一挑战,我们提出了一种自动化、基于云的签到跟踪系统,该系统使用安装在入口和出口处的摄像头。这些摄像头会持续捕捉视频并将数据发送至云端进行实时面部检测和识别。与现有的解决方案不同,我们的系统即使当参与者不直接面向摄像头时也能够准确地识别他们,允许他们在自然动作中被识别,例如四处张望或边走路边交谈。 据我们所知,这是首个在如此动态条件下仍能实现高识别率的系统。该系统的整体准确性达到90%,每帧视频处理时间仅为5秒,确保了实时操作且无数据丢失。此外,安全人员会在相同的延迟时间内收到即时通知。对于没有面部遮挡的人来说,该系统的准确率为100%;并且能够成功识别所有出现在摄像头视野内的参与者,为大型社交活动中的参会者识别提供了一种可靠的解决方案。
https://arxiv.org/abs/2503.03330
This paper presents an innovative approach that enables the user to find matching faces based on the user-selected face parameters. Through gradio-based user interface, the users can interactively select the face parameters they want in their desired partner. These user-selected face parameters are transformed into a text prompt which is used by the Text-To-Image generation model to generate a realistic face image. Further, the generated image along with the images downloaded from the this http URL are processed through face detection and feature extraction model, which results in high dimensional vector embedding of 512 dimensions. The vector embeddings generated from the downloaded images are stored into vector database. Now, the similarity search is carried out between the vector embedding of generated image and the stored vector embeddings. As a result, it displays the top five similar faces based on the user-selected face parameters. This contribution holds a significant potential to turn into a high-quality personalized face matching tool.
这篇论文提出了一种创新的方法,允许用户根据自选的面部参数找到匹配的脸庞。通过基于Gradio的用户界面,用户可以互动地选择他们希望在理想伴侣身上看到的面部特征。这些由用户选定的面部参数会被转换成文本提示,并用于文本到图像生成模型来生成逼真的面部图像。接着,生成的图像以及从网站下载的其他图像会经过面部检测和特征提取模型处理,从而产生一个512维的高度抽象向量表示。 从下载图片中得到的向量表示被存储在向量数据库里。接下来,在生成图像的向量表示与已存储的向量表示之间进行相似性搜索。最终结果是基于用户选择的面部参数,展示出五张最匹配的脸庞。这一贡献具有很大的潜力,可以转化为一种高质量的个性化面部匹配工具。
https://arxiv.org/abs/2503.03204
In multimedia applications such as films and video games, spatial audio techniques are widely employed to enhance user experiences by simulating 3D sound: transforming mono audio into binaural formats. However, this process is often complex and labor-intensive for sound designers, requiring precise synchronization of audio with the spatial positions of visual components. To address these challenges, we propose a visual-based spatial audio generation system - an automated system that integrates face detection YOLOv8 for object detection, monocular depth estimation, and spatial audio techniques. Notably, the system operates without requiring additional binaural dataset training. The proposed system is evaluated against existing Spatial Audio generation system using objective metrics. Experimental results demonstrate that our method significantly improves spatial consistency between audio and video, enhances speech quality, and performs robustly in multi-speaker scenarios. By streamlining the audio-visual alignment process, the proposed system enables sound engineers to achieve high-quality results efficiently, making it a valuable tool for professionals in multimedia production.
在多媒体应用(如电影和视频游戏)中,空间音频技术被广泛用于通过模拟三维声音来增强用户体验:将单声道音频转换为双耳格式。然而,这个过程对于音效设计师来说往往复杂且劳动密集型,需要精确地将音频与视觉组件的空间位置同步。为了应对这些挑战,我们提出了一种基于视觉的空间音频生成系统——一个集成了对象检测(使用YOLOv8进行面部检测)、单目深度估计和空间音频技术的自动化系统。值得注意的是,该系统在不需额外双耳数据集训练的情况下运行。 我们的提议系统通过客观指标与现有的空间音频生成系统进行了比较评估。实验结果表明,我们提出的方法显著提高了音频和视频之间的空间一致性,增强了语音质量,并且在多说话人场景中表现出良好的鲁棒性。通过简化音视同步的过程,所提出的系统使声音工程师能够高效地获得高质量的结果,成为多媒体制作专业人士的宝贵工具。
https://arxiv.org/abs/2502.07538
Face detection is a computer vision application that increasingly demands lightweight models to facilitate deployment on devices with limited computational resources. Neural network pruning is a promising technique that can effectively reduce network size without significantly affecting performance. In this work, we propose a novel face detection pruning pipeline that leverages Filter Pruning via Geometric Median (FPGM) pruning, Soft Filter Pruning (SFP) and Bayesian optimization in order to achieve a superior trade-off between size and performance compared to existing approaches. FPGM pruning is a structured pruning technique that allows pruning the least significant filters in each layer, while SFP iteratively prunes the filters and allows them to be updated in any subsequent training step. Bayesian optimization is employed in order to optimize the pruning rates of each layer, rather than relying on engineering expertise to determine the optimal pruning rates for each layer. In our experiments across all three subsets of the WIDER FACE dataset, our proposed approach B-FPGM consistently outperforms existing ones in balancing model size and performance. All our experiments were applied to EResFD, the currently smallest (in number of parameters) well-performing face detector of the literature; a small ablation study with a second small face detector, EXTD, is also reported. The source code and trained pruned face detection models can be found at: this https URL.
面部检测是一种计算机视觉应用,越来越需要轻量级模型以在计算资源有限的设备上进行部署。神经网络剪枝是一种有前景的技术,可以通过有效减少网络大小而不显著影响性能来实现这一目标。在这项工作中,我们提出了一种新颖的面部检测剪枝流水线,该流水线利用几何中位数滤波器剪枝(FPGM)和软滤波器剪枝(SFP),并结合贝叶斯优化,以实现相对于现有方法更优的大小与性能之间的权衡。FPGM剪枝是一种结构化剪枝技术,允许在每层中修剪最不重要的过滤器,而SFP则迭代地修剪这些过滤器,并使它们可以在后续训练步骤中被更新。贝叶斯优化用于优化每一层的最佳剪枝率,而不是依赖工程专业知识来确定最佳的层级剪枝率。 在WIDER FACE数据集的三个子集中进行的所有实验表明,在平衡模型大小和性能方面,我们的方法B-FPGM始终优于现有的方法。所有实验均应用于EResFD,这是目前文献中参数最少且表现良好的面部检测器;此外还报告了一个小型消融研究结果,使用了另一款较小的面部检测器EXTD。源代码及训练好的剪枝后面部检测模型可在以下链接找到:[此URL](请将此处的“this https URL”替换为实际提供的网址)。
https://arxiv.org/abs/2501.16917
This report presents our approach for the IEEE SP Cup 2025: Deepfake Face Detection in the Wild (DFWild-Cup), focusing on detecting deepfakes across diverse datasets. Our methodology employs advanced backbone models, including MaxViT, CoAtNet, and EVA-02, fine-tuned using supervised contrastive loss to enhance feature separation. These models were specifically chosen for their complementary strengths. Integration of convolution layers and strided attention in MaxViT is well-suited for detecting local features. In contrast, hybrid use of convolution and attention mechanisms in CoAtNet effectively captures multi-scale features. Robust pretraining with masked image modeling of EVA-02 excels at capturing global features. After training, we freeze the parameters of these models and train the classification heads. Finally, a majority voting ensemble is employed to combine the predictions from these models, improving robustness and generalization to unseen scenarios. The proposed system addresses the challenges of detecting deepfakes in real-world conditions and achieves a commendable accuracy of 95.83% on the validation dataset.
该报告介绍了我们对IEEE SP Cup 2025:野外深度伪造人脸检测(DFWild-Cup)大赛的方法,重点在于跨多种数据集进行深度伪造的检测。我们的方法采用先进的骨干模型,包括MaxViT、CoAtNet和EVA-02,并通过监督对比损失进行微调以增强特征分离能力。这些模型因其互补优势而被特别选中。 在MaxViT中,卷积层与步进注意力机制的结合非常适合检测局部特征。相比之下,在CoAtNet中混合使用卷积和注意力机制能够有效捕捉多尺度特征。EVA-02通过遮蔽图像建模进行强大的预训练,在捕获全局特征方面表现出色。 在模型微调后,我们冻结这些模型的参数,并为分类头重新进行训练。最后,采用多数投票集成方法来结合各模型的预测结果,以增强系统的稳健性和对未见场景的泛化能力。所提出的系统能够应对真实世界条件下的深度伪造检测挑战,在验证数据集上达到了95.83%的良好准确率。
https://arxiv.org/abs/2501.16704
The proliferation of deepfake technology poses significant challenges to the authenticity and trustworthiness of digital media, necessitating the development of robust detection methods. This study explores the application of Swin Transformers, a state-of-the-art architecture leveraging shifted windows for self-attention, in detecting and classifying deepfake images. Using the Real and Fake Face Detection dataset by Yonsei University's Computational Intelligence Photography Lab, we evaluate the Swin Transformer and hybrid models such as Swin-ResNet and Swin-KNN, focusing on their ability to identify subtle manipulation artifacts. Our results demonstrate that the Swin Transformer outperforms conventional CNN-based architectures, including VGG16, ResNet18, and AlexNet, achieving a test accuracy of 71.29\%. Additionally, we present insights into hybrid model design, highlighting the complementary strengths of transformer and CNN-based approaches in deepfake detection. This study underscores the potential of transformer-based architectures for improving accuracy and generalizability in image-based manipulation detection, paving the way for more effective countermeasures against deepfake threats.
深度伪造技术的扩散对数字媒体的真实性和可信度构成了重大挑战,这促使开发出更为强大的检测方法。本研究探索了Swin Transformer的应用,这是一种采用移位窗口进行自注意力处理的先进架构,在检测和分类深度伪造图像方面具有显著效果。我们使用延世大学计算智能摄影实验室提供的“真实与虚假人脸检测”数据集,评估了Swin Transformer及其混合模型(如Swin-ResNet和Swin-KNN)在识别细微操作痕迹方面的表现。实验结果表明,Swin Transformer的性能超越了传统的基于卷积神经网络(CNN)的架构,包括VGG16、ResNet18和AlexNet,在测试中达到了71.29%的准确率。 此外,我们还探讨了混合模型设计的见解,突显了在深度伪造检测方面,变换器与基于CNN的方法之间的互补优势。这项研究强调了基于变换器架构对于改善图像操作检测中的准确性及通用性所具有的潜力,并为更有效的应对深度伪造威胁铺平道路。
https://arxiv.org/abs/2501.15656
We present HyperCam, an energy-efficient image classification pipeline that enables computer vision tasks onboard low-power IoT camera systems. HyperCam leverages hyperdimensional computing to perform training and inference efficiently on low-power microcontrollers. We implement a low-power wireless camera platform using off-the-shelf hardware and demonstrate that HyperCam can achieve an accuracy of 93.60%, 84.06%, 92.98%, and 72.79% for MNIST, Fashion-MNIST, Face Detection, and Face Identification tasks, respectively, while significantly outperforming other classifiers in resource efficiency. Specifically, it delivers inference latency of 0.08-0.27s while using 42.91-63.00KB flash memory and 22.25KB RAM at peak. Among other machine learning classifiers such as SVM, xgBoost, MicroNets, MobileNetV3, and MCUNetV3, HyperCam is the only classifier that achieves competitive accuracy while maintaining competitive memory footprint and inference latency that meets the resource requirements of low-power camera systems.
我们介绍了HyperCam,这是一种节能的图像分类管道,可在低功耗物联网摄像头系统上执行计算机视觉任务。HyperCam利用超维度计算在低功耗微控制器上高效地进行训练和推理。我们使用现成的硬件实现了一个低功耗无线摄像头平台,并展示了HyperCam能够在MNIST、Fashion-MNIST、人脸检测和人脸识别任务中分别达到93.60%、84.06%、92.98% 和72.79% 的准确率,同时在资源效率上远超其他分类器。具体而言,在峰值状态下,它实现了0.08-0.27秒的推理延迟,并使用了42.91-63.00KB的闪存和22.25KB的RAM。与其他机器学习分类器(如SVM、xgBoost、MicroNets、MobileNetV3 和MCUNetV3)相比,HyperCam是唯一一个在保持竞争性准确率的同时,还能够满足低功耗摄像头系统资源需求的记忆占用和推理延迟要求的分类器。
https://arxiv.org/abs/2501.10547
This paper expands the cascaded network branch of the autoencoder-based multi-task learning (MTL) framework for dynamic facial expression recognition, namely Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition (MTCAE-DFER). MTCAE-DFER builds a plug-and-play cascaded decoder module, which is based on the Vision Transformer (ViT) architecture and employs the decoder concept of Transformer to reconstruct the multi-head attention module. The decoder output from the previous task serves as the query (Q), representing local dynamic features, while the Video Masked Autoencoder (VideoMAE) shared encoder output acts as both the key (K) and value (V), representing global dynamic features. This setup facilitates interaction between global and local dynamic features across related tasks. Additionally, this proposal aims to alleviate overfitting of complex large model. We utilize autoencoder-based multi-task cascaded learning approach to explore the impact of dynamic face detection and dynamic face landmark on dynamic facial expression recognition, which enhances the model's generalization ability. After we conduct extensive ablation experiments and comparison with state-of-the-art (SOTA) methods on various public datasets for dynamic facial expression recognition, the robustness of the MTCAE-DFER model and the effectiveness of global-local dynamic feature interaction among related tasks have been proven.
本文扩展了基于自动编码器的多任务学习(MTL)框架中的级联网络分支,以实现动态面部表情识别。该方法称为“动态面部表情识别的多任务级联自动编码器”(MTCAE-DFER)。MTCAE-DFER构建了一个即插即用的级联解码模块,基于Vision Transformer (ViT)架构,并采用Transformer的解码概念来重构多头注意力机制。来自前一任务的解码器输出作为查询(Q),表示局部动态特征;而Video Masked Autoencoder (VideoMAE)共享编码器的输出则充当键(K)和值(V),代表全局动态特征。这种设置促进了相关任务中全局与局部动态特征之间的交互。 此外,该提案旨在缓解复杂大型模型的过拟合问题。我们采用基于自动编码器的多任务级联学习方法来探索动态面部检测和动态面部关键点对动态面部表情识别的影响,从而提高模型的泛化能力。经过广泛的消融实验以及与各种公开数据集上最新的动态面部表情识别方法进行对比后,MTCAE-DFER模型的鲁棒性以及相关任务中全局局部动态特征交互的有效性已经得到证实。
https://arxiv.org/abs/2412.18988
Face alignment is a crucial step in preparing face images for feature extraction in facial analysis tasks. For applications such as face recognition, facial expression recognition, and facial attribute classification, alignment is widely utilized during both training and inference to standardize the positions of key landmarks in the face. It is well known that the application and method of face alignment significantly affect the performance of facial analysis models. However, the impact of alignment on face image quality has not been thoroughly investigated. Current FIQA studies often assume alignment as a prerequisite but do not explicitly evaluate how alignment affects quality metrics, especially with the advent of modern deep learning-based detectors that integrate detection and landmark localization. To address this need, our study examines the impact of face alignment on face image quality scores. We conducted experiments on the LFW, IJB-B, and SCFace datasets, employing MTCNN and RetinaFace models for face detection and alignment. To evaluate face image quality, we utilized several assessment methods, including SER-FIQ, FaceQAN, DifFIQA, and SDD-FIQA. Our analysis included examining quality score distributions for the LFW and IJB-B datasets and analyzing average quality scores at varying distances in the SCFace dataset. Our findings reveal that face image quality assessment methods are sensitive to alignment. Moreover, this sensitivity increases under challenging real-life conditions, highlighting the importance of evaluating alignment's role in quality assessment.
面部对齐是为面部分析任务中的特征提取准备面部图像的关键步骤。对于诸如人脸识别、面部表情识别和面部属性分类等应用,对齐技术在训练和推理过程中广泛使用,以标准化面部关键点的位置。众所周知,面部对齐的应用和方法显著影响面部分析模型的性能。然而,对齐对面部图像质量的影响尚未被彻底研究。当前的FIQA(面部图像质量评估)研究通常假设对齐是先决条件,但没有明确评估对齐如何影响质量指标,尤其是在现代基于深度学习的检测器集成了检测和关键点定位技术之后。为解决这一需求,我们的研究考察了面部对齐对面部图像质量评分的影响。我们在LFW、IJB-B和SCFace数据集上进行了实验,使用MTCNN和RetinaFace模型进行面部检测和对齐。为了评估面部图像的质量,我们采用了多种评估方法,包括SER-FIQ、FaceQAN、DifFIQA和SDD-FIQA。我们的分析涵盖了LFW和IJB-B数据集中质量评分的分布情况,并在SCFace数据集中分析了不同距离下的平均质量评分。研究结果表明,面部图像质量评估方法对对齐非常敏感。此外,在具有挑战性的现实条件下,这种敏感性会增加,突显了评估对齐在质量评估中作用的重要性。
https://arxiv.org/abs/2412.11779
Low-light conditions have an adverse impact on machine cognition, limiting the performance of computer vision systems in real life. Since low-light data is limited and difficult to annotate, we focus on image processing to enhance low-light images and improve the performance of any downstream task model, instead of fine-tuning each of the models which can be prohibitively expensive. We propose to improve the existing zero-reference low-light enhancement by leveraging the CLIP model to capture image prior and for semantic guidance. Specifically, we propose a data augmentation strategy to learn an image prior via prompt learning, based on image sampling, to learn the image prior without any need for paired or unpaired normal-light data. Next, we propose a semantic guidance strategy that maximally takes advantage of existing low-light annotation by introducing both content and context cues about the image training patches. We experimentally show, in a qualitative study, that the proposed prior and semantic guidance help to improve the overall image contrast and hue, as well as improve background-foreground discrimination, resulting in reduced over-saturation and noise over-amplification, common in related zero-reference methods. As we target machine cognition, rather than rely on assuming the correlation between human perception and downstream task performance, we conduct and present an ablation study and comparison with related zero-reference methods in terms of task-based performance across many low-light datasets, including image classification, object and face detection, showing the effectiveness of our proposed method.
低光照条件对机器认知有不利影响,限制了计算机视觉系统在现实生活中的性能。由于低光数据有限且难以标注,我们专注于图像处理以增强低光图像并提升任何下游任务模型的性能,而不是调整每个模型,这可能会非常昂贵。我们提出通过利用CLIP模型捕捉图像先验和提供语义指导来改进现有的零参考低光增强技术。具体来说,我们提出了一种数据增强策略,基于图像采样通过提示学习来获取图像先验,无需任何配对或非配对正常光照数据。接下来,我们提出了一个语义引导策略,最大限度地利用现有低光标注信息,引入有关图像训练补丁的内容和上下文线索。实验结果显示,在定性研究中,所提出的先验和语义指导有助于提高整体图像对比度和色调,并改善背景与前景的区分,减少了常见的过度饱和和噪声放大问题,这些问题在相关的零参考方法中较为常见。由于我们的目标是机器认知,而不是依赖于假设人类感知与下游任务性能之间的相关性,我们进行了消融研究并与其他零参考方法在多个低光数据集上的基于任务性能进行了比较,包括图像分类、物体检测和人脸检测,展示了我们提出的方法的有效性。
https://arxiv.org/abs/2412.07693
This paper investigates the feasibility of a proactive DeepFake defense framework, {\em FacePosion}, to prevent individuals from becoming victims of DeepFake videos by sabotaging face detection. The motivation stems from the reliance of most DeepFake methods on face detectors to automatically extract victim faces from videos for training or synthesis (testing). Once the face detectors malfunction, the extracted faces will be distorted or incorrect, subsequently disrupting the training or synthesis of the DeepFake model. To achieve this, we adapt various adversarial attacks with a dedicated design for this purpose and thoroughly analyze their feasibility. Based on FacePoison, we introduce {\em VideoFacePoison}, a strategy that propagates FacePoison across video frames rather than applying them individually to each frame. This strategy can largely reduce the computational overhead while retaining the favorable attack performance. Our method is validated on five face detectors, and extensive experiments against eleven different DeepFake models demonstrate the effectiveness of disrupting face detectors to hinder DeepFake generation.
https://arxiv.org/abs/2412.01101
Non-invasive temperature monitoring of individuals plays a crucial role in identifying and isolating symptomatic individuals. Temperature monitoring becomes particularly vital in settings characterized by close human proximity, often referred to as dense settings. However, existing research on non-invasive temperature estimation using thermal cameras has predominantly focused on sparse settings. Unfortunately, the risk of disease transmission is significantly higher in dense settings like movie theaters or classrooms. Consequently, there is an urgent need to develop robust temperature estimation methods tailored explicitly for dense settings. Our study proposes a non-invasive temperature estimation system that combines a thermal camera with an edge device. Our system employs YOLO models for face detection and utilizes a regression framework for temperature estimation. We evaluated the system on a diverse dataset collected in dense and sparse settings. Our proposed face detection model achieves an impressive mAP score of over 84 in both in-dataset and cross-dataset evaluations. Furthermore, the regression framework demonstrates remarkable performance with a mean square error of 0.18$^{\circ}$C and an impressive $R^2$ score of 0.96. Our experiments' results highlight the developed system's effectiveness, positioning it as a promising solution for continuous temperature monitoring in real-world applications. With this paper, we release our dataset and programming code publicly.
https://arxiv.org/abs/2412.00863
This study is focused on enhancing the Haar Cascade Algorithm to decrease the false positive and false negative rate in face matching and face detection to increase the accuracy rate even under challenging conditions. The face recognition library was implemented with Haar Cascade Algorithm in which the 128-dimensional vectors representing the unique features of a face are encoded. A subprocess was applied where the grayscale image from Haar Cascade was converted to RGB to improve the face encoding. Logical process and face filtering are also used to decrease non-face detection. The Enhanced Haar Cascade Algorithm produced a 98.39% accuracy rate (21.39% increase), 63.59% precision rate, 98.30% recall rate, and 72.23% in F1 Score. In comparison, the Haar Cascade Algorithm achieved a 46.70% to 77.00% accuracy rate, 44.15% precision rate, 98.61% recall rate, and 47.01% in F1 Score. Both algorithms used the Confusion Matrix Test with 301,950 comparisons using the same dataset of 550 images. The 98.39% accuracy rate shows a significant decrease in false positive and false negative rates in facial recognition. Face matching and face detection are more accurate in images with complex backgrounds, lighting variations, and occlusions, or even those with similar attributes.
这项研究的重点是改进Haar级联算法,以降低面部匹配和面部检测中的误报率和漏报率,从而在具有挑战性的条件下提高准确性。该面部识别库使用了Haar级联算法,其中128维向量编码了面部的独特特征。通过一个子过程将Haar级联生成的灰度图像转换为RGB图像以改进面部编码。此外,还应用了逻辑处理和面部过滤来减少非面部检测。 增强型Haar级联算法实现了98.39%的准确率(提升了21.39%),63.59%的精确率,98.30%的召回率以及72.23%的F1分数。相比之下,传统的Haar级联算法实现了46.70%到77.00%的准确率,44.15%的精确率,98.61%的召回率和47.01%的F1分数。 两种算法均使用了混淆矩阵测试,并在相同的550张图像数据集上进行了301,950次比较。98.39%的准确率表明面部识别中的误报率和漏报率显著降低。这种改进使得面部匹配和检测在复杂背景、光照变化、遮挡甚至具有相似属性的情况下更加精确。
https://arxiv.org/abs/2411.03831
This paper presents an autonomous method to address challenges arising from severe lighting conditions in machine vision applications that use event cameras. To manage these conditions, the research explores the built in potential of these cameras to adjust pixel functionality, named bias settings. As cars are driven at various times and locations, shifts in lighting conditions are unavoidable. Consequently, this paper utilizes the neuromorphic YOLO-based face tracking module of a driver monitoring system as the event-based application to study. The proposed method uses numerical metrics to continuously monitor the performance of the event-based application in real-time. When the application malfunctions, the system detects this through a drop in the metrics and automatically adjusts the event cameras bias values. The Nelder-Mead simplex algorithm is employed to optimize this adjustment, with finetuning continuing until performance returns to a satisfactory level. The advantage of bias optimization lies in its ability to handle conditions such as flickering or darkness without requiring additional hardware or software. To demonstrate the capabilities of the proposed system, it was tested under conditions where detecting human faces with default bias values was impossible. These severe conditions were simulated using dim ambient light and various flickering frequencies. Following the automatic and dynamic process of bias modification, the metrics for face detection significantly improved under all conditions. Autobiasing resulted in an increase in the YOLO confidence indicators by more than 33 percent for object detection and 37 percent for face detection highlighting the effectiveness of the proposed method.
本文提出了一种自主方法,以解决在使用事件相机的机器视觉应用中因严重照明条件而产生的挑战。为了管理这些条件,研究探索了这些相机内置的像素功能调整潜力,即偏置设置。由于汽车会在不同时间和地点行驶,光照条件的变化不可避免。因此,本文利用基于神经形态YOLO的驾驶员监控系统中的面部跟踪模块作为事件驱动的应用进行研究。所提出的方法使用数值指标实时连续监测事件驱动应用的表现。当应用程序出现故障时,系统通过指标下降检测到这一点,并自动调整事件相机的偏置值。采用Nelder-Mead单纯形算法来优化这一调整过程,直到性能恢复到令人满意的水平为止。偏置优化的优势在于它能够处理闪烁或黑暗等条件而不需额外硬件或软件的支持。为了展示所提系统的功能,在使用默认偏置值检测人脸不可能实现的条件下进行了测试。这些严重条件通过低环境光和各种闪烁频率进行模拟。经过自动动态调整偏置的过程,所有条件下的面部检测指标显著提升。自动偏置导致YOLO对象检测的信心指标增加了超过33%,面部检测的信心指标增加了37%以上,这突显了所提方法的有效性。
https://arxiv.org/abs/2411.00729
Smart focal-plane and in-chip image processing has emerged as a crucial technology for vision-enabled embedded systems with energy efficiency and privacy. However, the lack of special datasets providing examples of the data that these neuromorphic sensors compute to convey visual information has hindered the adoption of these promising technologies. Neuromorphic imager variants, including event-based sensors, produce various representations such as streams of pixel addresses representing time and locations of intensity changes in the focal plane, temporal-difference data, data sifted/thresholded by temporal differences, image data after applying spatial transformations, optical flow data, and/or statistical representations. To address the critical barrier to entry, we provide an annotated, temporal-threshold-based vision dataset specifically designed for face detection tasks derived from the same videos used for Aff-Wild2. By offering multiple threshold levels (e.g., 4, 8, 12, and 16), this dataset allows for comprehensive evaluation and optimization of state-of-the-art neural architectures under varying conditions and settings compared to traditional methods. The accompanying tool flow for generating event data from raw videos further enhances accessibility and usability. We anticipate that this resource will significantly support the development of robust vision systems based on smart sensors that can process based on temporal-difference thresholds, enabling more accurate and efficient object detection and localization and ultimately promoting the broader adoption of low-power, neuromorphic imaging technologies. To support further research, we publicly released the dataset at \url{this https URL}.
智能焦平面和芯片图像处理已成为具有能源效率和隐私的关键技术,支持视觉感知嵌入式系统的开发。然而,缺乏提供这些神经元传感器计算的数据特别数据集,使得这些具有潜力的技术受到了阻碍。神经元成像变体,包括基于事件的传感器,产生各种表示,如表示时间变化焦平面的像素流、时间差数据、基于时间差的筛选/阈值数据、应用空间变换后的图像数据、光学流数据,以及/或统计表示。为了应对进入障碍,我们提供了专门为面部检测任务设计的带注释的时间阈值基于的视觉数据,该数据来源于用于Aff-Wild2的视频。通过提供多个阈值级别(例如4、8、12和16),这个数据集允许在不同的条件和设置下对最先进的神经架构进行全面的评估和优化,与传统方法相比。附带生成事件数据的工具流进一步增强了可用性和易用性。我们预计,这个资源将显著支持基于智能传感器开发稳健的视觉系统,这些系统可以根据时间差阈值进行处理,从而实现更准确和高效的物体检测和定位,最终促进低功耗、神经元成像技术的更广泛采用。为了支持进一步的研究,我们公开了这个数据集,可在 \url{这个链接} 上找到。
https://arxiv.org/abs/2410.00368