Smart focal-plane and in-chip image processing has emerged as a crucial technology for vision-enabled embedded systems with energy efficiency and privacy. However, the lack of special datasets providing examples of the data that these neuromorphic sensors compute to convey visual information has hindered the adoption of these promising technologies. Neuromorphic imager variants, including event-based sensors, produce various representations such as streams of pixel addresses representing time and locations of intensity changes in the focal plane, temporal-difference data, data sifted/thresholded by temporal differences, image data after applying spatial transformations, optical flow data, and/or statistical representations. To address the critical barrier to entry, we provide an annotated, temporal-threshold-based vision dataset specifically designed for face detection tasks derived from the same videos used for Aff-Wild2. By offering multiple threshold levels (e.g., 4, 8, 12, and 16), this dataset allows for comprehensive evaluation and optimization of state-of-the-art neural architectures under varying conditions and settings compared to traditional methods. The accompanying tool flow for generating event data from raw videos further enhances accessibility and usability. We anticipate that this resource will significantly support the development of robust vision systems based on smart sensors that can process based on temporal-difference thresholds, enabling more accurate and efficient object detection and localization and ultimately promoting the broader adoption of low-power, neuromorphic imaging technologies. To support further research, we publicly released the dataset at \url{this https URL}.
智能焦平面和芯片图像处理已成为具有能源效率和隐私的关键技术,支持视觉感知嵌入式系统的开发。然而,缺乏提供这些神经元传感器计算的数据特别数据集,使得这些具有潜力的技术受到了阻碍。神经元成像变体,包括基于事件的传感器,产生各种表示,如表示时间变化焦平面的像素流、时间差数据、基于时间差的筛选/阈值数据、应用空间变换后的图像数据、光学流数据,以及/或统计表示。为了应对进入障碍,我们提供了专门为面部检测任务设计的带注释的时间阈值基于的视觉数据,该数据来源于用于Aff-Wild2的视频。通过提供多个阈值级别(例如4、8、12和16),这个数据集允许在不同的条件和设置下对最先进的神经架构进行全面的评估和优化,与传统方法相比。附带生成事件数据的工具流进一步增强了可用性和易用性。我们预计,这个资源将显著支持基于智能传感器开发稳健的视觉系统,这些系统可以根据时间差阈值进行处理,从而实现更准确和高效的物体检测和定位,最终促进低功耗、神经元成像技术的更广泛采用。为了支持进一步的研究,我们公开了这个数据集,可在 \url{这个链接} 上找到。
https://arxiv.org/abs/2410.00368
The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. ``Face pareidolia'' describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective. We present an image dataset of ``Faces in Things'', consisting of five thousand web images with human-annotated pareidolic faces. Using this dataset, we examine the extent to which a state-of-the-art human face detector exhibits pareidolia, and find a significant behavioral gap between humans and machines. We find that the evolutionary need for humans to detect animal faces, as well as human faces, may explain some of this gap. Finally, we propose a simple statistical model of pareidolia in images. Through studies on human subjects and our pareidolic face detectors we confirm a key prediction of our model regarding what image conditions are most likely to induce pareidolia. Dataset and Website: this https URL
人类视觉系统对各种形状和大小的面部具有很好的适应性,这带来了明显的生存优势,比如在丛林中更可能发现未知猎物的机会,但也导致了伪面部的检测。 “面部错觉”描述了在随机刺激物中,人们感知面部类似结构的现象:在咖啡污渍或天空中看到人脸。在本文中,我们从计算机视觉的角度研究面部错觉。我们呈现了一个由5000个带有人类标注的错觉面部组成的图像数据集。利用这个数据集,我们研究了最先进的面部检测器是否表现出面部错觉,并发现了人类和机器之间的显著行为差异。我们发现,人类检测动物面部和人类面部的进化需求可能解释了一些差距。最后,我们提出了一个简单的统计模型来描述图像中的面部错觉。通过研究人类参与者和我们的错觉面部检测器,我们证实了关于模型最有可能引起面部错觉的图像条件的预测。网站和数据集:https://www. this URL
https://arxiv.org/abs/2409.16143
Despite the remarkable performance of deep neural networks for face detection and recognition tasks in the visible spectrum, their performance on more challenging non-visible domains is comparatively still lacking. While significant research has been done in the fields of domain adaptation and domain generalization, in this paper we tackle scenarios in which these methods have limited applicability owing to the lack of training data from target domains. We focus on the problem of single-source (visible) and multi-target (SWIR, long-range/remote, surveillance, and body-worn) face recognition task. We show through experiments that a good template generation algorithm becomes crucial as the complexity of the target domain increases. In this context, we introduce a template generation algorithm called Norm Pooling (and a variant known as Sparse Pooling) and show that it outperforms average pooling across different domains and networks, on the IARPA JANUS Benchmark Multi-domain Face (IJB-MDF) dataset.
尽管在可见光谱范围内,深度神经网络在 face 检测和识别任务中的表现非常出色,但它们在更具有挑战性的非可见领域中的性能相对仍然不足。尽管在领域迁移和泛化领域已经进行了大量的研究,但在本文中,我们关注的是由于目标领域缺乏训练数据而使得这些方法的应用受限的情况。我们专注于单一源(可见)和多目标(SWIR,远距离/远程,监视,可穿戴)面部识别问题。我们通过实验证明,随着目标领域的复杂性的增加,一个好的模板生成算法变得至关重要。在这种情况下,我们引入了一个名为 Norm Pooling(及其变体 Sparse Pooling)的模板生成算法,并在 IARPA JANUS 基准多领域面部(IJB-MDF)数据集上证明了它优于不同领域和网络的平均池化。
https://arxiv.org/abs/2409.09832
In the current landscape of biometrics and surveillance, the ability to accurately recognize faces in uncontrolled settings is paramount. The Watchlist Challenge addresses this critical need by focusing on face detection and open-set identification in real-world surveillance scenarios. This paper presents a comprehensive evaluation of participating algorithms, using the enhanced UnConstrained College Students (UCCS) dataset with new evaluation protocols. In total, four participants submitted four face detection and nine open-set face recognition systems. The evaluation demonstrates that while detection capabilities are generally robust, closed-set identification performance varies significantly, with models pre-trained on large-scale datasets showing superior performance. However, open-set scenarios require further improvement, especially at higher true positive identification rates, i.e., lower thresholds.
在当前的人脸识别和监视领域,在未受控的环境中准确识别脸部至关重要。Watchlist挑战通过专注于现实主义监视场景中的人脸检测和开箱即用识别来解决这一关键需求。本文全面评估了参赛算法的性能,使用增强的UCCS数据集及其新的评估协议。总共四名参赛者提交了四个面部检测和九个开箱即用的人脸识别系统。评估表明,虽然检测能力通常很强,但开箱即用识别性能差异很大,在预训练于大型数据集的模型上表现优异。然而,开箱即用场景需要进一步改进,尤其是在更高的真阳性识别率上,即较低的阈值。
https://arxiv.org/abs/2409.07220
The Real Face Dataset is a pedestrian face detection benchmark dataset in the wild, comprising over 11,000 images and over 55,000 detected faces in various ambient conditions. The dataset aims to provide a comprehensive and diverse collection of real-world face images for the evaluation and development of face detection and recognition algorithms. The Real Face Dataset is a valuable resource for researchers and developers working on face detection and recognition algorithms. With over 11,000 images and 55,000 detected faces, the dataset offers a comprehensive and diverse collection of real-world face images. This diversity is crucial for evaluating the performance of algorithms under various ambient conditions, such as lighting, scale, pose, and occlusion. The dataset's focus on real-world scenarios makes it particularly relevant for practical applications, where faces may be captured in challenging environments. In addition to its size, the dataset's inclusion of images with a high degree of variability in scale, pose, and occlusion, as well as its focus on practical application scenarios, sets it apart as a valuable resource for benchmarking and testing face detection and recognition methods. The challenges presented by the dataset align with the difficulties faced in real-world surveillance applications, where the ability to detect faces and extract discriminative features is paramount. The Real Face Dataset provides an opportunity to assess the performance of face detection and recognition methods on a large scale. Its relevance to real-world scenarios makes it an important resource for researchers and developers aiming to create robust and effective algorithms for practical applications.
Real Face Dataset是一个野外的行人脸检测基准数据集,包括超过11,000张图像和超过55,000个检测到的面部,各种环境条件下。这个数据集旨在为评估和改进面部检测和识别算法提供全面和多样化的真实世界面部图像。Real Face Dataset对于研究者和开发人员来说是一个有价值的资源。它拥有超过11,000张图像和55,000个检测到的面部,提供了全面和多样化的真实世界面部图像。这种多样性对于评估算法在不同环境条件下的性能至关重要,例如光照、尺度、姿势和遮挡。数据集对真实世界场景的专注使它特别适用于具有挑战性的环境中的面部捕捉。除了其规模外,数据集中的图像在尺度、姿势和遮挡方面的变异性,以及它对实际应用场景的关注,使它成为评估和测试面部检测和识别方法的有价值的资源。数据集所面临的挑战与现实世界监视应用中遇到的困难相吻合,在现实世界中检测面和人是不容忽视的。Real Face Dataset为评估面部检测和识别方法在较大规模上的性能提供了一个机会。它的现实世界场景的关联使其成为研究人员和开发人员为实际应用创建健壮和有效的算法的重要资源。
https://arxiv.org/abs/2409.00283
With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited detection performance, or employ a naive two-branch structure to simultaneously obtain detection and localization results, which cannot effectively benefit the localization capability due to limited interaction between two tasks. This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The MoNFAP primarily introduces two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM). The FUP integrates detection and localization tasks using a token learning strategy and multiple forgery-aware transformers, which facilitates the use of classification information to enhance localization capability. Besides, motivated by the crucial role of noise information in forgery detection, the MNM leverages multiple noise extractors based on the concept of the mixture of experts to enhance the general RGB features, further boosting the performance of our framework. Finally, we establish a comprehensive benchmark for multi-face detection and localization and the proposed \textit{MoNFAP} achieves significant performance. The codes will be made available.
随着面部操作技术的发展,多面场景下的伪造图像逐渐成为更加复杂和真实的挑战。然而,对于这种多面操作的检测和定位方法仍处于初步发展阶段。传统的操作定位方法要么间接地从定位遮罩中得到检测结果,导致检测性能有限;要么同时使用简单的两分支结构来同时获得检测和定位结果,但由于两个任务之间的交互有限,无法有效提高定位能力。本文提出了一种新的框架,即MoNFAP,专门针对多面操作检测和定位。MoNFAP主要引入两个新颖的模块:伪造意识统一预测器(FUP)模块和噪声混合模块(MNM)。FUP使用标记学习策略集成检测和定位任务,并利用多个伪造意识转换器,从而促进将分类信息用于增强定位能力。此外,为了充分利用噪声信息在伪造检测中的关键作用,MNM基于专家混合的概念利用多个噪声提取器来增强通用RGB特征,进一步提高了我们的框架的性能。最后,我们建立了多面检测和定位的综合基准,并提出 MoNFAP,取得了显著的性能。代码将公开发布。
https://arxiv.org/abs/2408.02306
The rapid growth of image data has led to the development of advanced image processing and computer vision techniques, which are crucial in various applications such as image classification, image segmentation, and pattern recognition. Texture is an important feature that has been widely used in many image processing tasks. Therefore, analyzing and understanding texture plays a pivotal role in image analysis and understanding.Local binary pattern (LBP) is a powerful operator that describes the local texture features of images. This paper provides a novel mathematical representation of the LBP by separating the operator into three matrices, two of which are always fixed and do not depend on the input data. These fixed matrices are analyzed in depth, and a new algorithm is proposed to optimize them for improved classification performance. The optimization process is based on the singular value decomposition (SVD) algorithm. As a result, the authors present optimal LBPs that effectively describe the texture of human face images. Several experiment results presented in this paper convincingly verify the efficiency and superiority of the optimized LBPs for face detection and facial expression recognition tasks.
https://arxiv.org/abs/2407.18665
Extracting surfaces from Signed Distance Fields (SDFs) can be accomplished using traditional algorithms, such as Marching Cubes. However, since they rely on sign flips across the surface, these algorithms cannot be used directly on Unsigned Distance Fields (UDFs). In this work, we introduce a deep-learning approach to taking a UDF and turning it locally into an SDF, so that it can be effectively triangulated using existing algorithms. We show that it achieves better accuracy in surface detection than existing methods. Furthermore it generalizes well to unseen shapes and datasets, while being parallelizable. We also demonstrate the flexibily of the method by using it in conjunction with DualMeshUDF, a state of the art dual meshing method that can operate on UDFs, improving its results and removing the need to tune its parameters.
https://arxiv.org/abs/2407.18381
The paper provides a mathematical view to the binary numbers presented in the Local Binary Pattern (LBP) feature extraction process. Symmetric finite difference is often applied in numerical analysis to enhance the accuracy of approximations. Then, the paper investigates utilization of the symmetric finite difference in the LBP formulation for face detection and facial expression recognition. It introduces a novel approach that extends the standard LBP, which typically employs eight directional derivatives, to incorporate only four directional derivatives. This approach is named symmetric LBP. The number of LBP features is reduced to 16 from 256 by the use of the symmetric LBP. The study underscores the significance of the number of directions considered in the new approach. Consequently, the results obtained emphasize the importance of the research topic.
本文提供了一种将二进制数呈现在局部二进制模式(LBP)特征提取过程中数学观点。对称有限差分通常在数值分析中用于提高近似精度。然后,本文研究了在LBP公式中应用对称有限差分的用于面部检测和面部表情识别。它引入了一种新方法,该方法扩展了标准的LBP,通常采用八个方向导数,以仅考虑四个方向导数。这种方法被称为对称LBP。通过使用对称LBP,LBP特征的数量从256减少到16。研究结果强调了在新方法中考虑的方向数量的重要性。因此,获得的结果突出了研究主题的重要性。
https://arxiv.org/abs/2407.13178
Face detection is frequently attempted by using heavy pre-trained backbone networks like ResNet-50/101/152 and VGG16/19. Few recent works have also proposed lightweight detectors with customized backbones, novel loss functions and efficient training strategies. The novelty of this work lies in the design of a lightweight detector while training with only the commonly used loss functions and learning strategies. The proposed face detector grossly follows the established RetinaFace architecture. The first contribution of this work is the design of a customized lightweight backbone network (BLite) having 0.167M parameters with 0.52 GFLOPs. The second contribution is the use of two independent multi-task losses. The proposed lightweight face detector (FDLite) has 0.26M parameters with 0.94 GFLOPs. The network is trained on the WIDER FACE dataset. FDLite is observed to achieve 92.3\%, 89.8\%, and 82.2\% Average Precision (AP) on the easy, medium, and hard subsets of the WIDER FACE validation dataset, respectively.
面部检测常常使用具有重预训练的骨干网络(如ResNet-50/101/152和VGG16/19)来实现。少数最近的工作还提出了具有自定义骨干网络、新颖损失函数和高效训练策略的轻量级检测器。本文的创新之处在于在仅使用通常使用的损失函数和学习策略进行训练的情况下设计了一个轻量级的检测器。与RetinaFace架构相比,本文提出的面部检测器在轻量级方面具有创新性。 本文的第一个贡献是设计了一个具有自定义轻量级骨干网络(BLite),具有0.167M参数和0.52 GFLOPs。第二个贡献是使用两个独立的多任务损失。本文提出的轻量级面部检测器(FDLite)具有0.26M参数和0.94 GFLOPs。该网络在WIDER FACE数据集上进行训练。通过训练,FDLite在WIDER FACE验证集的容易、中度和困难子集上分别获得了92.3\%、89.8\%和82.2\%的均精度(AP)。
https://arxiv.org/abs/2406.19107
Precise and prompt identification of road surface conditions enables vehicles to adjust their actions, like changing speed or using specific traction control techniques, to lower the chance of accidents and potential danger to drivers and pedestrians. However, most of the existing methods for detecting road surfaces solely rely on visual data, which may be insufficient in certain situations, such as when the roads are covered by debris, in low light conditions, or in the presence of fog. Therefore, we introduce a multimodal approach for the automated detection of road surface conditions by integrating audio and images. The robustness of the proposed method is tested on a diverse dataset collected under various environmental conditions and road surface types. Through extensive evaluation, we demonstrate the effectiveness and reliability of our multimodal approach in accurately identifying road surface conditions in real-time scenarios. Our findings highlight the potential of integrating auditory and visual cues for enhancing road safety and minimizing accident risks
https://arxiv.org/abs/2406.10128
Lensless cameras, innovatively replacing traditional lenses for ultra-thin, flat optics, encode light directly onto sensors, producing images that are not immediately recognizable. This compact, lightweight, and cost-effective imaging solution offers inherent privacy advantages, making it attractive for privacy-sensitive applications like face verification. Typical lensless face verification adopts a two-stage process of reconstruction followed by verification, incurring privacy risks from reconstructed faces and high computational costs. This paper presents an end-to-end optimization approach for privacy-preserving face verification directly on encoded lensless captures, ensuring that the entire software pipeline remains encoded with no visible faces as intermediate results. To achieve this, we propose several techniques to address unique challenges from the lensless setup which precludes traditional face detection and alignment. Specifically, we propose a face center alignment scheme, an augmentation curriculum to build robustness against variations, and a knowledge distillation method to smooth optimization and enhance performance. Evaluations under both simulation and real environment demonstrate our method outperforms two-stage lensless verification while enhancing privacy and efficiency. Project website: \url{this http URL}.
无镜头相机通过采用超薄、平滑的光学元件替代传统的镜头,将光线直接编码到传感器上,从而产生无法立即识别的图像。这种紧凑轻便、成本效益高的成像解决方案具有固有的隐私优势,使其成为高度关注隐私的诸如面部验证等应用的理想选择。典型的无镜头人脸验证采用重建和验证两个阶段,从重构面部导致隐私风险和高计算成本。本文提出了一种端到端加密镜头less捕捉 privacy-preserving人脸验证的完整优化方法,确保整个软件管道在中间结果时没有任何可见 faces。为了实现这一目标,我们提出了几种解决镜头less设置独特挑战的技术,这排除了传统面部检测和对齐。具体来说,我们提出了一种面部中心对齐方案、一种增强课程以构建对变化的有鲁棒性以及一种知识蒸馏方法以平滑优化和提高性能。在模拟和真实环境下的评估表明,我们的方法在两阶段镜头less验证的基础上优于隐私和安全。项目网站:\url{这个链接}。
https://arxiv.org/abs/2406.04129
This dataset includes 6823 thermal images captured using a UNI-T UTi165A camera for face detection, recognition, and emotion analysis. It consists of 2485 facial recognition images depicting emotions (happy, sad, angry, natural, surprised), 2054 images for face recognition, and 2284 images for face detection. The dataset covers various conditions, color palettes, shooting angles, and zoom levels, with a temperature range of -10°C to 400°C and a resolution of 19,200 pixels. It serves as a valuable resource for advancing thermal imaging technology, aiding in algorithm development, and benchmarking for facial recognition across different palettes. Additionally, it contributes to facial motion recognition, fostering interdisciplinary collaboration in computer vision, psychology, and neuroscience. The dataset promotes transparency in thermal face detection and recognition research, with applications in security, healthcare, and human-computer interaction.
这个数据集是由UNI-T UTi165A相机捕捉的6823张热图像组成,用于面部检测、识别和情感分析。它包括2485张表现情感(快乐,悲伤,生气,自然,惊讶)的面部识别图像,2054张面部识别图像和2284张面部检测图像。该数据集涵盖了各种条件、色彩方案、拍摄角度和缩放级别,温度范围在-10°C至400°C之间,分辨率为19,200像素。它为提高热成像技术、促进算法开发和面部识别技术的基准提供了宝贵的资源。此外,它还促进了计算机视觉、心理学和神经科学领域的跨学科合作。该数据集在热面部检测和识别研究中提高了透明度,应用于安全、医疗和人机交互等领域。
https://arxiv.org/abs/2407.09494
Recent studies have shown that visually impaired people have desires to take selfies in the same way as sighted people do to record their photos and share them with others. Although support applications using sound and vibration have been developed to help visually impaired people take selfies using smartphone cameras, it is still difficult to capture everyone in the angle of view, and it is also difficult to confirm that they all have good expressions in the photo. To mitigate these issues, we propose a method to take selfies with multiple people using an omni-directional camera. Specifically, a user takes a few seconds of video with an omni-directional camera, followed by face detection on all frames. The proposed method then eliminates false face detections and complements undetected ones considering the consistency across all frames. After performing facial expression recognition on all the frames, the proposed method finally extracts the frame in which the participants are happiest, and generates a perspective projection image in which all the participants are in the angle of view from the omni-directional frame. In experiments, we use several scenes with different number of people taken to demonstrate the effectiveness of the proposed method.
近年来,研究表明,视力受损的人与正常视力的人一样,也对在手机相机上自拍有相似的愿望,以记录和与他人分享照片。尽管已经开发了一些支持应用程序,使用声音和振动来帮助视力受损的人自拍,但仍然很难捕捉到每个人在角度视野中的表情,也无法确认他们所有的表情都是积极的。为了减轻这些问题,我们提出了一个使用全景相机多个人拍摄自拍照的方法。 具体来说,用户用全景相机拍摄一段几秒钟的视频,然后对所有帧进行面部识别。所提出的方法消除了虚假的脸部检测,并考虑了所有帧之间的一致性。在所有帧完成面部表情识别后,所提出的方法最后提取出参与者最开心时的帧,并在全景相机视角下生成一个所有参与者都处于角度的视角的图像。在实验中,我们使用几个不同的场景,包括不同人数的人,来展示所提出方法的有效性。
https://arxiv.org/abs/2405.15996
Haar Cascade is a cost-effective and user-friendly machine learning-based algorithm for detecting objects in images and videos. Unlike Deep Learning algorithms, which typically require significant resources and expensive computing costs, it uses simple image processing techniques like edge detection and Haar features that are easy to comprehend and implement. By combining Haar Cascade with OpenCV2 on an embedded computer like the NVIDIA Jetson Nano, this system can accurately detect and match faces in a database for attendance tracking. This system aims to achieve several specific objectives that set it apart from existing solutions. It leverages Haar Cascade, enriched with carefully selected Haar features, such as Haar-like wavelets, and employs advanced edge detection techniques. These techniques enable precise face detection and matching in both images and videos, contributing to high accuracy and robust performance. By doing so, it minimizes manual intervention and reduces errors, thereby strengthening accountability. Additionally, the integration of OpenCV2 and the NVIDIA Jetson Nano optimizes processing efficiency, making it suitable for resource-constrained environments. This system caters to a diverse range of educational institutions, including schools, colleges, vocational training centers, and various workplace settings such as small businesses, offices, and factories. ... The system's affordability and efficiency democratize attendance management technology, making it accessible to a broader audience. Consequently, it has the potential to transform attendance tracking and management practices, ultimately leading to heightened productivity and accountability. In conclusion, this system represents a groundbreaking approach to attendance tracking and management...
哈勃级联是一种经济高效且用户友好的机器学习算法,用于检测图像和视频中的物体。与深度学习算法不同,它们通常需要大量的资源和昂贵的计算成本。相反,哈勃级联使用简单的图像处理技术,如边缘检测和哈勃特征,这些技术容易理解和实现。通过将哈勃级联与OpenCV2集成在类似于NVIDIA Jetson Nano的嵌入式计算机上,该系统可以在数据库中准确检测和匹配脸部,实现出勤追踪。该系统旨在实现与现有解决方案的几个特定目标。它利用哈勃级联,配备了精心选择的哈勃特征,如哈勃似波浪,并采用先进的边缘检测技术。这些技术使得图像和视频中的精确脸部检测和匹配成为可能,提高了准确性和稳健性能。通过这样做,它减少了手动干预并减少了错误,从而加强了责任。此外,集成OpenCV2和NVIDIA Jetson Nano优化了处理效率,使其适用于资源受限的环境。该系统服务于各种教育机构,包括学校、学院、职业学校、各种企业(如小企业、办公室和工厂)等。... 系统的实惠性和效率使 attendance management technology 民主化,使更多人能够使用。因此,它有潜力改变 attendance tracking 和 management practices,最终提高生产力和责任。总之,这个系统代表了在 attendance tracking 和 management 方面进行重大突破的方法...
https://arxiv.org/abs/2405.12633
In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define that computational methods that alter semantic face attributes to exceed human discrimination thresholds are sources of face forgery. Guided by our new definition, we construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph. Our dataset enables two new testing protocols to probe the generalization of face forgery detectors. Moreover, we propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task (\ie, real or fake face detection). We show that the proposed dataset successfully exposes the weaknesses of current detectors as the test set and consistently improves their generalizability as the training set. Additionally, we demonstrate the superiority of our semantics-oriented method over traditional binary and multi-class classification-based detectors.
近年来,深度学习极大地简化了生成逼真假脸图像的过程。为了意识到这种伪造技术的危险性,研究人员开发了各种工具来检测这些伪造技术。然而,没有一个工具问到这个问题:什么数字编辑会使得真实照片脸孔图像伪造,而其他工具不会?在本文中,我们将面部伪造置于语义背景下,并定义了能够超过人类判断阈值的语义人脸属性编辑的方法是面部伪造的源头。在受到新定义的指导下,我们构建了一个大规模的人脸伪造图像数据集,其中每个图像都与一个由层次图组织起来的标签集相关联。我们的数据集使得有两个新的测试协议可以探究面部伪造检测器的泛化能力。此外,我们提出了一个语义导向的面部伪造检测方法,它抓住了标签关系并优先考虑了主要任务(即真实或伪造脸孔检测)。我们证明了所提出的数据集成功地揭示了当前检测器的缺陷作为测试集,并且在训练集上持续改进了它们的泛化能力。此外,我们还证明了我们的语义导向方法比传统的二分类和多分类分类基础检测器具有优越性。
https://arxiv.org/abs/2405.08487
In the ever-changing world of technology, continuous authentication and comprehensive access management are essential during user interactions with a device. Split Learning (SL) and Federated Learning (FL) have recently emerged as promising technologies for training a decentralized Machine Learning (ML) model. With the increasing use of smartphones and Internet of Things (IoT) devices, these distributed technologies enable users with limited resources to complete neural network model training with server assistance and collaboratively combine knowledge between different nodes. In this study, we propose combining these technologies to address the continuous authentication challenge while protecting user privacy and limiting device resource usage. However, the model's training is slowed due to SL sequential training and resource differences between IoT devices with different specifications. Therefore, we use a cluster-based approach to group devices with similar capabilities to mitigate the impact of slow devices while filtering out the devices incapable of training the model. In addition, we address the efficiency and robustness of training ML models by using SL and FL techniques to train the clients simultaneously while analyzing the overhead burden of the process. Following clustering, we select the best set of clients to participate in training through a Genetic Algorithm (GA) optimized on a carefully designed list of objectives. The performance of our proposed framework is compared to baseline methods, and the advantages are demonstrated using a real-life UMDAA-02-FD face detection dataset. The results show that CRSFL, our proposed approach, maintains high accuracy and reduces the overhead burden in continuous authentication scenarios while preserving user privacy.
在不断变化的科技世界中,用户与设备之间的连续身份验证和全面访问管理至关重要。最近,分裂学习(SL)和联邦学习(FL)作为一种有前景的分布式机器学习(ML)模型训练技术而出现。随着智能手机和物联网(IoT)设备的广泛使用,这些分布式技术使有限资源的用户能够通过服务器帮助和不同节点之间协作完成神经网络模型训练。然而,由于SL的顺序训练和不同规格的IoT设备的资源差异,模型的训练速度受到了影响。因此,我们使用基于聚类的策略将具有类似能力的设备分组,通过过滤无法训练模型的设备来减轻缓慢设备的影响。此外,通过使用SL和FL技术同时训练客户端,我们解决了训练ML模型的效率和鲁棒性问题。在聚类之后,我们通过基于目标优化设计的精简列表使用遗传算法(GA)选择参与训练的最佳客户端。 与基线方法相比,我们提出的框架的性能进行了比较,并使用实际生活的UDMA-02-FD面部检测数据集证明了其优势。结果表明,CRSFL方法在连续身份验证场景中保持高准确率,同时减轻了身份验证过程中的开销负担,同时保护了用户隐私。
https://arxiv.org/abs/2405.07174
Incorporating human-perceptual intelligence into model training has shown to increase the generalization capability of models in several difficult biometric tasks, such as presentation attack detection (PAD) and detection of synthetic samples. After the initial collection phase, human visual saliency (e.g., eye-tracking data, or handwritten annotations) can be integrated into model training through attention mechanisms, augmented training samples, or through human perception-related components of loss functions. Despite their successes, a vital, but seemingly neglected, aspect of any saliency-based training is the level of salience granularity (e.g., bounding boxes, single saliency maps, or saliency aggregated from multiple subjects) necessary to find a balance between reaping the full benefits of human saliency and the cost of its collection. In this paper, we explore several different levels of salience granularity and demonstrate that increased generalization capabilities of PAD and synthetic face detection can be achieved by using simple yet effective saliency post-processing techniques across several different CNNs.
将人类感知智能融入模型训练已经在多项困难的生物特征任务中增加了模型的泛化能力,例如展示攻击检测(PAD)和合成样本检测。在初始收集阶段,人类视觉突出(例如,眼跟踪数据或手写注释)可以通过关注机制、增强训练样本或通过损失函数中的人感知相关组件进行整合。尽管它们取得了成功,但任何基于突显的训练中一个似乎被忽视的重要方面是所需的突显粒度水平(例如,边界框、单个突显图或来自多个对象的突显聚合)。在本文中,我们探讨了几个不同的突显粒度水平,并证明了通过使用简单而有效的突显后处理技术,可以在多个不同的卷积神经网络中实现PAD和合成样本检测的泛化能力的提高。
https://arxiv.org/abs/2405.00650
The rapid development of diffusion models has triggered diverse applications. Identity-preserving text-to-image generation (ID-T2I) particularly has received significant attention due to its wide range of application scenarios like AI portrait and advertising. While existing ID-T2I methods have demonstrated impressive results, several key challenges remain: (1) It is hard to maintain the identity characteristics of reference portraits accurately, (2) The generated images lack aesthetic appeal especially while enforcing identity retention, and (3) There is a limitation that cannot be compatible with LoRA-based and Adapter-based methods simultaneously. To address these issues, we present \textbf{ID-Aligner}, a general feedback learning framework to enhance ID-T2I performance. To resolve identity features lost, we introduce identity consistency reward fine-tuning to utilize the feedback from face detection and recognition models to improve generated identity preservation. Furthermore, we propose identity aesthetic reward fine-tuning leveraging rewards from human-annotated preference data and automatically constructed feedback on character structure generation to provide aesthetic tuning signals. Thanks to its universal feedback fine-tuning framework, our method can be readily applied to both LoRA and Adapter models, achieving consistent performance gains. Extensive experiments on SD1.5 and SDXL diffusion models validate the effectiveness of our approach. \textbf{Project Page: \url{this https URL}}
扩散模型的快速发展引发了各种应用。身份保留的图像转文本生成(ID-T2I)特别因其在AI肖像和广告等广泛应用场景而受到关注。虽然现有的ID-T2I方法已经取得了令人印象深刻的成果,但仍然存在几个关键挑战:(1)准确保持参考肖像的身份特征很困难,(2)在保留身份特征的同时,生成的图像缺乏美学吸引力,(3)与基于LoRA和自适应方法的兼容性有限。为解决这些问题,我们提出了\textbf{ID-Aligner},一种通用的反馈学习框架,以提高ID-T2I的性能。为了恢复丢失的身份特征,我们引入了身份一致性奖励微调,利用面部检测和识别模型的反馈来提高生成的身份保留。此外,我们还提出了基于人类标注偏好数据和自动构建的反馈来提供美学调整信号的身份美学奖励微调。由于其通用反馈微调框架,我们的方法可以轻松应用于LoRA和自适应模型,实现一致的性能提升。在SD1.5和SDXL扩散模型上进行的大量实验证实了该方法的有效性。\textbf{项目页面:\url{this https URL}}
https://arxiv.org/abs/2404.15449
This paper introduces a new dispersed Haar-like filter for efficiently detection face. The basic idea for finding the filter is maximising between-class and minimising within-class variance. The proposed filters can be considered as an optimal configuration dispersed Haar-like filters; filters with disjoint black and white parts.
本文提出了一种新的分布式哈勃类似滤波器,用于有效地检测人脸。找到滤波器的基本思路是最大化跨类别方差,最小化内部方差。所提出的滤波器可以被视为最优配置的分布式哈勃类似滤波器;具有分离的黑和白部分的滤波器。
https://arxiv.org/abs/2404.10476