This paper expands the cascaded network branch of the autoencoder-based multi-task learning (MTL) framework for dynamic facial expression recognition, namely Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition (MTCAE-DFER). MTCAE-DFER builds a plug-and-play cascaded decoder module, which is based on the Vision Transformer (ViT) architecture and employs the decoder concept of Transformer to reconstruct the multi-head attention module. The decoder output from the previous task serves as the query (Q), representing local dynamic features, while the Video Masked Autoencoder (VideoMAE) shared encoder output acts as both the key (K) and value (V), representing global dynamic features. This setup facilitates interaction between global and local dynamic features across related tasks. Additionally, this proposal aims to alleviate overfitting of complex large model. We utilize autoencoder-based multi-task cascaded learning approach to explore the impact of dynamic face detection and dynamic face landmark on dynamic facial expression recognition, which enhances the model's generalization ability. After we conduct extensive ablation experiments and comparison with state-of-the-art (SOTA) methods on various public datasets for dynamic facial expression recognition, the robustness of the MTCAE-DFER model and the effectiveness of global-local dynamic feature interaction among related tasks have been proven.
本文扩展了基于自动编码器的多任务学习(MTL)框架中的级联网络分支,以实现动态面部表情识别。该方法称为“动态面部表情识别的多任务级联自动编码器”(MTCAE-DFER)。MTCAE-DFER构建了一个即插即用的级联解码模块,基于Vision Transformer (ViT)架构,并采用Transformer的解码概念来重构多头注意力机制。来自前一任务的解码器输出作为查询(Q),表示局部动态特征;而Video Masked Autoencoder (VideoMAE)共享编码器的输出则充当键(K)和值(V),代表全局动态特征。这种设置促进了相关任务中全局与局部动态特征之间的交互。 此外,该提案旨在缓解复杂大型模型的过拟合问题。我们采用基于自动编码器的多任务级联学习方法来探索动态面部检测和动态面部关键点对动态面部表情识别的影响,从而提高模型的泛化能力。经过广泛的消融实验以及与各种公开数据集上最新的动态面部表情识别方法进行对比后,MTCAE-DFER模型的鲁棒性以及相关任务中全局局部动态特征交互的有效性已经得到证实。
https://arxiv.org/abs/2412.18988
Face alignment is a crucial step in preparing face images for feature extraction in facial analysis tasks. For applications such as face recognition, facial expression recognition, and facial attribute classification, alignment is widely utilized during both training and inference to standardize the positions of key landmarks in the face. It is well known that the application and method of face alignment significantly affect the performance of facial analysis models. However, the impact of alignment on face image quality has not been thoroughly investigated. Current FIQA studies often assume alignment as a prerequisite but do not explicitly evaluate how alignment affects quality metrics, especially with the advent of modern deep learning-based detectors that integrate detection and landmark localization. To address this need, our study examines the impact of face alignment on face image quality scores. We conducted experiments on the LFW, IJB-B, and SCFace datasets, employing MTCNN and RetinaFace models for face detection and alignment. To evaluate face image quality, we utilized several assessment methods, including SER-FIQ, FaceQAN, DifFIQA, and SDD-FIQA. Our analysis included examining quality score distributions for the LFW and IJB-B datasets and analyzing average quality scores at varying distances in the SCFace dataset. Our findings reveal that face image quality assessment methods are sensitive to alignment. Moreover, this sensitivity increases under challenging real-life conditions, highlighting the importance of evaluating alignment's role in quality assessment.
面部对齐是为面部分析任务中的特征提取准备面部图像的关键步骤。对于诸如人脸识别、面部表情识别和面部属性分类等应用,对齐技术在训练和推理过程中广泛使用,以标准化面部关键点的位置。众所周知,面部对齐的应用和方法显著影响面部分析模型的性能。然而,对齐对面部图像质量的影响尚未被彻底研究。当前的FIQA(面部图像质量评估)研究通常假设对齐是先决条件,但没有明确评估对齐如何影响质量指标,尤其是在现代基于深度学习的检测器集成了检测和关键点定位技术之后。为解决这一需求,我们的研究考察了面部对齐对面部图像质量评分的影响。我们在LFW、IJB-B和SCFace数据集上进行了实验,使用MTCNN和RetinaFace模型进行面部检测和对齐。为了评估面部图像的质量,我们采用了多种评估方法,包括SER-FIQ、FaceQAN、DifFIQA和SDD-FIQA。我们的分析涵盖了LFW和IJB-B数据集中质量评分的分布情况,并在SCFace数据集中分析了不同距离下的平均质量评分。研究结果表明,面部图像质量评估方法对对齐非常敏感。此外,在具有挑战性的现实条件下,这种敏感性会增加,突显了评估对齐在质量评估中作用的重要性。
https://arxiv.org/abs/2412.11779
Low-light conditions have an adverse impact on machine cognition, limiting the performance of computer vision systems in real life. Since low-light data is limited and difficult to annotate, we focus on image processing to enhance low-light images and improve the performance of any downstream task model, instead of fine-tuning each of the models which can be prohibitively expensive. We propose to improve the existing zero-reference low-light enhancement by leveraging the CLIP model to capture image prior and for semantic guidance. Specifically, we propose a data augmentation strategy to learn an image prior via prompt learning, based on image sampling, to learn the image prior without any need for paired or unpaired normal-light data. Next, we propose a semantic guidance strategy that maximally takes advantage of existing low-light annotation by introducing both content and context cues about the image training patches. We experimentally show, in a qualitative study, that the proposed prior and semantic guidance help to improve the overall image contrast and hue, as well as improve background-foreground discrimination, resulting in reduced over-saturation and noise over-amplification, common in related zero-reference methods. As we target machine cognition, rather than rely on assuming the correlation between human perception and downstream task performance, we conduct and present an ablation study and comparison with related zero-reference methods in terms of task-based performance across many low-light datasets, including image classification, object and face detection, showing the effectiveness of our proposed method.
低光照条件对机器认知有不利影响,限制了计算机视觉系统在现实生活中的性能。由于低光数据有限且难以标注,我们专注于图像处理以增强低光图像并提升任何下游任务模型的性能,而不是调整每个模型,这可能会非常昂贵。我们提出通过利用CLIP模型捕捉图像先验和提供语义指导来改进现有的零参考低光增强技术。具体来说,我们提出了一种数据增强策略,基于图像采样通过提示学习来获取图像先验,无需任何配对或非配对正常光照数据。接下来,我们提出了一个语义引导策略,最大限度地利用现有低光标注信息,引入有关图像训练补丁的内容和上下文线索。实验结果显示,在定性研究中,所提出的先验和语义指导有助于提高整体图像对比度和色调,并改善背景与前景的区分,减少了常见的过度饱和和噪声放大问题,这些问题在相关的零参考方法中较为常见。由于我们的目标是机器认知,而不是依赖于假设人类感知与下游任务性能之间的相关性,我们进行了消融研究并与其他零参考方法在多个低光数据集上的基于任务性能进行了比较,包括图像分类、物体检测和人脸检测,展示了我们提出的方法的有效性。
https://arxiv.org/abs/2412.07693
This paper investigates the feasibility of a proactive DeepFake defense framework, {\em FacePosion}, to prevent individuals from becoming victims of DeepFake videos by sabotaging face detection. The motivation stems from the reliance of most DeepFake methods on face detectors to automatically extract victim faces from videos for training or synthesis (testing). Once the face detectors malfunction, the extracted faces will be distorted or incorrect, subsequently disrupting the training or synthesis of the DeepFake model. To achieve this, we adapt various adversarial attacks with a dedicated design for this purpose and thoroughly analyze their feasibility. Based on FacePoison, we introduce {\em VideoFacePoison}, a strategy that propagates FacePoison across video frames rather than applying them individually to each frame. This strategy can largely reduce the computational overhead while retaining the favorable attack performance. Our method is validated on five face detectors, and extensive experiments against eleven different DeepFake models demonstrate the effectiveness of disrupting face detectors to hinder DeepFake generation.
https://arxiv.org/abs/2412.01101
Non-invasive temperature monitoring of individuals plays a crucial role in identifying and isolating symptomatic individuals. Temperature monitoring becomes particularly vital in settings characterized by close human proximity, often referred to as dense settings. However, existing research on non-invasive temperature estimation using thermal cameras has predominantly focused on sparse settings. Unfortunately, the risk of disease transmission is significantly higher in dense settings like movie theaters or classrooms. Consequently, there is an urgent need to develop robust temperature estimation methods tailored explicitly for dense settings. Our study proposes a non-invasive temperature estimation system that combines a thermal camera with an edge device. Our system employs YOLO models for face detection and utilizes a regression framework for temperature estimation. We evaluated the system on a diverse dataset collected in dense and sparse settings. Our proposed face detection model achieves an impressive mAP score of over 84 in both in-dataset and cross-dataset evaluations. Furthermore, the regression framework demonstrates remarkable performance with a mean square error of 0.18$^{\circ}$C and an impressive $R^2$ score of 0.96. Our experiments' results highlight the developed system's effectiveness, positioning it as a promising solution for continuous temperature monitoring in real-world applications. With this paper, we release our dataset and programming code publicly.
https://arxiv.org/abs/2412.00863
This study is focused on enhancing the Haar Cascade Algorithm to decrease the false positive and false negative rate in face matching and face detection to increase the accuracy rate even under challenging conditions. The face recognition library was implemented with Haar Cascade Algorithm in which the 128-dimensional vectors representing the unique features of a face are encoded. A subprocess was applied where the grayscale image from Haar Cascade was converted to RGB to improve the face encoding. Logical process and face filtering are also used to decrease non-face detection. The Enhanced Haar Cascade Algorithm produced a 98.39% accuracy rate (21.39% increase), 63.59% precision rate, 98.30% recall rate, and 72.23% in F1 Score. In comparison, the Haar Cascade Algorithm achieved a 46.70% to 77.00% accuracy rate, 44.15% precision rate, 98.61% recall rate, and 47.01% in F1 Score. Both algorithms used the Confusion Matrix Test with 301,950 comparisons using the same dataset of 550 images. The 98.39% accuracy rate shows a significant decrease in false positive and false negative rates in facial recognition. Face matching and face detection are more accurate in images with complex backgrounds, lighting variations, and occlusions, or even those with similar attributes.
这项研究的重点是改进Haar级联算法,以降低面部匹配和面部检测中的误报率和漏报率,从而在具有挑战性的条件下提高准确性。该面部识别库使用了Haar级联算法,其中128维向量编码了面部的独特特征。通过一个子过程将Haar级联生成的灰度图像转换为RGB图像以改进面部编码。此外,还应用了逻辑处理和面部过滤来减少非面部检测。 增强型Haar级联算法实现了98.39%的准确率(提升了21.39%),63.59%的精确率,98.30%的召回率以及72.23%的F1分数。相比之下,传统的Haar级联算法实现了46.70%到77.00%的准确率,44.15%的精确率,98.61%的召回率和47.01%的F1分数。 两种算法均使用了混淆矩阵测试,并在相同的550张图像数据集上进行了301,950次比较。98.39%的准确率表明面部识别中的误报率和漏报率显著降低。这种改进使得面部匹配和检测在复杂背景、光照变化、遮挡甚至具有相似属性的情况下更加精确。
https://arxiv.org/abs/2411.03831
This paper presents an autonomous method to address challenges arising from severe lighting conditions in machine vision applications that use event cameras. To manage these conditions, the research explores the built in potential of these cameras to adjust pixel functionality, named bias settings. As cars are driven at various times and locations, shifts in lighting conditions are unavoidable. Consequently, this paper utilizes the neuromorphic YOLO-based face tracking module of a driver monitoring system as the event-based application to study. The proposed method uses numerical metrics to continuously monitor the performance of the event-based application in real-time. When the application malfunctions, the system detects this through a drop in the metrics and automatically adjusts the event cameras bias values. The Nelder-Mead simplex algorithm is employed to optimize this adjustment, with finetuning continuing until performance returns to a satisfactory level. The advantage of bias optimization lies in its ability to handle conditions such as flickering or darkness without requiring additional hardware or software. To demonstrate the capabilities of the proposed system, it was tested under conditions where detecting human faces with default bias values was impossible. These severe conditions were simulated using dim ambient light and various flickering frequencies. Following the automatic and dynamic process of bias modification, the metrics for face detection significantly improved under all conditions. Autobiasing resulted in an increase in the YOLO confidence indicators by more than 33 percent for object detection and 37 percent for face detection highlighting the effectiveness of the proposed method.
本文提出了一种自主方法,以解决在使用事件相机的机器视觉应用中因严重照明条件而产生的挑战。为了管理这些条件,研究探索了这些相机内置的像素功能调整潜力,即偏置设置。由于汽车会在不同时间和地点行驶,光照条件的变化不可避免。因此,本文利用基于神经形态YOLO的驾驶员监控系统中的面部跟踪模块作为事件驱动的应用进行研究。所提出的方法使用数值指标实时连续监测事件驱动应用的表现。当应用程序出现故障时,系统通过指标下降检测到这一点,并自动调整事件相机的偏置值。采用Nelder-Mead单纯形算法来优化这一调整过程,直到性能恢复到令人满意的水平为止。偏置优化的优势在于它能够处理闪烁或黑暗等条件而不需额外硬件或软件的支持。为了展示所提系统的功能,在使用默认偏置值检测人脸不可能实现的条件下进行了测试。这些严重条件通过低环境光和各种闪烁频率进行模拟。经过自动动态调整偏置的过程,所有条件下的面部检测指标显著提升。自动偏置导致YOLO对象检测的信心指标增加了超过33%,面部检测的信心指标增加了37%以上,这突显了所提方法的有效性。
https://arxiv.org/abs/2411.00729
Smart focal-plane and in-chip image processing has emerged as a crucial technology for vision-enabled embedded systems with energy efficiency and privacy. However, the lack of special datasets providing examples of the data that these neuromorphic sensors compute to convey visual information has hindered the adoption of these promising technologies. Neuromorphic imager variants, including event-based sensors, produce various representations such as streams of pixel addresses representing time and locations of intensity changes in the focal plane, temporal-difference data, data sifted/thresholded by temporal differences, image data after applying spatial transformations, optical flow data, and/or statistical representations. To address the critical barrier to entry, we provide an annotated, temporal-threshold-based vision dataset specifically designed for face detection tasks derived from the same videos used for Aff-Wild2. By offering multiple threshold levels (e.g., 4, 8, 12, and 16), this dataset allows for comprehensive evaluation and optimization of state-of-the-art neural architectures under varying conditions and settings compared to traditional methods. The accompanying tool flow for generating event data from raw videos further enhances accessibility and usability. We anticipate that this resource will significantly support the development of robust vision systems based on smart sensors that can process based on temporal-difference thresholds, enabling more accurate and efficient object detection and localization and ultimately promoting the broader adoption of low-power, neuromorphic imaging technologies. To support further research, we publicly released the dataset at \url{this https URL}.
智能焦平面和芯片图像处理已成为具有能源效率和隐私的关键技术,支持视觉感知嵌入式系统的开发。然而,缺乏提供这些神经元传感器计算的数据特别数据集,使得这些具有潜力的技术受到了阻碍。神经元成像变体,包括基于事件的传感器,产生各种表示,如表示时间变化焦平面的像素流、时间差数据、基于时间差的筛选/阈值数据、应用空间变换后的图像数据、光学流数据,以及/或统计表示。为了应对进入障碍,我们提供了专门为面部检测任务设计的带注释的时间阈值基于的视觉数据,该数据来源于用于Aff-Wild2的视频。通过提供多个阈值级别(例如4、8、12和16),这个数据集允许在不同的条件和设置下对最先进的神经架构进行全面的评估和优化,与传统方法相比。附带生成事件数据的工具流进一步增强了可用性和易用性。我们预计,这个资源将显著支持基于智能传感器开发稳健的视觉系统,这些系统可以根据时间差阈值进行处理,从而实现更准确和高效的物体检测和定位,最终促进低功耗、神经元成像技术的更广泛采用。为了支持进一步的研究,我们公开了这个数据集,可在 \url{这个链接} 上找到。
https://arxiv.org/abs/2410.00368
The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. ``Face pareidolia'' describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective. We present an image dataset of ``Faces in Things'', consisting of five thousand web images with human-annotated pareidolic faces. Using this dataset, we examine the extent to which a state-of-the-art human face detector exhibits pareidolia, and find a significant behavioral gap between humans and machines. We find that the evolutionary need for humans to detect animal faces, as well as human faces, may explain some of this gap. Finally, we propose a simple statistical model of pareidolia in images. Through studies on human subjects and our pareidolic face detectors we confirm a key prediction of our model regarding what image conditions are most likely to induce pareidolia. Dataset and Website: this https URL
人类视觉系统对各种形状和大小的面部具有很好的适应性,这带来了明显的生存优势,比如在丛林中更可能发现未知猎物的机会,但也导致了伪面部的检测。 “面部错觉”描述了在随机刺激物中,人们感知面部类似结构的现象:在咖啡污渍或天空中看到人脸。在本文中,我们从计算机视觉的角度研究面部错觉。我们呈现了一个由5000个带有人类标注的错觉面部组成的图像数据集。利用这个数据集,我们研究了最先进的面部检测器是否表现出面部错觉,并发现了人类和机器之间的显著行为差异。我们发现,人类检测动物面部和人类面部的进化需求可能解释了一些差距。最后,我们提出了一个简单的统计模型来描述图像中的面部错觉。通过研究人类参与者和我们的错觉面部检测器,我们证实了关于模型最有可能引起面部错觉的图像条件的预测。网站和数据集:https://www. this URL
https://arxiv.org/abs/2409.16143
Despite the remarkable performance of deep neural networks for face detection and recognition tasks in the visible spectrum, their performance on more challenging non-visible domains is comparatively still lacking. While significant research has been done in the fields of domain adaptation and domain generalization, in this paper we tackle scenarios in which these methods have limited applicability owing to the lack of training data from target domains. We focus on the problem of single-source (visible) and multi-target (SWIR, long-range/remote, surveillance, and body-worn) face recognition task. We show through experiments that a good template generation algorithm becomes crucial as the complexity of the target domain increases. In this context, we introduce a template generation algorithm called Norm Pooling (and a variant known as Sparse Pooling) and show that it outperforms average pooling across different domains and networks, on the IARPA JANUS Benchmark Multi-domain Face (IJB-MDF) dataset.
尽管在可见光谱范围内,深度神经网络在 face 检测和识别任务中的表现非常出色,但它们在更具有挑战性的非可见领域中的性能相对仍然不足。尽管在领域迁移和泛化领域已经进行了大量的研究,但在本文中,我们关注的是由于目标领域缺乏训练数据而使得这些方法的应用受限的情况。我们专注于单一源(可见)和多目标(SWIR,远距离/远程,监视,可穿戴)面部识别问题。我们通过实验证明,随着目标领域的复杂性的增加,一个好的模板生成算法变得至关重要。在这种情况下,我们引入了一个名为 Norm Pooling(及其变体 Sparse Pooling)的模板生成算法,并在 IARPA JANUS 基准多领域面部(IJB-MDF)数据集上证明了它优于不同领域和网络的平均池化。
https://arxiv.org/abs/2409.09832
In the current landscape of biometrics and surveillance, the ability to accurately recognize faces in uncontrolled settings is paramount. The Watchlist Challenge addresses this critical need by focusing on face detection and open-set identification in real-world surveillance scenarios. This paper presents a comprehensive evaluation of participating algorithms, using the enhanced UnConstrained College Students (UCCS) dataset with new evaluation protocols. In total, four participants submitted four face detection and nine open-set face recognition systems. The evaluation demonstrates that while detection capabilities are generally robust, closed-set identification performance varies significantly, with models pre-trained on large-scale datasets showing superior performance. However, open-set scenarios require further improvement, especially at higher true positive identification rates, i.e., lower thresholds.
在当前的人脸识别和监视领域,在未受控的环境中准确识别脸部至关重要。Watchlist挑战通过专注于现实主义监视场景中的人脸检测和开箱即用识别来解决这一关键需求。本文全面评估了参赛算法的性能,使用增强的UCCS数据集及其新的评估协议。总共四名参赛者提交了四个面部检测和九个开箱即用的人脸识别系统。评估表明,虽然检测能力通常很强,但开箱即用识别性能差异很大,在预训练于大型数据集的模型上表现优异。然而,开箱即用场景需要进一步改进,尤其是在更高的真阳性识别率上,即较低的阈值。
https://arxiv.org/abs/2409.07220
The Real Face Dataset is a pedestrian face detection benchmark dataset in the wild, comprising over 11,000 images and over 55,000 detected faces in various ambient conditions. The dataset aims to provide a comprehensive and diverse collection of real-world face images for the evaluation and development of face detection and recognition algorithms. The Real Face Dataset is a valuable resource for researchers and developers working on face detection and recognition algorithms. With over 11,000 images and 55,000 detected faces, the dataset offers a comprehensive and diverse collection of real-world face images. This diversity is crucial for evaluating the performance of algorithms under various ambient conditions, such as lighting, scale, pose, and occlusion. The dataset's focus on real-world scenarios makes it particularly relevant for practical applications, where faces may be captured in challenging environments. In addition to its size, the dataset's inclusion of images with a high degree of variability in scale, pose, and occlusion, as well as its focus on practical application scenarios, sets it apart as a valuable resource for benchmarking and testing face detection and recognition methods. The challenges presented by the dataset align with the difficulties faced in real-world surveillance applications, where the ability to detect faces and extract discriminative features is paramount. The Real Face Dataset provides an opportunity to assess the performance of face detection and recognition methods on a large scale. Its relevance to real-world scenarios makes it an important resource for researchers and developers aiming to create robust and effective algorithms for practical applications.
Real Face Dataset是一个野外的行人脸检测基准数据集,包括超过11,000张图像和超过55,000个检测到的面部,各种环境条件下。这个数据集旨在为评估和改进面部检测和识别算法提供全面和多样化的真实世界面部图像。Real Face Dataset对于研究者和开发人员来说是一个有价值的资源。它拥有超过11,000张图像和55,000个检测到的面部,提供了全面和多样化的真实世界面部图像。这种多样性对于评估算法在不同环境条件下的性能至关重要,例如光照、尺度、姿势和遮挡。数据集对真实世界场景的专注使它特别适用于具有挑战性的环境中的面部捕捉。除了其规模外,数据集中的图像在尺度、姿势和遮挡方面的变异性,以及它对实际应用场景的关注,使它成为评估和测试面部检测和识别方法的有价值的资源。数据集所面临的挑战与现实世界监视应用中遇到的困难相吻合,在现实世界中检测面和人是不容忽视的。Real Face Dataset为评估面部检测和识别方法在较大规模上的性能提供了一个机会。它的现实世界场景的关联使其成为研究人员和开发人员为实际应用创建健壮和有效的算法的重要资源。
https://arxiv.org/abs/2409.00283
With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited detection performance, or employ a naive two-branch structure to simultaneously obtain detection and localization results, which cannot effectively benefit the localization capability due to limited interaction between two tasks. This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The MoNFAP primarily introduces two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM). The FUP integrates detection and localization tasks using a token learning strategy and multiple forgery-aware transformers, which facilitates the use of classification information to enhance localization capability. Besides, motivated by the crucial role of noise information in forgery detection, the MNM leverages multiple noise extractors based on the concept of the mixture of experts to enhance the general RGB features, further boosting the performance of our framework. Finally, we establish a comprehensive benchmark for multi-face detection and localization and the proposed \textit{MoNFAP} achieves significant performance. The codes will be made available.
随着面部操作技术的发展,多面场景下的伪造图像逐渐成为更加复杂和真实的挑战。然而,对于这种多面操作的检测和定位方法仍处于初步发展阶段。传统的操作定位方法要么间接地从定位遮罩中得到检测结果,导致检测性能有限;要么同时使用简单的两分支结构来同时获得检测和定位结果,但由于两个任务之间的交互有限,无法有效提高定位能力。本文提出了一种新的框架,即MoNFAP,专门针对多面操作检测和定位。MoNFAP主要引入两个新颖的模块:伪造意识统一预测器(FUP)模块和噪声混合模块(MNM)。FUP使用标记学习策略集成检测和定位任务,并利用多个伪造意识转换器,从而促进将分类信息用于增强定位能力。此外,为了充分利用噪声信息在伪造检测中的关键作用,MNM基于专家混合的概念利用多个噪声提取器来增强通用RGB特征,进一步提高了我们的框架的性能。最后,我们建立了多面检测和定位的综合基准,并提出 MoNFAP,取得了显著的性能。代码将公开发布。
https://arxiv.org/abs/2408.02306
The rapid growth of image data has led to the development of advanced image processing and computer vision techniques, which are crucial in various applications such as image classification, image segmentation, and pattern recognition. Texture is an important feature that has been widely used in many image processing tasks. Therefore, analyzing and understanding texture plays a pivotal role in image analysis and understanding.Local binary pattern (LBP) is a powerful operator that describes the local texture features of images. This paper provides a novel mathematical representation of the LBP by separating the operator into three matrices, two of which are always fixed and do not depend on the input data. These fixed matrices are analyzed in depth, and a new algorithm is proposed to optimize them for improved classification performance. The optimization process is based on the singular value decomposition (SVD) algorithm. As a result, the authors present optimal LBPs that effectively describe the texture of human face images. Several experiment results presented in this paper convincingly verify the efficiency and superiority of the optimized LBPs for face detection and facial expression recognition tasks.
https://arxiv.org/abs/2407.18665
Extracting surfaces from Signed Distance Fields (SDFs) can be accomplished using traditional algorithms, such as Marching Cubes. However, since they rely on sign flips across the surface, these algorithms cannot be used directly on Unsigned Distance Fields (UDFs). In this work, we introduce a deep-learning approach to taking a UDF and turning it locally into an SDF, so that it can be effectively triangulated using existing algorithms. We show that it achieves better accuracy in surface detection than existing methods. Furthermore it generalizes well to unseen shapes and datasets, while being parallelizable. We also demonstrate the flexibily of the method by using it in conjunction with DualMeshUDF, a state of the art dual meshing method that can operate on UDFs, improving its results and removing the need to tune its parameters.
https://arxiv.org/abs/2407.18381
The paper provides a mathematical view to the binary numbers presented in the Local Binary Pattern (LBP) feature extraction process. Symmetric finite difference is often applied in numerical analysis to enhance the accuracy of approximations. Then, the paper investigates utilization of the symmetric finite difference in the LBP formulation for face detection and facial expression recognition. It introduces a novel approach that extends the standard LBP, which typically employs eight directional derivatives, to incorporate only four directional derivatives. This approach is named symmetric LBP. The number of LBP features is reduced to 16 from 256 by the use of the symmetric LBP. The study underscores the significance of the number of directions considered in the new approach. Consequently, the results obtained emphasize the importance of the research topic.
本文提供了一种将二进制数呈现在局部二进制模式(LBP)特征提取过程中数学观点。对称有限差分通常在数值分析中用于提高近似精度。然后,本文研究了在LBP公式中应用对称有限差分的用于面部检测和面部表情识别。它引入了一种新方法,该方法扩展了标准的LBP,通常采用八个方向导数,以仅考虑四个方向导数。这种方法被称为对称LBP。通过使用对称LBP,LBP特征的数量从256减少到16。研究结果强调了在新方法中考虑的方向数量的重要性。因此,获得的结果突出了研究主题的重要性。
https://arxiv.org/abs/2407.13178
Face detection is frequently attempted by using heavy pre-trained backbone networks like ResNet-50/101/152 and VGG16/19. Few recent works have also proposed lightweight detectors with customized backbones, novel loss functions and efficient training strategies. The novelty of this work lies in the design of a lightweight detector while training with only the commonly used loss functions and learning strategies. The proposed face detector grossly follows the established RetinaFace architecture. The first contribution of this work is the design of a customized lightweight backbone network (BLite) having 0.167M parameters with 0.52 GFLOPs. The second contribution is the use of two independent multi-task losses. The proposed lightweight face detector (FDLite) has 0.26M parameters with 0.94 GFLOPs. The network is trained on the WIDER FACE dataset. FDLite is observed to achieve 92.3\%, 89.8\%, and 82.2\% Average Precision (AP) on the easy, medium, and hard subsets of the WIDER FACE validation dataset, respectively.
面部检测常常使用具有重预训练的骨干网络(如ResNet-50/101/152和VGG16/19)来实现。少数最近的工作还提出了具有自定义骨干网络、新颖损失函数和高效训练策略的轻量级检测器。本文的创新之处在于在仅使用通常使用的损失函数和学习策略进行训练的情况下设计了一个轻量级的检测器。与RetinaFace架构相比,本文提出的面部检测器在轻量级方面具有创新性。 本文的第一个贡献是设计了一个具有自定义轻量级骨干网络(BLite),具有0.167M参数和0.52 GFLOPs。第二个贡献是使用两个独立的多任务损失。本文提出的轻量级面部检测器(FDLite)具有0.26M参数和0.94 GFLOPs。该网络在WIDER FACE数据集上进行训练。通过训练,FDLite在WIDER FACE验证集的容易、中度和困难子集上分别获得了92.3\%、89.8\%和82.2\%的均精度(AP)。
https://arxiv.org/abs/2406.19107
Precise and prompt identification of road surface conditions enables vehicles to adjust their actions, like changing speed or using specific traction control techniques, to lower the chance of accidents and potential danger to drivers and pedestrians. However, most of the existing methods for detecting road surfaces solely rely on visual data, which may be insufficient in certain situations, such as when the roads are covered by debris, in low light conditions, or in the presence of fog. Therefore, we introduce a multimodal approach for the automated detection of road surface conditions by integrating audio and images. The robustness of the proposed method is tested on a diverse dataset collected under various environmental conditions and road surface types. Through extensive evaluation, we demonstrate the effectiveness and reliability of our multimodal approach in accurately identifying road surface conditions in real-time scenarios. Our findings highlight the potential of integrating auditory and visual cues for enhancing road safety and minimizing accident risks
https://arxiv.org/abs/2406.10128
Lensless cameras, innovatively replacing traditional lenses for ultra-thin, flat optics, encode light directly onto sensors, producing images that are not immediately recognizable. This compact, lightweight, and cost-effective imaging solution offers inherent privacy advantages, making it attractive for privacy-sensitive applications like face verification. Typical lensless face verification adopts a two-stage process of reconstruction followed by verification, incurring privacy risks from reconstructed faces and high computational costs. This paper presents an end-to-end optimization approach for privacy-preserving face verification directly on encoded lensless captures, ensuring that the entire software pipeline remains encoded with no visible faces as intermediate results. To achieve this, we propose several techniques to address unique challenges from the lensless setup which precludes traditional face detection and alignment. Specifically, we propose a face center alignment scheme, an augmentation curriculum to build robustness against variations, and a knowledge distillation method to smooth optimization and enhance performance. Evaluations under both simulation and real environment demonstrate our method outperforms two-stage lensless verification while enhancing privacy and efficiency. Project website: \url{this http URL}.
无镜头相机通过采用超薄、平滑的光学元件替代传统的镜头,将光线直接编码到传感器上,从而产生无法立即识别的图像。这种紧凑轻便、成本效益高的成像解决方案具有固有的隐私优势,使其成为高度关注隐私的诸如面部验证等应用的理想选择。典型的无镜头人脸验证采用重建和验证两个阶段,从重构面部导致隐私风险和高计算成本。本文提出了一种端到端加密镜头less捕捉 privacy-preserving人脸验证的完整优化方法,确保整个软件管道在中间结果时没有任何可见 faces。为了实现这一目标,我们提出了几种解决镜头less设置独特挑战的技术,这排除了传统面部检测和对齐。具体来说,我们提出了一种面部中心对齐方案、一种增强课程以构建对变化的有鲁棒性以及一种知识蒸馏方法以平滑优化和提高性能。在模拟和真实环境下的评估表明,我们的方法在两阶段镜头less验证的基础上优于隐私和安全。项目网站:\url{这个链接}。
https://arxiv.org/abs/2406.04129
This dataset includes 6823 thermal images captured using a UNI-T UTi165A camera for face detection, recognition, and emotion analysis. It consists of 2485 facial recognition images depicting emotions (happy, sad, angry, natural, surprised), 2054 images for face recognition, and 2284 images for face detection. The dataset covers various conditions, color palettes, shooting angles, and zoom levels, with a temperature range of -10°C to 400°C and a resolution of 19,200 pixels. It serves as a valuable resource for advancing thermal imaging technology, aiding in algorithm development, and benchmarking for facial recognition across different palettes. Additionally, it contributes to facial motion recognition, fostering interdisciplinary collaboration in computer vision, psychology, and neuroscience. The dataset promotes transparency in thermal face detection and recognition research, with applications in security, healthcare, and human-computer interaction.
这个数据集是由UNI-T UTi165A相机捕捉的6823张热图像组成,用于面部检测、识别和情感分析。它包括2485张表现情感(快乐,悲伤,生气,自然,惊讶)的面部识别图像,2054张面部识别图像和2284张面部检测图像。该数据集涵盖了各种条件、色彩方案、拍摄角度和缩放级别,温度范围在-10°C至400°C之间,分辨率为19,200像素。它为提高热成像技术、促进算法开发和面部识别技术的基准提供了宝贵的资源。此外,它还促进了计算机视觉、心理学和神经科学领域的跨学科合作。该数据集在热面部检测和识别研究中提高了透明度,应用于安全、医疗和人机交互等领域。
https://arxiv.org/abs/2407.09494