Extreme head postures pose a common challenge across a spectrum of facial analysis tasks, including face detection, facial landmark detection (FLD), and head pose estimation (HPE). These tasks are interdependent, where accurate FLD relies on robust face detection, and HPE is intricately associated with these key points. This paper focuses on the integration of these tasks, particularly when addressing the complexities posed by large-angle face poses. The primary contribution of this study is the proposal of a real-time multi-task detection system capable of simultaneously performing joint detection of faces, facial landmarks, and head poses. This system builds upon the widely adopted YOLOv8 detection framework. It extends the original object detection head by incorporating additional landmark regression head, enabling efficient localization of crucial facial landmarks. Furthermore, we conduct optimizations and enhancements on various modules within the original YOLOv8 framework. To validate the effectiveness and real-time performance of our proposed model, we conduct extensive experiments on 300W-LP and AFLW2000-3D datasets. The results obtained verify the capability of our model to tackle large-angle face pose challenges while delivering real-time performance across these interconnected tasks.
极端头姿势在面部分析任务中提出了一起常见的挑战,包括面部检测、面部地标检测(FLD)和头部姿势估计(HPE)。这些任务是相互依存的,准确的FLD依赖于强大的面部检测,而HPE与这些关键点密切相关。本文重点探讨了这些任务的整合,特别是在处理大角度面部姿势的复杂性时。本研究的主要贡献是提出了一种实时多任务检测系统,能够同时完成面部、面部地标和头部姿势的联合检测。该系统基于广泛使用的YOLOv8检测框架,通过添加地标回归头扩展了原对象检测头,从而能够高效地定位关键的面部地标。此外,我们在原YOLOv8框架中优化和改进了各种模块。为了验证我们提出的模型的有效性和实时性能,我们在300W-LP和AFLW2000-3D数据集上进行了广泛的实验。所获得的结果验证我们的模型能够应对大角度面部姿势挑战,同时在这些相互关联的任务中提供实时性能。
https://arxiv.org/abs/2309.11773
Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this work, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., a 26% attack success rate against Bing Chat and a 86% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on the robustness of MLLMs and facilitate future research on defenses. Our code is available at this https URL.
综合文本和其他modality(特别是视觉)的大型语言模型(MLLM),在多种modal任务中取得了前所未有的表现。然而,由于视觉模型未能解决的攻击鲁棒性问题,引入视觉输入可能会带来更加严重的安全和安全风险。在这项工作中,我们研究了Google的Bard(最近发布的 multimodal能力竞争的聊天机器人ChatGPT)的攻击鲁棒性,以更好地理解商业MLLM的漏洞。通过攻击白盒视觉编码器或MLLM,生成的dversarial examples可以误导Bard输出错误的图像描述,仅通过转移性成功率为22%。我们证明,dversarial examples也可以攻击其他MLLM,例如,对 Bing Chat 的攻击成功率为26%,对ERNIE机器人的攻击成功率为86%。此外,我们识别了 Bard 的两个防御机制,包括图像面部检测和毒性检测。我们设计相应的攻击来规避这些防御,表明 Bard 当前防御机制也薄弱环节。我们希望这项工作可以加深我们对MLLM的鲁棒性的理解,并促进未来的防御研究。我们的代码在这个httpsURL上可用。
https://arxiv.org/abs/2309.11751
Robot vision often involves a large computational load due to large images to process in a short amount of time. Existing solutions often involve reducing image quality which can negatively impact processing. Another approach is to generate regions of interest with expensive vision algorithms. In this paper, we evaluate how audio can be used to generate regions of interest in optical images. To achieve this, we propose a unique attention mechanism to localize speech sources and evaluate its impact on a face detection algorithm. Our results show that the attention mechanism reduces the computational load. The proposed pipeline is flexible and can be easily adapted for human-robot interactions, robot surveillance, video-conferences or smart glasses.
机器人视觉往往由于处理大量图像需要在短时间内进行,而需要大量的计算资源。现有解决方案往往涉及到降低图像质量,这可能会对处理产生负面影响。另一种方法是使用昂贵的视觉算法生成感兴趣的区域。在本文中,我们评估了如何使用音频来生成光学图像感兴趣的区域。为了实现这一点,我们提出了一种独特的注意力机制,以定位语音来源,并评估它对人脸识别算法的影响。我们的结果表明,注意力机制可以降低计算资源。我们提出的管道是灵活的,可以轻松适应人类-机器人交互、机器人监控、视频通话或智能眼镜等场景。
https://arxiv.org/abs/2309.08005
Recent advancement in personalized image generation have unveiled the intriguing capability of pre-trained text-to-image models on learning identity information from a collection of portrait images. However, existing solutions can be vulnerable in producing truthful details, and usually suffer from several defects such as (i) The generated face exhibit its own unique characteristics, \ie facial shape and facial feature positioning may not resemble key characteristics of the input, and (ii) The synthesized face may contain warped, blurred or corrupted regions. In this paper, we present FaceChain, a personalized portrait generation framework that combines a series of customized image-generation model and a rich set of face-related perceptual understanding models (\eg, face detection, deep face embedding extraction, and facial attribute recognition), to tackle aforementioned challenges and to generate truthful personalized portraits, with only a handful of portrait images as input. Concretely, we inject several SOTA face models into the generation procedure, achieving a more efficient label-tagging, data-processing, and model post-processing compared to previous solutions, such as DreamBooth ~\cite{ruiz2023dreambooth} , InstantBooth ~\cite{shi2023instantbooth} , or other LoRA-only approaches ~\cite{hu2021lora} . Through the development of FaceChain, we have identified several potential directions to accelerate development of Face/Human-Centric AIGC research and application. We have designed FaceChain as a framework comprised of pluggable components that can be easily adjusted to accommodate different styles and personalized needs. We hope it can grow to serve the burgeoning needs from the communities. FaceChain is open-sourced under Apache-2.0 license at \url{this https URL}.
最近的个性化图像生成技术的进步揭示了预训练文本到图像模型从一组肖像图像中学习身份信息的独特能力。然而,现有的解决方案在生成真实细节方面可能存在脆弱性,通常会出现多个缺陷,例如(i)生成的面部呈现其自身的独特特征, \ie 面部形状和面部特征位置可能不像输入的关键特征相似,(ii)合成的面部可能包含扭曲、模糊或失真的区域。在本文中,我们介绍了 FaceChain,一个个性化的肖像生成框架,它结合了一系列定制的图像生成模型和大量的面部相关感知理解模型,以解决上述挑战并生成只有少量肖像图像输入的真实个性化肖像。具体而言,我们注入 several SOTA 面部模型到生成过程,比过去的解决方案更高效地进行标签标注、数据处理和模型后处理,相比 Dreambooth ~\cite{ruiz2023dreambooth}、Instantbooth ~\cite{shi2023Instantbooth} 或 other LoRA-only approaches ~\cite{hu2021lora} 等方案更加高效。通过开发 FaceChain,我们识别了几个可能的方向,以加速 Face/人类中心 AIGC 研究和应用程序的发展。我们设计了 FaceChain,作为一个可插拔组件组成的框架,可以轻松适应不同的风格和个性化需求。我们希望它能够成长来满足社区不断增长的需求。FaceChain 采用 Apache-2.0 许可证开源。
https://arxiv.org/abs/2308.14256
The sparse modeling is an evident manifestation capturing the parsimony principle just described, and sparse models are widespread in statistics, physics, information sciences, neuroscience, computational mathematics, and so on. In statistics the many applications of sparse modeling span regression, classification tasks, graphical model selection, sparse M-estimators and sparse dimensionality reduction. It is also particularly effective in many statistical and machine learning areas where the primary goal is to discover predictive patterns from data which would enhance our understanding and control of underlying physical, biological, and other natural processes, beyond just building accurate outcome black-box predictors. Common examples include selecting biomarkers in biological procedures, finding relevant brain activity locations which are predictive about brain states and processes based on fMRI data, and identifying network bottlenecks best explaining end-to-end performance. Moreover, the research and applications of efficient recovery of high-dimensional sparse signals from a relatively small number of observations, which is the main focus of compressed sensing or compressive sensing, have rapidly grown and became an extremely intense area of study beyond classical signal processing. Likewise interestingly, sparse modeling is directly related to various artificial vision tasks, such as image denoising, segmentation, restoration and superresolution, object or face detection and recognition in visual scenes, and action recognition. In this manuscript, we provide a brief introduction of the basic theory underlying sparse representation and compressive sensing, and then discuss some methods for recovering sparse solutions to optimization problems in effective way, together with some applications of sparse recovery in a machine learning problem known as sparse dictionary learning.
稀疏建模是一种明显的表现,捕捉到我刚才描述的简洁性原则,稀疏模型在统计、物理、信息科学、神经科学、计算数学等领域广泛应用。在统计中,稀疏建模的许多应用涵盖了回归、分类任务、图形模型选择、稀疏高斯估计和稀疏维度减少。它还在许多统计和机器学习领域中特别有效,其主要目标是从数据中发现预测模式,这将增强我们对基础物理、生物和自然过程的理解和控制,超越了仅仅建立准确的黑盒预测器。常见的例子包括在生物学过程中选择生物标记物、基于FMRI数据的 Brain 活动位置找到相关的脑活动区域、并确定网络瓶颈的最佳解释,最有效地解释整体性能。此外,研究和应用从相对少量的观察数据中高效恢复高维稀疏信号的研究和应用,这是压缩感知或压缩感知的主要关注点,已经迅速增长并成为 classical 信号处理之外极为强烈的研究领域。类似地,稀疏建模直接与各种人工视觉任务相关,例如图像去噪、分割、恢复和超分辨率、视觉场景中的物体或面部检测和识别,以及动作识别。在本文中,我们简要介绍了稀疏表示和压缩感知的基础理论,然后讨论了如何有效地恢复优化问题的稀疏解决方案,以及在稀疏字典学习机器学习问题中的稀疏恢复应用。
https://arxiv.org/abs/2308.13960
This paper explores the role of eye gaze in human-robot interactions and proposes a novel system for detecting objects gazed by the human using solely visual feedback. The system leverages on face detection, human attention prediction, and online object detection, and it allows the robot to perceive and interpret human gaze accurately, paving the way for establishing joint attention with human partners. Additionally, a novel dataset collected with the humanoid robot iCub is introduced, comprising over 22,000 images from ten participants gazing at different annotated objects. This dataset serves as a benchmark for evaluating the performance of the proposed pipeline. The paper also includes an experimental analysis of the pipeline's effectiveness in a human-robot interaction setting, examining the performance of each component. Furthermore, the developed system is deployed on the humanoid robot iCub, and a supplementary video showcases its functionality. The results demonstrate the potential of the proposed approach to enhance social awareness and responsiveness in social robotics, as well as improve assistance and support in collaborative scenarios, promoting efficient human-robot collaboration. The code and the collected dataset will be released upon acceptance.
本论文探讨了人类-机器人交互中的目光注视作用,并提出了一种新型的系统,仅使用视觉反馈来检测人类注视的对象。该系统利用面部检测、人类注意力预测和在线对象检测技术,使机器人能够准确感知和解释人类注视的对象,为与人类合作伙伴建立联合关注铺平了道路。此外,与直立型机器人iCub一起收集了一个新的数据集,其中包括超过22,000张由十名参与者注视不同注释对象的图像。这个数据集作为评估 proposed 管道性能的标准。本文还包含了在人类-机器人交互环境中实验分析管道有效性的文献,考察了每个组件的性能。此外,该系统被部署在直立型机器人 iCub 上,一个补充视频展示了其功能。结果表明,该方法的潜力在于增强社会机器人中的社会意识和响应能力,改进协作场景中的协助和支持,促进高效的人类-机器人协作。代码和收集的数据集将在接受后发布。
https://arxiv.org/abs/2308.13318
Extracting useful visual cues for the downstream tasks is especially challenging under low-light vision. Prior works create enhanced representations by either correlating visual quality with machine perception or designing illumination-degrading transformation methods that require pre-training on synthetic datasets. We argue that optimizing enhanced image representation pertaining to the loss of the downstream task can result in more expressive representations. Therefore, in this work, we propose a novel module, FeatEnHancer, that hierarchically combines multiscale features using multiheaded attention guided by task-related loss function to create suitable representations. Furthermore, our intra-scale enhancement improves the quality of features extracted at each scale or level, as well as combines features from different scales in a way that reflects their relative importance for the task at hand. FeatEnHancer is a general-purpose plug-and-play module and can be incorporated into any low-light vision pipeline. We show with extensive experimentation that the enhanced representation produced with FeatEnHancer significantly and consistently improves results in several low-light vision tasks, including dark object detection (+5.7 mAP on ExDark), face detection (+1.5 mAPon DARK FACE), nighttime semantic segmentation (+5.1 mIoU on ACDC ), and video object detection (+1.8 mAP on DarkVision), highlighting the effectiveness of enhancing hierarchical features under low-light vision.
在低光视觉环境下提取后续任务有用的视觉提示是非常具有挑战性的。以前的研究可以通过与机器感知相关的比较视觉质量或者设计照明削弱变换方法来增强表示。我们认为优化增强图像表示与后续任务的损失相关度可以产生更加表现力强的表示。因此,在本文中,我们提出了一个全新的模块,FeatEnHancer,它通过引导任务相关损失函数的多重头注意力Hierarchically combines multiscale features,以构建适当的表示。此外,我们的内部尺度增强改善了每个尺度或层次的提取特征的质量,并同时结合来自不同尺度的特征,以反映它们对于当前任务的重要性。FeatEnHancer是一个通用插件模块,可以添加到任何低光视觉流程中。我们通过广泛的实验表明,使用FeatEnHancer产生的增强表示在多个低光视觉任务中显著且一致性地提高了结果,包括暗物体检测(+5.7 mAP on ExDark)、人脸检测(+1.5 mAPon DARK FACE)、夜晚语义分割(+5.1 mIoU on ACDC)和视频物体检测(+1.8 mAP on DarkVision),强调了在低光视觉环境下增强Hierarchical features的有效性。
https://arxiv.org/abs/2308.03594
Although face recognition starts to play an important role in our daily life, we need to pay attention that data-driven face recognition vision systems are vulnerable to adversarial attacks. However, the current two categories of adversarial attacks, namely digital attacks and physical attacks both have drawbacks, with the former ones impractical and the latter one conspicuous, high-computational and inexecutable. To address the issues, we propose a practical, executable, inconspicuous and low computational adversarial attack based on LED illumination modulation. To fool the systems, the proposed attack generates imperceptible luminance changes to human eyes through fast intensity modulation of scene LED illumination and uses the rolling shutter effect of CMOS image sensors in face recognition systems to implant luminance information perturbation to the captured face images. In summary,we present a denial-of-service (DoS) attack for face detection and a dodging attack for face verification. We also evaluate their effectiveness against well-known face detection models, Dlib, MTCNN and RetinaFace , and face verification models, Dlib, FaceNet,and ArcFace.The extensive experiments show that the success rates of DoS attacks against face detection models reach 97.67%, 100%, and 100%, respectively, and the success rates of dodging attacks against all face verification models reach 100%.
尽管人脸识别开始在我们的日常生活中扮演重要的角色,但我们仍需关注数据驱动的人脸识别视觉系统的易受攻击性质。然而,当前两种攻击类型——数字攻击和物理攻击 both 都有缺点,前者不切实际,后者过于显眼,计算量过大且无法执行。为了解决这些问题,我们提出了基于LED照明调制的实际可行、可执行、隐蔽且计算量较小的dversarial攻击。为了欺骗系统,我们提出的攻击通过快速场景LED照明强度调制来产生人类眼睛难以察觉的亮度变化,并在人脸识别系统中使用CMOS图像传感器的卷积快门效应来植入亮度信息干扰捕获的面部图像。总结起来,我们提出了一种用于面部检测和躲避攻击的拒绝服务攻击(DoS)。我们还对著名的面部检测模型Dlib、MTCNN和RetinaFace以及面部验证模型Dlib、FaceNet和ArcFace进行了攻击效果评估。广泛的实验结果表明,针对面部检测模型的拒绝服务攻击成功率高达97.67%,躲避攻击对所有面部验证模型的成功率高达100%。
https://arxiv.org/abs/2307.13294
DeepFake based digital facial forgery is threatening public media security, especially when lip manipulation has been used in talking face generation, and the difficulty of fake video detection is further improved. By only changing lip shape to match the given speech, the facial features of identity are hard to be discriminated in such fake talking face videos. Together with the lack of attention on audio stream as the prior knowledge, the detection failure of fake talking face videos also becomes inevitable. It's found that the optical flow of the fake talking face video is disordered especially in the lip region while the optical flow of the real video changes regularly, which means the motion feature from optical flow is useful to capture manipulation cues. In this study, a fake talking face detection network (FTFDNet) is proposed by incorporating visual, audio and motion features using an efficient cross-modal fusion (CMF) module. Furthermore, a novel audio-visual attention mechanism (AVAM) is proposed to discover more informative features, which can be seamlessly integrated into any audio-visual CNN architecture by modularization. With the additional AVAM, the proposed FTFDNet is able to achieve a better detection performance than other state-of-the-art DeepFake video detection methods not only on the established fake talking face detection dataset (FTFDD) but also on the DeepFake video detection datasets (DFDC and DF-TIMIT).
DeepFake based digital facial forgery威胁到了公共媒体安全,特别是在使用唇动技术生成对话面部视频时,而且假视频检测的难度进一步增加。仅改变唇形以匹配给定语音很难在这类假对话面部视频中找到身份特征。同时,缺乏对音频流作为先前知识的注意,也会导致假对话面部视频的检测失败。发现假对话面部视频的光学流特别不规律,而真实视频的光学流则变化规律,这意味着从光学流的动图特征可以用于捕捉操纵迹象。在本研究中,提出了一种假对话面部检测网络(FTFDNet),通过使用高效的跨模态融合模块(CMF)来集成视觉、音频和运动特征。此外,还提出了一种新的音频-视觉注意力机制(AVAM),以发现更多的有用特征,并通过模块化的方式将它们无缝集成到任何音频-视觉卷积神经网络架构中。额外的AVAM可以使提出的FTFDNet在建立的假对话面部检测数据集(FTFDD)和DeepFake视频检测数据集(DFDC和DF-TIMIT)上实现比现有先进的DeepFake视频检测方法更好的检测性能。
https://arxiv.org/abs/2307.03990
Machine learning models have shown increased accuracy in classification tasks when the training process incorporates human perceptual information. However, a challenge in training human-guided models is the cost associated with collecting image annotations for human salience. Collecting annotation data for all images in a large training set can be prohibitively expensive. In this work, we utilize ''teacher'' models (trained on a small amount of human-annotated data) to annotate additional data by means of teacher models' saliency maps. Then, ''student'' models are trained using the larger amount of annotated training data. This approach makes it possible to supplement a limited number of human-supplied annotations with an arbitrarily large number of model-generated image annotations. We compare the accuracy achieved by our teacher-student training paradigm with (1) training using all available human salience annotations, and (2) using all available training data without human salience annotations. We use synthetic face detection and fake iris detection as example challenging problems, and report results across four model architectures (DenseNet, ResNet, Xception, and Inception), and two saliency estimation methods (CAM and RISE). Results show that our teacher-student training paradigm results in models that significantly exceed the performance of both baselines, demonstrating that our approach can usefully leverage a small amount of human annotations to generate salience maps for an arbitrary amount of additional training data.
在训练人类引导模型的过程中,将训练过程包括人类感知信息会提高分类任务的准确率。然而,训练人类引导模型的挑战在于收集用于人类显意识的图像注释数据的成本。收集整个训练集所有图像的注释数据可能会非常昂贵。在本工作中,我们利用训练少量的人类注释数据的人像注释模型来使用人像注释模型的显意识映射来注释更多的数据。然后,我们使用大量的注释训练数据训练学生模型。这种方法可以使以有限的人类提供注释和大量的模型生成图像注释相结合。我们比较了我们的师生训练范式的准确率与(1)使用所有可用的人类显意识注释进行训练,以及(2)使用所有可用的训练数据但没有人类显意识注释。我们使用合成人脸检测和假iris检测作为挑战性问题,并报告了四种模型架构(DenseNet、ResNet、Xception和Inception)和两个显意识估计方法(CAM和rise)的结果。结果表明,我们的师生训练范式的结果模型显著超过两个基准表现,这表明我们的方法可以利用少量的人类注释生成大量的显意识映射,为任意数量额外的训练数据生成显意识地图。
https://arxiv.org/abs/2306.05527
In face detection, low-resolution faces, such as numerous small faces of a human group in a crowded scene, are common in dense face prediction tasks. They usually contain limited visual clues and make small faces less distinguishable from the other small objects, which poses great challenge to accurate face detection. Although deep convolutional neural network has significantly promoted the research on face detection recently, current deep face detectors rarely take into account low-resolution faces and are still vulnerable to the real-world scenarios where massive amount of low-resolution faces exist. Consequently, they usually achieve degraded performance for low-resolution face detection. In order to alleviate this problem, we develop an efficient detector termed EfficientSRFace by introducing a feature-level super-resolution reconstruction network for enhancing the feature representation capability of the model. This module plays an auxiliary role in the training process, and can be removed during the inference without increasing the inference time. Extensive experiments on public benchmarking datasets, such as FDDB and WIDER Face, show that the embedded image super-resolution module can significantly improve the detection accuracy at the cost of a small amount of additional parameters and computational overhead, while helping our model achieve competitive performance compared with the state-of-the-arts methods.
在人脸识别中,低分辨率的面孔,例如人群场景中的大量个人小型面部,是密集面部预测任务中的常见特征。它们通常包含有限的视觉线索,使小型面部与其他小型物体难以区分,这对精确面部检测构成了巨大的挑战。尽管深度学习卷积神经网络最近极大地促进了人脸识别研究,但当前的深度面部检测器很少考虑低分辨率的面孔,并且仍然面临着大量低分辨率面部存在的实际场景的脆弱性。因此,它们通常对于低分辨率面部检测的性能表现呈下降的趋势。为了解决这一问题,我们开发了一种高效的检测器,称为EfficientSRFace,通过引入一个特征级别的超分辨率重构网络,增强模型的特征表示能力。这个模块在训练过程中发挥着辅助作用,可以在推理期间去除,而不会增加推理时间。在公共基准数据集FDDB和WIDder Face等实验中,进行了广泛的实验,结果表明,嵌入的图像超分辨率模块可以在少量的额外参数和计算开销的代价下,显著提高检测精度,同时帮助我们的模型与最先进的方法相比实现竞争性能。
https://arxiv.org/abs/2306.02277
Diffusion models have achieved promising results in image restoration tasks, yet suffer from time-consuming, excessive computational resource consumption, and unstable restoration. To address these issues, we propose a robust and efficient Diffusion-based Low-Light image enhancement approach, dubbed DiffLL. Specifically, we present a wavelet-based conditional diffusion model (WCDM) that leverages the generative power of diffusion models to produce results with satisfactory perceptual fidelity. Additionally, it also takes advantage of the strengths of wavelet transformation to greatly accelerate inference and reduce computational resource usage without sacrificing information. To avoid chaotic content and diversity, we perform both forward diffusion and reverse denoising in the training phase of WCDM, enabling the model to achieve stable denoising and reduce randomness during inference. Moreover, we further design a high-frequency restoration module (HFRM) that utilizes the vertical and horizontal details of the image to complement the diagonal information for better fine-grained restoration. Extensive experiments on publicly available real-world benchmarks demonstrate that our method outperforms the existing state-of-the-art methods both quantitatively and visually, and it achieves remarkable improvements in efficiency compared to previous diffusion-based methods. In addition, we empirically show that the application for low-light face detection also reveals the latent practical values of our method.
扩散模型在图像恢复任务中取得了令人瞩目的结果,但在实践中却存在耗时、过度计算资源消耗和不稳定恢复的问题。为了解决这些问题,我们提出了一种稳健且高效的基于扩散的图像增强方法,称为DiffLL。具体来说,我们提出了一种基于Wavelet的条件扩散模型(WCDM),利用扩散模型的生成能力,产生令人满意的感知失真结果。此外,它还利用Wavelet变换的强大特性,极大地加速推断并减少计算资源使用,而无需牺牲信息。为了避免混沌内容和多样性,我们在WCDM的训练阶段进行 forward 扩散和 reverse denoising,使模型能够在推断期间实现稳定的denoising,并减少随机性。此外,我们还设计了高频率恢复模块(HFRM),利用图像的垂直和水平细节,以补充对角信息,以更好地精细恢复。在公开可用的真实世界基准测试中,进行了广泛的实验,证明了我们的方法和以前的基于扩散的方法在量和视觉上都超越了现有最先进的方法,而且相对于以前的扩散方法,在效率方面取得了显著的改进。此外,我们还经验证,用于低光人脸识别的应用也揭示了我们方法的潜在实用价值。
https://arxiv.org/abs/2306.00306
Contemporary face detection algorithms have to deal with many challenges such as variations in pose, illumination, and scale. A subclass of the face detection problem that has recently gained increasing attention is occluded face detection, or more specifically, the detection of masked faces. Three years on since the advent of the COVID-19 pandemic, there is still a complete lack of evidence regarding how well existing face detection algorithms perform on masked faces. This article first offers a brief review of state-of-the-art face detectors and detectors made for the masked face problem, along with a review of the existing masked face datasets. We evaluate and compare the performances of a well-representative set of face detectors at masked face detection and conclude with a discussion on the possible contributing factors to their performance.
当代人脸识别算法必须处理许多挑战,例如姿势、照明和大小的变化。最近越来越引人关注的是遮挡人脸识别,或更具体地说,是识别口罩面容的问题。自 COVID-19 大流行开始以来已经三年了,但仍然存在没有任何证据表明现有人脸识别算法在口罩面容识别方面表现如何的问题。本文首先简要介绍了为口罩面容问题开发的最先进的人脸识别算法和算法,并回顾了现有的口罩面容数据集。我们评估和比较了一组代表性的人脸识别算法在口罩面容识别方面的性能,并最后讨论了可能影响其性能的可能影响因素。
https://arxiv.org/abs/2305.11077
CNN-based face detection methods have achieved significant progress in recent years. In addition to the strong representation ability of CNN, post-processing methods are also very important for the performance of face detection. In general, the face detection method predicts several candidate bounding-boxes for one face. NMS is used to filter out inaccurate candidate boxes to get the most accurate box. The principle of NMS is to select the box with a higher score as the basic box and then delete the box which has a large overlapping area with the basic box but has a lower score. However, the current NMS method and its improved versions do not perform well when face image quality is poor or faces are in a cluster. In these situations, even after NMS filtering, there is often a face corresponding to multiple predicted boxes. To reduce this kind of negative result, in this paper, we propose a new NMS method that operates in the reverse order of other NMS methods. Our method performs well on low-quality and tiny face samples. Experiments demonstrate that our method is effective as a post-processor for different face detection methods.
卷积神经网络(CNN)为基础的面部检测方法在过去几年中取得了显著进展。除了CNN的强大表示能力之外,预处理方法对于面部检测的性能也非常重要。通常,面部检测方法预测多个面部的候选框。NMS被用来过滤掉不准确的候选框,以得到最准确的框。NMS的原理是选择得分更高的框作为基本框,然后删除得分较低的框。然而,当面部图像质量不好或面部处于集群中时,当前版本的NMS方法和改进版的NMS方法表现不好。在这些情况下,即使经过NMS过滤,仍然可能出现多个预测框对应着一个面部的情况。为了减少这种负面影响,在本文中,我们提出了一种新的NMS方法,其操作顺序与其他NMS方法相反。我们的方法和在低质量和小尺寸面部样本上表现良好的方法证明了我们的方法和不同的面部检测方法作为预处理方法的有效性。
https://arxiv.org/abs/2305.10593
The extensive utilization of biometric authentication systems have emanated attackers / imposters to forge user identity based on morphed images. In this attack, a synthetic image is produced and merged with genuine. Next, the resultant image is user for authentication. Numerous deep neural convolutional architectures have been proposed in literature for face Morphing Attack Detection (MADs) to prevent such attacks and lessen the risks associated with them. Although, deep learning models achieved optimal results in terms of performance, it is difficult to understand and analyse these networks since they are black box/opaque in nature. As a consequence, incorrect judgments may be made. There is, however, a dearth of literature that explains decision-making methods of black box deep learning models for biometric Presentation Attack Detection (PADs) or MADs that can aid the biometric community to have trust in deep learning-based biometric systems for identification and authentication in various security applications such as border control, criminal database establishment etc. In this work, we present a novel visual explanation approach named Ensemble XAI integrating Saliency maps, Class Activation Maps (CAM) and Gradient-CAM (Grad-CAM) to provide a more comprehensive visual explanation for a deep learning prognostic model (EfficientNet-B1) that we have employed to predict whether the input presented to a biometric authentication system is morphed or genuine. The experimentations have been performed on three publicly available datasets namely Face Research Lab London Set, Wide Multi-Channel Presentation Attack (WMCA), and Makeup Induced Face Spoofing (MIFS). The experimental evaluations affirms that the resultant visual explanations highlight more fine-grained details of image features/areas focused by EfficientNet-B1 to reach decisions along with appropriate reasoning.
大量利用生物识别 authentication 系统导致了攻击者/伪造者基于变形图像Forge 用户身份。在这种类型的攻击中,合成图像被产生并与真品合并。随后,该图像被用作用户身份验证。在许多文献中,提出了许多深度神经网络卷积架构,以预防面部变形攻击(MADs),以减少与之相关的风险。虽然深度学习模型在性能方面取得了最佳结果,但由于它们是黑盒/不透明的,因此难以理解和分析这些网络。因此,可能会出现错误的判断。然而,文献中缺乏解释黑盒深度学习模型用于生物识别演示攻击检测(PADs)或Mads的决策方法的书籍,这可以帮助生物识别社区相信基于深度学习的生物识别系统,用于身份验证和安全检查,如边境控制、犯罪数据库建立等。在本文中,我们提出了一种名为Ensemble XAI的新视觉解释方法,该方法集成了亮度图、类激活图(CAM)和梯度CAM(Grad-CAM),为我们所使用的深度学习预测模型(EfficientNet-B1)提供了更全面的视觉解释。实验在三个公开数据集上进行了实施,分别是Face Research Lab伦敦组数据集、 wide 多通道演示攻击(WMCA)和化妆导致的面部欺骗(MIFS)。实验评估确认,结果的视觉解释突出了EfficientNet-B1关注的图像特征/区域的更精细细节,以通过适当的推理做出决策。
https://arxiv.org/abs/2304.14509
Adversarial attacks aim to disturb the functionality of a target system by adding specific noise to the input samples, bringing potential threats to security and robustness when applied to facial recognition systems. Although existing defense techniques achieve high accuracy in detecting some specific adversarial faces (adv-faces), new attack methods especially GAN-based attacks with completely different noise patterns circumvent them and reach a higher attack success rate. Even worse, existing techniques require attack data before implementing the defense, making it impractical to defend newly emerging attacks that are unseen to defenders. In this paper, we investigate the intrinsic generality of adv-faces and propose to generate pseudo adv-faces by perturbing real faces with three heuristically designed noise patterns. We are the first to train an adv-face detector using only real faces and their self-perturbations, agnostic to victim facial recognition systems, and agnostic to unseen attacks. By regarding adv-faces as out-of-distribution data, we then naturally introduce a novel cascaded system for adv-face detection, which consists of training data self-perturbations, decision boundary regularization, and a max-pooling-based binary classifier focusing on abnormal local color aberrations. Experiments conducted on LFW and CelebA-HQ datasets with eight gradient-based and two GAN-based attacks validate that our method generalizes to a variety of unseen adversarial attacks.
对抗攻击的目标是通过在输入样本中添加特定的噪声来干扰目标系统的功能和,当应用于人脸识别系统时,可能带来安全和鲁棒性的潜在威胁。尽管现有的防御技术能够在一些特定的对抗 Faces(adv-faces) 的检测方面实现高精度(adv-faces),但新的攻击方法特别是基于GAN的攻击方法,具有完全不同的噪声模式,绕过了这些防御技术并实现了更高的攻击成功率。更加糟糕的是,现有技术需要在实施防御之前收集攻击数据,因此无法有效地防御那些对防御者未知的新攻击。在本文中,我们研究了adv-faces 的固有一般性,并提出了通过对真实人脸进行三个启发式的噪声模式扰动来生成伪adv-faces 的方法。我们是第一位使用仅真实人脸及其自扰训练 an adv-face 检测器的人,并对受害者人脸识别系统及未知的攻击进行gnostic。将adv-faces 视为非分布数据,因此我们自然地引入了一个 novel 的级联系统以进行adv-face 检测,它包括训练数据自扰、决策边界 Regularization 和基于最大池化的分类器,专注于异常局部颜色变异。在 LFW 和CelebA-HQ 数据集上,使用八项梯度攻击和两个 GAN 攻击进行实验,证明了我们的方法可以适应各种不同的未知对抗攻击。
https://arxiv.org/abs/2304.11359
In this work, we investigate the potential threat of adversarial examples to the security of face recognition systems. Although previous research has explored the adversarial risk to individual components of FRSs, our study presents an initial exploration of an adversary simultaneously fooling multiple components: the face detector and feature extractor in an FRS pipeline. We propose three multi-objective attacks on FRSs and demonstrate their effectiveness through a preliminary experimental analysis on a target system. Our attacks achieved up to 100% Attack Success Rates against both the face detector and feature extractor and were able to manipulate the face detection probability by up to 50% depending on the adversarial objective. This research identifies and examines novel attack vectors against FRSs and suggests possible ways to augment the robustness by leveraging the attack vector's knowledge during training of an FRS's components.
https://arxiv.org/abs/2304.05048
Real-time eyeblink detection in the wild can widely serve for fatigue detection, face anti-spoofing, emotion analysis, etc. The existing research efforts generally focus on single-person cases towards trimmed video. However, multi-person scenario within untrimmed videos is also important for practical applications, which has not been well concerned yet. To address this, we shed light on this research field for the first time with essential contributions on dataset, theory, and practices. In particular, a large-scale dataset termed MPEblink that involves 686 untrimmed videos with 8748 eyeblink events is proposed under multi-person conditions. The samples are captured from unconstrained films to reveal "in the wild" characteristics. Meanwhile, a real-time multi-person eyeblink detection method is also proposed. Being different from the existing counterparts, our proposition runs in a one-stage spatio-temporal way with end-to-end learning capacity. Specifically, it simultaneously addresses the sub-tasks of face detection, face tracking, and human instance-level eyeblink detection. This paradigm holds 2 main advantages: (1) eyeblink features can be facilitated via the face's global context (e.g., head pose and illumination condition) with joint optimization and interaction, and (2) addressing these sub-tasks in parallel instead of sequential manner can save time remarkably to meet the real-time running requirement. Experiments on MPEblink verify the essential challenges of real-time multi-person eyeblink detection in the wild for untrimmed video. Our method also outperforms existing approaches by large margins and with a high inference speed.
在野外实时监测 eyeblink 可以广泛用于疲劳检测、面部防伪造、情感分析等。现有的研究努力一般集中在剪辑视频的单人案例上。然而,在未剪辑的视频内多人场景也对实际应用至关重要,这一点尚未得到足够关注。为了解决这个问题,我们首次在数据集、理论和实践中做出了重要贡献,特别是提出了一个名为 MPEblink 的大型数据集,该数据集涉及 686 个未剪辑视频和 8748 个 eyeblink 事件,从不受限制的电影中采集样本,以揭示“在野外”的特征。同时,我们也提出了一种实时多人 eyeblink 检测方法。与现有的对应方法不同,我们的提议采用了一个单一的阶段空间方式,并具有端到端学习能力。具体来说,它同时解决了人脸检测、人脸跟踪和人实例级 eyeblink 检测的任务。这个范式有两个主要优势:(1) eyeblink 特征可以通过人脸的全球上下文(例如,头部姿势和照明条件)进行优化和交互,以促进;(2) 解决这些任务并行而不是Sequentially 的方式可以节省大量时间,以满足实时运行需求。MPEblink 数据集的实验验证了在野外实时监测未剪辑视频的多人 eyeblink 检测的关键挑战。我们的方法还以显著优势超越了现有的方法,并具有快速推理速度。
https://arxiv.org/abs/2303.16053
The performance of convolutional neural networks has continued to improve over the last decade. At the same time, as model complexity grows, it becomes increasingly more difficult to explain model decisions. Such explanations may be of critical importance for reliable operation of human-machine pairing setups, or for model selection when the "best" model among many equally-accurate models must be established. Saliency maps represent one popular way of explaining model decisions by highlighting image regions models deem important when making a prediction. However, examining salience maps at scale is not practical. In this paper, we propose five novel methods of leveraging model salience to explain a model behavior at scale. These methods ask: (a) what is the average entropy for a model's salience maps, (b) how does model salience change when fed out-of-set samples, (c) how closely does model salience follow geometrical transformations, (d) what is the stability of model salience across independent training runs, and (e) how does model salience react to salience-guided image degradations. To assess the proposed measures on a concrete and topical problem, we conducted a series of experiments for the task of synthetic face detection with two types of models: those trained traditionally with cross-entropy loss, and those guided by human salience when training to increase model generalizability. These two types of models are characterized by different, interpretable properties of their salience maps, which allows for the evaluation of the correctness of the proposed measures. We offer source codes for each measure along with this paper.
卷积神经网络的性能在过去十年中继续 improve。同时,随着模型复杂性的增加,解释模型决策变得越来越困难。这些解释可能对可靠地人类机器配对 setup 的正常运行至关重要,或者当必须在许多同样准确的模型中选择一个“最好的”模型时,必须建立模型选择。注意力地图是一种 popular 的方法,通过突出在预测时认为重要的图像区域,解释模型决策。然而,在尺度上检查注意力地图并不实际。在本文中,我们提出了五个利用模型注意力来解释模型行为的新方法。这些方法询问:(a) 模型注意力地图的平均熵是多少,(b) 如何处理超出范围样本时模型注意力的变化,(c) 模型注意力如何与几何变换密切跟随,(d) 模型注意力在不同独立训练轮上的稳定程度是多少,(e) 模型注意力如何响应由注意力引导的图像恶化。为了评估提出的措施的正确性,我们进行了一项涉及合成面部检测任务的系列实验,使用两种模型类型:那些传统上使用交叉熵损失训练的模型,以及在训练时由人类注意力指导以提高模型泛化能力的模型。这两种模型的特点是它们的注意力地图具有不同可解释的特性,这允许评估提出的措施的正确性。我们与本文一起提供了每个措施的源代码。
https://arxiv.org/abs/2303.11969
Recently, conditional diffusion models have gained popularity in numerous applications due to their exceptional generation ability. However, many existing methods are training-required. They need to train a time-dependent classifier or a condition-dependent score estimator, which increases the cost of constructing conditional diffusion models and is inconvenient to transfer across different conditions. Some current works aim to overcome this limitation by proposing training-free solutions, but most can only be applied to a specific category of tasks and not to more general conditions. In this work, we propose a training-Free conditional Diffusion Model (FreeDoM) used for various conditions. Specifically, we leverage off-the-shelf pre-trained networks, such as a face detection model, to construct time-independent energy functions, which guide the generation process without requiring training. Furthermore, because the construction of the energy function is very flexible and adaptable to various conditions, our proposed FreeDoM has a broader range of applications than existing training-free methods. FreeDoM is advantageous in its simplicity, effectiveness, and low cost. Experiments demonstrate that FreeDoM is effective for various conditions and suitable for diffusion models of diverse data domains, including image and latent code domains.
近年来,条件扩散模型因其出色的生成能力而在多个应用中获得了 popular 。然而,许多现有方法需要训练。它们需要训练一个时间依赖性分类器或条件依赖得分估计器,这增加了构建条件扩散模型的成本,并不利于在不同条件下的转移。一些当前工作旨在提出不需要训练的解决方案,但大多数只能应用于特定类别的任务,而不是更一般的条件。在本工作中,我们提出了一种不需要训练的训练-free条件扩散模型(FreeDoM),适用于各种条件。具体来说,我们利用现有预训练网络,如人脸识别模型,构建时间 independent 的能量函数,无需训练指导生成过程。此外,由于能量函数的构建非常灵活和适应各种条件,我们提出的 FreeDoM 比现有的不需要训练的方法应用范围更广。FreeDoM 的优点在于它的简单性、有效性和低成本。实验表明,FreeDoM 适用于各种条件,适用于包括图像和隐含代码 domains 等多种数据 domains 的条件扩散模型。
https://arxiv.org/abs/2303.09833