Abstract
Face presentation attacks (FPA), also known as face spoofing, have brought increasing concerns to the public through various malicious applications, such as financial fraud and privacy leakage. Therefore, safeguarding face recognition systems against FPA is of utmost importance. Although existing learning-based face anti-spoofing (FAS) models can achieve outstanding detection performance, they lack generalization capability and suffer significant performance drops in unforeseen environments. Many methodologies seek to use auxiliary modality data (e.g., depth and infrared maps) during the presentation attack detection (PAD) to address this limitation. However, these methods can be limited since (1) they require specific sensors such as depth and infrared cameras for data capture, which are rarely available on commodity mobile devices, and (2) they cannot work properly in practical scenarios when either modality is missing or of poor quality. In this paper, we devise an accurate and robust MultiModal Mobile Face Anti-Spoofing system named M3FAS to overcome the issues above. The innovation of this work mainly lies in the following aspects: (1) To achieve robust PAD, our system combines visual and auditory modalities using three pervasively available sensors: camera, speaker, and microphone; (2) We design a novel two-branch neural network with three hierarchical feature aggregation modules to perform cross-modal feature fusion; (3). We propose a multi-head training strategy. The model outputs three predictions from the vision, acoustic, and fusion heads, enabling a more flexible PAD. Extensive experiments have demonstrated the accuracy, robustness, and flexibility of M3FAS under various challenging experimental settings.
Abstract (translated)
面部呈现攻击(FPA)也称为面部仿冒,已经通过各种恶意应用引起了公众越来越多的关注,例如金融欺诈和隐私泄漏。因此,保护面部识别系统免受FPA攻击是至关重要的。虽然现有的基于学习的面容抗仿冒(FAS)模型可以表现出卓越的检测性能,但它们缺乏泛化能力,在意想不到的环境中表现出严重的性能下降。许多方法寻求在面部呈现攻击检测(PAD)期间使用辅助视觉和听觉数据(例如深度和红外地图)来解决这一限制。但是,这些方法可能会被限制因为(1)它们需要特定的传感器,例如深度和红外相机,用于数据捕捉,这些传感器在普通移动设备中很少可用,而且(2)当视觉或听觉模式都不可用或质量不佳时,这些方法就无法正常工作。在本文中,我们设计了一种准确且可靠的多模态移动设备面部抗仿冒系统,名为M3FAS,以克服上述问题。这项工作的创新主要存在于以下方面:(1)为了实现可靠的PAD,我们的系统利用三种普遍可用的传感器:相机、扬声器和麦克风,使用三个层级的特征聚合模块进行跨模态特征融合;(2)我们设计了一个独特的两分支神经网络,并具有三个Hierarchical feature aggregation模块,以进行跨模态特征融合;(3)我们提出了一个多目训练策略。模型从视觉、听觉和融合头输出三个预测,从而提供了更灵活的PAD。广泛的实验已经证明了M3FAS在各种挑战性实验设置下的准确、鲁棒性和灵活性。
URL
https://arxiv.org/abs/2301.12831