Recent advancements in deep learning have revolutionized technology and security measures, necessitating robust identification methods. Biometric approaches, leveraging personalized characteristics, offer a promising solution. However, Face Recognition Systems are vulnerable to sophisticated attacks, notably face morphing techniques, enabling the creation of fraudulent documents. In this study, we introduce a novel quadruplet loss function for increasing the robustness of face recognition systems against morphing attacks. Our approach involves specific sampling of face image quadruplets, combined with face morphs, for network training. Experimental results demonstrate the efficiency of our strategy in improving the robustness of face recognition networks against morphing attacks.
AI based Face Recognition Systems (FRSs) are now widely distributed and deployed as MLaaS solutions all over the world, moreso since the COVID-19 pandemic for tasks ranging from validating individuals' faces while buying SIM cards to surveillance of citizens. Extensive biases have been reported against marginalized groups in these systems and have led to highly discriminatory outcomes. The post-pandemic world has normalized wearing face masks but FRSs have not kept up with the changing times. As a result, these systems are susceptible to mask based face occlusion. In this study, we audit four commercial and nine open-source FRSs for the task of face re-identification between different varieties of masked and unmasked images across five benchmark datasets (total 14,722 images). These simulate a realistic validation/surveillance task as deployed in all major countries around the world. Three of the commercial and five of the open-source FRSs are highly inaccurate; they further perpetuate biases against non-White individuals, with the lowest accuracy being 0%. A survey for the same task with 85 human participants also results in a low accuracy of 40%. Thus a human-in-the-loop moderation in the pipeline does not alleviate the concerns, as has been frequently hypothesized in literature. Our large-scale study shows that developers, lawmakers and users of such services need to rethink the design principles behind FRSs, especially for the task of face re-identification, taking cognizance of observed biases.
基于AI的Face Recognition系统(FRS)现在在全球范围内广泛分布并作为MLaaS解决方案部署，尤其是在COVID-19大流行期间，例如验证个人购买SIM卡时的人脸识别和监视公民等任务。在这些系统中，针对边缘化群体的偏见报道较多，导致高度歧视性结果。大流行过后，世界已经正常化戴口罩，但FRS并没有跟上时代的变化。因此，这些系统容易受到口罩为基础的人脸遮挡。在本文中，我们对四个商业和九个开源FRS进行了审计，针对不同口罩和未戴口罩的图像进行人脸识别，跨越五个基准数据集（总共14,722张图片）。这些模拟了一个真实世界的验证/监视任务，类似于全球各国广泛部署的任务。三个商业和五个开源FRS高度不准确；它们进一步加剧了针对非白人居民的偏见，最低准确度仅为0%。同样，使用85名人类参与者的相同任务调查也结果不理想，准确度只有40%。因此，在流水线中的人类-闭环管理并不能减轻这些担忧，正如文献中经常假设的那样。 我们的大规模研究显示，开发人员、立法者和使用这些服务的用户需要重新思考FRS的设计原则，尤其是针对人脸识别任务，要关注所观察到的偏见。
Recently, there has been an explosion of mobile applications that perform computationally intensive tasks such as video streaming, data mining, virtual reality, augmented reality, image processing, video processing, face recognition, and online gaming. However, user devices (UDs), such as tablets and smartphones, have a limited ability to perform the computation needs of the tasks. Mobile edge computing (MEC) has emerged as a promising technology to meet the increasing computing demands of UDs. Task offloading in MEC is a strategy that meets the demands of UDs by distributing tasks between UDs and MEC servers. Deep reinforcement learning (DRL) is gaining attention in task-offloading problems because it can adapt to dynamic changes and minimize online computational complexity. However, the various types of continuous and discrete resource constraints on UDs and MEC servers pose challenges to the design of an efficient DRL-based task-offloading strategy. Existing DRL-based task-offloading algorithms focus on the constraints of the UDs, assuming the availability of enough storage resources on the server. Moreover, existing multiagent DRL (MADRL)--based task-offloading algorithms are homogeneous agents and consider homogeneous constraints as a penalty in their reward function. We proposed a novel combinatorial client-master MADRL (CCM\_MADRL) algorithm for task offloading in MEC (CCM\_MADRL\_MEC) that enables UDs to decide their resource requirements and the server to make a combinatorial decision based on the requirements of the UDs. CCM\_MADRL\_MEC is the first MADRL in task offloading to consider server storage capacity in addition to the constraints in the UDs. By taking advantage of the combinatorial action selection, CCM\_MADRL\_MEC has shown superior convergence over existing MADDPG and heuristic algorithms.
Deep neural networks are extensively applied to real-world tasks, such as face recognition and medical image classification, where privacy and data protection are critical. Image data, if not protected, can be exploited to infer personal or contextual information. Existing privacy preservation methods, like encryption, generate perturbed images that are unrecognizable to even humans. Adversarial attack approaches prohibit automated inference even for authorized stakeholders, limiting practical incentives for commercial and widespread adaptation. This pioneering study tackles an unexplored practical privacy preservation use case by generating human-perceivable images that maintain accurate inference by an authorized model while evading other unauthorized black-box models of similar or dissimilar objectives, and addresses the previous research gaps. The datasets employed are ImageNet, for image classification, Celeba-HQ dataset, for identity classification, and AffectNet, for emotion classification. Our results show that the generated images can successfully maintain the accuracy of a protected model and degrade the average accuracy of the unauthorized black-box models to 11.97%, 6.63%, and 55.51% on ImageNet, Celeba-HQ, and AffectNet datasets, respectively.
This paper introduces the Membership Inference Test (MINT), a novel approach that aims to empirically assess if specific data was used during the training of Artificial Intelligence (AI) models. Specifically, we propose two novel MINT architectures designed to learn the distinct activation patterns that emerge when an audited model is exposed to data used during its training process. The first architecture is based on a Multilayer Perceptron (MLP) network and the second one is based on Convolutional Neural Networks (CNNs). The proposed MINT architectures are evaluated on a challenging face recognition task, considering three state-of-the-art face recognition models. Experiments are carried out using six publicly available databases, comprising over 22 million face images in total. Also, different experimental scenarios are considered depending on the context available of the AI model to test. Promising results, up to 90% accuracy, are achieved using our proposed MINT approach, suggesting that it is possible to recognize if an AI model has been trained with specific data.
本文介绍了一种名为 Membership Inference Test (MINT) 的全新方法，旨在通过实证评估在人工智能（AI）模型训练过程中是否使用了特定数据。具体来说，我们提出了两个新的 MINT 架构，用于学习在受到训练过程中使用的特定数据激活模式时出现的独特激活模式。第一个架构是基于多层感知器（MLP）网络，第二个架构是基于卷积神经网络（CNN）。所提出的 MINT 架构在具有挑战性的人脸识别任务上进行了评估，考虑了三种最先进的人脸识别模型。实验使用了六个公开可用的数据库，包括总共超过 2200 万张人脸图片。此外，根据 AI 模型所处的上下文，还考虑了不同实验场景。使用我们提出的 MINT 方法获得的预测准确率可以达到 90%。这表明，使用特定数据可以训练 AI 模型。
Ensuring robustness in face recognition systems across various challenging conditions is crucial for their versatility. State-of-the-art methods often incorporate additional information, such as depth, thermal, or angular data, to enhance performance. However, light field-based face recognition approaches that leverage angular information face computational limitations. This paper investigates the fundamental trade-off between spatio-angular resolution in light field representation to achieve improved face recognition performance. By utilizing macro-pixels with varying angular resolutions while maintaining the overall image size, we aim to quantify the impact of angular information at the expense of spatial resolution, while considering computational constraints. Our experimental results demonstrate a notable performance improvement in face recognition systems by increasing the angular resolution, up to a certain extent, at the cost of spatial resolution.
The recognition performance of biometric systems strongly depends on the quality of the compared biometric samples. Motivated by the goal of establishing a common understanding of face image quality and enabling system interoperability, the committee draft of ISO/IEC 29794-5 introduces expression neutrality as one of many component quality elements affecting recognition performance. In this study, we train classifiers to assess facial expression neutrality using seven datasets. We conduct extensive performance benchmarking to evaluate their classification and face recognition utility prediction abilities. Our experiments reveal significant differences in how each classifier distinguishes "neutral" from "non-neutral" expressions. While Random Forests and AdaBoost classifiers are most suitable for distinguishing neutral from non-neutral facial expressions with high accuracy, they underperform compared to Support Vector Machines in predicting face recognition utility.
This paper introduces DogSurf - a newapproach of using quadruped robots to help visually impaired people navigate in real world. The presented method allows the quadruped robot to detect slippery surfaces, and to use audio and haptic feedback to inform the user when to stop. A state-of-the-art GRU-based neural network architecture with mean accuracy of 99.925% was proposed for the task of multiclass surface classification for quadruped robots. A dataset was collected on a Unitree Go1 Edu robot. The dataset and code have been posted to the public domain.
本文介绍了一种使用四足机器人帮助视觉受损者在现实世界中进行导航的新方法：DogSurf。所提出的方法允许四足机器人检测滑润表面，并使用音频和触觉反馈来告知用户何时停止。为四足机器人多分类表面分类任务，提出了一种基于先进的GRU神经网络架构，其平均准确率为99.925%。数据集在Unitree Go1 Edu机器人上收集。数据集和代码已公开发布。
This study investigates the possibility of mitigating the demographic biases that affect face recognition technologies through the use of synthetic data. Demographic biases have the potential to impact individuals from specific demographic groups, and can be identified by observing disparate performance of face recognition systems across demographic groups. They primarily arise from the unequal representations of demographic groups in the training data. In recent times, synthetic data have emerged as a solution to some problems that affect face recognition systems. In particular, during the generation process it is possible to specify the desired demographic and facial attributes of images, in order to control the demographic distribution of the synthesized dataset, and fairly represent the different demographic groups. We propose to fine-tune with synthetic data existing face recognition systems that present some demographic biases. We use synthetic datasets generated with GANDiffFace, a novel framework able to synthesize datasets for face recognition with controllable demographic distribution and realistic intra-class variations. We consider multiple datasets representing different demographic groups for training and evaluation. Also, we fine-tune different face recognition systems, and evaluate their demographic fairness with different metrics. Our results support the proposed approach and the use of synthetic data to mitigate demographic biases in face recognition.
Recent works have demonstrated the feasibility of inverting face recognition systems, enabling to recover convincing face images using only their embeddings. We leverage such template inversion models to develop a novel type ofdeep morphing attack based on inverting a theoretical optimal morph embedding, which is obtained as an average of the face embeddings of source images. We experiment with two variants of this approach: the first one exploits a fully self-contained embedding-to-image inversion model, while the second leverages the synthesis network of a pretrained StyleGAN network for increased morph realism. We generate morphing attacks from several source datasets and study the effectiveness of those attacks against several face recognition networks. We showcase that our method can compete with and regularly beat the previous state of the art for deep-learning based morph generation in terms of effectiveness, both in white-box and black-box attack scenarios, and is additionally much faster to run. We hope this might facilitate the development of large scale deep morph datasets for training detection models.
Face Recognition (FR) systems can suffer from physical (i.e., print photo) and digital (i.e., DeepFake) attacks. However, previous related work rarely considers both situations at the same time. This implies the deployment of multiple models and thus more computational burden. The main reasons for this lack of an integrated model are caused by two factors: (1) The lack of a dataset including both physical and digital attacks with ID consistency which means the same ID covers the real face and all attack types; (2) Given the large intra-class variance between these two attacks, it is difficult to learn a compact feature space to detect both attacks simultaneously. To address these issues, we collect a Unified physical-digital Attack dataset, called UniAttackData. The dataset consists of $1,800$ participations of 2 and 12 physical and digital attacks, respectively, resulting in a total of 29,706 videos. Then, we propose a Unified Attack Detection framework based on Vision-Language Models (VLMs), namely UniAttackDetection, which includes three main modules: the Teacher-Student Prompts (TSP) module, focused on acquiring unified and specific knowledge respectively; the Unified Knowledge Mining (UKM) module, designed to capture a comprehensive feature space; and the Sample-Level Prompt Interaction (SLPI) module, aimed at grasping sample-level semantics. These three modules seamlessly form a robust unified attack detection framework. Extensive experiments on UniAttackData and three other datasets demonstrate the superiority of our approach for unified face attack detection.
面部识别（FR）系统可能遭受物理（即打印照片）和数字（即DeepFake）攻击。然而，之前的相关工作很少考虑同时考虑这两种情况。这表明部署了多个模型，因此计算负担更重。导致缺乏集成模型的主要原因有两个因素：（1）缺乏包括物理和数字攻击的统一数据集，这意味着相同的ID涵盖了真实 face 和所有攻击类型；（2）由于这两种攻击之间的大类内方差，很难同时学习一个紧凑的特征空间以检测这两种攻击。为了解决这些问题，我们收集了一个统一物理-数字攻击数据集，称为UniAttackData。该数据集包括2和12个物理攻击的1800个参与者的视频，总共29,706段视频。然后，我们提出了一个基于视觉语言模型的统一攻击检测框架，即UniAttackDetection，包括三个主要模块：教师-学生提示（TSP）模块，关注获取统一和具体知识；统一知识挖掘（UKM）模块，旨在捕捉全面特征空间；样本级提示交互（SLPI）模块，旨在理解样本级的语义。这三个模块无缝构成了一个健壮的统一攻击检测框架。在UniAttackData和其他三个数据集的广泛实验中，我们证明了我们方法在统一面部攻击检测方面的优越性。
In this paper, we propose a novel approach for conducting face morphing attacks, which utilizes optimal-landmark-guided image blending. Current face morphing attacks can be categorized into landmark-based and generation-based approaches. Landmark-based methods use geometric transformations to warp facial regions according to averaged landmarks but often produce morphed images with poor visual quality. Generation-based methods, which employ generation models to blend multiple face images, can achieve better visual quality but are often unsuccessful in generating morphed images that can effectively evade state-of-the-art face recognition systems~(FRSs). Our proposed method overcomes the limitations of previous approaches by optimizing the morphing landmarks and using Graph Convolutional Networks (GCNs) to combine landmark and appearance features. We model facial landmarks as nodes in a bipartite graph that is fully connected and utilize GCNs to simulate their spatial and structural relationships. The aim is to capture variations in facial shape and enable accurate manipulation of facial appearance features during the warping process, resulting in morphed facial images that are highly realistic and visually faithful. Experiments on two public datasets prove that our method inherits the advantages of previous landmark-based and generation-based methods and generates morphed images with higher quality, posing a more significant threat to state-of-the-art FRSs.
在本文中，我们提出了一个新颖的进行面部变形攻击的方法，该方法利用最优特征点引导图像融合。当前的面部变形攻击可以分为基于地标和基于生成模型的方法。基于地标的 methods 使用几何变换根据平均地标扭曲面部区域，但通常会产生视觉质量较差的变形图像。基于生成的方法，使用生成模型将多个面部图像融合，可以实现更好的视觉质量，但通常无法生成能够有效逃避最先进面部识别系统（FRSs）的变形图像。我们提出的方法通过优化变形特征点并使用图卷积网络（GCNs）结合特征点和外观特征，克服了前方法的局限性。我们将面部特征点建模为二分图中的节点，并利用 GCNs 模拟其空间和结构关系。目标是在变形过程中捕捉面部形状的变异，并使面部外观特征在变形过程中得到准确的操作，从而生成高度逼真和视觉上忠实的外观图像。在两个公开数据集上的实验证明，我们的方法继承了前地标和生成模型的优势，生成的变形图像具有更高的质量，对最先进的 FRSs构成了更大的威胁。
In this study, we harness the information-theoretic Privacy Funnel (PF) model to develop a method for privacy-preserving representation learning using an end-to-end training framework. We rigorously address the trade-off between obfuscation and utility. Both are quantified through the logarithmic loss, a measure also recognized as self-information loss. This exploration deepens the interplay between information-theoretic privacy and representation learning, offering substantive insights into data protection mechanisms for both discriminative and generative models. Importantly, we apply our model to state-of-the-art face recognition systems. The model demonstrates adaptability across diverse inputs, from raw facial images to both derived or refined embeddings, and is competent in tasks such as classification, reconstruction, and generation.
Various face image datasets intended for facial biometrics research were created via web-scraping, i.e. the collection of images publicly available on the internet. This work presents an approach to detect both exactly and nearly identical face image duplicates, using file and image hashes. The approach is extended through the use of face image preprocessing. Additional steps based on face recognition and face image quality assessment models reduce false positives, and facilitate the deduplication of the face images both for intra- and inter-subject duplicate sets. The presented approach is applied to five datasets, namely LFW, TinyFace, Adience, CASIA-WebFace, and C-MS-Celeb (a cleaned MS-Celeb-1M variant). Duplicates are detected within every dataset, with hundreds to hundreds of thousands of duplicates for all except LFW. Face recognition and quality assessment experiments indicate a minor impact on the results through the duplicate removal. The final deduplication data is publicly available.
通过爬取互联网上公开可用的图像，各种面向面部生物识别研究的脸部图像数据集被创建。本文提出了一种使用文件和图像哈希检测准确和几乎相同脸部图像复制的 approach。通过使用面部图像预处理技术，该方法得到了扩展。基于面部识别和面部图像质量评估模型的额外步骤减少了假阳性，并促进了对于内在和间歇性复制的脸部图像的 deduplication。所提出的 approach 应用于五个数据集，即 LFW、TinyFace、Adience、CASIA-WebFace 和 C-MS-Celeb（经过清理的 MS-Celeb-1M 变体）。在所有数据集中，都能检测到复制品，其中只有 LFW 数据集中的复制品数量少于数百个。面部识别和质量评估实验表明，通过去重，对结果的影响非常小。最终 deduplication 数据公开可用。
Face recognition technology has been deployed in various real-life applications. The most sophisticated deep learning-based face recognition systems rely on training millions of face images through complex deep neural networks to achieve high accuracy. It is quite common for clients to upload face images to the service provider in order to access the model inference. However, the face image is a type of sensitive biometric attribute tied to the identity information of each user. Directly exposing the raw face image to the service provider poses a threat to the user's privacy. Current privacy-preserving approaches to face recognition focus on either concealing visual information on model input or protecting model output face embedding. The noticeable drop in recognition accuracy is a pitfall for most methods. This paper proposes a hybrid frequency-color fusion approach to reduce the input dimensionality of face recognition in the frequency domain. Moreover, sparse color information is also introduced to alleviate significant accuracy degradation after adding differential privacy noise. Besides, an identity-specific embedding mapping scheme is applied to protect original face embedding by enlarging the distance among identities. Lastly, secure multiparty computation is implemented for safely computing the embedding distance during model inference. The proposed method performs well on multiple widely used verification datasets. Moreover, it has around 2.6% to 4.2% higher accuracy than the state-of-the-art in the 1:N verification scenario.
Face recognition (FR) has been applied to nearly every aspect of daily life, but it is always accompanied by the underlying risk of leaking private information. At present, almost all attack models against FR rely heavily on the presence of a classification layer. However, in practice, the FR model can obtain complex features of the input via the model backbone, and then compare it with the target for inference, which does not explicitly involve the outputs of the classification layer adopting logit or other losses. In this work, we advocate a novel inference attack composed of two stages for practical FR models without a classification layer. The first stage is the membership inference attack. Specifically, We analyze the distances between the intermediate features and batch normalization (BN) parameters. The results indicate that this distance is a critical metric for membership inference. We thus design a simple but effective attack model that can determine whether a face image is from the training dataset or not. The second stage is the model inversion attack, where sensitive private data is reconstructed using a pre-trained generative adversarial network (GAN) guided by the attack model in the first stage. To the best of our knowledge, the proposed attack model is the very first in the literature developed for FR models without a classification layer. We illustrate the application of the proposed attack model in the establishment of privacy-preserving FR techniques.
Face recognition systems are widely deployed in high-security applications such as for biometric verification at border controls. Despite their high accuracy on pristine data, it is well-known that digital manipulations, such as face morphing, pose a security threat to face recognition systems. Malicious actors can exploit the facilities offered by the identity document issuance process to obtain identity documents containing morphed images. Thus, subjects who contributed to the creation of the morphed image can with high probability use the identity document to bypass automated face recognition systems. In recent years, no-reference (i.e., single image) and differential morphing attack detectors have been proposed to tackle this risk. These systems are typically evaluated in isolation from the face recognition system that they have to operate jointly with and do not consider the face recognition process. Contrary to most existing works, we present a novel method for adapting deep learning-based face recognition systems to be more robust against face morphing attacks. To this end, we introduce TetraLoss, a novel loss function that learns to separate morphed face images from its contributing subjects in the embedding space while still preserving high biometric verification performance. In a comprehensive evaluation, we show that the proposed method can significantly enhance the original system while also significantly outperforming other tested baseline methods.
基于脸部的识别系统广泛应用于需要高安全性的场景，如在边境控制中进行生物识别验证等。尽管在纯数据上具有高准确性，但已证实数字篡改（如脸部整形）对脸部识别系统构成安全威胁。恶意行为者可以利用身份证明文件发放过程提供的便利，获取包含整形图像的身份证。因此，为创作整形图像的受访者很可能使用身份证绕过自动脸部识别系统。在过去的几年里，没有参考（即单张图像）和差分整形攻击检测器被提出来解决这一问题。这些系统通常与需要共同操作的Face Recognition系统单独评估，没有考虑人脸识别过程。 与大多数现有作品不同，我们提出了一个新方法，将基于深度学习的脸部识别系统对脸部整形攻击的鲁棒性进行改进。为此，我们引入了TetraLoss，一种新的损失函数，在学习将整形人脸图像与嵌入空间中贡献的主体的区分的同时，保留高生物识别验证性能。在全面评估中，我们证明了所提出的方法可以在不影响原始系统性能的同时显著提高其他测试基线方法的性能。
Adversarial Attacks on Face Recognition (FR) encompass two types: impersonation attacks and evasion attacks. We observe that achieving a successful impersonation attack on FR does not necessarily ensure a successful dodging attack on FR in the black-box setting. Introducing a novel attack method named Pre-training Pruning Restoration Attack (PPR), we aim to enhance the performance of dodging attacks whilst avoiding the degradation of impersonation attacks. Our method employs adversarial example pruning, enabling a portion of adversarial perturbations to be set to zero, while tending to maintain the attack performance. By utilizing adversarial example pruning, we can prune the pre-trained adversarial examples and selectively free up certain adversarial perturbations. Thereafter, we embed adversarial perturbations in the pruned area, which enhances the dodging performance of the adversarial face examples. The effectiveness of our proposed attack method is demonstrated through our experimental results, showcasing its superior performance.
Face anti-spoofing is crucial for ensuring the security and reliability of face recognition systems. Several existing face anti-spoofing methods utilize GAN-like networks to detect presentation attacks by estimating the noise pattern of a spoof image and recovering the corresponding genuine image. But GAN's limited face appearance space results in the denoised faces cannot cover the full data distribution of genuine faces, thereby undermining the generalization performance of such methods. In this work, we present a pioneering attempt to employ diffusion models to denoise a spoof image and restore the genuine image. The difference between these two images is considered as the spoof noise, which can serve as a discriminative cue for face anti-spoofing. We evaluate our proposed method on several intra-testing and inter-testing protocols, where the experimental results showcase the effectiveness of our method in achieving competitive performance in terms of both accuracy and generalization.
Cutting-edge research in facial expression recognition (FER) currently favors the utilization of convolutional neural networks (CNNs) backbone which is supervisedly pre-trained on face recognition datasets for feature extraction. However, due to the vast scale of face recognition datasets and the high cost associated with collecting facial labels, this pre-training paradigm incurs significant expenses. Towards this end, we propose to pre-train vision Transformers (ViTs) through a self-supervised approach on a mid-scale general image dataset. In addition, when compared with the domain disparity existing between face datasets and FER datasets, the divergence between general datasets and FER datasets is more pronounced. Therefore, we propose a contrastive fine-tuning approach to effectively mitigate this domain disparity. Specifically, we introduce a novel FER training paradigm named Mask Image pre-training with MIx Contrastive fine-tuning (MIMIC). In the initial phase, we pre-train the ViT via masked image reconstruction on general images. Subsequently, in the fine-tuning stage, we introduce a mix-supervised contrastive learning process, which enhances the model with a more extensive range of positive samples by the mixing strategy. Through extensive experiments conducted on three benchmark datasets, we demonstrate that our MIMIC outperforms the previous training paradigm, showing its capability to learn better representations. Remarkably, the results indicate that the vanilla ViT can achieve impressive performance without the need for intricate, auxiliary-designed modules. Moreover, when scaling up the model size, MIMIC exhibits no performance saturation and is superior to the current state-of-the-art methods.
目前，在面部表情识别（FER）领域，尖端研究倾向于使用预训练在面部识别数据集上的卷积神经网络（CNN）骨干。然而，由于面部识别数据集的范围庞大，并且与收集面部标签相关的成本高昂，这种预训练范式导致相当高的开销。为此，我们提出了通过自监督方法在规模中等的通用图像数据集上预训练视觉Transformer（ViT）的主张。此外，与面部数据集和FER数据集之间的领域差异相比，通用数据集和FER数据集之间的差异更加突出。因此，我们提出了一个对比性微调方法来有效减轻这种领域差异。具体来说，我们引入了一种名为Mask Image预训练与MIx Contrastive微调（MIMIC）的新面部表情识别训练范式。在初始阶段，我们对通用图像进行遮罩图像重建预训练。随后，在微调阶段，我们引入了一种混合监督的对比学习过程，通过混合策略增强模型，从而获得更广泛的正样本。通过在三个基准数据集上进行广泛的实验，我们证明了我们的MIMIC比先前的训练范式表现更好，表明了其学习更丰富的表示的能力。值得注意的是，结果显示，没有复杂的辅助设计模块，ViT也可以实现惊人的性能。此外，当模型规模扩大时，MIMIC没有性能饱和，并且优于当前最先进的方法。