Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper investigates the effectiveness of self-supervised pre-trained transformers compared to supervised pre-trained transformers and conventional neural networks (ConvNets) for detecting various types of deepfakes. We focus on their potential for improved generalization, particularly when training data is limited. Despite the notable success of large vision-language models utilizing transformer architectures in various tasks, including zero-shot and few-shot learning, the deepfake detection community has still shown some reluctance to adopt pre-trained vision transformers (ViTs), especially large ones, as feature extractors. One concern is their perceived excessive capacity, which often demands extensive data, and the resulting suboptimal generalization when training or fine-tuning data is small or less diverse. This contrasts poorly with ConvNets, which have already established themselves as robust feature extractors. Additionally, training and optimizing transformers from scratch requires significant computational resources, making this accessible primarily to large companies and hindering broader investigation within the academic community. Recent advancements in using self-supervised learning (SSL) in transformers, such as DINO and its derivatives, have showcased significant adaptability across diverse vision tasks and possess explicit semantic segmentation capabilities. By leveraging DINO for deepfake detection with modest training data and implementing partial fine-tuning, we observe comparable adaptability to the task and the natural explainability of the detection result via the attention mechanism. Moreover, partial fine-tuning of transformers for deepfake detection offers a more resource-efficient alternative, requiring significantly fewer computational resources.

Abstract (translated)

本文研究了自监督预训练Transformer与监督预训练Transformer和传统神经网络（CNN）在检测不同种类的深度伪造时的有效性。我们重点关注其提高泛化能力的潜力，尤其是在训练数据有限的情况下。尽管在各种任务中成功应用大型视觉语言模型（ViT）的Transformer架构取得了显著的成功，包括零 shot 和少 shot 学习，但深度伪造检测领域仍然有人不情愿采用预训练的视觉Transformer（ViT），尤其是大型的 ones 作为特征提取器。一个关注点是它们被认为具有过度的能力，通常需要大量的数据，并且在训练或微调数据小或缺乏多样性时，导致泛化效果不佳。这使得Transformer与 ConvNets 相比相形见绌，后者已经在学术研究中被证明是稳健的特征提取器。此外，从零开始训练和优化Transformer需要大量的计算资源，这使得这个方法主要对大型公司开放，同时也限制了学术研究 community 的进一步调查。最近在Transformer中使用自监督学习（SSL）的进展，如DINO及其派生物，展示了在各种视觉任务中的显著适应性，并具有显式的语义分割能力。通过利用DINO进行深度伪造检测，并在小训练数据上进行部分微调，我们观察到与任务相当的应用能力，并通过注意机制自然地解释检测结果。此外，部分微调Transformer进行深度伪造检测提供了一种更资源有效的选择，只需要显著少的计算资源。

URL

https://arxiv.org/abs/2405.00355

PDF

https://arxiv.org/pdf/2405.00355.pdf

Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis

Abstract

Abstract (translated)

URL

PDF Copy

PDF