Identifying Misinformation on YouTube through Transcript Contextual Analysis with Transformer Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Misinformation on YouTube is a significant concern, necessitating robust detection strategies. In this paper, we introduce a novel methodology for video classification, focusing on the veracity of the content. We convert the conventional video classification task into a text classification task by leveraging the textual content derived from the video transcripts. We employ advanced machine learning techniques like transfer learning to solve the classification challenge. Our approach incorporates two forms of transfer learning: (a) fine-tuning base transformer models such as BERT, RoBERTa, and ELECTRA, and (b) few-shot learning using sentence-transformers MPNet and RoBERTa-large. We apply the trained models to three datasets: (a) YouTube Vaccine-misinformation related videos, (b) YouTube Pseudoscience videos, and (c) Fake-News dataset (a collection of articles). Including the Fake-News dataset extended the evaluation of our approach beyond YouTube videos. Using these datasets, we evaluated the models distinguishing valid information from misinformation. The fine-tuned models yielded Matthews Correlation Coefficient>0.81, accuracy>0.90, and F1 score>0.90 in two of three datasets. Interestingly, the few-shot models outperformed the fine-tuned ones by 20% in both Accuracy and F1 score for the YouTube Pseudoscience dataset, highlighting the potential utility of this approach -- especially in the context of limited training data.

Abstract (translated)

YouTube上的虚假信息是一个严重的问题,需要采取有力的检测策略。在本文中,我们介绍了一种新颖的视频分类方法,重点关注内容的真实性。通过利用视频转录文本的内容,我们将传统的视频分类任务转换为文本分类任务。我们采用了先进的机器学习技术,例如迁移学习,来解决分类挑战。我们的方法包括两种迁移学习形式:(a) fine-tuning基础Transformer模型,如BERT、RoBERTa和ELECTRA,以及(b)使用句子Transformer MPNet和RoBERTa-large进行的少量多次学习。我们将训练模型应用于三个数据集:(a) YouTube疫苗虚假信息相关的视频,(b) YouTube伪科学视频,以及(c) YouTube假新闻数据集(一组文章)。包括假新闻数据集将我们的评估扩展到YouTube视频之外。利用这些数据集,我们评估了模型区分真实信息和虚假信息的能力。 fine-tuning模型在两个数据集上取得了Matthew Correlation Coefficient>0.81、准确率>0.90和F1得分>0.90。有趣的是,少量多次学习在YouTube伪科学数据集上的准确率和F1得分上比 Fine-tuning 模型高出20%,这表明这种方法的潜在用途,特别是在训练数据有限的情况下。

URL

https://arxiv.org/abs/2307.12155

PDF

https://arxiv.org/pdf/2307.12155.pdf

Identifying Misinformation on YouTube through Transcript Contextual Analysis with Transformer Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF