Abstract
Intentionally luring readers to click on a particular content by exploiting their curiosity defines a title as clickbait. Although several studies focused on detecting clickbait titles in English articles, low resource language like Bangla has not been given adequate attention. To tackle clickbait titles in Bangla, we have constructed the first Bangla clickbait detection dataset containing 15,056 labeled news articles and 65,406 unlabelled news articles extracted from clickbait dense news sites. Each article has been labeled by three expert linguists and includes an article's title, body, and other metadata. By incorporating labeled and unlabelled data, we finetune a pretrained Bangla transformer model in an adversarial fashion using Semi Supervised Generative Adversarial Networks (SS GANs). The proposed model acts as a good baseline for this dataset, outperforming traditional neural network models (LSTM, GRU, CNN) and linguistic feature based models. We expect that this dataset and the detailed analysis and comparison of these clickbait detection models will provide a fundamental basis for future research into detecting clickbait titles in Bengali articles. We have released the corresponding code and dataset.
Abstract (translated)
有意识地吸引读者点击特定内容,通过利用他们的好奇心定义标题为点击标题。尽管有几项研究关注于在英语文章中检测点击标题,但像孟加拉语这样的低资源语言尚未得到足够的关注。为了解决孟加拉语中的点击标题问题,我们构建了包含15,056个有标签的新闻文章和65,406个无标签的新闻文章的第一个孟加拉语点击标题检测数据集。每篇文章都由三位专家级语言学家标注,包括文章标题、正文和其他元数据。通过结合有标签和无标签数据,我们以对抗的方式微调了预训练的孟加拉语Transformer模型。该模型作为这个数据集的基准,超过了传统神经网络模型(LSTM,GRU,CNN)和基于语言特征的模型。我们预计,这个数据集以及这些点击标题检测模型的详细分析和比较将为未来研究提供基础,以在孟加拉语文章中检测点击标题。我们已经发布了相应的代码和数据集。
URL
https://arxiv.org/abs/2311.06204