TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis

Abstract
Abstract (translated)
URL
PDF

Abstract

Turkish is one of the most popular languages in the world. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained language model for Turkish social media built using almost 900 million tweets. The model shares the same architecture as base BERT model with smaller input length, making TurkishBERTweet lighter than BERTurk and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two text classification tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also compared our models with the commercial OpenAI solutions in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective solution. As part of our research, we released TurkishBERTweet and fine-tuned LoRA adapters for the mentioned tasks under the MIT License to facilitate future research and applications on Turkish social media. Our TurkishBERTweet model is available at: this https URL

Abstract (translated)

土耳其是世界上使用最广泛的语之一。在Twitter、Instagram或Tiktok等社交媒体平台上广泛使用这种语言，以及土耳其在世界政治中的战略地位，使其对社交媒体研究员和产业具有吸引力。为满足这种需求，我们介绍了土耳其BERTweet，第一个基于几乎9亿条推文的土耳其社交媒体的大型预训练语言模型。该模型与基本BERT模型具有较小的输入长度，使得土耳其BERTweet比BERTurk更轻，可以在推理过程中显著降低。我们使用相同的方法对RoBERTa模型进行训练，并在两个文本分类任务上进行评估：情感分类和仇恨言论检测。我们证明，土耳其BERTweet在一般可解释性和较低的推理时间方面优于其他可用选项。我们还与商业OpenAI解决方案在成本和性能方面进行了比较，以证明土耳其BERTweet是一个可扩展和成本效益高的解决方案。作为我们的研究的一部分，我们在MIT许可证下发布了土耳其BERTweet，并对指定任务进行了微调，以促进未来对土耳其社交媒体的研究和应用。我们的土耳其BERTweet模型可以从以下链接获取：https://this URL

URL

https://arxiv.org/abs/2311.18063

PDF

https://arxiv.org/pdf/2311.18063.pdf

TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis

Abstract

Abstract (translated)

URL

PDF Copy

PDF