Abstract
Introduction: Microblogging websites have massed rich data sources for sentiment analysis and opinion mining. In this regard, sentiment classification has frequently proven inefficient because microblog posts typically lack syntactically consistent terms and representatives since users on these social networks do not like to write lengthy statements. Also, there are some limitations to low-resource languages. The Persian language has exceptional characteristics and demands unique annotated data and models for the sentiment analysis task, which are distinctive from text features within the English dialect. Method: This paper first constructs a user opinion dataset called ITRC-Opinion in a collaborative environment and insource way. Our dataset contains 60,000 informal and colloquial Persian texts from social microblogs such as Twitter and Instagram. Second, this study proposes a new architecture based on the convolutional neural network (CNN) model for more effective sentiment analysis of colloquial text in social microblog posts. The constructed datasets are used to evaluate the presented architecture. Furthermore, some models, such as LSTM, CNN-RNN, BiLSTM, and BiGRU with different word embeddings, including Fasttext, Glove, and Word2vec, investigated our dataset and evaluated the results. Results: The results demonstrate the benefit of our dataset and the proposed model (72% accuracy), displaying meaningful improvement in sentiment classification performance.
Abstract (translated)
简介:微博网站聚集了大量的情感分析和意见挖掘数据资源。在这方面,情感分类常常因为微博帖子通常缺乏句法一致的词汇和代表而证明效率低下。此外,低资源语言也存在一些限制。波斯语具有独特的特点,需要为情感分析任务提供独特的注释数据和模型,这与英式英语方言中的文本特征不同。方法:本文首先在一个合作和资源的环境中构建了一个用户意见数据集 called ITRC-Opinion。我们的数据集包含来自推特和Instagram等社交微博的60,000个非正式和俚语波斯语文本。接着,本研究提出了一种基于卷积神经网络(CNN)模型的新的架构,以更有效地分析社交微博中的流行文本的情感。构建的数据集用于评估所提出的架构。此外,一些模型,如LSTM、CNN-RNN、BiLSTM和BiGRU,使用不同的词向量,包括Fasttext、Glove和Word2vec,对数据集进行了调查并评估了结果。结果:结果表明,我们的数据集和所提出的模型的价值(72%的准确性),在情感分类性能上具有显著的提高。
URL
https://arxiv.org/abs/2306.12679