We report results of a longitudinal sentiment classification of Reddit posts written by students of four major Canadian universities. We work with the texts of the posts, concentrating on the years 2020-2023. By finely tuning a sentiment threshold to a range of [-0.075,0.075], we successfully built classifiers proficient in categorizing post sentiments into positive and negative categories. Noticeably, our sentiment classification results are consistent across the four university data sets.
我们报道了来自加拿大四大著名大学的学生在Reddit上撰写的帖子纵向情感分类的结果。我们专注于2020年至2023年的帖子文本。通过将情感阈值微调为[-0.075,0.075]的范围内,我们成功地构建了能够将帖子情感归类为积极和消极类别的分类器。值得注意的是,我们的情感分类结果在四个大学数据集上是一致的。
https://arxiv.org/abs/2401.12382
Analyzing authors' sentiments in texts as a technique for identifying text polarity can be practical and useful in various fields, including medicine and dentistry. Currently, due to factors such as patients' limited knowledge about their condition, difficulties in accessing specialist doctors, or fear of illness, particularly in pandemic conditions, there might be a delay between receiving a radiology report and consulting a doctor. In some cases, this delay can pose significant risks to the patient, making timely decision-making crucial. Having an automatic system that can inform patients about the deterioration of their condition by analyzing the text of radiology reports could greatly impact timely decision-making. In this study, a dataset comprising 1,134 cone-beam computed tomography (CBCT) photo reports was collected from the Shiraz University of Medical Sciences. Each case was examined, and an expert labeled a severity level for the patient's condition on each document. After preprocessing all the text data, a deep learning model based on Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network architecture, known as CNN-LSTM, was developed to detect the severity level of the patient's problem based on sentiment analysis in the radiologist's report. The model's performance was evaluated on two datasets, each with two and four classes, in both imbalanced and balanced scenarios. Finally, to demonstrate the effectiveness of our model, we compared its performance with that of other classification models. The results, along with one-way ANOVA and Tukey's test, indicated that our proposed model (CNN-LSTM) performed the best according to precision, recall, and f-measure criteria. This suggests that it can be a reliable model for estimating the severity of oral and dental diseases, thereby assisting patients.
分析作者在文本中的情感作为一种识别文本极性的技术在医学和口腔领域是实用和有用的。目前,由于患者对自己的疾病了解有限、难以获得专家医生帮助或害怕生病等原因,特别是在疫情条件下,从收到放射学报告到看医生的时间可能会延迟。在某些情况下,这种延迟会对患者造成严重风险,因此及时做出决策至关重要。开发一个自动系统,根据放射学报告的文本内容告知患者病情恶化程度,可以极大地促进及时决策。 在这项研究中,从 Shiraz University of Medical Sciences 收集了1,134个锥束计算机断层扫描(CBCT)照片报告的数据。对每份文件,专家都会对患者的病情严重程度进行标注。经过预处理所有文本数据后,开发了一个基于卷积神经网络(CNN)和长短时记忆(LSTM)网络架构的深度学习模型,称为 CNN-LSTM,以根据放射学报告中的情感分析患者的病情严重程度。对模型的性能进行了评估,在两种不均衡和两种平衡的场景中进行。最后,为了证明我们模型的有效性,我们将其性能与其它分类模型的性能进行了比较。结果加上单因素方差分析和 Tukey 检验,表明我们的提议模型(CNN-LSTM)根据精度、召回率和 F1 值标准,性能最佳。这表明它可以作为一个可靠模型来估计口腔和牙齿疾病严重程度,从而帮助患者。
https://arxiv.org/abs/2401.12993
Instruction-tuned large language models (LLMs) excel at many tasks, and will even provide explanations for their behavior. Since these models are directly accessible to the public, there is a risk that convincing and wrong explanations can lead to unsupported confidence in LLMs. Therefore, interpretability-faithfulness of self-explanations is an important consideration for AI Safety. Assessing the interpretability-faithfulness of these explanations, termed self-explanations, is challenging as the models are too complex for humans to annotate what is a correct explanation. To address this, we propose employing self-consistency checks as a measure of faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been applied to LLM's self-explanations. We apply self-consistency checks to three types of self-explanations: counterfactuals, importance measures, and redactions. Our work demonstrate that faithfulness is both task and model dependent, e.g., for sentiment classification, counterfactual explanations are more faithful for Llama2, importance measures for Mistral, and redaction for Falcon 40B. Finally, our findings are robust to prompt-variations.
经过训练的大型语言模型(LLMs)在许多任务上表现出色,甚至可以为他们行为提供解释。由于这些模型对公众直接可用,因此说服力和错误的解释可能导致对LLMs的可靠性产生不支持的观点。因此,在AI安全方面,解释的可信度是一个重要考虑因素。评估这些解释的可信度(称为自我解释)是一个具有挑战性的任务,因为这些模型对于人类来说太过复杂,无法准确标注正确解释。为解决这个问题,我们提出了使用自一致性检查作为可信度的度量。例如,如果一个LLM表示一组单词对于做出预测很重要,那么在没有这些单词的情况下,它应该不能做出相同的预测。尽管自一致性检查是信誉度的常见方法,但之前没有应用于LLM的自我解释。我们对三种类型的自我解释(反例、重要性度量、遮盖)应用自一致性检查。我们的工作证明了信誉度既与任务有关,也与模型有关,例如,对于情感分类,反例解释对Llama2来说更忠实,重要性度量对Mistral来说更准确,遮盖对Falcon 40B来说更准确。最后,我们的研究结果对提示变化具有鲁棒性。
https://arxiv.org/abs/2401.07927
Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM utilizes text and speech modality to better model the linguistic content and paralinguistic attribute of spoken response. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multi-modal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels to be the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively.
大语言模型(LLMs)在诸如聊天、推理和问题回答等任务上表现出了卓越的能力。然而,标准的LLM可能会忽略关键的会话语言信息,例如情感、情绪和口语风格,这些信息对于实现自然、人性化的人际口语对话至关重要,尤其是在这种信息通过语音提示传达时。因此,我们提出了Paralinguistics-enhanced Generative Pretrained Transformer(ParalinGPT),一种LLM利用文本和语音模态更好地建模口语响应的语义内容和会话属性。该模型以文本、语音嵌入和会话属性作为输入提示,在序列多模态框架中进行会话上下文建模。具体来说,我们的框架将任务序列化为当前会话属性预测、响应会话属性预测和响应文本生成与自回归条件。我们利用Switchboard-1语料库,包括其情感标签作为会话属性,作为我们的口语对话数据集。实验结果表明,与典型序列分类技术相比,所提出的序列多任务方法在当前和响应情感分类上表现出色。此外,利用会话上下文和语音嵌入 significantly 改善了响应文本生成和情感预测。我们提出的框架在当前情感准确性、响应情感准确性和响应文本BLEU评分方面分别实现了相对改进6.7%、12.0%和3.5%。
https://arxiv.org/abs/2312.15316
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task. In this paper, we present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes. Specifically, we design the domain- and target-constrained cloze test, which can leverage the PTLMs' strong language modeling ability to generate the given target's attributes pertaining to the review context. The attributes contain the background and property information of the target, which can help to enrich the semantics of the review context and the target. To exploit the attributes for tackling TSC, we first construct a heterogeneous information graph by treating the attributes as nodes and combining them with (1) the syntax graph automatically produced by the off-the-shelf dependency parser and (2) the semantics graph of the review context, which is derived from the self-attention mechanism. Then we propose a heterogeneous information gated graph convolutional network to model the interactions among the attribute information, the syntactic information, and the contextual information. The experimental results on three benchmark datasets demonstrate the superiority of our model, which achieves new state-of-the-art performance.
现有的基于PTLM的TCS模型可以分为两组:1)基于微调的模型,它们采用PTLM作为上下文编码器;2)基于提示的模型,它们将分类任务转移到文本/单词生成任务。在本文中,我们提出了一个新的利用PTLM的视角:通过上下文目标属性同时利用语言建模和明确的目标上下文交互的优点。具体来说,我们设计了一个域-和目标约束的闭包测试,该测试可以利用PTLM在给定评论上下文生成目标属性的强大语言建模能力。属性包含目标的背景和属性信息,这可以帮助丰富评论上下文和目标的语义。为了应对TC,我们首先通过将属性视为节点,将它们与(1)由普通依赖解析器自动生成的语法图和(2)来自自我注意机制生成的评论语义图合并,构建了一个异质信息图。然后,我们提出了一个异质信息卷积网络来建模属性信息、语义信息以及上下文信息的交互。在三个基准数据集上的实验结果表明,我们的模型具有优越的性能,实现了最新的 state-of-the-art 水平。
https://arxiv.org/abs/2312.13766
Sentiment analysis methods are rapidly being adopted by the field of Urban Design and Planning, for the crowdsourced evaluation of urban environments. However, most models used within this domain are able to identify positive or negative sentiment associated with a textual appraisal as a whole, without inferring information about specific urban aspects contained within it, or the sentiment associated with them. While Aspect Based Sentiment Analysis (ABSA) is becoming increasingly popular, most existing ABSA models are trained on non-urban themes such as restaurants, electronics, consumer goods and the like. This body of research develops an ABSA model capable of extracting urban aspects contained within geo-located textual urban appraisals, along with corresponding aspect sentiment classification. We annotate a dataset of 2500 crowdsourced reviews of public parks, and train a Bidirectional Encoder Representations from Transformers (BERT) model with Local Context Focus (LCF) on this data. Our model achieves significant improvement in prediction accuracy on urban reviews, for both Aspect Term Extraction (ATE) and Aspect Sentiment Classification (ASC) tasks. For demonstrative analysis, positive and negative urban aspects across Boston are spatially visualized. We hope that this model is useful for designers and planners for fine-grained urban sentiment evaluation.
情感分析方法正在迅速成为城市设计领域的热门选择,以便对城市环境进行 crowdsourced 评估。然而,该领域中使用的大多数模型仅能识别文本评价中的积极或消极情感,而无法推断其中包含的具体城市方面的情感或它们所带来的情感。虽然情感基于 aspects 的情感分析(ASSA)变得越来越受欢迎,但大多数现有的 ABSA 模型都是基于非城市主题进行训练的,如餐厅、电子产品、消费品等。这项研究开发了一个能够提取位于地理定位的文本城市评估中的城市方面以及相应方面情感分类的 ABSA 模型。我们在波士顿公共公园的 2500 条 crowdsourced 评论的数据集上进行标注,并使用 Local Context Focus (LCF) 的双向编码器表示从 Transformers (BERT) 模型中进行训练。我们的模型在 Aspect 词提取(ATE)和 Aspect 情感分类(ASC)任务上取得了显著的提高。为了演示分析,我们在波士顿的各个城市方面之间进行了空间可视化。我们希望这个模型对于设计师和规划师进行精细化的城市情感评估有所帮助。
https://arxiv.org/abs/2312.12253
Aspect-based sentiment analysis (ABSA), a fine-grained sentiment classification task, has received much attention recently. Many works investigate sentiment information through opinion words, such as ''good'' and ''bad''. However, implicit sentiment widely exists in the ABSA dataset, which refers to the sentence containing no distinct opinion words but still expresses sentiment to the aspect term. To deal with implicit sentiment, this paper proposes an ABSA method that integrates explicit sentiment augmentations. And we propose an ABSA-specific augmentation method to create such augmentations. Specifically, we post-trains T5 by rule-based data. We employ Syntax Distance Weighting and Unlikelihood Contrastive Regularization in the training procedure to guide the model to generate an explicit sentiment. Meanwhile, we utilize the Constrained Beam Search to ensure the augmentation sentence contains the aspect terms. We test ABSA-ESA on two of the most popular benchmarks of ABSA. The results show that ABSA-ESA outperforms the SOTA baselines on implicit and explicit sentiment accuracy.
aspect-based sentiment分析(ABSA)是一种精细情感分类任务,最近受到了很多关注。许多研究通过意见词,如“好”和“坏”,调查情感信息。然而,在ABSA数据集中,隐含情感普遍存在,这指的是没有明确意见词的句子,但仍然对方面表现出了情感。为了处理隐含情感,本文提出了一种将显性情感增强与ABSA方法集成的方法。我们还提出了一种ABSA特定的增强方法来创建这些增强。具体来说,我们通过基于规则的数据对T5进行后训练。我们在训练过程中使用语义距离加权和小概率差异正则化来引导模型生成明确的情感。同时,我们利用约束 beam search 确保增强句子包含方面词汇。我们在两个ABSA最受欢迎的基准上测试ABSA-ESA。结果表明,ABSA-ESA在隐含和显性情感准确率上超过了现有基线。
https://arxiv.org/abs/2312.10961
Few-shot text classification has attracted great interest in both academia and industry due to the lack of labeled data in many fields. Different from general text classification (e.g., topic classification), few-shot sentiment classification is more challenging because the semantic distances among the classes are more subtle. For instance, the semantic distances between the sentiment labels in a positive or negative polarity (e.g., ``love" and ``joy", ``remorse" and ``sadness") are close, while the distances are large for the sentiment labels in two opposite polarities (e.g., ``love" and ``sadness"). To address this problem, we propose a Soft Contrastive learning-based Prompt (\texttt{SCP}) model for few-shot sentiment analysis. First, we design a sentiment-aware chain of thought prompt module to guide the model to predict the sentiment from coarse grain to fine grain via a series of intermediate reasoning steps. Then, we propose a soft contrastive learning algorithm to take the correlation of the labels into account. A series of experiments on several sentiment analysis datasets show the great advantages of \texttt{SCP} by comparing it with SOTA baselines (e.g., ChatGPT).
由于许多领域缺乏足够的标记数据,稀疏shot文本分类在学术界和产业界都引起了极大的兴趣。与通用文本分类(如主题分类)不同,稀疏 shot 情感分类更具挑战性,因为类别之间的语义距离更加微妙。例如,在正面或负面极性(例如,“爱”和“快乐”,“悲伤”和“忧愁”)中,情感标签之间的语义距离很近,而两个相反极性(例如,“爱”和“悲伤”)中情感标签之间的距离较大。为解决这个问题,我们提出了一个基于软对比学习的学习提示(SCP)模型来进行稀疏 shot 情感分析。首先,我们设计了一个情感意识的思想提示模块,通过一系列中间推理步骤将模型从粗粒度到细粒度预测情感。然后,我们提出了一种软对比学习算法,以考虑标签的相关性。在多个情感分析数据集上的实验表明,与最先进的基准模型(如 ChatGPT)相比,SCP 具有显著的优势。
https://arxiv.org/abs/2312.10479
Recently, Deep Learning (DL) approaches have been applied to solve the Sentiment Classification (SC) problem, which is a core task in reviews mining or Sentiment Analysis (SA). The performances of these approaches are affected by different factors. This paper addresses these factors and classifies them into three categories: data preparation based factors, feature representation based factors and the classification techniques based factors. The paper is a comprehensive literature-based survey that compares the performance of more than 100 DL-based SC approaches by using 21 public datasets of reviews given by customers within three specific application domains (products, movies and restaurants). These 21 datasets have different characteristics (balanced/imbalanced, size, etc.) to give a global vision for our study. The comparison explains how the proposed factors quantitatively affect the performance of the studied DL-based SC approaches.
近年来,深度学习(DL)方法已应用于解决情感分类(SC)问题,这是评论挖掘或情感分析(SA)的核心任务。这些方法的表现受到不同因素的影响。本文对这些因素进行了分类,包括基于数据的因素、基于特征表示的因素以及基于分类技术的因素。本文是对基于文献的调查,通过使用由客户提供的21个公共评论数据集,对超过100个基于DL的SC方法进行了性能比较,这些数据集在产品、电影和餐厅等三个特定的应用领域(产品、电影和餐厅)中。这些21个数据集具有不同的特点(平衡/不平衡,大小等),为我们研究提供了全面的视角。比较解释了所提出的因素如何定量地影响研究的DL基于SC方法的性能。
https://arxiv.org/abs/2312.17253
While there is significant interest in using generative AI tools as general-purpose models for specific ML applications, discriminative models are much more widely deployed currently. One of the key shortcomings of these discriminative AI tools that have been already deployed is that they are not adaptable and user-friendly compared to generative AI tools (e.g., GPT4, Stable Diffusion, Bard, etc.), where a non-expert user can iteratively refine model inputs and give real-time feedback that can be accounted for immediately, allowing users to build trust from the start. Inspired by this emerging collaborative workflow, we develop a new system architecture that enables users to work with discriminative models (such as for object detection, sentiment classification, etc.) in a fashion similar to generative AI tools, where they can easily provide immediate feedback as well as adapt the deployed models as desired. Our approach has implications on improving trust, user-friendliness, and adaptability of these versatile but traditional prediction models.
尽管使用生成式 AI 工具作为特定机器学习应用程序的通用模型具有很大的兴趣,但目前部署的许多区分性 AI 工具的一个关键缺陷是,它们不如生成式 AI 工具(例如 GPT4、Stable Diffusion、Bard 等)具有适应性和用户友好性。这些已经部署的区分性 AI 工具的一个关键不足是,它们不具备像生成式 AI 工具那样的可适应性和易用性,非专家用户可以逐步优化模型输入并实时获得反馈,从而允许用户从开始就建立信任。受到这种新兴协作工作流程的启发,我们开发了一个新的系统架构,使用户能够以与生成式 AI 工具类似的方式与区分性模型(例如用于物体检测、情感分类等)进行合作,并且能轻松地立即提供反馈,并根据需要调整部署的模型。我们的方法对改善这些多功能但传统预测模型的信任、易用性、可扩展性具有影响。
https://arxiv.org/abs/2312.06826
Label smoothing is a widely used technique in various domains, such as image classification and speech recognition, known for effectively combating model overfitting. However, there is few research on its application to text sentiment classification. To fill in the gap, this study investigates the implementation of label smoothing for sentiment classification by utilizing different levels of smoothing. The primary objective is to enhance sentiment classification accuracy by transforming discrete labels into smoothed label distributions. Through extensive experiments, we demonstrate the superior performance of label smoothing in text sentiment classification tasks across eight diverse datasets and deep learning architectures: TextCNN, BERT, and RoBERTa, under two learning schemes: training from scratch and fine-tuning.
标签平滑是一种在各种领域中广泛使用的技术,如图像分类和语音识别,被广泛认为是有效对抗模型过拟合的有力工具。然而,标签平滑在文本情感分类中的应用研究较少。为了填补这一空白,本研究探讨了利用不同平滑级别的标签平滑来实施标签平滑在文本情感分类方面的应用。主要目标是通过将离散的标签转换为平滑的标签分布来提高情感分类准确性。通过广泛的实验,我们证明了标签平滑在八个不同数据集和深度学习架构上的文本情感分类任务中的卓越性能:TextCNN、BERT和RoBERTa,采用两种学习方案:从头开始训练和微调。
https://arxiv.org/abs/2312.06522
Google app market captures the school of thought of users from every corner of the globe via ratings and text reviews, in a multilinguistic arena. The potential information from the reviews cannot be extracted manually, due to its exponential growth. So, Sentiment analysis, by machine learning and deep learning algorithms employing NLP, explicitly uncovers and interprets the emotions. This study performs the sentiment classification of the app reviews and identifies the university student's behavior towards the app market via exploratory analysis. We applied machine learning algorithms using the TP, TF, and TF IDF text representation scheme and evaluated its performance on Bagging, an ensemble learning method. We used word embedding, Glove, on the deep learning paradigms. Our model was trained on Google app reviews and tested on Student's App Reviews(SAR). The various combinations of these algorithms were compared amongst each other using F score and accuracy and inferences were highlighted graphically. SVM, amongst other classifiers, gave fruitful accuracy(93.41%), F score(89%) on bigram and TF IDF scheme. Bagging enhanced the performance of LR and NB with accuracy of 87.88% and 86.69% and F score of 86% and 78% respectively. Overall, LSTM on Glove embedding recorded the highest accuracy(95.2%) and F score(88%).
谷歌应用商店通过评分和文本评论捕捉来自世界各地的用户思维,在一个多语言的竞技场中。由于其指数增长,无法手动提取评论中的信息。因此,通过机器学习和深度学习算法利用自然语言处理(NLP)进行情感分析和情感揭示,本研究对应用商店评论进行情感分类,并通过探索性分析识别出大学生对应用商店的态度。我们使用了TP、TF和TF IDF文本表示方案的机器学习算法,并使用Word嵌入、Glove在深度学习范式上进行比较。我们的模型在Google应用商店上进行训练,并在学生应用商店上进行测试。各种算法之间的组合通过F分数和准确度进行了比较,并突出了图形上的推理。SVM等分类器在词干和TF IDF方案上具有很高的准确率(93.41%),F分数(89%)。贝叶斯均衡增强了LR和NB的表现,准确度分别为87.88%和86.69%,F分数分别为86%和78%。总体而言,GloVe嵌入的LSTM记录了最高的准确率(95.2%)和F分数(88%)。
https://arxiv.org/abs/2312.06705
In this paper, we propose a novel method to enhance sentiment analysis by addressing the challenge of context-specific word meanings. It combines the advantages of a bidirectional long short-term memory network (Bi-LSTM) with a knowledge graph's synonym data. This synergy leverages a dynamic attention mechanism to develop a knowledge-driven state vector. For classifying sentiments linked to specific aspects, the approach constructs a memory bank integrating positional data. This data is then analyzed using a multi-layer gated recurrent unit (GRU) to pinpoint sentiment characteristics related to specific aspect terms. Tests on three widely available datasets demonstrate this method's superior performance in sentiment classification.
在本文中,我们提出了一种新方法来增强情感分析,通过解决上下文特定单词含义的挑战。它将双向长短期记忆网络(Bi-LSTM)的优点与知识图的同义词数据相结合。这种协同作用利用了动态注意力机制开发了一个知识驱动的状态向量。对于将情感与特定方面相关联,该方法构建了一个记忆库,其中整合了位置数据。然后,该数据通过多层门控循环单元(GRU)进行分析,以确定与特定方面术语相关的情感特征。对三个广泛使用的数据集的测试表明,这种方法在情感分类方面的表现优于现有方法。
https://arxiv.org/abs/2312.10048
Turkish is one of the most popular languages in the world. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained language model for Turkish social media built using almost 900 million tweets. The model shares the same architecture as base BERT model with smaller input length, making TurkishBERTweet lighter than BERTurk and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two text classification tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also compared our models with the commercial OpenAI solutions in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective solution. As part of our research, we released TurkishBERTweet and fine-tuned LoRA adapters for the mentioned tasks under the MIT License to facilitate future research and applications on Turkish social media. Our TurkishBERTweet model is available at: this https URL
土耳其是世界上使用最广泛的语之一。在Twitter、Instagram或Tiktok等社交媒体平台上广泛使用这种语言,以及土耳其在世界政治中的战略地位,使其对社交媒体研究员和产业具有吸引力。为满足这种需求,我们介绍了土耳其BERTweet,第一个基于几乎9亿条推文的土耳其社交媒体的大型预训练语言模型。该模型与基本BERT模型具有较小的输入长度,使得土耳其BERTweet比BERTurk更轻,可以在推理过程中显著降低。我们使用相同的方法对RoBERTa模型进行训练,并在两个文本分类任务上进行评估:情感分类和仇恨言论检测。我们证明,土耳其BERTweet在一般可解释性和较低的推理时间方面优于其他可用选项。我们还与商业OpenAI解决方案在成本和性能方面进行了比较,以证明土耳其BERTweet是一个可扩展和成本效益高的解决方案。作为我们的研究的一部分,我们在MIT许可证下发布了土耳其BERTweet,并对指定任务进行了微调,以促进未来对土耳其社交媒体的研究和应用。我们的土耳其BERTweet模型可以从以下链接获取:https://this URL
https://arxiv.org/abs/2311.18063
Artificial intelligence and machine learning have significantly bolstered the technological world. This paper explores the potential of transfer learning in natural language processing focusing mainly on sentiment analysis. The models trained on the big data can also be used where data are scarce. The claim is that, compared to training models from scratch, transfer learning, using pre-trained BERT models, can increase sentiment classification accuracy. The study adopts a sophisticated experimental design that uses the IMDb dataset of sentimentally labelled movie reviews. Pre-processing includes tokenization and encoding of text data, making it suitable for NLP models. The dataset is used on a BERT based model, measuring its performance using accuracy. The result comes out to be 100 per cent accurate. Although the complete accuracy could appear impressive, it might be the result of overfitting or a lack of generalization. Further analysis is required to ensure the model's ability to handle diverse and unseen data. The findings underscore the effectiveness of transfer learning in NLP, showcasing its potential to excel in sentiment analysis tasks. However, the research calls for a cautious interpretation of perfect accuracy and emphasizes the need for additional measures to validate the model's generalization.
人工智能和机器学习在很大程度上推动了科技发展。本文主要探讨自然语言处理中迁移学习的潜力,重点关注情感分析。使用大数据训练的模型也可以在没有数据的情况下使用。论文认为,与从头训练模型相比,使用预训练的BERT模型进行迁移学习可以提高情感分类准确性。研究采用了一种复杂的实验设计,使用了情感标注的电影评论的IMDb数据集。预处理包括对文本数据的分词和编码,使其适合自然语言处理模型。数据集应用于基于BERT的模型,通过准确性来衡量其性能。结果表明,准确率为100%。尽管完整的准确性可能会令人印象深刻,但它可能是过拟合或泛化不足的结果。需要进一步分析以确保模型能够处理多样化和未见过的数据。研究结果强调了迁移学习在自然语言处理中的有效性,展示了它在情感分析任务中取得优异表现的潜力。然而,研究呼吁对完美准确度的谨慎解释,并强调需要额外的措施来验证模型的泛化能力。
https://arxiv.org/abs/2311.16965
Recent progress in aspect-level sentiment classification has been propelled by the incorporation of graph neural networks (GNNs) leveraging syntactic structures, particularly dependency trees. Nevertheless, the performance of these models is often hampered by the innate inaccuracies of parsing algorithms. To mitigate this challenge, we introduce SynthFusion, an innovative graph ensemble method that amalgamates predictions from multiple parsers. This strategy blends diverse dependency relations prior to the application of GNNs, enhancing robustness against parsing errors while avoiding extra computational burdens. SynthFusion circumvents the pitfalls of overparameterization and diminishes the risk of overfitting, prevalent in models with stacked GNN layers, by optimizing graph connectivity. Our empirical evaluations on the SemEval14 and Twitter14 datasets affirm that SynthFusion not only outshines models reliant on single dependency trees but also eclipses alternative ensemble techniques, achieving this without an escalation in model complexity.
近年来,面向 aspect-level 情感分类的进展主要是由利用语义结构,特别是依赖关系树,进行图神经网络(GNNs)的引入所带来的。然而,这些模型的性能通常受到解析算法的固有不准确性所困扰。为了应对这一挑战,我们引入了SynthFusion,一种创新的图集成方法,它将多个解析器的预测进行集成。这种策略在应用GNNs之前融合了不同的依赖关系,提高了对解析错误时的鲁棒性,同时避免了对计算开销的额外累积。通过优化图的连接性,SynthFusion绕过了堆叠GNN层的模型的陷阱,减少了超参数设置和不准确的风险,同时实现了在没有模型复杂度增加的情况下提高性能的目标。我们对SemEval14和Twitter14数据集的实证评估证实,SynthFusion不仅超越了依赖单个依赖树模型的模型,而且超过了其他集成方法,实现了在没有模型复杂度增加的情况下实现这一目标。
https://arxiv.org/abs/2312.03738
While performance of many text classification tasks has been recently improved due to Pre-trained Language Models (PLMs), in this paper we show that they still suffer from a performance gap when the underlying distribution of topics changes. For example, a genre classifier trained on \textit{political} topics often fails when tested on documents about \textit{sport} or \textit{medicine}. In this work, we quantify this phenomenon empirically with a large corpus and a large set of topics. Consequently, we verify that domain transfer remains challenging both for classic PLMs, such as BERT, and for modern large models, such as GPT-3. We also suggest and successfully test a possible remedy: after augmenting the training dataset with topically-controlled synthetic texts, the F1 score improves by up to 50\% for some topics, nearing on-topic training results, while others show little to no improvement. While our empirical results focus on genre classification, our methodology is applicable to other classification tasks such as gender, authorship, or sentiment classification. The code and data to replicate the experiments are available at this https URL
虽然由于预训练语言模型(PLMs)的性能最近得到了提高,但当主题分布发生变化时,它们仍然会面临性能差距。例如,在政治主题上进行训练的 genres 分类器在测试体育或医疗方面的文档时常常表现不佳。在我们的研究中,我们用大量的语料库和主题集来定量这一现象。结果表明,对于经典 PLMs(如 BERT)和现代大型模型(如 GPT-3),领域迁移仍然具有挑战性。我们还提出了一个可能的解决方法,并在一些主题上进行了实验,结果表明,通过控制主题的 synthetic 文本,F1 分数可以达到提高至50\%的效果,接近主题训练的结果,而其他主题则没有或几乎没有改善。虽然我们的实证结果集中关注于 genres,但我们的方法可以应用于其他分类任务,如性别、作者或情感分类。实验代码和数据可在此处复制:https:// URL。
https://arxiv.org/abs/2311.16083
The impact of non-deterministic outputs from Large Language Models (LLMs) is not well examined for financial text understanding tasks. Through a compelling case study on investing in the US equity market via news sentiment analysis, we uncover substantial variability in sentence-level sentiment classification results, underscoring the innate volatility of LLM outputs. These uncertainties cascade downstream, leading to more significant variations in portfolio construction and return. While tweaking the temperature parameter in the language model decoder presents a potential remedy, it comes at the expense of stifled creativity. Similarly, while ensembling multiple outputs mitigates the effect of volatile outputs, it demands a notable computational investment. This work furnishes practitioners with invaluable insights for adeptly navigating uncertainty in the integration of LLMs into financial decision-making, particularly in scenarios dictated by non-deterministic information.
大语言模型(LLMs)非确定性输出的影响在金融文本理解任务中并没有得到很好的研究。通过一个引人入胜的案例研究,我们发现句子级情感分类结果的句子级情感存在很大的变异性,凸显了LLM输出的固有波动性。这些不确定性沿着下游传导,导致投资组合构建和回报的差异更加显著。虽然调整语言模型解码器的温度参数是一个潜在的解决方案,但以限制创造性为代价。同样,将多个输出进行集成可以减轻波动性输出的影响,但这需要明显的计算投入。这项工作为实践者提供了宝贵的经验,以便在将LLM集成到金融决策过程中更好地处理不确定性,尤其是在由非确定性信息决定的场景中。
https://arxiv.org/abs/2311.15180
According to the literature, Product reviews are an important source of information for customers to support their buying decision. Product reviews improve customer trust and loyalty. Reviews help customers in understanding what other customers think about a particular product and helps in driving purchase decisions. Therefore, for an e-commerce platform it is important to understand the sentiments in customer reviews to understand their products and services, and it also allows them to potentially create positive consumer interaction as well as long lasting relationships. Reviews also provide innovative ways to market the products for an ecommerce company. One such approach is Nudge Marketing. Nudge marketing is a subtle way for an ecommerce company to help their customers make better decisions without hesitation.
根据文献,产品评论是顾客支持购买决策的重要信息来源。产品评论提高了顾客的信任和忠诚度。通过评论,顾客可以了解其他顾客对某个产品的看法,从而帮助他们做出购买决策。因此,对于一个电子商务平台来说,了解顾客对产品评论的态度以了解他们的产品和服务非常重要,这也有助于他们创造积极的消费者互动以及长期的关系。产品评论也为电子商务公司提供了创新的营销方式。一种 such approach is Nudge Marketing。Nudge marketing is a subtle way for an ecommerce company to help their customers make better decisions without hesitation.
https://arxiv.org/abs/2311.10782
Recent advancements in natural language processing have led to the proliferation of large language models (LLMs). These models have been shown to yield good performance, using in-context learning, even on unseen tasks and languages. Additionally, they have been widely adopted as language-model-as-a-service commercial APIs like GPT-4 API. However, their performance on African languages is largely unknown. We present an analysis of three popular large language models (mT0, LLaMa 2, and GPT-4) on five tasks (news topic classification, sentiment classification, machine translation, question answering, and named entity recognition) across 30 African languages, spanning different language families and geographical regions. Our results suggest that all LLMs produce below-par performance on African languages, and there is a large gap in performance compared to high-resource languages like English most tasks. We find that GPT-4 has an average or impressive performance on classification tasks but very poor results on generative tasks like machine translation. Surprisingly, we find that mT0 had the best overall on cross-lingual QA, better than the state-of-the-art supervised model (i.e. fine-tuned mT5) and GPT-4 on African languages. Overall, LLaMa 2 records the worst performance due to its limited multilingual capabilities and English-centric pre-training corpus. In general, our findings present a call-to-action to ensure African languages are well represented in large language models, given their growing popularity.
近年来自然语言处理领域的进步导致了大型语言模型的(LLMs)的繁荣。这些模型已经在可见的任务和语言上表现良好,即使是在未见过的任务和语言上。此外,它们已经被广泛应用于诸如GPT-4 API这样的语言模型服务中。然而,它们在非洲语言上的表现仍然是未知的。我们对三种流行的LLM(mT0、LLaMa 2和GPT-4)在30个非洲语言上的五个任务(新闻主题分类、情感分类、机器翻译、问答和命名实体识别)进行了分析。我们的结果表明,所有LLM在非洲语言上的表现都低于预期,与英语等高资源语言相比,差距很大。我们发现,GPT-4在分类任务上具有平均或出色的性能,但在生成任务(如机器翻译)上表现非常差。令人惊讶的是,我们发现mT0在跨语言QA上表现最佳,优于当前最先进的监督模型(即微调的mT5)和GPT-4在非洲语言上的表现。总的来说,LLaMa 2由于其有限的多语言能力以及英语中心化的预训练语料库而记录了最差的表现。总的来说,我们的研究结果发出一个呼吁,确保非洲语言在大型语言模型中得到充分代表,鉴于它们日益增长的重要性。
https://arxiv.org/abs/2311.07978