In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
在本文中,我们在MER2024的半监督学习任务(MER-SEMI)中提出了一种解决方案。首先,为了提高特征提取器在情感分类任务上的性能,我们使用带标签数据微调视频和文本特征提取器,特别是CLIP-vit-large和Baichuan-13B。这种方法有效地保留了视频中传达的情感信息。其次,我们提出了一种音频引导Transformer(AGT)融合机制,利用Hubert-large的鲁棒性,在融合通道和通道内信息方面表现出卓越的效果。第三,为了提高模型的准确性,我们通过使用高置信度的未标注数据作为伪标签,逐步应用自监督学习。最后,通过黑盒查询,我们发现了训练集和测试集之间数据分布的不平衡。因此,我们采用基于先验知识的原则进行投票。结果证明了我们的策略的有效性,最终在MER-SEMI跟踪中获得了第三名的成绩。
https://arxiv.org/abs/2409.05007
Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
情感分类(SC)常常受到资源低的问题,例如领域特定上下文、不平衡的标签分布和少样本场景。扩散语言模型(LM)在文本数据增强(DA)方面的潜力仍然没有被探索。此外,文本DA方法很难平衡新样本的多样性和一致性。大多数DA方法e要么进行逻辑修改,要么通过语言模型对原始序列中的不太重要的词汇进行重新表述。在SC的背景下,强烈的情感词汇可能会对整个序列的情感产生关键影响。因此,我们提出DiffusionCLS,利用扩散LM来捕捉领域知识并生成伪样本,通过重构强烈标签相关的词汇,确保了一致性和多样性之间的平衡,避免了噪声的引入,增强了数据集的关键特征。DiffusionCLS还包括一个免噪音训练目标,帮助模型进行泛化。实验证实了我们在各种低资源场景中的有效性和框架模块的有效性。消融研究证实了我们的框架模块的有效性,而可视化研究突出了最优部署条件,进一步加强了我们的结论。
https://arxiv.org/abs/2409.03203
Aspect-based Sentiment Analysis (ABSA) is a critical task in Natural Language Processing (NLP) that focuses on extracting sentiments related to specific aspects within a text, offering deep insights into customer opinions. Traditional sentiment analysis methods, while useful for determining overall sentiment, often miss the implicit opinions about particular product or service features. This paper presents a comprehensive review of the evolution of ABSA methodologies, from lexicon-based approaches to machine learning and deep learning techniques. We emphasize the recent advancements in Transformer-based models, particularly Bidirectional Encoder Representations from Transformers (BERT) and its variants, which have set new benchmarks in ABSA tasks. We focused on finetuning Llama and Mistral models, building hybrid models using the SetFit framework, and developing our own model by exploiting the strengths of state-of-the-art (SOTA) Transformer-based models for aspect term extraction (ATE) and aspect sentiment classification (ASC). Our hybrid model Instruct - DeBERTa uses SOTA InstructABSA for aspect extraction and DeBERTa-V3-baseabsa-V1 for aspect sentiment classification. We utilize datasets from different domains to evaluate our model's performance. Our experiments indicate that the proposed hybrid model significantly improves the accuracy and reliability of sentiment analysis across all experimented domains. As per our findings, our hybrid model Instruct - DeBERTa is the best-performing model for the joint task of ATE and ASC for both SemEval restaurant 2014 and SemEval laptop 2014 datasets separately. By addressing the limitations of existing methodologies, our approach provides a robust solution for understanding detailed consumer feedback, thus offering valuable insights for businesses aiming to enhance customer satisfaction and product development.
面向 aspect 的情感分析(ASSA)是自然语言处理(NLP)中一项关键的任务,重点在于从文本中提取与特定方面相关的情感,为客户意见提供深入的洞察。传统的情感分析方法虽然对于确定整体情感有用,但往往忽视了关于特定产品或服务特性的隐含意见。本文对ASSA方法的发展进行了全面的回顾,从基于词库的方法到机器学习和深度学习技术。我们着重介绍了Transformer-based模型的最新进展,特别是Bidirectional Encoder Representations from Transformers(BERT)及其变体,这些模型在ASSA任务中设定了新的基准。我们专注于微调Llama和Mistral模型,使用SetFit框架构建混合模型,并利用最先进的(SOTA)Transformer-based模型的优势进行 aspects(ATE)和aspect sentiment分类(ASC)开发自己的模型。我们的混合模型Instruct - DeBERTa使用SOTA InstructABSA进行aspect extraction,DeBERTa-V3-baseabsa-V1进行aspect sentiment classification。我们使用不同领域的数据来评估我们模型的性能。我们的实验结果表明,所提出的混合模型Instruct - DeBERTa显著提高了所有实验领域中情感分析的准确性和可靠性。根据我们的研究结果,我们的混合模型Instruct - DeBERTa是针对SemEval restaurant 2014和SemEval laptop 2014数据集分别实现ATE和ASC任务的最佳模型。通过解决现有方法的局限性,我们的方法为理解详细的消费者反馈提供了稳健的解决方案,为提高客户满意度和企业产品开发提供了宝贵的见解。
https://arxiv.org/abs/2408.13202
$K$-nearest neighbor language models ($k$NN-LMs), which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling as well as downstream NLP benchmarks. These results have led researchers to argue that models trained on poor quality or outdated data could perform well by employing a $k$NN extension that has access to a higher-quality datastore. In this work, we ask whether this improved ability to recall information really translates into downstream abilities. We extensively evaluate $k$NN-LMs on a diverse set of tasks, ranging from sentiment classification and commonsense reasoning to multi-hop reasoning. Results show that $k$NN-LMs excel at memory-intensive tasks, where utilizing the patterns in the input is sufficient for determining the output, but struggle with reasoning tasks that require integrating multiple pieces of information to derive new knowledge. We further demonstrate through oracle experiments and qualitative analysis that even with perfect retrieval, $k$NN-LMs still fail to determine the correct answers, placing an upper bound on their reasoning performance. Code and datastores are released at this https URL.
$K$-近邻语言模型($k$NN-LMs)将检索与下一词预测相结合,已经在语言建模以及下游自然语言处理基准测试中表现出强大的性能。这些结果导致研究人员认为,在质量较差或过时的数据上训练的模型,通过使用具有访问更高质量数据存储的$k$NN扩展,可以在语言建模方面表现出色。在本文中,我们询问是否改善的召回信息能力真的转化为下游能力。我们全面评估了$k$NN-LMs在各种任务上的表现,从情感分类和常识推理到多级推理。结果表明,$k$NN-LMs在内存密集型任务上表现优异,在这些任务中,利用输入数据的模式足以确定输出,但他们在需要整合多个信息以得出新知识的推理任务中表现不佳。我们通过预言实验和定性分析进一步证明了,即使具有完美的检索,$k$NN-LMs仍然无法确定正确答案,将他们的推理性能置于上限。代码和数据存储器已发布在以下链接处:https://www.example.com/。
https://arxiv.org/abs/2408.11815
The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with $4$ sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose $14$ baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of $69.8\%$ and an F1 score of $69.1\%$ on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.
代码混合数据的广泛可用性可以提供关于资源有限的语言如孟加拉语的有价值的见解。情感分析在多个语言的代码混合数据中一直是基本的文本分类任务。然而,尚未出现大规模和多样性的情感分析代码混合孟加拉语的 datasets。我们通过引入 BnSentMix,一个由 Facebook、YouTube 和电子商务网站的 20,000 个样本组成的情感分析代码混合孟加拉语 datasets,来解决这一局限。我们确保数据来源的多样性,以复制真实的代码混合情况。此外,我们提出了 14 个基线方法,包括预先训练在代码混合孟加拉语-英语的 transformer 编码器,达到情感分类任务的整体准确率 69.8% 和 F1 分数 69.1%。详细分析揭示了不同情感标签和文本类型的性能差异,强调了未来改进的领域。
https://arxiv.org/abs/2408.08964
In this study, we implement a novel BERT architecture for multitask fine-tuning on three downstream tasks: sentiment classification, paraphrase detection, and semantic textual similarity prediction. Our model, Multitask BERT, incorporates layer sharing and a triplet architecture, custom sentence pair tokenization, loss pairing, and gradient surgery. Such optimizations yield a 0.516 sentiment classification accuracy, 0.886 paraphase detection accuracy, and 0.864 semantic textual similarity correlation on test data. We also apply generative adversarial learning to BERT, constructing a conditional generator model that maps from latent space to create fake embeddings in $\mathbb{R}^{768}$. These fake embeddings are concatenated with real BERT embeddings and passed into a discriminator model for auxiliary classification. Using this framework, which we refer to as AC-GAN-BERT, we conduct semi-supervised sensitivity analyses to investigate the effect of increasing amounts of unlabeled training data on AC-GAN-BERT's test accuracy. Overall, aside from implementing a high-performing multitask classification system, our novelty lies in the application of adversarial learning to construct a generator that mimics BERT. We find that the conditional generator successfully produces rich embeddings with clear spatial correlation with class labels, demonstrating avoidance of mode collapse. Our findings validate the GAN-BERT approach and point to future directions of generator-aided knowledge distillation.
在这项研究中,我们实现了一个新颖的BERT架构,用于在三个下游任务上进行多任务微调:情感分类、同义词检测和语义文本相似度预测。我们的模型Multitask BERT包括层共享和三元组架构、自定义句子对token化、损失对消和梯度修复。这些优化使测试数据的情感分类准确度为0.516,同义词检测准确度为0.886,语义文本相似度关联为0.864。我们还对BERT应用了生成对抗学习,构建了一个条件生成器模型,从潜在空间映射到在$\mathbb{R}^{768}$上创建伪嵌入。这些伪嵌入与真实的BERT嵌入连同传递给辅助分类器的判别模型。使用这一框架,我们称之为AC-GAN-BERT,我们对未标记训练数据的半监督敏感性分析进行了研究,以探究增加无标签训练数据量对AC-GAN-BERT测试准确性的影响。总的来说,除了实现高性能的多任务分类系统外,我们的新颖之处在于将对抗学习应用于构建一个模仿BERT的生成器。我们发现,条件生成器成功产生了具有清晰空间关联的丰富嵌入,表明避免了收缩模式。我们的研究验证了GAN-BERT方法,并指出了未来生成器辅助知识蒸馏的未来方向。
https://arxiv.org/abs/2408.15265
The mental health assessment of middle school students has always been one of the focuses in the field of education. This paper introduces a new ensemble learning network based on BERT, employing the concept of enhancing model performance by integrating multiple classifiers. We trained a range of BERT-based learners, which combined using the majority voting method. We collect social network text data of middle school students through China's Weibo and apply the method to the task of classifying emotional tendencies in middle school students' social network texts. Experimental results suggest that the ensemble learning network has a better performance than the base model and the performance of the ensemble learning model, consisting of three single-layer BERT models, is barely the same as a three-layer BERT model but requires 11.58% more training time. Therefore, in terms of balancing prediction effect and efficiency, the deeper BERT network should be preferred for training. However, for interpretability, network ensembles can provide acceptable solutions.
中学学生的心理健康评估一直是教育领域的重点之一。本文介绍了一种基于BERT的新集成学习网络,通过将多个分类器的概念应用于增强模型性能。我们训练了一系列基于BERT的 learners,使用多数投票方法结合。我们收集了通过中国微博获得的中学生社交网络文本数据,并将该方法应用于中学生社交网络文本的情感倾向分类任务。实验结果表明,集成学习网络的表现比基模型和由三个单层BERT模型组成的集成模型的表现更好,但集成模型的性能与三个单层BERT模型的表现几乎相同,需要11.58%的训练时间。因此,在平衡预测效果和效率方面,应优先选择较深的BERT网络进行训练。然而,对于可解释性,网络集成可以提供可行的解决方案。
https://arxiv.org/abs/2408.04849
How can we define visual sentiment when viewers systematically disagree on their perspectives? This study introduces a novel approach to visual sentiment analysis by integrating attitudinal differences into visual sentiment classification. Recognizing that societal divides, such as partisan differences, heavily influence sentiment labeling, we developed a dataset that reflects these divides. We then trained a deep learning multi-task multi-class model to predict visual sentiment from different ideological viewpoints. Applied to immigration-related images, our approach captures perspectives from both Democrats and Republicans. By incorporating diverse perspectives into the labeling and model training process, our strategy addresses the limitation of label ambiguity and demonstrates improved accuracy in visual sentiment predictions. Overall, our study advocates for a paradigm shift in decoding visual sentiment toward creating classifiers that more accurately reflect the sentiments generated by humans.
我们如何定义视觉情感?当观众系统性地就他们的观点分歧时,我们如何定义视觉情感?这项研究通过将态度差异融入视觉情感分类来介绍了一种新颖的视觉情感分析方法。认识到社会分歧(如政治分歧)极大地影响着情感分类,我们开发了一个反映这些分歧的数据集。然后我们训练了一个多任务多类深度学习模型,从不同的意识形态观点预测视觉情感。将我们的方法应用于移民相关图像,我们的方法捕捉了民主党和共和党人的观点。通过将多样观点纳入标签和模型训练过程,我们的策略解决了标签歧义的问题,并表明了在视觉情感预测方面的提高准确性。总的来说,我们的研究主张将视觉情感解读从创建更准确反映人类生成的情感分类器转移到一个全新的范式。
https://arxiv.org/abs/2408.04103
As NLP models become increasingly integral to decision-making processes, the need for explainability and interpretability has become paramount. In this work, we propose a framework that achieves the aforementioned by generating semantically edited inputs, known as counterfactual interventions, which change the model prediction, thus providing a form of counterfactual explanations for the model. We test our framework on two NLP tasks - binary sentiment classification and topic classification - and show that the generated edits are contrastive, fluent and minimal, while the whole process remains significantly faster that other state-of-the-art counterfactual editors.
随着自然语言处理(NLP)模型在决策过程中越来越重要,可解释性和可理解性变得至关重要。在这项工作中,我们提出了一个框架,通过生成语义编辑的输入,即反事实干预,实现了上述目标。我们测试了我们的框架在两个NLP任务——二分类情感分类和主题分类——上,并证明了生成的编辑是对比鲜明、流畅且最小的,而整个过程比其他最先进的反事实编辑要快得多。
https://arxiv.org/abs/2408.01969
Transparency in AI decision-making is crucial in healthcare due to the severe consequences of errors, and this is important for building trust among AI and users in sentiment analysis task. Incorporating reasoning capabilities helps Large Language Models (LLMs) understand human emotions within broader contexts, handle nuanced and ambiguous language, and infer underlying sentiments that may not be explicitly stated. In this work, we introduce a new task - Sentiment Reasoning - for both speech and text modalities, along with our proposed multimodal multitask framework and dataset. Our study showed that rationale-augmented training enhances model performance in sentiment classification across both human transcript and ASR settings. Also, we found that the generated rationales typically exhibit different vocabularies compared to human-generated rationales, but maintain similar semantics. All code, data (English-translated and Vietnamese) and models are published online: this https URL
在医疗保健领域,AI决策的透明度对减少错误后果至关重要,这对在情感分析任务中建立AI和用户之间的信任非常重要。纳入推理功能有助于大型语言模型(LLMs)在更广泛的上下文内理解人类情感,处理复杂和模糊的语言,并推断可能没有明确表述的潜在情感。在这项工作中,我们为语音和文本模态引入了情感推理任务,并提出了我们所提出的多模态多任务框架和数据集。我们的研究表明,推理训练可以增强模型在人类转录和自动说话设置中的情感分类性能。此外,我们发现生成的推理通常具有不同的词汇表,但保持相似的语义。所有代码、数据(包括英语翻译和越南语)和模型都已公开发布:此链接为https://。
https://arxiv.org/abs/2407.21054
Sentiment analysis (SA), is an approach of natural language processing (NLP) for determining a text's emotional tone by analyzing subjective information such as views, feelings, and attitudes toward specific topics, products, services, events, or experiences. This study attempts to develop an advanced deep learning (DL) model for SA to understand global audience emotions through tweets in the context of the Olympic Games. The findings represent global attitudes around the Olympics and contribute to advancing the SA models. We have used NLP for tweet pre-processing and sophisticated DL models for arguing with SA, this research enhances the reliability and accuracy of sentiment classification. The study focuses on data selection, preprocessing, visualization, feature extraction, and model building, featuring a baseline Naïve Bayes (NB) model and three advanced DL models: Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), and Bidirectional Encoder Representations from Transformers (BERT). The results of the experiments show that the BERT model can efficiently classify sentiments related to the Olympics, achieving the highest accuracy of 99.23%.
情感分析(SA)是一种自然语言处理(NLP)方法,通过分析主观信息,如对特定话题、产品、服务、事件或经历的看法和感受,来确定文本的情感倾向。本研究旨在开发一种高级的深度学习(DL)模型,以便SA通过奥运会微博理解全球受众的情感。研究结果代表了对奥运会的全球态度,并为推进SA模型的发展做出了贡献。我们使用NLP对微博进行预处理,并使用复杂的DL模型进行反驳,这项研究提高了情感分类的可靠性和准确性。本研究关注数据选择、预处理、可视化、特征提取和模型构建,包括基础朴素贝叶斯(NB)模型和三种先进的DL模型:卷积神经网络(CNN)、双向长短期记忆(BiLSTM)和双向编码器表示从Transformer(BERT)中提取。实验结果表明,BERT模型可以有效地分类与奥运会相关的情感,达到99.23%的最高准确率。
https://arxiv.org/abs/2407.12376
The stakeholders' needs in sentiment analysis for various issues, whether positive or negative, are speed and accuracy. One new challenge in sentiment analysis tasks is the limited training data, which often leads to suboptimal machine learning models and poor performance on test data. This paper discusses the problem of text classification based on limited training data (300 to 600 samples) into three classes: positive, negative, and neutral. A benchmark dataset is provided for training and testing data on the issue of Kaesang Pangarep's appointment as Chairman of PSI. External data for aggregation and augmentation purposes are provided, consisting of two datasets: the topic of Covid Vaccination sentiment and an open topic. The official score used is the F1-score, which balances precision and recall among the three classes, positive, negative, and neutral. A baseline score is provided as a reference for researchers for unoptimized classification methods. The optimized score is provided as a reference for the target score to be achieved by any proposed method. Both scoring (baseline and optimized) use the SVM method, which is widely reported as the state-of-the-art in conventional machine learning methods. The F1-scores achieved by the baseline and optimized methods are 40.83% and 51.28%, respectively.
利益相关者对于情感分析中各种问题的需求包括速度和准确性。情感分析任务的有限训练数据是一个新的挑战,往往导致机器学习模型效果不佳,在测试数据上的表现不佳。本文讨论了基于有限训练数据(300到600个样本)的文本分类问题分为三类的三个类别:正面、负面和中立。对于Kaesang Pangarep被任命为PSI董事会主席的问题,提供了一个用于训练和测试数据的基准数据集。外部数据用于汇总和增强目的,包括两个数据集:Covid Vaccination sentiment和open topic。所使用的官方得分是F1分数,它平衡了三个类别:正面、负面和中立之间的精确度和召回率。提供了一个基准得分,作为研究人员未优化分类方法的参考。提供了一个目标得分,作为任何所提出的方法的参考。两种得分(基线和优化)都使用SVM方法,该方法在传统机器学习方法中被广泛报告为最先进的。基线和优化方法获得的F1分数分别为40.83%和51.28%。
https://arxiv.org/abs/2407.05627
In this study, we explore the application of sentiment analysis on financial news headlines to understand investor sentiment. By leveraging Natural Language Processing (NLP) and Large Language Models (LLM), we analyze sentiment from the perspective of retail investors. The FinancialPhraseBank dataset, which contains categorized sentiments of financial news headlines, serves as the basis for our analysis. We fine-tuned several models, including distilbert-base-uncased, Llama, and gemma-7b, to evaluate their effectiveness in sentiment classification. Our experiments demonstrate that the fine-tuned gemma-7b model outperforms others, achieving the highest precision, recall, and F1 score. Specifically, the gemma-7b model showed significant improvements in accuracy after fine-tuning, indicating its robustness in capturing the nuances of financial sentiment. This model can be instrumental in providing market insights, risk management, and aiding investment decisions by accurately predicting the sentiment of financial news. The results highlight the potential of advanced LLMs in transforming how we analyze and interpret financial information, offering a powerful tool for stakeholders in the financial industry.
在这项研究中,我们探讨了在金融新闻标题中应用情感分析来了解投资者的情感。通过利用自然语言处理(NLP)和大语言模型(LLM),我们从零售投资者的角度分析了情感。包含分类情感的金融新闻标题的数据集金融PhraseBank是我们的分析基础。我们微调了几个模型,包括distilbert-base-uncased、Llama和gemma-7b,以评估它们在情感分类方面的有效性。我们的实验证明,微调后的gemma-7b模型在其他模型中表现优异,实现了最高的精确度、召回率和F1分数。具体来说,经过微调后,gemma-7b模型在准确性方面显著提高,表明其在捕捉金融情感的细微差别方面具有稳健性。这个模型可以为市场提供洞察力、风险管理和投资决策提供支持,通过准确预测金融新闻的情感来提供投资建议。结果突出了先进LLM在改变我们分析和解释金融信息方面的潜力,为金融行业的利益相关者提供了一个强大的工具。
https://arxiv.org/abs/2406.13626
Large Language Models (LLMs) are valuable for text classification, but their vulnerabilities must not be disregarded. They lack robustness against adversarial examples, so it is pertinent to understand the impacts of different types of perturbations, and assess if those attacks could be replicated by common users with a small amount of perturbations and a small number of queries to a deployed LLM. This work presents an analysis of the effectiveness, efficiency, and practicality of three different types of adversarial attacks against five different LLMs in a sentiment classification task. The obtained results demonstrated the very distinct impacts of the word-level and character-level attacks. The word attacks were more effective, but the character and more constrained attacks were more practical and required a reduced number of perturbations and queries. These differences need to be considered during the development of adversarial defense strategies to train more robust LLMs for intelligent text classification applications.
大语言模型(LLMs)在文本分类任务中具有价值,但它们的安全性必须引起重视。它们对对抗性样本缺乏鲁棒性,因此有必要了解不同类型扰动的后果,并评估是否可以用少量扰动和查询来复制这些攻击,由普通用户进行部署的LLM。这项工作对三种不同类型的对抗性攻击在五种不同LLM上的有效性、效率和实用性进行了分析。得到的结果表明,词级和字符级攻击的影响非常不同。词攻击更有效,但字符级攻击更实用,需要更少的扰动和查询。这些差异需要在开发对抗性防御策略时予以考虑,为智能文本分类应用培训更健壮的LLM。
https://arxiv.org/abs/2406.08050
In the contemporary era, social media platforms amass an extensive volume of social data contributed by their users. In order to promptly grasp the opinions and emotional inclinations of individuals regarding a product or event, it becomes imperative to perform sentiment analysis on the user-generated content. Microblog comments often encompass both lengthy and concise text entries, presenting a complex scenario. This complexity is particularly pronounced in extensive textual content due to its rich content and intricate word interrelations compared to shorter text entries. Sentiment analysis of public opinion shared on social networking websites such as Facebook or Twitter has evolved and found diverse applications. However, several challenges remain to be tackled in this field. The hybrid methodologies have emerged as promising models for mitigating sentiment analysis errors, particularly when dealing with progressively intricate training data. In this article, to investigate the hesitancy of COVID-19 vaccination, we propose eight different hybrid deep learning models for sentiment classification with an aim of improving overall accuracy of the model. The sentiment prediction is achieved using embedding, deep learning model and grid search algorithm on Twitter COVID-19 dataset. According to the study, public sentiment towards COVID-19 immunization appears to be improving with time, as evidenced by the gradual decline in vaccine reluctance. Through extensive evaluation, proposed model reported an increased accuracy of 98.86%, outperforming other models. Specifically, the combination of BERT, CNN and GS yield the highest accuracy, while the combination of GloVe, BiLSTM, CNN and GS follows closely behind with an accuracy of 98.17%. In addition, increase in accuracy in the range of 2.11% to 14.46% is reported by the proposed model in comparisons with existing works.
https://arxiv.org/abs/2406.10266
Explainable AI (XAI) algorithms aim to help users understand how a machine learning model makes predictions. To this end, many approaches explain which input features are most predictive of a target label. However, such explanations can still be puzzling to users (e.g., in product reviews, the word "problems" is predictive of positive sentiment). If left unexplained, puzzling explanations can have negative impacts. Explaining unintuitive associations between an input feature and a target label is an underexplored area in XAI research. We take an initial effort in this direction using unintuitive associations learned by sentiment classifiers as a case study. We propose approaches for (1) automatically detecting associations that can appear unintuitive to users and (2) generating explanations to help users understand why an unintuitive feature is predictive. Results from a crowdsourced study (N=300) found that our proposed approaches can effectively detect and explain predictive but unintuitive features in sentiment classification.
可解释人工智能(XAI)算法旨在帮助用户理解机器学习模型如何做出预测。为此,许多方法解释了输入特征中最具有预测目标标签的特性。然而,这些解释对用户仍然可能是令人困惑的(例如,在产品评论中,“问题”一词预测积极情绪)。如果未经解释,令人困惑的解释可能会产生负面影响。解释输入特征与目标标签之间不合常理的关联是XAI研究的一个未被探索的领域。我们以情感分类器学习的无直觉关联为例,做出 initial effort。我们提出了两种方法:(1)自动检测用户可能不熟悉的关联;(2)生成帮助用户理解为什么一个不合常理的特征是预测目标的解释。来自一个(300人)民间研究的结果表明,我们提出的方法可以有效地检测和解释情感分类中不合常理的预测特性。
https://arxiv.org/abs/2406.03594
The explainability of recommender systems has attracted significant attention in academia and industry. Many efforts have been made for explainable recommendations, yet evaluating the quality of the explanations remains a challenging and unresolved issue. In recent years, leveraging LLMs as evaluators presents a promising avenue in Natural Language Processing tasks (e.g., sentiment classification, information extraction), as they perform strong capabilities in instruction following and common-sense reasoning. However, evaluating recommendation explanatory texts is different from these NLG tasks, as its criteria are related to human perceptions and are usually subjective. In this paper, we investigate whether LLMs can serve as evaluators of recommendation explanations. To answer the question, we utilize real user feedback on explanations given from previous work and additionally collect third-party annotations and LLM evaluations. We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users. Our experiments reveal that LLMs, such as GPT4, can provide comparable evaluations with appropriate prompts and settings. We also provide further insights into combining human labels with the LLM evaluation process and utilizing ensembles of multiple heterogeneous LLM evaluators to enhance the accuracy and stability of evaluations. Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts. Our code is available at this https URL.
推荐系统的可解释性在学术界和产业界都引起了广泛关注。为了解决可解释性推荐问题,学术界已经做出了很多努力,但是评估解释的质量仍然是一个具有挑战性和未解决的问题。近年来,将LLM作为评估器在自然语言处理任务中(如情感分类、信息提取)取得了有益的进展,因为它们在指令跟随和常识推理方面表现出强大的能力。然而,评估推荐解释文本不同于这些自然语言处理任务,因为其评估标准与人类感知有关,通常具有主观性。在本文中,我们研究了LLM是否可以作为推荐解释评估器。为了回答这个问题,我们利用之前工作的用户反馈来评估给出的解释,并收集了第三方注释和LLM评估。我们设计了一个3级元评估策略来衡量评估者标签与用户提供的真实标签之间的相关性。我们的实验结果表明,LLM(如GPT-4)可以在适当的提示和设置下提供与用户提供的真实标签相当的评价。我们还深入研究了将人类标签与LLM评估过程相结合以及使用多种异质LLM评估器的集成来提高评估的准确性和稳定性的问题。我们的研究证实了利用LLM作为评估器可以是评估推荐解释文本的准确、可重复且成本效益高的解决方案。我们的代码可在此链接中获取:https://www. researchgate.net/publication/327213661_LLM-based_evaluation_for_recommendation_system_explanations
https://arxiv.org/abs/2406.03248
Sentiment analysis, an increasingly vital field in both academia and industry, plays a pivotal role in machine learning applications, particularly on social media platforms like Reddit. However, the efficacy of sentiment analysis models is hindered by the lack of expansive and fine-grained emotion datasets. To address this gap, our study leverages the GoEmotions dataset, comprising a diverse range of emotions, to evaluate sentiment analysis methods across a substantial corpus of 58,000 comments. Distinguished from prior studies by the Google team, which limited their analysis to only two models, our research expands the scope by evaluating a diverse array of models. We investigate the performance of traditional classifiers such as Naive Bayes and Support Vector Machines (SVM), as well as state-of-the-art transformer-based models including BERT, RoBERTa, and GPT. Furthermore, our evaluation criteria extend beyond accuracy to encompass nuanced assessments, including hierarchical classification based on varying levels of granularity in emotion categorization. Additionally, considerations such as computational efficiency are incorporated to provide a comprehensive evaluation framework. Our findings reveal that the RoBERTa model consistently outperforms the baseline models, demonstrating superior accuracy in fine-grained sentiment classification tasks. This underscores the substantial potential and significance of the RoBERTa model in advancing sentiment analysis capabilities.
情感分析,一个在学术界和产业中越来越重要的领域,在机器学习应用中扮演着关键角色,特别是在社交媒体平台如Reddit上。然而,情感分析模型的有效性受到缺乏广泛和精细情感数据集的阻碍。为了填补这一空白,我们的研究利用了GoEmotions数据集,这是一个包含各种情感的多样性数据集,以评估情感分析方法在58,000条评论的庞大语料库中的性能。与谷歌团队之前的研究不同,他们的分析仅限于两个模型,而我们的研究扩展了范围,评估了各种模型。我们研究了传统分类器如朴素贝叶斯和支持向量机(SVM),以及最先进的基于Transformer的模型,包括BERT、RoBERTa和GPT。此外,我们的评估标准不仅涵盖了准确度,还涉及情感分类分级细腻程度的评估,包括基于情感分类分级不同 granularity 的分层分类。此外,我们考虑了计算效率,以提供全面评估框架。我们的研究结果表明,RoBERTa模型在细粒度情感分类任务中始终优于基线模型,证明了RoBERTa模型在提高情感分析能力方面的巨大潜力和重要性。
https://arxiv.org/abs/2405.16810
Multimodal aspect-based sentiment analysis (MABSA) aims to understand opinions in a granular manner, advancing human-computer interaction and other fields. Traditionally, MABSA methods use a joint prediction approach to identify aspects and sentiments simultaneously. However, we argue that joint models are not always superior. Our analysis shows that joint models struggle to align relevant text tokens with image patches, leading to misalignment and ineffective image utilization. In contrast, a pipeline framework first identifies aspects through MATE (Multimodal Aspect Term Extraction) and then aligns these aspects with image patches for sentiment classification (MASC: Multimodal Aspect-Oriented Sentiment Classification). This method is better suited for multimodal scenarios where effective image use is crucial. We present three key observations: (a) MATE and MASC have different feature requirements, with MATE focusing on token-level features and MASC on sequence-level features; (b) the aspect identified by MATE is crucial for effective image utilization; and (c) images play a trivial role in previous MABSA methods due to high noise. Based on these observations, we propose a pipeline framework that first predicts the aspect and then uses translation-based alignment (TBA) to enhance multimodal semantic consistency for better image utilization. Our method achieves state-of-the-art (SOTA) performance on widely used MABSA datasets Twitter-15 and Twitter-17. This demonstrates the effectiveness of the pipeline approach and its potential to provide valuable insights for future MABSA research. For reproducibility, the code and checkpoint will be released.
多模态 aspect-based 情感分析 (MABSA) 的目标是以细粒度的方式理解观点,促进人机交互和其他领域的发展。 传统上,MABSA 方法使用联合预测方法同时确定 aspects 和情感。然而,我们认为联合模型并不总是更优越。我们的分析表明,联合模型在将相关文本标记与图像补丁对齐方面遇到困难,导致对齐错误和有效的图像利用效果不佳。相比之下,一个数据流框架首先通过 MATE (多模态 aspect 词提取) 识别方面,然后使用平移基础对齐 (TBA) 增强多模态语义一致性,以实现更好的图像利用效果。这种方法更适合多模态场景,即有效的图像利用至关重要。我们提出了三个关键观察结论: (a)MATE 和 MASC 具有不同的特征要求。MATE 关注于标记级别的特征,而 MASC 关注于序列级别的特征; (b)由 MATE 确定的方面对有效的图像利用至关重要; (c)在以前的多模态 MABSA 方法中,由于噪声较高,图像发挥了无关紧要的作用。 基于这些观察结论,我们提出了一个数据流框架,首先预测方面,然后使用平移基础对齐 (TBA) 增强多模态语义一致性,以实现更好的图像利用效果。我们的方法在广泛使用的 MABSA 数据集 Twitter-15 和 Twitter-17 上实现了最先进的性能。这证明了数据流方法及其对未来 MABSA 研究的潜在有益性。为了可重复性,代码和检查点将公开发布。
https://arxiv.org/abs/2406.00017
In machine learning, temporal shifts occur when there are differences between training and test splits in terms of time. For streaming data such as news or social media, models are commonly trained on a fixed corpus from a certain period of time, and they can become obsolete due to the dynamism and evolving nature of online content. This paper focuses on temporal shifts in social media and, in particular, Twitter. We propose a unified evaluation scheme to assess the performance of language models (LMs) under temporal shift on standard social media tasks. LMs are tested on five diverse social media NLP tasks under different temporal settings, which revealed two important findings: (i) the decrease in performance under temporal shift is consistent across different models for entity-focused tasks such as named entity recognition or disambiguation, and hate speech detection, but not significant in the other tasks analysed (i.e., topic and sentiment classification); and (ii) continuous pre-training on the test period does not improve the temporal adaptability of LMs.
在机器学习中,时间变化发生在训练和测试集的时间方面存在差异时。对于像新闻或社交媒体这样的流式数据,通常会使用一段时间内的固定语料库对模型进行训练,但由于在线内容的动态和演变性质,它们可能会过时。本文重点关注社交媒体和特别关注Twitter的时间变化。我们提出了一个统一评估方案来评估语言模型(LMs)在标准社交媒体任务上的时间变化性能。在不同的时间设置下,我们对五个不同的社交媒体NLP任务对LM进行测试,发现了两个重要结论:(i)在诸如命名实体识别或歧义检测等实体关注任务上,时间变化引起的性能下降是不同的模型之间一致的,而在其他任务(即主题和情感分类)上并不显著;(ii)在测试期间持续进行前预训练不能提高LM的时空适应性。
https://arxiv.org/abs/2405.13017