Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model's parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.
有效的跨语言迁移仍然是扩大大型语言模型从高资源语言到低资源语言效益的关键挑战。为了解决这个问题,先前的研究探索了多种方法来结合任务特定数据(在高资源源语言中)中的任务知识和未标记文本(在低资源目标语言中)中的语言知识。其中一种值得注意的方法是提出了可组合的稀疏微调(SFT),用于跨语言迁移,这种方法学习任务特定和语言特定的稀疏掩码来选择预训练模型参数的一个子集进行进一步的微调。这些稀疏微调向量(SFTs)随后与预训练模型结合使用,以促进仅通过源语言的任务特定数据就实现零样本跨语言迁移到目标语言中的任务。用于SFTs的稀疏掩码是通过简单的基于幅度的剪枝来识别的。 在我们的工作中,我们引入了DeFT-X,这是一种新的可组合SFT方法,在进行幅度剪枝之前使用奇异值分解对预训练模型的权重矩阵进行去噪处理,从而产生更健壮的SFTs。我们在多样化的极度低资源语言(用于情感分类NusaX和自然语言推理AmericasNLI)上评估了DeFT-X,并证明它在性能上与SFT和其他显着的跨语言迁移基线相当或优于它们。
https://arxiv.org/abs/2505.15090
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.
这项研究提出了一种评估大型语言模型(LLM)二元文本分类一致性的框架,旨在解决现有可靠性评估方法不足的问题。通过借鉴心理测量学原理,我们确定了样本量的需求,开发了无效响应的度量标准,并对评分者内和评分者间的可靠性进行了评价。我们的案例研究考察了14个大型语言模型(包括claude-3-7-sonnet、gpt-4o、deepseek-r1、gemma3、llama3.2、phi4和command-r-plus)在财务新闻情绪分类方面的表现,每个模型重复进行五次测试,共分析了1,350篇文章。各模型展示了较高的评分者内一致性,在90%-98%的样本中实现了完全一致的意见,并且来自同一家族的昂贵型和经济型模型之间差异很小。当与StockNewsAPI标签验证时,这些模型表现出色(准确率在0.76到0.88之间),其中较小规模的模型如gemma3:1B、llama3.2:3B以及claude-3-5-haiku的表现优于其大型同类。所有模型在预测实际市场走势时表现均接近随机水平,这表明任务难度而非模型能力限制了性能。我们的框架为LLM选择、样本量规划和可靠性评估提供了系统性指导,帮助组织优化资源以应对分类任务的挑战。
https://arxiv.org/abs/2505.14918
Fine-grained sentiment analysis (FGSA) aims to identify sentiment polarity toward specific aspects within a text, enabling more precise opinion mining in domains such as product reviews and social media. However, traditional FGSA approaches often require task-specific architectures and extensive annotated data, limiting their generalization and scalability. To address these challenges, we propose PL-FGSA, a unified prompt learning-based framework implemented using the MindSpore platform, which integrates prompt design with a lightweight TextCNN backbone. Our method reformulates FGSA as a multi-task prompt-augmented generation problem, jointly tackling aspect extraction, sentiment classification, and causal explanation in a unified paradigm. By leveraging prompt-based guidance, PL-FGSA enhances interpretability and achieves strong performance under both full-data and low-resource conditions. Experiments on three benchmark datasets-SST-2, SemEval-2014 Task 4, and MAMS-demonstrate that our model consistently outperforms traditional fine-tuning methods and achieves F1-scores of 0.922, 0.694, and 0.597, respectively. These results validate the effectiveness of prompt-based generalization and highlight the practical value of PL-FGSA for real-world sentiment analysis tasks.
细粒度情感分析(FGSA)旨在识别文本中针对特定方面的情感极性,从而在产品评论和社会媒体等领域实现更精确的意见挖掘。然而,传统的FGSA方法通常需要特定任务的架构和大量的注释数据,这限制了它们的泛化能力和可扩展性。为了解决这些挑战,我们提出了PL-FGSA,这是一个基于提示学习的统一框架,使用MindSpore平台构建,并结合了轻量级的TextCNN骨干网络。我们的方法将FGSA重新表述为一个多任务提示增强生成问题,在一个统一的范式中同时处理方面提取、情感分类和因果解释。通过利用基于提示的指导,PL-FGSA增强了可解释性,并在全数据集和低资源条件下均表现出色。 我们在三个基准数据集上进行了实验——SST-2、SemEval-2014 Task 4 和 MAMS,结果表明我们的模型始终优于传统的微调方法,并分别取得了F1分数为0.922、0.694 和 0.597 的成绩。这些结果验证了基于提示泛化的有效性,并突显了PL-FGSA在现实世界情感分析任务中的实际价值。
https://arxiv.org/abs/2505.14165
We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.
我们介绍了PersonaConvBench,这是一个大规模基准测试系统,用于评估大型语言模型(LLM)在多轮对话中个性化推理和生成的能力。与现有研究仅专注于个人化或对话结构不同,PersonaConvBench将两者结合起来,并提供三项核心任务:句子分类、影响回归以及跨十个基于Reddit的多样化领域进行以用户为中心的文本生成。这种设计能够系统地分析个性化的对话背景如何塑造LLM在现实多用户场景中的输出结果。 我们在统一的提示设置下对几种商用和开源的大型语言模型进行了基准测试,并观察到将个性化历史记录纳入考量可带来显著性能提升,包括在情感分类上比最佳非对话基线高出198%的相对增益。通过发布PersonaConvBench及其评估代码,我们的目标是支持研究那些能够适应个人风格、追踪长期背景信息并生成丰富且引人入胜响应的大型语言模型的研究工作。
https://arxiv.org/abs/2505.14106
The emergence of global health crises, such as COVID-19 and Monkeypox (mpox), has underscored the importance of understanding public sentiment to inform effective public health strategies. This study conducts a comparative sentiment analysis of public perceptions surrounding COVID-19 and mpox by leveraging extensive datasets of 147,475 and 106,638 tweets, respectively. Advanced machine learning models, including Logistic Regression, Naive Bayes, RoBERTa, DistilRoBERTa and XLNet, were applied to perform sentiment classification, with results indicating key trends in public emotion and discourse. The analysis highlights significant differences in public sentiment driven by disease characteristics, media representation, and pandemic fatigue. Through the lens of sentiment polarity and thematic trends, this study offers valuable insights into tailoring public health messaging, mitigating misinformation, and fostering trust during concurrent health crises. The findings contribute to advancing sentiment analysis applications in public health informatics, setting the groundwork for enhanced real-time monitoring and multilingual analysis in future research.
全球健康危机的出现,例如COVID-19和猴痘(mpox),强调了了解公众情绪以制定有效的公共卫生策略的重要性。本研究通过对包含147,475条和106,638条推文的大规模数据集进行比较性情感分析,探究了公众对COVID-19和mpox的看法。该研究应用了包括逻辑回归、朴素贝叶斯、RoBERTa、DistilRoBERTa和XLNet在内的先进机器学习模型来进行情感分类,并揭示出关键的情绪和话语趋势。分析结果显示,疾病特性、媒体报道以及疫情疲劳导致了公众情绪的重大差异。通过情感极性和主题趋势的视角,本研究为定制公共卫生信息传递、减少错误信息传播以及在并行健康危机中建立信任提供了宝贵的见解。 该研究成果推进了情感分析在公共卫生成像中的应用,并为进一步的研究奠定了基础,包括实时监测和多语言分析的改进。
https://arxiv.org/abs/2505.07430
Multi-domain sentiment classification aims to mitigate poor performance models due to the scarcity of labeled data in a single domain, by utilizing data labeled from various domains. A series of models that jointly train domain classifiers and sentiment classifiers have demonstrated their advantages, because domain classification helps generate necessary information for sentiment classification. Intuitively, the importance of sentiment classification tasks is the same in all domains for multi-domain sentiment classification; but domain classification tasks are different because the impact of domain information on sentiment classification varies across different fields; this can be controlled through adjustable weights or hyper parameters. However, as the number of domains increases, existing hyperparameter optimization algorithms may face the following challenges: (1) tremendous demand for computing resources, (2) convergence problems, and (3) high algorithm complexity. To efficiently generate the domain information required for sentiment classification in each domain, we propose a dynamic information modulation algorithm. Specifically, the model training process is divided into two stages. In the first stage, a shared hyperparameter, which would control the proportion of domain classification tasks across all fields, is determined. In the second stage, we introduce a novel domain-aware modulation algorithm to adjust the domain information contained in the input text, which is then calculated based on a gradient-based and loss-based method. In summary, experimental results on a public sentiment analysis dataset containing 16 domains prove the superiority of the proposed method.
跨领域情感分类旨在通过利用来自不同领域的标记数据,来缓解单一领域中由于标注数据稀缺而导致的模型性能不佳问题。一系列同时训练领域分类器和情感分类器的模型已经展示了它们的优势,因为领域分类有助于生成情感分类所需的信息。直观来看,在多领域情感分类任务中,所有领域的感情分类任务的重要性是相同的;然而,不同领域的领域分类任务则是不同的,因为领域信息对情感分类的影响在各个领域间各不相同;这可以通过可调权重或超参数来控制。但是,随着领域的数量增加,现有的超参数优化算法可能会面临以下挑战:(1)计算资源的巨大需求、(2)收敛问题以及(3)算法复杂度高。 为了高效生成每个领域中情感分类所需的信息,我们提出了一种动态信息调节算法。具体来说,模型训练过程分为两个阶段。在第一阶段,确定一个共享的超参数,该超参数将控制跨所有领域的领域分类任务的比例。第二阶段引入一种新的基于领域的调节算法来调整输入文本中的领域信息,然后根据梯度和损失方法进行计算。 总之,在包含16个领域的公共情感分析数据集上的实验结果证明了所提出方法的优越性。
https://arxiv.org/abs/2505.06630
Financial sentiment analysis (FSA) presents unique challenges to LLMs that surpass those in typical sentiment analysis due to the nuanced language used in financial contexts. The prowess of these models is often undermined by the inherent subjectivity of sentiment classifications in existing benchmark datasets like Financial Phrasebank. These datasets typically feature undefined sentiment classes that reflect the highly individualized perspectives of annotators, leading to significant variability in annotations. This variability results in an unfair expectation for LLMs during benchmarking, where they are tasked to conjecture the subjective viewpoints of human annotators without sufficient context. In this paper, we introduce the Annotators' Instruction Assisted Prompt, a novel evaluation prompt designed to redefine the task definition of FSA for LLMs. By integrating detailed task instructions originally intended for human annotators into the LLMs' prompt framework, AIAP aims to standardize the understanding of sentiment across both human and machine interpretations, providing a fair and context-rich foundation for sentiment analysis. We utilize a new dataset, WSBS, derived from the WallStreetBets subreddit to demonstrate how AIAP significantly enhances LLM performance by aligning machine operations with the refined task definitions. Experimental results demonstrate that AIAP enhances LLM performance significantly, with improvements up to 9.08. This context-aware approach not only yields incremental gains in performance but also introduces an innovative sentiment-indexing method utilizing model confidence scores. This method enhances stock price prediction models and extracts more value from the financial sentiment analysis, underscoring the significance of WSB as a critical source of financial text. Our research offers insights into both improving FSA through better evaluation methods.
金融情感分析(FSA)对大型语言模型(LLM)提出了超出常规情感分析的独特挑战,这是因为金融环境中使用的语言具有复杂性。现有基准数据集如Financial Phrasebank中的情感分类本质上是主观的,这削弱了这些模型的能力。这些数据集中通常包含由标注者个体化视角反映、但未定义的情感类别,导致标注结果存在显著差异。这种变异性在对LLM进行基准测试时造成了不公平的期望,即要求模型推测缺乏足够背景信息的人类标注者的主观观点。 在这篇论文中,我们引入了“注释员指令辅助提示”(AIAP),这是一种重新定义FSA任务的新颖评估提示框架。通过将原本为人类注释者设计的详细任务说明整合到LLM的提示框架中,AIAP旨在标准化人机在情感理解上的共同认知,并提供一个更加公平且背景丰富的基础来进行情感分析。我们利用一个新的数据集WSBS(来自WallStreetBets subreddit的数据)来展示如何通过将机器操作与细化的任务定义对齐,AIAP显著提升了LLM的性能。实验结果表明,AIAP能够大幅度提高LLM的表现,最高可达9.08%的增长。 这种方法不仅带来了性能上的增量改进,还引入了一种新颖的情感指数方法,利用模型置信度分数来增强股票价格预测模型,并从金融情感分析中提取更多价值。这凸显了WSB作为重要金融文本来源的重要性。我们的研究为通过更好的评估方法提升FSA提供了见解。
https://arxiv.org/abs/2505.07871
Engagement between client and therapist is a critical determinant of therapeutic success. We propose a multi-dimensional natural language processing (NLP) framework that objectively classifies engagement quality in counseling sessions based on textual transcripts. Using 253 motivational interviewing transcripts (150 high-quality, 103 low-quality), we extracted 42 features across four domains: conversational dynamics, semantic similarity as topic alignment, sentiment classification, and question detection. Classifiers, including Random Forest (RF), Cat-Boost, and Support Vector Machines (SVM), were hyperparameter tuned and trained using a stratified 5-fold cross-validation and evaluated on a holdout test set. On balanced (non-augmented) data, RF achieved the highest classification accuracy (76.7%), and SVM achieved the highest AUC (85.4%). After SMOTE-Tomek augmentation, performance improved significantly: RF achieved up to 88.9% accuracy, 90.0% F1-score, and 94.6% AUC, while SVM reached 81.1% accuracy, 83.1% F1-score, and 93.6% AUC. The augmented data results reflect the potential of the framework in future larger-scale applications. Feature contribution revealed conversational dynamics and semantic similarity between clients and therapists were among the top contributors, led by words uttered by the client (mean and standard deviation). The framework was robust across the original and augmented datasets and demonstrated consistent improvements in F1 scores and recall. While currently text-based, the framework supports future multimodal extensions (e.g., vocal tone, facial affect) for more holistic assessments. This work introduces a scalable, data-driven method for evaluating engagement quality of the therapy session, offering clinicians real-time feedback to enhance the quality of both virtual and in-person therapeutic interactions.
客户与治疗师之间的参与度是心理治疗成功的关键决定因素。我们提出了一种基于文本转录的多维度自然语言处理(NLP)框架,以客观地分类咨询会话中的参与质量。使用了253份动机访谈记录(其中150份为高质量,103份为低质量),提取了涵盖四个方面共42个特征:对话动态、语义相似性作为主题一致性、情感分类和问题检测。 通过分层五折交叉验证进行超参数调优并训练随机森林(RF)、Cat-Boost和支持向量机(SVM)等分类器,然后在独立测试集上评估其性能。对于未增强的数据集,随机森林达到了最高的分类准确率(76.7%),而支持向量机则具有最高的AUC(85.4%)。经过SMOTE-Tomek增强后,模型表现显著提升:随机森林的最高准确率为88.9%,F1得分为90.0%,AUC为94.6%;支持向量机的准确性达到81.1%,F1得分83.1%,AUC为93.6%。增强数据集的结果反映了该框架在未来大规模应用中的潜力。 特征贡献分析表明,对话动态和客户与治疗师之间的语义相似性是最重要的特征之一,其中由客户说出的词汇(平均值和标准差)最为关键。此框架在原始和增强的数据集中表现稳健,并且在F1得分和召回率方面表现出持续改进的趋势。尽管目前该模型基于文本处理,但未来可以扩展为多模态形式(如声调、面部表情),以进行更全面的评估。 这项工作引入了一种可扩展、数据驱动的方法来评价治疗会话中参与质量,并为临床医生提供了实时反馈,有助于提升虚拟和面对面心理治疗互动的质量。
https://arxiv.org/abs/2505.06151
Sentiment classification, a complex task in natural language processing, becomes even more challenging when analyzing passages with multiple conflicting tones. Typically, longer passages exacerbate this issue, leading to decreased model performance. The aim of this paper is to introduce novel methodologies for isolating conflicting sentiments and aggregating them to effectively predict the overall sentiment of such passages. One of the aggregation strategies involves a Multi-Layer Perceptron (MLP) model which outperforms baseline models across various datasets, including Amazon, Twitter, and SST while costing $\sim$1/100 of what fine-tuning the baseline would take.
情感分类是自然语言处理中的一个复杂任务,当分析包含多种冲突情绪的段落时,该任务变得更加具有挑战性。通常情况下,较长的文本会使这个问题更加严重,并导致模型性能下降。本文旨在介绍一些新颖的方法来分离这些相互矛盾的情绪并将它们聚合起来,以有效预测此类文本的整体情感倾向。 其中一种聚合策略采用多层感知机(Multi-Layer Perceptron, MLP)模型,在亚马逊、推特和SST等不同数据集上均优于基准模型,并且其训练成本仅为微调基准模型的约1/100。
https://arxiv.org/abs/2505.06320
We explore the use of Chain-of-Thought (CoT) prompting with large language models (LLMs) to improve the accuracy of granular sentiment categorization in app store reviews. Traditional numeric and polarity-based ratings often fail to capture the nuanced sentiment embedded in user feedback. We evaluated the effectiveness of CoT prompting versus simple prompting on 2000 Amazon app reviews by comparing each method's predictions to human judgements. CoT prompting improved classification accuracy from 84% to 93% highlighting the benefit of explicit reasoning in enhancing sentiment analysis performance.
我们探讨了在大型语言模型(LLM)中使用链式思维(CoT)提示,以提高应用商店评论中的细粒度情感分类的准确性。传统的数值和极性评分方法往往无法捕捉到用户反馈中蕴含的细微情感。我们通过将每种方法的预测与人类判断进行比较,在2000条亚马逊应用评论上评估了CoT提示相较于简单提示的有效性。结果显示,CoT提示使分类准确率从84%提高到了93%,突显了明确推理在提升情感分析性能方面的优势。
https://arxiv.org/abs/2505.04135
We present a framework for large-scale sentiment and topic analysis of Twitter discourse. Our pipeline begins with targeted data collection using conflict-specific keywords, followed by automated sentiment labeling via multiple pre-trained models to improve annotation robustness. We examine the relationship between sentiment and contextual features such as timestamp, geolocation, and lexical content. To identify latent themes, we apply Latent Dirichlet Allocation (LDA) on partitioned subsets grouped by sentiment and metadata attributes. Finally, we develop an interactive visualization interface to support exploration of sentiment trends and topic distributions across time and regions. This work contributes a scalable methodology for social media analysis in dynamic geopolitical contexts.
我们提出了一种针对大规模Twitter言论情感和主题分析的框架。我们的数据处理流程首先通过使用冲突特定关键词进行目标化数据收集,随后利用多个预训练模型自动标注情感标签以增强注释的稳健性。接下来,我们研究了情感与上下文特征(如时间戳、地理位置和词汇内容)之间的关系。为了识别潜在的主题,我们在按情感和元数据属性分组的分区子集中应用潜在狄利克雷分配(LDA)。最后,我们开发了一个交互式可视化界面,以支持对不同时段和地区的情感趋势和主题分布进行探索。这项工作为动态地缘政治环境中社交媒体分析提供了一种可扩展的方法论。
https://arxiv.org/abs/2505.01883
This study explores the intersection of fashion trends and social media sentiment through computational analysis of Twitter data using the T4SA (Twitter for Sentiment Analysis) dataset. By applying natural language processing and machine learning techniques, we examine how sentiment patterns in fashion-related social media conversations can serve as predictors for emerging fashion trends. Our analysis involves the identification and categorization of fashion-related content, sentiment classification with improved normalization techniques, time series decomposition, statistically validated causal relationship modeling, cross-platform sentiment comparison, and brand-specific sentiment analysis. Results indicate correlations between sentiment patterns and fashion theme popularity, with accessories and streetwear themes showing statistically significant rising trends. The Granger causality analysis establishes sustainability and streetwear as primary trend drivers, showing bidirectional relationships with several other themes. The findings demonstrate that social media sentiment analysis can serve as an effective early indicator of fashion trend trajectories when proper statistical validation is applied. Our improved predictive model achieved 78.35% balanced accuracy in sentiment classification, establishing a reliable foundation for trend prediction across positive, neutral, and negative sentiment categories.
这项研究通过计算分析Twitter数据(使用T4SA(Twitter for Sentiment Analysis)数据集),探索了时尚趋势与社交媒体情绪之间的交集。通过应用自然语言处理和机器学习技术,我们探讨了与时尚相关的社交媒体对话中的情感模式如何作为新兴时尚趋势的预测指标。 我们的分析包括识别和分类与时尚相关的内容、改进的情感归一化技术下的情感分类、时间序列分解、统计验证因果关系建模、跨平台情绪比较以及品牌特定的情绪分析。结果表明,情感模式与时尚主题受欢迎程度之间存在关联,配饰和街头服饰主题显示出统计上显著的增长趋势。 格兰杰因果分析建立了可持续性和街头服饰作为主要的趋势驱动因素,并展示了与其他多个主题之间的双向关系。研究发现证明了当应用适当的统计验证时,社交媒体情绪分析可以作为一种有效的早期指标来预测时尚趋势的发展轨迹。 我们改进的预测模型在情感分类中实现了78.35%的平衡准确性,在正面、中性和负面情绪类别上建立了可靠的趋势预测基础。
https://arxiv.org/abs/2505.00050
In this paper, I discuss the testing of the Arabic Metaphor Corpus (AMC) [1] using newly designed automatic tools for sentiment classification for AMC based on semantic tags. The tool incorporates semantic emotional tags for sentiment classification. I evaluate the tool using standard methods, which are F-score, recall, and precision. The method is to show the impact of Arabic online metaphors on sentiment through the newly designed tools. To the best of our knowledge, this is the first approach to conduct sentiment classification for Arabic metaphors using semantic tags to find the impact of the metaphor.
在这篇论文中,我讨论了使用新设计的自动工具对阿拉伯语隐喻语料库(AMC)[1]进行测试的过程。这些工具是基于语义标签为AMC专门设计的情感分类工具,并且集成了用于情感分类的语义情感标记。我通过标准方法(即F值、召回率和精确度)评估了该工具的效果。研究的方法旨在展示阿拉伯语在线隐喻对情绪的影响,使用新设计的工具来实现这一目标。据我们所知,这是首次采用基于语义标签的情感分类方法来分析阿拉伯语隐喻对其情感影响的研究。
https://arxiv.org/abs/2504.19590
This study explores a novel approach to predicting key bug-related outcomes, including the time to resolution, time to fix, and ultimate status of a bug, using data from the Bugzilla Eclipse Project. Specifically, we leverage features available before a bug is resolved to enhance predictive accuracy. Our methodology incorporates sentiment analysis to derive both an emotionality score and a sentiment classification (positive or negative). Additionally, we integrate the bug's priority level and its topic, extracted using a BERTopic model, as features for a Convolutional Neural Network (CNN) and a Multilayer Perceptron (MLP). Our findings indicate that the combination of BERTopic and sentiment analysis can improve certain model performance metrics. Furthermore, we observe that balancing model inputs enhances practical applicability, albeit at the cost of a significant reduction in accuracy in most cases. To address our primary objectives, predicting time-to-resolution, time-to-fix, and bug destiny, we employ both binary classification and exact time value predictions, allowing for a comparative evaluation of their predictive effectiveness. Results demonstrate that sentiment analysis serves as a valuable predictor of a bug's eventual outcome, particularly in determining whether it will be fixed. However, its utility is less pronounced when classifying bugs into more complex or unconventional outcome categories.
这项研究探索了一种新颖的方法,用于预测与软件缺陷相关的关键结果,包括解决时间、修复时间和最终状态。该方法利用了来自Bugzilla Eclipse项目的数据,并且采用了在问题解决之前可用的特征来提高预测准确性。我们的方法采用情感分析技术以得出情感性评分和正负面情绪分类,同时结合使用BERTopic模型提取的优先级和话题作为卷积神经网络(CNN)和多层感知机(MLP)的输入特征。 研究结果表明,将BERTopic与情感分析相结合可以在某些性能指标上提升模型表现。此外,平衡模型输入以增强其实用性虽在大多数情况下会导致准确性显著降低,但这是必要的步骤。为了实现主要目标——预测解决时间、修复时间和缺陷最终状态,我们采用了二元分类和精确时间值预测的方法,并允许对它们的预测效果进行比较评估。 结果表明,情感分析能够作为预测软件缺陷最终结局的有效工具,特别是在确定该问题是否会被修复方面尤其有效。然而,在将缺陷归类为更复杂或非传统的结果类别时,其效用显得相对较低。
https://arxiv.org/abs/2504.15972
Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms, aimed at predicting sentiment polarity toward specific aspect targets (i.e., entities or attributes explicitly mentioned in text-image pairs). Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content and the cognitive rationales derived from semantic content and impressions (cognitive interpretations of emotions evoked by image content). In this study, we present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects and infer the fundamental drivers of sentiment expression from both semantic perspectives and affective-cognitive resonance (the synergistic effect between emotional responses and cognitive interpretations). Specifically, this framework first incorporates visual patch features for patch-word alignment. Meanwhile, it extracts coarse-grained visual features (e.g., overall image representation) and fine-grained visual regions (e.g., aspect-related regions) and translates them into corresponding textual descriptions (e.g., facial, aesthetic). Finally, we leverage the sentimental causes and impressions generated by a large language model (LLM) to enhance the model's awareness of sentimental cues evoked by semantic content and affective-cognitive resonance. Experimental results on standard MASC datasets demonstrate the effectiveness of the proposed model, which also exhibits greater flexibility to MASC compared to LLMs such as GPT-4o. We have publicly released the complete implementation and dataset at this https URL
多模态方面情感分类(MASC)是一项新兴任务,由于社交媒体平台上用户生成的多模态内容增多而产生。该任务旨在预测特定方面的目标的情感极性(即,在图文对中明确提到的实体或属性)。尽管在现有的MASC研究上已经付出了大量努力并取得了显著成果,但在理解细粒度视觉内容以及从语义内容和印象中得出的认知理由方面仍存在重大差距。 在这项研究中,我们提出了Chimera:一个认知与美学情感因果关系理解框架。该框架旨在通过语义视角和情感能动-认知共鸣(情感反应与认知解释之间的协同作用)来推断表达情感的基本驱动因素,并从细粒度的整体特征角度来解析方面。具体而言,此框架首先将视觉补丁特征纳入其中以实现视觉与文本的对齐。同时,它还提取粗粒度的视觉特征(例如,整体图像表示)和细粒度的视觉区域(例如,与特定方面相关的区域),并将其转换为相应的文本描述(如面部、美学)。最后,我们利用大型语言模型(LLM)生成的情感原因和印象来增强模型对由语义内容和情感能动-认知共鸣引发的情感线索的理解。 在标准多模态情感分类数据集上的实验结果表明了所提出模型的有效性。与像GPT-4o这样的LLMs相比,该模型表现出更大的灵活性以适应MASC任务。我们在以下链接公开发布了完整实现及数据集:[此URL]
https://arxiv.org/abs/2504.15848
In the age of social media, understanding public sentiment toward major corporations is crucial for investors, policymakers, and researchers. This paper presents a comprehensive sentiment analysis system tailored for corporate reputation monitoring, combining Natural Language Processing (NLP) and machine learning techniques to accurately interpret public opinion in real time. The methodology integrates a hybrid sentiment detection framework leveraging both rule-based models (VADER) and transformer-based deep learning models (DistilBERT), applied to social media data from multiple platforms. The system begins with robust preprocessing involving noise removal and text normalization, followed by sentiment classification using an ensemble approach to ensure both interpretability and contextual accuracy. Results are visualized through sentiment distribution plots, comparative analyses, and temporal sentiment trends for enhanced interpretability. Our analysis reveals significant disparities in public sentiment across major corporations, with companies like Amazon (81.2) and Samsung (45.8) receiving excellent sentiment scores, while Microsoft (21.7) and Walmart (21.9) exhibit poor sentiment profiles. These findings demonstrate the utility of our multi-source sentiment framework in providing actionable insights regarding corporate public perception, enabling stakeholders to make informed strategic decisions based on comprehensive sentiment analysis.
在社交媒体时代,理解公众对大型企业的情绪对于投资者、政策制定者和研究人员来说至关重要。本文提出了一种全面的情感分析系统,该系统专门用于企业声誉监测,并结合自然语言处理(NLP)和机器学习技术来实时准确地解读公众意见。 本研究的方法论整合了一个混合情感检测框架,利用基于规则的模型(VADER)和基于转换器的深度学习模型(DistilBERT),应用于来自多个平台的社会媒体数据。该系统首先进行强大的预处理步骤,包括去除噪声和文本规范化,然后使用集成方法进行情感分类,以确保可解释性和上下文准确性。结果通过情感分布图、比较分析以及时间趋势来可视化,从而增强其可解释性。 我们的分析揭示了大型企业之间公众情绪的重大差异:亚马逊(81.2)和三星(45.8)获得了出色的情感评分,而微软(21.7)和沃尔玛(21.9)则表现出较低的情绪评分。这些发现证明了我们多源情感框架的实用性,在提供有关公司公共形象可操作见解方面具有重要意义,使利益相关者能够根据全面的情感分析做出明智的战略决策。
https://arxiv.org/abs/2504.15448
In various natural language processing (NLP) tasks, fine-tuning Pre-trained Language Models (PLMs) often leads to the issue of spurious correlations, which negatively impacts performance, particularly when dealing with out-of-distribution data. To address this problem, we propose SALAD}(Structure Aware and LLM-driven Augmented Data), a novel approach designed to enhance model robustness and generalization by generating structure-aware and counterfactually augmented data for contrastive learning. Our method leverages a tagging-based approach to generate structure-aware positive samples and utilizes large language models (LLMs) to generate counterfactual negative samples with diverse sentence patterns. By applying contrastive learning, SALAD enables the model to focus on learning the structural relationships between key sentence components while minimizing reliance on spurious correlations. We validate our approach through experiments on three tasks: Sentiment Classification, Sexism Detection, and Natural Language Inference. The results demonstrate that SALAD not only improves model robustness and performance across different environments but also enhances generalization to out-of-distribution datasets and cross-domain scenarios.
在各种自然语言处理(NLP)任务中,对预训练语言模型(PLMs)进行微调往往会带来虚假相关性的问题,这会负面影响性能,特别是在处理分布外数据时。为了解决这个问题,我们提出了SALAD(结构感知与大规模语言模型驱动的增强数据),这是一种通过生成结构感知且反事实增强的数据来进行对比学习的方法,旨在提高模型的鲁棒性和泛化能力。我们的方法采用基于标签的方法来生成结构感知的正样本,并利用大型语言模型(LLMs)生成具有多种句子模式的反事实负样本。通过应用对比学习,SALAD使模型能够专注于关键句法成分之间的结构性关系的学习,同时最小化对虚假相关性的依赖。 我们通过对三个任务——情感分类、性别歧视检测和自然语言推理进行实验来验证我们的方法的有效性。结果表明,SALAD不仅提高了在不同环境下的模型鲁棒性和性能,还增强了其对于分布外数据集以及跨域场景的泛化能力。
https://arxiv.org/abs/2504.12185
One fundamental question for the social sciences today is: how much can we trust highly complex predictive models like ChatGPT? This study tests the hypothesis that subtle changes in the structure of prompts do not produce significant variations in the classification results of sentiment polarity analysis generated by the Large Language Model GPT-4o mini. Using a dataset of 100.000 comments in Spanish on four Latin American presidents, the model classified the comments as positive, negative, or neutral on 10 occasions, varying the prompts slightly each time. The experimental methodology included exploratory and confirmatory analyses to identify significant discrepancies among classifications. The results reveal that even minor modifications to prompts such as lexical, syntactic, or modal changes, or even their lack of structure impact the classifications. In certain cases, the model produced inconsistent responses, such as mixing categories, providing unsolicited explanations, or using languages other than Spanish. Statistical analysis using Chi-square tests confirmed significant differences in most comparisons between prompts, except in one case where linguistic structures were highly similar. These findings challenge the robustness and trust of Large Language Models for classification tasks, highlighting their vulnerability to variations in instructions. Moreover, it was evident that the lack of structured grammar in prompts increases the frequency of hallucinations. The discussion underscores that trust in Large Language Models is based not only on technical performance but also on the social and institutional relationships underpinning their use.
今天社会科学面临的一个基本问题是:我们能在多大程度上信任像ChatGPT这样高度复杂的预测模型?这项研究检验了这样一个假设,即提示结构的细微变化不会导致大型语言模型(如GPT-4o mini)生成的情感极性分析分类结果产生显著差异。该研究使用了一组10万条西班牙语评论数据集,这些评论针对四个拉丁美洲总统。实验中,模型在10个不同场景下对这些建议进行了积极、消极或中立的分类,并且每次提示都有细微的不同之处。 实验方法包括了探索性和确认性分析,以识别分类之间的显著差异。研究结果显示,即使是词汇、句法和模态上的微小变化,或者缺乏结构化都会影响模型的分类结果。在某些情况下,该模型产生了不一致的回答,例如混淆类别、提供非请求解释或使用西班牙语以外的语言。 卡方检验的统计分析确认了大多数提示比较之间的显著差异,唯一的例外是语言结构高度相似的情况。这些发现挑战了大型语言模型在分类任务中的稳健性和可信度,并突显了它们对指令变化的脆弱性。此外还观察到,缺乏结构化语法的提示会增加幻觉现象的发生频率。 讨论强调了信任大型语言模型不仅基于其技术性能,还需要考虑在其使用中所涉及的社会和机构关系。
https://arxiv.org/abs/2504.12180
Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets by leveraging the unique strengths of these models. BERT captures bidirectional context, GPT-2 enhances generative capabilities, RoBERTa optimizes contextual understanding with larger corpora and dynamic masking, XLNet models dependency through permutation-based learning, and DistilBERT offers efficiency with reduced computational overhead while maintaining high accuracy. We demonstrate text cleaning, tokenization, and feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), ensure high-quality input data for the models. The hybrid approach was evaluated on benchmark datasets Sentiment140 and IMDB, achieving superior accuracy rates of 94\% and 95\%, respectively, outperforming standalone models. The results validate the effectiveness of combining multiple transformer models in ensemble-like setups to address the limitations of individual architectures. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking which offers a pathway for future advancements in hybrid NLP frameworks.
情感分析是自然语言处理(NLP)中的关键任务,它能够从文本数据中提取有意义的见解,特别是在如Twitter和IMDb这样的动态平台上。本研究探讨了一种结合基于变压器模型(包括BERT、GPT-2、RoBERTa、XLNet以及DistilBERT)的混合框架,以提高情感分类的准确性和鲁棒性。该框架通过利用这些模型的独特优势来应对诸如噪音数据、语境模糊以及在不同数据集上泛化等方面的挑战。 具体来说,BERT能够捕捉双向上下文信息,GPT-2提升了生成能力,RoBERTa使用更大规模的数据集和动态屏蔽技术优化了对上下文的理解,XLNet通过基于排列的学习方式建模依赖关系,而DistilBERT则在减少计算开销的同时保持高准确率。我们展示了如何利用词频逆文档频率(TF-IDF)和词汇袋(BoW)进行文本清洗、分词以及特征提取,以确保输入模型的数据质量。 混合方法在Sentiment140和IMDb基准数据集上进行了评估,在这些数据集中分别达到了94%和95%的优越准确率,超过了单一模型的表现。研究结果验证了通过类似集成的方法组合多个变压器模型可以有效解决单一架构的局限性。 这项研究强调其在实际任务中的适用性,如社交媒体监控、客户情感分析以及公众舆论跟踪等领域,并为未来混合NLP框架的发展提供了方向。
https://arxiv.org/abs/2504.09896
Large Language Models (LLMs) have significantly advanced sentiment analysis, yet their inherent uncertainty and variability pose critical challenges to achieving reliable and consistent outcomes. This paper systematically explores the Model Variability Problem (MVP) in LLM-based sentiment analysis, characterized by inconsistent sentiment classification, polarization, and uncertainty arising from stochastic inference mechanisms, prompt sensitivity, and biases in training data. We analyze the core causes of MVP, presenting illustrative examples and a case study to highlight its impact. In addition, we investigate key challenges and mitigation strategies, paying particular attention to the role of temperature as a driver of output randomness and emphasizing the crucial role of explainability in improving transparency and user trust. By providing a structured perspective on stability, reproducibility, and trustworthiness, this study helps develop more reliable, explainable, and robust sentiment analysis models, facilitating their deployment in high-stakes domains such as finance, healthcare, and policymaking, among others.
大型语言模型(LLMs)在情感分析方面取得了显著进展,但其内在的不确定性和变化性给实现可靠和一致的结果带来了重大挑战。本文系统地探讨了基于LLM的情感分析中的模型变异性问题(Model Variability Problem, MVP),该问题表现为不一致的情感分类、极化以及由于随机推理机制、提示敏感性及训练数据偏差所产生的不确定性。我们分析了MVP的核心原因,并通过示例和案例研究来强调其影响。此外,本文还探讨了关键挑战和缓解策略,特别关注温度作为输出随机性的驱动因素的作用,并强调可解释性在提高透明度和用户信任中的关键作用。通过提供关于稳定性、重现性和可信度的结构化视角,本研究有助于开发更可靠、更具解释性和鲁棒性的模型,从而促进这些模型在金融、医疗保健及政策制定等高风险领域的部署。
https://arxiv.org/abs/2504.04462