Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on health issues are largely unexplored. Thus, this research examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in detecting risk-promoting versus health-supporting sentiments across two critical public health topics: Human Papillomavirus (HPV) vaccination and heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we curated multiple sets of messages supporting or opposing recommended health behaviors, supplemented with human annotations as the gold standard for sentiment classification. The findings indicate that all three LLMs generally demonstrate substantial accuracy in classifying risk-promoting and health-supporting sentiments, although notable discrepancies emerge by platform, health issue, and model type. Specifically, models often show higher accuracy for risk-promoting sentiment on Facebook, whereas health-supporting messages on Twitter are more accurately detected. An additional analysis also shows the challenges LLMs face in reliably detecting neutral messages. These results highlight the importance of carefully selecting and validating language models for public health analyses, particularly given potential biases in training data that may lead LLMs to overestimate or underestimate the prevalence of certain perspectives.
基于大规模数据的机器学习方法越来越多地被应用于分析与健康相关的公众言论,但关于它们准确检测不同类型健康情感的能力仍然存在疑问。特别是,大型语言模型(LLMs)作为一项强大的技术引起了关注,然而它们在捕捉健康问题的不同观点和视角方面的准确性及可行性尚未得到充分探索。因此,这项研究考察了三种知名的大规模语言模型(GPT、Gemini 和 LLAMA),它们在检测与两种关键公共卫生话题相关的情感方面——人类乳头瘤病毒(HPV)疫苗接种和加热烟草产品(HTPs)——的准确度如何。我们从 Facebook 和 Twitter 中提取了多组支持或反对推荐健康行为的信息,并通过人工注释作为情感分类的标准来补充这些数据。研究结果表明,这三种 LLM 通常在识别风险促进与健康支持性情感方面表现出相当高的准确性,尽管这种准确性因平台、健康问题和模型类型而异。具体而言,模型往往在 Facebook 上对风险促进的情感检测更准确,而在 Twitter 上则更擅长于检测健康支持性的信息。此外的分析还显示了 LLM 在可靠地检测中立消息时面临的挑战。这些结果突显了在进行公共卫生分析时选择和验证语言模型的重要性,尤其是在考虑到训练数据中存在的潜在偏见可能导致 LLM 对某些观点的流行程度估计过高或过低的情况下。
https://arxiv.org/abs/2507.04364
In the field of education, understanding students' opinions through their comments is crucial, especially in the Vietnamese language, where resources remain limited. Existing educational datasets often lack domain relevance and student slang. To address these gaps, we introduce NEU-ESC, a new Vietnamese dataset for Educational Sentiment Classification and Topic Classification, curated from university forums, which offers more samples, richer class diversity, longer texts, and broader vocabulary. In addition, we explore multitask learning using encoder-only language models (BERT), in which we showed that it achieves performance up to 83.7% and 79.8% accuracy for sentiment and topic classification tasks. We also benchmark our dataset and model with other datasets and models, including Large Language Models, and discuss these benchmarks. The dataset is publicly available at: this https URL.
在教育领域,通过学生的评论了解他们的意见至关重要,特别是在越南语中,由于资源有限,这一挑战更加突出。现有的教育数据集通常缺乏领域的相关性和学生用语的多样性。为了弥补这些不足,我们推出了NEU-ESC,这是一个新的越南语数据集,用于教育情感分类和话题分类,该数据集是从大学论坛上收集整理而来,提供了更多的样本、更丰富的类别多样性、更长的文章长度以及更为广泛的词汇量。此外,我们还探索了使用编码器单独的语言模型(如BERT)进行多任务学习,在这种情况下,我们展示了它在情感分类和话题分类任务中分别达到了高达83.7%和79.8%的准确性。我们还用其他数据集和模型包括大型语言模型对我们的数据集和模型进行了基准测试,并讨论了这些基准的结果。该数据集可在此URL公开获取:this https URL。
https://arxiv.org/abs/2506.23524
Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task-specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. Project page: this https URL.
尽管基于大型语言模型(LLM)的注释范式近年来取得了重大突破,但在实际部署中仍然存在两个核心瓶颈:首先,在大规模标注场景下调用商业API的成本非常高昂;其次,在需要细粒度语义理解的情境下,例如情感分类和毒性分类等任务,LLM的标注准确率甚至低于专门为此领域设计的小型语言模型(SLMs)。为了解决这些问题,我们提出了一种多模态协同注释的新范式,并基于此设计了一个全自动注释框架AutoAnnotator。 具体来说,AutoAnnotator包含两个层次。顶层元控制器层利用LLM的生成和推理能力来选择合适的SLM进行标注,自动生成标注代码并验证难例;底层任务专家层则由多个专精于特定任务的SLM组成,通过多模型投票的方式执行注释工作。此外,我们还采用了元控制器层二次审查得到的难例作为强化学习的数据集,并通过持续学习策略分阶段微调SLMs,从而提高其泛化能力。 广泛的实验表明,在零样本、单样本、CoT(链式推理)和多数表决等设置下,AutoAnnotator的表现优于现有的开源/API LLM。特别值得注意的是,相较于直接使用GPT-3.5-turbo进行标注,AutoAnnotator将注释成本降低了74.15%,同时仍然提高了6.21%的准确性。 项目页面: [请参阅此链接](https://this.is.the.url/project-page/)(请注意替换实际URL)。
https://arxiv.org/abs/2506.16393
Quantum machine learning is a promising direction for building more efficient and expressive models, particularly in domains where understanding complex, structured data is critical. We present the Quantum Graph Transformer (QGT), a hybrid graph-based architecture that integrates a quantum self-attention mechanism into the message-passing framework for structured language modeling. The attention mechanism is implemented using parameterized quantum circuits (PQCs), which enable the model to capture rich contextual relationships while significantly reducing the number of trainable parameters compared to classical attention mechanisms. We evaluate QGT on five sentiment classification benchmarks. Experimental results show that QGT consistently achieves higher or comparable accuracy than existing quantum natural language processing (QNLP) models, including both attention-based and non-attention-based approaches. When compared with an equivalent classical graph transformer, QGT yields an average accuracy improvement of 5.42% on real-world datasets and 4.76% on synthetic datasets. Additionally, QGT demonstrates improved sample efficiency, requiring nearly 50% fewer labeled samples to reach comparable performance on the Yelp dataset. These results highlight the potential of graph-based QNLP techniques for advancing efficient and scalable language understanding.
量子机器学习是构建更高效和表达力更强的模型的一个有前景的方向,尤其是在理解复杂、结构化数据至关重要的领域。我们提出了一种基于图的混合架构——量子图变换器(Quantum Graph Transformer, QGT),它在结构化的语言建模中将量子自注意力机制整合到了消息传递框架内。该注意机制通过参数化量子电路(Parameterized Quantum Circuits,PQCs)实现,这使得模型能够捕捉丰富的上下文关系,并且相比传统的经典注意机制显著减少了可训练的参数数量。 我们在五个情感分类基准测试集上评估了QGT的效果。实验结果显示,QGT在准确度方面始终优于现有的量子自然语言处理(Quantum Natural Language Processing, QNLP)模型,包括基于注意力和非注意力的方法。与等效的经典图变换器相比,在真实世界数据集上,QGT的平均精度提高了5.42%,而在合成数据集上的提高为4.76%。此外,QGT还展示了更好的样本效率,仅需大约少一半数量的标记样本即可在Yelp数据集上达到类似性能。 这些结果突显了基于图的QNLP技术在推进高效和可扩展的语言理解方面的潜力。
https://arxiv.org/abs/2506.07937
We investigate the effectiveness of large language models (LLMs), including reasoning-based and non-reasoning models, in performing zero-shot financial sentiment analysis. Using the Financial PhraseBank dataset annotated by domain experts, we evaluate how various LLMs and prompting strategies align with human-labeled sentiment in a financial context. We compare three proprietary LLMs (GPT-4o, GPT-4.1, o3-mini) under different prompting paradigms that simulate System 1 (fast and intuitive) or System 2 (slow and deliberate) thinking and benchmark them against two smaller models (FinBERT-Prosus, FinBERT-Tone) fine-tuned on financial sentiment analysis. Our findings suggest that reasoning, either through prompting or inherent model design, does not improve performance on this task. Surprisingly, the most accurate and human-aligned combination of model and method was GPT-4o without any Chain-of-Thought (CoT) prompting. We further explore how performance is impacted by linguistic complexity and annotation agreement levels, uncovering that reasoning may introduce overthinking, leading to suboptimal predictions. This suggests that for financial sentiment classification, fast, intuitive "System 1"-like thinking aligns more closely with human judgment compared to "System 2"-style slower, deliberative reasoning simulated by reasoning models or CoT prompting. Our results challenge the default assumption that more reasoning always leads to better LLM decisions, particularly in high-stakes financial applications.
我们研究了大型语言模型(LLMs),包括基于推理的和非推理模型,在零样本金融情感分析中的有效性。使用由领域专家标注的Financial PhraseBank数据集,我们评估了各种LLM和提示策略在财务背景下与人类标记的情感的一致性。我们在不同的提示范式下比较了三种专有LLM(GPT-4o、GPT-4.1 和 o3-mini),这些范式模拟了系统1(快速且直觉)或系统2(缓慢且深思熟虑)的思考方式,并将其与两个较小的模型(FinBERT-Prosus和FinBERT-Tone,后者在金融情感分析上进行了微调)进行基准比较。我们的研究结果表明,在这项任务中,无论是通过提示还是内置的设计,推理并不能提高性能。令人惊讶的是,最准确且与人类判断一致的模型方法组合是GPT-4o,没有使用任何链式思维(CoT)提示。 我们进一步探讨了语言复杂性和标注一致性水平对性能的影响,发现推理可能会导致过度思考,从而产生次优预测。这表明,在金融情感分类中,“系统1”式的快速、直觉性思考与人类判断更为契合,而由推理模型或CoT提示模拟的“系统2”式缓慢且深思熟虑的推理则不然。 我们的研究结果挑战了默认假设:更多的推理总是能带来更好的LLM决策,尤其是在高风险金融应用中。
https://arxiv.org/abs/2506.04574
Aspect-Based Sentiment Analysis (ABSA) offers granular insights into opinions but often suffers from the scarcity of diverse, labeled datasets that reflect real-world conversational nuances. This paper presents an approach for generating synthetic ABSA data using Large Language Models (LLMs) to address this gap. We detail the generation process aimed at producing data with consistent topic and sentiment distributions across multiple domains using GPT-4o. The quality and utility of the generated data were evaluated by assessing the performance of three state-of-the-art LLMs (Gemini 1.5 Pro, Claude 3.5 Sonnet, and DeepSeek-R1) on topic and sentiment classification tasks. Our results demonstrate the effectiveness of the synthetic data, revealing distinct performance trade-offs among the models: DeepSeekR1 showed higher precision, Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited strong recall, and Gemini 1.5 Pro offered significantly faster inference. We conclude that LLM-based synthetic data generation is a viable and flexible method for creating valuable ABSA resources, facilitating research and model evaluation without reliance on limited or inaccessible real-world labeled data.
基于方面的情感分析(ABSA)提供了对意见的精细洞察,但通常会因为缺乏多样且标注完善的现实对话数据而受限。本文提出了一种使用大型语言模型(LLMs)生成合成ABSA数据的方法,以弥补这一缺口。我们详细描述了利用GPT-4o来产生在多个领域内具有一致主题和情感分布的合成数据的过程。通过评估三种最先进的LLM(Gemini 1.5 Pro、Claude 3.5 Sonnet 和 DeepSeek-R1)在主题分类和情感分析任务上的性能,我们对生成的数据质量进行了评价。我们的结果显示了合成数据的有效性,并揭示了不同模型之间的性能权衡:DeepSeekR1表现出更高的精度,Gemini 1.5 Pro和Claude 3.5 Sonnet显示出强大的召回率,而Gemini 1.5 Pro则提供了显著更快的推理速度。我们得出结论,基于LLM的合成数据生成是一种可行且灵活的方法,能够创建有价值的ABSA资源,从而在没有受限或难以获取的真实世界标注数据的情况下支持研究和模型评估。
https://arxiv.org/abs/2505.24701
In this paper, we present a comprehensive and systematic analysis of vision-language models (VLMs) for disparate meme classification tasks. We introduced a novel approach that generates a VLM-based understanding of meme images and fine-tunes the LLMs on textual understanding of the embedded meme text for improving the performance. Our contributions are threefold: (1) Benchmarking VLMs with diverse prompting strategies purposely to each sub-task; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing a novel approach where detailed meme interpretations generated by VLMs are used to train smaller language models (LLMs), significantly improving classification. The strategy of combining VLMs with LLMs improved the baseline performance by 8.34%, 3.52% and 26.24% for sarcasm, offensive and sentiment classification, respectively. Our results reveal the strengths and limitations of VLMs and present a novel strategy for meme understanding.
在这篇论文中,我们对用于不同梗图分类任务的视觉-语言模型(VLM)进行了全面和系统的分析。我们提出了一种新颖的方法,该方法通过基于VLM的理解生成梗图图像,并在嵌入式文本理解上微调大型语言模型(LLMs),以提高性能。我们的贡献有三方面:(1) 使用多种针对每个子任务的提示策略对VLM进行基准测试;(2) 在所有VLM组件中评估LoRA微调带来的性能提升;以及(3) 提出了一种新颖的方法,其中通过VLM生成的详细梗图解释来训练小型语言模型(LLMs),从而显著提高了分类效果。结合使用VLM和LLM的战略分别在讽刺、冒犯性和情感分类任务上将基线性能提升了8.34%、3.52% 和 26.24%。我们的结果揭示了VLM的优势和局限性,并提出了一种理解梗图的新策略。
https://arxiv.org/abs/2505.20937
Decoding natural language from brain activity using non-invasive electroencephalography (EEG) remains a significant challenge in neuroscience and machine learning, particularly for open-vocabulary scenarios where traditional methods struggle with noise and variability. Previous studies have achieved high accuracy on small-closed vocabularies, but it still struggles on open vocabularies. In this study, we propose ETS, a framework that integrates EEG with synchronized eye-tracking data to address two critical tasks: (1) open-vocabulary text generation and (2) sentiment classification of perceived language. Our model achieves a superior performance on BLEU and Rouge score for EEG-To-Text decoding and up to 10% F1 score on EEG-based ternary sentiment classification, which significantly outperforms supervised baselines. Furthermore, we show that our proposed model can handle data from various subjects and sources, showing great potential for high performance open vocabulary eeg-to-text system.
使用非侵入性脑电图(EEG)从大脑活动解码自然语言仍然是神经科学和机器学习领域的一大挑战,特别是在开放词汇场景中,传统方法在处理噪声和变异方面存在困难。以往的研究虽然在小型封闭词汇表上实现了高精度,但在开放词汇表上的表现仍然不尽如人意。在这项研究中,我们提出了ETS框架,该框架结合了同步的眼动追踪数据与EEG信号,旨在解决两个关键任务:(1)开放式文本生成和(2)感知语言的情感分类。我们的模型在基于EEG的文本到语音解码上取得了BLEU和Rouge评分的卓越成绩,并且在基于EEG的情绪三分类任务中达到了高达10%的F1得分,显著优于监督基线模型。此外,我们还展示了所提出模型能够处理来自不同受试者和来源的数据,显示了其构建高性能开放词汇表EEG到文本系统的巨大潜力。
https://arxiv.org/abs/2506.14783
Political biases encoded by LLMs might have detrimental effects on downstream applications. Existing bias analysis methods rely on small-size intermediate tasks (questionnaire answering or political content generation) and rely on the LLMs themselves for analysis, thus propagating bias. We propose a new approach leveraging the observation that LLM sentiment predictions vary with the target entity in the same sentence. We define an entropy-based inconsistency metric to encode this prediction variability. We insert 1319 demographically and politically diverse politician names in 450 political sentences and predict target-oriented sentiment using seven models in six widely spoken languages. We observe inconsistencies in all tested combinations and aggregate them in a statistically robust analysis at different granularity levels. We observe positive and negative bias toward left and far-right politicians and positive correlations between politicians with similar alignment. Bias intensity is higher for Western languages than for others. Larger models exhibit stronger and more consistent biases and reduce discrepancies between similar languages. We partially mitigate LLM unreliability in target-oriented sentiment classification (TSC) by replacing politician names with fictional but plausible counterparts.
大型语言模型(LLM)中编码的政治偏见可能对下游应用产生负面影响。现有的偏见分析方法依赖于小型中间任务(如问卷回答或政治内容生成),并且依赖于LLMs本身进行分析,从而传播了偏见。我们提出了一种新方法,利用观察到的LLM情感预测在句子中的目标实体变化这一现象,并定义了一个基于熵的不一致性指标来编码这种预测变动性。 为了测试这种方法,我们在450条政治语句中插入了1319个具有不同人口统计学和政治背景的政治家名字,并使用六种广泛使用的语言中的七种模型进行了目标导向的情感预测。我们观察到所有测试组合中存在不一致性,并在不同的粒度级别上进行了一致的统计分析。 我们发现对左翼和极右翼政客有积极和消极的偏见,且具有相似政治立场的政治家之间存在正相关性。西方语言中的偏见强度高于其他语言。较大的模型表现出更强且更一致的偏见,并减少了类似语言之间的差异。 为了部分缓解LLM在目标导向情感分类(TSC)中的不可靠性,我们用虚构但可信的替身替代了政治家的名字。
https://arxiv.org/abs/2505.19776
In-context learning (ICL) is a crucial capability of current large language models (LLMs), where the selection of examples plays a key role in performance. While most existing approaches focus on selecting the most similar examples to the query, the impact of diversity in example selection remains underexplored. We systematically investigate the role of diversity in in-context example selection through experiments across a range of tasks, from sentiment classification to more challenging math and code problems. Experiments on Llama-3.1, Gemma-2, and Mistral-v0.3 families of models show that diversity-aware selection methods improve performance, particularly on complex tasks like math and code, and enhance robustness to out-of-distribution queries. To support these findings, we introduce a theoretical framework that explains the benefits of incorporating diversity in in-context example selection.
上下文学习(ICL)是当前大型语言模型(LLM)的一项关键能力,其中选择示例在性能中起着重要作用。虽然大多数现有方法侧重于选取与查询最相似的示例,但多样性的选择对性能的影响仍鲜有探索。我们通过一系列实验系统地研究了多样性在上下文示例选择中的作用,涵盖了从情感分类到更具挑战性的数学和编程问题等各种任务。在Llama-3.1、Gemma-2和Mistral-v0.3系列模型上的实验证明,具有多样性的示例选择方法能够提高性能,特别是在复杂的数学和编程等任务上,并且还能增强对分布外查询的鲁棒性。为了支持这些发现,我们引入了一个理论框架来解释在上下文示例选择中融入多样性的好处。
https://arxiv.org/abs/2505.19426
Aspect-Based Sentiment Analysis (ABSA) is a fundamental task in natural language processing, offering fine-grained insights into opinions expressed in text. While existing research has largely focused on resource-rich languages like English which leveraging large annotated datasets, pre-trained models, and language-specific tools. These resources are often unavailable for low-resource languages such as Bengali. The ABSA task in Bengali remains poorly explored and is further complicated by its unique linguistic characteristics and a lack of annotated data, pre-trained models, and optimized hyperparameters. To address these challenges, this research propose CrosGrpsABS, a novel hybrid framework that leverages bidirectional cross-attention between syntactic and semantic graphs to enhance aspect-level sentiment classification. The CrosGrpsABS combines transformerbased contextual embeddings with graph convolutional networks, built upon rule-based syntactic dependency parsing and semantic similarity computations. By employing bidirectional crossattention, the model effectively fuses local syntactic structure with global semantic context, resulting in improved sentiment classification performance across both low- and high-resource settings. We evaluate CrosGrpsABS on four low-resource Bengali ABSA datasets and the high-resource English SemEval 2014 Task 4 dataset. The CrosGrpsABS consistently outperforms existing approaches, achieving notable improvements, including a 0.93% F1-score increase for the Restaurant domain and a 1.06% gain for the Laptop domain in the SemEval 2014 Task 4 benchmark.
基于方面的情感分析(ABSA)是自然语言处理中的一个基本任务,它为文本中表达的意见提供了细致的洞察。尽管现有研究主要集中在资源丰富的语言上,如英语,并且依赖于大规模标注数据集、预训练模型和特定的语言工具,这些资源对于像孟加拉语这样的低资源语言往往是不可用的。孟加拉语中的ABSA任务尚未充分探索,并且由于其独特的语言特征以及缺乏标注数据、预训练模型和优化的超参数而变得更加复杂。 为了应对这些挑战,这项研究提出了CrosGrpsABS,这是一种新颖的混合框架,利用句法图与语义图之间的双向交叉注意来增强方面级情感分类。CrosGrpsABS结合了基于转换器的上下文嵌入和图形卷积网络,并建立在基于规则的句法依存分析和语义相似性计算之上。通过使用双向交叉注意力机制,该模型有效地融合了局部句法结构与全局语义背景信息,在低资源环境和高资源环境中均提高了情感分类性能。 我们在四个孟加拉语低资源ABSA数据集以及高资源英语SemEval 2014 Task 4数据集上对CrosGrpsABS进行了评估。在所有评测中,CrosGrpsABS始终优于现有方法,并实现了显著的改进,包括在SemEval 2014 Task 4基准测试中的餐厅领域和笔记本电脑领域的F1得分分别提高了0.93%和1.06%。
https://arxiv.org/abs/2505.19018
Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model's parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.
有效的跨语言迁移仍然是扩大大型语言模型从高资源语言到低资源语言效益的关键挑战。为了解决这个问题,先前的研究探索了多种方法来结合任务特定数据(在高资源源语言中)中的任务知识和未标记文本(在低资源目标语言中)中的语言知识。其中一种值得注意的方法是提出了可组合的稀疏微调(SFT),用于跨语言迁移,这种方法学习任务特定和语言特定的稀疏掩码来选择预训练模型参数的一个子集进行进一步的微调。这些稀疏微调向量(SFTs)随后与预训练模型结合使用,以促进仅通过源语言的任务特定数据就实现零样本跨语言迁移到目标语言中的任务。用于SFTs的稀疏掩码是通过简单的基于幅度的剪枝来识别的。 在我们的工作中,我们引入了DeFT-X,这是一种新的可组合SFT方法,在进行幅度剪枝之前使用奇异值分解对预训练模型的权重矩阵进行去噪处理,从而产生更健壮的SFTs。我们在多样化的极度低资源语言(用于情感分类NusaX和自然语言推理AmericasNLI)上评估了DeFT-X,并证明它在性能上与SFT和其他显着的跨语言迁移基线相当或优于它们。
https://arxiv.org/abs/2505.15090
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.
这项研究提出了一种评估大型语言模型(LLM)二元文本分类一致性的框架,旨在解决现有可靠性评估方法不足的问题。通过借鉴心理测量学原理,我们确定了样本量的需求,开发了无效响应的度量标准,并对评分者内和评分者间的可靠性进行了评价。我们的案例研究考察了14个大型语言模型(包括claude-3-7-sonnet、gpt-4o、deepseek-r1、gemma3、llama3.2、phi4和command-r-plus)在财务新闻情绪分类方面的表现,每个模型重复进行五次测试,共分析了1,350篇文章。各模型展示了较高的评分者内一致性,在90%-98%的样本中实现了完全一致的意见,并且来自同一家族的昂贵型和经济型模型之间差异很小。当与StockNewsAPI标签验证时,这些模型表现出色(准确率在0.76到0.88之间),其中较小规模的模型如gemma3:1B、llama3.2:3B以及claude-3-5-haiku的表现优于其大型同类。所有模型在预测实际市场走势时表现均接近随机水平,这表明任务难度而非模型能力限制了性能。我们的框架为LLM选择、样本量规划和可靠性评估提供了系统性指导,帮助组织优化资源以应对分类任务的挑战。
https://arxiv.org/abs/2505.14918
Fine-grained sentiment analysis (FGSA) aims to identify sentiment polarity toward specific aspects within a text, enabling more precise opinion mining in domains such as product reviews and social media. However, traditional FGSA approaches often require task-specific architectures and extensive annotated data, limiting their generalization and scalability. To address these challenges, we propose PL-FGSA, a unified prompt learning-based framework implemented using the MindSpore platform, which integrates prompt design with a lightweight TextCNN backbone. Our method reformulates FGSA as a multi-task prompt-augmented generation problem, jointly tackling aspect extraction, sentiment classification, and causal explanation in a unified paradigm. By leveraging prompt-based guidance, PL-FGSA enhances interpretability and achieves strong performance under both full-data and low-resource conditions. Experiments on three benchmark datasets-SST-2, SemEval-2014 Task 4, and MAMS-demonstrate that our model consistently outperforms traditional fine-tuning methods and achieves F1-scores of 0.922, 0.694, and 0.597, respectively. These results validate the effectiveness of prompt-based generalization and highlight the practical value of PL-FGSA for real-world sentiment analysis tasks.
细粒度情感分析(FGSA)旨在识别文本中针对特定方面的情感极性,从而在产品评论和社会媒体等领域实现更精确的意见挖掘。然而,传统的FGSA方法通常需要特定任务的架构和大量的注释数据,这限制了它们的泛化能力和可扩展性。为了解决这些挑战,我们提出了PL-FGSA,这是一个基于提示学习的统一框架,使用MindSpore平台构建,并结合了轻量级的TextCNN骨干网络。我们的方法将FGSA重新表述为一个多任务提示增强生成问题,在一个统一的范式中同时处理方面提取、情感分类和因果解释。通过利用基于提示的指导,PL-FGSA增强了可解释性,并在全数据集和低资源条件下均表现出色。 我们在三个基准数据集上进行了实验——SST-2、SemEval-2014 Task 4 和 MAMS,结果表明我们的模型始终优于传统的微调方法,并分别取得了F1分数为0.922、0.694 和 0.597 的成绩。这些结果验证了基于提示泛化的有效性,并突显了PL-FGSA在现实世界情感分析任务中的实际价值。
https://arxiv.org/abs/2505.14165
We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.
我们介绍了PersonaConvBench,这是一个大规模基准测试系统,用于评估大型语言模型(LLM)在多轮对话中个性化推理和生成的能力。与现有研究仅专注于个人化或对话结构不同,PersonaConvBench将两者结合起来,并提供三项核心任务:句子分类、影响回归以及跨十个基于Reddit的多样化领域进行以用户为中心的文本生成。这种设计能够系统地分析个性化的对话背景如何塑造LLM在现实多用户场景中的输出结果。 我们在统一的提示设置下对几种商用和开源的大型语言模型进行了基准测试,并观察到将个性化历史记录纳入考量可带来显著性能提升,包括在情感分类上比最佳非对话基线高出198%的相对增益。通过发布PersonaConvBench及其评估代码,我们的目标是支持研究那些能够适应个人风格、追踪长期背景信息并生成丰富且引人入胜响应的大型语言模型的研究工作。
https://arxiv.org/abs/2505.14106
The emergence of global health crises, such as COVID-19 and Monkeypox (mpox), has underscored the importance of understanding public sentiment to inform effective public health strategies. This study conducts a comparative sentiment analysis of public perceptions surrounding COVID-19 and mpox by leveraging extensive datasets of 147,475 and 106,638 tweets, respectively. Advanced machine learning models, including Logistic Regression, Naive Bayes, RoBERTa, DistilRoBERTa and XLNet, were applied to perform sentiment classification, with results indicating key trends in public emotion and discourse. The analysis highlights significant differences in public sentiment driven by disease characteristics, media representation, and pandemic fatigue. Through the lens of sentiment polarity and thematic trends, this study offers valuable insights into tailoring public health messaging, mitigating misinformation, and fostering trust during concurrent health crises. The findings contribute to advancing sentiment analysis applications in public health informatics, setting the groundwork for enhanced real-time monitoring and multilingual analysis in future research.
全球健康危机的出现,例如COVID-19和猴痘(mpox),强调了了解公众情绪以制定有效的公共卫生策略的重要性。本研究通过对包含147,475条和106,638条推文的大规模数据集进行比较性情感分析,探究了公众对COVID-19和mpox的看法。该研究应用了包括逻辑回归、朴素贝叶斯、RoBERTa、DistilRoBERTa和XLNet在内的先进机器学习模型来进行情感分类,并揭示出关键的情绪和话语趋势。分析结果显示,疾病特性、媒体报道以及疫情疲劳导致了公众情绪的重大差异。通过情感极性和主题趋势的视角,本研究为定制公共卫生信息传递、减少错误信息传播以及在并行健康危机中建立信任提供了宝贵的见解。 该研究成果推进了情感分析在公共卫生成像中的应用,并为进一步的研究奠定了基础,包括实时监测和多语言分析的改进。
https://arxiv.org/abs/2505.07430
Multi-domain sentiment classification aims to mitigate poor performance models due to the scarcity of labeled data in a single domain, by utilizing data labeled from various domains. A series of models that jointly train domain classifiers and sentiment classifiers have demonstrated their advantages, because domain classification helps generate necessary information for sentiment classification. Intuitively, the importance of sentiment classification tasks is the same in all domains for multi-domain sentiment classification; but domain classification tasks are different because the impact of domain information on sentiment classification varies across different fields; this can be controlled through adjustable weights or hyper parameters. However, as the number of domains increases, existing hyperparameter optimization algorithms may face the following challenges: (1) tremendous demand for computing resources, (2) convergence problems, and (3) high algorithm complexity. To efficiently generate the domain information required for sentiment classification in each domain, we propose a dynamic information modulation algorithm. Specifically, the model training process is divided into two stages. In the first stage, a shared hyperparameter, which would control the proportion of domain classification tasks across all fields, is determined. In the second stage, we introduce a novel domain-aware modulation algorithm to adjust the domain information contained in the input text, which is then calculated based on a gradient-based and loss-based method. In summary, experimental results on a public sentiment analysis dataset containing 16 domains prove the superiority of the proposed method.
跨领域情感分类旨在通过利用来自不同领域的标记数据,来缓解单一领域中由于标注数据稀缺而导致的模型性能不佳问题。一系列同时训练领域分类器和情感分类器的模型已经展示了它们的优势,因为领域分类有助于生成情感分类所需的信息。直观来看,在多领域情感分类任务中,所有领域的感情分类任务的重要性是相同的;然而,不同领域的领域分类任务则是不同的,因为领域信息对情感分类的影响在各个领域间各不相同;这可以通过可调权重或超参数来控制。但是,随着领域的数量增加,现有的超参数优化算法可能会面临以下挑战:(1)计算资源的巨大需求、(2)收敛问题以及(3)算法复杂度高。 为了高效生成每个领域中情感分类所需的信息,我们提出了一种动态信息调节算法。具体来说,模型训练过程分为两个阶段。在第一阶段,确定一个共享的超参数,该超参数将控制跨所有领域的领域分类任务的比例。第二阶段引入一种新的基于领域的调节算法来调整输入文本中的领域信息,然后根据梯度和损失方法进行计算。 总之,在包含16个领域的公共情感分析数据集上的实验结果证明了所提出方法的优越性。
https://arxiv.org/abs/2505.06630
Financial sentiment analysis (FSA) presents unique challenges to LLMs that surpass those in typical sentiment analysis due to the nuanced language used in financial contexts. The prowess of these models is often undermined by the inherent subjectivity of sentiment classifications in existing benchmark datasets like Financial Phrasebank. These datasets typically feature undefined sentiment classes that reflect the highly individualized perspectives of annotators, leading to significant variability in annotations. This variability results in an unfair expectation for LLMs during benchmarking, where they are tasked to conjecture the subjective viewpoints of human annotators without sufficient context. In this paper, we introduce the Annotators' Instruction Assisted Prompt, a novel evaluation prompt designed to redefine the task definition of FSA for LLMs. By integrating detailed task instructions originally intended for human annotators into the LLMs' prompt framework, AIAP aims to standardize the understanding of sentiment across both human and machine interpretations, providing a fair and context-rich foundation for sentiment analysis. We utilize a new dataset, WSBS, derived from the WallStreetBets subreddit to demonstrate how AIAP significantly enhances LLM performance by aligning machine operations with the refined task definitions. Experimental results demonstrate that AIAP enhances LLM performance significantly, with improvements up to 9.08. This context-aware approach not only yields incremental gains in performance but also introduces an innovative sentiment-indexing method utilizing model confidence scores. This method enhances stock price prediction models and extracts more value from the financial sentiment analysis, underscoring the significance of WSB as a critical source of financial text. Our research offers insights into both improving FSA through better evaluation methods.
金融情感分析(FSA)对大型语言模型(LLM)提出了超出常规情感分析的独特挑战,这是因为金融环境中使用的语言具有复杂性。现有基准数据集如Financial Phrasebank中的情感分类本质上是主观的,这削弱了这些模型的能力。这些数据集中通常包含由标注者个体化视角反映、但未定义的情感类别,导致标注结果存在显著差异。这种变异性在对LLM进行基准测试时造成了不公平的期望,即要求模型推测缺乏足够背景信息的人类标注者的主观观点。 在这篇论文中,我们引入了“注释员指令辅助提示”(AIAP),这是一种重新定义FSA任务的新颖评估提示框架。通过将原本为人类注释者设计的详细任务说明整合到LLM的提示框架中,AIAP旨在标准化人机在情感理解上的共同认知,并提供一个更加公平且背景丰富的基础来进行情感分析。我们利用一个新的数据集WSBS(来自WallStreetBets subreddit的数据)来展示如何通过将机器操作与细化的任务定义对齐,AIAP显著提升了LLM的性能。实验结果表明,AIAP能够大幅度提高LLM的表现,最高可达9.08%的增长。 这种方法不仅带来了性能上的增量改进,还引入了一种新颖的情感指数方法,利用模型置信度分数来增强股票价格预测模型,并从金融情感分析中提取更多价值。这凸显了WSB作为重要金融文本来源的重要性。我们的研究为通过更好的评估方法提升FSA提供了见解。
https://arxiv.org/abs/2505.07871
Engagement between client and therapist is a critical determinant of therapeutic success. We propose a multi-dimensional natural language processing (NLP) framework that objectively classifies engagement quality in counseling sessions based on textual transcripts. Using 253 motivational interviewing transcripts (150 high-quality, 103 low-quality), we extracted 42 features across four domains: conversational dynamics, semantic similarity as topic alignment, sentiment classification, and question detection. Classifiers, including Random Forest (RF), Cat-Boost, and Support Vector Machines (SVM), were hyperparameter tuned and trained using a stratified 5-fold cross-validation and evaluated on a holdout test set. On balanced (non-augmented) data, RF achieved the highest classification accuracy (76.7%), and SVM achieved the highest AUC (85.4%). After SMOTE-Tomek augmentation, performance improved significantly: RF achieved up to 88.9% accuracy, 90.0% F1-score, and 94.6% AUC, while SVM reached 81.1% accuracy, 83.1% F1-score, and 93.6% AUC. The augmented data results reflect the potential of the framework in future larger-scale applications. Feature contribution revealed conversational dynamics and semantic similarity between clients and therapists were among the top contributors, led by words uttered by the client (mean and standard deviation). The framework was robust across the original and augmented datasets and demonstrated consistent improvements in F1 scores and recall. While currently text-based, the framework supports future multimodal extensions (e.g., vocal tone, facial affect) for more holistic assessments. This work introduces a scalable, data-driven method for evaluating engagement quality of the therapy session, offering clinicians real-time feedback to enhance the quality of both virtual and in-person therapeutic interactions.
客户与治疗师之间的参与度是心理治疗成功的关键决定因素。我们提出了一种基于文本转录的多维度自然语言处理(NLP)框架,以客观地分类咨询会话中的参与质量。使用了253份动机访谈记录(其中150份为高质量,103份为低质量),提取了涵盖四个方面共42个特征:对话动态、语义相似性作为主题一致性、情感分类和问题检测。 通过分层五折交叉验证进行超参数调优并训练随机森林(RF)、Cat-Boost和支持向量机(SVM)等分类器,然后在独立测试集上评估其性能。对于未增强的数据集,随机森林达到了最高的分类准确率(76.7%),而支持向量机则具有最高的AUC(85.4%)。经过SMOTE-Tomek增强后,模型表现显著提升:随机森林的最高准确率为88.9%,F1得分为90.0%,AUC为94.6%;支持向量机的准确性达到81.1%,F1得分83.1%,AUC为93.6%。增强数据集的结果反映了该框架在未来大规模应用中的潜力。 特征贡献分析表明,对话动态和客户与治疗师之间的语义相似性是最重要的特征之一,其中由客户说出的词汇(平均值和标准差)最为关键。此框架在原始和增强的数据集中表现稳健,并且在F1得分和召回率方面表现出持续改进的趋势。尽管目前该模型基于文本处理,但未来可以扩展为多模态形式(如声调、面部表情),以进行更全面的评估。 这项工作引入了一种可扩展、数据驱动的方法来评价治疗会话中参与质量,并为临床医生提供了实时反馈,有助于提升虚拟和面对面心理治疗互动的质量。
https://arxiv.org/abs/2505.06151
Sentiment classification, a complex task in natural language processing, becomes even more challenging when analyzing passages with multiple conflicting tones. Typically, longer passages exacerbate this issue, leading to decreased model performance. The aim of this paper is to introduce novel methodologies for isolating conflicting sentiments and aggregating them to effectively predict the overall sentiment of such passages. One of the aggregation strategies involves a Multi-Layer Perceptron (MLP) model which outperforms baseline models across various datasets, including Amazon, Twitter, and SST while costing $\sim$1/100 of what fine-tuning the baseline would take.
情感分类是自然语言处理中的一个复杂任务,当分析包含多种冲突情绪的段落时,该任务变得更加具有挑战性。通常情况下,较长的文本会使这个问题更加严重,并导致模型性能下降。本文旨在介绍一些新颖的方法来分离这些相互矛盾的情绪并将它们聚合起来,以有效预测此类文本的整体情感倾向。 其中一种聚合策略采用多层感知机(Multi-Layer Perceptron, MLP)模型,在亚马逊、推特和SST等不同数据集上均优于基准模型,并且其训练成本仅为微调基准模型的约1/100。
https://arxiv.org/abs/2505.06320