In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.
在这篇论文中,我们结合了两步知识蒸馏、结构化剪枝、截断和词汇精简技术,以极大压缩低资源语言的多语种编码器模型。我们的创新方法系统性地整合现有技术并推向极限,减少层数深度、前向反馈隐藏层大小以及中间层嵌入尺寸,从而创建显著更小的语言特定单语模型,同时保留了关键的语言特性知识。我们在包括情感分析、主题分类、命名实体识别和词性标注在内的四个下游任务中,在三种低资源语言上实现了高达92%的压缩率,并且仅在性能上有微不足道的下降(2-10%)。值得注意的是,性能下降的程度与教师模型中的语言特定数据量相关,较大的数据集导致较小的性能损失。此外,我们进行了详尽的消融研究,以确定使用这些技术进行多语种模型压缩的最佳实践。
https://arxiv.org/abs/2505.16956
Large Language Models (LLMs) have rapidly become central to NLP, demonstrating their ability to adapt to various tasks through prompting techniques, including sentiment analysis. However, we still have a limited understanding of how these models capture sentiment-related information. This study probes the hidden layers of Llama models to pinpoint where sentiment features are most represented and to assess how this affects sentiment analysis. Using probe classifiers, we analyze sentiment encoding across layers and scales, identifying the layers and pooling methods that best capture sentiment signals. Our results show that sentiment information is most concentrated in mid-layers for binary polarity tasks, with detection accuracy increasing up to 14% over prompting techniques. Additionally, we find that in decoder-only models, the last token is not consistently the most informative for sentiment encoding. Finally, this approach enables sentiment tasks to be performed with memory requirements reduced by an average of 57%. These insights contribute to a broader understanding of sentiment in LLMs, suggesting layer-specific probing as an effective approach for sentiment tasks beyond prompting, with potential to enhance model utility and reduce memory requirements.
大型语言模型(LLMs)迅速成为自然语言处理的核心,通过提示技术展示了其适应各种任务的能力,其中包括情感分析。然而,我们对这些模型如何捕捉与情感相关的信息仍缺乏深入的理解。本研究探索了Llama模型的隐藏层,以确定情感特征在何处最集中,并评估这对其情感分析的影响。使用探测分类器,我们在不同层次和尺度上分析情感编码,识别出最佳捕获情感信号的层次和池化方法。我们的结果显示,在二元极性任务中,情感信息最为集中在中间层,相较于提示技术,检测准确率提高了高达14%。此外,我们还发现,在解码器独有模型中,并非最后一个标记始终是情感编码中最具信息量的部分。最后,这种方法使得情感任务的内存需求平均减少了57%。这些见解有助于更全面地理解LLMs中的情感,表明特定层次探测方法作为超越提示的情感任务的有效途径,具有增强模型实用性和减少内存需求的潜力。
https://arxiv.org/abs/2505.16491
During sudden disaster events, accurately predicting public panic sentiment on social media is crucial for proactive governance and crisis management. Current efforts on this problem face three main challenges: lack of finely annotated data hinders emotion prediction studies, unmodeled risk perception causes prediction inaccuracies, and insufficient interpretability of panic formation mechanisms. We address these issues by proposing a Psychology-driven generative Agent framework (PsychoAgent) for explainable panic prediction based on emotion arousal theory. Specifically, we first construct a fine-grained open panic emotion dataset (namely COPE) via human-large language models (LLMs) collaboration to mitigate semantic bias. Then, we develop a framework integrating cross-domain heterogeneous data grounded in psychological mechanisms to model risk perception and cognitive differences in emotion generation. To enhance interpretability, we design an LLM-based role-playing agent that simulates individual psychological chains through dedicatedly designed prompts. Experimental results on our annotated dataset show that PsychoAgent improves panic emotion prediction performance by 12.6% to 21.7% compared to baseline models. Furthermore, the explainability and generalization of our approach is validated. Crucially, this represents a paradigm shift from opaque "data-driven fitting" to transparent "role-based simulation with mechanistic interpretation" for panic emotion prediction during emergencies. Our implementation is publicly available at: this https URL.
在突发事件中,准确预测社交媒体上公众恐慌情绪对于主动治理和危机管理至关重要。目前在这方面的努力面临三大主要挑战:缺乏精细标注的数据阻碍了情感预测研究的进展;未建模的风险感知导致预测不准确;以及恐慌形成机制解释性的不足。我们通过提出一个基于情绪唤醒理论的心理驱动生成代理框架(PsychoAgent),来解决这些问题,旨在实现可解释的恐慌预测。 具体而言,我们首先通过人类与大型语言模型(LLMs)的合作构建了一个细粒度的开放恐慌情感数据集(即COPE),以减轻语义偏差。然后,我们开发了一个整合跨领域异构数据的框架,并基于心理机制来建模风险感知和情绪生成中的认知差异。为了增强解释性,我们设计了一种基于LLM的角色扮演代理,通过专门设计的提示模拟个体的心理链。 在我们的标注数据集上的实验结果显示,PsychoAgent相比基线模型提高了12.6%至21.7%的情绪恐慌预测性能。此外,我们的方法的有效性和泛化能力也得到了验证。尤为重要的是,这标志着从不透明的“数据驱动拟合”向透明的“基于角色模拟和机制解释”的范式转变,在紧急情况下的恐慌情绪预测中实现这一转变。 我们的实施代码公开可得:[此链接](this https URL)。
https://arxiv.org/abs/2505.16455
The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.
多模态大型语言模型(MLLMs)的进步使得各种多模态任务能够在零样本范式下得到解决。这种范式避免了模型微调的成本,成为实际应用中的主导趋势。然而,作为追求通用人工智能的关键挑战之一,多模态情感分析(MSA)未能享受到这一便利性。在零样本范式下,MSA的表现不尽如人意,这让人质疑MLLMs是否能像监督学习模型那样有效地感知情感。通过将零样本范式扩展到上下文学习(ICL),并深入研究配置示例的方法,我们验证了MLLM确实具备这种能力。具体来说,本文全面调查和优化了涵盖示例行检索、呈现及分布的三个关键因素,并发现了一种存在于MLLM中的内在情感预测偏差,后来有效地对此进行了矫正。这三种策略相互补充,在六个MSA数据集上相对于零样本范式平均提高了15.9%的准确率,并且在与随机ICL基线相比时,平均提高了11.2%的准确率。
https://arxiv.org/abs/2505.16193
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs)to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and politics polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions.
大型语言模型(LLM)在自然语言理解和生成方面展现了令人印象深刻的能力,但可靠地控制其行为仍然是具有挑战性的,特别是在开放式的生成设置中。本文介绍了一种新颖的监督导向方法,该方法操作于稀疏且可解释的表示空间中。我们使用稀疏自动编码器(SAE)来获取旨在从模型激活中分离语义属性的稀疏潜在表示。然后我们在潜在表示的小子空间中训练线性分类器以识别与任务相关的维度。最后,我们学习在该受限子空间中的监督导向向量,并优化这些向量使其与目标行为对齐。针对多个LLM的情感、真实性和政治倾向导向任务进行的实验表明,我们的监督导向向量相比现有方法,在提高成功率的同时几乎不会降低生成质量。进一步分析发现,一个相对较小的空间就足够有效指导模型行为,这使得干预更加有针对性和可解释。
https://arxiv.org/abs/2505.16188
This study introduces an interpretable machine learning (ML) framework to extract macroeconomic alpha from global news sentiment. We process the Global Database of Events, Language, and Tone (GDELT) Project's worldwide news feed using FinBERT -- a Bidirectional Encoder Representations from Transformers (BERT) based model pretrained on finance-specific language -- to construct daily sentiment indices incorporating mean tone, dispersion, and event impact. These indices drive an XGBoost classifier, benchmarked against logistic regression, to predict next-day returns for EUR/USD, USD/JPY, and 10-year U.S. Treasury futures (ZN). Rigorous out-of-sample (OOS) backtesting (5-fold expanding-window cross-validation, OOS period: c. 2017-April 2025) demonstrates exceptional, cost-adjusted performance for the XGBoost strategy: Sharpe ratios achieve 5.87 (EUR/USD), 4.65 (USD/JPY), and 4.65 (Treasuries), with respective compound annual growth rates (CAGRs) exceeding 50% in Foreign Exchange (FX) and 22% in bonds. Shapley Additive Explanations (SHAP) affirm that sentiment dispersion and article impact are key predictive features. Our findings establish that integrating domain-specific Natural Language Processing (NLP) with interpretable ML offers a potent and explainable source of macro alpha.
这项研究引入了一个可解释的机器学习(ML)框架,用于从全球新闻情绪中提取宏观经济阿尔法值。我们利用FinBERT——一种基于Transformer的双向编码器表示模型(BERT),经过特定于金融语言的预训练——来处理Global Database of Events, Language, and Tone (GDELT)项目的全球新闻流,构建包含平均情感、分散度和事件影响的每日情绪指数。这些指标驱动一个XGBoost分类器,并与逻辑回归进行基准比较,以预测欧元/美元(EUR/USD)、美元/日元(USD/JPY)以及10年期美国国债期货(ZN)的次日回报率。 严格的离线样本测试(5折滚动窗口交叉验证,离线时期约为2017年至2025年4月)显示,XGBoost策略在经过成本调整后表现出色:夏普比率分别为欧元/美元5.87、美元/日元4.65和国债4.65,相应的复合年度增长率(CAGRs)在外汇市场超过50%,在债券市场上为22%。Shapley Additive Explanations (SHAP) 确认情绪分散度和文章影响力是关键的预测特征。 我们的研究结果表明,将特定领域的自然语言处理(NLP)与可解释的机器学习相结合,可以提供一种强大的、易于理解的宏观经济阿尔法值来源。
https://arxiv.org/abs/2505.16136
As of 2025, Generative Artificial Intelligence (GenAI) has become a central tool for productivity across industries. Beyond text generation, GenAI now plays a critical role in coding, data analysis, and research workflows. As large language models (LLMs) continue to evolve, it is essential to assess the reliability and accuracy of their outputs, especially in specialized, high-stakes domains like finance. Most modern LLMs transform text into numerical vectors, which are used in operations such as cosine similarity searches to generate responses. However, this abstraction process can lead to misinterpretation of emotional tone, particularly in nuanced financial contexts. While LLMs generally excel at identifying sentiment in everyday language, these models often struggle with the nuanced, strategically ambiguous language found in earnings call transcripts. Financial disclosures frequently embed sentiment in hedged statements, forward-looking language, and industry-specific jargon, making it difficult even for human analysts to interpret consistently, let alone AI models. This paper presents findings from the Santa Clara Microsoft Practicum Project, led by Professor Charlie Goldenberg, which benchmarks the performance of Microsoft's Copilot, OpenAI's ChatGPT, Google's Gemini, and traditional machine learning models for sentiment analysis of financial text. Using Microsoft earnings call transcripts, the analysis assesses how well LLM-derived sentiment correlates with market sentiment and stock movements and evaluates the accuracy of model outputs. Prompt engineering techniques are also examined to improve sentiment analysis results. Visualizations of sentiment consistency are developed to evaluate alignment between tone and stock performance, with sentiment trends analyzed across Microsoft's lines of business to determine which segments exert the greatest influence.
截至2025年,生成式人工智能(GenAI)已成为各行业中提高生产力的核心工具。除了文本生成之外,GenAI在编码、数据分析和研究工作流程中也扮演着关键角色。随着大型语言模型(LLMs)的不断发展,评估其输出的可靠性和准确性变得至关重要,尤其是在金融等专业且高风险领域。大多数现代LLM将文本转换为数值向量,并用于操作如余弦相似度搜索以生成响应。然而,这种抽象过程可能导致情感语气的误解,特别是在复杂的财务语境中更为明显。虽然这些模型通常擅长识别日常语言中的情感倾向,但在收益电话会议记录等复杂、策略性含糊的语言环境中,它们的表现往往不尽人意。金融披露经常将情绪隐藏在有保留的声明、前瞻性语言和行业特定术语中,这使得即使是人类分析师也难以一致地解读,更不用说AI模型了。 本文呈现了由查理·戈德伯格教授领导的圣克拉拉微软实习项目的结果,该项目对微软Copilot、OpenAI的ChatGPT、谷歌的Gemini以及传统机器学习模型在金融文本情感分析中的表现进行了基准测试。使用微软收益电话会议记录进行分析,评估LLM生成的情感与市场情绪和股票走势的相关性,并检查模型输出的准确性。研究还考察了提示工程技术如何改进情感分析结果。开发了一种可视化情感一致性的方法来评估语气与股价表现之间的对齐情况,并分析了不同业务线上的情感趋势,以确定哪些部分具有最大的影响力。
https://arxiv.org/abs/2505.16090
We present BiasLab, a dataset of 300 political news articles annotated for perceived ideological bias. These articles were selected from a curated 900-document pool covering diverse political events and source biases. Each article is labeled by crowdworkers along two independent scales, assessing sentiment toward the Democratic and Republican parties, and enriched with rationale indicators. The annotation pipeline incorporates targeted worker qualification and was refined through pilot-phase analysis. We quantify inter-annotator agreement, analyze misalignment with source-level outlet bias, and organize the resulting labels into interpretable subsets. Additionally, we simulate annotation using schema-constrained GPT-4o, enabling direct comparison to human labels and revealing mirrored asymmetries, especially in misclassifying subtly right-leaning content. We define two modeling tasks: perception drift prediction and rationale type classification, and report baseline performance to illustrate the challenge of explainable bias detection. BiasLab's rich rationale annotations provide actionable interpretations that facilitate explainable modeling of political bias, supporting the development of transparent, socially aware NLP systems. We release the dataset, annotation schema, and modeling code to encourage research on human-in-the-loop interpretability and the evaluation of explanation effectiveness in real-world settings.
我们介绍 BiasLab,这是一个包含300篇政治新闻文章的数据集,这些文章被标注为感知到的思想偏见。这300篇文章是从一个涵盖多样政治事件和来源偏见的精选900文档池中挑选出来的。每篇文章由众包工作者根据两个独立尺度进行标记:一是对民主党和共和党的情感倾向评估;二是附带了理据指标以丰富标注信息。注释流程包括有针对性的工作资格审核,并在试点阶段进行了优化改进。 我们量化了不同注解者之间的协议,分析了与源头级别偏见的不一致,并将产生的标签组织成可解释的小集合。此外,我们使用基于模式约束的GPT-4o模拟注释过程,使人工标注和机器生成标注之间可以进行直接比较,并揭示出在轻微右倾内容分类上的对称误差。 我们定义了两个建模任务:感知漂移预测及理据类型分类,并报告基准性能以显示可解释偏见检测的挑战。BiasLab 的丰富理据注释提供了可行的解读,有助于政治偏见的可解释建模,支持透明、社会意识强的自然语言处理系统的开发。 我们发布数据集、标注模式和建模代码,旨在鼓励关于人机交互中的可解释性研究以及在现实环境中评估说明效果的研究。
https://arxiv.org/abs/2505.16081
Bias in Large Language Models (LLMs) significantly undermines their reliability and fairness. We focus on a common form of bias: when two reference concepts in the model's concept space, such as sentiment polarities (e.g., "positive" and "negative"), are asymmetrically correlated with a third, target concept, such as a reviewing aspect, the model exhibits unintended bias. For instance, the understanding of "food" should not skew toward any particular sentiment. Existing bias evaluation methods assess behavioral differences of LLMs by constructing labeled data for different social groups and measuring model responses across them, a process that requires substantial human effort and captures only a limited set of social concepts. To overcome these limitations, we propose BiasLens, a test-set-free bias analysis framework based on the structure of the model's vector space. BiasLens combines Concept Activation Vectors (CAVs) with Sparse Autoencoders (SAEs) to extract interpretable concept representations, and quantifies bias by measuring the variation in representational similarity between the target concept and each of the reference concepts. Even without labeled data, BiasLens shows strong agreement with traditional bias evaluation metrics (Spearman correlation r > 0.85). Moreover, BiasLens reveals forms of bias that are difficult to detect using existing methods. For example, in simulated clinical scenarios, a patient's insurance status can cause the LLM to produce biased diagnostic assessments. Overall, BiasLens offers a scalable, interpretable, and efficient paradigm for bias discovery, paving the way for improving fairness and transparency in LLMs.
大型语言模型(LLMs)中的偏见显著削弱了它们的可靠性和公平性。我们关注的一种常见偏见形式是:当模型的概念空间中两个参考概念,如情感极性(例如,“正面”和“负面”),与第三个目标概念,比如评论方面,存在不对称相关时,该模型会展现出未预期的偏见。例如,对于“食物”的理解不应偏向任何特定的情感倾向。 现有的偏见评估方法通过为不同的社会群体构建带标签的数据并测量模型在这类数据上的响应差异来评估LLMs的行为差异,这种方法需要大量的人力劳动,并且只能捕捉到有限的一组社会概念。为了克服这些限制,我们提出了BiasLens,这是一个基于模型向量空间结构的无测试集偏见分析框架。 BiasLens结合了概念激活向量(Concept Activation Vectors, CAVs)和稀疏自编码器(Sparse Autoencoders, SAEs),以提取可解释的概念表示,并通过测量目标概念与每个参考概念之间的表征相似性的变化来量化偏见。即使没有带标签的数据,BiasLens也表现出与传统偏见评估指标(Spearman相关系数 r > 0.85)高度一致的结果。 此外,BiasLens能够揭示现有方法难以检测到的偏见形式。例如,在模拟临床场景中,患者的保险状况可能会导致LLMs产生有偏见的诊断评估。 总体而言,BiasLens提供了一个可扩展、解释性强且高效的偏见发现范式,为提高LLMs中的公平性和透明度铺平了道路。
https://arxiv.org/abs/2505.15524
NeoN, a tool for detecting and analyzing Polish neologisms. Unlike traditional dictionary-based methods requiring extensive manual review, NeoN combines reference corpora, Polish-specific linguistic filters, an LLM-driven precision-boosting filter, and daily RSS monitoring in a multi-layered pipeline. The system uses context-aware lemmatization, frequency analysis, and orthographic normalization to extract candidate neologisms while consolidating inflectional variants. Researchers can verify candidates through an intuitive interface with visualizations and filtering controls. An integrated LLM module automatically generates definitions and categorizes neologisms by domain and sentiment. Evaluations show NeoN maintains high accuracy while significantly reducing manual effort, providing an accessible solution for tracking lexical innovation in Polish.
NeoN 是一个用于检测和分析波兰新词的工具。与传统的基于字典的方法需要大量人工审查不同,NeoN 结合了参考语料库、波兰语言学特定过滤器、由大型语言模型(LLM)驱动的精度提升过滤器以及每日 RSS 监控,在多层管道中协同工作。该系统利用上下文感知词干提取、频率分析和正字法规范化来提取候选新词,并合并其屈折变化形式。研究人员可以通过一个直观的界面,借助可视化工具和筛选控制功能来验证候选词。内置的语言模型模块可以自动生成定义并根据领域和情感对新词进行分类。评估结果显示,NeoN 能够在保持高准确率的同时大幅减少人工工作量,为追踪波兰词汇创新提供了一个实用且易用的解决方案。
https://arxiv.org/abs/2505.15426
Detoxifying offensive language while preserving the speaker's original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.
净化有害语言,同时保留发言者的原始意图,是提高在线互动质量的一项具有挑战性但又至关重要的目标。尽管大型语言模型(LLMs)在重写有毒内容方面显示出潜力,但它们往往默认生成过于礼貌的改写版本,这会扭曲情感基调和沟通意图。这种问题在中国语境下尤为严重,在这里,毒性经常通过表情符号、谐音词或话语背景隐含地出现。 我们推出了ToxiRewriteCN,这是首个专门设计用于保留情感极性的中文净化数据集。该数据集中包含1,556条精心标注的三元组,每个三元组包括一条有毒句子、一个与情感一致且无毒的改写版本以及标记出的有毒段落。它涵盖了五个现实场景:标准表达式、由表情符号和同音词引发的毒性,以及单轮和多轮对话。 我们评估了17种LLMs(其中包括商用和开源模型,并且具有不同的架构),依据四个维度进行评价:净化准确性、流畅度、内容保留能力和情感极性。结果表明,虽然商业和MoE模型在总体上表现最佳,但所有模型都在处理更为微妙或背景信息丰富的输入时——例如表情符号、同音词以及对话式输入——难以在安全性与情感真实性之间找到平衡。 我们发布了ToxiRewriteCN数据集以支持未来针对中文的可控性和情感感知净化研究。
https://arxiv.org/abs/2505.15297
Sarcasm is a challenge to sentiment analysis because of the incongruity between stated and implied sentiment. The challenge is exacerbated when the implication may be relevant to a specific country or geographical region. Pragmatic metacognitive prompting (PMP) is a cognition-inspired technique that has been used for pragmatic reasoning. In this paper, we harness PMP for explainable sarcasm detection for Australian and Indian English, alongside a benchmark dataset for standard English. We manually add sarcasm explanations to an existing sarcasm-labeled dataset for Australian and Indian English called BESSTIE, and compare the performance for explainable sarcasm detection for them with FLUTE, a standard English dataset containing sarcasm explanations. Our approach utilising PMP when evaluated on two open-weight LLMs (GEMMA and LLAMA) achieves statistically significant performance improvement across all tasks and datasets when compared with four alternative prompting strategies. We also find that alternative techniques such as agentic prompting mitigate context-related failures by enabling external knowledge retrieval. The focused contribution of our work is utilising PMP in generating sarcasm explanations for varieties of English.
讽刺是情感分析中的一个挑战,因为表达的情感与其隐含的情感之间存在不一致。这种挑战在涉及到特定国家或地理区域的相关性时会被进一步放大。实用元认知提示(PMP)是一种启发自认知的技术,在实际推理中已被使用。本文中,我们利用PMP进行解释型讽刺检测,针对的是澳大利亚英语和印度英语,并与标准英语基准数据集进行了比较。我们在一个已有的标有讽刺标签的数据集中手动添加了对澳大利亚英语和印度英语的BESSTIE语料库中的讽刺解释,并将其与FLUTE数据集(包含讽刺解释的标准英语数据集)进行性能对比。 我们的方法在使用两种开源大型语言模型(GEMMA 和 LLAMA)进行评估时,在所有任务和数据集中,相较于四种不同的提示策略,实现了统计上显著的性能提升。我们还发现,如代理提示等其他技术能够通过启用外部知识检索来缓解与上下文相关的问题。我们的主要贡献是利用PMP生成英语不同变体中的讽刺解释。
https://arxiv.org/abs/2505.15095
Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model's parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.
有效的跨语言迁移仍然是扩大大型语言模型从高资源语言到低资源语言效益的关键挑战。为了解决这个问题,先前的研究探索了多种方法来结合任务特定数据(在高资源源语言中)中的任务知识和未标记文本(在低资源目标语言中)中的语言知识。其中一种值得注意的方法是提出了可组合的稀疏微调(SFT),用于跨语言迁移,这种方法学习任务特定和语言特定的稀疏掩码来选择预训练模型参数的一个子集进行进一步的微调。这些稀疏微调向量(SFTs)随后与预训练模型结合使用,以促进仅通过源语言的任务特定数据就实现零样本跨语言迁移到目标语言中的任务。用于SFTs的稀疏掩码是通过简单的基于幅度的剪枝来识别的。 在我们的工作中,我们引入了DeFT-X,这是一种新的可组合SFT方法,在进行幅度剪枝之前使用奇异值分解对预训练模型的权重矩阵进行去噪处理,从而产生更健壮的SFTs。我们在多样化的极度低资源语言(用于情感分类NusaX和自然语言推理AmericasNLI)上评估了DeFT-X,并证明它在性能上与SFT和其他显着的跨语言迁移基线相当或优于它们。
https://arxiv.org/abs/2505.15090
Peer review is vital in academia for evaluating research quality. Top AI conferences use reviewer confidence scores to ensure review reliability, but existing studies lack fine-grained analysis of text-score consistency, potentially missing key details. This work assesses consistency at word, sentence, and aspect levels using deep learning and NLP conference review data. We employ deep learning to detect hedge sentences and aspects, then analyze report length, hedge word/sentence frequency, aspect mentions, and sentiment to evaluate text-score alignment. Correlation, significance, and regression tests examine confidence scores' impact on paper outcomes. Results show high text-score consistency across all levels, with regression revealing higher confidence scores correlate with paper rejection, validating expert assessments and peer review fairness.
同行评审在学术界对于评估研究质量至关重要。顶级的人工智能会议使用审稿人信心评分来确保评审的可靠性,但现有的研究缺乏对文本与评分一致性进行细致分析的能力,可能遗漏了一些关键细节。这项工作利用深度学习和自然语言处理技术对会议评审数据,在词、句子及方面(领域)层面评估一致性的表现。我们采用深度学习方法检测模糊表达句式和关注点,并通过分析报告长度、模糊词汇/语句频率、关注点提及次数以及情感倾向,来评价文本与评分的一致性。相关性、显著性和回归测试则考察了信心评分对论文结果的影响。结果显示,在所有层面上,文本与评分之间均表现出高度一致性。回归分析表明,更高的信心评分与论文被拒的概率更高相一致,这验证了专家评估的准确性和同行评审过程中的公平性。
https://arxiv.org/abs/2505.15031
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.
这项研究提出了一种评估大型语言模型(LLM)二元文本分类一致性的框架,旨在解决现有可靠性评估方法不足的问题。通过借鉴心理测量学原理,我们确定了样本量的需求,开发了无效响应的度量标准,并对评分者内和评分者间的可靠性进行了评价。我们的案例研究考察了14个大型语言模型(包括claude-3-7-sonnet、gpt-4o、deepseek-r1、gemma3、llama3.2、phi4和command-r-plus)在财务新闻情绪分类方面的表现,每个模型重复进行五次测试,共分析了1,350篇文章。各模型展示了较高的评分者内一致性,在90%-98%的样本中实现了完全一致的意见,并且来自同一家族的昂贵型和经济型模型之间差异很小。当与StockNewsAPI标签验证时,这些模型表现出色(准确率在0.76到0.88之间),其中较小规模的模型如gemma3:1B、llama3.2:3B以及claude-3-5-haiku的表现优于其大型同类。所有模型在预测实际市场走势时表现均接近随机水平,这表明任务难度而非模型能力限制了性能。我们的框架为LLM选择、样本量规划和可靠性评估提供了系统性指导,帮助组织优化资源以应对分类任务的挑战。
https://arxiv.org/abs/2505.14918
Despite the many benefits of large language models (LLMs), they can also cause harm, e.g., through automatic generation of misinformation, including conspiracy theories. Moreover, LLMs can also ''disguise'' conspiracy theories by altering characteristic textual features, e.g., by transforming their typically strong negative emotions into a more positive tone. Although several studies have proposed automated conspiracy theory detection methods, they are usually trained using human-authored text, whose features can vary from LLM-generated text. Furthermore, several conspiracy detection models, including the previously proposed ConspEmoLLM, rely heavily on the typical emotional features of human-authored conspiracy content. As such, intentionally disguised content may evade detection. To combat such issues, we firstly developed an augmented version of the ConDID conspiracy detection dataset, ConDID-v2, which supplements human-authored conspiracy tweets with versions rewritten by an LLM to reduce the negativity of their original sentiment. The quality of the rewritten tweets was verified by combining human and LLM-based assessment. We subsequently used ConDID-v2 to train ConspEmoLLM-v2, an enhanced version of ConspEmoLLM. Experimental results demonstrate that ConspEmoLLM-v2 retains or exceeds the performance of ConspEmoLLM on the original human-authored content in ConDID, and considerably outperforms both ConspEmoLLM and several other baselines when applied to sentiment-transformed tweets in ConDID-v2. The project will be available at this https URL.
尽管大型语言模型(LLMs)有许多好处,但它们也可能造成危害,例如通过自动生成错误信息,包括阴谋论。此外,LLMs还可以通过改变文本的特征来“伪装”阴谋论,比如将典型的强烈负面情绪转化为更积极的语气。虽然已有若干研究提出了自动检测阴谋论的方法,但这些方法通常使用由人类撰写的文本进行训练,而这种文本的特点可能与LLM生成的文本有所不同。此外,包括之前提出的ConspEmoLLM在内的几个阴谋检测模型,很大程度上依赖于人类撰写阴谋内容中典型的情感特征。因此,故意伪装的内容可能会逃避检测。 为了应对这些问题,我们首先开发了ConDID阴谋论检测数据集的一个增强版本,即ConDID-v2,该版本在由人类撰写的阴谋推特基础上加入了LLM改写后的版本,以降低其原始情绪的负面性。通过结合人工和基于LLM的评估方法验证了重写推文的质量。随后,我们使用ConDID-v2来训练增强版的ConspEmoLLM,即ConspEmoLLM-v2。实验结果表明,ConspEmoLLM-v2在原始人类撰写的ConDID内容上保持或超过了ConspEmoLLM的表现,并且当应用于ConDID-v2中情感转换后的推文时,显著优于ConspEmoLLM和其他几个基准方法。 该项目将在以下网址提供:[https URL]。
https://arxiv.org/abs/2505.14917
There has been growing interest in Multimodal Aspect-Based Sentiment Analysis (MABSA) in recent years. Existing methods predominantly rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text, with an aim to align these two modalities. However, small SLMs possess limited capacity and knowledge, often resulting in inaccurate identification of meaning, aspects, sentiments, and their interconnections in textual and visual data. On the other hand, Large language models (LLMs) have shown exceptional capabilities in various tasks by effectively exploring fine-grained information in multimodal data. However, some studies indicate that LLMs still fall short compared to fine-tuned small models in the field of ABSA. Based on these findings, we propose a novel framework, termed LRSA, which combines the decision-making capabilities of SLMs with additional information provided by LLMs for MABSA. Specifically, we inject explanations generated by LLMs as rationales into SLMs and employ a dual cross-attention mechanism for enhancing feature interaction and fusion, thereby augmenting the SLMs' ability to identify aspects and sentiments. We evaluated our method using two baseline models, numerous experiments highlight the superiority of our approach on three widely-used benchmarks, indicating its generalizability and applicability to most pre-trained models for MABSA.
近年来,多模态基于方面的情感分析(MABSA)引起了越来越多的兴趣。现有方法主要依赖于预训练的小型语言模型(SLMs),从图像和文本中收集与方面和情感相关的信息,并试图将这两种模式对齐。然而,小型SLMs的容量有限且知识不足,这往往导致在识别文本和视觉数据中的意义、方面、情感及其相互关系时准确性较低。另一方面,大型语言模型(LLMs)通过有效地探索多模态数据中的细粒度信息,在各种任务中表现出卓越的能力。然而,一些研究表明,与针对ABSA领域微调的小型模型相比,LLMs仍然存在不足之处。 基于这些发现,我们提出了一种新的框架,称为LRSA,它结合了SLM的决策能力以及LLM提供的额外信息来进行MABSA分析。具体而言,我们将由LLM生成的解释作为理由注入到SLM中,并采用双交叉注意力机制来增强特征交互和融合,从而提高SLMs在识别方面和情感方面的能力。我们使用两个基准模型对我们的方法进行了评估,众多实验结果表明,在三个广泛使用的数据集上,我们的方法优于其他基线模型,这证明了其通用性和适用于大多数预训练的MABSA模型的能力。
https://arxiv.org/abs/2505.14499
Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.
尽管神经语音对话系统取得了显著进展,但由于语音数据集中缺乏个性标注,能够根据个性调整行为的个性化会话代理仍然研究不足。我们提出了一条流水线,用于预处理原始音频记录,创建带有时间戳、响应类型和情感/情绪标签的对话数据集。首先,我们使用自动语音识别(ASR)系统提取转录文本和时间戳,然后生成对话级别的注释。利用这些注释,我们设计了一个系统,该系统采用大型语言模型来预测会话个性。通过人类评估者来识别会话特征并分配个性标签。我们的分析表明,所提出的系统在与现有方法相比时,在与人类判断的对齐方面表现出更强的效果。
https://arxiv.org/abs/2505.14356
Fine-grained sentiment analysis (FGSA) aims to identify sentiment polarity toward specific aspects within a text, enabling more precise opinion mining in domains such as product reviews and social media. However, traditional FGSA approaches often require task-specific architectures and extensive annotated data, limiting their generalization and scalability. To address these challenges, we propose PL-FGSA, a unified prompt learning-based framework implemented using the MindSpore platform, which integrates prompt design with a lightweight TextCNN backbone. Our method reformulates FGSA as a multi-task prompt-augmented generation problem, jointly tackling aspect extraction, sentiment classification, and causal explanation in a unified paradigm. By leveraging prompt-based guidance, PL-FGSA enhances interpretability and achieves strong performance under both full-data and low-resource conditions. Experiments on three benchmark datasets-SST-2, SemEval-2014 Task 4, and MAMS-demonstrate that our model consistently outperforms traditional fine-tuning methods and achieves F1-scores of 0.922, 0.694, and 0.597, respectively. These results validate the effectiveness of prompt-based generalization and highlight the practical value of PL-FGSA for real-world sentiment analysis tasks.
细粒度情感分析(FGSA)旨在识别文本中针对特定方面的情感极性,从而在产品评论和社会媒体等领域实现更精确的意见挖掘。然而,传统的FGSA方法通常需要特定任务的架构和大量的注释数据,这限制了它们的泛化能力和可扩展性。为了解决这些挑战,我们提出了PL-FGSA,这是一个基于提示学习的统一框架,使用MindSpore平台构建,并结合了轻量级的TextCNN骨干网络。我们的方法将FGSA重新表述为一个多任务提示增强生成问题,在一个统一的范式中同时处理方面提取、情感分类和因果解释。通过利用基于提示的指导,PL-FGSA增强了可解释性,并在全数据集和低资源条件下均表现出色。 我们在三个基准数据集上进行了实验——SST-2、SemEval-2014 Task 4 和 MAMS,结果表明我们的模型始终优于传统的微调方法,并分别取得了F1分数为0.922、0.694 和 0.597 的成绩。这些结果验证了基于提示泛化的有效性,并突显了PL-FGSA在现实世界情感分析任务中的实际价值。
https://arxiv.org/abs/2505.14165
Multi-task learning (MTL) enables the efficient transfer of extra knowledge acquired from other tasks. The high correlation between multimodal sentiment analysis (MSA) and multimodal emotion recognition (MER) supports their joint training. However, existing methods primarily employ hard parameter sharing, ignoring parameter conflicts caused by complex task correlations. In this paper, we present a novel MTL method for MSA and MER, termed Multimodal Mixture of Low-Rank Experts (MMoLRE). MMoLRE utilizes shared and task-specific experts to distinctly model common and unique task characteristics, thereby avoiding parameter conflicts. Additionally, inspired by low-rank structures in the Mixture of Experts (MoE) framework, we design low-rank expert networks to reduce parameter and computational overhead as the number of experts increases. Extensive experiments on the CMU-MOSI and CMU-MOSEI benchmarks demonstrate that MMoLRE achieves state-of-the-art performance on the MSA task and competitive results on the MER task.
多任务学习(MTL)能够有效地转移从其他任务中获得的额外知识。跨模态情感分析(MSA)和跨模态情绪识别(MER)之间的高度相关性支持了它们的联合训练。然而,现有的方法主要采用硬参数共享方式,忽略了由于复杂任务关联导致的参数冲突问题。在本文中,我们提出了一种新颖的多任务学习方法用于处理 MSA 和 MER,称为跨模态低秩专家混合模型(MMoLRE)。MMoLRE 利用共享和特定于任务的专家来分别建模共同和独特的任务特征,从而避免参数冲突。此外,受到混合专家框架中低秩结构的启发,我们设计了具有低秩结构的专家网络,在增加专家数量的同时减少了参数量和计算开销。在 CMU-MOSI 和 CMU-MOSEI 数据集上的广泛实验表明,MMoLRE 在 MSA 任务上达到了最先进的性能,并且在 MER 任务上也取得了有竞争力的结果。
https://arxiv.org/abs/2505.14143