In various natural language processing (NLP) tasks, fine-tuning Pre-trained Language Models (PLMs) often leads to the issue of spurious correlations, which negatively impacts performance, particularly when dealing with out-of-distribution data. To address this problem, we propose SALAD}(Structure Aware and LLM-driven Augmented Data), a novel approach designed to enhance model robustness and generalization by generating structure-aware and counterfactually augmented data for contrastive learning. Our method leverages a tagging-based approach to generate structure-aware positive samples and utilizes large language models (LLMs) to generate counterfactual negative samples with diverse sentence patterns. By applying contrastive learning, SALAD enables the model to focus on learning the structural relationships between key sentence components while minimizing reliance on spurious correlations. We validate our approach through experiments on three tasks: Sentiment Classification, Sexism Detection, and Natural Language Inference. The results demonstrate that SALAD not only improves model robustness and performance across different environments but also enhances generalization to out-of-distribution datasets and cross-domain scenarios.
在各种自然语言处理(NLP)任务中,对预训练语言模型(PLMs)进行微调往往会带来虚假相关性的问题,这会负面影响性能,特别是在处理分布外数据时。为了解决这个问题,我们提出了SALAD(结构感知与大规模语言模型驱动的增强数据),这是一种通过生成结构感知且反事实增强的数据来进行对比学习的方法,旨在提高模型的鲁棒性和泛化能力。我们的方法采用基于标签的方法来生成结构感知的正样本,并利用大型语言模型(LLMs)生成具有多种句子模式的反事实负样本。通过应用对比学习,SALAD使模型能够专注于关键句法成分之间的结构性关系的学习,同时最小化对虚假相关性的依赖。 我们通过对三个任务——情感分类、性别歧视检测和自然语言推理进行实验来验证我们的方法的有效性。结果表明,SALAD不仅提高了在不同环境下的模型鲁棒性和性能,还增强了其对于分布外数据集以及跨域场景的泛化能力。
https://arxiv.org/abs/2504.12185
One fundamental question for the social sciences today is: how much can we trust highly complex predictive models like ChatGPT? This study tests the hypothesis that subtle changes in the structure of prompts do not produce significant variations in the classification results of sentiment polarity analysis generated by the Large Language Model GPT-4o mini. Using a dataset of 100.000 comments in Spanish on four Latin American presidents, the model classified the comments as positive, negative, or neutral on 10 occasions, varying the prompts slightly each time. The experimental methodology included exploratory and confirmatory analyses to identify significant discrepancies among classifications. The results reveal that even minor modifications to prompts such as lexical, syntactic, or modal changes, or even their lack of structure impact the classifications. In certain cases, the model produced inconsistent responses, such as mixing categories, providing unsolicited explanations, or using languages other than Spanish. Statistical analysis using Chi-square tests confirmed significant differences in most comparisons between prompts, except in one case where linguistic structures were highly similar. These findings challenge the robustness and trust of Large Language Models for classification tasks, highlighting their vulnerability to variations in instructions. Moreover, it was evident that the lack of structured grammar in prompts increases the frequency of hallucinations. The discussion underscores that trust in Large Language Models is based not only on technical performance but also on the social and institutional relationships underpinning their use.
今天社会科学面临的一个基本问题是:我们能在多大程度上信任像ChatGPT这样高度复杂的预测模型?这项研究检验了这样一个假设,即提示结构的细微变化不会导致大型语言模型(如GPT-4o mini)生成的情感极性分析分类结果产生显著差异。该研究使用了一组10万条西班牙语评论数据集,这些评论针对四个拉丁美洲总统。实验中,模型在10个不同场景下对这些建议进行了积极、消极或中立的分类,并且每次提示都有细微的不同之处。 实验方法包括了探索性和确认性分析,以识别分类之间的显著差异。研究结果显示,即使是词汇、句法和模态上的微小变化,或者缺乏结构化都会影响模型的分类结果。在某些情况下,该模型产生了不一致的回答,例如混淆类别、提供非请求解释或使用西班牙语以外的语言。 卡方检验的统计分析确认了大多数提示比较之间的显著差异,唯一的例外是语言结构高度相似的情况。这些发现挑战了大型语言模型在分类任务中的稳健性和可信度,并突显了它们对指令变化的脆弱性。此外还观察到,缺乏结构化语法的提示会增加幻觉现象的发生频率。 讨论强调了信任大型语言模型不仅基于其技术性能,还需要考虑在其使用中所涉及的社会和机构关系。
https://arxiv.org/abs/2504.12180
Multimodal Sentiment Analysis (MSA) faces two critical challenges: the lack of interpretability in the decision logic of multimodal fusion and modality imbalance caused by disparities in inter-modal information density. To address these issues, we propose KAN-MCP, a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework. First, KAN leverages its univariate function decomposition to achieve transparent analysis of cross-modal interactions. This structural design allows direct inspection of feature transformations without relying on external interpretation tools, thereby ensuring both high expressiveness and interpretability. Second, the proposed MCPareto enhances robustness by addressing modality imbalance and noise interference. Specifically, we introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality. This approach provides KAN with discriminative low-dimensional inputs to reduce the modeling complexity of KAN while preserving critical sentiment-related information. Furthermore, MCPareto dynamically balances gradient contributions across modalities using the purified features output by DRD-MIB, ensuring lossless transmission of auxiliary signals and effectively alleviating modality imbalance. This synergy of interpretability and robustness not only achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2 but also offers an intuitive visualization interface through KAN's interpretable architecture.
多模态情感分析(MSA)面临两大关键挑战:一是多模态融合决策逻辑的不可解释性,二是由于跨模态信息密度差异导致的模态失衡。为解决这些问题,我们提出了一种新的框架KAN-MCP,该框架将柯尔莫哥洛夫-阿诺德网络(KAN)的可解释性和多模态清洁帕累托(MCPareto)框架的鲁棒性相结合。 首先,KAN利用其一元函数分解实现了跨模态交互的透明分析。这种结构设计允许直接检查特征变换,而无需依赖外部解释工具,从而确保了高表达能力和可解释性。 其次,我们提出的MCPareto通过解决模态失衡和噪声干扰问题来增强鲁棒性。具体来说,我们引入了一种降维与去噪的多模态信息瓶颈(DRD-MIB)方法,该方法同时进行去噪和降低特征维度。这种方法为KAN提供了具有区分性的低维输入,减少了KAN的建模复杂度,同时保留了关键的情感相关信息。 此外,MCPareto利用经过DRD-MIB净化后的特性动态平衡跨模式之间的梯度贡献,确保辅助信号的有效传输而不损失,并且有效地缓解了多模态失衡的问题。这种可解释性和鲁棒性的协同效应不仅在CMU-MOSI、CMU-MOSEI和CH-SIMS v2等基准数据集上实现了卓越的性能,而且还通过KAN的可解释架构提供了直观的可视化界面。
https://arxiv.org/abs/2504.12151
Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract fine-grained information from image-text pairs to identify aspect terms and determine their sentiment polarity. However, existing approaches often fall short in simultaneously addressing three core challenges: Sentiment Cue Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise Elimination (SNE). To overcome these limitations, we propose DASCO (\textbf{D}ependency Structure \textbf{A}ugmented \textbf{Sco}ping Framework), a fine-grained scope-oriented framework that enhances aspect-level sentiment reasoning by leveraging dependency parsing trees. First, we designed a multi-task pretraining strategy for MABSA on our base model, combining aspect-oriented enhancement, image-text matching, and aspect-level sentiment-sensitive cognition. This improved the model's perception of aspect terms and sentiment cues while achieving effective image-text alignment, addressing key challenges like SCP and MIM. Furthermore, we incorporate dependency trees as syntactic branch combining with semantic branch, guiding the model to selectively attend to critical contextual elements within a target-specific scope while effectively filtering out irrelevant noise for addressing SNE problem. Extensive experiments on two benchmark datasets across three subtasks demonstrate that DASCO achieves state-of-the-art performance in MABSA, with notable gains in JMASA (+3.1\% F1 and +5.4\% precision on Twitter2015).
多模态基于方面的情感分析(MABSA)旨在从图像-文本对中提取细粒度信息,以识别方面术语并确定其情感极性。然而,现有方法在同时应对三个核心挑战时往往力不从心:情感线索感知(SCP)、跨模态信息错配(MIM)和语义噪声消除(SNE)。为了克服这些限制,我们提出了DASCO(依赖结构增强的细粒度范围框架),这是一种基于细粒度范围导向的架构,通过利用依存句法树来提升方面级别的情感推理。首先,我们在基础模型上设计了一种多任务预训练策略用于MABSA,结合了面向方面的增强、图像-文本匹配和针对方面的情感敏感认知。这提高了模型对方面术语和情感线索的理解能力,并实现了有效的图像-文本对齐,从而解决了包括SCP和MIM在内的关键挑战。 此外,我们将依赖树作为句法分支与语义分支相结合,指导模型在特定目标的范围内选择性地关注重要的上下文元素,同时有效过滤掉不相关的噪声以解决SNE问题。在两个基准数据集上的三个子任务中进行的广泛实验表明,DASCO在MABSA方面达到了最先进的性能,在Twitter2015数据集上获得了显著的改进(F1值提高了3.1%,精确度提高5.4%)。
https://arxiv.org/abs/2504.11331
While multimodal fusion has been extensively studied in Multimodal Sentiment Analysis (MSA), the role of fusion depth and multimodal capacity allocation remains underexplored. In this work, we position fusion depth, scalability, and dedicated multimodal capacity as primary factors for effective fusion. We introduce DeepMLF, a novel multimodal language model (LM) with learnable tokens tailored toward deep fusion. DeepMLF leverages an audiovisual encoder and a pretrained decoder LM augmented with multimodal information across its layers. We append learnable tokens to the LM that: 1) capture modality interactions in a controlled fashion and 2) preserve independent information flow for each modality. These fusion tokens gather linguistic information via causal self-attention in LM Blocks and integrate with audiovisual information through cross-attention MM Blocks. Serving as dedicated multimodal capacity, this design enables progressive fusion across multiple layers, providing depth in the fusion process. Our training recipe combines modality-specific losses and language modelling loss, with the decoder LM tasked to predict ground truth polarity. Across three MSA benchmarks with varying dataset characteristics, DeepMLF achieves state-of-the-art performance. Our results confirm that deeper fusion leads to better performance, with optimal fusion depths (5-7) exceeding those of existing approaches. Additionally, our analysis on the number of fusion tokens reveals that small token sets ($\sim$20) achieve optimal performance. We examine the importance of representation learning order (fusion curriculum) through audiovisual encoder initialization experiments. Our ablation studies demonstrate the superiority of the proposed fusion design and gating while providing a holistic examination of DeepMLF's scalability to LLMs, and the impact of each training objective and embedding regularization.
尽管多模态融合在多模态情感分析(MSA)中已经得到了广泛研究,但关于融合深度和多模态容量分配的作用仍需进一步探索。在这项工作中,我们定位了融合深度、可扩展性和专有的多模态容量作为有效融合的主要因素。我们引入了一种新的多模态语言模型DeepMLF,该模型具有针对深层融合而定制的可学习令牌。DeepMLF利用了一个音频-视觉编码器和一个通过其各个层融入了多模态信息的预训练解码器LM。我们在LM中附加了可学习的令牌,以:1)有控制地捕获模式间的交互作用;2)为每种模式保留独立的信息流。这些融合令牌通过LM块中的因果自注意力机制收集语言信息,并通过跨注意力MM块与音频-视觉信息集成。作为专有的多模态容量,这种设计允许在多个层级上进行渐进式融合,在融合过程中提供深度。 我们的训练方案结合了特定于模式的损失和语言建模损失,其中解码器LM的任务是预测真实极性。在具有不同数据集特征的三个MSA基准测试中,DeepMLF实现了最先进的性能。我们的结果证实了更深层次的融合能够带来更好的表现,并且最佳融合深度(5-7)超过了现有方法的表现。此外,我们关于融合令牌数量的分析表明,较小的令牌集合(约20个)可以达到最优性能。 通过音频-视觉编码器初始化实验,我们探讨了表示学习顺序的重要性(即融合课程)。我们的消融研究展示了所提出融合设计和门控机制的优越性,并提供了对DeepMLF在大规模语言模型中的可扩展性的全面评估。此外,还揭示了每个训练目标及嵌入正则化的影响。
https://arxiv.org/abs/2504.11082
Estimating the Kullback--Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to the use of sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance, and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially in practice. Additionally, we derive an analogous Rao--Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.
估计语言模型之间的Kullback-Leibler (KL) 散度在许多应用中都有重要作用,例如基于人类反馈的强化学习(RLHF)、可解释性以及知识蒸馏。然而,计算任意两个语言模型之间的确切KL散度是不可行的。因此,实践者通常会使用基于采样的估计器。虽然很容易构造一个简单的蒙特卡洛(MC)估计器来提供语言模型间KL散度的无偏估计值,但这种估计器以其高方差而著称,并且甚至可能导致负的KL散度估计——这是一个非负量。在本文中,我们引入了一个Rao-Blackwell化估计器,该估计器同样是无偏的,并且其方差被证明不大于标准蒙特卡洛估计器。 在情感控制微调的实证研究中,我们展示了我们的估计器提供了更稳定的KL估计值,并在实践中显著减少了方差。此外,我们还推导出了一个与之类似的Rao-Blackwell化估计器来估计KL散度的梯度,这导致了更稳定的学习过程,并产生了更多地出现在奖励与KL之间的帕累托前沿上的模型,相比于那些使用MC梯度估计器训练出来的模型。 简而言之,我们的研究贡献在于提供了一种新的基于Rao-Blackwell化的估计方法来改进语言模型之间KL散度及其梯度的评估,从而提高了模型训练过程中的稳定性和效率。
https://arxiv.org/abs/2504.10637
Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets by leveraging the unique strengths of these models. BERT captures bidirectional context, GPT-2 enhances generative capabilities, RoBERTa optimizes contextual understanding with larger corpora and dynamic masking, XLNet models dependency through permutation-based learning, and DistilBERT offers efficiency with reduced computational overhead while maintaining high accuracy. We demonstrate text cleaning, tokenization, and feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), ensure high-quality input data for the models. The hybrid approach was evaluated on benchmark datasets Sentiment140 and IMDB, achieving superior accuracy rates of 94\% and 95\%, respectively, outperforming standalone models. The results validate the effectiveness of combining multiple transformer models in ensemble-like setups to address the limitations of individual architectures. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking which offers a pathway for future advancements in hybrid NLP frameworks.
情感分析是自然语言处理(NLP)中的关键任务,它能够从文本数据中提取有意义的见解,特别是在如Twitter和IMDb这样的动态平台上。本研究探讨了一种结合基于变压器模型(包括BERT、GPT-2、RoBERTa、XLNet以及DistilBERT)的混合框架,以提高情感分类的准确性和鲁棒性。该框架通过利用这些模型的独特优势来应对诸如噪音数据、语境模糊以及在不同数据集上泛化等方面的挑战。 具体来说,BERT能够捕捉双向上下文信息,GPT-2提升了生成能力,RoBERTa使用更大规模的数据集和动态屏蔽技术优化了对上下文的理解,XLNet通过基于排列的学习方式建模依赖关系,而DistilBERT则在减少计算开销的同时保持高准确率。我们展示了如何利用词频逆文档频率(TF-IDF)和词汇袋(BoW)进行文本清洗、分词以及特征提取,以确保输入模型的数据质量。 混合方法在Sentiment140和IMDb基准数据集上进行了评估,在这些数据集中分别达到了94%和95%的优越准确率,超过了单一模型的表现。研究结果验证了通过类似集成的方法组合多个变压器模型可以有效解决单一架构的局限性。 这项研究强调其在实际任务中的适用性,如社交媒体监控、客户情感分析以及公众舆论跟踪等领域,并为未来混合NLP框架的发展提供了方向。
https://arxiv.org/abs/2504.09896
Large language models (LLMs) are transforming global decision-making and societal systems by processing diverse data at unprecedented scales. However, their potential to homogenize human values poses critical risks, similar to biodiversity loss undermining ecological resilience. Rooted in the ancient Greek concept of ethos, meaning both individual character and the shared moral fabric of communities, EthosGPT draws on a tradition that spans from Aristotle's virtue ethics to Adam Smith's moral sentiments as the ethical foundation of economic cooperation. These traditions underscore the vital role of value diversity in fostering social trust, institutional legitimacy, and long-term prosperity. EthosGPT addresses the challenge of value homogenization by introducing an open-source framework for mapping and evaluating LLMs within a global scale of human values. Using international survey data on cultural indices, prompt-based assessments, and comparative statistical analyses, EthosGPT reveals both the adaptability and biases of LLMs across regions and cultures. It offers actionable insights for developing inclusive LLMs, such as diversifying training data and preserving endangered cultural heritage to ensure representation in AI systems. These contributions align with the United Nations Sustainable Development Goals (SDGs), especially SDG 10 (Reduced Inequalities), SDG 11.4 (Cultural Heritage Preservation), and SDG 16 (Peace, Justice and Strong Institutions). Through interdisciplinary collaboration, EthosGPT promotes AI systems that are both technically robust and ethically inclusive, advancing value plurality as a cornerstone for sustainable and equitable futures.
大型语言模型(LLMs)正在通过处理前所未有的规模的多样化数据,重塑全球决策和社会系统。然而,它们有可能同化人类价值观,这带来了关键的风险,类似于生物多样性丧失对生态系统的韧性造成的破坏。EthosGPT 从古代希腊概念“ethos”出发,这个词既指个人的性格特征也指社区共享的价值观基础。这一传统跨越了从亚里士多德的美德伦理学到亚当·斯密的道德情感作为经济合作伦理基础的各种思想流派。这些传统强调价值多样性在培育社会信任、制度合法性以及长期繁荣方面的重要作用。 EthosGPT 通过引入一个开源框架来解决价值观同质化的问题,该框架用于在全球范围内的人类价值观尺度内绘制和评估大型语言模型(LLMs)。利用国际文化指数调查数据、提示导向评估以及比较统计分析,EthosGPT 揭示了 LLMs 在不同地区和文化中的适应性和偏见。它为开发包容性的 LLM 提供了实用见解,例如通过多样化训练数据并保护濒危文化遗产来确保在AI系统中进行代表性表达。这些贡献与联合国可持续发展目标(SDGs)紧密相关,特别是减少不平等的 SDG 10、保护文化遗产权的 SDG 11.4 和和平、正义和强有力的机构建设的 SDG 16。 通过跨学科合作,EthosGPT 推动了技术上稳健且道德包容的AI系统的开发,将价值观多样性确立为可持续和公平未来的基石。
https://arxiv.org/abs/2504.09861
As the popularity and reach of social networks continue to surge, a vast reservoir of opinions and sentiments across various subjects inundates these platforms. Among these, X social network (formerly Twitter) stands as a juggernaut, boasting approximately 420 million active users. Extracting users' emotional and mental states from their expressed opinions on social media has become a common pursuit. While past methodologies predominantly focused on the textual content of messages to analyze user sentiment, the interactive nature of these platforms suggests a deeper complexity. This study employs hybrid methodologies, integrating textual analysis, profile examination, follower analysis, and emotion dissemination patterns. Initially, user interactions are leveraged to refine emotion classification within messages, encompassing exchanges where users respond to each other. Introducing the concept of a communication tree, a model is extracted to map these interactions. Subsequently, users' bios and interests from this tree are juxtaposed with message text to enrich analysis. Finally, influential figures are identified among users' followers in the communication tree, categorized into different topics to gauge interests. The study highlights that traditional sentiment analysis methodologies, focusing solely on textual content, are inadequate in discerning sentiment towards significant events, notably the presidential election. Comparative analysis with conventional methods reveals a substantial improvement in accuracy with the incorporation of emotion distribution patterns and user profiles. The proposed approach yields a 12% increase in accuracy with emotion distribution patterns and a 15% increase when considering user profiles, underscoring its efficacy in capturing nuanced sentiment dynamics.
随着社交网络的流行和影响力不断扩大,各种主题的意见和情感在这些平台上泛滥成灾。其中,X 社交平台(原名 Twitter)作为领头羊,拥有约 4.2 亿活跃用户。从社交媒体上表达的观点中提取用户的感情和心理状态已成为一种常见的追求。虽然过去的分析方法主要集中在消息文本内容的情感分析上,但这些平台的互动性质表明存在更深层次的复杂性。这项研究采用了一种混合方法,结合了文本分析、资料审查、关注者分析以及情感传播模式。 首先,通过利用用户之间的交互来细化消息中的情绪分类,包括用户相互回应的情况。引入了“沟通树”的概念,以提取和映射这些互动。随后,将来自这棵树的用户个人简介和兴趣与消息文本进行对比,丰富分析内容。最后,在沟通树中识别用户的关注者中的影响力人物,并按不同主题对其进行分类以评估兴趣。 该研究强调,传统的仅基于文本内容的情感分析方法不足以辨别对重大事件(如总统选举)的情绪态度。与传统方法的比较分析显示,结合情感传播模式和用户资料后,准确度显著提高。新提出的方法在使用情感分布模式时提高了12%的准确性,在考虑用户个人资料时则提高了15%,这表明该方法能够有效捕捉细微的情感动态变化。
https://arxiv.org/abs/2504.10521
Recent advances in language modeling have led to growing interest in applying Natural Language Processing (NLP) techniques to financial problems, enabling new approaches to analysis and decision-making. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers across 11 qualitative and quantitative dimensions, identifying key trends such as the increasing use of general-purpose language models, steady progress in sentiment analysis and information extraction, and emerging efforts around explainability and privacy-preserving methods. We also discuss the use of evaluation metrics, highlighting the importance of domain-specific ones to complement standard machine learning metrics. Our findings emphasize the need for more accessible, adaptive datasets and highlight the significance of incorporating financial crisis periods to strengthen model robustness under real-world conditions. This survey provides a structured overview of NLP research applied to finance and offers practical insights for researchers and practitioners working at this intersection.
近年来,语言模型的进步引发了人们对将自然语言处理(NLP)技术应用于金融问题的兴趣日益增长,这开启了新的分析和决策方法。为了系统地研究这一趋势,我们回顾了2017年至2024年间在38个会议和研讨会上发表的374篇NLP研究论文,并对其中直接涉及与财务相关任务的221篇进行了深入分析。我们在包括11项定性和定量指标的基础上评估这些论文,识别出关键趋势,如通用语言模型使用量增加、情感分析及信息提取技术持续进步以及解释性方法和隐私保护措施正在成为新兴领域。我们还讨论了评价指标的应用,强调了特定领域的度量标准的重要性,以补充标准的机器学习指标。我们的研究结果突显了更易于获取且适应性强的数据集的需求,并指出了在金融危机时期纳入数据对于增强模型应对现实世界条件下的鲁棒性至关重要。这篇综述提供了NLP技术应用于财务领域的一个结构化概述,并为在此交叉领域的研究人员和从业者提供了实用见解。
https://arxiv.org/abs/2504.07274
In this paper, we introduce the Dialogue Evaluation shared task on extraction of structured opinions from Russian news texts. The task of the contest is to extract opinion tuples for a given sentence; the tuples are composed of a sentiment holder, its target, an expression and sentiment from the holder to the target. In total, the task received more than 100 submissions. The participants experimented mainly with large language models in zero-shot, few-shot and fine-tuning formats. The best result on the test set was obtained with fine-tuning of a large language model. We also compared 30 prompts and 11 open source language models with 3-32 billion parameters in the 1-shot and 10-shot settings and found the best models and prompts.
在这篇论文中,我们介绍了对话评估共享任务——从俄语新闻文本中提取结构化意见。该竞赛的任务是为给定的句子抽取意见元组;这些元组由持有者、目标对象、表达以及从持有者到目标对象的情感组成。总共有超过100份提交参加了这项任务。参赛者主要尝试了在零样本(zero-shot)、少样本(few-shot)和微调格式下的大规模语言模型。在测试集上获得最佳结果的方法是通过对大规模语言模型进行微调实现的。我们还比较了30种提示以及在1-shot和10-shot设置下具有3-320亿参数的11款开源语言模型,并找到了表现最优的模型和提示。
https://arxiv.org/abs/2504.06947
Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.
基于Transformer架构的语言模型在许多与语言相关的任务(如文本分类或情感分析)中取得了出色的成绩。然而,尽管这些模型的架构已经被很好地定义了,但对于它们内部计算如何帮助其取得结果知之甚少。这使得当前的这些模型成为一种“黑箱”系统。然而,有一条研究路线——“可解释性”,旨在学习这些模型内部信息是如何编码的。具体来说,有些工作致力于研究基于Transformer的模型是否拥有类似于人类语言使用者的语言现象知识——我们称这一领域为“这些模型的语言可解释性”。在这篇综述中,我们对160项跨多种语言和模型(包括多语种模型)的研究进行了全面分析,试图从传统语言学领域的多个视角(如句法、形态学、词汇-语义和话语)发现这些模型中的语言信息。我们的调查填补了现有可解释性文献的一个空白,即要么不关注这些模型中的语言知识,要么存在一些局限——例如只研究基于英语的模型。此外,本综述还侧重于未进一步针对下游任务进行专门化的预训练语言模型,并强调使用探索模型内部表示的技术来实现可解释性的作品。
https://arxiv.org/abs/2504.08001
Watermarking has emerged as a promising technique for detecting texts generated by LLMs. Current research has primarily focused on three design criteria: high quality of the watermarked text, high detectability, and robustness against removal attack. However, the security against spoofing attacks remains relatively understudied. For example, a piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark, thereby damaging the reputation of the LLM provider. We identify two core challenges that make defending against spoofing difficult: (1) the need for watermarks to be both sensitive to semantic-distorting changes and insensitive to semantic-preserving edits, and (2) the contradiction between the need to detect global semantic shifts and the local, auto-regressive nature of most watermarking schemes. To address these challenges, we propose a semantic-aware watermarking algorithm that post-hoc embeds watermarks into a given target text while preserving its original meaning. Our method introduces a semantic mapping model, which guides the generation of a green-red token list, contrastively trained to be sensitive to semantic-distorting changes and insensitive to semantic-preserving changes. Experiments on two standard benchmarks demonstrate strong robustness against removal attacks and security against spoofing attacks, including sentiment reversal and toxic content insertion, while maintaining high watermark detectability. Our approach offers a significant step toward more secure and semantically aware watermarking for LLMs. Our code is available at this https URL.
水印技术作为一种检测由大型语言模型(LLM)生成文本的有前景的方法已经出现。当前的研究主要集中在三个设计标准上:水印文本的质量、可检测性和对抗移除攻击的鲁棒性。然而,针对欺骗攻击的安全性研究相对较少。例如,一种搭便车攻击可以恶意改变带有水印文本的意义——将其转化为仇恨言论——同时保留原始水印,从而损害LLM提供商的声誉。我们识别出了两个核心挑战,这些挑战使得防御欺骗攻击变得困难:(1)需要水印对语义破坏性变化敏感但对保持原意的编辑不敏感;(2)需要检测全局语义转变与大多数水印方案局部、自回归性质之间的矛盾。 为了解决这些问题,我们提出了一种基于语义的水印算法,该算法在保留给定目标文本原有意义的情况下,在事后嵌入水印。我们的方法引入了一个语义映射模型,该模型指导生成绿色-红色标记列表,并进行对比训练以使其对语义破坏性变化敏感但不敏感于保持原意的变化。 在两个标准基准上的实验表明,与移除攻击相比,我们的算法具有强大的鲁棒性和对抗欺骗攻击的安全性(包括情感逆转和毒性内容插入),同时保持了高水印可检测性。这种方法为大型语言模型的更安全、更具语义感知能力的水印技术提供了一个重要的步骤。我们代码可在提供的链接中获取。
https://arxiv.org/abs/2504.06575
Collaborative assistants, or chatbots, are data-driven decision support systems that enable natural interaction for task completion. While they can meet critical needs in modern society, concerns about their reliability and trustworthiness persist. In particular, Large Language Model (LLM)-based chatbots like ChatGPT, Gemini, and DeepSeek are becoming more accessible. However, such chatbots have limitations, including their inability to explain response generation, the risk of generating problematic content, the lack of standardized testing for reliability, and the need for deep AI expertise and extended development times. These issues make chatbots unsuitable for trust-sensitive applications like elections or healthcare. To address these concerns, we introduce SafeChat, a general architecture for building safe and trustworthy chatbots, with a focus on information retrieval use cases. Key features of SafeChat include: (a) safety, with a domain-agnostic design where responses are grounded and traceable to approved sources (provenance), and 'do-not-respond' strategies to prevent harmful answers; (b) usability, with automatic extractive summarization of long responses, traceable to their sources, and automated trust assessments to communicate expected chatbot behavior, such as sentiment; and (c) fast, scalable development, including a CSV-driven workflow, automated testing, and integration with various devices. We implemented SafeChat in an executable framework using the open-source chatbot platform Rasa. A case study demonstrates its application in building ElectionBot-SC, a chatbot designed to safely disseminate official election information. SafeChat is being used in many domains, validating its potential, and is available at: this https URL.
协作型助手或聊天机器人是数据驱动的决策支持系统,能够实现自然交互以完成任务。尽管它们可以满足现代社会中的关键需求,但关于其可靠性和可信度的问题仍然存在。特别是基于大型语言模型(LLM)的聊天机器人如ChatGPT、Gemini和DeepSeek正变得越来越普及。然而,这些聊天机器人存在着局限性,包括无法解释响应生成过程、产生有争议内容的风险、缺乏标准化测试以确保可靠性以及需要深入的人工智能专业知识和较长的研发周期等。这些问题使得聊天机器人不适合在选举或医疗保健等信任敏感的应用场景中使用。 为了应对这些关切,我们介绍了SafeChat——一个用于构建安全且值得信赖的聊天机器人的通用架构,并重点关注信息检索用例。SafeChat的关键特性包括:(a)安全性,在其领域无关的设计中,回应基于并可追溯到经过批准的信息来源(出处),并且采用“不回应”策略以避免有害的回答;(b)易用性,自动提取长回答的总结并可以追溯到其来源,并通过自动化信任评估来传达预期中的聊天机器人行为,例如情绪分析;以及(c)快速、可扩展的发展能力,包括由CSV驱动的工作流程、自动化测试和与各种设备集成的功能。 我们在Rasa开源聊天机器人平台上实现了SafeChat。一个案例研究展示了它在构建ElectionBot-SC方面的应用——这是一个旨在安全传播官方选举信息的聊天机器人。SafeChat已经在许多领域得到使用,并验证了其潜力,您可以在此链接中获取更多信息:[此URL]。
https://arxiv.org/abs/2504.07995
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, including sentiment analysis. However, data quality--particularly when sourced from social media--can significantly impact their accuracy. This research explores how textual nuances, including emojis and sarcasm, affect sentiment analysis, with a particular focus on improving data quality through text paraphrasing techniques. To address the lack of labeled sarcasm data, the authors created a human-labeled dataset of 5929 tweets that enabled the assessment of LLM in various sarcasm contexts. The results show that when topic-specific datasets, such as those related to nuclear power, are used to finetune LLMs these models are not able to comprehend accurate sentiment in presence of sarcasm due to less diverse text, requiring external interventions like sarcasm removal to boost model accuracy. Sarcasm removal led to up to 21% improvement in sentiment accuracy, as LLMs trained on nuclear power-related content struggled with sarcastic tweets, achieving only 30% accuracy. In contrast, LLMs trained on general tweet datasets, covering a broader range of topics, showed considerable improvements in predicting sentiment for sarcastic tweets (60% accuracy), indicating that incorporating general text data can enhance sarcasm detection. The study also utilized adversarial text augmentation, showing that creating synthetic text variants by making minor changes significantly increased model robustness and accuracy for sarcastic tweets (approximately 85%). Additionally, text paraphrasing of tweets with fragmented language transformed around 40% of the tweets with low-confidence labels into high-confidence ones, improving LLMs sentiment analysis accuracy by 6%.
大型语言模型(LLMs)在包括情感分析在内的多种任务中表现出色。然而,数据质量——特别是社交媒体来源的数据——会显著影响其准确性。这项研究探讨了文本细微差别,如表情符号和讽刺对情感分析的影响,并特别关注通过文本改写技术提高数据质量的方法。 为了解决缺乏标注的讽刺数据的问题,作者创建了一个由5929条推文组成的人工标注数据集,这使得可以评估大型语言模型在各种讽刺语境下的表现。研究结果显示,在使用特定主题的数据集(如与核能相关的数据)来微调LLMs时,这些模型难以准确理解含讽刺的文本情感,由于内容多样性较低,需要借助外部干预措施如去除讽刺成分以提高模型准确性。 讽刺去除技术可使情感分析的准确性提升高达21%,因为针对核能相关话题训练的大型语言模型在处理讽刺推文时表现较差(仅30%准确率)。相比之下,在涵盖广泛主题的一般推特数据集上训练的LLMs预测含讽刺推文的情感方面有了显著提高(60%准确度),表明引入一般文本数据有助于提升对讽刺的检测能力。 此外,该研究还利用了对抗性文本增强技术。通过制造轻微改动来创建合成文本变体可以大幅提升模型在处理讽刺推文时的鲁棒性和准确性(大约85%)。另外,针对语言碎片化的推文进行改写后,有将近40%带有低置信度标签的推文转换为高置信度标签,从而提高了大型语言模型的情感分析准确率6%。
https://arxiv.org/abs/2504.05603
Large Language Models (LLMs) have significantly advanced sentiment analysis, yet their inherent uncertainty and variability pose critical challenges to achieving reliable and consistent outcomes. This paper systematically explores the Model Variability Problem (MVP) in LLM-based sentiment analysis, characterized by inconsistent sentiment classification, polarization, and uncertainty arising from stochastic inference mechanisms, prompt sensitivity, and biases in training data. We analyze the core causes of MVP, presenting illustrative examples and a case study to highlight its impact. In addition, we investigate key challenges and mitigation strategies, paying particular attention to the role of temperature as a driver of output randomness and emphasizing the crucial role of explainability in improving transparency and user trust. By providing a structured perspective on stability, reproducibility, and trustworthiness, this study helps develop more reliable, explainable, and robust sentiment analysis models, facilitating their deployment in high-stakes domains such as finance, healthcare, and policymaking, among others.
大型语言模型(LLMs)在情感分析方面取得了显著进展,但其内在的不确定性和变化性给实现可靠和一致的结果带来了重大挑战。本文系统地探讨了基于LLM的情感分析中的模型变异性问题(Model Variability Problem, MVP),该问题表现为不一致的情感分类、极化以及由于随机推理机制、提示敏感性及训练数据偏差所产生的不确定性。我们分析了MVP的核心原因,并通过示例和案例研究来强调其影响。此外,本文还探讨了关键挑战和缓解策略,特别关注温度作为输出随机性的驱动因素的作用,并强调可解释性在提高透明度和用户信任中的关键作用。通过提供关于稳定性、重现性和可信度的结构化视角,本研究有助于开发更可靠、更具解释性和鲁棒性的模型,从而促进这些模型在金融、医疗保健及政策制定等高风险领域的部署。
https://arxiv.org/abs/2504.04462
As the prevalence of mental health crises increases on social media platforms, identifying and preventing potential harm has become an urgent challenge. This study introduces a large language model (LLM)-based text transfer recognition method for social network crisis intervention, enhanced with domain-specific mental health knowledge. We propose a multi-level framework that incorporates transfer learning using BERT, and integrates mental health knowledge, sentiment analysis, and behavior prediction techniques. The framework includes a crisis annotation tool trained on social media datasets from real-world events, enabling the model to detect nuanced emotional cues and identify psychological crises. Experimental results show that the proposed method outperforms traditional models in crisis detection accuracy and exhibits greater sensitivity to subtle emotional and contextual variations.
随着社交媒体平台上心理健康危机的增加,识别和预防潜在危害已成为一个紧迫的挑战。本研究介绍了一种基于大型语言模型(LLM)的文本转换识别方法,用于社会网络危机干预,并结合了特定领域的心理健康知识。我们提出一个多级框架,该框架采用了BERT的迁移学习技术,并集成了心理健康知识、情感分析和技术预测行为的方法。该框架包括一个使用来自真实事件的社会媒体数据集训练的危机标注工具,使模型能够检测细微的情感线索并识别心理危机。实验结果显示,所提出的方法在危机检测准确性方面优于传统模型,并且对微妙的情绪和上下文变化表现出更高的敏感度。
https://arxiv.org/abs/2504.07983
Dynamic hedging strategies are essential for effective risk management in derivatives markets, where volatility and market sentiment can greatly impact performance. This paper introduces a novel framework that leverages large language models (LLMs) for sentiment analysis and news analytics to inform hedging decisions. By analyzing textual data from diverse sources like news articles, social media, and financial reports, our approach captures critical sentiment indicators that reflect current market conditions. The framework allows for real-time adjustments to hedging strategies, adapting positions based on continuous sentiment signals. Backtesting results on historical derivatives data reveal that our dynamic hedging strategies achieve superior risk-adjusted returns compared to conventional static approaches. The incorporation of LLM-driven sentiment analysis into hedging practices presents a significant advancement in decision-making processes within derivatives trading. This research showcases how sentiment-informed dynamic hedging can enhance portfolio management and effectively mitigate associated risks.
动态对冲策略对于衍生品市场中的有效风险管理至关重要,因为市场的波动性和情绪可以显著影响业绩。本文介绍了一种新颖的框架,该框架利用大型语言模型(LLMs)进行情感分析和新闻数据分析,以支持对冲决策。通过分析来自不同来源的文字数据,如新闻文章、社交媒体和财务报告,我们的方法能够捕捉到反映当前市场状况的关键情绪指标。此框架允许根据持续的情绪信号实时调整对冲策略,从而适应不断变化的市场条件。 在历史衍生品数据上的回溯测试表明,与传统的静态方法相比,我们的动态对冲策略实现了更优的风险调整回报率。将LLM驱动的情感分析整合到对冲实践中,在衍生品交易中的决策过程方面带来了显著的进步。这项研究展示了情感导向的动态对冲如何增强投资组合管理并有效降低相关风险。
https://arxiv.org/abs/2504.04295
Large Language Models (LLMs) have made significant strides in Natural Language Processing but remain vulnerable to fairness-related issues, often reflecting biases inherent in their training data. These biases pose risks, particularly when LLMs are deployed in sensitive areas such as healthcare, finance, and law. This paper introduces a metamorphic testing approach to systematically identify fairness bugs in LLMs. We define and apply a set of fairness-oriented metamorphic relations (MRs) to assess the LLaMA and GPT model, a state-of-the-art LLM, across diverse demographic inputs. Our methodology includes generating source and follow-up test cases for each MR and analyzing model responses for fairness violations. The results demonstrate the effectiveness of MT in exposing bias patterns, especially in relation to tone and sentiment, and highlight specific intersections of sensitive attributes that frequently reveal fairness faults. This research improves fairness testing in LLMs, providing a structured approach to detect and mitigate biases and improve model robustness in fairness-sensitive applications.
大型语言模型(LLMs)在自然语言处理方面取得了显著进展,但仍然容易受到与公平性相关的问题的影响,这些问题往往反映了其训练数据中存在的偏见。这些偏见带来的风险尤其突出,当LLMs被部署于医疗、金融和法律等敏感领域时更为明显。本文介绍了一种基于变形测试的方法,旨在系统地识别大型语言模型中的公平性问题。我们定义并应用了一系列面向公平性的变形关系(MRs),以评估在各种人口统计学输入下的LLaMA和GPT模型(一种最先进的LLM)的公平性。 我们的方法包括为每种MR生成源测试案例和后续测试案例,并分析模型响应以发现公平性违规。研究结果证明,基于MT的方法能够有效地揭示偏见模式,特别是在语气和情感方面的偏差,并突出了敏感属性交叉点中经常出现公平性问题的具体情况。 这项研究改进了大型语言模型的公平性测试方法,提供了一种结构化的途径来检测、缓解偏见并增强模型在需要高度公正性的应用中的稳健性。
https://arxiv.org/abs/2504.07982
Chatbots powered by artificial intelligence (AI) have rapidly become a significant part of everyday life, with over a quarter of American adults using them multiple times per week. While these tools offer potential benefits and risks, a fundamental question remains largely unexplored: How do conversations with AI influence subjective well-being? To investigate this, we conducted a study where participants either engaged in conversations with an AI chatbot (N = 334) or wrote journal entires (N = 193) on the same randomly assigned topics and reported their momentary happiness afterward. We found that happiness after AI chatbot conversations was higher than after journaling, particularly when discussing negative topics such as depression or guilt. Leveraging large language models for sentiment analysis, we found that the AI chatbot mirrored participants' sentiment while maintaining a consistent positivity bias. When discussing negative topics, participants gradually aligned their sentiment with the AI's positivity, leading to an overall increase in happiness. We hypothesized that the history of participants' sentiment prediction errors, the difference between expected and actual emotional tone when responding to the AI chatbot, might explain this happiness effect. Using computational modeling, we find the history of these sentiment prediction errors over the course of a conversation predicts greater post-conversation happiness, demonstrating a central role of emotional expectations during dialogue. Our findings underscore the effect that AI interactions can have on human well-being.
由人工智能(AI)驱动的聊天机器人已经迅速成为日常生活中不可或缺的一部分,超过四分之一的美国成年人每周使用它们多次。尽管这些工具提供了潜在的好处和风险,但一个基本问题仍然未被充分探讨:与AI进行对话如何影响主观幸福感?为了研究这一问题,我们开展了一项研究,参与者要么与AI聊天机器人进行了对话(N=334),要么写下了关于同样随机分配的主题的日记条目(N=193),并在之后报告了他们当时的幸福程度。我们发现,在与AI聊天机器人的对话后,幸福感比在写日记后更高,尤其是在讨论抑郁或内疚等负面话题时更是如此。 借助大型语言模型进行情感分析,我们发现AI聊天机器人能够反映参与者的感情状态,并维持一种持续的积极偏见。当讨论负面话题时,参与者逐渐将自己的情绪与AI的积极态度相一致,从而整体幸福感有所提升。我们认为,参与者的情感预测误差的历史,即在回复AI聊天机器人时预期的情绪基调和实际情绪基调之间的差异,可能解释了这种幸福效应。 通过计算建模,我们发现对话过程中情感预测误差的历史可以预测更大的对话后幸福感,这表明在对话中情感期望起着核心作用。我们的研究结果强调了AI互动对人类福祉的影响。
https://arxiv.org/abs/2504.02091