Propaganda is a form of persuasion that has been used throughout history with the intention goal of influencing people's opinions through rhetorical and psychological persuasion techniques for determined ends. Although Arabic ranked as the fourth most- used language on the internet, resources for propaganda detection in languages other than English, especially Arabic, remain extremely limited. To address this gap, the first Arabic dataset for Multi-label Propaganda, Sentiment, and Emotion (MultiProSE) has been introduced. MultiProSE is an open-source extension of the existing Arabic propaganda dataset, ArPro, with the addition of sentiment and emotion annotations for each text. This dataset comprises 8,000 annotated news articles, which is the largest propaganda dataset to date. For each task, several baselines have been developed using large language models (LLMs), such as GPT-4o-mini, and pre-trained language models (PLMs), including three BERT-based models. The dataset, annotation guidelines, and source code are all publicly released to facilitate future research and development in Arabic language models and contribute to a deeper understanding of how various opinion dimensions interact in news media1.
以下是给定文本的中文翻译: 宣传是一种历史上长期使用的说服形式,其目的是通过修辞和心理劝说技巧来影响人们的观点以达到特定的目的。尽管阿拉伯语在互联网上排名第四常用语言,但在英语以外的语言(尤其是阿拉伯语)中用于检测宣传的资源仍然极为有限。为了填补这一空白,首次推出了针对阿拉伯语的多标签宣传、情感和情绪(MultiProSE)数据集。MultiProSE是现有阿拉伯语宣传数据集ArPro的一个开源扩展版本,并为每条文本添加了情感和情绪注释。该数据集包含8,000篇经过标注的新闻文章,这是迄今为止最大的宣传数据集。对于每个任务,开发人员使用大型语言模型(如GPT-4o-mini)和预训练语言模型(包括三种基于BERT的模型)建立了多个基准模型。该数据集、注释指南及源代码均公开发布,以促进阿拉伯语语言模型的未来研究和发展,并有助于深入了解新闻媒体中各种观点维度之间的相互作用。
https://arxiv.org/abs/2502.08319
This research explores the opportunities of Generative AI (GenAI) in the realm of higher education through the design and development of a multimodal chatbot for an undergraduate course. Leveraging the ChatGPT API for nuanced text-based interactions and Google Bard for advanced image analysis and diagram-to-code conversions, we showcase the potential of GenAI in addressing a broad spectrum of educational queries. Additionally, the chatbot presents a file-based analyser designed for educators, offering deep insights into student feedback via sentiment and emotion analysis, and summarising course evaluations with key metrics. These combinations highlight the crucial role of multimodal conversational AI in enhancing teaching and learning processes, promising significant advancements in educational adaptability, engagement, and feedback analysis. By demonstrating a practical web application, this research underlines the imperative for integrating GenAI technologies to foster more dynamic and responsive educational environments, ultimately contributing to improved educational outcomes and pedagogical strategies.
这项研究探索了生成式人工智能(GenAI)在高等教育领域的机遇,通过设计和开发一个多模态聊天机器人来为本科课程提供服务。该聊天机器人利用ChatGPT API进行细腻的文本交互,并借助Google Bard进行高级图像分析及图转代码转换,展示了GenAI在回答广泛教育问题方面的潜力。 此外,聊天机器人还配备了一个基于文件的分析工具,专为教师设计,能够通过情感和情绪分析提供对学生反馈的深入洞察,并总结课程评价中的关键指标。这些组合突显了多模态对话式人工智能在增强教学与学习过程中的关键作用,有望在教育适应性、参与度及反馈分析方面取得显著进展。 通过展示一个实际的网络应用程序,这项研究强调了整合GenAI技术以促进更加动态和响应式的教育环境的重要性,最终有助于改善教育成果和教学策略。
https://arxiv.org/abs/2502.07401
Deep reinforcement learning (DRL) has been applied in financial portfolio management to improve returns in changing market conditions. However, unlike most fields where DRL is widely used, the stock market is more volatile and dynamic as it is affected by several factors such as global events and investor sentiment. Therefore, it remains a challenge to construct a DRL-based portfolio management framework with strong return capability, stable training, and generalization ability. This study introduces a new framework utilizing the Memory Instance Gated Transformer (MIGT) for effective portfolio management. By incorporating a novel Gated Instance Attention module, which combines a transformer variant, instance normalization, and a Lite Gate Unit, our approach aims to maximize investment returns while ensuring the learning process's stability and reducing outlier impacts. Tested on the Dow Jones Industrial Average 30, our framework's performance is evaluated against fifteen other strategies using key financial metrics like the cumulative return and risk-return ratios (Sharpe, Sortino, and Omega ratios). The results highlight MIGT's advantage, showcasing at least a 9.75% improvement in cumulative returns and a minimum 2.36% increase in risk-return ratios over competing strategies, marking a significant advancement in DRL for portfolio management.
深度强化学习(DRL)已被应用于金融投资组合管理,以在不断变化的市场条件下提高回报。然而,与大多数广泛应用DRL的领域不同,股市由于受到全球事件和投资者情绪等多种因素的影响而更加波动且动态多变。因此,在构建一个具有强大收益能力、稳定训练能力和泛化能力的投资组合管理框架方面仍面临挑战。本研究引入了一个新的利用记忆实例门控变换器(MIGT)的有效投资组合管理系统框架。通过结合一种新颖的门控实例注意力模块,该模块融合了变压器变体、实例归一化和轻量级门单元,我们的方法旨在最大化投资回报,同时确保学习过程的稳定性,并减少异常值的影响。 我们在道琼斯工业平均指数30只股票上测试了我们框架的表现,并将其与其他15种策略的关键财务指标(如累计收益和风险-收益比率,包括夏普比率、索蒂诺比率和Omega比率)进行了比较。结果显示,MIGT具有明显的优势:在累计回报方面至少提高了9.75%,而风险-收益比率则最少提高了2.36%,这标志着DRL在投资组合管理领域的显著进展。
https://arxiv.org/abs/2502.07280
Zero-shot LLMs are now also used for textual classification tasks, e.g., sentiment/emotion detection of a given input as a sentence/article. However, their performance can be suboptimal in such data annotation tasks. We introduce a novel technique Perceived Confidence Scoring (PCS) that evaluates LLM's confidence for its classification of an input by leveraging Metamorphic Relations (MRs). The MRs generate semantically equivalent yet textually mutated versions of the input. Following the principles of Metamorphic Testing (MT), the mutated versions are expected to have annotation labels similar to the input. By analyzing the consistency of LLM responses across these variations, PCS computes a confidence score based on the frequency of predicted labels. PCS can be used both for single LLM and multiple LLM settings (e.g., majority voting). We introduce an algorithm Perceived Differential Evolution (PDE) that determines the optimal weights assigned to the MRs and the LLMs for a classification task. Empirical evaluation shows PCS significantly improves zero-shot accuracy for Llama-3-8B-Instruct (4.96%) and Mistral-7B-Instruct-v0.3 (10.52%), with Gemma-2-9b-it showing a 9.39% gain. When combining all three models, PCS significantly outperforms majority voting by 7.75%.
零样本大型语言模型(LLMs)现在也被用于文本分类任务,例如对给定句子或文章的情感/情绪检测。然而,在这类数据标注任务中,它们的表现可能不尽如人意。为此,我们引入了一种新的技术——感知置信评分(Perceived Confidence Scoring, PCS),该技术通过利用同态关系(Metamorphic Relations, MRs)来评估LLM对其输入分类的自信度。MRs可以生成与原始输入语义等价但文本有所变异的不同版本。遵循同态测试(MT)的原则,这些变异版本应该具有类似于原输入的数据标注标签。通过对LLM在这些变体上的响应一致性进行分析,PCS可以根据预测标签出现的频率计算一个置信分数。PCS既可以用于单个LLM也可以用于多个LLM的设置中(例如多数表决)。我们还引入了一种算法——感知差分进化(Perceived Differential Evolution, PDE),该算法可以确定为分类任务分配给MRs和LLMs的最佳权重。 实证评估显示,PCS显著提升了零样本准确率:对于Llama-3-8B-Instruct模型提高了4.96%,Mistral-7B-Instruct-v0.3模型提高了10.52%;Gemma-2-9b-it模型的提升则达到了9.39%。当同时使用这三个模型时,PCS相较于多数表决方法有显著的优势,性能提升了7.75%。
https://arxiv.org/abs/2502.07186
To understand the complexity of sequence classification tasks, Hahn et al. (2021) proposed sensitivity as the number of disjoint subsets of the input sequence that can each be individually changed to change the output. Though effective, calculating sensitivity at scale using this framework is costly because of exponential time complexity. Therefore, we introduce a Sensitivity-based Multi-Armed Bandit framework (SMAB), which provides a scalable approach for calculating word-level local (sentence-level) and global (aggregated) sensitivities concerning an underlying text classifier for any dataset. We establish the effectiveness of our approach through various applications. We perform a case study on CHECKLIST generated sentiment analysis dataset where we show that our algorithm indeed captures intuitively high and low-sensitive words. Through experiments on multiple tasks and languages, we show that sensitivity can serve as a proxy for accuracy in the absence of gold data. Lastly, we show that guiding perturbation prompts using sensitivity values in adversarial example generation improves attack success rate by 15.58%, whereas using sensitivity as an additional reward in adversarial paraphrase generation gives a 12.00% improvement over SOTA approaches. Warning: Contains potentially offensive content.
为了理解序列分类任务的复杂性,Hahn等人(2021)提出了敏感度的概念,定义为输入序列中可以单独改变以导致输出变化的不相交子集的数量。虽然这种方法有效,但由于其指数级的时间复杂度,在大规模计算中成本高昂。因此,我们引入了一种基于敏感度的多臂赌博机框架(SMAB),该框架提供了一个可扩展的方法来针对任何数据集中的底层文本分类器计算单词级别的局部(句子级别)和全局(聚合)敏感度。 通过多种应用,我们证明了这种方法的有效性。我们在CHECKLIST生成的情感分析数据集中进行了一项案例研究,并展示了我们的算法确实能够捕捉到直观上高敏感度和低敏感度的词语。通过对多个任务和语言进行实验,我们表明在缺乏黄金标准数据的情况下,敏感度可以作为准确性的替代指标。 最后,我们证明了使用敏感度值来引导对抗性样本生成中的扰动提示可以将攻击成功率提高15.58%,而利用敏感度作为对抗性改写生成中的额外奖励则能使改进超过现有最佳方法(SOTA)达到12%。请注意:内容中可能包含潜在的冒犯性材料。
https://arxiv.org/abs/2502.07101
A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as this http URL, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.
一种语言可以有多种变体。这些变体会影响自然语言处理(NLP)模型的性能,包括大型语言模型(LLMs),因为后者通常是在广泛使用的语言变体的数据上进行训练的。本文介绍了一种新颖且成本效益高的方法来评估不同语言变体上的模型性能。我们提出,国际在线评论平台(如这个网址所示:http://example.com/)可以作为有效数据源,用于构建包含来自相似现实场景的不同语言变体评论的数据集,例如使用相同语言(如普通话)但不同语言变体(如台湾国语、中国大陆普通话)对同一酒店进行的评分相同的评论。为了证明这一概念,我们构建了一个在上下文中对齐的数据集,其中包括台湾国语和大陆普通话的评论,并测试了六种大型语言模型在情感分析任务中的表现。我们的结果显示,在台湾国语中,LLMs的表现始终不如预期。
https://arxiv.org/abs/2502.07058
Insider threats wield an outsized influence on organizations, disproportionate to their small numbers. This is due to the internal access insiders have to systems, information, and infrastructure. %One example of this influence is where anonymous respondents submit web-based job search site reviews, an insider threat risk to organizations. Signals for such risks may be found in anonymous submissions to public web-based job search site reviews. This research studies the potential for large language models (LLMs) to analyze and detect insider threat sentiment within job site reviews. Addressing ethical data collection concerns, this research utilizes synthetic data generation using LLMs alongside existing job review datasets. A comparative analysis of sentiment scores generated by LLMs is benchmarked against expert human scoring. Findings reveal that LLMs demonstrate alignment with human evaluations in most cases, thus effectively identifying nuanced indicators of threat sentiment. The performance is lower on human-generated data than synthetic data, suggesting areas for improvement in evaluating real-world data. Text diversity analysis found differences between human-generated and LLM-generated datasets, with synthetic data exhibiting somewhat lower diversity. Overall, the results demonstrate the applicability of LLMs to insider threat detection, and a scalable solution for insider sentiment testing by overcoming ethical and logistical barriers tied to data acquisition.
内部威胁在组织中具有不成比例的影响,尽管其成员数量较少。这主要是因为内部人员可以访问系统、信息和基础设施。一个体现这种影响力的例子是匿名用户提交到网络求职网站的评价,这是对组织的一种内部威胁风险。这类风险可能出现在公共网络求职网站上的匿名提交内容中。这项研究探讨了大型语言模型(LLMs)在分析和检测求职网站评论中的内部威胁情绪方面的潜力。 为了解决伦理数据收集问题,该研究利用LLM生成合成数据,并结合现有的职位评价数据集进行研究。通过对LLM生成的情感评分与专家人工打分的比较分析,发现大多数情况下,LLM的表现与人类评估一致,能够有效识别复杂的情绪威胁指标。然而,在处理真实世界的人类生成数据时,其表现不如在合成数据上好,这表明了改进评估实际数据的方法的空间。 文本多样性分析显示,人类生成的数据集和LLM生成的数据集之间存在差异,其中合成数据的多样性略低。总体而言,研究结果证明了大型语言模型在内部威胁检测中的适用性,并提供了一个克服与数据获取相关的伦理和后勤障碍的可扩展解决方案。
https://arxiv.org/abs/2502.07045
Social media has become a crucial open-access platform for individuals to express opinions and share experiences. However, leveraging low-resource language data from Twitter is challenging due to scarce, poor-quality content and the major variations in language use, such as slang and code-switching. Identifying tweets in these languages can be difficult as Twitter primarily supports high-resource languages. We analyze Kenyan code-switched data and evaluate four state-of-the-art (SOTA) transformer-based pretrained models for sentiment and emotion classification, using supervised and semi-supervised methods. We detail the methodology behind data collection and annotation, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2\%) and F1 score (66.1\%), XLM-R semi-supervised (67.2\% accuracy, 64.1\% F1 score). In emotion analysis, DistilBERT supervised leads in accuracy (59.8\%) and F1 score (31\%), mBERT semi-supervised (accuracy (59\% and F1 score 26.5\%). AfriBERTa models show the lowest accuracy and F1 scores. All models tend to predict neutral sentiment, with Afri-BERT showing the highest bias and unique sensitivity to empathy emotion. this https URL
社交媒体已成为个人表达观点和分享经验的重要开放平台。然而,由于低资源语言数据稀缺且质量低下,并且存在俚语和代码混用等重大语言使用差异,在Twitter上利用这些低资源语言的数据面临着挑战。鉴于Twitter主要支持高资源语言,识别这些语言的推文变得尤为困难。 我们分析了肯尼亚地区的代码混用数据,并评估了四种最先进的(SOTA)基于转换器的预训练模型在情感和情绪分类中的表现,采用了监督学习与半监督学习方法。文中详细描述了数据收集、标注的方法及在数据整理阶段遇到的各种挑战。 我们的结果显示XLM-R超越其他所有模型;对于情感分析,XLM-R监督模型实现了最高准确率(69.2%)和F1分数(66.1%),而半监督模型则为67.2%的准确率与64.1%的F1分数。在情绪分析中,DistilBERT监督学习方法领先于其他模型,在准确性(59.8%)和F1值(31%)方面表现突出,mBERT半监督学习模型的准确率为59%,而F1得分为26.5%。AfriBERTa模型则在准确性和F1得分上均最低。 所有模型都倾向于预测中立情感,并且Afri-BERT表现出最高的情感偏差和对共情情绪的独特敏感性。 相关研究详情请参阅:[此链接](https://example.com)(请注意,示例链接实际不可用,请替换为有效URL)。
https://arxiv.org/abs/2502.06180
Depression is one of the leading causes of disability worldwide, posing a severe burden on individuals, healthcare systems, and society at large. Recent advancements in Large Language Models (LLMs) have shown promise in addressing mental health challenges, including the detection of depression through text-based analysis. However, current LLM-based methods often struggle with nuanced symptom identification and lack a transparent, step-by-step reasoning process, making it difficult to accurately classify and explain mental health conditions. To address these challenges, we propose a Chain-of-Thought Prompting approach that enhances both the performance and interpretability of LLM-based depression detection. Our method breaks down the detection process into four stages: (1) sentiment analysis, (2) binary depression classification, (3) identification of underlying causes, and (4) assessment of severity. By guiding the model through these structured reasoning steps, we improve interpretability and reduce the risk of overlooking subtle clinical indicators. We validate our method on the E-DAIC dataset, where we test multiple state-of-the-art large language models. Experimental results indicate that our Chain-of-Thought Prompting technique yields superior performance in both classification accuracy and the granularity of diagnostic insights, compared to baseline approaches.
抑郁症是全球致残的主要原因之一,对个人、医疗保健系统以及整个社会造成了严重的负担。最近,在大型语言模型(LLMs)领域取得的进展显示出在应对心理健康挑战方面的潜力,包括通过基于文本的分析来检测抑郁症状。然而,现有的基于LLM的方法往往难以识别细微的症状,并且缺乏透明、逐步推理的过程,这使得准确分类和解释心理健康状况变得困难。 为了解决这些问题,我们提出了一种Chain-of-Thought Prompting(思考链提示)方法,这种方法可以提升基于大型语言模型的抑郁症检测在性能和可解释性方面的表现。我们的方法将检测过程细分为四个阶段:(1)情感分析;(2)二元抑郁分类;(3)潜在原因识别;以及(4)病情严重程度评估。通过引导模型按照这些结构化步骤进行推理,我们可以提高其透明度,并减少忽略细微临床指标的风险。 我们在E-DAIC数据集上验证了我们的方法,测试了多个最先进的大型语言模型。实验结果表明,与基线方法相比,我们的Chain-of-Thought Prompting技术在分类准确性以及诊断洞察的详细程度方面都取得了更优的表现。
https://arxiv.org/abs/2502.05879
Books, while often rich in cultural insights, can also mirror societal biases of their eras - biases that Large Language Models (LLMs) may learn and perpetuate during training. We introduce a novel method to trace and quantify these biases using fine-tuned LLMs. We develop BookPAGE, a corpus comprising 593 fictional books across seven decades (1950-2019), to track bias evolution. By fine-tuning LLMs on books from each decade and using targeted prompts, we examine shifts in biases related to gender, sexual orientation, race, and religion. Our findings indicate that LLMs trained on decade-specific books manifest biases reflective of their times, with both gradual trends and notable shifts. For example, model responses showed a progressive increase in the portrayal of women in leadership roles (from 8% to 22%) from the 1950s to 2010s, with a significant uptick in the 1990s (from 4% to 12%), possibly aligning with third-wave feminism. Same-sex relationship references increased markedly from the 1980s to 2000s (from 0% to 10%), mirroring growing LGBTQ+ visibility. Concerningly, negative portrayals of Islam rose sharply in the 2000s (26% to 38%), likely reflecting post-9/11 sentiments. Importantly, we demonstrate that these biases stem mainly from the books' content and not the models' architecture or initial training. Our study offers a new perspective on societal bias trends by bridging AI, literary studies, and social science research.
书籍虽然常常富含文化洞察,但也可能反映出其时代社会的偏见——这些偏见可能会被大型语言模型(LLMs)在训练过程中学习并延续。我们提出了一种新的方法,利用微调后的LLM来追踪和量化这些偏见。为此,我们开发了BookPAGE语料库,该语料库包括自1950年至2019年七个十年间的593本虚构书籍,以便跟踪偏见演变。通过在每个年代的书籍上对LLMs进行微调,并使用针对性提示,我们可以研究与性别、性取向、种族和宗教相关的偏见变化。我们的发现表明,训练有素于特定年代书籍的模型会表现出反映其时代的偏见,这些偏见既包含渐进的趋势也包括显著的变化。 例如,模型在响应中显示女性领导角色的描绘从20世纪50年代到21世纪初逐渐增加(从8%到22%),特别是在90年代(从4%上升至12%),这可能与第三波女权主义运动相吻合。同性关系参考在80年代至2000年代间显著增长(从0%增至10%),反映了LGBTQ+群体可见度的提高。然而,对伊斯兰教的负面描绘在2000年代急剧上升(从26%升至38%),这可能与9/11事件之后的情感有关。 重要的是,我们证明了这些偏见主要源自书籍内容本身,并非模型架构或初始训练的影响。我们的研究为社会偏见趋势提供了新的视角,通过将人工智能、文学研究和社会科学研究相结合而得以实现。
https://arxiv.org/abs/2502.05331
Democratic processes increasingly aim to integrate large-scale voting with face-to-face deliberation, addressing the challenge of reconciling individual preferences with collective decision-making. This work introduces new methods that use algorithms and computational tools to bridge online voting with face-to-face deliberation, tested in two real-world scenarios: Kultur Komitee 2024 (KK24) and vTaiwan. These case studies highlight the practical applications and impacts of the proposed methods. We present three key contributions: (1) Radial Clustering for Preference Based Subgroups, which enables both in-depth and broad discussions in deliberative settings by computing homogeneous and heterogeneous group compositions with balanced and adjustable group sizes; (2) Human-in-the-loop MES, a practical method that enhances the Method of Equal Shares (MES) algorithm with real-time digital feedback. This builds algorithmic trust by giving participants full control over how much decision-making is delegated to the voting aggregation algorithm as compared to deliberation; and (3) the ReadTheRoom deliberation method, which uses opinion space mapping to identify agreement and divergence, along with spectrum-based preference visualisation to track opinion shifts during deliberation. This approach enhances transparency by clarifying collective sentiment and fosters collaboration by encouraging participants to engage constructively with differing perspectives. By introducing these actionable frameworks, this research extends in-person deliberation with scalable digital methods that address the complexities of modern decision-making in participatory processes.
民主进程越来越多地旨在将大规模投票与面对面的讨论相结合,以解决个人偏好与集体决策之间的矛盾。这项工作介绍了一些新的方法,这些方法利用算法和计算工具来连接在线投票与面对面的讨论,并在两个现实场景中进行了测试:Kultur Komitee 2024 (KK24) 和 vTaiwan。这些案例研究强调了所提出的方法的实际应用和影响。 我们提出了三项关键贡献: 1. **径向聚类偏好子组**,该方法能够在审议设置中支持深入而广泛的讨论,通过计算同质性和异质性群体组成,并调整平衡的小组规模。 2. **人类在循环中的MES(Method of Equal Shares)算法** 是一种实用的方法,它增强了等份额方法(MES),并结合了实时数字反馈。这种方法通过让用户完全控制多少决策权分配给投票聚合算法而非讨论来建立算法信任。 3. **读取房间审议法** 使用意见空间映射来识别共识和分歧,并利用基于谱系的偏好可视化技术来跟踪审议过程中的意见变化,从而增强透明度并促进协作。 通过引入这些可操作框架,这项研究将面对面的审议与能够应对参与过程中现代决策复杂性的可扩展数字方法相结合。
https://arxiv.org/abs/2502.05017
Text Style Transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, same as in other natural language processing (NLP) tasks, however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks-sentiment transfer and detoxification-in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigate the potential of Large Language Models (LLMs) as tools for TST evaluation. Our findings highlight that certain advanced NLP metrics and experimental-hybrid-techniques, provide better insights than existing TST metrics for delivering more accurate, consistent, and reproducible TST evaluations.
文本风格转换(TST)的任务是将一段文本转化为具有特定风格的版本,同时保留其原始内容。评估TST输出是一个多维度挑战,需要对风格转化准确性、内容保存和自然度进行评测。虽然人工评价是最理想的但成本高昂,与其他自然语言处理(NLP)任务类似,自动化的TST评估指标并未像机器翻译或摘要生成等领域那样受到足够重视。在本文中,我们考察了现有以及新提出的广泛应用于其他NLP任务的评估标准,并集中于两个流行的子任务:情感转换和去毒化,在包含英语、印地语和孟加拉语的多语言环境中进行研究。通过与人工判断的相关性测试作为元评价方法,我们展示了这些指标在单独使用及组合使用时的有效性。此外,我们也探讨了大型语言模型(LLMs)作为TST评估工具的潜力。我们的研究表明,某些先进的NLP评估标准和实验混合技术比现有的TST评估标准能提供更准确、一致且可重复的结果。
https://arxiv.org/abs/2502.04718
Multilingual language models have significantly advanced due to rapid progress in natural language processing. Models like BLOOM 1.7B, trained on diverse multilingual datasets, aim to bridge linguistic gaps. However, their effectiveness in capturing linguistic knowledge, particularly for low-resource languages, remains an open question. This study critically examines MLMs capabilities in multilingual understanding, semantic representation, and cross-lingual knowledge transfer. While these models perform well for high-resource languages, they struggle with less-represented ones. Additionally, traditional evaluation methods often overlook their internal syntactic and semantic encoding. This research addresses key limitations through three objectives. First, it assesses semantic similarity by analyzing multilingual word embeddings for consistency using cosine similarity. Second, it examines BLOOM-1.7B and Qwen2 through Named Entity Recognition and sentence similarity tasks to understand their linguistic structures. Third, it explores cross-lingual knowledge transfer by evaluating generalization from high-resource to low-resource languages in sentiment analysis and text classification. By leveraging linguistic probing, performance metrics, and visualizations, this study provides insights into the strengths and limitations of MLMs. The findings aim to enhance multilingual NLP models, ensuring better support for both high- and low-resource languages, thereby promoting inclusivity in language technologies.
多语言语言模型由于自然语言处理的迅速进步而得到了显著的发展。像BLOOM 1.7B这样的模型,在多样化的多语种数据集上进行训练,旨在弥合语言鸿沟。然而,它们在捕捉低资源语言的语言知识方面的有效性仍然是一个开放性问题。这项研究从批判的角度审视了多语言语言模型(MLMs)在跨语言理解、语义表示和跨语言知识迁移方面的能力。虽然这些模型对于高资源语言表现良好,但在处理代表性不足的语言时却面临挑战。此外,传统的评估方法通常忽视了它们内部的句法和语义编码。这项研究通过三个目标来解决关键限制:首先,它通过使用余弦相似度分析多语言词嵌入的一致性来评估语义相似性;其次,它通过命名实体识别和句子相似性任务来考察BLOOM-1.7B和Qwen2的语言结构理解能力;第三,它通过从高资源语言到低资源语言的情感分析和文本分类中的泛化性能评估跨语言知识迁移。该研究利用了语言探测、性能指标和可视化技术,从而为多语言自然语言处理(NLP)模型的优势和局限性提供了见解。这些发现旨在增强多语言NLP模型,确保对高资源和低资源语言都提供更好的支持,进而促进语言技术的包容性。
https://arxiv.org/abs/2502.04269
Sentiment Analysis, a popular subtask of Natural Language Processing, employs computational methods to extract sentiment, opinions, and other subjective aspects from linguistic data. Given its crucial role in understanding human sentiment, research in sentiment analysis has witnessed significant growth in the recent years. However, the majority of approaches are aimed at the English language, and research towards Arabic sentiment analysis remains relatively unexplored. This paper presents a comprehensive and contemporary survey of Arabic Sentiment Analysis, identifies the challenges and limitations of existing literature in this field and presents avenues for future research. We present a systematic review of Arabic sentiment analysis methods, focusing specifically on research utilizing deep learning. We then situate Arabic Sentiment Analysis within the broader context, highlighting research gaps in Arabic sentiment analysis as compared to general sentiment analysis. Finally, we outline the main challenges and promising future directions for research in Arabic sentiment analysis.
情感分析是自然语言处理的一个热门子任务,它采用计算方法从语言数据中提取情感、观点和其他主观方面。鉴于其在理解人类情绪中的关键作用,近年来情感分析的研究得到了显著的增长。然而,大多数研究主要针对英语,而阿拉伯语情感分析领域的研究仍相对较少探索。本文对阿拉伯语情感分析进行了全面和当代的综述,识别了现有文献中面临的挑战与限制,并提出了未来研究的方向。 我们系统地回顾了阿拉伯语情感分析的方法,特别关注使用深度学习的研究成果。接着,我们将阿拉伯语情感分析置于更广泛的背景之下,强调其相对于一般语言的情感分析而言存在的研究空白。最后,本文概述了阿拉伯语情感分析研究的主要挑战和具有前景的未来方向。
https://arxiv.org/abs/2502.03827
Controlled text generation allows for enforcing user-defined constraints on large language model outputs, an increasingly important field as LLMs become more prevalent in everyday life. One common approach uses energy-based decoding, which defines a target distribution through an energy function that combines multiple constraints into a weighted average. However, these methods often struggle to balance fluency with constraint satisfaction, even with extensive tuning of the energy function's coefficients. In this paper, we identify that this suboptimal balance arises from sampling in continuous space rather than the natural discrete space of text tokens. To address this, we propose Discrete Auto-regressive Biasing, a controlled decoding algorithm that leverages gradients while operating entirely in the discrete text domain. Specifically, we introduce a new formulation for controlled text generation by defining a joint distribution over the generated sequence and an auxiliary bias sequence. To efficiently sample from this joint distribution, we propose a Langevin-within-Gibbs sampling algorithm using gradient-based discrete MCMC. Our method significantly improves constraint satisfaction while maintaining comparable or better fluency, all with even lower computational costs. We demonstrate the advantages of our controlled decoding method on sentiment control, language detoxification, and keyword-guided generation.
受控文本生成允许对大规模语言模型的输出施加用户定义的约束,这一领域随着LLM在日常生活中的普及变得越来越重要。一种常见的方法是基于能量解码法,通过能量函数来定义目标分布,并将多个约束合并为加权平均值。然而,这些方法常常难以同时保持流畅性与满足约束条件,即使进行了大量调优以调整能量函数的系数也是如此。本文中,我们发现这种次优化平衡的问题源于在连续空间而非自然离散文本标记空间进行采样。 为解决这一问题,我们提出了“离散自回归偏置法”,这是一种完全基于离散文本域操作的同时利用梯度的受控解码算法。具体来说,我们通过定义生成序列与辅助偏置序列之间的联合分布来提出一种新的受控文本生成公式。为了从这个联合分布中有效采样,我们提出了使用基于梯度的离散MCMC的Langevin-within-Gibbs采样算法。 我们的方法在保持或提升流畅性的同时显著提高了约束满足能力,并且计算成本更低。我们在情绪控制、语言净化和关键词引导生成等任务上展示了受控解码法的优势。
https://arxiv.org/abs/2502.03685
Distinguishing in- and out-of-distribution (OOD) inputs is crucial for reliable deployment of classification systems. However, OOD data is typically unavailable or difficult to collect, posing a significant challenge for accurate OOD detection. In this work, we present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies, eliminating the dependency on any external OOD data source. We study the efficacy of our method on classical text classification tasks such as toxicity detection and sentiment classification as well as classification tasks arising in LLM development and deployment, such as training a reward model for RLHF and detecting misaligned generations. Extensive experiments on nine InD-OOD dataset pairs and various model sizes show that our approach dramatically lowers false positive rates (achieving a perfect zero in some cases) while maintaining high accuracy on in-distribution tasks, outperforming baseline methods by a significant margin.
区分内分布(In-Distribution,ID)和外分布(Out-of-Distribution,OOD)输入对于分类系统的可靠部署至关重要。然而,OOD数据通常难以获取或收集,这给准确的OOD检测带来了重大挑战。在这项工作中,我们提出了一种方法,该方法利用大规模语言模型(LLMs)的生成能力来创建高质量的合成OOD代理,从而消除了对任何外部OOD数据源的依赖。我们在经典文本分类任务(如毒性检测和情感分类)以及在LLM开发和部署中出现的分类任务(例如为RLHF训练奖励模型和检测不一致生成)上研究了我们方法的有效性。在九组InD-OOD数据集对及各种规模模型上的大量实验表明,我们的方法显著降低了假阳性率(在某些情况下达到了完美的零),同时保持了高准确度的内分布任务表现,大幅超越了基线方法。
https://arxiv.org/abs/2502.03323
We present LLaVAC, a method for constructing a classifier for multimodal sentiment analysis. This method leverages fine-tuning of the Large Language and Vision Assistant (LLaVA) to predict sentiment labels across both image and text modalities. Our approach involves designing a structured prompt that incorporates both unimodal and multimodal labels to fine-tune LLaVA, enabling it to perform sentiment classification effectively. Experiments on the MVSA-Single dataset demonstrate that LLaVAC outperforms existing methods in multimodal sentiment analysis across three data processing procedures. The implementation of LLaVAC is publicly available at this https URL.
我们介绍了LLaVAC方法,这是一种用于构建多模态情感分析分类器的方法。该方法利用大型语言和视觉助手(LLaVA)的微调来预测图像和文本两种模式的情感标签。我们的方法涉及设计一种结构化的提示,将单模态和多模态标签结合起来以对LLaVA进行微调,从而使它能够有效地执行情感分类任务。在MVSA-Single数据集上的实验表明,在三种不同的数据处理流程中,LLaVAC的性能超过了现有的多模态情感分析方法。LLaVAC的实现代码可在以下链接公开获取:[此URL](请将“此URL”替换为实际提供的GitHub或代码托管网站链接)。
https://arxiv.org/abs/2502.02938
With the internet's evolution, consumers increasingly rely on online reviews for service or product choices, necessitating that businesses analyze extensive customer feedback to enhance their offerings. While machine learning-based sentiment classification shows promise in this realm, its technical complexity often bars small businesses and individuals from leveraging such advancements, which may end up making the competitive gap between small and large businesses even bigger in terms of improving customer satisfaction. This paper introduces an approach that integrates large language models (LLMs), specifically Generative Pre-trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT)-based models, making it accessible to a wider audience. Our experiments across various datasets confirm that our approach retains high classification accuracy without the need for manual labeling, expert knowledge in tuning and data annotation, or substantial computational power. By significantly lowering the barriers to applying sentiment classification techniques, our methodology enhances competitiveness and paves the way for making machine learning technology accessible to a broader audience.
随着互联网的演进,消费者越来越依赖在线评论来做出服务或产品选择,这促使企业需要分析大量的客户反馈以改进其提供的产品和服务。虽然基于机器学习的情感分类技术在此领域展现出巨大潜力,但其复杂的技术要求通常使小型企业和个人难以利用这些进步,从而可能加大了在提升顾客满意度方面的小型企业与大型企业的竞争差距。 本文介绍了一种结合大规模语言模型(LLMs)的方法,特别是生成预训练变换器(GPT)和基于变压器双向编码表示(BERT)的模型。这种方法使得情感分类技术更加易于广大用户使用。通过不同数据集进行的实验表明,我们的方法在不需要手动标注、专业知识调优及大量计算能力的情况下仍能保持高精度的情感分类。 该方法大大降低了应用情感分类技术的门槛,从而增强了小型企业的竞争力,并为使机器学习技术更为广泛地应用于各种规模的企业铺平了道路。
https://arxiv.org/abs/2502.02893
Attention, or prioritization of certain information items over others, is a critical element of any learning process, for both humans and machines. Given that humans continue to outperform machines in certain learning tasks, it seems plausible that machine performance could be enriched by aligning machine attention with human attention mechanisms -- yet research on this topic is sparse and has achieved only limited success. This paper proposes a new approach to address this gap, called Human-Machine Attention Learning (HuMAL). This approach involves reliance on data annotated by humans to reflect their self-perceived attention during specific tasks. We evaluate several alternative strategies for integrating such human attention data into machine learning (ML) algorithms, using a sentiment analysis task (review data from Yelp) and a personality-type classification task (data from myPersonality). The best-performing HuMAL strategy significantly enhances the task performance of fine-tuned transformer models (BERT, as well as GPT-2 and XLNET), and the benefit is particularly pronounced under challenging conditions of imbalanced or sparse labeled data. This research contributes to a deeper understanding of strategies for integrating human attention into ML models and highlights the potential of leveraging human cognition to augment ML in real-world applications.
注意,即对某些信息项的优先处理,在任何学习过程中都是一个关键元素,无论是人类还是机器。鉴于人类在某些学习任务中仍能胜过机器,将机器的关注机制与人的关注机制相一致似乎能够提升机器性能——尽管在这个领域的研究尚且稀缺,并且取得的成功有限。本文提出了一种新的方法来填补这一空白,称为人机注意力学习(Human-Machine Attention Learning, HuMAL)。此方法依赖于由人类标注的数据以反映他们在特定任务中的自我感知关注点。 我们评估了将此类人类注意数据整合到机器学习(ML)算法中的一系列替代策略,并采用了两项具体任务来测试:一项是情感分析任务(使用Yelp的评论数据),另一项是性格类型分类任务(使用myPersonality的数据)。最佳表现的HuMAL策略显著提升了微调后的转换模型(如BERT、GPT-2和XLNET)在这些任务中的性能,特别是在处理不平衡或标签稀疏的数据时,其效果尤为明显。 这项研究为如何将人类注意力整合到机器学习模型中提供了更深入的理解,并强调了利用人类认知来增强机器学习在实际应用中的潜力。
https://arxiv.org/abs/2502.06811
Large language models (LLMs) have shown remarkable success in language modelling due to scaling laws found in model size and the hidden dimension of the model's text representation. Yet, we demonstrate that compressed representations of text can yield better performance in LLM-based regression tasks. In this paper, we compare the relative performance of embedding compression in three different signal-to-noise contexts: financial return prediction, writing quality assessment and review scoring. Our results show that compressing embeddings, in a minimally supervised manner using an autoencoder's hidden representation, can mitigate overfitting and improve performance on noisy tasks, such as financial return prediction; but that compression reduces performance on tasks that have high causal dependencies between the input and target data. Our results suggest that the success of interpretable compressed representations such as sentiment may be due to a regularising effect.
大型语言模型(LLMs)由于在模型大小和文本表示隐藏维度中发现的规模法则,在语言建模方面表现出显著的成功。然而,我们证明了文本的压缩表示可以在基于大语言模型的回归任务中取得更好的性能表现。在这篇论文中,我们在三种不同的信噪比上下文中比较嵌入压缩的相对效果:金融回报预测、写作质量评估和评论评分。我们的研究结果表明,在最小监督的方式下使用自编码器隐藏层来压缩嵌入可以缓解过拟合并提高噪声任务(如金融回报预测)上的性能表现;然而,压缩会降低输入与目标数据之间具有高因果依赖关系的任务的性能表现。这些结果显示可解释压缩表示的成功,例如情感分析,可能是由于其正则化效应导致的。
https://arxiv.org/abs/2502.02199