Aspect-based sentiment analysis (ABSA) involves identifying sentiment towards specific aspect terms in a sentence and allows us to uncover nuanced perspectives and attitudes on particular aspects of a product, service, or topic. However, the scarcity of labeled data poses a significant challenge to training high-quality models. To address this issue, we explore the potential of data augmentation using ChatGPT, a well-performing large language model (LLM), to enhance the sentiment classification performance towards aspect terms. Specifically, we explore three data augmentation strategies based on ChatGPT: context-focused, aspect-focused, and context-aspect data augmentation techniques. Context-focused data augmentation focuses on changing the word expression of context words in the sentence while keeping aspect terms unchanged. In contrast, aspect-focused data augmentation aims to change aspect terms but keep context words unchanged. Context-Aspect data augmentation integrates the above two data augmentations to generate augmented samples. Furthermore, we incorporate contrastive learning into the ABSA tasks to improve performance. Extensive experiments show that all three data augmentation techniques lead to performance improvements, with the context-aspect data augmentation strategy performing best and surpassing the performance of the baseline models.
基于 aspect 的情感分析(ASSA)涉及在句子中识别对特定方面词的 sentiment,并允许我们揭示产品、服务或主题的特定方面上的细微观点和态度。然而,稀疏的标注数据对训练高质量模型造成了重大挑战。为解决这个问题,我们探讨了使用 ChatGPT(表现出色的 large language model)进行数据增强以提高面向方面词的情感分类性能的可能性。具体来说,我们探讨了三种基于 ChatGPT 的数据增强策略:基于上下文的数据增强、基于方面的数据增强和上下文方面数据增强技术。基于上下文的数据增强关注于句子中上下文词的单词表达,而方面词保持不变。相反,基于方面的数据增强旨在改变方面词,而上下文词保持不变。上下文方面数据增强将上述两种数据增强技术相结合以生成增强样本。此外,我们将对比学习引入 ASSA 任务中,以提高性能。大量实验证明,三种数据增强技术都导致了性能提高,其中上下文方面数据增强策略表现最佳,并超过了基线模型的性能。
https://arxiv.org/abs/2409.11218
The development of unbiased large language models is widely recognized as crucial, yet existing benchmarks fall short in detecting biases due to limited scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the first holistic benchmarking pipeline to address these problems. The pipeline encompasses five core stages: scraping materials, assembling benchmarks, generating responses, extracting numeric features, and diagnosing with disparity metrics. SAGED includes metrics for max disparity, such as impact ratio, and bias concentration, such as Max Z-scores. Noticing that assessment tool bias and contextual bias in prompts can distort evaluation, SAGED implements counterfactual branching and baseline calibration for mitigation. For demonstration, we use SAGED on G20 Countries with popular 8b-level models including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we find that while Mistral and Qwen2 show lower max disparity and higher bias concentration than Gemma2 and Llama3.1, all models are notably biased against countries like Russia and (except for Qwen2) China. With further experiments to have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more intensively than Biden and Harris, indicating role-playing performance bias in these models.
大规模无偏语言模型的开发被广泛认为是至关重要的,然而现有的基准测试由于范围有限、污染和缺乏公平基准,很难检测到偏见。SAGED(-Bias) 是第一个全面基准测试管道来解决这些问题。该管道包括五个核心阶段:抓取材料、组装基准、生成响应、提取数值特征和诊断差异指标。SAGED包括最大差异指标,如影响比,以及偏差集中度指标,如最大Z分数。注意到评估工具的偏见和上下文偏见在提示中扭曲了评估,SAGED 实施了反事实分支和基准校准来减轻这些影响。为了演示,我们在包括G20国家在内的流行8b级模型上使用SAGED,包括Gemma2、Llama3.1、Mistral和Qwen2。通过情感分析,我们发现,虽然Mistral和Qwen2在最大差异和偏差集中度上低于Gemma2和Llama3.1,但所有模型对像俄罗斯和中国这样的国家明显存在偏见。通过进一步的实验,让模型扮演前总统的角色,我们发现偏见加剧了,且转移到了不均匀的方向。此外,我们发现Qwen2和Mistral没有进行角色扮演,而Llama3.1和Gemma2的角色扮演特朗普尤为积极,表明这些模型在角色扮演方面的表现存在偏见。
https://arxiv.org/abs/2409.11149
Written texts reflect an author's perspective, making the thorough analysis of literature a key research method in fields such as the humanities and social sciences. However, conventional text mining techniques like sentiment analysis and topic modeling are limited in their ability to capture the hierarchical narrative structures that reveal deeper argumentative patterns. To address this gap, we propose a method that leverages large language models (LLMs) to extract and organize these structures into a hierarchical framework. We validate this approach by analyzing public opinions on generative AI collected by Japan's Agency for Cultural Affairs, comparing the narratives of supporters and critics. Our analysis provides clearer visualization of the factors influencing divergent opinions on generative AI, offering deeper insights into the structures of agreement and disagreement.
书面文本反映了作者的观点,使得对文学和社会科学领域的深入分析成为关键研究方法。然而,传统文本挖掘技术如情感分析和主题建模在捕捉层次叙事结构以揭示更深刻的论证模式方面有限。为了填补这一空白,我们提出了一个利用大型语言模型(LLMs)的方法,以提取和整理这些结构并构建一个层次框架。我们通过分析日本文化事务局收集的关于生成人工智能公共 opinion 的分析,比较支持者和批评者的叙事,验证了这种方法的有效性。我们的分析提供了对生成人工智能分歧观点影响的更清晰可视化,深入揭示了同意和分歧的结构。
https://arxiv.org/abs/2409.11032
This paper provides a comprehensive survey of sentiment analysis within the context of artificial intelligence (AI) and large language models (LLMs). Sentiment analysis, a critical aspect of natural language processing (NLP), has evolved significantly from traditional rule-based methods to advanced deep learning techniques. This study examines the historical development of sentiment analysis, highlighting the transition from lexicon-based and pattern-based approaches to more sophisticated machine learning and deep learning models. Key challenges are discussed, including handling bilingual texts, detecting sarcasm, and addressing biases. The paper reviews state-of-the-art approaches, identifies emerging trends, and outlines future research directions to advance the field. By synthesizing current methodologies and exploring future opportunities, this survey aims to understand sentiment analysis in the AI and LLM context thoroughly.
本文对人工智能(AI)和大语言模型(LLM)背景下情感分析的全面调查进行了概述。情感分析是自然语言处理(NLP)的关键方面,从传统基于规则的方法演变为先进的深度学习技术。本研究探讨了情感分析的历史发展,重点关注从词汇和模式方法向更复杂的人工智能和深度学习模型的转变。本文讨论了关键挑战,包括处理双语文本、检测讽刺和解决偏见等问题。本文回顾了最先进的方法,识别了新兴趋势,并概述了未来研究的方向,以推动该领域的发展。通过合成现有的方法和探索未来的机遇,本文的调查旨在全面了解情感分析在AI和LLM背景下的应用。
https://arxiv.org/abs/2409.09989
The rapid development of LLMs brings both convenience and potential threats. As costumed and private LLMs are widely applied, model copyright protection has become important. Text watermarking is emerging as a promising solution to AI-generated text detection and model protection issues. However, current text watermarks have largely ignored the critical need for injecting different watermarks for different users, which could help attribute the watermark to a specific individual. In this paper, we explore the personalized text watermarking scheme for LLM copyright protection and other scenarios, ensuring accountability and traceability in content generation. Specifically, we propose a novel text watermarking method PersonaMark that utilizes sentence structure as the hidden medium for the watermark information and optimizes the sentence-level generation algorithm to minimize disruption to the model's natural generation process. By employing a personalized hashing function to inject unique watermark signals for different users, personalized watermarked text can be obtained. Since our approach performs on sentence level instead of token probability, the text quality is highly preserved. The injection process of unique watermark signals for different users is time-efficient for a large number of users with the designed multi-user hashing function. As far as we know, we achieved personalized text watermarking for the first time through this. We conduct an extensive evaluation of four different LLMs in terms of perplexity, sentiment polarity, alignment, readability, etc. The results demonstrate that our method maintains performance with minimal perturbation to the model's behavior, allows for unbiased insertion of watermark information, and exhibits strong watermark recognition capabilities.
随着LLM(自然语言处理)的快速发展,带来了便利和潜在威胁。随着为用户量身定制的LLM(自然语言处理模型)的应用越来越广泛,模型版权保护也变得越来越重要。文本水印作为一种有前景的解决方案,正被广泛应用于人工智能生成的文本检测和模型保护问题。然而,现有的文本水印主要忽略了为不同用户注入不同水印的需求,这有助于将水印与特定的个人归因。在本文中,我们探讨了为LLM版权保护和其他情景实现个性化文本水印的方法,确保内容生成过程中的责任性和可追溯性。 具体来说,我们提出了一个名为PersonaMark的新文本水印方法,该方法利用句子结构作为水印信息的隐藏介质,并优化了句子级别生成算法,以最小化对模型自然生成过程的干扰。通过采用个性化的哈希函数为不同用户注入独特的水印信号,可以获得个性化的水印标记文本。由于我们的方法在句子级别而不是词条概率上表现,因此文本质量得到了极大的保留。为具有预设计的多用户哈希函数的大型用户群体提供的水印注入过程具有高效性。据我们所知,这是首次通过这种方法实现个性化的文本水印。 我们对四种不同的LLM进行了广泛的评估,从谜义度、情感极性、对齐度、可读性等各方面进行了评估。结果显示,我们的方法在保持模型行为的同时对模型的干扰最小化,允许无偏见地插入水印信息,并表现出强大的水印识别能力。
https://arxiv.org/abs/2409.09739
This paper investigates gender bias in Large Language Model (LLM)-generated teacher evaluations in higher education setting, focusing on evaluations produced by GPT-4 across six academic subjects. By applying a comprehensive analytical framework that includes Odds Ratio (OR) analysis, Word Embedding Association Test (WEAT), sentiment analysis, and contextual analysis, this paper identified patterns of gender-associated language reflecting societal stereotypes. Specifically, words related to approachability and support were used more frequently for female instructors, while words related to entertainment were predominantly used for male instructors, aligning with the concepts of communal and agentic behaviors. The study also found moderate to strong associations between male salient adjectives and male names, though career and family words did not distinctly capture gender biases. These findings align with prior research on societal norms and stereotypes, reinforcing the notion that LLM-generated text reflects existing biases.
本文研究了在高等教育环境中,大型语言模型(LLM)生成的教师评价中的性别偏见,重点关注GPT-4在六个学科领域生成的评估。通过应用包括OR分析、词嵌入关联测试(WEAT)、情感分析和情境分析在内的全面分析框架,本文发现了反映社会刻板印象的性别相关语言模式。具体来说,与亲和力和支持相关的词汇在女性教师中更频繁使用,而与娱乐相关的词汇则主要在男性教师中使用,符合共同和代理行为的定义。此外,研究发现男性显着形容词与男性名字之间存在中等至强烈的关联,尽管职业和家庭词汇并未明显捕捉到性别偏见。这些发现与有关社会规范和刻板印象的研究相一致,巩固了LLM生成的文本反映现有偏见的观念。
https://arxiv.org/abs/2409.09652
The growth of deep learning (DL) relies heavily on huge amounts of labelled data for tasks such as natural language processing and computer vision. Specifically, in image-to-text or image-to-image pipelines, opinion (sentiment) may be inadvertently learned by a model from human-generated image captions. Additionally, learning may be affected by the variety and diversity of the provided captions. While labelling large datasets has largely relied on crowd-sourcing or data-worker pools, evaluating the quality of such training data is crucial. This study proposes an evaluation method focused on sentiment and semantic richness. That method was applied to the COCO-MS dataset, comprising approximately 150K images with segmented objects and corresponding crowd-sourced captions. We employed pre-trained models (Twitter-RoBERTa-base and BERT-base) to extract sentiment scores and variability of semantic embeddings from captions. The relation of the sentiment score and semantic variability with object categories was examined using multiple linear regression. Results indicate that while most captions were neutral, about 6% of the captions exhibited strong sentiment influenced by specific object categories. Semantic variability of within-image captions remained low and uncorrelated with object categories. Model-generated captions showed less than 1.5% of strong sentiment which was not influenced by object categories and did not correlate with the sentiment of the respective human-generated captions. This research demonstrates an approach to assess the quality of crowd- or worker-sourced captions informed by image content.
深度学习(DL)的增长在自然语言处理和计算机视觉等任务中很大程度上依赖于大量标记数据。具体来说,在图像到文本或图像到图像管道中,模型可能会从人类生成的图像标题中无意间学习情感(情感)。此外,学习可能受到提供给定的标题的多样性和差异的影响。虽然大量数据标注主要依赖于众包或数据工人池,但评估这种训练数据的质量至关重要。这项研究提出了一种关注情感和语义丰富度的评估方法。该方法应用于COCO-MS数据集,包括大约150K个带有分割物体和相应众包标题的图像。我们使用预训练模型(Twitter-RoBERTa-base和BERT-base)提取标题中的情感分数和语义嵌入的变异性。我们研究了情感分数和语义变异性与物体类别的关系,使用了多元线性回归进行分析。结果表明,虽然大多数标题都是中性的,但大约6%的标题表现出由特定物体类别引起的强烈情感。图像内标题的语义变异性仍然较低,且与物体类别无关。模型生成的标题表现出不到1.5%的强烈情感,这些情感不受物体类别影响,并且不与相应的人类生成的标题的情感相关。这项研究证明了根据图像内容的评估质量的方法。
https://arxiv.org/abs/2409.09560
As part of a broader look at the impact of generative AI, this study investigated the emotional responses of journalists to the release of ChatGPT at the time of its launch. By analyzing nearly 1 million Tweets from journalists at major U.S. news outlets, we tracked changes in emotional tone and sentiment before and after the introduction of ChatGPT in November 2022. Using various computational and natural language processing techniques to measure emotional shifts in response to ChatGPT's release, we found an increase in positive emotion and a more favorable tone post-launch, suggesting initial optimism toward AI's potential. This research underscores the pivotal role of journalists as interpreters of technological innovation and disruption, highlighting how their emotional reactions may shape public narratives around emerging technologies. The study contributes to understanding the intersection of journalism, emotion, and AI, offering insights into the broader societal impact of generative AI tools.
本研究是在对生成式人工智能进行全面审视的一部分中进行的,旨在调查在ChatGPT发布时,记者对人工智能发布产生的情感反应。通过分析来自美国主要新闻机构的几乎1000万条推文,我们在ChatGPT发布前后的情感态度和情绪变化进行了跟踪。利用各种计算和自然语言处理技术来衡量ChatGPT发布后情感变化,我们发现,在ChatGPT发布后,积极情感和更积极的语气都有所增加,这表明人们对人工智能潜力的初始乐观态度。这项研究突显了记者在技术创新和颠覆中的重要作用,强调了他们的情感反应可能如何塑造围绕新兴技术的社会叙事。该研究为理解新闻界、情感和人工智能之间的交汇提供了洞察,揭示了生成式人工智能工具对更广泛社会影响的研究方向。
https://arxiv.org/abs/2409.08761
Multimodal affective computing (MAC) has garnered increasing attention due to its broad applications in analyzing human behaviors and intentions, especially in text-dominated multimodal affective computing field. This survey presents the recent trends of multimodal affective computing from NLP perspective through four hot tasks: multimodal sentiment analysis, multimodal emotion recognition in conversation, multimodal aspect-based sentiment analysis and multimodal multi-label emotion recognition. The goal of this survey is to explore the current landscape of multimodal affective research, identify development trends, and highlight the similarities and differences across various tasks, offering a comprehensive report on the recent progress in multimodal affective computing from an NLP perspective. This survey covers the formalization of tasks, provides an overview of relevant works, describes benchmark datasets, and details the evaluation metrics for each task. Additionally, it briefly discusses research in multimodal affective computing involving facial expressions, acoustic signals, physiological signals, and emotion causes. Additionally, we discuss the technical approaches, challenges, and future directions in multimodal affective computing. To support further research, we released a repository that compiles related works in multimodal affective computing, providing detailed resources and references for the community.
多模态情感计算(MAC)因其在分析人类行为和意图方面的广泛应用而受到越来越多的关注,尤其是在以文本为主的多模态情感计算领域。这项调查从自然语言处理(NLP)的角度呈现了多模态情感计算领域的最近趋势,通过四个热门任务:多模态情感分析、对话中的多模态情感识别、基于多模态的情感分析和多模态多标签情感识别。本次调查的目的是探索多模态情感研究的现状,确定发展趋势,并强调不同任务之间的相似和差异,为从NLP角度全面报告多模态情感计算的最近进展提供一份全面的报告。本次调查涵盖了任务的正式化、相关研究概述、基准数据集以及每个任务的评估指标。此外,还简要讨论了涉及面部表情、音频信号、生理信号和情感原因的多模态情感计算研究。最后,我们讨论了多模态情感计算的技术方法、挑战和未来方向。为了支持进一步的研究,我们发布了一个仓库,汇集了与多模态情感计算相关的论文,为社区提供了详细的资源和参考文献。
https://arxiv.org/abs/2409.07388
Analyzing user reviews for sentiment towards app features can provide valuable insights into users' perceptions of app functionality and their evolving needs. Given the volume of user reviews received daily, an automated mechanism to generate feature-level sentiment summaries of user reviews is needed. Recent advances in Large Language Models (LLMs) such as ChatGPT have shown impressive performance on several new tasks without updating the model's parameters i.e. using zero or a few labeled examples. Despite these advancements, LLMs' capabilities to perform feature-specific sentiment analysis of user reviews remain unexplored. This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for extracting app features and associated sentiments under 0-shot, 1-shot, and 5-shot scenarios. Results indicate the best-performing GPT-4 model outperforms rule-based approaches by 23.6% in f1-score with zero-shot feature extraction; 5-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predicting positive sentiment towards correctly predicted app features, with 5-shot enhancing it by 7%. Our study suggests that LLM models are promising for generating feature-specific sentiment summaries of user reviews.
对应用程序功能进行用户评论的分析可以提供有关用户对应用程序功能的认识以及他们不断变化的需求的重要见解。考虑到每天收到的用户评论数量,需要一个自动机制来生成用户评论的功能级别情感摘要。近年来,大型语言模型(LLMs)如ChatGPT在多个新任务上的表现令人印象深刻,而无需更新模型参数,即使用零或几个带有标签的示例。尽管有这些进步,LLMs对用户评论功能级别的情感分析仍然没有被探索。本研究将最先进的LLM模型(包括GPT-4、ChatGPT和LLama-2-chat变体)在0-shot、1-shot和5-shot场景下提取应用程序功能和相关情感的性能进行了比较。结果表明,性能最好的GPT-4模型在零 shot功能提取方面比基于规则的方法领先23.6%;5-shot进一步提高了它的性能。GPT-4在预测正确预测的应用程序功能上获得了74%的f1-score,5-shot则提高了7%。我们的研究表明,LLM模型对于生成用户评论的功能级别情感摘要是有前途的。
https://arxiv.org/abs/2409.07162
It is widely acknowledged that extracting market sentiments from news data benefits market predictions. However, existing methods of using financial sentiments remain simplistic, relying on equal-weight and static aggregation to manage sentiments from multiple news items. This leads to a critical issue termed ``Aggregated Sentiment Homogenization'', which has been explored through our analysis of a large financial news dataset from industry practice. This phenomenon occurs when aggregating numerous sentiments, causing representations to converge towards the mean values of sentiment distributions and thereby smoothing out unique and important information. Consequently, the aggregated sentiment representations lose much predictive value of news data. To address this problem, we introduce the Market Attention-weighted News Aggregation Network (MANA-Net), a novel method that leverages a dynamic market-news attention mechanism to aggregate news sentiments for market prediction. MANA-Net learns the relevance of news sentiments to price changes and assigns varying weights to individual news items. By integrating the news aggregation step into the networks for market prediction, MANA-Net allows for trainable sentiment representations that are optimized directly for prediction. We evaluate MANA-Net using the S&P 500 and NASDAQ 100 indices, along with financial news spanning from 2003 to 2018. Experimental results demonstrate that MANA-Net outperforms various recent market prediction methods, enhancing Profit & Loss by 1.1% and the daily Sharpe ratio by 0.252.
人们普遍认为,从新闻数据中提取市场情绪有助于市场预测。然而,现有的使用金融情绪的方法仍然过于简单,依赖等权法和静态聚合来管理多个新闻项目的情绪。这导致了一个被称为“情感聚类化”的关键问题,通过我们对行业实践中大型金融新闻数据集的分析进行了研究。这种现象发生在 aggregating numerous sentiments 时,导致情感表示趋向于情感分布的平均值,从而平滑 unique 和 important 的信息。因此,聚合后的情感表示失去了很多新闻数据的预测价值。为解决这个问题,我们引入了市场关注度加权新闻聚合网络 (MANA-Net),一种利用动态市场-新闻关注度机制来聚合新闻情感以进行市场预测的新型方法。MANA-Net 学会了新闻情感与价格变化的相关性,并为 individual news items 分配不同的权重。通过将新闻聚合步骤整合到预测网络中,MANA-Net 允许 trainable sentiment representations 直接优化预测。我们使用标普500指数和纳指100指数以及从2003年到2018年的金融新闻数据集来评估MANA-Net。实验结果表明,MANA-Net 超越了各种最近的市场预测方法,提高了 1.1% 的收益和 0.252 的日夏普比率。
https://arxiv.org/abs/2409.05698
The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. No prior work related to social media mining has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper aims to address this research gap and makes two scientific contributions to this field. First, it presents a multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. The dataset, available at this https URL, contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were performed. This process included classifying each post into (i) one of the sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral, (ii) hate or not hate, and (iii) anxiety/stress detected or no anxiety/stress detected. These results are presented as separate attributes in the dataset. Second, this paper presents the results of performing sentiment analysis, hate speech analysis, and anxiety or stress analysis. The variation of the sentiment classes - fear, surprise, joy, sadness, anger, disgust, and neutral were observed to be 27.95%, 2.57%, 8.69%, 5.94%, 2.69%, 1.53%, and 50.64%, respectively. In terms of hate speech detection, 95.75% of the posts did not contain hate and the remaining 4.25% of the posts contained hate. Finally, 72.05% of the posts did not indicate any anxiety/stress, and the remaining 27.95% of the posts represented some form of anxiety/stress.
目前,世界正经历一场名为“mpox”的疫情,这一疫情被世界卫生组织宣布为国际关注的公共卫生紧急事件。这篇论文旨在填补这一研究空白,为该领域做出了两个科学贡献。 首先,本文提出了一个多语言的60,127篇关于mpox疫情在Instagram上的帖子数据集。该数据集 available at this <https://url> 包含了52种语言的Instagram帖子关于mpox疫情。对于每篇帖子,Post ID,Post描述,发布日期,语言和翻译版本(翻译至英语利用谷歌翻译API)都作为单独的属性在数据集中出现。在开发完这个数据集后,进行了情感分析、仇恨言论检测和焦虑或压力检测。这一过程包括将每篇文章归类到(i)一种情感类别,即恐惧、惊讶、高兴、悲伤、愤怒或中立,(ii)仇恨或不仇恨,(iii)焦虑/压力检测或未检测到焦虑/压力。这些结果也被分别列出在数据集中。 其次,本文报告了情感分析、仇恨言论分析和焦虑或压力分析的结果。情感类别的变化观察到为:恐惧,惊讶,高兴,悲伤,愤怒,厌恶,和中立, respective。在仇恨言论检测方面,95.75%的帖子没有包含仇恨,而剩余的4.25%的帖子则包含了仇恨。最后,72.05%的帖子没有表明任何焦虑/压力,而剩余的27.95%的帖子则表现为某种形式的焦虑/压力。
https://arxiv.org/abs/2409.05292
In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
在本文中,我们在MER2024的半监督学习任务(MER-SEMI)中提出了一种解决方案。首先,为了提高特征提取器在情感分类任务上的性能,我们使用带标签数据微调视频和文本特征提取器,特别是CLIP-vit-large和Baichuan-13B。这种方法有效地保留了视频中传达的情感信息。其次,我们提出了一种音频引导Transformer(AGT)融合机制,利用Hubert-large的鲁棒性,在融合通道和通道内信息方面表现出卓越的效果。第三,为了提高模型的准确性,我们通过使用高置信度的未标注数据作为伪标签,逐步应用自监督学习。最后,通过黑盒查询,我们发现了训练集和测试集之间数据分布的不平衡。因此,我们采用基于先验知识的原则进行投票。结果证明了我们的策略的有效性,最终在MER-SEMI跟踪中获得了第三名的成绩。
https://arxiv.org/abs/2409.05007
Machine translation using large language models (LLMs) is having a significant global impact, making communication easier. Mandarin Chinese is the official language used for communication by the government, education institutes, and media in China. In this study, we provide an automated assessment of machine translation models with human experts using sentiment and semantic analysis. In order to demonstrate our framework, we select classic early twentieth-century novel 'The True Story of Ah Q' with selected Mandarin Chinese to English translations. We also us Google Translate to generate the given text into English and then conduct a chapter-wise sentiment analysis and semantic analysis to compare the extracted sentiments across the different translations. We utilise LLMs for semantic and sentiment analysis. Our results indicate that the precision of Google Translate differs both in terms of semantic and sentiment analysis when compared to human expert translations. We find that Google Translate is unable to translate some of the specific words or phrases in Chinese, such as Chinese traditional allusions. The mistranslations have to its lack of contextual significance and historical knowledge of China. Thus, this framework brought us some new insights about machine translation for Chinese Mandarin. The future work can explore other languages or types of texts with this framework.
利用大型语言模型(LLMs)进行机器翻译正在产生重大全球影响,使交流变得更加容易。汉语是中国政府、教育和媒体官方使用的语言。在这项研究中,我们使用情感和语义分析对使用人类专家评估的机器翻译模型进行自动评估。为了展示我们的框架,我们选择了经典20世纪初的小说《阿Q的正传》以及选定的汉语到英语翻译。我们还使用谷歌翻译将给定的文本转换成英语,然后对不同翻译进行逐章情感分析和语义分析,以比较提取的情感差异。我们利用LLMs进行语义和情感分析。我们的结果表明,与人类专家翻译相比,谷歌翻译在语义和情感分析方面的准确性存在差异。我们发现谷歌翻译无法翻译某些特定的中文词汇或短语,例如中国传统文化元素的引用。这些错误翻译的原因在于它们缺乏上下文意义和历史知识。因此,这个框架为我们提供了关于汉语机器翻译的新见解。未来工作可以探讨使用这个框架来评估其他语言或类型的文本。
https://arxiv.org/abs/2409.04964
This paper introduces Chain of Translation Prompting (CoTR), a novel strategy designed to enhance the performance of language models in low-resource languages. CoTR restructures prompts to first translate the input context from a low-resource language into a higher-resource language, such as English. The specified task like generation, classification, or any other NLP function is then performed on the translated text, with the option to translate the output back to the original language if needed. All these steps are specified in a single prompt. We demonstrate the effectiveness of this method through a case study on the low-resource Indic language Marathi. The CoTR strategy is applied to various tasks, including sentiment analysis, hate speech classification, subject classification and text generation, and its efficacy is showcased by comparing it with regular prompting methods. Our results underscore the potential of translation-based prompting strategies to significantly improve multilingual LLM performance in low-resource languages, offering valuable insights for future research and applications. We specifically see the highest accuracy improvements with the hate speech detection task. The technique also has the potential to enhance the quality of synthetic data generation for underrepresented languages using LLMs.
本文介绍了一种名为链式翻译提示(CoTR)的新策略,旨在提高低资源语言模型在低资源语言中的性能。CoTR重新结构化提示,首先将输入语义从低资源语言翻译成高资源语言,如英语。然后,指定任务(如生成、分类或任何其他自然语言处理功能)在翻译文本上执行,并提供将输出翻译回原始语言的选项。所有这些步骤都在单个提示中指定。我们通过一个关于低资源印地语的案例研究来证明这种方法的有效性。CoTR策略应用于各种任务,包括情感分析、仇恨言论分类、主题分类和文本生成,并将其有效性通过与常规提示方法进行比较来展示。我们的结果强调基于翻译的提示策略在低资源语言中显著提高多语言LLM性能的重要性,为未来的研究和应用提供了宝贵的洞见。我们特别看到仇恨言论检测任务中的准确度最高改善。该技术还有可能通过LLM提高低代表性语言的合成数据的质量。
https://arxiv.org/abs/2409.04512
Eating disorders are complex mental health conditions that affect millions of people around the world. Effective interventions on social media platforms are crucial, yet testing strategies in situ can be risky. We present a novel LLM-driven experimental testbed for simulating and assessing intervention strategies in ED-related discussions. Our framework generates synthetic conversations across multiple platforms, models, and ED-related topics, allowing for controlled experimentation with diverse intervention approaches. We analyze the impact of various intervention strategies on conversation dynamics across four dimensions: intervention type, generative model, social media platform, and ED-related community/topic. We employ cognitive domain analysis metrics, including sentiment, emotions, etc., to evaluate the effectiveness of interventions. Our findings reveal that civility-focused interventions consistently improve positive sentiment and emotional tone across all dimensions, while insight-resetting approaches tend to increase negative emotions. We also uncover significant biases in LLM-generated conversations, with cognitive metrics varying notably between models (Claude-3 Haiku $>$ Mistral $>$ GPT-3.5-turbo $>$ LLaMA3) and even between versions of the same model. These variations highlight the importance of model selection in simulating realistic discussions related to ED. Our work provides valuable information on the complex dynamics of ED-related discussions and the effectiveness of various intervention strategies.
饮食障碍是一种复杂的心理疾病,影响了全球数百万人。在社交媒体平台上进行有效的干预措施至关重要,然而在情境中进行干预策略测试可能具有风险。我们提出了一个新颖的LLM驱动的实验平台,用于模拟和评估ED相关讨论的干预策略。我们的框架在多个平台、模型和ED相关主题之间生成合成对话,允许我们以多样化的干预方法进行控制实验。我们分析了许多干预策略对四维(干预类型、生成模型、社交媒体平台、ED相关社区/主题)对话动态的影响。我们使用情感领域分析指标(包括情感、情绪等)来评估干预的有效性。我们的研究结果表明,以礼相待的干预措施在所有维度上都改善了积极情感和情绪基调,而见解重置方法则倾向于增加负面情感。我们还揭示了LLM生成的对话中存在的显著偏见,认知指标在不同的模型(Claude-3 Haiku $>$ Mistral $>$ GPT-3.5-turbo $>$ LLaMA3)和甚至同一模型版本之间差异显著。这些差异突出了在模拟与ED相关的讨论时选择模型的至关重要性。我们的工作为了解ED相关讨论的复杂动态和各种干预策略的有效性提供了有价值的信息。
https://arxiv.org/abs/2409.04043
This work proposes a novel and simple sequential learning strategy to train models on videos and texts for multimodal sentiment analysis. To estimate sentiment polarities on unseen out-of-distribution data, we introduce a multimodal model that is trained either in a single source domain or multiple source domains using our learning strategy. This strategy starts with learning domain invariant features from text, followed by learning sparse domain-agnostic features from videos, assisted by the selected features learned in text. Our experimental results demonstrate that our model achieves significantly better performance than the state-of-the-art approaches on average in both single-source and multi-source settings. Our feature selection procedure favors the features that are independent to each other and are strongly correlated with their polarity labels. To facilitate research on this topic, the source code of this work will be publicly available upon acceptance.
本文提出了一种新颖且简单的序列学习策略,用于在多模态情感分析中训练模型。为了估计未见过的离散数据中的情感极性,我们引入了一种多模态模型,该模型使用我们的学习策略在单个来源领域或多个来源领域进行训练。我们的学习策略从文本中学习领域不变的特征,然后从视频中学习稀疏的领域无关特征,并使用文本中选择的学习特征进行辅助。我们的实验结果表明,与最先进的解决方案相比,我们的模型在单源和多源设置中的平均性能都有显著的提高。我们的特征选择过程倾向于选择相互独立且与极性标签强烈相关的特征。为了促进对此主题的研究,本文的工作将公开接受时源代码。
https://arxiv.org/abs/2409.04473
Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
情感分类(SC)常常受到资源低的问题,例如领域特定上下文、不平衡的标签分布和少样本场景。扩散语言模型(LM)在文本数据增强(DA)方面的潜力仍然没有被探索。此外,文本DA方法很难平衡新样本的多样性和一致性。大多数DA方法e要么进行逻辑修改,要么通过语言模型对原始序列中的不太重要的词汇进行重新表述。在SC的背景下,强烈的情感词汇可能会对整个序列的情感产生关键影响。因此,我们提出DiffusionCLS,利用扩散LM来捕捉领域知识并生成伪样本,通过重构强烈标签相关的词汇,确保了一致性和多样性之间的平衡,避免了噪声的引入,增强了数据集的关键特征。DiffusionCLS还包括一个免噪音训练目标,帮助模型进行泛化。实验证实了我们在各种低资源场景中的有效性和框架模块的有效性。消融研究证实了我们的框架模块的有效性,而可视化研究突出了最优部署条件,进一步加强了我们的结论。
https://arxiv.org/abs/2409.03203
This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named "Prediction statements," categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.
本研究利用先进的自然语言处理技术对加密货币相关讨论中的预测语句、希望发言和后悔检测行为进行分析。我们引入了一个名为“预测语句”的新分类方案,将评论分为预测增加、预测减少、中立预测或非预测性类别。采用GPT-4o,一种最先进的巨大语言模型,我们探讨了五种著名加密货币(Cardano、Binance、Matic、Fantom和Ripple)之间的情感动态。我们的分析揭示了预测情感的明显差异,其中Matic表现出更强的乐观预测倾向。此外,我们研究了希望和后悔情感,揭示了这些情感与预测行为之间的细微相互作用。尽管遇到了数据量和资源可用性方面的限制,我们的研究在加密货币市场投资者行为和情感趋势方面报告了宝贵的发现,为战略决策和未来研究提供了指导。
https://arxiv.org/abs/2409.02836
Pre-training and self-training are two approaches to semi-supervised learning. The comparison between pre-training and self-training has been explored. However, the previous works led to confusing findings: self-training outperforms pre-training experienced on some tasks in computer vision, and contrarily, pre-training outperforms self-training experienced on some tasks in natural language processing, under certain conditions of incomparable settings. We propose, comparatively and exhaustively, an ensemble method to empirical study all feasible training paradigms combining pre-training, self-training, and fine-tuning within consistent foundational settings comparable to data augmentation. We conduct experiments on six datasets, four data augmentation, and imbalanced data for sentiment analysis and natural language inference tasks. Our findings confirm that the pre-training and fine-tuning paradigm yields the best overall performances. Moreover, self-training offers no additional benefits when combined with semi-supervised pre-training.
预训练和自训练是半监督学习中的两种方法。已经探讨了预训练和自训练之间的比较。然而,以前的工作得出的结论有时会令人困惑:在计算机视觉任务上,自训练表现优于预训练,而在自然语言处理任务上,预训练表现优于自训练,在某些条件下的可比设置下。我们提出了一种比较完整的方法,对所有可能的训练范式在一致的基础设置中进行综合研究,这些设置与数据增强类似。我们在六个数据集、四个数据增强和情感分析和自然语言推理任务上进行了实验。我们的研究结果证实,预训练和微调范式产生最佳的整体性能。此外,当与半监督预训练相结合时,自训练没有额外的益处。
https://arxiv.org/abs/2409.02751