This study evaluates the robustness of two state-of-the-art deep contextual language representations, ELMo and DistilBERT, on supervised learning of binary protest news classification and sentiment analysis of product reviews. A "cross-context" setting is enabled using test sets that are distinct from the training data. Specifically, in the news classification task, the models are developed on local news from India and tested on the local news from China. In the sentiment analysis task, the models are trained on movie reviews and tested on customer reviews. This comparison is aimed at exploring the limits of the representative power of today's Natural Language Processing systems on the path to the systems that are generalizable to real-life scenarios. The models are fine-tuned and fed into a Feed-Forward Neural Network and a Bidirectional Long Short Term Memory network. Multinomial Naive Bayes and Linear Support Vector Machine are used as traditional baselines. The results show that, in binary text classification, DistilBERT is significantly better than ELMo on generalizing to the cross-context setting. ELMo is observed to be significantly more robust to the cross-context test data than both baselines. On the other hand, the baselines performed comparably well to ELMo when the training and test data are subsets of the same corpus (no cross-context). DistilBERT is also found to be 30% smaller and 83% faster than ELMo. The results suggest that DistilBERT can transfer generic semantic knowledge to other domains better than ELMo. DistilBERT is also favorable in incorporating into real-life systems for it requires a smaller computational training budget. When generalization is not the utmost preference and test domain is similar to the training domain, the traditional ML algorithms can still be considered as more economic alternatives to deep language representations.
这项研究评估了ELMo和DistilBERT这两种最先进的深度学习上下文语言表示的稳健性,它们在监督学习二进制评论新闻分类和商品评论情感分析方面的表现。使用与训练数据不同的测试集,实现了一个“跨上下文”设置。具体而言,在新闻分类任务中,模型基于印度本地新闻和中国本地新闻开发,在情感分析任务中,模型基于电影评论训练,并在顾客评论中测试。这个比较旨在探索当今自然语言处理系统的代表作力的极限,以使其能够适用于实际场景。模型经过了优化,并输入到Feed-Forward神经网络和双向长期短期记忆网络中。多nomial Naive Bayes和线性支持向量机作为传统的基线。结果显示,在二进制文本分类中,DistilBERT在跨上下文 setting上的表现比ELMo更好。ELMo观察到其对跨上下文测试数据的稳定性比两个基线都强。另一方面,当训练和测试数据都是同一个语料库的子集(没有跨上下文)时,ELMo的表现与DistilBERT相当。DistilBERT也被发现比ELMo小30%,运行速度更快83%。结果显示,DistilBERT可以更好地将通用语义知识向其他领域转移,比ELMo更有效。DistilBERT也被认为更适合融入实际系统,因为它需要的计算训练预算较小。当泛化不是最优先考虑时,测试领域与训练领域相似,传统的机器学习算法仍然可以被视为深度学习表示的更经济的选择。
https://arxiv.org/abs/2303.12936
In recent years, Transformer-based models such as the Switch Transformer have achieved remarkable results in natural language processing tasks. However, these models are often too complex and require extensive pre-training, which limits their effectiveness for small clinical text classification tasks with limited data. In this study, we propose a simplified Switch Transformer framework and train it from scratch on a small French clinical text classification dataset at CHU Sainte-Justine hospital. Our results demonstrate that the simplified small-scale Transformer models outperform pre-trained BERT-based models, including DistillBERT, CamemBERT, FlauBERT, and FrALBERT. Additionally, using a mixture of expert mechanisms from the Switch Transformer helps capture diverse patterns; hence, the proposed approach achieves better results than a conventional Transformer with the self-attention mechanism. Finally, our proposed framework achieves an accuracy of 87\%, precision at 87\%, and recall at 85\%, compared to the third-best pre-trained BERT-based model, FlauBERT, which achieved an accuracy of 84\%, precision at 84\%, and recall at 84\%. However, Switch Transformers have limitations, including a generalization gap and sharp minima. We compare it with a multi-layer perceptron neural network for small French clinical narratives classification and show that the latter outperforms all other models.
近年来,基于Transformer的模型,如Switch Transformer,在自然语言处理任务中取得了显著的成果。然而,这些模型往往过于复杂,需要广泛的预训练,这限制了它们在数据有限的情况下小型临床文本分类任务的有效性。在本研究中,我们提出了一个简单的Switch Transformer框架,并在CHU圣Justine医院的小 French临床文本分类数据集上从头训练它。我们的结果显示,简化的小尺度Transformer模型比预训练的BERT基于模型,如DistillBERT、 CamemBERT、FlauBERT和FrALBERT,表现更好。此外,使用Switch Transformer中的专家机制混合物可以帮助捕捉不同模式,因此,我们提出的方法比传统的Transformer self-attention机制获得更好的结果。最后,我们提出的框架 achieve an accuracy of 87\%, precision at 87\%, and recall at 85%,相比之下,FlauBERT(84\%),具有84\%的accuracy、84\%的precision和84\%的recall。然而,Switch Transformer有限制,包括泛化差距和 sharp minimum。我们比较了Switch Transformer和小 French临床叙事分类数据的多层感知机神经网络,并表明后者在所有其他模型中表现更好。
https://arxiv.org/abs/2303.12892
In recent years there has been a growing demand from financial agents, especially from particular and institutional investors, for companies to report on climate-related financial risks. A vast amount of information, in text format, can be expected to be disclosed in the short term by firms in order to identify these types of risks in their financial and non financial reports, particularly in response to the growing regulation that is being passed on the matter. To this end, this paper applies state-of-the-art NLP techniques to achieve the detection of climate change in text corpora. We use transfer learning to fine-tune two transformer models, BERT and ClimateBert -a recently published DistillRoBERTa-based model that has been specifically tailored for climate text classification-. These two algorithms are based on the transformer architecture which enables learning the contextual relationships between words in a text. We carry out the fine-tuning process of both models on the novel Clima-Text database, consisting of data collected from Wikipedia, 10K Files Reports and web-based claims. Our text classification model obtained from the ClimateBert fine-tuning process on ClimaText, outperforms the models created with BERT and the current state-of-the-art transformer in this particular problem. Our study is the first one to implement on the ClimaText database the recently published ClimateBert algorithm. Based on our results, it can be said that ClimateBert fine-tuned on ClimaText is an outstanding tool within the NLP pre-trained transformer models that may and should be used by investors, institutional agents and companies themselves to monitor the disclosure of climate risk in financial reports. In addition, our transfer learning methodology is cheap in computational terms, thus allowing any organization to perform it.
近年来,金融代理人,特别是特定和机构投资者,对公司报告与气候相关的金融风险日益需求。大量信息以文本格式 expected to be披露短期以识别公司的财务和非财务报告中的这种类型的风险,特别是针对正在通过不断增加的监管。为此,本文应用最先进的自然语言处理技术来实现在文本数据集上的气候变化检测。我们使用迁移学习微调了两个Transformer模型,BERT和 ClimateBert - 最近发布的基于DistillRoBERTa的模型,专门设计为气候文本分类-。这两个算法基于Transformer架构,使学习文本中的单词之间的上下文关系成为可能。我们在名为Clima-Text的数据集上执行了这两个模型的微调过程,该数据集包括从维基百科、10K文件报告和网上声称收集的数据。从 ClimateBert 在ClimaText 微调过程中获得的诗歌分类模型在这个问题中的特定问题上表现更好。我们的研究是首个在 ClimaText 数据库上实施最近发布的 ClimateBert 算法的研究。根据我们的结果,可以说 ClimateBert 在 ClimaText 微调过程中是NLP预训练Transformer模型中的优秀工具,可能应该和必须由投资者、机构代理人和公司自己用于监测在财务报告中披露气候风险。此外,我们的迁移学习方法在计算上是廉价的,因此允许任何组织执行。
https://arxiv.org/abs/2303.13373
Objective: To develop a natural language processing (NLP) system to extract medications and contextual information that help understand drug changes. This project is part of the 2022 n2c2 challenge. Materials and methods: We developed NLP systems for medication mention extraction, event classification (indicating medication changes discussed or not), and context classification to classify medication changes context into 5 orthogonal dimensions related to drug changes. We explored 6 state-of-the-art pretrained transformer models for the three subtasks, including GatorTron, a large language model pretrained using >90 billion words of text (including >80 billion words from >290 million clinical notes identified at the University of Florida Health). We evaluated our NLP systems using annotated data and evaluation scripts provided by the 2022 n2c2 organizers. Results:Our GatorTron models achieved the best F1-scores of 0.9828 for medication extraction (ranked 3rd), 0.9379 for event classification (ranked 2nd), and the best micro-average accuracy of 0.9126 for context classification. GatorTron outperformed existing transformer models pretrained using smaller general English text and clinical text corpora, indicating the advantage of large language models. Conclusion: This study demonstrated the advantage of using large transformer models for contextual medication information extraction from clinical narratives.
目标开发一个自然语言处理(NLP)系统,提取药物和上下文信息,帮助理解药物变化。这是2022年n2c2挑战的一部分。材料和方法:我们开发了NLP系统,用于提取药物提及、事件分类(表示是否讨论药物变化)和上下文分类,将药物变化上下文分类为与药物变化相关的5个非相关维度。我们探索了6种最先进的预训练Transformer模型,包括GatorTron,一个使用超过90亿字的文本(包括从佛罗里达健康大学发现的2.9百万临床记录中识别的超过80亿字文本)预训练的大型语言模型。我们使用由2022年n2c2组织者提供的标注数据和评估脚本评估我们的NLP系统。结果:我们的GatorTron模型在药物提取任务中取得了最好的F1得分(排名第3),在事件分类任务中取得了最好的得分(排名第2),在上下文分类任务中表现出最好的micro-average准确性(得分为0.9126)。GatorTron比使用较小的通用英语文本和临床文本 corpora预训练的传统Transformer模型表现更好,这表明大型语言模型的优势。结论:这项研究证明了使用大型Transformer模型从临床陈述中提取上下文药物信息的优势。
https://arxiv.org/abs/2303.08259
Text classification methods have been widely investigated as a way to detect content of low credibility: fake news, social media bots, propaganda, etc. Quite accurate models (likely based on deep neural networks) help in moderating public electronic platforms and often cause content creators to face rejection of their submissions or removal of already published texts. Having the incentive to evade further detection, content creators try to come up with a slightly modified version of the text (known as an attack with an adversarial example) that exploit the weaknesses of classifiers and result in a different output. Here we introduce BODEGA: a benchmark for testing both victim models and attack methods on four misinformation detection tasks in an evaluation framework designed to simulate real use-cases of content moderation. We also systematically test the robustness of popular text classifiers against available attacking techniques and discover that, indeed, in some cases barely significant changes in input text can mislead the models. We openly share the BODEGA code and data in hope of enhancing the comparability and replicability of further research in this area.
文本分类方法已经被广泛研究作为检测低信誉内容的方法:假新闻、社交媒体机器人、宣传等。相当准确的模型(很可能基于深度学习网络)帮助监管公共电子平台,并常常导致内容创作者面临拒绝其提交或删除已经发布的文本的激励。有逃避进一步检测的激励,内容创作者试图想出一个略微修改的版本(称为攻击具有对抗性示例的攻击)来利用分类器的的弱点,产生不同的输出。在这里,我们介绍了BODEGA:一个基准测试框架,旨在模拟内容 moderation 的实际 use-cases,用于测试受害者模型和攻击方法,在评估框架中模拟了四个虚假信息检测任务。我们还 systematic 地测试了流行的文本分类器对可用攻击技术的鲁棒性,并发现,事实上,输入文本中几乎微小的变化可能会误导模型。我们公开分享了BODEGA 代码和数据,希望增强这一领域的研究可比性和可重复性。
https://arxiv.org/abs/2303.08032
This paper introduces a novel mechanism to obtain the optimal parameters of a deep learning model using the Bees Algorithm, which is a recent promising swarm intelligence algorithm. The optimization problem is to maximize the accuracy of classifying ailments based on medical text given the initial hyper-parameters to be adjusted throughout a definite number of iterations. Experiments included two different datasets: English and Arabic. The highest accuracy achieved is 99.63% on the English dataset using Long Short-Term Memory (LSTM) along with the Bees Algorithm, and 88% on the Arabic dataset using AraBERT.
本论文介绍了一种新机制,利用蜜蜂算法( Bees Algorithm)获取深度学习模型的最优参数,这是一种最近表现出前景的群集智能算法。优化问题是,根据医学文本对疾病进行分类的准确性最大化,考虑到需要在整个迭代中都进行调整的初始超参数数量确定。实验包括两个不同的数据集:英语和阿拉伯语。使用蜜蜂算法和LSTM(长短期记忆)同时处理英语数据集,最高准确性达到99.63%,而在使用araBERT(阿拉伯语BERT)处理阿拉伯语数据集时,最高准确性为88%。
https://arxiv.org/abs/2303.08021
Identifying words that impact a task's performance more than others is a challenge in natural language processing. Transformers models have recently addressed this issue by incorporating an attention mechanism that assigns greater attention (i.e., relevance) scores to some words than others. Because of the attention mechanism's high computational cost, transformer models usually have an input-length limitation caused by hardware constraints. This limitation applies to many transformers, including the well-known bidirectional encoder representations of the transformer (BERT) model. In this paper, we examined BERT's attention assignment mechanism, focusing on two questions: (1) How can attention be employed to reduce input length? (2) How can attention be used as a control mechanism for conditional text generation? We investigated these questions in the context of a text classification task. We discovered that BERT's early layers assign more critical attention scores for text classification tasks compared to later layers. We demonstrated that the first layer's attention sums could be used to filter tokens in a given sequence, considerably decreasing the input length while maintaining good test accuracy. We also applied filtering, which uses a compute-efficient semantic similarities algorithm, and discovered that retaining approximately 6\% of the original sequence is sufficient to obtain 86.5\% accuracy. Finally, we showed that we could generate data in a stable manner and indistinguishable from the original one by only using a small percentage (10\%) of the tokens with high attention scores according to BERT's first layer.
识别影响任务表现的某些单词是自然语言处理中的一项挑战。Transformer模型最近解决了这个问题,通过引入一种关注机制,给某些单词赋予更高的关注(即相关性)分数。由于关注机制的高计算成本,Transformer模型通常由于硬件限制而存在输入长度的限制。这适用于许多Transformer模型,包括Transformer模型著名的双向编码器表示(BERT)模型。在本文中,我们研究了BERT的注意力分配机制,重点是两个问题:(1)如何应用关注以减少输入长度?(2)如何应用关注作为条件文本生成控制机制?我们在文本分类任务的背景下研究了这些问题。我们发现,BERT的早期层对文本分类任务赋予更高的关注得分,而后期层则更加关注。我们展示了第一个层的关注总和可以用来过滤给定序列中的 tokens,显著减少输入长度,同时保持较好的测试精度。我们还应用了一种计算效率高语义相似度算法的过滤器,并发现保留约6%的原序列是足够的,以获得86.5%的准确率。最后,我们展示了我们可以通过只使用BERT第一个层中高度关注单词的一小部分(约10%)来稳定地生成数据,并且与原始数据几乎无区别,从而证明了我们的方法。
https://arxiv.org/abs/2303.07585
The use of transfer learning methods is largely responsible for the present breakthrough in Natural Learning Processing (NLP) tasks across multiple domains. In order to solve the problem of sentiment detection, we examined the performance of four different types of well-known state-of-the-art transformer models for text classification. Models such as Bidirectional Encoder Representations from Transformers (BERT), Robustly Optimized BERT Pre-training Approach (RoBERTa), a distilled version of BERT (DistilBERT), and a large bidirectional neural network architecture (XLNet) were proposed. The performance of the four models that were used to detect disaster in the text was compared. All the models performed well enough, indicating that transformer-based models are suitable for the detection of disaster in text. The RoBERTa transformer model performs best on the test dataset with a score of 82.6% and is highly recommended for quality predictions. Furthermore, we discovered that the learning algorithms' performance was influenced by the pre-processing techniques, the nature of words in the vocabulary, unbalanced labeling, and the model parameters.
使用转移学习方法在很大程度上导致了多个领域的自然学习处理任务的突破。为了解决情感检测问题,我们对四种著名的Transformer模型文本分类的性能进行了比较。例如,提出了Bidirectional Encoder Representations from Transformers (BERT)、 robustly optimized BERT Pre-training Approach (RoBERTa)、从BERT中提取的蒸馏版本(DistilBERT)以及大型双向神经网络架构(XLNet)。用于检测文本灾难的四个模型的性能进行了比较。所有模型表现都足够好,这表明基于Transformer的模型适合在文本中检测灾难。RoBERTa Transformer模型在测试数据集上表现最佳,得分为82.6%,并强烈推荐用于高质量的预测。此外,我们发现,预处理技术、词汇表中的单词性质、不平衡标签以及模型参数会影响学习算法的性能。
https://arxiv.org/abs/2303.07292
This case study investigates the task of job classification in a real-world setting, where the goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position. We explore multiple approaches to text classification, including supervised approaches such as traditional models like Support Vector Machines (SVMs) and state-of-the-art deep learning methods such as DeBERTa. We compare them with Large Language Models (LLMs) used in both few-shot and zero-shot classification settings. To accomplish this task, we employ prompt engineering, a technique that involves designing prompts to guide the LLMs towards the desired output. Specifically, we evaluate the performance of two commercially available state-of-the-art GPT-3.5-based language models, text-davinci-003 and gpt-3.5-turbo. We also conduct a detailed analysis of the impact of different aspects of prompt engineering on the model's performance. Our results show that, with a well-designed prompt, a zero-shot gpt-3.5-turbo classifier outperforms all other models, achieving a 6% increase in Precision@95% Recall compared to the best supervised approach. Furthermore, we observe that the wording of the prompt is a critical factor in eliciting the appropriate "reasoning" in the model, and that seemingly minor aspects of the prompt significantly affect the model's performance.
本案例研究探讨了在现实世界 setting 中进行 job classification 的任务,该任务的目标是确定对于 graduate 或 entry-level 职位来说,英语职位发布是否合适。我们探索了多种文本分类方法,包括监督方法,如传统的模型,如支持向量机 (SVMs) 和先进的深度学习方法,如 DeBERTa。我们将这些方法和 small 和 zero-shot 分类设置中的大型语言模型(LLM)进行比较。为了实现这一任务,我们采用了prompt engineering,这是一种技术,涉及设计prompts,以指导LLM 向预期输出方向移动。具体来说,我们评估了两种商业上最先进的基于 GPT-3.5 的语言模型,text-davinci-003 和 gpt-3.5-Turbo,以及prompt engineering 不同方面的对模型性能的影响。我们的结果表明,通过设计良好的prompt,零-shot gpt-3.5-Turbo分类器在所有其他模型上都表现更好,相较于最好的监督方法,提高了6%的Precision@95%Recall。此外,我们观察到prompt 的词法是提取模型中适当的“推理”的关键因素,并且似乎prompt 的一些方面对模型性能产生了显著影响。
https://arxiv.org/abs/2303.07142
State-sponsored trolls are the main actors of influence campaigns on social media and automatic troll detection is important to combat misinformation at scale. Existing troll detection models are developed based on training data for known campaigns (e.g.\ the influence campaign by Russia's Internet Research Agency on the 2016 US Election), and they fall short when dealing with {\em novel} campaigns with new targets. We propose MetaTroll, a text-based troll detection model based on the meta-learning framework that enables high portability and parameter-efficient adaptation to new campaigns using only a handful of labelled samples for few-shot transfer. We introduce \textit{campaign-specific} transformer adapters to MetaTroll to ``memorise'' campaign-specific knowledge so as to tackle catastrophic forgetting, where a model ``forgets'' how to detect trolls from older campaigns due to continual adaptation. Our experiments demonstrate that MetaTroll substantially outperforms baselines and state-of-the-art few-shot text classification models. Lastly, we explore simple approaches to extend MetaTroll to multilingual and multimodal detection. Source code for MetaTroll is available at: this https URL.
国家赞助的 Troll 是社交媒体上影响力运动的主要参与者,自动 Troll 检测对于大规模虚假信息 combat 非常重要。现有的 Troll 检测模型基于已知的运动训练数据(例如,俄罗斯互联网研究所对 2016 年美国选举的影响运动),它们在处理具有新目标的新的运动时表现不佳。我们提出了 MetaTroll,一个基于元学习框架的文本 Troll 检测模型,它使用少量的标签样本进行多次输入的迁移,可以实现高可移植性和参数高效的适应新运动。我们引入了 extit{运动 specific} 变压器适应器到 MetaTroll,以“记忆”运动 specific 知识,以应对灾难性忘记,因为模型由于不断的适应而“忘记”如何从以前的运动中检测 Troll。我们的实验结果表明,MetaTroll 显著优于基准模型和最先进的多语言和多模式文本分类模型。最后,我们探索了简单的方法,将 MetaTroll 扩展到多语言和多模式检测。MetaTroll 的源代码代码可在 this https URL 获取。
https://arxiv.org/abs/2303.07354
ChatGPT has shown strong capabilities in natural language generation tasks, which naturally leads researchers to explore where its abilities end. In this paper, we examine whether ChatGPT can be used for zero-shot text classification, more specifically, automatic genre identification. We compare ChatGPT with a multilingual XLM-RoBERTa language model that was fine-tuned on datasets, manually annotated with genres. The models are compared on test sets in two languages: English and Slovenian. Results show that ChatGPT outperforms the fine-tuned model when applied to the dataset which was not seen before by either of the models. Even when applied on Slovenian language as an under-resourced language, ChatGPT's performance is no worse than when applied to English. However, if the model is fully prompted in Slovenian, the performance drops significantly, showing the current limitations of ChatGPT usage on smaller languages. The presented results lead us to questioning whether this is the beginning of an end of laborious manual annotation campaigns even for smaller languages, such as Slovenian.
ChatGPT在自然语言生成任务中表现出强大的能力,这自然地促使研究人员探索它的能力边界。在本文中,我们探讨了ChatGPT是否可以用于零经验文本分类,更具体地说,自动分类。我们比较了ChatGPT和一个在数据集上手动标注了多种类型的多语言XLM-RoBERTa语言模型。模型在两个语言:英语和斯洛文尼亚的语言测试集上进行比较。结果表明,当应用于未在两种模型中观察到的dataset时,ChatGPT的性能比优化模型更好。即使应用斯洛文尼亚语言作为资源匮乏的语言,ChatGPT的性能也没有恶化到与英语的性能相同。然而,如果模型完全在斯洛文尼亚语中启用,性能会显著下降,这表明ChatGPT在小语言(如斯洛文尼亚)中使用目前的限制。 presented results 促使我们质疑,即使对于像斯洛文尼亚这样的小语言,手动标注 campaigns 也可能已经到了尽头。
https://arxiv.org/abs/2303.03953
Deep neural networks based on layer-stacking architectures have historically suffered from poor inherent interpretability. Meanwhile, symbolic probabilistic models function with clear interpretability, but how to combine them with neural networks to enhance their performance remains to be explored. In this paper, we try to marry these two systems for text classification via a structured language model. We propose a Symbolic-Neural model that can learn to explicitly predict class labels of text spans from a constituency tree without requiring any access to span-level gold labels. As the structured language model learns to predict constituency trees in a self-supervised manner, only raw texts and sentence-level labels are required as training data, which makes it essentially a general constituent-level self-interpretable classification model. Our experiments demonstrate that our approach could achieve good prediction accuracy in downstream tasks. Meanwhile, the predicted span labels are consistent with human rationales to a certain degree.
基于层堆架构的深度学习网络历史上一直存在缺乏内在解释性的问题。与此同时,符号概率模型具有明确的解释性,但如何将它们与神经网络结合以提高其性能仍然是待探索的。在本文中,我们尝试通过结构语言模型将这两个系统结合起来进行文本分类。我们提出了一种符号神经网络模型,可以 explicitly predict class labels of text spans from a constituency tree,而不需要访问span级别的黄金标签。由于结构语言模型在自我监督的情况下学习预测 constituency 树,只需要原始文本和句子级别的标签作为训练数据,因此它本质上是一个通用组成部分级别的自解释性分类模型。我们的实验结果表明,我们的方法可以在下游任务中获得良好的预测精度。与此同时,预测的跨度标签在一定程度上与人类理由一致。
https://arxiv.org/abs/2303.02860
Data augmentation has proven widely effective in computer vision. In Natural Language Processing (NLP) data augmentation remains an area of active research. There is no widely accepted augmentation technique that works well across tasks and model architectures. In this paper we explore data augmentation techniques in the context of text classification using two social media datasets. We explore popular varieties of data augmentation, starting with oversampling, Easy Data Augmentation (Wei and Zou, 2019) and Back-Translation (Sennrich et al., 2015). We also consider Greyscaling, a relatively unexplored data augmentation technique that seeks to mitigate the intensity of adjectives in examples. Finally, we consider a few-shot learning approach: Pattern-Exploiting Training (PET) (Schick et al., 2020). For the experiments we use a BERT transformer architecture. Results show that augmentation techniques provide only minimal and inconsistent improvements. Synonym replacement provided evidence of some performance improvement and adjective scales with Grayscaling is an area where further exploration would be valuable. Few-shot learning experiments show consistent improvement over supervised training, and seem very promising when classes are easily separable but further exploration would be valuable.
数据增强在计算机视觉中已被证明非常有效。在自然语言处理(NLP)中,数据增强仍然是一个活跃的研究领域。目前没有被广泛接受的数据增强技术,能够在多个任务和模型架构之间取得良好的效果。在本文中,我们使用两个社交媒体数据集来探索文本分类中的数据增强技术。我们探索了常见的数据增强技术,包括超采样、简单数据增强(Wei和Zou,2019)和逆翻译(Sennrich等人,2015)。我们还考虑了灰度化,这是一种相对较新的数据增强技术,旨在减轻例子中形容词强度的影响。最后,我们考虑了几步学习方法:模式利用训练(PET) (Schick等人,2020)。在实验中,我们使用了BERTtransformer架构。结果表明,数据增强技术仅提供了微不足道且不一致的提高。同义词替换提供了一些性能改进的证据,并且灰度化与词汇级数的关系是一个值得进一步探索的领域。几步学习实验表明优于监督训练的一致性改进,当类别易于分离时,但其进一步探索仍然具有重要意义。
https://arxiv.org/abs/2303.02198
ChatGPT has shown the potential of emerging general artificial intelligence capabilities, as it has demonstrated competent performance across many natural language processing tasks. In this work, we evaluate the capabilities of ChatGPT to perform text classification on three affective computing problems, namely, big-five personality prediction, sentiment analysis, and suicide tendency detection. We utilise three baselines, a robust language model (RoBERTa-base), a legacy word model with pretrained embeddings (Word2Vec), and a simple bag-of-words baseline (BoW). Results show that the RoBERTa trained for a specific downstream task generally has a superior performance. On the other hand, ChatGPT provides decent results, and is relatively comparable to the Word2Vec and BoW baselines. ChatGPT further shows robustness against noisy data, where Word2Vec models achieve worse results due to noise. Results indicate that ChatGPT is a good generalist model that is capable of achieving good results across various problems without any specialised training, however, it is not as good as a specialised model for a downstream task.
ChatGPT展现了新兴通用人工智能能力的潜力,因为它在许多自然语言处理任务中表现出了出色的性能。在本研究中,我们评估了ChatGPT对三个情感计算问题进行文本分类的能力,这些问题分别是大五人格预测、情感分析和自杀倾向检测。我们使用了三个基准,一个稳健的语言模型(RoBERTa-base)、一个具有预训练嵌入的 legacy 词模型(Word2Vec)和一个简单的词袋基准(BoW)。结果显示,RoBERTa训练出的特定下游任务通常表现更好。另一方面,ChatGPT提供了不错的结果,与Word2Vec和BoW基准相当。ChatGPT还表现出对噪声数据的可靠性,因为Word2Vec模型因为噪声而出现更差的结果。结果显示,ChatGPT是一个通用性模型,能够在各种问题上获得良好的结果,但是与针对特定下游任务专门的模型相比,它的表现并不理想。
https://arxiv.org/abs/2303.03186
Style analysis, which is relatively a less explored topic, enables several interesting applications. For instance, it allows authors to adjust their writing style to produce a more coherent document in collaboration. Similarly, style analysis can also be used for document provenance and authentication as a primary step. In this paper, we propose an ensemble-based text-processing framework for the classification of single and multi-authored documents, which is one of the key tasks in style analysis. The proposed framework incorporates several state-of-the-art text classification algorithms including classical Machine Learning (ML) algorithms, transformers, and deep learning algorithms both individually and in merit-based late fusion. For the merit-based late fusion, we employed several weight optimization and selection methods to assign merit-based weights to the individual text classification algorithms. We also analyze the impact of the characters on the task that are usually excluded in NLP applications during pre-processing by conducting experiments on both clean and un-clean data. The proposed framework is evaluated on a large-scale benchmark dataset, significantly improving performance over the existing solutions.
风格分析是一个相对较为陌生的主题,但它却带来了几个有趣的应用。例如,它可以让作者在协作中调整写作风格,生成更具连贯性的文档。同样,风格分析也可以用于文档溯源和验证,作为其主要步骤。在本文中,我们提出了一个集成式的文本处理框架,用于对单写人和多写人文档进行分类,这是风格分析中的关键任务之一。该框架包括多个先进的文本分类算法,包括经典机器学习(机器学习)算法、变压器和深度学习算法,同时包括基于价值的 late fusion 算法。对于基于价值的 late fusion 算法,我们采用了多个权重优化和选择方法,为每个文本分类算法分配基于价值的权重。我们还分析了字符对NLP应用中通常被排除的任务的影响,通过在干净数据和脏数据上开展实验进行分析。该框架在一个大型基准数据集上进行评估, significantly improving over the existing solutions.
https://arxiv.org/abs/2303.01197
Large pre-trained language models help to achieve state of the art on a variety of natural language processing (NLP) tasks, nevertheless, they still suffer from forgetting when incrementally learning a sequence of tasks. To alleviate this problem, recent works enhance existing models by sparse experience replay and local adaption, which yield satisfactory performance. However, in this paper we find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. To verify the ability of BERT to maintain old knowledge, we adopt and re-finetune single-layer probe networks with the parameters of BERT fixed. We investigate the models on two types of NLP tasks, text classification and extractive question answering. Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of novel methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting.
大型预训练语言模型有助于在多种自然语言处理任务(NLP)方面达到最先进的水平,但仍然会在逐步学习序列任务时忘记。为了解决这个问题,最近的工作通过稀疏经验回放和局部适应来增强现有的模型,取得了良好的性能。但是,在本文中,我们发现像BERT这样的预训练语言模型有潜力Sequentially learn, even without any sparse memory replay。为了验证BERT维持旧知识的能力,我们采用并重新调整单层探针网络,以固定BERT的参数。我们研究了两种NLP任务,文本分类和提取式问题回答。我们的实验表明,BERT可以在长期内为以前学习的任务生成高质量的表示,即使在非常稀疏的经验回放或甚至没有经验回放的情况下。我们还介绍了一系列新的方法,解释遗忘机制以及在任务增量学习中,记忆复习如何发挥着重要作用,从而跨越了我们的新发现与灾难性遗忘研究的鸿沟。
https://arxiv.org/abs/2303.01081
Extreme multi-label text classification utilizes the label hierarchy to partition extreme labels into multiple label groups, turning the task into simple multi-group multi-label classification tasks. Current research encodes labels as a vector with fixed length which needs establish multiple classifiers for different label groups. The problem is how to build only one classifier without sacrificing the label relationship in the hierarchy. This paper adopts the multi-answer questioning task for extreme multi-label classification. This paper also proposes an auxiliary classification evaluation metric. This study adopts the proposed method and the evaluation metric to the legal domain. The utilization of legal Berts and the study on task distribution are discussed. The experiment results show that the proposed hierarchy and multi-answer questioning task can do extreme multi-label classification for EURLEX dataset. And in minor/fine-tuning the multi-label classification task, the domain adapted BERT models could not show apparent advantages in this experiment. The method is also theoretically applicable to zero-shot learning.
极端多标签文本分类利用标签层次将极端标签分割为多个标签组,将任务转化为简单的多组多标签分类任务。目前的研究将标签编码为具有固定长度的向量,需要为不同的标签组建立多个分类器。问题在于如何仅建立一种分类器,而不放弃标签层次中的标签关系。本文采用多回答问答任务来进行极端多标签分类。本文还提出了辅助分类评估 metric。本文将所提出的方法应用于法律领域。利用法律Berts和研究任务分布是讨论的内容。实验结果显示,所提出的层次和多回答问答任务可以对eurLex数据集进行极端多标签分类。在微调多标签分类任务时,域适应的BERT模型在实验中未能表现出明显的优势。该方法也理论上适用于零样本学习。
https://arxiv.org/abs/2303.01064
In the field of dream research, the study of dream content typically relies on the analysis of verbal reports provided by dreamers upon awakening from their sleep. This task is classically performed through manual scoring provided by trained annotators, at a great time expense. While a consistent body of work suggests that natural language processing (NLP) tools can support the automatic analysis of dream reports, proposed methods lacked the ability to reason over a report's full context and required extensive data pre-processing. Furthermore, in most cases, these methods were not validated against standard manual scoring approaches. In this work, we address these limitations by adopting large language models (LLMs) to study and replicate the manual annotation of dream reports, using a mixture of off-the-shelf and bespoke approaches, with a focus on references to reports' emotions. Our results show that the off-the-shelf method achieves a low performance probably in light of inherent linguistic differences between reports collected in different (groups of) individuals. On the other hand, the proposed bespoke text classification method achieves a high performance, which is robust against potential biases. Overall, these observations indicate that our approach could find application in the analysis of large dream datasets and may favour reproducibility and comparability of results across studies.
在梦研究领域,梦内容的研究和分析通常依赖于醒来后从梦中清醒来提供的文字报告的分析和解释。这种任务通常需要通过受过训练的标注员提供手动评分来完成,需要花费大量时间和精力。虽然一些研究一致性的研究表明自然语言处理(NLP)工具可以支持梦报告的自动分析,但这些提议的方法缺乏对报告全文的推理能力,并且需要广泛的数据预处理。在这项研究中,我们采用大型语言模型(LLMs)来研究并重复梦报告的手动标注,使用了off-the-shelf和定制的方法相结合,重点关注报告的情感引用。我们的结果表明,off-the-shelf方法可能表现较差,因为不同(群体)个体收集的报告之间存在固有的语言差异。相反,我们提出的定制文本分类方法表现出高水平的性能,具有鲁棒性 against潜在偏见。总的来说,这些观察表明,我们的方法可以在分析大型梦数据集时应用,可能促进在不同研究中重现的结果的一致性。
https://arxiv.org/abs/2302.14828
Text classification is an important task in Natural Language Processing (NLP), where the goal is to categorize text data into predefined classes. In this study, we analyse the dataset creation steps and evaluation techniques of multi-label news categorisation task as part of text classification. We first present a newly obtained dataset for Uzbek text classification, which was collected from 10 different news and press websites and covers 15 categories of news, press and law texts. We also present a comprehensive evaluation of different models, ranging from traditional bag-of-words models to deep learning architectures, on this newly created dataset. Our experiments show that the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based models outperform the rule-based models. The best performance is achieved by the BERTbek model, which is a transformer-based BERT model trained on the Uzbek corpus. Our findings provide a good baseline for further research in Uzbek text classification.
文本分类是自然语言处理(NLP)中的一个重要任务,其目标是将文本数据分类到预定义的类别中。在本研究中,我们将分析多标签新闻分类任务的数据集创建步骤和评估技术,将其作为文本分类的一部分。我们首先介绍了一份新获得的乌兹干文本分类数据集,该数据集是从10个不同的新闻和媒体网站收集的,涵盖了15个新闻、媒体和法律文本类别。我们还对该新创建的数据集进行了各种模型的全面评估,包括传统的词袋模型和深度学习架构。我们的实验结果表明,基于循环神经网络(RNN)和卷积神经网络(CNN)的模型比基于规则的模型表现更好。最佳性能由BERTbek模型实现,这是一种基于Transformer的BERT模型在乌兹干语料库上训练而成。我们的研究结果为乌兹干文本分类的进一步研究提供了一个好的基准。
https://arxiv.org/abs/2302.14494
Increasingly larger datasets have become a standard ingredient to advancing the state of the art in NLP. However, data quality might have already become the bottleneck to unlock further gains. Given the diversity and the sizes of modern datasets, standard data filtering is not straight-forward to apply, because of the multifacetedness of the harmful data and elusiveness of filtering rules that would generalize across multiple tasks. We study the fitness of task-agnostic self-influence scores of training examples for data cleaning, analyze their efficacy in capturing naturally occurring outliers, and investigate to what extent self-influence based data cleaning can improve downstream performance in machine translation, question answering and text classification, building up on recent approaches to self-influence calculation and automated curriculum learning.
increasingly larger datasets 已经成为推动自然语言处理技术前沿的标准组件。然而,数据质量可能已经已经成为解锁进一步优势的瓶颈。鉴于现代数据集的多样性和大小,标准数据过滤并不是很容易应用,因为有害数据的多样性和可能在不同任务上的泛化困难的过滤规则。我们研究任务无关的自我影响得分的训练示例数据清洗的适应性,分析它们在捕捉自然出现的异常值方面的效力,并研究基于自我影响的数据清洗是否能够提高机器翻译、问答和文本分类的后续表现,基于最近关于自我影响计算和自动化课程学习的新方法。
https://arxiv.org/abs/2302.13959