This study is part of the debate on the efficiency of large versus small language models for text classification by prompting.We assess the performance of small language models in zero-shot text classification, challenging the prevailing dominance of large models.Across 15 datasets, our investigation benchmarks language models from 77M to 40B parameters using different architectures and scoring functions. Our findings reveal that small models can effectively classify texts, getting on par with or surpassing their larger counterparts.We developed and shared a comprehensive open-source repository that encapsulates our methodologies. This research underscores the notion that bigger isn't always better, suggesting that resource-efficient small models may offer viable solutions for specific data classification challenges.
本研究是关于大型语言模型与小型语言模型在文本分类任务中的效率辩论的一部分。我们通过提示评估了小型语言模型的零散文本分类表现,挑战了现有大型模型的主导地位。在15个数据集上,我们使用不同的架构和评分函数评估了语言模型的性能,从77M到40B参数。我们的研究结果表明,小型模型可以有效地分类文本,与大型模型相当或者超越它们。我们还开发并共享了一个全面的开源仓库,封装了我们的方法论。这项研究强调了一个事实,即并不总是越大越好,这表明资源高效的小型模型可能为特定的数据分类挑战提供可行的解决方案。
https://arxiv.org/abs/2404.11122
In this paper, we aim to generate text classification data given arbitrary class definitions (i.e., user instruction), so one can train a small text classifier without any human annotation or raw corpus. Compared with pioneer attempts, our proposed Incubator is the first framework that can handle complicated and even mutually dependent classes (e.g., "TED Talk given by Educator" and "Other"). Specifically, Incubator is an LLM firstly tuned on the instruction-to-data mappings that we obtained from classification datasets and descriptions on HuggingFace together with in-context augmentation by GPT-4. We then refine Incubator by learning on the cluster centers of semantic textual embeddings to emphasize the uniformity and semantic diversity in generations. We compare Incubator on various classification tasks with strong baselines such as direct LLM-based inference and training data generation by prompt engineering. Experiments show Incubator is able to (1) perform well on traditional benchmarks, (2) take label dependency and user preference into consideration, and (3) enable logical text mining by incubating multiple classifiers.
在本文中,我们的目标是根据任意类定义生成文本分类数据,从而可以训练一个没有人工标注或原始语料库的小文本分类器。与先驱尝试相比,我们提出的孵化器是第一个可以处理复杂甚至相互依赖类别的框架(例如,"TED演讲由教育者给出"和"其他")。具体来说,孵化器是我们从分类数据和描述中获得的指令到数据映射的LLM,并使用GPT-4在上下文增强。然后,通过在语义文本嵌入的聚类中心上学习来优化孵化器,以强调代际之间的统一性和多样性。我们比较孵化器在各种分类任务上的结果与强大的基线,如直接LLM推理和训练数据生成通过提示工程。实验结果表明,孵化器能够(1)在传统基准测试中表现良好,(2)考虑标签依赖和用户偏好,(3)通过孵化多个分类器实现逻辑文本挖掘。
https://arxiv.org/abs/2404.10877
Researchers must stay current in their fields by regularly reviewing academic literature, a task complicated by the daily publication of thousands of papers. Traditional multi-label text classification methods often ignore semantic relationships and fail to address the inherent class imbalances. This paper introduces a novel approach using the SciBERT model and CNNs to systematically categorize academic abstracts from the Elsevier OA CC-BY corpus. We use a multi-segment input strategy that processes abstracts, body text, titles, and keywords obtained via BERT topic modeling through SciBERT. Here, the [CLS] token embeddings capture the contextual representation of each segment, concatenated and processed through a CNN. The CNN uses convolution and pooling to enhance feature extraction and reduce dimensionality, optimizing the data for classification. Additionally, we incorporate class weights based on label frequency to address the class imbalance, significantly improving the classification F1 score and enhancing text classification systems and literature review efficiency.
研究人员必须保持他们在领域内的最新状态,并通过定期查阅学术论文文献来保持更新。然而,传统的多标签文本分类方法往往忽视了语义关系,未能解决固有的分类不平衡问题。本文介绍了一种使用SciBERT模型和CNNs系统分类Elsevier OA CC-BY语料库中学术摘要的新颖方法。我们采用了一种多段输入策略,该策略通过SciBERT主题建模处理摘要、正文、标题和关键词。在这里,[CLS]词向量捕获每个部分的上下文表示,通过CNN进行连接和处理。CNN使用卷积和池化来增强特征提取和降低维度,优化数据以进行分类。此外,我们还根据标签频率基于类别权重来解决分类不平衡问题,显著提高了分类F1得分,并提高了文本分类系统和文献综述的效率。
https://arxiv.org/abs/2404.13078
In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decomposition are well-concentrated around several peaks, which allows us to efficiently replace them with corresponding centroids for quantization purposes. We study the theoretical properties of the proposed approach and rigorously evaluate our compression algorithm in the context of next-word prediction tasks and on a set of downstream tasks for text classification. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance while ensuring data compression, marking a significant advancement in the field of data quantization.
在本文中,我们基于Kashin表示原理提出了一种数据量化算法。这种方法的关键在于将任何给定的向量、矩阵或张量分解成两个因子。第一个因子保持一个较小的无穷范数,而第二个因子在乘以正交矩阵时表现出类似的约束范数。令人惊讶的是,在分解后,因子中的entry points 很好地集中在几个尖点上,这使我们能够有效地用相应的聚类器来代替它们,从而实现量化目的。我们研究了所提出的算法的理论性质,并在next-word prediction任务和一系列下游任务(如文本分类)上对压缩算法进行了严谨的评估。我们的研究结果表明,Kashin量化在保证数据压缩的同时,实现了与竞争或卓越模型性能相当的量化质量,这标志着数据量化领域的重要进展。
https://arxiv.org/abs/2404.09737
Text classification systems have continuously improved in performance over the years. However, nearly all current SOTA classifiers have a similar shortcoming, they process text in a horizontal manner. Vertically written words will not be recognized by a classifier. In contrast, humans are easily able to recognize and read words written both horizontally and vertically. Hence, a human adversary could write problematic words vertically and the meaning would still be preserved to other humans. We simulate such an attack, VertAttack. VertAttack identifies which words a classifier is reliant on and then rewrites those words vertically. We find that VertAttack is able to greatly drop the accuracy of 4 different transformer models on 5 datasets. For example, on the SST2 dataset, VertAttack is able to drop RoBERTa's accuracy from 94 to 13%. Furthermore, since VertAttack does not replace the word, meaning is easily preserved. We verify this via a human study and find that crowdworkers are able to correctly label 77% perturbed texts perturbed, compared to 81% of the original texts. We believe VertAttack offers a look into how humans might circumvent classifiers in the future and thus inspire a look into more robust algorithms.
文本分类系统在过去几年中一直不断提高性能。然而,几乎所有当前的最优分类器都有类似的缺陷,它们以水平方式处理文本。水平书写的单词不会被分类器识别。相比之下,人类能够轻松地识别和阅读水平书写和垂直书写的单词。因此,一个的人类攻击者可以垂直书写有问题的单词,其他人类仍能够理解其含义。我们模拟了这种攻击,名为VertAttack。VertAttack会识别分类器所依赖的单词,然后将它们垂直地重写。我们发现,VertAttack能够在5个数据集上大大降低4种不同Transformer模型的准确性。例如,在SST2数据集上,VertAttack将RoBERTa的准确性从94%降低到13%。此外,由于VertAttack没有替换单词的含义,因此很容易保留。我们通过人类研究证实了这一点,并发现,在原文本上,工人能够正确地标记出77%的扰动文本,而原始文本上的81%则无法正确标记。我们相信,VertAttack提供了一个窗口,让人们思考未来人类可能会如何绕过分类器,从而激发了对更健壮算法的思考。
https://arxiv.org/abs/2404.08538
Popular zero-shot models suffer due to artifacts inherited from pretraining. A particularly detrimental artifact, caused by unbalanced web-scale pretraining data, is mismatched label distribution. Existing approaches that seek to repair the label distribution are not suitable in zero-shot settings, as they have incompatible requirements such as access to labeled downstream task data or knowledge of the true label balance in the pretraining distribution. We sidestep these challenges and introduce a simple and lightweight approach to adjust pretrained model predictions via optimal transport. Our technique requires only an estimate of the label distribution of a downstream task. Theoretically, we characterize the improvement produced by our procedure under certain mild conditions and provide bounds on the error caused by misspecification. Empirically, we validate our method in a wide array of zero-shot image and text classification tasks, improving accuracy by 4.8% and 15.9% on average, and beating baselines like Prior Matching -- often by significant margins -- in 17 out of 21 datasets.
由于预训练中存在的元数据导致的 artifacts,流行的一零 shot 模型效果不佳。特别有害的元数据是由不平衡的跨网站预训练数据引起的,即不平衡标签分布。现有的试图修复标签分布的方法在零 shot 设置中并不适用,因为它们具有不兼容的要求,如访问已标注的下游任务数据或对预训练分布的真实标签平衡的了解。我们避开了这些挑战,并引入了一种简单而轻量级的通过最优传输调整预训练模型预测的方法。我们的技术只需要预训练任务下游任务的标签分布的估计。从理论上看,我们研究了我们的过程在某些轻度条件下的改进,并提供了由不准确估计引起的误差的上界。在实证研究中,我们在广泛的零 shot图像和文本分类任务中验证了我们的方法,平均提高了 4.8% 的准确率,并且在 21 个数据集中的基线(如 Prior Matching)中击败了像这样具有显著优势的基线。
https://arxiv.org/abs/2404.08461
Learning an effective representation in multi-label text classification (MLTC) is a significant challenge in NLP. This challenge arises from the inherent complexity of the task, which is shaped by two key factors: the intricate connections between labels and the widespread long-tailed distribution of the data. To overcome this issue, one potential approach involves integrating supervised contrastive learning with classical supervised loss functions. Although contrastive learning has shown remarkable performance in multi-class classification, its impact in the multi-label framework has not been thoroughly investigated. In this paper, we conduct an in-depth study of supervised contrastive learning and its influence on representation in MLTC context. We emphasize the importance of considering long-tailed data distributions to build a robust representation space, which effectively addresses two critical challenges associated with contrastive learning that we identify: the "lack of positives" and the "attraction-repulsion imbalance". Building on this insight, we introduce a novel contrastive loss function for MLTC. It attains Micro-F1 scores that either match or surpass those obtained with other frequently employed loss functions, and demonstrates a significant improvement in Macro-F1 scores across three multi-label datasets.
在多标签文本分类(MLTC)中学习有效的表示是一个重要的挑战。这个挑战来自于任务的复杂性,这是由标签之间复杂的联系和数据中普遍的长尾分布两个关键因素塑造的。要解决这个问题,一种潜在的方法是将有监督的对比学习与经典的监督损失函数集成起来。尽管在多标签分类中对比学习表现出显著的性能,但其在MLTC框架中的影响尚未得到充分调查。在本文中,我们深入研究了有监督的对比学习及其在MLTC上下文中的影响。我们强调了考虑长尾数据分布对于构建稳健的表示空间的重要性,这有效地解决了我们确定的对比学习中的两个关键问题:“缺乏正例”和“吸引-排斥不平衡”。基于这个洞见,我们引入了一种新的MLTC损失函数。它在其他常用损失函数中要么匹配,要么超越,并表明在三个多标签数据集上宏观F1得分显著提高。
https://arxiv.org/abs/2404.08720
We present Sequence Salience, a visual tool for interactive prompt debugging with input salience methods. Sequence Salience builds on widely used salience methods for text classification and single-token prediction, and extends this to a system tailored for debugging complex LLM prompts. Our system is well-suited for long texts, and expands on previous work by 1) providing controllable aggregation of token-level salience to the word, sentence, or paragraph level, making salience over long inputs tractable; and 2) supporting rapid iteration where practitioners can act on salience results, refine prompts, and run salience on the new output. We include case studies showing how Sequence Salience can help practitioners work with several complex prompting strategies, including few-shot, chain-of-thought, and constitutional principles. Sequence Salience is built on the Learning Interpretability Tool, an open-source platform for ML model visualizations, and code, notebooks, and tutorials are available at http://goo.gle/sequence-salience.
我们提出了Sequence Salience,一个用于交互式提示调试的视觉工具,支持使用输入的重要性方法。Sequence Salience 基于广泛使用的文本分类和单词预测的显著性方法,并将此扩展到专为调试复杂 LLM 提示而设计的系统。我们的系统非常适合长文本,并且通过以下方式拓展了以前的工作: 1. 提供可控制地将词级重要性聚合到单词、句子或段落级别的功能,使对长输入的显著性进行处理; 2. 支持快速迭代,使实践者可以在显著性结果上进行操作,改进提示,并在新输出上运行显著性。 我们包括了一些案例研究,展示了 Sequence Salience 如何帮助实践者处理多种复杂提示策略,包括少样本、链式思维和基本原则。Sequence Salience 是基于 Learning Interpretability Tool(一个开源的 ML 模型可视化平台)构建的,代码、笔记本和教程可在 http://goo.gle/sequence-salience 获取。
https://arxiv.org/abs/2404.07498
Out-of-distribution (OOD) detection plays a crucial role in ensuring the safety and reliability of deep neural networks in various applications. While there has been a growing focus on OOD detection in visual data, the field of textual OOD detection has received less attention. Only a few attempts have been made to directly apply general OOD detection methods to natural language processing (NLP) tasks, without adequately considering the characteristics of textual data. In this paper, we delve into textual OOD detection with Transformers. We first identify a key problem prevalent in existing OOD detection methods: the biased representation learned through the maximization of the conditional likelihood $p(y\mid x)$ can potentially result in subpar performance. We then propose a novel variational inference framework for OOD detection (VI-OOD), which maximizes the likelihood of the joint distribution $p(x, y)$ instead of $p(y\mid x)$. VI-OOD is tailored for textual OOD detection by efficiently exploiting the representations of pre-trained Transformers. Through comprehensive experiments on various text classification tasks, VI-OOD demonstrates its effectiveness and wide applicability. Our code has been released at \url{this https URL}.
分布外(OOD)检测在确保各种应用中深度神经网络的安全和可靠性方面发挥着关键作用。虽然视觉数据中的OOD检测已经得到了越来越多的关注,但文本数据中的OOD检测领域受到了更少的关注。只有少数尝试将一般OUD检测方法直接应用于自然语言处理(NLP)任务,而没有充分考虑文本数据的特征。在本文中,我们深入研究了使用Transformer进行文本OUD检测。我们首先指出现有OUD检测方法中一个关键问题:通过最大化条件概率$p(y\mid x)$获得的偏置表示可能会导致性能不佳。然后我们提出了一个名为VI-OOD的新变分推理框架(VI-OOD)用于OUD检测,它最大化联合分布$p(x,y)$,而不是$p(y\mid x)$。VI-OOD专门针对文本OUD检测进行了优化,通过有效地利用预训练Transformer的表示,提高了OUD检测的效率。通过对各种文本分类任务的全面实验,VI-OOD证明了其有效性和广泛应用性。我们的代码已发布在 \url{this <https://this URL>}。
https://arxiv.org/abs/2404.06217
Data analysis and machine learning are of preeminent importance in the legal domain, especially in tasks like clustering and text classification. In this study, we harnessed the power of natural language processing tools to enhance datasets meticulously curated by experts. This process significantly improved the classification workflow for legal texts using machine learning techniques. We considered the Sustainable Development Goals (SDGs) data from the United Nations 2030 Agenda as a practical case study. Data augmentation clustering-based strategy led to remarkable enhancements in the accuracy and sensitivity metrics of classification models. For certain SDGs within the 2030 Agenda, we observed performance gains of over 15%. In some cases, the example base expanded by a noteworthy factor of 5. When dealing with unclassified legal texts, data augmentation strategies centered around clustering prove to be highly effective. They provide a valuable means to expand the existing knowledge base without the need for labor-intensive manual classification efforts.
数据分析和机器学习在法律领域具有至关重要的作用,尤其是在聚类和文本分类等任务中。在这项研究中,我们利用自然语言处理工具增强由专家精心策划的数据集。这一过程显著提高了使用机器学习技术对法律文本进行分类的分类工作流程。我们将联合国2030议程中的可持续发展目标(SDGs)作为一个实际案例研究。数据增强聚类为基础的策略在分类模型的准确性和敏感性指标方面取得了显著的提高。在2030议程的某些SDG中,我们观察到分类表现的提升超过15%。在某些情况下,示例基础扩大了5倍。当处理未分类的法律文本时,以聚类为中心的数据增强策略变得非常有效。它们为扩展现有的知识库提供了有力的手段,而无需进行繁重的人工分类努力。
https://arxiv.org/abs/2404.08683
For assessing various performance indicators of companies, the focus is shifting from strictly financial (quantitative) publicly disclosed information to qualitative (textual) information. This textual data can provide valuable weak signals, for example through stylistic features, which can complement the quantitative data on financial performance or on Environmental, Social and Governance (ESG) criteria. In this work, we use various multi-task learning methods for financial text classification with the focus on financial sentiment, objectivity, forward-looking sentence prediction and ESG-content detection. We propose different methods to combine the information extracted from training jointly on different tasks; our best-performing method highlights the positive effect of explicitly adding auxiliary task predictions as features for the final target task during the multi-task training. Next, we use these classifiers to extract textual features from annual reports of FTSE350 companies and investigate the link between ESG quantitative scores and these features.
为了评估公司的各种绩效指标,重点是从严格的数据披露的财务(数量)信息转向定性(文本)信息。这些文本数据可以提供有价值的弱信号,例如通过文体特征,可以补充财务绩效或ESG标准的定量数据。在这项工作中,我们使用各种多任务学习方法对财务文本分类,重点关注财务情感、客观性、前瞻性句子预测和ESG内容检测。我们提出了结合训练过程中从不同任务提取的信息的不同方法;我们最佳的方法强调在多任务训练过程中明确添加辅助任务预测作为最终目标任务的特征的有积极影响。接下来,我们使用这些分类器从FTSE350公司的年度报告提取文本特征,并研究ESG定量评分与这些特征之间的联系。
https://arxiv.org/abs/2404.05281
In various real-world applications such as machine translation, sentiment analysis, and question answering, a pivotal role is played by NLP models, facilitating efficient communication and decision-making processes in domains ranging from healthcare to finance. However, a significant challenge is posed to the robustness of these natural language processing models by text adversarial attacks. These attacks involve the deliberate manipulation of input text to mislead the predictions of the model while maintaining human interpretability. Despite the remarkable performance achieved by state-of-the-art models like BERT in various natural language processing tasks, they are found to remain vulnerable to adversarial perturbations in the input text. In addressing the vulnerability of text classifiers to adversarial attacks, three distinct attack mechanisms are explored in this paper using the victim model BERT: BERT-on-BERT attack, PWWS attack, and Fraud Bargain's Attack (FBA). Leveraging the IMDB, AG News, and SST2 datasets, a thorough comparative analysis is conducted to assess the effectiveness of these attacks on the BERT classifier model. It is revealed by the analysis that PWWS emerges as the most potent adversary, consistently outperforming other methods across multiple evaluation scenarios, thereby emphasizing its efficacy in generating adversarial examples for text classification. Through comprehensive experimentation, the performance of these attacks is assessed and the findings indicate that the PWWS attack outperforms others, demonstrating lower runtime, higher accuracy, and favorable semantic similarity scores. The key insight of this paper lies in the assessment of the relative performances of three prevalent state-of-the-art attack mechanisms.
在各种现实应用中,如机器翻译、情感分析和问答系统,自然语言处理(NLP)模型在促进高效沟通和决策过程中发挥了关键作用,如医疗保健和金融领域。然而,自然语言处理模型面临着文本对抗攻击的重大挑战。这些攻击涉及对输入文本的故意操纵,以误导模型预测的同时保持人类可解释性。尽管在自然语言处理任务中取得了令人印象深刻的性能,如BERT等最先进的模型,但它们在输入文本上仍然容易受到对抗性扰动。为解决文本分类器对对抗性攻击的脆弱性,本文使用受害者模型BERT探讨了三种攻击机制:BERT-on-BERT攻击、PWWS攻击和Fraud Bargain's Attack(FBA)。利用IMDB、AG新闻和SST2数据集,对这些攻击对BERT分类器模型的效果进行了全面比较分析。分析发现,PWWS攻击成为最强大的对抗者,在多个评估场景中始终优于其他方法,从而突出了它在生成文本分类器对抗性样本方面的有效性。通过全面的实验,评估了这些攻击的表现,研究结果表明,PWWS攻击胜过其他攻击,证明了较低的运行时间、更高的准确性和有利的语义相似性分数。本文的关键在于对三种最流行攻击机制相对性能的评估。
https://arxiv.org/abs/2404.05159
Saliency post-hoc explainability methods are important tools for understanding increasingly complex NLP models. While these methods can reflect the model's reasoning, they may not align with human intuition, making the explanations not plausible. In this work, we present a methodology for incorporating rationales, which are text annotations explaining human decisions, into text classification models. This incorporation enhances the plausibility of post-hoc explanations while preserving their faithfulness. Our approach is agnostic to model architectures and explainability methods. We introduce the rationales during model training by augmenting the standard cross-entropy loss with a novel loss function inspired by contrastive learning. By leveraging a multi-objective optimization algorithm, we explore the trade-off between the two loss functions and generate a Pareto-optimal frontier of models that balance performance and plausibility. Through extensive experiments involving diverse models, datasets, and explainability methods, we demonstrate that our approach significantly enhances the quality of model explanations without causing substantial (sometimes negligible) degradation in the original model's performance.
为了更好地理解日益复杂的自然语言处理(NLP)模型,解释性后验方法是重要的工具。然而,这些方法可能不会与人类直觉保持一致,导致解释不可信。在这项工作中,我们提出了一种将理据(文本注释,解释人类决策)纳入文本分类模型中的方法。这种纳入在保留后验解释的准确性的同时提高了后验解释的可信度。我们的方法对模型架构和解释性方法持中立态度。我们通过在模型训练期间增加一种新型的损失函数(受到对比学习启发的交叉熵损失)来引入理据。通过利用多目标优化算法,我们探讨了两个损失函数之间的权衡,并生成了在性能和可信度之间达到帕累托最优的模型的Pareto最优前沿。通过涉及各种模型、数据集和解释性方法的大型实验,我们证明了我们的方法可以显著提高模型的解释质量,而不会导致对原始模型性能的实质性(有时微不足道的)下降。
https://arxiv.org/abs/2404.03098
Prompt-based methods have achieved promising results in most few-shot text classification tasks. However, for readability assessment tasks, traditional prompt methods lackcrucial linguistic knowledge, which has already been proven to be essential. Moreover, previous studies on utilizing linguistic features have shown non-robust performance in few-shot settings and may even impair model this http URL address these issues, we propose a novel prompt-based tuning framework that incorporates rich linguistic knowledge, called Feature Prompt Tuning (FPT). Specifically, we extract linguistic features from the text and embed them into trainable soft prompts. Further, we devise a new loss function to calibrate the similarity ranking order between categories. Experimental results demonstrate that our proposed method FTP not only exhibits a significant performance improvement over the prior best prompt-based tuning approaches, but also surpasses the previous leading methods that incorporate linguistic features. Also, our proposed model significantly outperforms the large language model gpt-3.5-turbo-16k in most cases. Our proposed method establishes a new architecture for prompt tuning that sheds light on how linguistic features can be easily adapted to linguistic-related tasks.
基于提示的方法在大多数几条文本分类任务中取得了良好的效果。然而,对于可读性评估任务,传统提示方法缺乏关键的语言知识,这已经被证明是必要的。此外,之前关于使用语言特征的研究表明,在几条设置中表现非鲁棒,甚至可能损害模型,我们提出了一种新颖的提示基于调整框架,称为特征提示调整(FPT)。具体来说,我们从文本中提取语言特征并将它们嵌入到可训练的轻量级提示中。进一步,我们设计了一个新的损失函数来调整类别之间的相似度排名顺序。实验结果表明,与之前最好的基于提示的调整方法相比,我们的FPT方法不仅表现出显著的性能提升,而且超过了之前采用语言特征的主要方法。此外,在大多数情况下,我们的方法显著优于大型语言模型gpt-3.5-turbo-16k。我们提出的方法建立了一个新的提示调整架构,阐明了语言特征如何轻松适应与语言相关的任务。
https://arxiv.org/abs/2404.02772
Zero-Shot Cross-lingual Transfer (ZS-XLT) utilizes a model trained in a source language to make predictions in another language, often with a performance loss. To alleviate this, additional improvements can be achieved through subsequent adaptation using examples in the target language. In this paper, we exploit In-Context Tuning (ICT) for One-Shot Cross-lingual transfer in the classification task by introducing In-Context Cross-lingual Transfer (IC-XLT). The novel concept involves training a model to learn from context examples and subsequently adapting it during inference to a target language by prepending a One-Shot context demonstration in that language. Our results show that IC-XLT successfully leverages target-language examples to improve the cross-lingual capabilities of the evaluated mT5 model, outperforming prompt-based models in the Zero and Few-shot scenarios adapted through fine-tuning. Moreover, we show that when source-language data is limited, the fine-tuning framework employed for IC-XLT performs comparably to prompt-based fine-tuning with significantly more training data in the source language.
Zero-Shot Cross-Lingual Transfer(ZS-XLT)利用在源语言中训练的模型进行预测,往往性能会降低。为缓解这一问题,可以通过使用目标语言中的示例进行后续适应来获得进一步的改进。在本文中,我们利用In-Context Tuning(ICT)为One-Shot Cross-Lingual Transfer(IC-XLT)在分类任务中进行改进。新概念涉及通过在目标语言中准备一个One-Shot上下文演示来训练模型,然后在推理时将其适应目标语言。我们的结果表明,ICT成功利用目标语言示例来提高评估mT5模型的跨语言能力,在零和少样本场景下优于基于提示的模型。此外,我们还证明了当源语言数据有限时,用于ICT的微调框架与通过大量在源语言中的训练数据进行微调的提示方法的表现相当。
https://arxiv.org/abs/2404.02452
Large Language Models (LLMs) operating in 0-shot or few-shot settings achieve competitive results in Text Classification tasks. In-Context Learning (ICL) typically achieves better accuracy than the 0-shot setting, but it pays in terms of efficiency, due to the longer input prompt. In this paper, we propose a strategy to make LLMs as efficient as 0-shot text classifiers, while getting comparable or better accuracy than ICL. Our solution targets the low resource setting, i.e., when only 4 examples per class are available. Using a single LLM and few-shot real data we perform a sequence of generation, filtering and Parameter-Efficient Fine-Tuning steps to create a robust and efficient classifier. Experimental results show that our approach leads to competitive results on multiple text classification datasets.
大语言模型(LLMs)在零或少数样本设置中实现文本分类任务的竞争结果。在上下文学习(ICL)中,通常比零样本设置获得更好的准确率,但代价是效率较低,因为输入提示较长。在本文中,我们提出了一种使LLMs与零样本文本分类器一样高效,同时获得与ICL相当或更好的准确性的策略。我们的解决方案针对资源低下的情况,即只有每个类别4个示例可用。使用单个LLM和少样本真实数据,我们进行了一系列生成、筛选和参数高效的微调步骤,以创建一个稳健且高效的分类器。实验结果表明,我们的方法在多个文本分类数据集上实现了竞争力的结果。
https://arxiv.org/abs/2404.02422
Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference -- providing the "recipe" for the optimal setups.
尽管在自然语言处理领域有很多带有标签的数据集,但各种语言间数据可用性的持续不均衡仍然明显。特别是乌克兰语,作为一个仍然需要继续改进跨语言方法论的语言,具有突出表现。由于我们的知识,乌克兰语的典型文本分类任务中缺乏大量的语料库。在这项工作中,我们利用自然语言处理领域的最新进展,探讨跨语言知识传递方法:大型多语言编码器、翻译系统、LLM 和语言适配器。我们在三个文本分类任务——毒性分类、格式分类和自然语言推理——上进行了测试,提供了最优设置的“食谱”。
https://arxiv.org/abs/2404.02043
SemEval-2024 Task 8 provides a challenge to detect human-written and machine-generated text. There are 3 subtasks for different detection scenarios. This paper proposes a system that mainly deals with Subtask B. It aims to detect if given full text is written by human or is generated by a specific Large Language Model (LLM), which is actually a multi-class text classification task. Our team AISPACE conducted a systematic study of fine-tuning transformer-based models, including encoderonly, decoder-only and encoder-decoder models. We compared their performance on this task and identified that encoder-only models performed exceptionally well. We also applied a weighted Cross Entropy loss function to address the issue of data imbalance of different class samples. Additionally, we employed softvoting strategy over multi-models ensemble to enhance the reliability of our predictions. Our system ranked top 1 in Subtask B, which sets a state-of-the-art benchmark for this new challenge.
SemEval-2024 任务 8 提出了一种检测人类撰写的和机器生成的文本的挑战。有三个子任务,用于不同的检测场景。本文提出了一种主要针对子任务 B 的系统。其旨在检测给定的完整文本是由人类撰写的,还是由特定的大型语言模型(LLM)生成的,实际上是一个多类文本分类任务。我们的团队 AISPACE 对基于变换器的模型进行了系统性的研究,包括仅编码器、仅解码器模型和编码器-解码器模型。我们比较了它们在这个任务上的表现,并发现仅编码器模型的表现尤为出色。我们还采用了一种加权交叉熵损失函数来解决不同类样本数据不平衡的问题。此外,我们还使用软投票策略来增强我们对预测的可靠性。我们的系统在子任务 B 上排名 top 1,为这个新挑战设定了最先进的基准。
https://arxiv.org/abs/2404.00950
Prompt-based learning paradigm has demonstrated remarkable efficacy in enhancing the adaptability of pretrained language models (PLMs), particularly in few-shot scenarios. However, this learning paradigm has been shown to be vulnerable to backdoor attacks. The current clean-label attack, employing a specific prompt as a trigger, can achieve success without the need for external triggers and ensure correct labeling of poisoned samples, which is more stealthy compared to the poisoned-label attack, but on the other hand, it faces significant issues with false activations and poses greater challenges, necessitating a higher rate of poisoning. Using conventional negative data augmentation methods, we discovered that it is challenging to trade off between effectiveness and stealthiness in a clean-label setting. In addressing this issue, we are inspired by the notion that a backdoor acts as a shortcut and posit that this shortcut stems from the contrast between the trigger and the data utilized for poisoning. In this study, we propose a method named Contrastive Shortcut Injection (CSI), by leveraging activation values, integrates trigger design and data selection strategies to craft stronger shortcut features. With extensive experiments on full-shot and few-shot text classification tasks, we empirically validate CSI's high effectiveness and high stealthiness at low poisoning rates. Notably, we found that the two approaches play leading roles in full-shot and few-shot settings, respectively.
基于提示的学习范式在增强预训练语言模型(PLMs)的适应性方面表现出了显著的效果,特别是在少样本场景中。然而,这种学习范式已经被证明容易受到后门攻击。当前的干净标签攻击通过使用特定的提示作为触发器,可以在不需要外部触发器的情况下实现成功,并确保正确标注的有毒样本,这比毒标签攻击更加隐秘,但另一方面,它面临假激活的问题,构成更大的挑战,需要更高的毒性率。通过传统的负数据增强方法,我们发现在一个干净标签设置中,有效性和隐秘性之间的平衡是困难的。为解决这一问题,我们受到了灵感来自于后门作为一个快捷方式的想法,并认为这一快捷方式源于触发器和用于毒化的数据之间的差异。在这项研究中,我们提出了名为 Contrastive Shortcut Injection(CSI)的方法,通过利用激活值,将触发器设计和数据选择策略集成在一起,构建出更强的快捷特征。在全面检测和少量文本分类任务的广泛实验中,我们通过经验验证 CSI 在低毒性率下的高效性和隐秘性。值得注意的是,我们发现两种方法在全面检测和少量文本分类场景中发挥着关键作用。
https://arxiv.org/abs/2404.00461
Automatic legal text classification systems have been proposed in the literature to address knowledge extraction from judgments and detect their aspects. However, most of these systems are black boxes even when their models are interpretable. This may raise concerns about their trustworthiness. Accordingly, this work contributes with a system combining Natural Language Processing (NLP) with Machine Learning (ML) to classify legal texts in an explainable manner. We analyze the features involved in the decision and the threshold bifurcation values of the decision paths of tree structures and present this information to the users in natural language. This is the first work on automatic analysis of legal texts combining NLP and ML along with Explainable Artificial Intelligence techniques to automatically make the models' decisions understandable to end users. Furthermore, legal experts have validated our solution, and this knowledge has also been incorporated into the explanation process as "expert-in-the-loop" dictionaries. Experimental results on an annotated data set in law categories by jurisdiction demonstrate that our system yields competitive classification performance, with accuracy values well above 90%, and that its automatic explanations are easily understandable even to non-expert users.
自动法律文本分类系统已经被文献中提出来解决从判决中提取知识和检测其方面。然而,即使这些系统的模型可以解释,它们仍然通常是黑盒子。这可能引起对它们可靠性的担忧。因此,本研究通过将自然语言处理(NLP)与机器学习(ML)相结合,以有解释性地对法律文本进行分类。我们分析了决策涉及的特征以及决策路径分叉值阈值的阈值,用自然语言向用户呈现这些信息。这是第一个将NLP和ML与可解释人工智能技术相结合,自动使模型决策对用户可解释的第一篇工作。此外,法律专家已经验证了我们的解决方案,并将这一知识纳入了解释过程作为"专家-在-循环"词典。通过对法律类别的带注释数据集的实验结果,证明了我们的系统具有竞争力的分类性能,准确值超过90%,并且自动解释对非专家用户来说也容易理解。
https://arxiv.org/abs/2404.00437