Zero-Shot Cross-lingual Transfer (ZS-XLT) utilizes a model trained in a source language to make predictions in another language, often with a performance loss. To alleviate this, additional improvements can be achieved through subsequent adaptation using examples in the target language. In this paper, we exploit In-Context Tuning (ICT) for One-Shot Cross-lingual transfer in the classification task by introducing In-Context Cross-lingual Transfer (IC-XLT). The novel concept involves training a model to learn from context examples and subsequently adapting it during inference to a target language by prepending a One-Shot context demonstration in that language. Our results show that IC-XLT successfully leverages target-language examples to improve the cross-lingual capabilities of the evaluated mT5 model, outperforming prompt-based models in the Zero and Few-shot scenarios adapted through fine-tuning. Moreover, we show that when source-language data is limited, the fine-tuning framework employed for IC-XLT performs comparably to prompt-based fine-tuning with significantly more training data in the source language.
Zero-Shot Cross-Lingual Transfer(ZS-XLT)利用在源语言中训练的模型进行预测,往往性能会降低。为缓解这一问题,可以通过使用目标语言中的示例进行后续适应来获得进一步的改进。在本文中,我们利用In-Context Tuning(ICT)为One-Shot Cross-Lingual Transfer(IC-XLT)在分类任务中进行改进。新概念涉及通过在目标语言中准备一个One-Shot上下文演示来训练模型,然后在推理时将其适应目标语言。我们的结果表明,ICT成功利用目标语言示例来提高评估mT5模型的跨语言能力,在零和少样本场景下优于基于提示的模型。此外,我们还证明了当源语言数据有限时,用于ICT的微调框架与通过大量在源语言中的训练数据进行微调的提示方法的表现相当。
https://arxiv.org/abs/2404.02452
Large Language Models (LLMs) operating in 0-shot or few-shot settings achieve competitive results in Text Classification tasks. In-Context Learning (ICL) typically achieves better accuracy than the 0-shot setting, but it pays in terms of efficiency, due to the longer input prompt. In this paper, we propose a strategy to make LLMs as efficient as 0-shot text classifiers, while getting comparable or better accuracy than ICL. Our solution targets the low resource setting, i.e., when only 4 examples per class are available. Using a single LLM and few-shot real data we perform a sequence of generation, filtering and Parameter-Efficient Fine-Tuning steps to create a robust and efficient classifier. Experimental results show that our approach leads to competitive results on multiple text classification datasets.
大语言模型(LLMs)在零或少数样本设置中实现文本分类任务的竞争结果。在上下文学习(ICL)中,通常比零样本设置获得更好的准确率,但代价是效率较低,因为输入提示较长。在本文中,我们提出了一种使LLMs与零样本文本分类器一样高效,同时获得与ICL相当或更好的准确性的策略。我们的解决方案针对资源低下的情况,即只有每个类别4个示例可用。使用单个LLM和少样本真实数据,我们进行了一系列生成、筛选和参数高效的微调步骤,以创建一个稳健且高效的分类器。实验结果表明,我们的方法在多个文本分类数据集上实现了竞争力的结果。
https://arxiv.org/abs/2404.02422
Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference -- providing the "recipe" for the optimal setups.
尽管在自然语言处理领域有很多带有标签的数据集,但各种语言间数据可用性的持续不均衡仍然明显。特别是乌克兰语,作为一个仍然需要继续改进跨语言方法论的语言,具有突出表现。由于我们的知识,乌克兰语的典型文本分类任务中缺乏大量的语料库。在这项工作中,我们利用自然语言处理领域的最新进展,探讨跨语言知识传递方法:大型多语言编码器、翻译系统、LLM 和语言适配器。我们在三个文本分类任务——毒性分类、格式分类和自然语言推理——上进行了测试,提供了最优设置的“食谱”。
https://arxiv.org/abs/2404.02043
SemEval-2024 Task 8 provides a challenge to detect human-written and machine-generated text. There are 3 subtasks for different detection scenarios. This paper proposes a system that mainly deals with Subtask B. It aims to detect if given full text is written by human or is generated by a specific Large Language Model (LLM), which is actually a multi-class text classification task. Our team AISPACE conducted a systematic study of fine-tuning transformer-based models, including encoderonly, decoder-only and encoder-decoder models. We compared their performance on this task and identified that encoder-only models performed exceptionally well. We also applied a weighted Cross Entropy loss function to address the issue of data imbalance of different class samples. Additionally, we employed softvoting strategy over multi-models ensemble to enhance the reliability of our predictions. Our system ranked top 1 in Subtask B, which sets a state-of-the-art benchmark for this new challenge.
SemEval-2024 任务 8 提出了一种检测人类撰写的和机器生成的文本的挑战。有三个子任务,用于不同的检测场景。本文提出了一种主要针对子任务 B 的系统。其旨在检测给定的完整文本是由人类撰写的,还是由特定的大型语言模型(LLM)生成的,实际上是一个多类文本分类任务。我们的团队 AISPACE 对基于变换器的模型进行了系统性的研究,包括仅编码器、仅解码器模型和编码器-解码器模型。我们比较了它们在这个任务上的表现,并发现仅编码器模型的表现尤为出色。我们还采用了一种加权交叉熵损失函数来解决不同类样本数据不平衡的问题。此外,我们还使用软投票策略来增强我们对预测的可靠性。我们的系统在子任务 B 上排名 top 1,为这个新挑战设定了最先进的基准。
https://arxiv.org/abs/2404.00950
Prompt-based learning paradigm has demonstrated remarkable efficacy in enhancing the adaptability of pretrained language models (PLMs), particularly in few-shot scenarios. However, this learning paradigm has been shown to be vulnerable to backdoor attacks. The current clean-label attack, employing a specific prompt as a trigger, can achieve success without the need for external triggers and ensure correct labeling of poisoned samples, which is more stealthy compared to the poisoned-label attack, but on the other hand, it faces significant issues with false activations and poses greater challenges, necessitating a higher rate of poisoning. Using conventional negative data augmentation methods, we discovered that it is challenging to trade off between effectiveness and stealthiness in a clean-label setting. In addressing this issue, we are inspired by the notion that a backdoor acts as a shortcut and posit that this shortcut stems from the contrast between the trigger and the data utilized for poisoning. In this study, we propose a method named Contrastive Shortcut Injection (CSI), by leveraging activation values, integrates trigger design and data selection strategies to craft stronger shortcut features. With extensive experiments on full-shot and few-shot text classification tasks, we empirically validate CSI's high effectiveness and high stealthiness at low poisoning rates. Notably, we found that the two approaches play leading roles in full-shot and few-shot settings, respectively.
基于提示的学习范式在增强预训练语言模型(PLMs)的适应性方面表现出了显著的效果,特别是在少样本场景中。然而,这种学习范式已经被证明容易受到后门攻击。当前的干净标签攻击通过使用特定的提示作为触发器,可以在不需要外部触发器的情况下实现成功,并确保正确标注的有毒样本,这比毒标签攻击更加隐秘,但另一方面,它面临假激活的问题,构成更大的挑战,需要更高的毒性率。通过传统的负数据增强方法,我们发现在一个干净标签设置中,有效性和隐秘性之间的平衡是困难的。为解决这一问题,我们受到了灵感来自于后门作为一个快捷方式的想法,并认为这一快捷方式源于触发器和用于毒化的数据之间的差异。在这项研究中,我们提出了名为 Contrastive Shortcut Injection(CSI)的方法,通过利用激活值,将触发器设计和数据选择策略集成在一起,构建出更强的快捷特征。在全面检测和少量文本分类任务的广泛实验中,我们通过经验验证 CSI 在低毒性率下的高效性和隐秘性。值得注意的是,我们发现两种方法在全面检测和少量文本分类场景中发挥着关键作用。
https://arxiv.org/abs/2404.00461
Automatic legal text classification systems have been proposed in the literature to address knowledge extraction from judgments and detect their aspects. However, most of these systems are black boxes even when their models are interpretable. This may raise concerns about their trustworthiness. Accordingly, this work contributes with a system combining Natural Language Processing (NLP) with Machine Learning (ML) to classify legal texts in an explainable manner. We analyze the features involved in the decision and the threshold bifurcation values of the decision paths of tree structures and present this information to the users in natural language. This is the first work on automatic analysis of legal texts combining NLP and ML along with Explainable Artificial Intelligence techniques to automatically make the models' decisions understandable to end users. Furthermore, legal experts have validated our solution, and this knowledge has also been incorporated into the explanation process as "expert-in-the-loop" dictionaries. Experimental results on an annotated data set in law categories by jurisdiction demonstrate that our system yields competitive classification performance, with accuracy values well above 90%, and that its automatic explanations are easily understandable even to non-expert users.
自动法律文本分类系统已经被文献中提出来解决从判决中提取知识和检测其方面。然而,即使这些系统的模型可以解释,它们仍然通常是黑盒子。这可能引起对它们可靠性的担忧。因此,本研究通过将自然语言处理(NLP)与机器学习(ML)相结合,以有解释性地对法律文本进行分类。我们分析了决策涉及的特征以及决策路径分叉值阈值的阈值,用自然语言向用户呈现这些信息。这是第一个将NLP和ML与可解释人工智能技术相结合,自动使模型决策对用户可解释的第一篇工作。此外,法律专家已经验证了我们的解决方案,并将这一知识纳入了解释过程作为"专家-在-循环"词典。通过对法律类别的带注释数据集的实验结果,证明了我们的系统具有竞争力的分类性能,准确值超过90%,并且自动解释对非专家用户来说也容易理解。
https://arxiv.org/abs/2404.00437
Dataset distillation aims to compress a training dataset by creating a small number of informative synthetic samples such that neural networks trained on them perform as well as those trained on the original training dataset. Current text dataset distillation methods create each synthetic sample as a sequence of word embeddings instead of a text to apply gradient-based optimization; however, such embedding-level distilled datasets cannot be used for training other models whose word embedding weights are different from the model used for distillation. To address this issue, we propose a novel text dataset distillation approach, called Distilling dataset into Language Model (DiLM), which trains a language model to generate informative synthetic training samples as text data, instead of directly optimizing synthetic samples. We evaluated DiLM on various text classification datasets and showed that distilled synthetic datasets from DiLM outperform those from current coreset selection methods. DiLM achieved remarkable generalization performance in training different types of models and in-context learning of large language models. Our code will be available at this https URL.
数据集蒸馏的目的是通过创建一些信息丰富的合成样本,压缩训练数据,使得用其训练的神经网络的表现与用原始训练数据训练的神经网络相同。目前文本数据集蒸馏方法将每个合成样本表示为一个序列的词向量,而不是一个文本,以便应用梯度基优化;然而,这种词向量级的蒸馏数据集无法用于训练那些与用于蒸馏的模型词汇表不同的模型。为了解决这个问题,我们提出了一个新颖的文本数据集蒸馏方法,称为将数据集蒸馏成语言模型(DiLM),它训练一个语言模型生成有益的合成训练样本作为文本数据,而不是直接优化合成样本。我们在各种文本分类数据集上评估DiLM,并发现从DiLM蒸馏的合成数据集的性能优于当前核心集选择方法的蒸馏数据集。DiLM在训练不同类型的模型方面表现出色,并在大型语言模型的上下文学习中取得了显著的泛化性能。我们的代码将在此处公开发布:https:// this URL.
https://arxiv.org/abs/2404.00264
Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional text representation methods have been successfully applied to self-contained documents of medium size. However, information in short texts is often insufficient, due, for example, to the use of mnemonics, which makes them hard to classify. Therefore, the particularities of specific domains must be exploited. In this article we describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions for personal finance management, a problem that was not previously considered in the literature. We trained and tested that system on a labelled dataset with real customer transactions that will be available to other researchers on request. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into account complexity and computing time. Finally, we present a use case with a personal finance application, CoinScrap, which is available at Google Play and App Store.
短文本在实时新闻、社交媒体评论等应用中无处不在。已经成功应用于中等大小的自包含文档的传统文本表示方法。然而,短文本中的信息通常不足,例如使用记忆词使得它们难以分类。因此,必须充分利用特定领域的特点。本文描述了一个结合自然语言处理技术和机器学习算法来对个人财务管理交易描述进行分类的新颖系统。我们用带有标签的数据集训练和测试该系统,该数据集将供其他研究人员使用。为了激励现有的垃圾邮件检测解决方案,我们还提出了一个基于Jaccard距离的短文本相似度检测器,以减少训练集的大小。使用这种检测器和SVM的二级分类器的实验结果表明,与传统方法相比,其准确率高得多,考虑到复杂性和计算时间。最后,我们还介绍了一个使用个人金融应用CoinScrap的用例,该应用可在Google Play和App Store中获取。
https://arxiv.org/abs/2404.08664
In the field of text data augmentation, rule-based methods are widely adopted for real-world applications owing to their cost-efficiency. However, conventional rule-based approaches suffer from the possibility of losing the original semantics of the given text. We propose a novel text data augmentation strategy that avoids such phenomena through a straightforward deletion of adverbs, which play a subsidiary role in the sentence. Our comprehensive experiments demonstrate the efficiency and effectiveness of our proposed approach for not just single text classification, but also natural language inference that requires semantic preservation. We publicly released our source code for reproducibility.
在文本数据增强领域,基于规则的方法因其高性价比而广泛采用。然而,传统基于规则的方法会损失给定文本的原始语义。我们提出了一种新颖的文本数据增强策略,通过直接删除修辞性词汇(在句子中起次要作用)来避免这种现象。我们对我们的方法进行了全面的实验,证明了它不仅对单文本分类有效,而且对需要语义保留的自然语言推理也有效。我们公开发布了我们源代码,以便于重复使用。
https://arxiv.org/abs/2403.20015
Large language models (LLMs) have demonstrated remarkable success in NLP tasks. However, there is a paucity of studies that attempt to evaluate their performances on social media-based health-related natural language processing tasks, which have traditionally been difficult to achieve high scores in. We benchmarked one supervised classic machine learning model based on Support Vector Machines (SVMs), three supervised pretrained language models (PLMs) based on RoBERTa, BERTweet, and SocBERT, and two LLM based classifiers (GPT3.5 and GPT4), across 6 text classification tasks. We developed three approaches for leveraging LLMs for text classification: employing LLMs as zero-shot classifiers, us-ing LLMs as annotators to annotate training data for supervised classifiers, and utilizing LLMs with few-shot examples for augmentation of manually annotated data. Our comprehensive experiments demonstrate that employ-ing data augmentation using LLMs (GPT-4) with relatively small human-annotated data to train lightweight supervised classification models achieves superior results compared to training with human-annotated data alone. Supervised learners also outperform GPT-4 and GPT-3.5 in zero-shot settings. By leveraging this data augmentation strategy, we can harness the power of LLMs to develop smaller, more effective domain-specific NLP models. LLM-annotated data without human guidance for training light-weight supervised classification models is an ineffective strategy. However, LLM, as a zero-shot classifier, shows promise in excluding false negatives and potentially reducing the human effort required for data annotation. Future investigations are imperative to explore optimal training data sizes and the optimal amounts of augmented data.
大语言模型(LLMs)在自然语言处理任务中表现出了显著的成功。然而,很少有研究尝试评估它们在社交媒体基于健康相关自然语言处理任务上的性能,而这些任务通常很难在较高得分上实现。我们对一个基于Support Vector Machines(SVMs)的超监督经典机器学习模型、一个基于RoBERTa、BERTweet和SocBERT的三个有监督预训练语言模型以及两个基于LLM的分类器(GPT3.5和GPT4)进行了基准测试。我们开发了三种利用LLMs进行文本分类的方法:使用LLMs作为零散分类器、使用LLMs作为注释者来注释监督分类器的训练数据以及使用LLM及其少量示例进行数据增强。我们的全面实验结果表明,使用带有较小人类标注数据的LLM(GPT-4)进行训练轻量级监督分类模型实现了优于仅使用人类标注数据的结果。监督学习者也在零散设置中优于LLM和GPT-3.5。通过利用这种数据增强策略,我们可以利用LLMs开发更小、更有效的领域特定自然语言处理模型。对LLM进行标注的未经人类指导的训练数据是一种低效策略。然而,作为零散分类器,LLM表现出排除误分类的潜力,并可能降低数据注释的人力要求。未来的研究必须进行,以探索最优的训练数据大小和适当的 augmented 数据量。
https://arxiv.org/abs/2403.19031
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches based on fine-tuning is the ability to understand instructions written in natural language (prompts), which helps them generalise better to different tasks and domains without the need for specific training data. This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances. However, existing research is limited in scale and lacks understanding of how text generation models combined with prompting techniques compare to more established methods for text classification such as fine-tuning masked language models. In this paper, we address this research gap by performing a large-scale evaluation study for 16 text classification datasets covering binary, multiclass, and multilabel problems. In particular, we compare zero- and few-shot approaches of large language models to fine-tuning smaller language models. We also analyse the results by prompt, classification type, domain, and number of labels. In general, the results show how fine-tuning smaller and more efficient language models can still outperform few-shot approaches of larger language models, which have room for improvement when it comes to text classification.
近年来,在零和少样本设置下,已经发现了许多自然语言处理(NLP)任务的基语言模型具有与基于微调的更标准方法相媲美的表现。这些模型与更标准的基于微调的方法的区别在于能够理解自然语言中的指令(提示),这有助于它们更好地泛化到不同的任务和领域,而无需特定训练数据。这使得它们非常适合解决有限标注实例的领域中的文本分类问题。然而,现有研究在规模上存在局限,并且缺乏对如何将文本生成模型与提示技术相结合与更传统的文本分类方法(如微调掩码语言模型)相比的深入理解。在本文中,我们通过为16个文本分类数据集进行大规模评估,填补了这个研究空白。特别地,我们比较了大型语言模型的零和少样本 approaches与小型语言模型的微调 approaches。我们还通过提示、分类类型、领域和标签数量等分析结果。总的来说,结果表明,即使采用了更小和更高效的语言模型,仍然可以在文本分类中超越仅通过微调的大型语言模型的几样本方法,而这种方法在文本分类方面还有改进的空间。
https://arxiv.org/abs/2403.17661
Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information. In this paper, we tend to investigate the feasibility of a contrastive learning scheme in which the semantic and syntactic information inherent in the input sample is adequately reserved in the contrastive samples and fused during the learning process. Specifically, we propose an information lossless contrastive learning strategy for HTC, namely \textbf{H}ierarchy-aware \textbf{I}nformation \textbf{L}ossless contrastive \textbf{L}earning (HILL), which consists of a text encoder representing the input document, and a structure encoder directly generating the positive sample. The structure encoder takes the document embedding as input, extracts the essential syntactic information inherent in the label hierarchy with the principle of structural entropy minimization, and injects the syntactic information into the text representation via hierarchical representation learning. Experiments on three common datasets are conducted to verify the superiority of HILL.
自然语言处理(NLP)中现有的自监督方法,尤其是层次文本分类(HTC),主要关注自监督对比学习,对人类设计的增强规则极度依赖,可能导致原始信息的污染或扭曲。在本文中,我们倾向于研究在输入样本中保留语义和句法信息并在学习过程中进行合理融合的对比学习方案的可行性。具体来说,我们提出了一个信息无损失的HTC对比学习策略,即 \textbf{H}ierarchy-aware \textbf{I}nformation \textbf{L}ossless contrastive \textbf{L}earning (HILL),它包括一个表示输入文档的文本编码器和直接生成正样本的结构编码器。结构编码器接受文档嵌入作为输入,利用结构熵最小化原则提取文档中隐含的语义信息,并通过分层表示学习将语义信息注入到文本表示中。我们对三个常用数据集进行了实验,以验证HILL的优越性。
https://arxiv.org/abs/2403.17307
Following the significant achievements of large language models (LLMs), researchers have employed in-context learning for text classification tasks. However, these studies focused on monolingual, single-turn classification tasks. In this paper, we introduce LARA (Linguistic-Adaptive Retrieval-Augmented Language Models), designed to enhance accuracy in multi-turn classification tasks across six languages, accommodating numerous intents in chatbot interactions. Multi-turn intent classification is notably challenging due to the complexity and evolving nature of conversational contexts. LARA tackles these issues by combining a fine-tuned smaller model with a retrieval-augmented mechanism, integrated within the architecture of LLMs. This integration allows LARA to dynamically utilize past dialogues and relevant intents, thereby improving the understanding of the context. Furthermore, our adaptive retrieval techniques bolster the cross-lingual capabilities of LLMs without extensive retraining and fine-tune. Comprehensive experiments demonstrate that LARA achieves state-of-the-art performance on multi-turn intent classification tasks, enhancing the average accuracy by 3.67% compared to existing methods.
随着大型语言模型(LLMs)的显著成就是在自然语言处理任务中的应用,研究人员开始使用上下文学习来进行文本分类任务。然而,这些研究主要集中在单轮、一次性的分类任务上。在本文中,我们引入了LARA(语言自适应检索增强语言模型),旨在提高六种语言的multi-turn分类任务的准确性,适应了聊天机器人交互中的大量意图。多轮意图分类任务因对话上下文的复杂性和演变性而更具挑战性。LARA通过将一个微调过的较小模型与检索增强机制集成在LLM的架构中,解决了这些问题。这种集成允许LARA动态地利用过去对话和相关的意图,从而提高上下文的理解。此外,我们的自适应检索技术在不进行大量重新训练和微调的情况下增强了LLM的跨语言能力。全面的实验证明,LARA在multi-turn意图分类任务上取得了与现有方法相当的最佳性能,提高了平均准确性3.67%。
https://arxiv.org/abs/2403.16504
The success of Natural Language Understanding (NLU) benchmarks in various languages, such as GLUE for English, CLUE for Chinese, KLUE for Korean, and IndoNLU for Indonesian, has facilitated the evaluation of new NLU models across a wide range of tasks. To establish a standardized set of benchmarks for Vietnamese NLU, we introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchmark. The VLUE benchmark encompasses five datasets covering different NLU tasks, including text classification, span extraction, and natural language understanding. To provide an insightful overview of the current state of Vietnamese NLU, we then evaluate seven state-of-the-art pre-trained models, including both multilingual and Vietnamese monolingual models, on our proposed VLUE benchmark. Furthermore, we present CafeBERT, a new state-of-the-art pre-trained model that achieves superior results across all tasks in the VLUE benchmark. Our model combines the proficiency of a multilingual pre-trained model with Vietnamese linguistic knowledge. CafeBERT is developed based on the XLM-RoBERTa model, with an additional pretraining step utilizing a significant amount of Vietnamese textual data to enhance its adaptation to the Vietnamese language. For the purpose of future research, CafeBERT is made publicly available for research purposes.
Natural Language Understanding (NLU) benchmarks在各种语言中取得成功,如GLUE(英语) 、CLUE(中文) 、KLUE(韩语) 和 IndoNLU(印度尼西亚语) 。这促使我们评估各种任务中的新NLU模型。为了建立一个针对越南NLU的标准化基准,我们引入了第一个越南语言理解评估(VLUE)基准。VLUE基准涵盖了五个数据集,涵盖了不同的NLU任务,包括文本分类、跨度提取和自然语言理解。为了深入了解越南NLU的现状,我们接下来评估了七个最先进的预训练模型,包括多语言和越南单语模型,在我们的VLUE基准上。此外,我们还推出了CafeBERT,一种在VLUE基准上取得优越成果的新预训练模型。我们的模型结合了多语言预训练模型的优势和越南语言知识。CafeBERT基于XLM-RoBERTa模型,并通过使用大量越南文本数据进行预训练,以增强其对越南语言的适应能力。为了未来的研究,CafeBERT仅供研究目的公开发布。
https://arxiv.org/abs/2403.15882
Active learning (AL) techniques aim to maximally utilize a labeling budget by iteratively selecting instances that are most likely to improve prediction accuracy. However, their benefit compared to random sampling has not been consistent across various setups, e.g., different datasets, classifiers. In this empirical study, we examine how a combination of different factors might obscure any gains from an AL technique. Focusing on text classification, we rigorously evaluate AL techniques over around 1000 experiments that vary wrt the dataset, batch size, text representation and the classifier. We show that AL is only effective in a narrow set of circumstances. We also address the problem of using metrics that are better aligned with real world expectations. The impact of this study is in its insights for a practitioner: (a) the choice of text representation and classifier is as important as that of an AL technique, (b) choice of the right metric is critical in assessment of the latter, and, finally, (c) reported AL results must be holistically interpreted, accounting for variables other than just the query strategy.
主动学习(AL)技术旨在通过迭代选择最有可能提高预测准确性的实例,最大化利用标签预算。然而,与随机抽样的好处相比,它们的差异并没有在各种设置中保持一致,例如不同的数据集和分类器。在本文的实证研究中,我们研究了不同因素如何可能掩盖AL技术中的任何提高。重点关注文本分类,我们在与dataset、batch size、文本表示和分类器有关的大约1000个实验中,对AL技术进行了严格的评估。我们发现,AL技术在某些情况下仅有效。我们还解决了使用与现实世界预期更一致的指标评估后的问题。本研究对实践者来说,其影响在于: (a)文本表示和分类器的选择与AL技术同样重要; (b)选择正确的指标对评估后者至关重要; (c)报道的AL结果必须进行整体解释,考虑其他因素,而不仅仅是查询策略。
https://arxiv.org/abs/2403.15744
This paper presents the MasonTigers entry to the SemEval-2024 Task 8 - Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. The task encompasses Binary Human-Written vs. Machine-Generated Text Classification (Track A), Multi-Way Machine-Generated Text Classification (Track B), and Human-Machine Mixed Text Detection (Track C). Our best performing approaches utilize mainly the ensemble of discriminator transformer models along with sentence transformer and statistical machine learning approaches in specific cases. Moreover, zero-shot prompting and fine-tuning of FLAN-T5 are used for Track A and B.
本文介绍了MasonTigers模型参加SemEval-2024任务8 - 多生成者、多领域和多语言黑盒机器生成文本检测。该任务包括二分类人类写作与机器生成文本分类(A轨道)、多分类机器生成文本分类(B轨道)以及人机混合文本检测(C轨道)。在特定情况下,我们最优秀的策略主要利用了主要由 discriminator Transformer 模型组成的集成,并结合了句子Transformer 和统计机器学习方法。此外,在A和B轨道上使用了零样本提示和FLAN-T5的微调。
https://arxiv.org/abs/2403.14989
In natural language processing (NLP), text classification tasks are increasingly fine-grained, as datasets are fragmented into a larger number of classes that are more difficult to differentiate from one another. As a consequence, the semantic structures of datasets have become more complex, and model decisions more difficult to explain. Existing tools, suited for coarse-grained classification, falter under these additional challenges. In response to this gap, we worked closely with NLP domain experts in an iterative design-and-evaluation process to characterize and tackle the growing requirements in their workflow of developing fine-grained text classification models. The result of this collaboration is the development of SemLa, a novel visual analytics system tailored for 1) dissecting complex semantic structures in a dataset when it is spatialized in model embedding space, and 2) visualizing fine-grained nuances in the meaning of text samples to faithfully explain model reasoning. This paper details the iterative design study and the resulting innovations featured in SemLa. The final design allows contrastive analysis at different levels by unearthing lexical and conceptual patterns including biases and artifacts in data. Expert feedback on our final design and case studies confirm that SemLa is a useful tool for supporting model validation and debugging as well as data annotation.
在自然语言处理(NLP)中,文本分类任务变得越来越精细,因为数据集被分解成更多的类别,这些类别彼此难以区分。因此,数据集的语义结构变得更加复杂,模型决策也更加难以解释。现有的工具,适于粗粒度分类,在这些附加挑战面前显得无力。为了填补这一空白,我们与NLP领域专家在迭代设计评估过程中密切合作,对他们在文本分类模型开发工作流程中日益增长的需求进行了分析和解决。这次合作的成果是开发了SemLa,一种专为(1)在模型嵌入空间中剖析数据集中复杂的语义结构,以及(2)可视化文本样本中细粒度的意义,从而忠实解释模型推理的视觉分析系统。本文详细描述了SemLa的迭代设计研究和相应创新。最终的设计允许在不同的级别进行对比分析,揭示包括数据集中偏见和伪迹的语义和概念模式。SemLa的专家反馈证实,SemLa是一个有益的工具,支持模型验证、调试和数据注释。
https://arxiv.org/abs/2403.15492
Perturbation-based explanation methods such as LIME and SHAP are commonly applied to text classification. This work focuses on their extension to generative language models. To address the challenges of text as output and long text inputs, we propose a general framework called MExGen that can be instantiated with different attribution algorithms. To handle text output, we introduce the notion of scalarizers for mapping text to real numbers and investigate multiple possibilities. To handle long inputs, we take a multi-level approach, proceeding from coarser levels of granularity to finer ones, and focus on algorithms with linear scaling in model queries. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and context-grounded question answering. The results show that our framework can provide more locally faithful explanations of generated outputs.
扰动解释方法,如LIME和SHAP,通常用于文本分类。本研究关注这些方法在生成语言模型上的扩展。为了应对文本作为输出和长文本输入的挑战,我们提出了一个通用的框架MExGen,可以应用于不同的归因算法。为了处理文本输出,我们引入了标量器的概念,研究了多种可能性。为了处理长输入,我们采用多层次方法,从粗粒度到细粒度,重点关注具有线性比例关系的算法。我们对基于扰动的归因方法进行了系统性的自动和人类评估,以总结和基于上下文的问题回答。结果表明,我们的框架可以在生成输出上提供更局部准确的解释。
https://arxiv.org/abs/2403.14459
The advancement of Large Language Models (LLMs) has significantly transformed the field of natural language processing, although the focus on English-centric models has created a noticeable research gap for specific languages, including Vietnamese. To address this issue, this paper presents vi-mistral-x, an innovative Large Language Model designed expressly for the Vietnamese language. It utilizes a unique method of continual pre-training, based on the Mistral architecture, which incorporates grouped-query attention and sliding window attention techniques. This model, vi-Mistral-X, marks a significant step forward in improving the understanding and generation of the Vietnamese language. It introduces an additional phase of continual pre-training, specifically adapted for Vietnamese, enhancing the model's capability in understanding complex language nuances and generating accurate, context-aware Vietnamese text. Through comprehensive testing on various benchmarks, vi-mistral-x has shown to outperform existing Vietnamese LLMs in several key areas, including text classification, question answering, and text generation. Particularly, in the Vietnamese Multitask Language Understanding (VMLU) benchmark, vi-mistral-x sets a new standard, outperforming other available models significantly. This paper highlights the critical role of continual pre-training in advancing language-specific LLMs and opens new avenues for the development of multilingual models. We aim for vi-mistral-x to not just be an important asset for processing the Vietnamese language but also to encourage more advancements in creating large language models for languages that are less represented.
大规模语言模型(LLMs)的发展已经显著地改造了自然语言处理领域,尽管以英语为中心的模型关注在一定程度上造成了对于特定语言(如越南语)的研究空白,但这种关注为大规模语言模型的开发开辟了新的途径。为了应对这一问题,本文提出了vi-mistral-x,一种专门为越南语设计的创新大规模语言模型。它基于Mistral架构独特的方法进行持续预训练,融入了聚类查询关注和滑动窗口关注技术。这个模型,vi-Mistral-X,在提高对越南语的理解和生成方面迈出了重要的一步。它引入了一个专门针对越南语的持续预训练阶段,提高了模型对复杂语言细微差别的理解和生成准确、上下文相关的越南语文本的能力。通过在各种基准测试上的全面测试,vi-mistral-x在文本分类、问题回答和文本生成等关键领域均表现出优越性能。特别是在越南多任务语言理解(VMLU)基准测试中,vi-mistral-x设定了一个新标准,显著优于其他可用模型。本文强调了持续预训练在推动特定语言大规模模型发展中的关键作用,并为开发多语言模型开辟了新的途径。我们的目标是,vi-mistral-x不仅将成为处理越南语的重要资产,还将鼓励为那些代表较少的语言创建更大规模模型的进一步发展。
https://arxiv.org/abs/2403.15470
The parallelism of Transformer-based models comes at the cost of their input max-length. Some studies proposed methods to overcome this limitation, but none of them reported the effectiveness of summarization as an alternative. In this study, we investigate the performance of document truncation and summarization in text classification tasks. Each of the two was investigated with several variations. This study also investigated how close their performances are to the performance of full-text. We used a dataset of summarization tasks based on Indonesian news articles (IndoSum) to do classification tests. This study shows how the summaries outperform the majority of truncation method variations and lose to only one. The best strategy obtained in this study is taking the head of the document. The second is extractive summarization. This study explains what happened to the result, leading to further research in order to exploit the potential of document summarization as a shortening alternative. The code and data used in this work are publicly available in this https URL.
Transformer-based模型的并行性以牺牲输入最大长度为代价。有些研究提出了方法来克服这一限制,但都没有报告摘要作为替代手段的有效性。在这项研究中,我们研究了文档截断和摘要在文本分类任务中的表现。每个变量都进行了多种变化。这项研究还研究了两者的表现与完整文本性能的接近程度。我们使用基于印度尼西亚新闻文章(IndoSum)的摘要任务数据集来进行分类测试。这项研究展示了摘要如何优于大多数截断方法变体,并输给了只有一个。本研究中获得的最好策略是文档的头部。第二是提取性摘要。这项研究解释了结果发生的原因,从而进一步研究如何利用文档摘要作为缩短 alternative 的潜在可能性。本研究中使用的代码和数据都可以在上述链接的 URL 中公开获得。
https://arxiv.org/abs/2403.12799