Unsupervised cross-lingual transfer involves transferring knowledge between languages without explicit supervision. Although numerous studies have been conducted to improve performance in such tasks by focusing on cross-lingual knowledge, particularly lexical and syntactic knowledge, current approaches are limited as they only incorporate syntactic or lexical information. Since each type of information offers unique advantages and no previous attempts have combined both, we attempt to explore the potential of this approach. In this paper, we present a novel framework called "Lexicon-Syntax Enhanced Multilingual BERT" that combines both lexical and syntactic knowledge. Specifically, we use Multilingual BERT (mBERT) as the base model and employ two techniques to enhance its learning capabilities. The code-switching technique is used to implicitly teach the model lexical alignment information, while a syntactic-based graph attention network is designed to help the model encode syntactic structure. To integrate both types of knowledge, we input code-switched sequences into both the syntactic module and the mBERT base model simultaneously. Our extensive experimental results demonstrate this framework can consistently outperform all baselines of zero-shot cross-lingual transfer, with the gains of 1.0~3.7 points on text classification, named entity recognition (ner), and semantic parsing tasks. Keywords:cross-lingual transfer, lexicon, syntax, code-switching, graph attention network
无监督跨语言转移涉及在没有任何明确监督的情况下在语言之间传递知识。尽管已经进行了大量研究,以通过关注跨语言知识来提高此类任务的性能,特别是词汇和句法知识,但目前的方法仍然有限,因为它们仅包括语义或词汇信息。由于每种信息都具有独特的优势,并且没有 previous 尝试将两种信息相结合,因此我们试图探索这种方法的潜力。在本文中,我们提出了一个名为 "Lexicon-Syntax Enhanced Multilingual BERT" 的新框架,结合了词汇和句法知识。具体来说,我们使用多语言 BERT(mBERT)作为基础模型,并采用两种技术来增强其学习能力。代码转换技术用于含蓄地教导模型词汇对齐信息,而基于句法的图注意力网络旨在帮助模型编码语义结构。为了整合两种知识,我们将代码转换序列同时输入到语义模块和 mBERT 基础模型中。我们进行了广泛的实验研究,结果表明,与其他零散的跨语言转移 baseline 相比,该框架可以始终如一地优于所有基线,在文本分类、命名实体识别(NER)和语义解析任务中的得分增加了 1.0~3.7 点。关键词:跨语言转移,词汇,语法,代码转换,图注意力网络
https://arxiv.org/abs/2404.16627
This paper focuses on a very important societal challenge of water quality analysis. Being one of the key factors in the economic and social development of society, the provision of water and ensuring its quality has always remained one of the top priorities of public authorities. To ensure the quality of water, different methods for monitoring and assessing the water networks, such as offline and online surveys, are used. However, these surveys have several limitations, such as the limited number of participants and low frequency due to the labor involved in conducting such surveys. In this paper, we propose a Natural Language Processing (NLP) framework to automatically collect and analyze water-related posts from social media for data-driven decisions. The proposed framework is composed of two components, namely (i) text classification, and (ii) topic modeling. For text classification, we propose a merit-fusion-based framework incorporating several Large Language Models (LLMs) where different weight selection and optimization methods are employed to assign weights to the LLMs. In topic modeling, we employed the BERTopic library to discover the hidden topic patterns in the water-related tweets. We also analyzed relevant tweets originating from different regions and countries to explore global, regional, and country-specific issues and water-related concerns. We also collected and manually annotated a large-scale dataset, which is expected to facilitate future research on the topic.
本论文重点关注水质量分析这一重要的社会挑战。作为社会经济发展的重要因素,提供水资源并确保其质量始终是公共当局的头等大事。为了确保水质,采用了一些监测和评估水网络的方法,例如离线和在线调查。然而,这些调查存在一些局限性,例如参与人数有限和调查频率较低,因为这些调查需要大量的人力投入。在本文中,我们提出了一个自然语言处理(NLP)框架,用于自动收集和分析社交媒体上的与水相关的帖子,以支持数据驱动的决策。所提出的框架由两个组成部分组成,即(i)文本分类和(ii)主题建模。 在文本分类方面,我们提出了一个基于 merits-fusion 的框架,其中包含多个 large language models (LLMs)。为了给 LLMs 分配权重,我们采用了一些权重选择和优化方法。在主题建模方面,我们使用了 BERTopic 库来发现水相关微博中的隐藏主题模式。我们还分析了许多不同地区和国家的相关 tweets,以探索全球、区域和国家特定的水和相关问题。 我们还收集并手动标注了一个大规模数据集,预计将促进未来关于这个主题的研究。
https://arxiv.org/abs/2404.14977
In this work, we propose a novel tree-based explanation technique, PEACH (Pretrained-embedding Explanation Across Contextual and Hierarchical Structure), that can explain how text-based documents are classified by using any pretrained contextual embeddings in a tree-based human-interpretable manner. Note that PEACH can adopt any contextual embeddings of the PLMs as a training input for the decision tree. Using the proposed PEACH, we perform a comprehensive analysis of several contextual embeddings on nine different NLP text classification benchmarks. This analysis demonstrates the flexibility of the model by applying several PLM contextual embeddings, its attribute selections, scaling, and clustering methods. Furthermore, we show the utility of explanations by visualising the feature selection and important trend of text classification via human-interpretable word-cloud-based trees, which clearly identify model mistakes and assist in dataset debugging. Besides interpretability, PEACH outperforms or is similar to those from pretrained models.
在这项工作中,我们提出了一种新颖的基于树的解释技术,称为PEACH(预训练嵌入解释跨上下文和层次结构),可以在基于树的整个人类可解释方式中解释文本文档的分类。请注意,PEACH可以采用任何PLM的上下文嵌入作为训练输入来训练决策树。通过使用提出的PEACH,我们对九个不同的自然语言处理文本分类基准进行了全面分析。这种分析证明了模型的灵活性,通过应用多个PLM的上下文嵌入、属性选择、缩放和聚类方法,对其进行了分析。此外,我们通过可视化通过人类可解释的词云树来展示解释的有用性,该树清楚地指出了模型的错误,并有助于数据集的调试。除了可解释性之外,PEACH超越或与预训练模型相当。
https://arxiv.org/abs/2404.13645
Machine learning models have made incredible progress, but they still struggle when applied to examples from unseen domains. This study focuses on a specific problem of domain generalization, where a model is trained on one source domain and tested on multiple target domains that are unseen during training. We propose IMO: Invariant features Masks for Out-of-Distribution text classification, to achieve OOD generalization by learning invariant features. During training, IMO would learn sparse mask layers to remove irrelevant features for prediction, where the remaining features keep invariant. Additionally, IMO has an attention module at the token level to focus on tokens that are useful for prediction. Our comprehensive experiments show that IMO substantially outperforms strong baselines in terms of various evaluation metrics and settings.
机器学习模型已经取得了巨大的进展,但在应用到未见过的领域时,它们仍然存在困难。本研究关注于领域泛化问题,即在训练过程中,模型学习一个未见过的领域,而在测试过程中,对多个未见过的领域进行测试。我们提出IMO:Invariant features Masks for Out-of-Distribution text classification,通过学习不变的特征来实现OOD泛化。在训练过程中,IMO会学习稀疏的掩码层,用于消除预测过程中的无关特征,而剩余的特征保持不变。此外,IMO在词级别有一个注意力模块,专注于对预测有用的词进行关注。我们的全面实验结果表明,IMO在各种评估指标和设置方面都显著优于强大的基线。
https://arxiv.org/abs/2404.13504
The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrinsic evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Extrinsic evaluation, in turn, is performed via the Out-of-Vocabulary Generalization Challenge 1.0 benchmark, which consists of three newly specified downstream text classification tasks. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.
目前流行的语言模型子词划分器,如Byte-Pair Encoding(BPE)等,已知不尊重语素边界,这会影响模型的下游性能。虽然已经提出了许多改进的词素划分算法,但它们的评估和跨比仍是未解决的问题。为了解决问题,我们提出了一个结合内部和外部评估的词素划分框架。内部评估基于我们新的UniMorph Labeller工具,将词素划分分为语素或外星。外部评估通过Out-of-Vocabulary Generalization Challenge 1.0基准进行,该基准包括三个新的下游文本分类任务。我们的实证研究结果表明,UniMorph Labeller的准确率为98%,而在所有研究语言模型中(包括ALBERT、BERT、RoBERTa和DeBERTa),外星词素划分会导致语义组成性较弱,与语素词分相比更差。
https://arxiv.org/abs/2404.13292
This study proposes a multi-modal fusion framework Multitrans based on the Transformer architecture and self-attention mechanism. This architecture combines the study of non-contrast computed tomography (NCCT) images and discharge diagnosis reports of patients undergoing stroke treatment, using a variety of methods based on Transformer architecture approach to predicting functional outcomes of stroke treatment. The results show that the performance of single-modal text classification is significantly better than single-modal image classification, but the effect of multi-modal combination is better than any single modality. Although the Transformer model only performs worse on imaging data, when combined with clinical meta-diagnostic information, both can learn better complementary information and make good contributions to accurately predicting stroke treatment effects..
本研究提出了一种基于Transformer架构的多模态融合框架Multitrans,该框架基于自注意力机制。该架构将非对比计算断层扫描(NCCT)图像和接受中风治疗的患者出院诊断报告相结合,利用Transformer架构方法预测中风治疗的功能性结果。结果显示,单模态文本分类的性能明显优于单模态图像分类,但多模态组合的效果要优于任何单一模态。尽管Transformer模型在图像数据上的表现仅略逊于其他模型,但与临床元诊断信息相结合时,两者可以获得更好的互补信息,从而准确预测中风治疗效果。
https://arxiv.org/abs/2404.12634
We present FastFit, a method, and a Python package design to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multiclass classification performance in speed and accuracy across FewMany, our newly curated English benchmark, and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds. The FastFit package is now available on GitHub and PyPi, presenting a user-friendly solution for NLP practitioners.
我们提出了FastFit方法和一个Python软件包设计,旨在提供快速和准确的零散shot分类,尤其是在具有许多相似语义类别的场景中。FastFit采用了一种新颖的方法,将批式对比学习与词级相似度分数相结合。与现有的零散shot学习软件包(如SetFit、Transformers或通过API调用的大语言模型的一小 shots提示)相比,FastFit在速度和准确性上显著提高了多分类分类性能。FastFit在FewMany、我们的新编英语基准和多语言数据集上的表现表明,其训练速度提高了3-20倍,训练时间仅需几秒钟。FastFit软件包现在可以在GitHub和PyPI上获得,为NLP从业者提供了一个易于使用的解决方案。
https://arxiv.org/abs/2404.12365
Cognitive Behavioral Therapy (CBT) is an effective technique for addressing the irrational thoughts stemming from mental illnesses, but it necessitates precise identification of cognitive pathways to be successfully implemented in patient care. In current society, individuals frequently express negative emotions on social media on specific topics, often exhibiting cognitive distortions, including suicidal behaviors in extreme cases. Yet, there is a notable absence of methodologies for analyzing cognitive pathways that could aid psychotherapists in conducting effective interventions online. In this study, we gathered data from social media and established the task of extracting cognitive pathways, annotating the data based on a cognitive theoretical framework. We initially categorized the task of extracting cognitive pathways as a hierarchical text classification with four main categories and nineteen subcategories. Following this, we structured a text summarization task to help psychotherapists quickly grasp the essential information. Our experiments evaluate the performance of deep learning and large language models (LLMs) on these tasks. The results demonstrate that our deep learning method achieved a micro-F1 score of 62.34% in the hierarchical text classification task. Meanwhile, in the text summarization task, GPT-4 attained a Rouge-1 score of 54.92 and a Rouge-2 score of 30.86, surpassing the experimental deep learning model's performance. However, it may suffer from an issue of hallucination. We have made all models and codes publicly available to support further research in this field.
认知行为疗法(CBT)是一种有效的治疗心理疾病的方法,针对源于心理疾病的非理性思维,但它需要精确识别认知通路才能在患者护理中成功实施。在当今社会,个人经常在社交媒体上表达针对特定主题的负面情绪,通常表现出扭曲的认知,包括极端情况下自杀行为。然而,目前尚无分析认知通路的方法,可以帮助心理治疗师在网上进行有效的干预。在这项研究中,我们收集了来自社交媒体的数据,并确立了提取认知通路的任务,基于认知理论框架进行数据注释。我们最初将提取认知通路的任务归类为分层文本分类,包括四个主要类别和19个子类别。接下来,我们设立了一个文本摘要任务,帮助心理治疗师快速掌握关键信息。我们的实验评估了深度学习和大型语言模型(LLMs)在这些任务上的表现。实验结果表明,我们的深度学习方法在分层文本分类任务上取得了62.34%的微F1得分。与此同时,在文本摘要任务上,GPT-4获得了Rouge-1得分54.92和Rouge-2得分30.86,超过了实验深度学习模型的性能。然而,它可能存在幻觉问题。我们已经将所有模型和代码公开发布,以支持在这个领域进行进一步的研究。
https://arxiv.org/abs/2404.11449
ICD(International Classification of Diseases) coding involves assigning ICD codes to patients visit based on their medical notes. ICD coding is a challenging multilabel text classification problem due to noisy medical document inputs. Recent advancements in automated ICD coding have enhanced performance by integrating additional data and knowledge bases with the encoding of medical notes and codes. However, most of them ignore the code hierarchy, leading to improper code assignments. To address these problems, we propose a novel framework based on associated and hierarchical code description distillation (AHDD) for better code representation learning and avoidance of improper code assignment.we utilize the code description and the hierarchical structure inherent to the ICD codes. Therefore, in this paper, we leverage the code description and the hierarchical structure inherent to the ICD codes. The code description is also applied to aware the attention layer and output layer. Experimental results on the benchmark dataset show the superiority of the proposed framework over several state-of-the-art baselines.
ICD(国际疾病分类)编码是将患者访问分配给他们的医疗记录的ICD代码的过程。由于嘈杂的医疗文件输入,ICD编码是一个具有多个标签的多标签文本分类问题。最近,自动ICD编码通过将额外数据和知识库与医疗记录的编码相结合来提高性能。然而,大多数忽略代码层次结构,导致不当的代码分配。为了应对这些问题,我们提出了一个基于相关和分层代码描述蒸馏(AHDD)的新框架,以进行更好的代码表示学习和避免不当代码分配。我们利用了ICD代码固有的代码描述和层次结构。因此,在本文中,我们利用了ICD代码的代码描述和层次结构。将代码描述还应用于注意层和输出层,以增强模型的关注度。基准数据集上的实验结果表明,与最先进的基线相比,所提出的框架具有优越性。
https://arxiv.org/abs/2404.11132
This study is part of the debate on the efficiency of large versus small language models for text classification by prompting.We assess the performance of small language models in zero-shot text classification, challenging the prevailing dominance of large models.Across 15 datasets, our investigation benchmarks language models from 77M to 40B parameters using different architectures and scoring functions. Our findings reveal that small models can effectively classify texts, getting on par with or surpassing their larger counterparts.We developed and shared a comprehensive open-source repository that encapsulates our methodologies. This research underscores the notion that bigger isn't always better, suggesting that resource-efficient small models may offer viable solutions for specific data classification challenges.
本研究是关于大型语言模型与小型语言模型在文本分类任务中的效率辩论的一部分。我们通过提示评估了小型语言模型的零散文本分类表现,挑战了现有大型模型的主导地位。在15个数据集上,我们使用不同的架构和评分函数评估了语言模型的性能,从77M到40B参数。我们的研究结果表明,小型模型可以有效地分类文本,与大型模型相当或者超越它们。我们还开发并共享了一个全面的开源仓库,封装了我们的方法论。这项研究强调了一个事实,即并不总是越大越好,这表明资源高效的小型模型可能为特定的数据分类挑战提供可行的解决方案。
https://arxiv.org/abs/2404.11122
In this paper, we aim to generate text classification data given arbitrary class definitions (i.e., user instruction), so one can train a small text classifier without any human annotation or raw corpus. Compared with pioneer attempts, our proposed Incubator is the first framework that can handle complicated and even mutually dependent classes (e.g., "TED Talk given by Educator" and "Other"). Specifically, Incubator is an LLM firstly tuned on the instruction-to-data mappings that we obtained from classification datasets and descriptions on HuggingFace together with in-context augmentation by GPT-4. We then refine Incubator by learning on the cluster centers of semantic textual embeddings to emphasize the uniformity and semantic diversity in generations. We compare Incubator on various classification tasks with strong baselines such as direct LLM-based inference and training data generation by prompt engineering. Experiments show Incubator is able to (1) perform well on traditional benchmarks, (2) take label dependency and user preference into consideration, and (3) enable logical text mining by incubating multiple classifiers.
在本文中,我们的目标是根据任意类定义生成文本分类数据,从而可以训练一个没有人工标注或原始语料库的小文本分类器。与先驱尝试相比,我们提出的孵化器是第一个可以处理复杂甚至相互依赖类别的框架(例如,"TED演讲由教育者给出"和"其他")。具体来说,孵化器是我们从分类数据和描述中获得的指令到数据映射的LLM,并使用GPT-4在上下文增强。然后,通过在语义文本嵌入的聚类中心上学习来优化孵化器,以强调代际之间的统一性和多样性。我们比较孵化器在各种分类任务上的结果与强大的基线,如直接LLM推理和训练数据生成通过提示工程。实验结果表明,孵化器能够(1)在传统基准测试中表现良好,(2)考虑标签依赖和用户偏好,(3)通过孵化多个分类器实现逻辑文本挖掘。
https://arxiv.org/abs/2404.10877
Researchers must stay current in their fields by regularly reviewing academic literature, a task complicated by the daily publication of thousands of papers. Traditional multi-label text classification methods often ignore semantic relationships and fail to address the inherent class imbalances. This paper introduces a novel approach using the SciBERT model and CNNs to systematically categorize academic abstracts from the Elsevier OA CC-BY corpus. We use a multi-segment input strategy that processes abstracts, body text, titles, and keywords obtained via BERT topic modeling through SciBERT. Here, the [CLS] token embeddings capture the contextual representation of each segment, concatenated and processed through a CNN. The CNN uses convolution and pooling to enhance feature extraction and reduce dimensionality, optimizing the data for classification. Additionally, we incorporate class weights based on label frequency to address the class imbalance, significantly improving the classification F1 score and enhancing text classification systems and literature review efficiency.
研究人员必须保持他们在领域内的最新状态,并通过定期查阅学术论文文献来保持更新。然而,传统的多标签文本分类方法往往忽视了语义关系,未能解决固有的分类不平衡问题。本文介绍了一种使用SciBERT模型和CNNs系统分类Elsevier OA CC-BY语料库中学术摘要的新颖方法。我们采用了一种多段输入策略,该策略通过SciBERT主题建模处理摘要、正文、标题和关键词。在这里,[CLS]词向量捕获每个部分的上下文表示,通过CNN进行连接和处理。CNN使用卷积和池化来增强特征提取和降低维度,优化数据以进行分类。此外,我们还根据标签频率基于类别权重来解决分类不平衡问题,显著提高了分类F1得分,并提高了文本分类系统和文献综述的效率。
https://arxiv.org/abs/2404.13078
In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decomposition are well-concentrated around several peaks, which allows us to efficiently replace them with corresponding centroids for quantization purposes. We study the theoretical properties of the proposed approach and rigorously evaluate our compression algorithm in the context of next-word prediction tasks and on a set of downstream tasks for text classification. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance while ensuring data compression, marking a significant advancement in the field of data quantization.
在本文中,我们基于Kashin表示原理提出了一种数据量化算法。这种方法的关键在于将任何给定的向量、矩阵或张量分解成两个因子。第一个因子保持一个较小的无穷范数,而第二个因子在乘以正交矩阵时表现出类似的约束范数。令人惊讶的是,在分解后,因子中的entry points 很好地集中在几个尖点上,这使我们能够有效地用相应的聚类器来代替它们,从而实现量化目的。我们研究了所提出的算法的理论性质,并在next-word prediction任务和一系列下游任务(如文本分类)上对压缩算法进行了严谨的评估。我们的研究结果表明,Kashin量化在保证数据压缩的同时,实现了与竞争或卓越模型性能相当的量化质量,这标志着数据量化领域的重要进展。
https://arxiv.org/abs/2404.09737
Text classification systems have continuously improved in performance over the years. However, nearly all current SOTA classifiers have a similar shortcoming, they process text in a horizontal manner. Vertically written words will not be recognized by a classifier. In contrast, humans are easily able to recognize and read words written both horizontally and vertically. Hence, a human adversary could write problematic words vertically and the meaning would still be preserved to other humans. We simulate such an attack, VertAttack. VertAttack identifies which words a classifier is reliant on and then rewrites those words vertically. We find that VertAttack is able to greatly drop the accuracy of 4 different transformer models on 5 datasets. For example, on the SST2 dataset, VertAttack is able to drop RoBERTa's accuracy from 94 to 13%. Furthermore, since VertAttack does not replace the word, meaning is easily preserved. We verify this via a human study and find that crowdworkers are able to correctly label 77% perturbed texts perturbed, compared to 81% of the original texts. We believe VertAttack offers a look into how humans might circumvent classifiers in the future and thus inspire a look into more robust algorithms.
文本分类系统在过去几年中一直不断提高性能。然而,几乎所有当前的最优分类器都有类似的缺陷,它们以水平方式处理文本。水平书写的单词不会被分类器识别。相比之下,人类能够轻松地识别和阅读水平书写和垂直书写的单词。因此,一个的人类攻击者可以垂直书写有问题的单词,其他人类仍能够理解其含义。我们模拟了这种攻击,名为VertAttack。VertAttack会识别分类器所依赖的单词,然后将它们垂直地重写。我们发现,VertAttack能够在5个数据集上大大降低4种不同Transformer模型的准确性。例如,在SST2数据集上,VertAttack将RoBERTa的准确性从94%降低到13%。此外,由于VertAttack没有替换单词的含义,因此很容易保留。我们通过人类研究证实了这一点,并发现,在原文本上,工人能够正确地标记出77%的扰动文本,而原始文本上的81%则无法正确标记。我们相信,VertAttack提供了一个窗口,让人们思考未来人类可能会如何绕过分类器,从而激发了对更健壮算法的思考。
https://arxiv.org/abs/2404.08538
Popular zero-shot models suffer due to artifacts inherited from pretraining. A particularly detrimental artifact, caused by unbalanced web-scale pretraining data, is mismatched label distribution. Existing approaches that seek to repair the label distribution are not suitable in zero-shot settings, as they have incompatible requirements such as access to labeled downstream task data or knowledge of the true label balance in the pretraining distribution. We sidestep these challenges and introduce a simple and lightweight approach to adjust pretrained model predictions via optimal transport. Our technique requires only an estimate of the label distribution of a downstream task. Theoretically, we characterize the improvement produced by our procedure under certain mild conditions and provide bounds on the error caused by misspecification. Empirically, we validate our method in a wide array of zero-shot image and text classification tasks, improving accuracy by 4.8% and 15.9% on average, and beating baselines like Prior Matching -- often by significant margins -- in 17 out of 21 datasets.
由于预训练中存在的元数据导致的 artifacts,流行的一零 shot 模型效果不佳。特别有害的元数据是由不平衡的跨网站预训练数据引起的,即不平衡标签分布。现有的试图修复标签分布的方法在零 shot 设置中并不适用,因为它们具有不兼容的要求,如访问已标注的下游任务数据或对预训练分布的真实标签平衡的了解。我们避开了这些挑战,并引入了一种简单而轻量级的通过最优传输调整预训练模型预测的方法。我们的技术只需要预训练任务下游任务的标签分布的估计。从理论上看,我们研究了我们的过程在某些轻度条件下的改进,并提供了由不准确估计引起的误差的上界。在实证研究中,我们在广泛的零 shot图像和文本分类任务中验证了我们的方法,平均提高了 4.8% 的准确率,并且在 21 个数据集中的基线(如 Prior Matching)中击败了像这样具有显著优势的基线。
https://arxiv.org/abs/2404.08461
Learning an effective representation in multi-label text classification (MLTC) is a significant challenge in NLP. This challenge arises from the inherent complexity of the task, which is shaped by two key factors: the intricate connections between labels and the widespread long-tailed distribution of the data. To overcome this issue, one potential approach involves integrating supervised contrastive learning with classical supervised loss functions. Although contrastive learning has shown remarkable performance in multi-class classification, its impact in the multi-label framework has not been thoroughly investigated. In this paper, we conduct an in-depth study of supervised contrastive learning and its influence on representation in MLTC context. We emphasize the importance of considering long-tailed data distributions to build a robust representation space, which effectively addresses two critical challenges associated with contrastive learning that we identify: the "lack of positives" and the "attraction-repulsion imbalance". Building on this insight, we introduce a novel contrastive loss function for MLTC. It attains Micro-F1 scores that either match or surpass those obtained with other frequently employed loss functions, and demonstrates a significant improvement in Macro-F1 scores across three multi-label datasets.
在多标签文本分类(MLTC)中学习有效的表示是一个重要的挑战。这个挑战来自于任务的复杂性,这是由标签之间复杂的联系和数据中普遍的长尾分布两个关键因素塑造的。要解决这个问题,一种潜在的方法是将有监督的对比学习与经典的监督损失函数集成起来。尽管在多标签分类中对比学习表现出显著的性能,但其在MLTC框架中的影响尚未得到充分调查。在本文中,我们深入研究了有监督的对比学习及其在MLTC上下文中的影响。我们强调了考虑长尾数据分布对于构建稳健的表示空间的重要性,这有效地解决了我们确定的对比学习中的两个关键问题:“缺乏正例”和“吸引-排斥不平衡”。基于这个洞见,我们引入了一种新的MLTC损失函数。它在其他常用损失函数中要么匹配,要么超越,并表明在三个多标签数据集上宏观F1得分显著提高。
https://arxiv.org/abs/2404.08720
We present Sequence Salience, a visual tool for interactive prompt debugging with input salience methods. Sequence Salience builds on widely used salience methods for text classification and single-token prediction, and extends this to a system tailored for debugging complex LLM prompts. Our system is well-suited for long texts, and expands on previous work by 1) providing controllable aggregation of token-level salience to the word, sentence, or paragraph level, making salience over long inputs tractable; and 2) supporting rapid iteration where practitioners can act on salience results, refine prompts, and run salience on the new output. We include case studies showing how Sequence Salience can help practitioners work with several complex prompting strategies, including few-shot, chain-of-thought, and constitutional principles. Sequence Salience is built on the Learning Interpretability Tool, an open-source platform for ML model visualizations, and code, notebooks, and tutorials are available at http://goo.gle/sequence-salience.
我们提出了Sequence Salience,一个用于交互式提示调试的视觉工具,支持使用输入的重要性方法。Sequence Salience 基于广泛使用的文本分类和单词预测的显著性方法,并将此扩展到专为调试复杂 LLM 提示而设计的系统。我们的系统非常适合长文本,并且通过以下方式拓展了以前的工作: 1. 提供可控制地将词级重要性聚合到单词、句子或段落级别的功能,使对长输入的显著性进行处理; 2. 支持快速迭代,使实践者可以在显著性结果上进行操作,改进提示,并在新输出上运行显著性。 我们包括了一些案例研究,展示了 Sequence Salience 如何帮助实践者处理多种复杂提示策略,包括少样本、链式思维和基本原则。Sequence Salience 是基于 Learning Interpretability Tool(一个开源的 ML 模型可视化平台)构建的,代码、笔记本和教程可在 http://goo.gle/sequence-salience 获取。
https://arxiv.org/abs/2404.07498
Out-of-distribution (OOD) detection plays a crucial role in ensuring the safety and reliability of deep neural networks in various applications. While there has been a growing focus on OOD detection in visual data, the field of textual OOD detection has received less attention. Only a few attempts have been made to directly apply general OOD detection methods to natural language processing (NLP) tasks, without adequately considering the characteristics of textual data. In this paper, we delve into textual OOD detection with Transformers. We first identify a key problem prevalent in existing OOD detection methods: the biased representation learned through the maximization of the conditional likelihood $p(y\mid x)$ can potentially result in subpar performance. We then propose a novel variational inference framework for OOD detection (VI-OOD), which maximizes the likelihood of the joint distribution $p(x, y)$ instead of $p(y\mid x)$. VI-OOD is tailored for textual OOD detection by efficiently exploiting the representations of pre-trained Transformers. Through comprehensive experiments on various text classification tasks, VI-OOD demonstrates its effectiveness and wide applicability. Our code has been released at \url{this https URL}.
分布外(OOD)检测在确保各种应用中深度神经网络的安全和可靠性方面发挥着关键作用。虽然视觉数据中的OOD检测已经得到了越来越多的关注,但文本数据中的OOD检测领域受到了更少的关注。只有少数尝试将一般OUD检测方法直接应用于自然语言处理(NLP)任务,而没有充分考虑文本数据的特征。在本文中,我们深入研究了使用Transformer进行文本OUD检测。我们首先指出现有OUD检测方法中一个关键问题:通过最大化条件概率$p(y\mid x)$获得的偏置表示可能会导致性能不佳。然后我们提出了一个名为VI-OOD的新变分推理框架(VI-OOD)用于OUD检测,它最大化联合分布$p(x,y)$,而不是$p(y\mid x)$。VI-OOD专门针对文本OUD检测进行了优化,通过有效地利用预训练Transformer的表示,提高了OUD检测的效率。通过对各种文本分类任务的全面实验,VI-OOD证明了其有效性和广泛应用性。我们的代码已发布在 \url{this <https://this URL>}。
https://arxiv.org/abs/2404.06217
Data analysis and machine learning are of preeminent importance in the legal domain, especially in tasks like clustering and text classification. In this study, we harnessed the power of natural language processing tools to enhance datasets meticulously curated by experts. This process significantly improved the classification workflow for legal texts using machine learning techniques. We considered the Sustainable Development Goals (SDGs) data from the United Nations 2030 Agenda as a practical case study. Data augmentation clustering-based strategy led to remarkable enhancements in the accuracy and sensitivity metrics of classification models. For certain SDGs within the 2030 Agenda, we observed performance gains of over 15%. In some cases, the example base expanded by a noteworthy factor of 5. When dealing with unclassified legal texts, data augmentation strategies centered around clustering prove to be highly effective. They provide a valuable means to expand the existing knowledge base without the need for labor-intensive manual classification efforts.
数据分析和机器学习在法律领域具有至关重要的作用,尤其是在聚类和文本分类等任务中。在这项研究中,我们利用自然语言处理工具增强由专家精心策划的数据集。这一过程显著提高了使用机器学习技术对法律文本进行分类的分类工作流程。我们将联合国2030议程中的可持续发展目标(SDGs)作为一个实际案例研究。数据增强聚类为基础的策略在分类模型的准确性和敏感性指标方面取得了显著的提高。在2030议程的某些SDG中,我们观察到分类表现的提升超过15%。在某些情况下,示例基础扩大了5倍。当处理未分类的法律文本时,以聚类为中心的数据增强策略变得非常有效。它们为扩展现有的知识库提供了有力的手段,而无需进行繁重的人工分类努力。
https://arxiv.org/abs/2404.08683
For assessing various performance indicators of companies, the focus is shifting from strictly financial (quantitative) publicly disclosed information to qualitative (textual) information. This textual data can provide valuable weak signals, for example through stylistic features, which can complement the quantitative data on financial performance or on Environmental, Social and Governance (ESG) criteria. In this work, we use various multi-task learning methods for financial text classification with the focus on financial sentiment, objectivity, forward-looking sentence prediction and ESG-content detection. We propose different methods to combine the information extracted from training jointly on different tasks; our best-performing method highlights the positive effect of explicitly adding auxiliary task predictions as features for the final target task during the multi-task training. Next, we use these classifiers to extract textual features from annual reports of FTSE350 companies and investigate the link between ESG quantitative scores and these features.
为了评估公司的各种绩效指标,重点是从严格的数据披露的财务(数量)信息转向定性(文本)信息。这些文本数据可以提供有价值的弱信号,例如通过文体特征,可以补充财务绩效或ESG标准的定量数据。在这项工作中,我们使用各种多任务学习方法对财务文本分类,重点关注财务情感、客观性、前瞻性句子预测和ESG内容检测。我们提出了结合训练过程中从不同任务提取的信息的不同方法;我们最佳的方法强调在多任务训练过程中明确添加辅助任务预测作为最终目标任务的特征的有积极影响。接下来,我们使用这些分类器从FTSE350公司的年度报告提取文本特征,并研究ESG定量评分与这些特征之间的联系。
https://arxiv.org/abs/2404.05281