Patient experience and care quality are crucial for a hospital's sustainability and reputation. The analysis of patient feedback offers valuable insight into patient satisfaction and outcomes. However, the unstructured nature of these comments poses challenges for traditional machine learning methods following a supervised learning paradigm. This is due to the unavailability of labeled data and the nuances these texts encompass. This research explores leveraging Large Language Models (LLMs) in conducting Multi-label Text Classification (MLTC) of inpatient comments shared after a stay in the hospital. GPT-4o-Turbo was leveraged to conduct the classification. However, given the sensitive nature of patients' comments, a security layer is introduced before feeding the data to the LLM through a Protected Health Information (PHI) detection framework, which ensures patients' de-identification. Additionally, using the prompt engineering framework, zero-shot learning, in-context learning, and chain-of-thought prompting were experimented with. Results demonstrate that GPT-4o-Turbo, whether following a zero-shot or few-shot setting, outperforms traditional methods and Pre-trained Language Models (PLMs) and achieves the highest overall performance with an F1-score of 76.12% and a weighted F1-score of 73.61% followed closely by the few-shot learning results. Subsequently, the results' association with other patient experience structured variables (e.g., rating) was conducted. The study enhances MLTC through the application of LLMs, offering healthcare practitioners an efficient method to gain deeper insights into patient feedback and deliver prompt, appropriate responses.
患者体验和护理质量对医院的可持续性和声誉至关重要。分析患者反馈可以提供有价值的见解,了解患者的满意度和结果。然而,这些评论的非结构化特性为遵循监督学习范式的传统机器学习方法带来了挑战。这是由于缺乏标注数据以及文本中包含的细微差别。这项研究探索了利用大型语言模型(LLMs)进行住院后分享的患者评论多标签文本分类(MLTC)。使用GPT-4o-Turbo进行了分类。然而,鉴于患者评论的敏感性,在将数据输入到LLM之前引入了一层安全机制,通过受保护健康信息(PHI)检测框架确保了患者的去识别化处理。此外,还实验了提示工程框架中的零样本学习、上下文内学习和思维链提示。结果显示,GPT-4o-Turbo无论是采用零样本还是少量样本设置,都优于传统方法和预训练语言模型(PLMs),总体性能最高,F1得分为76.12%,加权F1得分为73.61%,紧随其后的是少量样本学习的结果。随后,还进行了结果与其他患者体验结构化变量(如评分)的关联分析。该研究通过应用LLMs增强了MLTC,为医疗保健从业者提供了一种高效的方法来深入洞察患者反馈,并作出及时、恰当的回应。
https://arxiv.org/abs/2410.23528
State-of-the-art Extreme Multi-Label Text Classification (XMTC) models rely heavily on multi-label attention layers to focus on key tokens in input text, but obtaining optimal attention weights is challenging and resource-intensive. To address this, we introduce PLANT -- Pretrained and Leveraged AtteNTion -- a novel transfer learning strategy for fine-tuning XMTC decoders. PLANT surpasses existing state-of-the-art methods across all metrics on mimicfull, mimicfifty, mimicfour, eurlex, and wikiten datasets. It particularly excels in few-shot scenarios, outperforming previous models specifically designed for few-shot scenarios by over 50 percentage points in F1 scores on mimicrare and by over 36 percentage points on mimicfew, demonstrating its superior capability in handling rare codes. PLANT also shows remarkable data efficiency in few-shot scenarios, achieving precision comparable to traditional models with significantly less data. These results are achieved through key technical innovations: leveraging a pretrained Learning-to-Rank model as the planted attention layer, integrating mutual-information gain to enhance attention, introducing an inattention mechanism, and implementing a stateful-decoder to maintain context. Comprehensive ablation studies validate the importance of these contributions in realizing the performance gains.
最先进的极端多标签文本分类(XMTC)模型依赖于多标签注意力层来聚焦输入文本中的关键标记,但获得最佳的注意力权重既具挑战性又耗费资源。为了解决这个问题,我们引入了PLANT——预训练并利用注意(Pretrained and Leveraged AtteNTion),这是一种新颖的迁移学习策略,用于微调XMTC解码器。PLANT在mimicfull、mimicfifty、mimicfour、eurlex和wikiten数据集上的所有指标上都超过了现有的最先进方法。它特别擅长处理少量样本场景,在针对少量样本场景专门设计的先前模型中,其F1分数在mimicrare上高出50多个百分点,在mimicfew上高出36个百分点以上,展示了其处理罕见代码的卓越能力。PLANT还在少量样本场景中表现出显著的数据效率,与传统模型相比,使用明显更少的数据即可达到相同的精确度。这些结果通过关键的技术创新实现:利用预训练的学习到排序模型作为植入注意力层、整合互信息增益以增强注意力、引入不注意机制和实施有状态解码器来保持上下文。全面的消融研究表明了这些贡献在实现性能提升中的重要性。
https://arxiv.org/abs/2410.23066
Machine learning (ML) for text classification has been widely used in various domains, such as toxicity detection, chatbot consulting, and review analysis. These applications can significantly impact ethics, economics, and human behavior, raising serious concerns about trusting ML decisions. Several studies indicate that traditional metrics, such as model confidence and accuracy, are insufficient to build human trust in ML models. These models often learn spurious correlations during training and predict based on them during inference. In the real world, where such correlations are absent, their performance can deteriorate significantly. To avoid this, a common practice is to test whether predictions are reasonable. Along with this, a challenge known as the trustworthiness oracle problem has been introduced. Due to the lack of automated trustworthiness oracles, the assessment requires manual validation of the decision process disclosed by explanation methods, which is time-consuming and not scalable. We propose TOKI, the first automated trustworthiness oracle generation method for text classifiers, which automatically checks whether the prediction-contributing words are related to the predicted class using explanation methods and word embeddings. To demonstrate its practical usefulness, we introduce a novel adversarial attack method targeting trustworthiness issues identified by TOKI. We compare TOKI with a naive baseline based solely on model confidence using human-created ground truths of 6,000 predictions. We also compare TOKI-guided adversarial attack method with A2T, a SOTA adversarial attack method. Results show that relying on prediction uncertainty cannot distinguish between trustworthy and untrustworthy predictions, TOKI achieves 142% higher accuracy than the naive baseline, and TOKI-guided adversarial attack method is more effective with fewer perturbations than A2T.
机器学习(ML)在文本分类中的应用已在多个领域广泛使用,例如毒性检测、聊天机器人咨询和评论分析。这些应用可以显著影响伦理、经济及人类行为,从而引发对信任机器学习决策的严肃关切。多项研究表明,传统的度量标准如模型置信度和准确性不足以建立人对于机器学习模型的信任。这些模型在训练过程中往往会学到虚假的相关性,并在推理时基于这些相关性进行预测。而在现实世界中,由于这种相关性的缺失,其性能可能会显著下降。为了避免这种情况,一种常见的做法是测试预测是否合理。与此相关的一个挑战被称为信任度预言问题。由于缺乏自动化的信任度预言方法,评估需要手动验证由解释方法披露的决策过程,这既耗时又不具备可扩展性。 我们提出了一种名为TOKI的方法,它是首个针对文本分类器的自动化信任度预言生成方法。该方法利用解释技术和词嵌入来自动检查对预测有贡献的单词是否与所预测类别相关。为了展示其实际应用价值,我们介绍了一种新的对抗攻击方法,专门针对由TOKI识别的信任问题进行设计。我们将TOKI与仅基于模型置信度的朴素基线进行了对比,使用了6000个通过人工创建的真实数据作为基准。同时,我们也将由TOKI指导的对抗攻击方法与最先进的对抗攻击方法A2T进行了比较。结果表明,依赖预测不确定性无法区分可信和不可信的预测,而TOKI比朴素基线实现了142%的更高准确率,并且由TOKI引导的对抗攻击方法在扰动更少的情况下比A2T更加有效。
https://arxiv.org/abs/2410.22663
Despite their high predictive accuracies, current machine learning systems often exhibit systematic biases stemming from annotation artifacts or insufficient support for certain classes in the dataset. Recent work proposes automatic methods for identifying and explaining systematic biases using keywords. We introduce DISCERN, a framework for interpreting systematic biases in text classifiers using language explanations. DISCERN iteratively generates precise natural language descriptions of systematic errors by employing an interactive loop between two large language models. Finally, we use the descriptions to improve classifiers by augmenting classifier training sets with synthetically generated instances or annotated examples via active learning. On three text-classification datasets, we demonstrate that language explanations from our framework induce consistent performance improvements that go beyond what is achievable with exemplars of systematic bias. Finally, in human evaluations, we show that users can interpret systematic biases more effectively (by over 25% relative) and efficiently when described through language explanations as opposed to cluster exemplars.
尽管当前的机器学习系统具有较高的预测准确性,但它们通常表现出由注释伪影或数据集中某些类别支持不足引起的系统性偏差。最近的研究提出了使用关键词识别和解释系统性偏差的自动方法。我们引入了DISCERN框架,用于通过语言解释来诠释文本分类器中的系统性偏差。DISCERN通过两个大型语言模型之间的交互循环迭代生成系统错误的精确自然语言描述。最后,我们将这些描述用于改进分类器,通过对分类器训练集进行合成实例增强或通过主动学习对示例进行注释。在三个文本分类数据集中,我们展示了来自我们框架的语言解释带来的性能提升是一致的,并且超过了仅使用系统偏差样本所能达到的效果。最终,在人类评估中,我们表明用户可以通过语言解释而不是聚类样本更有效地(超过25%相对)和高效地理解系统性偏差。
https://arxiv.org/abs/2410.22239
The unique characteristics of text data make classification tasks a complex problem. Advances in unsupervised and semi-supervised learning and autoencoder architectures addressed several challenges. However, they still struggle with imbalanced text classification tasks, a common scenario in real-world applications, demonstrating a tendency to produce embeddings with unfavorable properties, such as class overlap. In this paper, we show that leveraging class-aware contrastive optimization combined with denoising autoencoders can successfully tackle imbalanced text classification tasks, achieving better performance than the current state-of-the-art. Concretely, our proposal combines reconstruction loss with contrastive class separation in the embedding space, allowing a better balance between the truthfulness of the generated embeddings and the model's ability to separate different classes. Compared with an extensive set of traditional and state-of-the-art competing methods, our proposal demonstrates a notable increase in performance across a wide variety of text datasets.
文本数据的独特特性使分类任务成为一个复杂的问题。无监督和半监督学习以及自编码器架构的进步解决了若干挑战,但它们仍然难以应对不平衡的文本分类任务——这是现实世界应用中的常见情况,并表现出生成具有不利属性(如类别重叠)嵌入的趋势。在本文中,我们展示了利用类感知对比优化结合去噪自编码器可以成功解决不平衡的文本分类任务,并且性能优于当前最先进的方法。具体来说,我们的提案将重构损失与嵌入空间中的对比类别分离相结合,使生成的嵌入的真实性与模型区分不同类别的能力之间达到更好的平衡。相比于广泛的传统的和最先进的竞争方法,我们的提案在一系列不同的文本数据集中展现了显著的性能提升。
https://arxiv.org/abs/2410.22197
Objective: This review aims to analyze the application of natural language processing (NLP) techniques in cancer research using electronic health records (EHRs) and clinical notes. This review addresses gaps in the existing literature by providing a broader perspective than previous studies focused on specific cancer types or applications. Methods: A comprehensive literature search was conducted using the Scopus database, identifying 94 relevant studies published between 2019 and 2024. Data extraction included study characteristics, cancer types, NLP methodologies, dataset information, performance metrics, challenges, and future directions. Studies were categorized based on cancer types and NLP applications. Results: The results showed a growing trend in NLP applications for cancer research, with breast, lung, and colorectal cancers being the most studied. Information extraction and text classification emerged as predominant NLP tasks. A shift from rule-based to advanced machine learning techniques, particularly transformer-based models, was observed. The Dataset sizes used in existing studies varied widely. Key challenges included the limited generalizability of proposed solutions and the need for improved integration into clinical workflows. Conclusion: NLP techniques show significant potential in analyzing EHRs and clinical notes for cancer research. However, future work should focus on improving model generalizability, enhancing robustness in handling complex clinical language, and expanding applications to understudied cancer types. Integration of NLP tools into clinical practice and addressing ethical considerations remain crucial for utilizing the full potential of NLP in enhancing cancer diagnosis, treatment, and patient outcomes.
目标:本综述旨在分析利用电子健康记录(EHRs)和临床笔记中的自然语言处理(NLP)技术在癌症研究中的应用。该综述通过提供比以往专注于特定癌症类型或应用的研究更广泛的视角,填补了现有文献的空白。 方法:使用Scopus数据库进行了全面的文献搜索,识别出2019年至2024年间发表的相关研究共94篇。数据提取包括研究特点、癌症类型、NLP方法论、数据集信息、性能指标、挑战和未来方向等。根据癌症类型和NLP应用对研究进行分类。 结果:结果显示了在癌症研究中NLP应用的逐渐增加趋势,其中乳腺癌、肺癌和结直肠癌是最常被研究的对象。信息提取和文本分类作为主要的NLP任务出现。观察到从基于规则的方法向先进的机器学习技术转变,尤其是基于变换器(transformer)的模型。现有研究所使用的数据集大小差异很大。关键挑战包括提出解决方案的有限通用性和需要改进临床工作流程中的整合。 结论:NLP技术在分析EHRs和临床笔记以用于癌症研究方面显示出巨大的潜力。然而,未来的工作应集中在提高模型泛化能力、增强处理复杂临床语言的鲁棒性以及扩展到较少被研究的癌症类型的应用上。将NLP工具整合进临床实践中,并解决伦理考量是充分利用NLP技术提升癌症诊断、治疗及患者预后的关键步骤。
https://arxiv.org/abs/2410.22180
Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.
合成数据增强通过大型语言模型(LLMs)使研究人员能够利用额外的训练数据,从而提高下游任务的表现,特别是在现实世界的数据稀缺时。然而,生成的数据可能会偏离现实世界的分布,这种不对齐在将训练好的模型应用于实际应用时会导致不良的结果。因此,我们提出了有效的加权损失方法,通过强调由LLMs生成的高质量和多样化数据,并仅使用少量真实世界数据来使合成数据与现实世界分布对齐。我们在多个文本分类任务中实证评估了我们的方法的有效性,结果表明,在BERT级别的模型上采用我们的方法稳健地优于标准交叉熵和其他数据加权方法,为如何有效利用任何合适的生成器产生的合成数据进行模型训练提供了潜在的解决方案。
https://arxiv.org/abs/2410.21526
We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, e.g. text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure until the victim classifier changes its decision. The evaluation confirms the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.
我们研究了生成对抗样本以测试文本分类算法检测低可信度内容(包括宣传、虚假声明、谣言和高度党派新闻)的鲁棒性的挑战。我们的重点是通过设定攻击者可以尝试的查询次数的现实限制来模拟内容审核。在我们的解决方案(TREPAT)中,初始的重新表述是由大型语言模型生成的,这些模型使用了受保义自然语言处理任务(如文本简化和风格转换)启发的提示词。随后,这些修改被分解为小的变化,并通过束搜索程序应用,直到受害者分类器改变其决策。评估结果证实了我们的方法在受限场景中的优越性,特别是在输入文本较长(例如新闻文章)的情况下,全面搜索是不可行的。
https://arxiv.org/abs/2410.20940
In light of the recent success of Graph Neural Networks (GNNs) and their ability to perform inference on complex data structures, many studies apply GNNs to the task of text classification. In most previous methods, a heterogeneous graph, containing both word and document nodes, is constructed using the entire corpus and a GNN is used to classify document nodes. In this work, we explore a new Discriminative Graph of Words Graph Neural Network (DGoW-GNN) approach encapsulating both a novel discriminative graph construction and model to classify text. In our graph construction, containing only word nodes and no document nodes, we split the training corpus into disconnected subgraphs according to their labels and weight edges by the pointwise mutual information of the represented words. Our graph construction, for which we provide theoretical motivation, allows us to reformulate the task of text classification as the task of walk classification. We also propose a new model for the graph-based classification of text, which combines a GNN and a sequence model. We evaluate our approach on seven benchmark datasets and find that it is outperformed by several state-of-the-art baseline models. We analyse reasons for this performance difference and hypothesise under which conditions it is likely to change.
鉴于图神经网络(GNN)在处理复杂数据结构推理方面取得的近期成功,许多研究将GNN应用于文本分类任务。在大多数先前的方法中,通过整个语料库构建包含词节点和文档节点的异构图,并使用GNN对文档节点进行分类。在这项工作中,我们探索了一种新的词汇区分图神经网络(DGoW-GNN)方法,该方法结合了新颖的区分图构造和模型来分类文本。在我们的图构造中,仅包含词节点而没有文档节点,我们将训练语料库根据标签分割成不相连的子图,并按所表示词语的点互信息对边进行加权。我们为这种图构造提供了理论动机,它使我们可以将文本分类任务重新定义为路径分类任务。此外,我们还提出了一种新的基于图的文本分类模型,该模型结合了GNN和序列模型。我们在七个基准数据集上评估我们的方法,并发现其表现不如几种最先进的基线模型。我们分析了这种性能差异的原因,并推测在哪些条件下这种情况可能会发生变化。
https://arxiv.org/abs/2410.20469
Large Language Models have introduced novel opportunities for text comprehension and generation. Yet, they are vulnerable to adversarial perturbations and data poisoning attacks, particularly in tasks like text classification and translation. However, the adversarial robustness of abstractive text summarization models remains less explored. In this work, we unveil a novel approach by exploiting the inherent lead bias in summarization models, to perform adversarial perturbations. Furthermore, we introduce an innovative application of influence functions, to execute data poisoning, which compromises the model's integrity. This approach not only shows a skew in the models behavior to produce desired outcomes but also shows a new behavioral change, where models under attack tend to generate extractive summaries rather than abstractive summaries.
大型语言模型为文本理解和生成引入了新的机遇。然而,它们容易受到对抗性扰动和数据投毒攻击的影响,特别是在文本分类和翻译等任务中。但是,对于抽象式文本摘要模型的对抗鲁棒性研究仍然较少。在本研究中,我们揭示了一种新颖的方法,通过利用摘要模型内在的引言偏见来进行对抗性扰动。此外,我们引入了影响函数的一种创新应用,以执行数据投毒,从而破坏模型的完整性。这种方法不仅显示出模型行为向生成期望结果倾斜的现象,还展示了新的行为变化,即受到攻击的模型倾向于生成抽取式摘要而非抽象式摘要。
https://arxiv.org/abs/2410.20019
Text classification involves categorizing a given text, such as determining its sentiment or identifying harmful content. With the advancement of large language models (LLMs), these models have become highly effective at performing text classification tasks. However, they still show vulnerabilities to variations in text formatting. Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks. While easily understood by humans, these inputs can significantly mislead models, posing a potential risk of bypassing detection in real-world scenarios involving harmful or sensitive information. With the expanding application of LLMs, a crucial question arises: Do decoder-based LLMs exhibit similar vulnerabilities to vertically formatted text input? In this paper, we investigate the impact of vertical text input on the performance of various LLMs across multiple text classification datasets and analyze the underlying causes. Our findings are as follows: (i) Vertical text input significantly degrades the accuracy of LLMs in text classification tasks. (ii) Chain of Thought (CoT) reasoning does not help LLMs recognize vertical input or mitigate its vulnerability, but few-shot learning with careful analysis does. (iii) We explore the underlying cause of the vulnerability by analyzing the inherent issues in tokenization and attention matrices.
文本分类涉及对给定的文本进行归类,比如确定其情感或识别有害内容。随着大型语言模型(LLMs)的发展,这些模型在执行文本分类任务方面变得非常有效。然而,它们仍然显示出对文本格式变化的脆弱性。最近的研究表明,修改输入格式,例如将基于编码器的模型中的单词垂直对齐,可以显著降低文本分类任务的准确性。虽然人类容易理解这种输入形式,但它可能会严重误导模型,在涉及有害或敏感信息的真实场景中可能绕过检测。随着LLMs应用范围的扩大,一个关键问题出现了:基于解码器的LLMs是否也表现出类似的脆弱性以垂直格式的文本输入?在这篇论文中,我们调查了垂直文本输入对多种LLMs在多个文本分类数据集上的性能影响,并分析其潜在原因。我们的发现如下:(i)垂直文本输入显著降低了LLMs在文本分类任务中的准确性。(ii)思维链(CoT)推理不能帮助LLMs识别垂直输入或减轻其脆弱性,但通过仔细分析的少样本学习可以做到这一点。(iii)我们通过对分词和注意力矩阵内在问题的分析来探索这种脆弱性的根本原因。
https://arxiv.org/abs/2410.20016
Finetuning is a common practice widespread across different communities to adapt pretrained models to particular tasks. Text classification is one of these tasks for which many pretrained models are available. On the other hand, ensembles of neural networks are typically used to boost performance and provide reliable uncertainty estimates. However, ensembling pretrained models for text classification is not a well-studied avenue. In this paper, we present a metadataset with predictions from five large finetuned models on six datasets, and report results of different ensembling strategies from these predictions. Our results shed light on how ensembling can improve the performance of finetuned text classifiers and incentivize future adoption of ensembles in such tasks.
微调是一种广泛应用于不同社区的常见做法,用于将预训练模型适应特定任务。文本分类是许多可用预训练模型的任务之一。另一方面,神经网络集成通常被用来提升性能并提供可靠的不确定性估计。然而,对于文本分类而言,组合预训练模型的研究并不充分。本文中,我们呈现了一个元数据集,其中包含了五个大型微调模型在六个数据集上的预测结果,并报告了从这些预测中得出的不同集成策略的结果。我们的研究结果揭示了集成如何能够提升微调文本分类器的性能,并鼓励在未来任务中采用集成方法。
https://arxiv.org/abs/2410.19889
Natural Language Processing is revolutionizing the way legal professionals and laypersons operate in the legal field. The considerable potential for Natural Language Processing in the legal sector, especially in developing computational tools for various legal processes, has captured the interest of researchers for years. This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework, reviewing 148 studies, with a final selection of 127 after manual filtering. It explores foundational concepts related to Natural Language Processing in the legal domain, illustrating the unique aspects and challenges of processing legal texts, such as extensive document length, complex language, and limited open legal datasets. We provide an overview of Natural Language Processing tasks specific to legal text, such as Legal Document Summarization, legal Named Entity Recognition, Legal Question Answering, Legal Text Classification, and Legal Judgment Prediction. In the section on legal Language Models, we analyze both developed Language Models and approaches for adapting general Language Models to the legal domain. Additionally, we identify 15 Open Research Challenges, including bias in Artificial Intelligence applications, the need for more robust and interpretable models, and improving explainability to handle the complexities of legal language and reasoning.
自然语言处理正在革新法律专业人士和非专业人员在法律领域的运作方式。多年来,研究人员对自然语言处理在法律部门的巨大潜力尤其开发用于各种法律流程的计算工具产生了浓厚的兴趣。这项调查遵循系统评价和荟萃分析首选报告项目框架,审查了148项研究,在人工筛选后最终选择了127项。它探讨了与法律领域相关的自然语言处理的基础概念,并展示了处理法律文本的独特方面和挑战,如文档长度过长、语言复杂以及公开的法律数据集有限。我们概述了特定于法律文本的自然语言处理任务,包括法律文件摘要生成、法律命名实体识别、法律问答系统、法律文本分类以及法律判决预测。在关于法律语言模型的部分中,我们分析了已开发的语言模型以及将通用语言模型适应到法律领域的策略。此外,我们还确定了15个开放的研究挑战,其中包括人工智能应用中的偏见问题,对更强大且可解释的模型的需求,以及提高透明度以应对法律语言和推理复杂性的需求。
https://arxiv.org/abs/2410.21306
Causal decoder-only transformer models used for generative language modelling, such as Generative Pre-trained Transformers (GPT), are trained to predict the next token in a sequence based only on its previous tokens. Despite this simple training objective, they have proved to be powerful AI tools. However, only predicting the next token results in top layer embedding vectors that are highly token-focused. There may be benefits in generating embedding vectors at each token position that better capture the overall meaning of longer sequences of future text. Recent studies matching brain scans with deep language models suggest that humans also predict upcoming words when listening or reading but consider multiple future tokens rather than just one. This research investigates a new pretraining method called Future Token Prediction (FTP). In FTP, a large transformer encoder generates top layer embedding vectors for each token position, which, instead of being passed to a language head, are linearly and expansively projected to a pseudo-sequence, which is cross attended to by a small transformer decoder to predict the next N tokens forward from that position in the sequence. The top layer embedding vectors from FTP models exhibit distinct properties compared to those from standard GPT models, varying smoothly along a text sequence as measured by cosine similarity between adjacent tokens. Text generated by FTP models show improved topic coherence compared to standard GPT-like models trained with the same prediction perplexity for the next single token. The vectors are shown to better represent the topic of text based on the results of text classification examples. On a toy, but complex, coding problem, FTP networks produce significantly better results than GPT networks.
因果解码器-only变压器模型,如生成预训练变换器(GPT),用于生成性语言建模,这些模型被训练为仅根据其之前的标记来预测序列中的下一个标记。尽管这个训练目标很简单,但它们已被证明是强大的AI工具。然而,仅仅预测下一个标记导致顶层嵌入向量高度专注于单个标记。在每个标记位置上生成能够更好地捕捉更长未来文本整体含义的嵌入向量可能会带来好处。最近将大脑扫描与深度语言模型匹配的研究表明,人类在听或读时也预测即将出现的单词,但会考虑多个未来的标记而不仅仅是其中一个。这项研究调查了一种新的预训练方法,称为“未来标记预测”(FTP)。在FTP中,一个大型变压器编码器为每个标记位置生成顶层嵌入向量,这些向量不是传递给语言头部,而是线性和扩展地投影到伪序列上,然后由一个小的变压器解码器交叉关注并从该序列中的那个位置向前预测下一个N个标记。来自FTP模型的顶层嵌入向量与标准GPT模型相比表现出不同的特性,在文本序列中沿相邻标记之间的余弦相似度平滑变化。通过FTP模型生成的文本显示出比具有相同单个后续标记预测困惑度的标准GPT样模型更好的主题一致性,基于文本分类示例的结果表明这些向量更好地表示了文本的主题。在一项玩具但复杂的编码问题上,FTP网络产生的结果显著优于GPT网络。
https://arxiv.org/abs/2410.18160
Deep neural networks have achieved remarkable performance in various text-based tasks but often lack interpretability, making them less suitable for applications where transparency is critical. To address this, we propose ProtoLens, a novel prototype-based model that provides fine-grained, sub-sentence level interpretability for text classification. ProtoLens uses a Prototype-aware Span Extraction module to identify relevant text spans associated with learned prototypes and a Prototype Alignment mechanism to ensure prototypes are semantically meaningful throughout training. By aligning the prototype embeddings with human-understandable examples, ProtoLens provides interpretable predictions while maintaining competitive accuracy. Extensive experiments demonstrate that ProtoLens outperforms both prototype-based and non-interpretable baselines on multiple text classification benchmarks. Code and data are available at \url{this https URL}.
深度神经网络在各种基于文本的任务中表现出色,但通常缺乏可解释性,这使得它们不太适合需要透明度的应用场景。为了解决这一问题,我们提出了ProtoLens,这是一种新型的原型模型,能够提供细粒度、子句子级别的可解释性用于文本分类。ProtoLens使用了一个Prototype-aware Span Extraction模块来识别与学习到的原型相关的文本片段,并采用一个Prototype Alignment机制确保原型在整个训练过程中具有语义上的意义。通过将原型嵌入对齐至人类可以理解的例子,ProtoLens在保持竞争力准确率的同时提供了可解释性的预测结果。广泛的实验表明,ProtoLens在多个文本分类基准测试中超越了基于原型和非可解释基线模型的表现。代码和数据可以在\url{this https URL}获得。
https://arxiv.org/abs/2410.17546
Social media is a great source of data for users reporting information and regarding their health and how various things have had an effect on them. This paper presents various approaches using Transformers and Large Language Models and their ensembles, their performance along with advantages and drawbacks for various tasks of SMM4H'24 - Classifying texts on impact of nature and outdoor spaces on the author's mental health (Task 3), Binary classification of tweets reporting their children's health disorders like Asthma, Autism, ADHD and Speech disorder (task 5), Binary classification of users self-reporting their age (task 6).
https://arxiv.org/abs/2410.15998
With the advancements in open-source models, training (or finetuning) models on custom datasets has become a crucial part of developing solutions which are tailored to specific industrial or open-source applications. Yet, there is no single tool which simplifies the process of training across different types of modalities or tasks. We introduce AutoTrain (aka AutoTrain Advanced) -- an open-source, no code tool/library which can be used to train (or finetune) models for different kinds of tasks such as: large language model (LLM) finetuning, text classification/regression, token classification, sequence-to-sequence task, finetuning of sentence transformers, visual language model (VLM) finetuning, image classification/regression and even classification and regression tasks on tabular data. AutoTrain Advanced is an open-source library providing best practices for training models on custom datasets. The library is available at this https URL. AutoTrain can be used in fully local mode or on cloud machines and works with tens of thousands of models shared on Hugging Face Hub and their variations.
https://arxiv.org/abs/2410.15735
Objective: Recognizing diseases from discharge letters is crucial for cohort selection and epidemiological analyses, as this is the only type of data consistently produced across hospitals. This is a classic document classification problem, typically requiring supervised learning. However, manual annotation of large datasets of discharge letters is uncommon since it is extremely time-consuming. We propose a novel weakly-supervised pipeline to recognize diseases from Italian discharge letters. Methods: Our Natural Language Processing pipeline is based on a fine-tuned version of the Italian Umberto model. The pipeline extracts diagnosis-related sentences from a subset of letters and applies a two-level clustering using the embeddings generated by the fine-tuned Umberto model. These clusters are summarized and those mapped to the diseases of interest are selected as weak labels. Finally, the same BERT-based model is trained using these weak labels to detect the targeted diseases. Results: A case study related to the identification of bronchiolitis with 33'176 Italian discharge letters from 44 hospitals in the Veneto Region shows the potential of our method, with an AUC of 77.7 % and an F1-Score of 75.1 % on manually annotated labels, improving compared to other non-supervised methods and with a limited loss compared to fully supervised methods. Results are robust to the cluster selection and the identified clusters highlight the potential to recognize a variety of diseases. Conclusions: This study demonstrates the feasibility of diagnosis identification from Italian discharge letters in the absence of labelled data. Our pipeline showed strong performance and robustness, and its flexibility allows for easy adaptation to various diseases. This approach offers a scalable solution for clinical text classification, reducing the need for manual annotation while maintaining good accuracy.
https://arxiv.org/abs/2410.15051
Graph contrastive learning (GCL) has been widely applied to text classification tasks due to its ability to generate self-supervised signals from unlabeled data, thus facilitating model training. However, existing GCL-based text classification methods often suffer from negative sampling bias, where similar nodes are incorrectly paired as negative pairs. This can lead to over-clustering, where instances of the same class are divided into different clusters. To address the over-clustering issue, we propose an innovative GCL-based method of graph contrastive learning via cluster-refined negative sampling for semi-supervised text classification, namely ClusterText. Firstly, we combine the pre-trained model Bert with graph neural networks to learn text representations. Secondly, we introduce a clustering refinement strategy, which clusters the learned text representations to obtain pseudo labels. For each text node, its negative sample set is drawn from different clusters. Additionally, we propose a self-correction mechanism to mitigate the loss of true negative samples caused by clustering inconsistency. By calculating the Euclidean distance between each text node and other nodes within the same cluster, distant nodes are still selected as negative samples. Our proposed ClusterText demonstrates good scalable computing, as it can effectively extract important information from from a large amount of data. Experimental results demonstrate the superiority of ClusterText in text classification tasks.
图对比学习(GCL)因其能够从无标签数据中生成自我监督信号,从而促进模型训练,已被广泛应用于文本分类任务。然而,现有的基于GCL的文本分类方法常常受到负样本偏差的影响,在这种情况下,相似节点被错误地配对为负面样本对。这可能导致过度聚类,即同一类别的实例被划分为不同的集群。为了处理过度聚类问题,我们提出了一种通过群集细化负采样的创新图对比学习方法用于半监督文本分类,称为ClusterText。首先,我们将预训练模型Bert与图神经网络结合以学习文本表示。其次,引入了集群优化策略,该策略将学习到的文本表示进行聚类以获得伪标签。对于每个文本节点,其负样本集来自不同的群集。此外,我们提出了一种自我校正机制来缓解由聚类不一致性导致的真实负面样本损失问题。通过计算每个文本节点与其他同一集群内节点之间的欧几里得距离,远处的节点仍然被选为负面样本。我们的ClusterText提案展示了良好的可扩展性计算能力,因为它能有效从大量数据中提取重要信息。实验结果证明了ClusterText在文本分类任务中的优越性。
https://arxiv.org/abs/2410.18130
Parameter-efficient fine-tuning (PEFT) can bridge the gap between large language models (LLMs) and downstream tasks. However, PEFT has been proven vulnerable to malicious attacks. Research indicates that poisoned LLMs, even after PEFT, retain the capability to activate internalized backdoors when input samples contain predefined triggers. In this paper, we introduce a novel weak-to-strong unlearning algorithm to defend against backdoor attacks based on feature alignment knowledge distillation, named W2SDefense. Specifically, we first train a small-scale language model through full-parameter fine-tuning to serve as the clean teacher model. Then, this teacher model guides the large-scale poisoned student model in unlearning the backdoor, leveraging PEFT. Theoretical analysis suggests that W2SDefense has the potential to enhance the student model's ability to unlearn backdoor features, preventing the activation of the backdoor. We conduct experiments on text classification tasks involving three state-of-the-art language models and three different backdoor attack algorithms. Our empirical results demonstrate the outstanding performance of W2SDefense in defending against backdoor attacks without compromising model performance.
https://arxiv.org/abs/2410.14425