Distinguishing in- and out-of-distribution (OOD) inputs is crucial for reliable deployment of classification systems. However, OOD data is typically unavailable or difficult to collect, posing a significant challenge for accurate OOD detection. In this work, we present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies, eliminating the dependency on any external OOD data source. We study the efficacy of our method on classical text classification tasks such as toxicity detection and sentiment classification as well as classification tasks arising in LLM development and deployment, such as training a reward model for RLHF and detecting misaligned generations. Extensive experiments on nine InD-OOD dataset pairs and various model sizes show that our approach dramatically lowers false positive rates (achieving a perfect zero in some cases) while maintaining high accuracy on in-distribution tasks, outperforming baseline methods by a significant margin.
区分内分布(In-Distribution,ID)和外分布(Out-of-Distribution,OOD)输入对于分类系统的可靠部署至关重要。然而,OOD数据通常难以获取或收集,这给准确的OOD检测带来了重大挑战。在这项工作中,我们提出了一种方法,该方法利用大规模语言模型(LLMs)的生成能力来创建高质量的合成OOD代理,从而消除了对任何外部OOD数据源的依赖。我们在经典文本分类任务(如毒性检测和情感分类)以及在LLM开发和部署中出现的分类任务(例如为RLHF训练奖励模型和检测不一致生成)上研究了我们方法的有效性。在九组InD-OOD数据集对及各种规模模型上的大量实验表明,我们的方法显著降低了假阳性率(在某些情况下达到了完美的零),同时保持了高准确度的内分布任务表现,大幅超越了基线方法。
https://arxiv.org/abs/2502.03323
Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL. Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. DEUCE performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of DEUCE.
冷启动主动学习(CSAL)从未标记的数据集中选择有价值的实例进行人工标注,为标签稀少的文本分类提供了高质量且成本低的数据。然而,现有的CSAL方法忽视了弱类别和难以代表的例子,导致了偏向性的学习问题。为了应对这些问题,本文提出了一种新颖的双多样性增强与不确定性感知(DEUCE)框架用于CSAL。具体来说,DEUCE利用预训练的语言模型(PLM)高效地提取文本表示、类别预测以及预测不确定性。接着,它构建了一个双邻域图(DNG),结合了文本多样性和类别多样性的信息,确保数据分布的均衡性。此外,通过基于密度的聚类传播不确定性信息,DEUCE选择出具有代表性的困难实例。 通过利用双重多样性与信息量,DEUCE在选择平衡且有代表性的样本方面表现出色。实验结果表明,在六个NLP数据集上,DEUCE展示了其优越性和效率。
https://arxiv.org/abs/2502.00305
Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing methods may reduce certain parts of the memory required for fine-tuning, they still require caching all intermediate activations computed in the forward pass to update weights during the backward pass. In this work, we develop TokenTune, a method to reduce memory usage, specifically the memory to store intermediate activations, in the fine-tuning of transformer-based models. During the backward pass, TokenTune approximates the gradient computation by backpropagating through just a subset of input tokens. Thus, with TokenTune, only a subset of intermediate activations are cached during the forward pass. Also, TokenTune can be easily combined with existing methods like LoRA, further reducing the memory cost. We evaluate our approach on pre-trained transformer models with up to billions of parameters, considering the performance on multiple downstream tasks such as text classification and question answering in a few-shot learning setup. Overall, TokenTune achieves performance on par with full fine-tuning or representative memory-efficient fine-tuning methods, while greatly reducing the memory footprint, especially when combined with other methods with complementary memory reduction mechanisms. We hope that our approach will facilitate the fine-tuning of large transformers, in specializing them for specific domains or co-training them with other neural components from a larger system. Our code is available at this https URL.
微调提供了一种有效的方法,可以将预训练模型专门化为各种下游任务。然而,微调通常会产生较高的内存开销,尤其是对于基于转换器的大型语言模型(如LLM)。尽管现有方法可能减少了某些部分在微调过程中所需的内存,但它们仍然需要缓存前向传播计算的所有中间激活以更新反向传播过程中的权重。在此项工作中,我们开发了TokenTune方法,该方法旨在减少基于转换器模型的微调中用于存储中间激活的内存使用量。在反向传播期间,TokenTune通过仅对输入令牌的一个子集进行反向传播来近似梯度计算。因此,在前向传播过程中,只需要缓存一个部分的中间激活。此外,TokenTune可以轻松地与现有方法(如LoRA)结合使用,进一步减少内存成本。我们在具有数十亿参数的预训练转换器模型上评估了我们的方法,并考虑了在少量样本学习设置下的多个下游任务性能,例如文本分类和问答。总体而言,TokenTune实现了与完整微调或代表性内存高效微调方法相当的性能,同时大大减少了内存占用量,尤其是在与其他具有互补内存减少机制的方法结合使用时更是如此。我们希望我们的方法将有助于大型转换器模型的微调工作,以专门针对特定领域或在更大的系统中与其他神经组件进行联合训练。我们的代码可在以下链接获取:[此URL]。
https://arxiv.org/abs/2501.18824
Accurate sentiment analysis of texts is crucial for a variety of applications, such as understanding customer feedback, monitoring market trends, and detecting public sentiment. However, manually annotating large sentiment corpora for supervised learning is labor-intensive and time-consuming. Therefore, it is essential and effective to develop a semi-supervised method for the sentiment analysis task. Although some methods have been proposed for semi-supervised text classification, they rely on the intrinsic information within the unlabeled data and the learning capability of the NLP model, which lack generalization ability to the sentiment analysis scenario and may prone to overfit. Inspired by the ability of pretrained Large Language Models (LLMs) in following instructions and generating coherent text, we propose a Semantic Consistency Regularization with Large Language Models (SCR) framework for semi-supervised sentiment analysis. We introduce two prompting strategies to semantically enhance unlabeled text using LLMs. The first is Entity-based Enhancement (SCR-EE), which involves extracting entities and numerical information, and querying the LLM to reconstruct the textual information. The second is Concept-based Enhancement (SCR-CE), which directly queries the LLM with the original sentence for semantic reconstruction. Subsequently, the LLM-augmented data is utilized for a consistency loss with confidence thresholding, which preserves high-quality agreement samples to provide additional supervision signals during training. Furthermore, to fully utilize the uncertain unlabeled data samples, we propose a class re-assembling strategy inspired by the class space shrinking theorem. Experiments show our method achieves remarkable performance over prior semi-supervised methods.
准确的情感分析对于理解客户反馈、监控市场趋势和检测公众情感等众多应用至关重要。然而,为监督学习手动标注大量情感语料库既耗时又费力。因此,开发一种适用于情感分析任务的半监督方法是必要且有效的。虽然已经提出了一些用于半监督文本分类的方法,但这些方法依赖于未标记数据中的内在信息和NLP模型的学习能力,这在情感分析场景中缺乏泛化能力,并可能容易过拟合。受预训练大规模语言模型(LLM)遵循指令并生成连贯文本的能力的启发,我们提出了一种基于大规模语言模型的语义一致性正则化(SCR)框架用于半监督情感分析。我们介绍了两种提示策略,利用LLM来增强未标记文本的语义信息:一种是基于实体的增强(SCR-EE),涉及提取实体和数值信息,并查询LLM以重构文本信息;另一种是基于概念的增强(SCR-CE),直接使用原始句子查询LLM进行语义重构。随后,利用经过LLM扩充的数据应用一致性损失并采用置信度阈值筛选高质量的一致性样本,在训练过程中提供额外的监督信号。此外,为了充分利用不确定的未标记数据样本,我们提出了一种基于类别重组策略的方法,该方法受到类空间收缩定理的启发。实验结果表明,我们的方法在以前的半监督方法中取得了显著的效果。
https://arxiv.org/abs/2501.17598
This survey provides an overview of the challenges of misspellings in natural language processing (NLP). While often unintentional, misspellings have become ubiquitous in digital communication, especially with the proliferation of Web 2.0, user-generated content, and informal text mediums such as social media, blogs, and forums. Even if humans can generally interpret misspelled text, NLP models frequently struggle to handle it: this causes a decline in performance in common tasks like text classification and machine translation. In this paper, we reconstruct a history of misspellings as a scientific problem. We then discuss the latest advancements to address the challenge of misspellings in NLP. Main strategies to mitigate the effect of misspellings include data augmentation, double step, character-order agnostic, and tuple-based methods, among others. This survey also examines dedicated data challenges and competitions to spur progress in the field. Critical safety and ethical concerns are also examined, for example, the voluntary use of misspellings to inject malicious messages and hate speech on social networks. Furthermore, the survey explores psycholinguistic perspectives on how humans process misspellings, potentially informing innovative computational techniques for text normalization and representation. Finally, the misspelling-related challenges and opportunities associated with modern large language models are also analyzed, including benchmarks, datasets, and performances of the most prominent language models against misspellings. This survey aims to be an exhaustive resource for researchers seeking to mitigate the impact of misspellings in the rapidly evolving landscape of NLP.
这项调查概述了自然语言处理(NLP)中错拼词的挑战。尽管通常无意为之,但错拼词在数字通信中已经变得无处不在,尤其是在Web 2.0、用户生成内容以及社交媒体、博客和论坛等非正式文本媒介兴起的情况下更是如此。虽然人类一般可以理解错误拼写的文本,但NLP模型却经常难以处理:这导致了诸如文本分类和机器翻译等常见任务性能的下降。在本文中,我们重建了错拼词作为科学问题的历史,并讨论了最新进展以应对NLP中的错拼挑战。 主要缓解错拼影响的战略包括数据增强、双步法、字符顺序无关性和元组基于的方法等等。此调查还考察了专门的数据挑战和竞赛,旨在推动该领域的发展。同时探讨了一些关键的安全与伦理关注点,例如在社交网络上故意使用错拼词来注入恶意信息和仇恨言论的问题。 此外,这项调查还探索了从心理语言学角度如何处理错拼词的人类方法,这可能有助于创新的计算技术进行文本规范化和表示。最后,分析了现代大型语言模型相关的错拼挑战与机遇,包括基准测试、数据集以及最突出的语言模型在应对错拼词方面的表现。 本调查旨在为研究者提供一份详尽资源,在不断发展的NLP领域中减轻错拼的影响。
https://arxiv.org/abs/2501.16836
Classification tasks are widely investigated in the In-Context Learning (ICL) paradigm. However, current efforts are evaluated on disjoint benchmarks and settings, while their performances are significantly influenced by some trivial variables, such as prompt templates, data sampling, instructions, etc., which leads to significant inconsistencies in the results reported across various literature, preventing fair comparison or meta-analysis across different papers. Therefore, this paper proposes a standardized and easy-to-use evaluation toolkit (StaICC) for in-context classification. Including, for the normal classification task, we provide StaICC-Normal, selecting 10 widely used datasets, and generating prompts with a fixed form, to mitigate the variance among the experiment implementations. To enrich the usage of our benchmark, we also provide a sub-benchmark StaICC-Diag for diagnosing ICL from several aspects, aiming for a more robust inference processing.
在上下文学习(ICL)范式中,分类任务得到了广泛的研究。然而,当前的努力都是在相互独立的基准和设置上进行评估的,并且它们的表现受到了一些琐碎变量的影响,比如提示模板、数据采样、指令等。这导致了不同文献报告的结果存在显著不一致性,阻碍了跨论文之间的公平比较或元分析。因此,本文提出了一种标准化且易于使用的评估工具包(StaICC),用于上下文分类的评价。具体来说,对于普通的分类任务,我们提供了StaICC-Normal,选择了10个常用的基准数据集,并使用固定形式生成提示,以减少实验实现之间的差异。为了丰富我们的基准测试套件的应用范围,我们也提供了一个子基准StaICC-Diag,用于从多个角度诊断ICL,旨在促进更稳健的推理处理。
https://arxiv.org/abs/2501.15708
Transformer-based models have achieved remarkable results in natural language processing (NLP) tasks such as text classification and machine translation. However, their computational complexity and resource demands pose challenges for scalability and accessibility. This research proposes a hybrid quantum-classical transformer model that integrates a quantum-enhanced attention mechanism to address these limitations. By leveraging quantum kernel similarity and variational quantum circuits (VQC), the model captures intricate token dependencies while improving computational efficiency. Experimental results on the IMDb dataset demonstrate that the quantum-enhanced model outperforms the classical baseline across all key metrics, achieving a 1.5% improvement in accuracy (65.5% vs. 64%), precision, recall, and F1 score. Statistical significance tests validate these improvements, highlighting the robustness of the quantum approach. These findings illustrate the transformative potential of quantum-enhanced attention mechanisms in optimizing NLP architectures for real-world applications.
基于Transformer的模型在自然语言处理(NLP)任务中,如文本分类和机器翻译方面取得了显著成果。然而,这些模型由于计算复杂性和资源需求而面临可扩展性和易用性的挑战。本研究提出了一种混合量子-经典Transformer模型,该模型集成了增强型注意力机制,以解决这些问题。通过利用量子核相似度和变分量子电路(VQC),该模型能够捕捉到复杂的令牌依赖关系,并提高计算效率。在IMDb数据集上的实验结果显示,经过量子增强的模型在所有关键指标上均超越了经典基线模型,准确率提高了1.5%(从64%提升至65.5%),同时精度、召回率和F1分数也有所改善。统计显著性测试验证了这些改进,并突显出量子方法的鲁棒性。这些发现展示了量子增强注意力机制在优化NLP架构以应用于实际场景中的变革潜力。
https://arxiv.org/abs/2501.15630
Idiom detection using Natural Language Processing (NLP) is the computerized process of recognizing figurative expressions within a text that convey meanings beyond the literal interpretation of the words. While idiom detection has seen significant progress across various languages, the Kurdish language faces a considerable research gap in this area despite the importance of idioms in tasks like machine translation and sentiment analysis. This study addresses idiom detection in Sorani Kurdish by approaching it as a text classification task using deep learning techniques. To tackle this, we developed a dataset containing 10,580 sentences embedding 101 Sorani Kurdish idioms across diverse contexts. Using this dataset, we developed and evaluated three deep learning models: KuBERT-based transformer sequence classification, a Recurrent Convolutional Neural Network (RCNN), and a BiLSTM model with an attention mechanism. The evaluations revealed that the transformer model, the fine-tuned BERT, consistently outperformed the others, achieving nearly 99% accuracy while the RCNN achieved 96.5% and the BiLSTM 80%. These results highlight the effectiveness of Transformer-based architectures in low-resource languages like Kurdish. This research provides a dataset, three optimized models, and insights into idiom detection, laying a foundation for advancing Kurdish NLP.
使用自然语言处理(NLP)进行成语检测是通过计算机识别文本中表达超出字面意义的比喻性表达的过程。尽管在多种语言上已经取得了显著进展,但库尔德语(尤其是索拉尼库尔德语)在这个领域的研究仍然存在较大空白,这不利于机器翻译和情感分析等任务的发展。这项研究表明如何使用深度学习技术将索拉尼库尔德语的成语检测转化为文本分类问题。 为了实现这一目标,我们创建了一个包含10,580个句子的数据集,这些句子中嵌入了101条索拉尼库尔德语成语,并涵盖了不同的上下文。利用这个数据集,我们开发并评估了三种深度学习模型:基于KuBERT的Transformer序列分类、递归卷积神经网络(RCNN)以及带有注意力机制的双层长短期记忆网络(BiLSTM)。在实验中,变压器模型(微调后的BERT模型)的表现始终优于其他两种模型,在准确性上接近99%,而RCNN和BiLSTM分别达到了96.5% 和80%。 这些结果突显了基于Transformer架构的有效性,尤其是在像库尔德语这样的低资源语言环境中。这项研究不仅提供了一个数据集、三种优化后的模型以及有关成语检测的见解,并为推动库尔德语自然语言处理的进步奠定了基础。
https://arxiv.org/abs/2501.14528
Detecting AI-generated text, especially in short-context documents, is difficult because there is not enough context for accurate classification. This paper presents a new teacher-student model that uses domain adaptation and data augmentation to solve these problems. The teacher model, which combines DeBERTa-v3-large and Mamba-790m, learns semantic knowledge through domain-specific fine-tuning. The student model handles short-context text more efficiently. The system uses a Mean Squared Error (MSE) loss function to guide the student's learning, improving both accuracy and efficiency. Also, data augmentation methods like spelling correction and error injection make the model more robust. Experimental results show that this approach works better than baseline methods, proving its usefulness for real-time AI-generated text detection and other text classification tasks.
检测AI生成文本,尤其是在短文档中,由于缺乏足够的上下文信息来进行准确分类而变得困难。本文介绍了一种新的师生模型,该模型结合领域适应和数据增强技术来解决这些问题。教师模型融合了DeBERTa-v3-large和Mamba-790m,并通过特定领域的微调学习语义知识。学生模型则更高效地处理短文本上下文的检测任务。系统采用均方误差(Mean Squared Error,MSE)损失函数引导学生的训练过程,从而提高准确性和效率。此外,数据增强方法如拼写纠错和错误注入使模型更加健壮。实验结果表明,该方法比基线方法表现更好,证明了其在实时AI生成文本检测及其他文本分类任务中的实用性。
https://arxiv.org/abs/2501.14288
Attention maps in neural models for NLP are appealing to explain the decision made by a model, hopefully emphasizing words that justify the decision. While many empirical studies hint that attention maps can provide such justification from the analysis of sound examples, only a few assess the plausibility of explanations based on attention maps, i.e., the usefulness of attention maps for humans to understand the decision. These studies furthermore focus on text classification. In this paper, we report on a preliminary assessment of attention maps in a sentence comparison task, namely natural language inference. We compare the cross-attention weights between two RNN encoders with human-based and heuristic-based annotations on the eSNLI corpus. We show that the heuristic reasonably correlates with human annotations and can thus facilitate evaluation of plausible explanations in sentence comparison tasks. Raw attention weights however remain only loosely related to a plausible explanation.
神经模型中的注意力图(Attention maps)在解释模型决策方面具有吸引力,理想情况下应强调那些能为决策提供依据的词语。虽然许多实证研究表明,通过对典型示例的分析可以得出这样的结论,即注意力图能够为模型决策提供合理说明,但只有少数研究评估了基于注意力图的解释的实际有效性,也就是这些解释对于人类理解模型决策是否有用。而且这些研究主要集中在文本分类任务上。 本文报告了一项初步研究,在句子比较任务(特别是自然语言推理)中对注意力图进行了评估。我们对比了两个RNN编码器之间的交叉注意权重,并将其与eSNLI语料库中的基于人的注释和基于启发式的注释进行对照。结果表明,基于启发式的方法合理地与人类的标注相关联,因此可以在句子比较任务中促进可接受解释的评估。然而,原始注意力权重仍然仅与合理的解释存在松散的相关性。
https://arxiv.org/abs/2501.13735
This paper studies a text classification algorithm based on an improved Transformer to improve the performance and efficiency of the model in text classification tasks. Aiming at the shortcomings of the traditional Transformer model in capturing deep semantic relationships and optimizing computational complexity, this paper introduces a multi-level attention mechanism and a contrastive learning strategy. The multi-level attention mechanism effectively models the global semantics and local features in the text by combining global attention with local attention; the contrastive learning strategy enhances the model's ability to distinguish between different categories by constructing positive and negative sample pairs while improving the classification effect. In addition, in order to improve the training and inference efficiency of the model on large-scale text data, this paper designs a lightweight module to optimize the feature transformation process and reduce the computational cost. Experimental results on the dataset show that the improved Transformer model outperforms the comparative models such as BiLSTM, CNN, standard Transformer, and BERT in terms of classification accuracy, F1 score, and recall rate, showing stronger semantic representation ability and generalization performance. The method proposed in this paper provides a new idea for algorithm optimization in the field of text classification and has good application potential and practical value. Future work will focus on studying the performance of this model in multi-category imbalanced datasets and cross-domain tasks and explore the integration wi
这篇论文研究了一种基于改进Transformer的文本分类算法,以提高模型在文本分类任务中的性能和效率。针对传统Transformer模型在捕捉深层次语义关系及优化计算复杂度方面的不足,本文引入了多层次注意力机制以及对比学习策略。多层次注意力机制通过结合全局注意力与局部注意力有效建模文本的全局语义和局部特征;对比学习策略则通过构造正负样本对来增强模型区分不同类别的能力,并提升分类效果的同时改进模型性能。 此外,为了提高大规模文本数据上模型训练及推理效率,本文设计了一个轻量级模块以优化特征转换过程并降低计算成本。实验结果表明,在测试集上的分类准确率、F1值和召回率等方面,改进后的Transformer模型优于BiLSTM、CNN、标准Transformer以及BERT等对比模型,显示出更强的语义表示能力和泛化性能。 该文中提出的这种方法为文本分类领域的算法优化提供了新的思路,并具有良好的应用潜力与实际价值。未来的工作将集中于研究此模型在多类别不平衡数据集及跨域任务中的表现,并探索与其他技术的集成应用可能性。
https://arxiv.org/abs/2501.13467
Attention mechanism is contributing to the majority of recent advances in machine learning for natural language processing. Additionally, it results in an attention map that shows the proportional influence of each input in its decision. Empirical studies postulate that attention maps can be provided as an explanation for model output. However, it is still questionable to ask whether this explanation helps regular people to understand and accept the model output (the plausibility of the explanation). Recent studies show that attention weights in the RNN encoders are hardly plausible because they spread on input tokens. We thus propose 3 additional constraints to the learning objective function to improve the plausibility of the attention map: regularization to increase the attention weight sparsity, semi-supervision to supervise the map by a heuristic and supervision by human annotation. Results show that all techniques can improve the attention map plausibility at some level. We also observe that specific instructions for human annotation might have a negative effect on classification performance. Beyond the attention map, the result of experiments on text classification tasks also shows that no matter how the constraint brings the gain, the contextualization layer plays a crucial role in finding the right space for finding plausible tokens.
注意力机制在自然语言处理领域的机器学习最近进展中发挥了重要作用。此外,它生成了一个显示每个输入在其决策中的相对影响力的关注图(attention map)。实证研究表明,可以将这些关注图作为模型输出的解释提供给人们。然而,仍值得探讨的是,这种解释是否有助于普通人理解和接受模型的输出结果(即解释的有效性)。 最近的研究表明,在循环神经网络(RNN)编码器中的注意力权重难以具有有效性,因为它们分散在输入令牌上。因此,我们提出了三种附加约束来改善关注图的有效性:增加注意权重稀疏性的正则化、通过启发式方法监督地图的半监督学习以及由人工注释进行监督。 实验结果显示,所有这些技术都在一定程度上提高了注意力图的有效性。同时观察到,特定的人类标注指导可能对分类性能产生负面影响。除了关注图之外,在文本分类任务上的实验证明,无论哪种约束带来了多少收益,上下文化层(contextualization layer)在找到合适的令牌空间以提高有效性方面起着至关重要的作用。
https://arxiv.org/abs/2501.12775
This study presents a comprehensive review of the potential of multimodal deep learning (DL) in medical diagnosis, using COVID-19 as a case example. Motivated by the success of artificial intelligence applications during the COVID-19 pandemic, this research aims to uncover the capabilities of DL in disease screening, prediction, and classification, and to derive insights that enhance the resilience, sustainability, and inclusiveness of science, technology, and innovation systems. Adopting a systematic approach, we investigate the fundamental methodologies, data sources, preprocessing steps, and challenges encountered in various studies and implementations. We explore the architecture of deep learning models, emphasising their data-specific structures and underlying algorithms. Subsequently, we compare different deep learning strategies utilised in COVID-19 analysis, evaluating them based on methodology, data, performance, and prerequisites for future research. By examining diverse data types and diagnostic modalities, this research contributes to scientific understanding and knowledge of the multimodal application of DL and its effectiveness in diagnosis. We have implemented and analysed 11 deep learning models using COVID-19 image, text, and speech (ie, cough) data. Our analysis revealed that the MobileNet model achieved the highest accuracy of 99.97% for COVID-19 image data and 93.73% for speech data (i.e., cough). However, the BiGRU model demonstrated superior performance in COVID-19 text classification with an accuracy of 99.89%. The broader implications of this research suggest potential benefits for other domains and disciplines that could leverage deep learning techniques for image, text, and speech analysis.
这项研究提供了多模态深度学习(DL)在医学诊断中潜在应用的全面回顾,以COVID-19为例。鉴于人工智能技术在新冠疫情期间的成功应用,本研究旨在揭示深度学习在疾病筛查、预测和分类方面的潜力,并从中获得有助于增强科学、技术和创新体系韧性、可持续性和包容性的见解。采用系统方法,我们探讨了各种研究与实施中遇到的基本方法论、数据来源、预处理步骤以及所面临的挑战。我们还探索了深度学习模型的架构,强调其特定于数据的结构及其基础算法。接下来,我们将比较在COVID-19分析中使用的不同深度学习策略,并根据方法学、数据、性能和未来研究的需求对其进行评估。 通过考察不同类型的数据及诊断模式,本研究为多模态应用下的DL科学理解和知识贡献了力量,并探讨其在诊断中的有效性。我们实施并分析了11种基于COVID-19图像、文本以及语音(即咳嗽)数据的深度学习模型。我们的分析表明,MobileNet模型对COVID-19图像数据实现了最高精度为99.97%,而针对语音数据(如咳嗽)的准确率达到了93.73%。然而,在COVID-19文本分类中,BiGRU模型表现出色,其准确性达到99.89%。 这项研究更广泛的含义在于,它可能对其他领域和学科产生潜在益处,这些领域和学科可以利用深度学习技术进行图像、文本以及语音分析。
https://arxiv.org/abs/2501.09506
Short text classification has gained significant attention in the information age due to its prevalence and real-world applications. Recent advancements in graph learning combined with contrastive learning have shown promising results in addressing the challenges of semantic sparsity and limited labeled data in short text classification. However, existing models have certain limitations. They rely on explicit data augmentation techniques to generate contrastive views, resulting in semantic corruption and noise. Additionally, these models only focus on learning the intrinsic consistency between the generated views, neglecting valuable discriminative information from other potential views. To address these issues, we propose a Simple graph contrastive learning framework for Short Text Classification (SimSTC). Our approach involves performing graph learning on multiple text-related component graphs to obtain multi-view text embeddings. Subsequently, we directly apply contrastive learning on these embeddings. Notably, our method eliminates the need for data augmentation operations to generate contrastive views while still leveraging the benefits of multi-view contrastive learning. Despite its simplicity, our model achieves outstanding performance, surpassing large language models on various datasets.
在信息时代,由于其普遍性和实际应用价值,短文本分类已获得了广泛关注。近期,图学习与对比学习相结合的技术,在解决短文本分类中的语义稀疏性和标注数据不足的挑战方面显示出巨大潜力。然而,现有的模型存在一定的局限性:它们依赖于显式的数据增强技术来生成对比视图,这会导致语义污染和噪声;此外,这些模型仅关注于从生成的视图中学习内在一致性,而忽视了其他潜在视图中的有价值的区别信息。 为了解决这些问题,我们提出了一种用于短文本分类的简单图对比学习框架(SimSTC)。我们的方法包括在多个与文本相关的组件图上执行图学习以获取多视角文本嵌入,随后直接在这些建模后的嵌入上应用对比学习。特别值得注意的是,我们的方法消除了生成对比视图所需的数据增强操作,同时仍然利用了多视图对比学习的好处。尽管该模型结构简单,但在各种数据集上的性能表现却非常出色,并且超过了大型语言模型的水平。
https://arxiv.org/abs/2501.09219
Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MI-DELIGHT for short text classification in this work. Specifically, it first performs multi-source information (i.e., statistical information, linguistic information, and factual information) exploration to alleviate the sparsity issues. Then, the graph learning approach is adopted to learn the representation of short texts, which are presented in graph forms. Moreover, we introduce a dual-level (i.e., instance-level and cluster-level) contrastive learning auxiliary task to effectively capture different-grained contrastive information within massive unlabeled data. Meanwhile, previous models merely perform the main task and auxiliary tasks in parallel, without considering the relationship among tasks. Therefore, we introduce a hierarchical architecture to explicitly model the correlations between tasks. We conduct extensive experiments across various benchmark datasets, demonstrating that MI-DELIGHT significantly surpasses previous competitive models. It even outperforms popular large language models on several datasets.
短文本分类作为自然语言处理领域的研究子课题,因其语义稀疏性和实际场景中标签样本不足而更具挑战性。在本工作中,我们提出了一种名为MI-DELIGHT的新模型,用于解决短文本分类问题。具体来说,该模型首先进行多源信息(即统计信息、语言学信息和事实信息)的探索以缓解语义稀疏性的问题。然后,采用图学习方法来学习用图形式表示的短文本的表征。此外,我们引入了一种双层次(实例级和聚类级)对比学习辅助任务,有效捕获大量未标记数据中的不同粒度的对比信息。同时,先前的模型仅在并行执行主要任务与辅助任务时,并不考虑各任务之间的关系。因此,我们引入了分层架构以明确建模各个任务间的相关性。我们在多个基准数据集上进行了广泛的实验,结果表明MI-DELIGHT显著超越了以前的竞争模型,在某些数据集中甚至超过了流行的大规模语言模型的性能。
https://arxiv.org/abs/2501.09214
Large Language Models (LLMs) like GPT-4o can help automate text classification tasks at low cost and scale. However, there are major concerns about the validity and reliability of LLM outputs. By contrast, human coding is generally more reliable but expensive to procure at scale. In this study, we propose a hybrid solution to leverage the strengths of both. We combine human-coded data and synthetic LLM-produced data to fine-tune a classical machine learning classifier, distilling both into a smaller BERT model. We evaluate our method on a human-coded test set as a validity measure for LLM output quality. In three experiments, we systematically vary LLM-generated samples' size, variety, and consistency, informed by best practices in LLM tuning. Our findings indicate that augmenting datasets with synthetic samples improves classifier performance, with optimal results achieved at an 80% synthetic to 20% human-coded data ratio. Lower temperature settings of 0.3, corresponding to less variability in LLM generations, produced more stable improvements but also limited model learning from augmented samples. In contrast, higher temperature settings (0.7 and above) introduced greater variability in performance estimates and, at times, lower performance. Hence, LLMs may produce more uniform output that classifiers overfit to earlier or produce more diverse output that runs the risk of deteriorating model performance through information irrelevant to the prediction task. Filtering out inconsistent synthetic samples did not enhance performance. We conclude that integrating human and LLM-generated data to improve text classification models in assessment offers a scalable solution that leverages both the accuracy of human coding and the variety of LLM outputs.
大型语言模型(LLMs)如GPT-4可以以低成本和大规模的方式自动化文本分类任务,但其输出的有效性和可靠性存在重大问题。相比之下,人工编码的数据通常更可靠,但在大规模获取时成本高昂。在本研究中,我们提出了一种混合解决方案,旨在结合两者的优势。我们将人工编码数据与LLM生成的合成数据结合起来,对经典的机器学习分类器进行微调,并将这些数据浓缩到一个较小的BERT模型中。我们在一个人工编码的测试集上评估了我们的方法的有效性,作为衡量LLM输出质量的标准。在三项实验中,我们系统地改变了LLM生成样本的数量、种类和一致性,参考了最佳的LLM调优实践。我们的发现表明,通过使用合成样本增强数据集可以提高分类器的性能,在80%合成样本与20%人工编码数据的比例下达到最优结果。 较低的温度设置(如0.3)对应于更少的生成变异性,并产生了更加稳定的改进效果,但也限制了模型从扩充的数据中学习。相比之下,较高的温度设置(0.7及以上)引入了更大的性能估计变动性,在某些情况下甚至导致性能下降。因此,LLM可能产生分类器会过度拟合的较为统一的输出,或者生成多样化的输出从而带来风险,即通过与预测任务无关的信息降低模型性能。 过滤不一致的合成样本并未提高性能表现。综上所述,结合人工和LLM生成的数据以改进评估中的文本分类模型提供了一种可扩展的解决方案,该方案利用了人工编码的准确性以及LLM输出的多样性。
https://arxiv.org/abs/2501.09126
Unlocking the potential of Large Language Models (LLMs) in data classification represents a promising frontier in natural language processing. In this work, we evaluate the performance of different LLMs in comparison with state-of-the-art deep-learning and machine-learning models, in two different classification scenarios: i) the classification of employees' working locations based on job reviews posted online (multiclass classification), and 2) the classification of news articles as fake or not (binary classification). Our analysis encompasses a diverse range of language models differentiating in size, quantization, and architecture. We explore the impact of alternative prompting techniques and evaluate the models based on the weighted F1-score. Also, we examine the trade-off between performance (F1-score) and time (inference response time) for each language model to provide a more nuanced understanding of each model's practical applicability. Our work reveals significant variations in model responses based on the prompting strategies. We find that LLMs, particularly Llama3 and GPT-4, can outperform traditional methods in complex classification tasks, such as multiclass classification, though at the cost of longer inference times. In contrast, simpler ML models offer better performance-to-time trade-offs in simpler binary classification tasks.
解锁大型语言模型(LLMs)在数据分类中的潜力代表了自然语言处理领域的一个有前景的前沿方向。在这项工作中,我们评估了不同LLM与最先进的深度学习和机器学习模型相比,在两种不同的分类场景下的性能:i) 根据在线发布的职位评价来分类员工的工作地点(多类分类),以及 2) 将新闻文章分类为假新闻或非假新闻(二元分类)。我们的分析涵盖了各种在大小、量化和架构上有所区别的语言模型。我们探讨了不同的提示技术的影响,并根据加权F1分数对这些模型进行评估。此外,我们还考察了性能(F1得分)与时间(推理响应时间)之间的权衡关系,以提供每种语言模型实际应用中更细致的理解。我们的研究揭示了基于不同提示策略的模型响应存在显著差异。我们发现,特别是Llama3和GPT-4这样的LLM在复杂的分类任务(如多类分类)上可以超越传统方法,尽管这会带来推理时间较长的问题。相反,在较为简单的二元分类任务中,简单的机器学习模型提供了更好的性能与时间的权衡关系。
https://arxiv.org/abs/2501.08457
Pre-trained transformer models such as BERT have shown massive gains across many text classification tasks. However, these models usually need enormous labeled data to achieve impressive performances. Obtaining labeled data is often expensive and time-consuming, whereas collecting unlabeled data using some heuristics is relatively much cheaper for any task. Therefore, this paper proposes a method that encapsulates reinforcement learning-based text generation and semi-supervised adversarial learning approaches in a novel way to improve the model's performance. Our method READ, Reinforcement-based Adversarial learning, utilizes an unlabeled dataset to generate diverse synthetic text through reinforcement learning, improving the model's generalization capability using adversarial learning. Our experimental results show that READ outperforms the existing state-of-art methods on multiple datasets.
预训练的Transformer模型(如BERT)在许多文本分类任务中展现了巨大的优势。然而,这些模型通常需要大量的标注数据才能达到令人印象深刻的性能水平。获取标注数据往往既昂贵又耗时,而使用一些启发式方法收集未标记的数据对于任何任务来说都相对便宜得多。因此,本文提出了一种创新的方法,该方法将基于强化学习的文本生成和半监督对抗性学习相结合,以提高模型的表现。 我们的方法称为READ(Reinforcement-based Adversarial Learning),它利用未标注的数据集通过强化学习来生成多样化的合成文本,并使用对抗性学习提升模型的泛化能力。实验结果表明,READ在多个数据集上优于现有的最先进方法。
https://arxiv.org/abs/2501.08035
This work invokes the notion of $f$-divergence to introduce a novel upper bound on the Bayes error rate of a general classification task. We show that the proposed bound can be computed by sampling from the output of a parameterized model. Using this practical interpretation, we introduce the Bayes optimal learning threshold (BOLT) loss whose minimization enforces a classification model to achieve the Bayes error rate. We validate the proposed loss for image and text classification tasks, considering MNIST, Fashion-MNIST, CIFAR-10, and IMDb datasets. Numerical experiments demonstrate that models trained with BOLT achieve performance on par with or exceeding that of cross-entropy, particularly on challenging datasets. This highlights the potential of BOLT in improving generalization.
这项工作引入了$f$-散度的概念,提出了一种对一般分类任务的贝叶斯错误率的新上界估计。我们展示了所提出的界限可以通过从参数化模型的输出中采样来计算得出。基于这一实用解释,我们提出了贝叶斯最优学习阈值(BOLT)损失函数,其最小化能够促使分类模型达到贝叶斯错误率。我们在图像和文本分类任务中验证了该损失函数的有效性,考虑的数据集包括MNIST、Fashion-MNIST、CIFAR-10以及IMDb数据集。数值实验表明,在训练时使用BOLT的模型在性能上可以与交叉熵相媲美或超越交叉熵,尤其是在挑战性的数据集上更为明显。这突显了BOLT在提升泛化能力方面的潜力。
https://arxiv.org/abs/2501.07754
Training and fine-tuning deep learning models, especially large language models (LLMs), on limited and imbalanced datasets poses substantial challenges. These issues often result in poor generalization, where models overfit to dominant classes and underperform on minority classes, leading to biased predictions and reduced robustness in real-world applications. To overcome these challenges, we propose augmenting features in the embedding space by generating synthetic samples using a range of techniques. By upsampling underrepresented classes, this method improves model performance and alleviates data imbalance. We validate the effectiveness of this approach across multiple open-source text classification benchmarks, demonstrating its potential to enhance model robustness and generalization in imbalanced data scenarios.
在有限且不平衡的数据集上训练和微调深度学习模型,尤其是大型语言模型(LLMs),面临着重大挑战。这些问题常常导致模型泛化能力差:过度拟合到主要类别并在次要类别中表现不佳,从而导致预测偏见并降低实际应用中的鲁棒性。为克服这些挑战,我们提出通过在嵌入空间生成合成样本的方法来增强特征,以此来提升未充分代表类别的数据量。这种方法可以提高模型性能,并缓解数据不平衡的问题。我们在多个开源文本分类基准测试中验证了该方法的有效性,展示了其在不均衡数据场景下增强模型鲁棒性和泛化能力的潜力。
https://arxiv.org/abs/2501.06434