This study presents a comprehensive review of the potential of multimodal deep learning (DL) in medical diagnosis, using COVID-19 as a case example. Motivated by the success of artificial intelligence applications during the COVID-19 pandemic, this research aims to uncover the capabilities of DL in disease screening, prediction, and classification, and to derive insights that enhance the resilience, sustainability, and inclusiveness of science, technology, and innovation systems. Adopting a systematic approach, we investigate the fundamental methodologies, data sources, preprocessing steps, and challenges encountered in various studies and implementations. We explore the architecture of deep learning models, emphasising their data-specific structures and underlying algorithms. Subsequently, we compare different deep learning strategies utilised in COVID-19 analysis, evaluating them based on methodology, data, performance, and prerequisites for future research. By examining diverse data types and diagnostic modalities, this research contributes to scientific understanding and knowledge of the multimodal application of DL and its effectiveness in diagnosis. We have implemented and analysed 11 deep learning models using COVID-19 image, text, and speech (ie, cough) data. Our analysis revealed that the MobileNet model achieved the highest accuracy of 99.97% for COVID-19 image data and 93.73% for speech data (i.e., cough). However, the BiGRU model demonstrated superior performance in COVID-19 text classification with an accuracy of 99.89%. The broader implications of this research suggest potential benefits for other domains and disciplines that could leverage deep learning techniques for image, text, and speech analysis.
这项研究提供了多模态深度学习(DL)在医学诊断中潜在应用的全面回顾,以COVID-19为例。鉴于人工智能技术在新冠疫情期间的成功应用,本研究旨在揭示深度学习在疾病筛查、预测和分类方面的潜力,并从中获得有助于增强科学、技术和创新体系韧性、可持续性和包容性的见解。采用系统方法,我们探讨了各种研究与实施中遇到的基本方法论、数据来源、预处理步骤以及所面临的挑战。我们还探索了深度学习模型的架构,强调其特定于数据的结构及其基础算法。接下来,我们将比较在COVID-19分析中使用的不同深度学习策略,并根据方法学、数据、性能和未来研究的需求对其进行评估。 通过考察不同类型的数据及诊断模式,本研究为多模态应用下的DL科学理解和知识贡献了力量,并探讨其在诊断中的有效性。我们实施并分析了11种基于COVID-19图像、文本以及语音(即咳嗽)数据的深度学习模型。我们的分析表明,MobileNet模型对COVID-19图像数据实现了最高精度为99.97%,而针对语音数据(如咳嗽)的准确率达到了93.73%。然而,在COVID-19文本分类中,BiGRU模型表现出色,其准确性达到99.89%。 这项研究更广泛的含义在于,它可能对其他领域和学科产生潜在益处,这些领域和学科可以利用深度学习技术进行图像、文本以及语音分析。
https://arxiv.org/abs/2501.09506
Short text classification has gained significant attention in the information age due to its prevalence and real-world applications. Recent advancements in graph learning combined with contrastive learning have shown promising results in addressing the challenges of semantic sparsity and limited labeled data in short text classification. However, existing models have certain limitations. They rely on explicit data augmentation techniques to generate contrastive views, resulting in semantic corruption and noise. Additionally, these models only focus on learning the intrinsic consistency between the generated views, neglecting valuable discriminative information from other potential views. To address these issues, we propose a Simple graph contrastive learning framework for Short Text Classification (SimSTC). Our approach involves performing graph learning on multiple text-related component graphs to obtain multi-view text embeddings. Subsequently, we directly apply contrastive learning on these embeddings. Notably, our method eliminates the need for data augmentation operations to generate contrastive views while still leveraging the benefits of multi-view contrastive learning. Despite its simplicity, our model achieves outstanding performance, surpassing large language models on various datasets.
在信息时代,由于其普遍性和实际应用价值,短文本分类已获得了广泛关注。近期,图学习与对比学习相结合的技术,在解决短文本分类中的语义稀疏性和标注数据不足的挑战方面显示出巨大潜力。然而,现有的模型存在一定的局限性:它们依赖于显式的数据增强技术来生成对比视图,这会导致语义污染和噪声;此外,这些模型仅关注于从生成的视图中学习内在一致性,而忽视了其他潜在视图中的有价值的区别信息。 为了解决这些问题,我们提出了一种用于短文本分类的简单图对比学习框架(SimSTC)。我们的方法包括在多个与文本相关的组件图上执行图学习以获取多视角文本嵌入,随后直接在这些建模后的嵌入上应用对比学习。特别值得注意的是,我们的方法消除了生成对比视图所需的数据增强操作,同时仍然利用了多视图对比学习的好处。尽管该模型结构简单,但在各种数据集上的性能表现却非常出色,并且超过了大型语言模型的水平。
https://arxiv.org/abs/2501.09219
Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MI-DELIGHT for short text classification in this work. Specifically, it first performs multi-source information (i.e., statistical information, linguistic information, and factual information) exploration to alleviate the sparsity issues. Then, the graph learning approach is adopted to learn the representation of short texts, which are presented in graph forms. Moreover, we introduce a dual-level (i.e., instance-level and cluster-level) contrastive learning auxiliary task to effectively capture different-grained contrastive information within massive unlabeled data. Meanwhile, previous models merely perform the main task and auxiliary tasks in parallel, without considering the relationship among tasks. Therefore, we introduce a hierarchical architecture to explicitly model the correlations between tasks. We conduct extensive experiments across various benchmark datasets, demonstrating that MI-DELIGHT significantly surpasses previous competitive models. It even outperforms popular large language models on several datasets.
短文本分类作为自然语言处理领域的研究子课题,因其语义稀疏性和实际场景中标签样本不足而更具挑战性。在本工作中,我们提出了一种名为MI-DELIGHT的新模型,用于解决短文本分类问题。具体来说,该模型首先进行多源信息(即统计信息、语言学信息和事实信息)的探索以缓解语义稀疏性的问题。然后,采用图学习方法来学习用图形式表示的短文本的表征。此外,我们引入了一种双层次(实例级和聚类级)对比学习辅助任务,有效捕获大量未标记数据中的不同粒度的对比信息。同时,先前的模型仅在并行执行主要任务与辅助任务时,并不考虑各任务之间的关系。因此,我们引入了分层架构以明确建模各个任务间的相关性。我们在多个基准数据集上进行了广泛的实验,结果表明MI-DELIGHT显著超越了以前的竞争模型,在某些数据集中甚至超过了流行的大规模语言模型的性能。
https://arxiv.org/abs/2501.09214
Large Language Models (LLMs) like GPT-4o can help automate text classification tasks at low cost and scale. However, there are major concerns about the validity and reliability of LLM outputs. By contrast, human coding is generally more reliable but expensive to procure at scale. In this study, we propose a hybrid solution to leverage the strengths of both. We combine human-coded data and synthetic LLM-produced data to fine-tune a classical machine learning classifier, distilling both into a smaller BERT model. We evaluate our method on a human-coded test set as a validity measure for LLM output quality. In three experiments, we systematically vary LLM-generated samples' size, variety, and consistency, informed by best practices in LLM tuning. Our findings indicate that augmenting datasets with synthetic samples improves classifier performance, with optimal results achieved at an 80% synthetic to 20% human-coded data ratio. Lower temperature settings of 0.3, corresponding to less variability in LLM generations, produced more stable improvements but also limited model learning from augmented samples. In contrast, higher temperature settings (0.7 and above) introduced greater variability in performance estimates and, at times, lower performance. Hence, LLMs may produce more uniform output that classifiers overfit to earlier or produce more diverse output that runs the risk of deteriorating model performance through information irrelevant to the prediction task. Filtering out inconsistent synthetic samples did not enhance performance. We conclude that integrating human and LLM-generated data to improve text classification models in assessment offers a scalable solution that leverages both the accuracy of human coding and the variety of LLM outputs.
大型语言模型(LLMs)如GPT-4可以以低成本和大规模的方式自动化文本分类任务,但其输出的有效性和可靠性存在重大问题。相比之下,人工编码的数据通常更可靠,但在大规模获取时成本高昂。在本研究中,我们提出了一种混合解决方案,旨在结合两者的优势。我们将人工编码数据与LLM生成的合成数据结合起来,对经典的机器学习分类器进行微调,并将这些数据浓缩到一个较小的BERT模型中。我们在一个人工编码的测试集上评估了我们的方法的有效性,作为衡量LLM输出质量的标准。在三项实验中,我们系统地改变了LLM生成样本的数量、种类和一致性,参考了最佳的LLM调优实践。我们的发现表明,通过使用合成样本增强数据集可以提高分类器的性能,在80%合成样本与20%人工编码数据的比例下达到最优结果。 较低的温度设置(如0.3)对应于更少的生成变异性,并产生了更加稳定的改进效果,但也限制了模型从扩充的数据中学习。相比之下,较高的温度设置(0.7及以上)引入了更大的性能估计变动性,在某些情况下甚至导致性能下降。因此,LLM可能产生分类器会过度拟合的较为统一的输出,或者生成多样化的输出从而带来风险,即通过与预测任务无关的信息降低模型性能。 过滤不一致的合成样本并未提高性能表现。综上所述,结合人工和LLM生成的数据以改进评估中的文本分类模型提供了一种可扩展的解决方案,该方案利用了人工编码的准确性以及LLM输出的多样性。
https://arxiv.org/abs/2501.09126
Unlocking the potential of Large Language Models (LLMs) in data classification represents a promising frontier in natural language processing. In this work, we evaluate the performance of different LLMs in comparison with state-of-the-art deep-learning and machine-learning models, in two different classification scenarios: i) the classification of employees' working locations based on job reviews posted online (multiclass classification), and 2) the classification of news articles as fake or not (binary classification). Our analysis encompasses a diverse range of language models differentiating in size, quantization, and architecture. We explore the impact of alternative prompting techniques and evaluate the models based on the weighted F1-score. Also, we examine the trade-off between performance (F1-score) and time (inference response time) for each language model to provide a more nuanced understanding of each model's practical applicability. Our work reveals significant variations in model responses based on the prompting strategies. We find that LLMs, particularly Llama3 and GPT-4, can outperform traditional methods in complex classification tasks, such as multiclass classification, though at the cost of longer inference times. In contrast, simpler ML models offer better performance-to-time trade-offs in simpler binary classification tasks.
解锁大型语言模型(LLMs)在数据分类中的潜力代表了自然语言处理领域的一个有前景的前沿方向。在这项工作中,我们评估了不同LLM与最先进的深度学习和机器学习模型相比,在两种不同的分类场景下的性能:i) 根据在线发布的职位评价来分类员工的工作地点(多类分类),以及 2) 将新闻文章分类为假新闻或非假新闻(二元分类)。我们的分析涵盖了各种在大小、量化和架构上有所区别的语言模型。我们探讨了不同的提示技术的影响,并根据加权F1分数对这些模型进行评估。此外,我们还考察了性能(F1得分)与时间(推理响应时间)之间的权衡关系,以提供每种语言模型实际应用中更细致的理解。我们的研究揭示了基于不同提示策略的模型响应存在显著差异。我们发现,特别是Llama3和GPT-4这样的LLM在复杂的分类任务(如多类分类)上可以超越传统方法,尽管这会带来推理时间较长的问题。相反,在较为简单的二元分类任务中,简单的机器学习模型提供了更好的性能与时间的权衡关系。
https://arxiv.org/abs/2501.08457
Pre-trained transformer models such as BERT have shown massive gains across many text classification tasks. However, these models usually need enormous labeled data to achieve impressive performances. Obtaining labeled data is often expensive and time-consuming, whereas collecting unlabeled data using some heuristics is relatively much cheaper for any task. Therefore, this paper proposes a method that encapsulates reinforcement learning-based text generation and semi-supervised adversarial learning approaches in a novel way to improve the model's performance. Our method READ, Reinforcement-based Adversarial learning, utilizes an unlabeled dataset to generate diverse synthetic text through reinforcement learning, improving the model's generalization capability using adversarial learning. Our experimental results show that READ outperforms the existing state-of-art methods on multiple datasets.
预训练的Transformer模型(如BERT)在许多文本分类任务中展现了巨大的优势。然而,这些模型通常需要大量的标注数据才能达到令人印象深刻的性能水平。获取标注数据往往既昂贵又耗时,而使用一些启发式方法收集未标记的数据对于任何任务来说都相对便宜得多。因此,本文提出了一种创新的方法,该方法将基于强化学习的文本生成和半监督对抗性学习相结合,以提高模型的表现。 我们的方法称为READ(Reinforcement-based Adversarial Learning),它利用未标注的数据集通过强化学习来生成多样化的合成文本,并使用对抗性学习提升模型的泛化能力。实验结果表明,READ在多个数据集上优于现有的最先进方法。
https://arxiv.org/abs/2501.08035
This work invokes the notion of $f$-divergence to introduce a novel upper bound on the Bayes error rate of a general classification task. We show that the proposed bound can be computed by sampling from the output of a parameterized model. Using this practical interpretation, we introduce the Bayes optimal learning threshold (BOLT) loss whose minimization enforces a classification model to achieve the Bayes error rate. We validate the proposed loss for image and text classification tasks, considering MNIST, Fashion-MNIST, CIFAR-10, and IMDb datasets. Numerical experiments demonstrate that models trained with BOLT achieve performance on par with or exceeding that of cross-entropy, particularly on challenging datasets. This highlights the potential of BOLT in improving generalization.
这项工作引入了$f$-散度的概念,提出了一种对一般分类任务的贝叶斯错误率的新上界估计。我们展示了所提出的界限可以通过从参数化模型的输出中采样来计算得出。基于这一实用解释,我们提出了贝叶斯最优学习阈值(BOLT)损失函数,其最小化能够促使分类模型达到贝叶斯错误率。我们在图像和文本分类任务中验证了该损失函数的有效性,考虑的数据集包括MNIST、Fashion-MNIST、CIFAR-10以及IMDb数据集。数值实验表明,在训练时使用BOLT的模型在性能上可以与交叉熵相媲美或超越交叉熵,尤其是在挑战性的数据集上更为明显。这突显了BOLT在提升泛化能力方面的潜力。
https://arxiv.org/abs/2501.07754
Training and fine-tuning deep learning models, especially large language models (LLMs), on limited and imbalanced datasets poses substantial challenges. These issues often result in poor generalization, where models overfit to dominant classes and underperform on minority classes, leading to biased predictions and reduced robustness in real-world applications. To overcome these challenges, we propose augmenting features in the embedding space by generating synthetic samples using a range of techniques. By upsampling underrepresented classes, this method improves model performance and alleviates data imbalance. We validate the effectiveness of this approach across multiple open-source text classification benchmarks, demonstrating its potential to enhance model robustness and generalization in imbalanced data scenarios.
在有限且不平衡的数据集上训练和微调深度学习模型,尤其是大型语言模型(LLMs),面临着重大挑战。这些问题常常导致模型泛化能力差:过度拟合到主要类别并在次要类别中表现不佳,从而导致预测偏见并降低实际应用中的鲁棒性。为克服这些挑战,我们提出通过在嵌入空间生成合成样本的方法来增强特征,以此来提升未充分代表类别的数据量。这种方法可以提高模型性能,并缓解数据不平衡的问题。我们在多个开源文本分类基准测试中验证了该方法的有效性,展示了其在不均衡数据场景下增强模型鲁棒性和泛化能力的潜力。
https://arxiv.org/abs/2501.06434
The rapid advancement of Large Language Models (LLMs), particularly those trained on multilingual corpora, has intensified the need for a deeper understanding of their performance across a diverse range of languages and model sizes. Our research addresses this critical need by studying the performance and scaling behavior of multilingual LLMs in text classification and machine translation tasks across 204 languages. We systematically examine both seen and unseen languages across three model families of varying sizes in zero-shot and few-shot settings. Our findings show significant differences in scaling behavior between zero-shot and two-shot scenarios, with striking disparities in performance between seen and unseen languages. Model scale has little effect on zero-shot performance, which remains mostly flat. However, in two-shot settings, larger models show clear linear improvements in multilingual text classification. For translation tasks, however, only the instruction-tuned model showed clear benefits from scaling. Our analysis also suggests that overall resource levels, not just the proportions of pretraining languages, are better predictors of model performance, shedding light on what drives multilingual LLM effectiveness.
大型语言模型(LLMs)的迅速发展,尤其是那些基于多语种语料库训练的模型,增强了对其在不同语言和规模范围内性能进行深入理解的需求。我们的研究通过探讨204种语言中多语种LLMs在文本分类和机器翻译任务中的表现及扩展行为,来解决这一关键需求。我们系统地考察了三种大小不同的模型家族在零样本和少量样本设置下已见与未见过的语言的性能。 我们的发现表明,在零样本和两样本情景之间存在显著的扩展行为差异,并且已见语言与未见过语言之间的表现差距明显。对于零样本情境,模型规模对性能影响不大,大多数情况下保持不变;然而在两样本设定中,较大的模型在多语种文本分类方面表现出明显的线性改进。但对于翻译任务而言,只有经过指令调优的模型从扩展中明确受益。 我们的分析还表明,整体资源水平(而不仅仅是预训练语言的比例)是更好的性能预测指标,这为理解驱动多语言LLMs有效性的因素提供了洞见。
https://arxiv.org/abs/2501.05629
Political discourse datasets are important for gaining political insights, analyzing communication strategies or social science phenomena. Although numerous political discourse corpora exist, comprehensive, high-quality, annotated datasets are scarce. This is largely due to the substantial manual effort, multidisciplinarity, and expertise required for the nuanced annotation of rhetorical strategies and ideological contexts. In this paper, we present AgoraSpeech, a meticulously curated, high-quality dataset of 171 political speeches from six parties during the Greek national elections in 2023. The dataset includes annotations (per paragraph) for six natural language processing (NLP) tasks: text classification, topic identification, sentiment analysis, named entity recognition, polarization and populism detection. A two-step annotation was employed, starting with ChatGPT-generated annotations and followed by exhaustive human-in-the-loop validation. The dataset was initially used in a case study to provide insights during the pre-election period. However, it has general applicability by serving as a rich source of information for political and social scientists, journalists, or data scientists, while it can be used for benchmarking and fine-tuning NLP and large language models (LLMs).
政治话语数据集对于获取政治见解、分析沟通策略或社会科学研究现象至关重要。尽管存在大量政治话语语料库,但全面且高质量的标注数据集却十分稀缺。这主要是因为对修辞策略和意识形态背景进行细致标注需要巨大的人工努力、多学科知识以及专业知识。 在本文中,我们介绍了AgoraSpeech,这是一个精心整理的数据集,包含2023年希腊全国选举期间六党派的171份政治演讲文本。该数据集包含了针对六个自然语言处理(NLP)任务的标注(每段落为单位):文本分类、主题识别、情感分析、命名实体识别、极化和民粹主义检测。 为了构建这个数据集,我们采用了两步注释方法:首先利用ChatGPT生成初步注释,然后进行详尽的人工验证。最初,该数据集被用于案例研究,在选举前提供见解。然而,它具有广泛的适用性,可作为政治和社会科学家、记者或数据科学家的信息丰富来源,并可用于基准测试和微调NLP和大型语言模型(LLMs)。
https://arxiv.org/abs/2501.06265
Document layout understanding is a field of study that analyzes the spatial arrangement of information in a document hoping to understand its structure and layout. Models such as LayoutLM (and its subsequent iterations) can understand semi-structured documents with SotA results; however, the lack of open semi-structured data is a limitation in itself. While semi-structured data is common in everyday life (balance sheets, purchase orders, receipts), there is a lack of public datasets for training machine learning models for this type of document. In this investigation we propose a method to generate new, synthetic, layout information that can help overcoming this data shortage. According to our results, the proposed method performs better than LayoutTransformer, another popular layout generation method. We also show that, in some scenarios, text classification can improve when supported by bounding box information.
文档布局理解是一个研究领域,它分析文档中信息的空间排列,旨在了解其结构和布局。像LayoutLM及其后续版本这样的模型可以以最先进的结果来理解和处理半结构化文档;然而,缺乏公开的半结构化数据本身就是一个限制。尽管在日常生活中,半结构化的数据非常普遍(如资产负债表、采购订单、收据),但用于训练此类文档机器学习模型的公共数据集却很匮乏。在这项研究中,我们提出了一种生成新的合成布局信息的方法,以帮助克服这一数据短缺问题。根据我们的结果,所提出的方法优于另一种流行的布局生成方法LayoutTransformer。此外,我们还表明,在某些场景下,文本分类在有边界框信息支持的情况下可以得到改进。
https://arxiv.org/abs/2501.05497
The Questio de aqua et terra is a cosmological treatise traditionally attributed to Dante Alighieri. However, the authenticity of this text is controversial, due to discrepancies with Dante's established works and to the absence of contemporary references. This study investigates the authenticity of the Questio via computational authorship verification (AV), a class of techniques which combine supervised machine learning and stylometry. We build a family of AV systems and assemble a corpus of 330 13th- and 14th-century Latin texts, which we use to comparatively evaluate the AV systems through leave-one-out cross-validation. Our best-performing system achieves high verification accuracy (F1=0.970) despite the heterogeneity of the corpus in terms of textual genre. The key contribution to the accuracy of this system is shown to come from Distributional Random Oversampling (DRO), a technique specially tailored to text classification which is here used for the first time in AV. The application of the AV system to the Questio returns a highly confident prediction concerning its authenticity. These findings contribute to the debate on the authorship of the Questio, and highlight DRO's potential in the application of AV to cultural heritage.
《水与地的问题》是一篇传统上归于但丁·阿利吉耶里的宇宙学著作。然而,由于该文本与但丁已知作品存在矛盾以及缺乏同时代的引用证据,其真实性备受争议。本研究通过计算作者身份验证(AV)方法来探讨《水与地的问题》的真实性。这种方法结合了监督机器学习和文体测量技术。我们构建了一组AV系统,并汇集了一个包含330篇13世纪至14世纪拉丁文文本的语料库,利用留一交叉验证法对这些系统进行比较评估。性能最佳的系统实现了高准确率(F1=0.970),尽管语料库在文体类型方面存在多样性。该系统准确性提高的关键在于分布随机过采样技术(DRO),这是一种专门用于文本分类的方法,首次在此领域应用于AV研究中。我们将AV系统应用到《水与地的问题》上,得到一个关于其真实性的高度自信的预测结果。这些发现有助于作者身份争议的讨论,并突显了DRO在文化遗产领域的AV应用潜力。
https://arxiv.org/abs/2501.05480
Text classification is a fundamental task in natural language processing, pivotal to various applications such as query optimization, data integration, and schema matching. While neural network-based models, such as CNN and BERT, have demonstrated remarkable performance in text classification, their effectiveness heavily relies on abundant labeled training data. This dependency makes these models less effective in dynamic few-shot text classification, where labeled data is scarce, and target labels frequently evolve based on application needs. Recently, large language models (LLMs) have shown promise due to their extensive pretraining and contextual understanding. Current approaches provide LLMs with text inputs, candidate labels, and additional side information (e.g., descriptions) to predict text labels. However, their effectiveness is hindered by the increased input size and the noise introduced through side information processing. To address these limitations, we propose a graph-based online retrieval-augmented generation framework, namely GORAG, for dynamic few-shot text classification. GORAG constructs and maintains an adaptive information graph by extracting side information across all target texts, rather than treating each input independently. It employs a weighted edge mechanism to prioritize the importance and reliability of extracted information and dynamically retrieves relevant context using a minimum-cost spanning tree tailored for each text input. Empirical evaluations demonstrate that GORAG outperforms existing approaches by providing more comprehensive and accurate contextual information.
文本分类是自然语言处理中的基本任务,对于诸如查询优化、数据集成和模式匹配等各种应用至关重要。虽然基于神经网络的模型(如卷积神经网络CNN和BERT)在文本分类中表现出卓越性能,但它们的有效性严重依赖于大量标注的训练数据。这种依赖关系使得这些模型在动态少样本文本分类任务上效果较差,在这类任务中,标注数据稀缺且目标标签频繁根据应用需求变化。 最近,大规模语言模型(LLMs)因其广泛的预训练和上下文理解能力而显示出潜力。当前的方法通过向LLMs提供文本输入、候选标签以及额外的侧面信息(例如描述),来预测文本分类标签。然而,这些方法的效果受到了输入规模增加及处理侧面信息所带来的噪声问题的影响。 为了克服这些问题,我们提出了一种基于图的在线检索增强生成框架——GORAG,用于动态少样本文本分类任务。GORAG通过提取所有目标文本的侧面信息构建并维护一个自适应的信息图,而不是将每个输入独立对待。它采用加权边机制来优先考虑提取信息的重要性及可靠性,并且根据每一个文本输入定制最小成本生成树进行相关上下文的动态检索。 实验证明,与现有方法相比,GORAG通过提供更全面和准确的上下文信息,在性能上显著优于其他方法。
https://arxiv.org/abs/2501.02844
Text augmentation (TA) is a critical technique for text classification, especially in few-shot settings. This paper introduces a novel LLM-based TA method, TARDiS, to address challenges inherent in the generation and alignment stages of two-stage TA methods. For the generation stage, we propose two generation processes, SEG and CEG, incorporating multiple class-specific prompts to enhance diversity and separability. For the alignment stage, we introduce a class adaptation (CA) method to ensure that generated examples align with their target classes through verification and modification. Experimental results demonstrate TARDiS's effectiveness, outperforming state-of-the-art LLM-based TA methods in various few-shot text classification tasks. An in-depth analysis confirms the detailed behaviors at each stage.
文本增强(TA)是文本分类中的关键技术,尤其是在少量样本设置中尤为重要。本文介绍了一种基于大型语言模型(LLM)的新型文本增强方法TARDiS,以解决两阶段文本增强方法在生成和对齐阶段所固有的挑战。对于生成阶段,我们提出了两种生成过程:特定类别的段落生成(SEG)和特定类别的示例生成(CEG),通过使用多个类别特异性的提示来提高多样性和可分离性。对于对齐阶段,我们引入了一种类适应方法(CA),以确保所生成的样本与其目标类别保持一致,这是通过对所生成样本进行验证和调整实现的。 实验结果表明,TARDiS在各种少量样本文本分类任务中表现优越,优于现有的基于大型语言模型的文本增强方法。深入分析确认了该方法在每个阶段的具体行为。
https://arxiv.org/abs/2501.02739
In this study, we introduce the Multi-Head Explainer (MHEX), a versatile and modular framework that enhances both the explainability and accuracy of Convolutional Neural Networks (CNNs) and Transformer-based models. MHEX consists of three core components: an Attention Gate that dynamically highlights task-relevant features, Deep Supervision that guides early layers to capture fine-grained details pertinent to the target class, and an Equivalent Matrix that unifies refined local and global representations to generate comprehensive saliency maps. Our approach demonstrates superior compatibility, enabling effortless integration into existing residual networks like ResNet and Transformer architectures such as BERT with minimal modifications. Extensive experiments on benchmark datasets in medical imaging and text classification show that MHEX not only improves classification accuracy but also produces highly interpretable and detailed saliency scores.
在这项研究中,我们介绍了多头解释器(MHEX),这是一种灵活且模块化的框架,旨在增强卷积神经网络(CNN)和基于变压器模型的可解释性和准确性。MHEX 包含三个核心组件:注意力门控,该组件可以动态突出显示任务相关特征;深度监督,它引导早期层捕捉与目标类别相关的细粒度细节;以及等价矩阵,用于统一精炼后的局部和全局表示以生成全面的显著性图。 我们的方法展示了出色的兼容性,能够轻松地将现有残差网络(如ResNet)和变压器架构(如BERT)集成到MHEX中,并且仅需进行最小的修改。在医学影像和文本分类等基准数据集上的广泛实验表明,MHEX 不仅提高了分类准确性,还生成了高度可解释且详细的显著性得分。
https://arxiv.org/abs/2501.01311
This study evaluates fine-tuning strategies for text classification using the DistilBERT model, specifically the distilbert-base-uncased-finetuned-sst-2-english variant. Through structured experiments, we examine the influence of hyperparameters such as learning rate, batch size, and epochs on accuracy, F1-score, and loss. Polynomial regression analyses capture foundational and incremental impacts of these hyperparameters, focusing on fine-tuning adjustments relative to a baseline model. Results reveal variability in metrics due to hyperparameter configurations, showing trade-offs among performance metrics. For example, a higher learning rate reduces loss in relative analysis (p=0.027) but challenges accuracy improvements. Meanwhile, batch size significantly impacts accuracy and F1-score in absolute regression (p=0.028 and p=0.005) but has limited influence on loss optimization (p=0.170). The interaction between epochs and batch size maximizes F1-score (p=0.001), underscoring the importance of hyperparameter interplay. These findings highlight the need for fine-tuning strategies addressing non-linear hyperparameter interactions to balance performance across metrics. Such variability and metric trade-offs are relevant for tasks beyond text classification, including NLP and computer vision. This analysis informs fine-tuning strategies for large language models and promotes adaptive designs for broader model applicability.
这项研究评估了使用DistilBERT模型进行文本分类的微调策略,具体采用了`distilbert-base-uncased-finetuned-sst-2-english`变体。通过结构化实验,我们考察了学习率、批处理大小和训练周期等超参数对准确度、F1值和损失的影响。多项式回归分析捕捉了这些超参数的基础影响及其相对于基线模型的增量调整效果。 研究结果揭示了由于不同的超参数配置导致的指标变化,并展示了性能指标之间的权衡关系。例如,较高的学习率在相对分析中减少了损失(p=0.027),但对提高准确度构成了挑战。同时,批处理大小显著影响绝对回归中的准确度和F1值(分别为p=0.028和p=0.005),但在优化损失方面的影响有限(p=0.170)。训练周期与批处理大小之间的相互作用最大化了F1值(p=0.001),强调了超参数交互的重要性。 这些发现突显了需要针对非线性超参数交互的微调策略,以在各项指标间实现性能平衡。这种变化性和指标间的权衡不仅适用于文本分类任务,在自然语言处理和计算机视觉等领域也具有相关性。该分析为大型语言模型的微调策略提供了指导,并促进了适应性设计,使之能够更广泛地应用。 通过这项研究,我们强调了在不同超参数配置下优化性能的重要性,并且指出为了获得最佳结果,需要仔细调整这些参数以实现多指标之间的平衡。这一方法不仅对文本分类任务有用,还为其他自然语言处理和计算机视觉领域的复杂模型提供了有益的指导原则。
https://arxiv.org/abs/2501.00241
Text Classification (TC) stands as a cornerstone within the realm of Natural Language Processing (NLP), particularly when viewed through the lens of computer science and engineering. The past decade has seen deep learning revolutionize TC, propelling advancements in text retrieval, categorization, information extraction, and summarization. The scholarly literature is rich with datasets, models, and evaluation criteria, with English being the predominant language of focus, despite studies involving Arabic, Chinese, Hindi, and others. The efficacy of TC models relies heavily on their ability to capture intricate textual relationships and nonlinear correlations, necessitating a comprehensive examination of the entire TC pipeline. This monograph provides an in-depth exploration of the TC pipeline, with a particular emphasis on evaluating the impact of each component on the overall performance of TC models. The pipeline includes state-of-the-art datasets, text preprocessing techniques, text representation methods, classification models, evaluation metrics, current results and future trends. Each chapter meticulously examines these stages, presenting technical innovations and significant recent findings. The work critically assesses various classification strategies, offering comparative analyses, examples, case studies, and experimental evaluations. These contributions extend beyond a typical survey, providing a detailed and insightful exploration of TC.
文本分类(TC)在自然语言处理(NLP)领域中占据着核心地位,特别是在计算机科学和工程的视角下。过去十年间,深度学习技术彻底革新了TC,推动了文本检索、分类、信息抽取以及摘要生成等方面的进步。学术文献中充满了各种数据集、模型及评估标准的研究成果,尽管英语是主要研究语言,但也有涉及阿拉伯语、汉语、印地语等其他语言的研究。 TC模型的有效性高度依赖于其捕捉复杂文本关系和非线性相关性的能力,因此有必要对整个TC流程进行全面审查。本专著深入探讨了TC的全流程,并特别关注评估每个组件对整体性能的影响。该流程涵盖了前沿的数据集、文本预处理技术、文本表示方法、分类模型、评估指标以及当前成果与未来趋势。每一章节详细审视这些阶段,介绍技术创新和近期重要发现。 这项工作批判性地分析了各种分类策略,提供了比较分析、案例研究及实验评价。其贡献超越了一般的综述文章,提供了一个详尽而深刻的TC探索视角。
https://arxiv.org/abs/2501.00174
The advancements in the Large Language Model (LLM) have helped in solving several problems related to language processing. Most of the researches have focused on the English language only, because of its popularity and abundance on the internet. However, a high-performance language model for Hindi and other Indic languages is lacking in the literature. In this work, we have pre-trained two autoregressive LLM models for the Hindi language, namely HindiLLM-Small and HindiLLM-Medium. We use a two-step process comprising unsupervised pre-training and supervised fine-tuning. First, we create a large and high-quality text corpus for unsupervised pre-training. Next, we train a Byte-Pair Encoding, named HindiLLM tokenizer, using the pre-training text data. We then perform training on the unlabeled data, known as the pre-training step, to get the HindiLLM base models. Furthermore, we perform fine-tuning of the HindiLLM base models for different tasks like sentiment analysis, text classification, natural language inference, and multiple choice question-answer on popular labeled datasets to measure the real-world performance. The evaluation shows that the HindiLLM-based fine-tuned models outperform several models in most of the language related tasks.
大型语言模型(LLM)的进步在解决与语言处理相关的问题上取得了显著成效。然而,大多数研究仅集中于英语,这是因为英语在网络上的普及性和丰富资源。但是,在文献中缺乏针对印地语和其他印地语系语言的高性能语言模型。 在这项工作中,我们为印地语预训练了两个自回归LLM模型:HindiLLM-Small和HindiLLM-Medium。我们采用两步流程进行无监督预训练和有监督微调。首先,我们创建了一个大规模且高质量的文本语料库用于无监督预训练。接下来,使用预训练文本数据训练出一个字节对编码器(Byte-Pair Encoding),称为HindiLLM分词器。然后在未标记的数据上进行训练,即预训练步骤,以获得基础模型HindiLLM base models。之后,我们针对不同的任务如情感分析、文本分类、自然语言推理和多项选择题问答,在流行有标签数据集上对HindiLLM基础模型进行微调,从而评估其实际性能。 评测结果显示,基于HindiLLM的微调模型在大多数与语言相关的任务中超越了多种现有模型。
https://arxiv.org/abs/2412.20357
Cyberbullying significantly contributes to mental health issues in communities by negatively impacting the psychology of victims. It is a prevalent problem on social media platforms, necessitating effective, real-time detection and monitoring systems to identify harmful messages. However, current cyberbullying detection systems face challenges related to performance, dataset quality, time efficiency, and computational costs. This research aims to conduct a comparative study by adapting and evaluating existing text classification techniques within the cyberbullying detection domain. The study specifically evaluates the effectiveness and performance of these techniques in identifying cyberbullying instances on social media platforms. It focuses on leveraging and assessing large language models, including BERT, RoBERTa, XLNet, DistilBERT, and GPT-2.0, for their suitability in this domain. The results show that BERT strikes a balance between performance, time efficiency, and computational resources: Accuracy of 95%, Precision of 95%, Recall of 95%, F1 Score of 95%, Error Rate of 5%, Inference Time of 0.053 seconds, RAM Usage of 35.28 MB, CPU/GPU Usage of 0.4%, and Energy Consumption of 0.000263 kWh. The findings demonstrate that generative AI models, while powerful, do not consistently outperform fine-tuned models on the tested benchmarks. However, state-of-the-art performance can still be achieved through strategic adaptation and fine-tuning of existing models for specific datasets and tasks.
网络欺凌在社区中显著加剧了心理健康问题,通过负面影响受害者的心理状态。它已成为社交媒体平台上一个普遍存在的问题,因此需要有效的实时检测和监控系统来识别有害信息。然而,现有的网络欺凌检测系统面临着性能、数据集质量、时间效率以及计算成本等方面的挑战。本研究旨在通过对现有文本分类技术进行适应性调整及评估,在网络欺凌检测领域开展比较研究。该研究具体评估了这些技术在社交媒体平台上识别网络欺凌实例的有效性和性能。特别关注的是,这项研究利用并评估了几种大型语言模型(包括BERT、RoBERTa、XLNet、DistilBERT和GPT-2.0)在这方面的适用性。 结果表明,BERT 在性能、时间效率以及计算资源方面实现了平衡:准确率为95%,精确率为95%,召回率为95%,F1评分为95%,错误率为5%,推理时间为0.053秒,RAM使用量为35.28MB,CPU/GPU使用率仅为0.4%,能耗为0.000263千瓦时。研究发现表明,生成式AI模型尽管强大但并不总是在测试基准上持续优于经过微调的模型。然而,通过针对特定数据集和任务的战略适应与微调,仍然可以实现最先进的性能。 这些结果强调了在开发网络欺凌检测系统时考虑多种因素的重要性,并指出BERT可能为该领域提供了一个理想的起点。
https://arxiv.org/abs/2412.19928
Machine Unlearning has emerged as a critical area in artificial intelligence, addressing the need to selectively remove learned data from machine learning models in response to data privacy regulations. This paper provides a comprehensive comparative analysis of six state-of-theart unlearning techniques applied to image and text classification tasks. We evaluate their performance, efficiency, and compliance with regulatory requirements, highlighting their strengths and limitations in practical scenarios. By systematically analyzing these methods, we aim to provide insights into their applicability, challenges,and tradeoffs, fostering advancements in the field of ethical and adaptable machine learning.
机器去学习(Machine Unlearning)已经成为人工智能领域的一个关键研究方向,旨在应对数据隐私法规的要求,通过选择性地从机器学习模型中移除已学得的数据。本文提供了一项全面的比较分析,评估了六种最先进的去学习技术在图像和文本分类任务中的应用效果。我们评价了这些方法的性能、效率以及它们是否符合监管要求,并指出了它们在实际应用场景中的优缺点。通过系统地分析这些方法,我们的目标是为了解其适用性、面临的挑战及取舍提供见解,从而促进伦理与适应性强的机器学习领域的进步。
https://arxiv.org/abs/2412.19583