With the emergence of ChatGPT, Transformer models have significantly advanced text classification and related tasks. Decoder-only models such as Llama exhibit strong performance and flexibility, yet they suffer from inefficiency on inference due to token-by-token generation, and their effectiveness in text classification tasks heavily depends on prompt quality. Moreover, their substantial GPU resource requirements often limit widespread adoption. Thus, the question of whether smaller language models are capable of effectively handling text classification tasks emerges as a topic of significant interest. However, the selection of appropriate models and methodologies remains largely underexplored. In this paper, we conduct a comprehensive evaluation of prompt engineering and supervised fine-tuning methods for transformer-based text classification. Specifically, we focus on practical industrial scenarios, including email classification, legal document categorization, and the classification of extremely long academic texts. We examine the strengths and limitations of smaller models, with particular attention to both their performance and their efficiency in Video Random-Access Memory (VRAM) utilization, thereby providing valuable insights for the local deployment and application of compact models in industrial settings.
随着ChatGPT的出现,Transformer模型在文本分类及相关任务上取得了显著的进步。例如Llama这样的解码器自回归模型展现出强大的性能和灵活性,但由于逐令牌生成过程中的效率低下,在推理阶段表现不佳,并且它们在文本分类任务上的效果严重依赖于提示的质量。此外,这些模型对GPU资源的巨大需求往往限制了其广泛应用。因此,关于较小规模的语言模型是否能够有效处理文本分类任务的问题引起了广泛关注。然而,选择合适的模型和方法仍然是一个未被充分探索的领域。 本文中,我们全面评估了基于Transformer的文本分类中的提示工程和监督微调方法。具体来说,我们的研究集中在实际工业场景上,包括电子邮件分类、法律文档归类以及对极其长篇学术文章进行分类的任务。我们将重点关注小型模型的优势与局限性,并特别关注它们在性能及视频随机访问存储器(VRAM)利用率方面的表现,从而为紧凑型模型在工业环境中的本地部署和应用提供宝贵的见解。
https://arxiv.org/abs/2505.16078
The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behaviors, existing theoretical frameworks remain fragmented in explaining emergent phenomena through a unified mathematical lens. We establish the first formal connection between LLM architectures and Algorithmic Information Theory (AIT) by proving two fundamental results: (1) the training process computationally approximates Solomonoff prior through loss minimization interpreted as program length optimization, and (2) next-token prediction implements approximate Solomonoff induction. We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws. Furthermore, our theoretical insights lead to a principled method for few-shot example selection that prioritizes samples where models exhibit lower predictive confidence. We demonstrate through experiments on diverse text classification benchmarks that this strategy yields significant performance improvements, particularly for smaller model architectures, when compared to selecting high-confidence examples. Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.
大型语言模型(LLM)的快速发展需要一个严格的理论框架来解释其实验成功的原因。尽管在理解LLM行为方面取得了显著进展,但现有的理论框架仍然缺乏统一的数学视角来解释新兴现象。我们首次建立了LLM架构与算法信息论(AIT)之间的正式联系,并证明了两个基本结果:(1) 通过损失最小化实现程序长度优化的过程计算上近似于Solomonoff先验;(2) 下一令牌预测实现了对Solomonoff归纳的近似实施。我们利用AIT提供了一个统一的理论解释,涵盖了上下文学习、少量样本学习以及规模法则。此外,我们的理论见解为少样本示例选择提供了基于原则的方法,优先考虑模型表现出较低预测置信度的情况下的样本。通过在多种文本分类基准上的实验表明,与选取高置信度示例相比,该策略能显著提高较小架构模型的性能。我们的框架弥合了理论基础和实际LLM行为之间的差距,为未来模型的发展提供了解释力和实用性的洞察。
https://arxiv.org/abs/2505.15784
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.
这项研究提出了一种评估大型语言模型(LLM)二元文本分类一致性的框架,旨在解决现有可靠性评估方法不足的问题。通过借鉴心理测量学原理,我们确定了样本量的需求,开发了无效响应的度量标准,并对评分者内和评分者间的可靠性进行了评价。我们的案例研究考察了14个大型语言模型(包括claude-3-7-sonnet、gpt-4o、deepseek-r1、gemma3、llama3.2、phi4和command-r-plus)在财务新闻情绪分类方面的表现,每个模型重复进行五次测试,共分析了1,350篇文章。各模型展示了较高的评分者内一致性,在90%-98%的样本中实现了完全一致的意见,并且来自同一家族的昂贵型和经济型模型之间差异很小。当与StockNewsAPI标签验证时,这些模型表现出色(准确率在0.76到0.88之间),其中较小规模的模型如gemma3:1B、llama3.2:3B以及claude-3-5-haiku的表现优于其大型同类。所有模型在预测实际市场走势时表现均接近随机水平,这表明任务难度而非模型能力限制了性能。我们的框架为LLM选择、样本量规划和可靠性评估提供了系统性指导,帮助组织优化资源以应对分类任务的挑战。
https://arxiv.org/abs/2505.14918
The design of optimization algorithms for neural networks remains a critical challenge, with most existing methods relying on heuristic adaptations of gradient-based approaches. This paper introduces KO (Kinetics-inspired Optimizer), a novel neural optimizer inspired by kinetic theory and partial differential equation (PDE) simulations. We reimagine the training dynamics of network parameters as the evolution of a particle system governed by kinetic principles, where parameter updates are simulated via a numerical scheme for the Boltzmann transport equation (BTE) that models stochastic particle collisions. This physics-driven approach inherently promotes parameter diversity during optimization, mitigating the phenomenon of parameter condensation, i.e. collapse of network parameters into low-dimensional subspaces, through mechanisms analogous to thermal diffusion in physical systems. We analyze this property, establishing both a mathematical proof and a physical interpretation. Extensive experiments on image classification (CIFAR-10/100, ImageNet) and text classification (IMDB, Snips) tasks demonstrate that KO consistently outperforms baseline optimizers (e.g., Adam, SGD), achieving accuracy improvements while computation cost remains comparable.
神经网络的优化算法设计仍然是一个关键挑战,大多数现有的方法依赖于基于梯度的方法的启发式改进。本文介绍了KO(Kinetics-inspired Optimizer),这是一种受动力学理论和偏微分方程(PDE)模拟启发的新颖神经网络优化器。我们重新构思了网络参数训练的动力学,将其视为由动力学原理支配的粒子系统的演化过程,并通过玻尔兹曼输运方程(BTE)的数值方案来模拟参数更新,该模型描述了随机粒子碰撞的过程。这种物理驱动的方法内在地促进了优化过程中参数的多样性,通过类似于物理系统中热扩散机制的方式缓解参数凝聚现象,即网络参数坍缩到低维子空间的现象。我们分析了这一特性,并建立了数学证明和物理解释。 在图像分类(CIFAR-10/100、ImageNet)和文本分类(IMDB、Snips)任务的广泛实验中表明,KO始终优于基线优化器(例如Adam、SGD),并在计算成本相当的情况下提高了准确性。
https://arxiv.org/abs/2505.14777
Query routing, the task to route user queries to different large language model (LLM) endpoints, can be considered as a text classification problem. However, out-of-distribution queries must be handled properly, as those could be questions about unrelated domains, queries in other languages, or even contain unsafe text. Here, we thus study a \emph{guarded} query routing problem, for which we first introduce the Guarded Query Routing Benchmark (GQR-Bench), which covers three exemplary target domains (law, finance, and healthcare), and seven datasets to test robustness against out-of-distribution queries. We then use GQR-Bench to contrast the effectiveness and efficiency of LLM-based routing mechanisms (GPT-4o-mini, Llama-3.2-3B, and Llama-3.1-8B), standard LLM-based guardrail approaches (LlamaGuard and NVIDIA NeMo Guardrails), continuous bag-of-words classifiers (WideMLP, fastText), and traditional machine learning models (SVM, XGBoost). Our results show that WideMLP, enhanced with out-of-domain detection capabilities, yields the best trade-off between accuracy (88\%) and speed (<4ms). The embedding-based fastText excels at speed (<1ms) with acceptable accuracy (80\%), whereas LLMs yield the highest accuracy (91\%) but are comparatively slow (62ms for local Llama-3.1:8B and 669ms for remote GPT-4o-mini calls). Our findings challenge the automatic reliance on LLMs for (guarded) query routing and provide concrete recommendations for practical applications. GQR-Bench will be released as a Python package -- \texttt{gqr}.
查询路由任务是指将用户查询定向到不同的大型语言模型(LLM)端点,可以视为一个文本分类问题。然而,必须妥善处理出界查询,因为这些可能是与目标领域无关的问题、其他语言的查询或包含不安全内容的查询。在此背景下,我们研究了一个受保护的查询路由问题,并为此引入了受保护的查询路由基准(GQR-Bench),该基准涵盖了三个典型的目标领域(法律、金融和医疗保健)以及七个用于测试抵御出界查询能力的数据集。 我们使用GQR-Bench对比了基于LLM的路由机制(包括GPT-4o-mini、Llama-3.2-3B和Llama-3.1-8B)、标准的LLM保护措施方法(如LlamaGuard和NVIDIA NeMo Guardrails)、连续词袋分类器(WideMLP、fastText)以及传统的机器学习模型(SVM和支持向量机、XGBoost)的有效性和效率。我们的结果表明,增强出界检测能力的WideMLP在准确率(88%)与速度(<4毫秒)之间提供了最佳的平衡。基于嵌入的fastText在速度上表现卓越(<1毫秒),尽管其准确性略低(80%)。而LLM在准确性方面表现出最高水平(91%),但相对来说运行较慢(本地Llama-3.1:8B为62毫秒,远程GPT-4o-mini调用则需669毫秒)。 我们的发现挑战了自动依赖于LLM进行受保护的查询路由的做法,并提供了实际应用中的具体建议。GQR-Bench将作为一个Python包——\texttt{gqr}——发布,供研究和开发人员使用。
https://arxiv.org/abs/2505.14524
Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at this https URL .
有效的提示工程仍然是充分发挥大型语言模型(LLM)能力的核心挑战。虽然设计良好的提示可以显著提升性能,但制作这些提示通常需要专家的直觉和对任务的细致理解。更重要的是,最具影响力的提示往往依赖于细微的语言线索,这类线索可能不易被人类察觉,但却对于指导LLM的行为至关重要。在这篇论文中,我们介绍了PRL(来自强化学习的Prompt),这是一种基于强化学习的自动提示生成新方法。不同于以往的方法,PRL能够产生在训练过程中未见过的新颖少量样本示例。我们的方法在文本分类、简化和摘要等不同基准测试上均取得了最先进的性能表现。在分类任务上,它比APE高出2.58%,比EvoPrompt高出1%;在摘要生成任务中,平均ROUGE得分比APE高4.32,比EvoPrompt高2.12;而在简化任务中,SARI评分比APE高6.93,比EvoPrompt高6.01。我们的代码可在[这里](https://this https URL)获取。
https://arxiv.org/abs/2505.14412
Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges, including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP (this https URL), a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.
豪萨语自然语言处理(NLP)近年来引起了越来越多的关注,然而作为一种低资源语言,尽管全球拥有超过1.2亿母语使用者和8000万第二语言使用者,其研究仍然相对较少。虽然在高资源语言领域取得了显著进展,但豪萨语NLP仍面临着诸如开放数据集有限、模型表示不足等持续挑战。本文概述了当前豪萨语NLP的状态,系统地考察了现有资源、研究成果和跨基础NLP任务(如文本分类、机器翻译、命名实体识别、语音识别及问答)的差距。我们推出了HausaNLP(此链接),这是一个经过精心策划的目录,汇总了数据集、工具和研究工作,以提高可访问性并推动进一步的发展。此外,我们讨论了在大型语言模型中整合豪萨语所面临的挑战,并解决了次优标记化及方言变化等问题。最后,我们提出了战略性的研究方向,强调数据集扩展、改进的语言建模方法以及加强社区合作对于推进豪萨语NLP的重要性。我们的工作为加速豪萨语NLP的进步提供了基础,并且为更广泛的多语言NLP研究提供了宝贵的见解。 翻译总结: 本文重点介绍了豪萨语自然语言处理(NLP)领域目前的发展状态,探讨了该领域的挑战、现有资源及未来的研究方向。通过系统地分析和提出改进策略,希望能够促进豪萨语NLP的进一步发展,并为更大范围内的多语言研究提供参考。
https://arxiv.org/abs/2505.14311
While transformer-based models achieve strong performance on text classification, we explore whether masking input tokens can further enhance their effectiveness. We propose token masking regularization, a simple yet theoretically motivated method that randomly replaces input tokens with a special [MASK] token at probability p. This introduces stochastic perturbations during training, leading to implicit gradient averaging that encourages the model to capture deeper inter-token dependencies. Experiments on language identification and sentiment analysis -- across diverse models (mBERT, Qwen2.5-0.5B, TinyLlama-1.1B) -- show consistent improvements over standard regularization techniques. We identify task-specific optimal masking rates, with p = 0.1 as a strong general default. We attribute the gains to two key effects: (1) input perturbation reduces overfitting, and (2) gradient-level smoothing acts as implicit ensembling.
虽然基于变压器的模型在文本分类任务中表现强劲,但我们探索了通过屏蔽输入令牌是否可以进一步提升其效果。我们提出了一种简单且具有理论依据的方法——令牌屏蔽正则化:以概率 p 随机将输入令牌替换为特殊 [MASK] 令牌。这种方法在训练过程中引入随机扰动,导致隐式的梯度平均化,促使模型捕捉更深层次的令牌间依赖关系。 在语言识别和情感分析任务上进行实验(涵盖多种模型,如 mBERT、Qwen2.5-0.5B 和 TinyLlama-1.1B),结果显示该方法相对于标准正则化技术带来了持续性的改进。我们发现特定任务的最佳屏蔽率,并确定 p = 0.1 是一种通用的良好默认值。 我们认为这种提升主要源于两个关键效应:(1) 输入扰动减少了过拟合,以及 (2) 梯度级别的平滑作用等同于隐式的集成学习(即模型通过梯度水平的平滑处理,在训练过程中模拟了不同输入情况下的多种模型行为)。
https://arxiv.org/abs/2505.11746
Recently, training-free methods for improving large language models (LLMs) have attracted growing interest, with token-level attention tuning emerging as a promising and interpretable direction. However, existing methods typically rely on auxiliary mechanisms to identify important or irrelevant task-specific tokens, introducing potential bias and limiting applicability. In this paper, we uncover a surprising and elegant alternative: the semantically empty initial token is a powerful and underexplored control point for optimizing model behavior. Through theoretical analysis, we show that tuning the initial token's attention sharpens or flattens the attention distribution over subsequent tokens, and its role as an attention sink amplifies this effect. Empirically, we find that: (1) tuning its attention improves LLM performance more effectively than tuning other task-specific tokens; (2) the effect follows a consistent trend across layers, with earlier layers having greater impact, but varies across attention heads, with different heads showing distinct preferences in how they attend to this token. Based on these findings, we propose ZeroTuning, a training-free approach that improves LLM performance by applying head-specific attention adjustments to this special token. Despite tuning only one token, ZeroTuning achieves higher performance on text classification, multiple-choice, and multi-turn conversation tasks across models such as Llama, Qwen, and DeepSeek. For example, ZeroTuning improves Llama-3.1-8B by 11.71% on classification, 2.64% on QA tasks, and raises its multi-turn score from 7.804 to 7.966. The method is also robust to limited resources, few-shot settings, long contexts, quantization, decoding strategies, and prompt variations. Our work sheds light on a previously overlooked control point in LLMs, offering new insights into both inference-time tuning and model interpretability.
最近,无需训练的方法在提升大型语言模型(LLMs)性能方面引起了越来越多的关注,其中基于令牌级别的注意力微调展现出了前景和可解释性。然而,现有的方法通常依赖辅助机制来识别特定任务的重要或不相关令牌,这可能会引入潜在的偏差并限制其适用范围。在这篇论文中,我们揭示了一个令人惊讶且优雅的替代方案:语义空缺的初始标记是一个强大的且被低估了的控制点,可用于优化模型行为。通过理论分析,我们展示了调整初始标记的注意力可以锐化或扁平化后续令牌上的注意力分布,并且作为注意力汇流的角色会放大这种效果。从实验上来看,我们发现: 1. 调整其注意力比调整个别任务特定令牌更有效地提高LLM性能; 2. 效果在各层中呈现出一致的趋势,早期层影响更大,但不同注意头的效果有所不同,显示出它们如何关注这个标记的偏好各异。 基于这些发现,我们提出了ZeroTuning——一种无需训练的方法,通过针对该特殊标记进行头部特定的注意力调整来提升LLM性能。尽管只调一个令牌,但ZeroTuning在Llama、Qwen和DeepSeek等模型上,在文本分类、多项选择以及多轮对话任务上的表现均超越了现有方法。例如,ZeroTuning使Llama-3.1-8B在分类任务中的性能提升了11.71%,在QA任务中提高了2.64%,并且将其多轮对话得分从7.804提升到了7.966。此外,该方法对于资源有限、少量样本设置、长上下文、量化和解码策略以及提示变化等场景也表现出较强的鲁棒性。 我们的工作揭示了LLMs中一个此前被忽略的控制点,并为推理时调整和模型可解释性提供了新的见解。
https://arxiv.org/abs/2505.11739
Recent research highlights concerns about the trustworthiness of third-party Pre-Trained Language Models (PTLMs) due to potential backdoor attacks. These backdoored PTLMs, however, are effective only for specific pre-defined downstream tasks. In reality, these PTLMs can be adapted to many other unrelated downstream tasks. Such adaptation may lead to unforeseen consequences in downstream model outputs, consequently raising user suspicion and compromising attack stealthiness. We refer to this phenomenon as backdoor complications. In this paper, we undertake the first comprehensive quantification of backdoor complications. Through extensive experiments using 4 prominent PTLMs and 16 text classification benchmark datasets, we demonstrate the widespread presence of backdoor complications in downstream models fine-tuned from backdoored PTLMs. The output distribution of triggered samples significantly deviates from that of clean samples. Consequently, we propose a backdoor complication reduction method leveraging multi-task learning to mitigate complications without prior knowledge of downstream tasks. The experimental results demonstrate that our proposed method can effectively reduce complications while maintaining the efficacy and consistency of backdoor attacks. Our code is available at this https URL.
最近的研究指出,第三方预训练语言模型(PTLM)的可信度因潜在的后门攻击而受到关注。然而,这些被植入了后门的PTLM仅在特定的预先定义的下游任务中有效。实际上,这些PTLM可以适应许多其他不相关的下游任务。这种适应可能导致下游模型输出出现意想不到的结果,进而引起用户的怀疑并破坏攻击的隐蔽性。我们将这一现象称为“后门复杂化”。在这篇论文中,我们首次全面量化了后门复杂化的程度。通过使用4个著名的PTLM和16个文本分类基准数据集进行广泛的实验,我们展示了在从被植入后门的PTLM微调而来的下游模型中普遍存在后门复杂化现象。触发样本的输出分布与干净样本显著不同。 因此,我们提出了一种利用多任务学习的方法来减少后门复杂化,并且不需要预先了解下游任务的知识即可实现这一目标。实验结果表明,我们的方法能够有效减少复杂性的同时保持后门攻击的有效性和一致性。我们的代码可在[此处](https://example.com)获取(注:实际链接请替换为论文中提供的真实URL)。
https://arxiv.org/abs/2505.11586
The increasing volume of healthcare textual data requires computationally efficient, yet highly accurate classification approaches able to handle the nuanced and complex nature of medical terminology. This research presents Knowledge Distillation for Healthcare Multi-Label Text Classification (KDH-MLTC), a framework leveraging model compression and Large Language Models (LLMs). The proposed approach addresses conventional healthcare Multi-Label Text Classification (MLTC) challenges by integrating knowledge distillation and sequential fine-tuning, subsequently optimized through Particle Swarm Optimization (PSO) for hyperparameter tuning. KDH-MLTC transfers knowledge from a more complex teacher LLM (i.e., BERT) to a lighter student LLM (i.e., DistilBERT) through sequential training adapted to MLTC that preserves the teacher's learned information while significantly reducing computational requirements. As a result, the classification is enabled to be conducted locally, making it suitable for healthcare textual data characterized by sensitivity and, therefore, ensuring HIPAA compliance. The experiments conducted on three medical literature datasets of different sizes, sampled from the Hallmark of Cancer (HoC) dataset, demonstrate that KDH-MLTC achieves superior performance compared to existing approaches, particularly for the largest dataset, reaching an F1 score of 82.70%. Additionally, statistical validation and an ablation study are carried out, proving the robustness of KDH-MLTC. Furthermore, the PSO-based hyperparameter optimization process allowed the identification of optimal configurations. The proposed approach contributes to healthcare text classification research, balancing efficiency requirements in resource-constrained healthcare settings with satisfactory accuracy demands.
不断增加的医疗文本数据量要求计算效率高且准确度高的分类方法,能够处理医学术语的独特复杂性和细微差别。本研究提出了一种利用模型压缩和大型语言模型(LLMs)的知识蒸馏用于医疗多标签文本分类(KDH-MLTC)框架。所提出的这一方法通过整合知识蒸馏与序列微调来解决传统的医疗多标签文本分类(MLTC)挑战,并且使用粒子群优化(PSO)对超参数进行调整以实现进一步的性能优化。KDH-MLTC通过适应于MLTC的序列训练过程,将一个更为复杂的教师模型(如BERT)的知识转移到轻量级的学生模型(例如DistilBERT),这一过程中既保留了教师模型所学的信息又显著降低了计算需求。 因此,在本地进行分类成为可能,这使得该方法非常适合敏感性特征明显的医疗文本数据,并确保符合HIPAA合规要求。研究在不同大小的三个医学文献数据集上进行了实验,这些数据集是从癌症标志性特征(HoC)数据集中抽样得到的。结果表明,KDH-MLTC相比于现有的方法取得了更为优越的表现,尤其是在最大的数据集上达到了82.70%的F1分数。 此外,通过统计验证和消融研究证明了KDH-MLTC的稳健性,并且基于PSO的超参数优化过程允许识别出最佳配置。本研究提出的这一方法为医疗文本分类的研究做出了贡献,在资源受限的医疗环境中平衡计算效率需求的同时满足了对准确性的要求。
https://arxiv.org/abs/2505.07162
Robustness to label noise within data is a significant challenge in federated learning (FL). From the data-centric perspective, the data quality of distributed datasets can not be guaranteed since annotations of different clients contain complicated label noise of varying degrees, which causes the performance degradation. There have been some early attempts to tackle noisy labels in FL. However, there exists a lack of benchmark studies on comprehensively evaluating their practical performance under unified settings. To this end, we propose the first benchmark study FNBench to provide an experimental investigation which considers three diverse label noise patterns covering synthetic label noise, imperfect human-annotation errors and systematic errors. Our evaluation incorporates eighteen state-of-the-art methods over five image recognition datasets and one text classification dataset. Meanwhile, we provide observations to understand why noisy labels impair FL, and additionally exploit a representation-aware regularization method to enhance the robustness of existing methods against noisy labels based on our observations. Finally, we discuss the limitations of this work and propose three-fold future directions. To facilitate related communities, our source code is open-sourced at this https URL.
在联邦学习(FL)中,数据标签中的鲁棒性是一个重要挑战。从数据为中心的角度来看,由于不同客户端的注释包含不同程度和复杂程度的标签噪声,因此无法保证分布式数据集的质量,这导致了性能下降。虽然已有早期尝试解决联邦学习中的噪声标签问题,但尚缺乏在统一设置下全面评估其实际表现基准研究。为此,我们提出了首个基准研究 FNBench,它考虑了三种不同的标签噪声模式(合成标签噪声、不完美的人工注释错误和系统性错误),并提供了实验调查。我们的评估涵盖了五种图像识别数据集和一种文本分类数据集上的十八种最先进的方法。 此外,我们提供观察结果来理解为什么噪声标签会损害联邦学习,并进一步利用了一种基于感知表示的正则化方法来增强现有方法对噪声标签的鲁棒性。最后,我们讨论了这项工作的局限性和提出了三个未来研究方向。为了方便相关社区,我们的源代码已开源在以下链接:[https://this-url](https://this-url)(请将此占位符替换为实际URL)。
https://arxiv.org/abs/2505.06684
We extend and study a semi-supervised model for text classification proposed earlier by Hatefi et al. for classification tasks in which document classes are described by a small number of gold-labeled examples, while the majority of training examples is unlabeled. The model leverages the teacher-student architecture of Meta Pseudo Labels in which a ''teacher'' generates labels for originally unlabeled training data to train the ''student'' and updates its own model iteratively based on the performance of the student on the gold-labeled portion of the data. We extend the original model of Hatefi et al. by an unsupervised pre-training phase based on objective masking, and conduct in-depth performance evaluations of the original model, our extension, and various independent baselines. Experiments are performed using three different datasets in two different languages (English and Swedish).
我们扩展并研究了Hatefi等人之前提出的用于文本分类的半监督模型,该模型适用于文档类别由少量带标签的示例描述、而大多数训练样本未标记的情况。此模型利用Meta Pseudo Labels(伪标签)中的教师-学生架构,在该架构中,“老师”为原本没有标签的训练数据生成标签以训练“学生”,并根据“学生”在带有黄金标签的数据部分上的表现迭代更新其自身的模型。我们通过基于目标掩码的无监督预训练阶段扩展了Hatefi等人的原始模型,并对原始模型、我们的改进版本以及各种独立基线进行了深入的性能评估。实验使用了三种不同数据集(英语和瑞典语两种语言)进行。
https://arxiv.org/abs/2505.06624
Few-shot text classification has important application value in low-resource environments. This paper proposes a strategy that combines adaptive fine-tuning, contrastive learning, and regularization optimization to improve the classification performance of Transformer-based models. Experiments on the FewRel 2.0 dataset show that T5-small, DeBERTa-v3, and RoBERTa-base perform well in few-shot tasks, especially in the 5-shot setting, which can more effectively capture text features and improve classification accuracy. The experiment also found that there are significant differences in the classification difficulty of different relationship categories. Some categories have fuzzy semantic boundaries or complex feature distributions, making it difficult for the standard cross entropy loss to learn the discriminative information required to distinguish categories. By introducing contrastive loss and regularization loss, the generalization ability of the model is enhanced, effectively alleviating the overfitting problem in few-shot environments. In addition, the research results show that the use of Transformer models or generative architectures with stronger self-attention mechanisms can help improve the stability and accuracy of few-shot classification.
少量样本文本分类在资源匮乏的环境中具有重要的应用价值。本文提出了一种结合自适应微调、对比学习和正则化优化的策略,以提高基于Transformer模型的分类性能。在FewRel 2.0数据集上的实验表明,T5-small、DeBERTa-v3和RoBERTa-base在少量样本任务中表现良好,尤其是在五次训练设置下,能够更有效地捕捉文本特征并提升分类准确性。实验还发现不同关系类别的分类难度存在显著差异。某些类别具有模糊的语义边界或复杂的特征分布,使得标准交叉熵损失难以学习区分所需的信息。通过引入对比损失和正则化损失,模型的泛化能力得到增强,有效缓解了少量样本环境中过拟合的问题。此外,研究结果表明使用Transformer模型或具有更强自注意力机制的生成架构可以帮助提高少量样本分类的稳定性和准确性。
https://arxiv.org/abs/2505.06145
Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.
依赖于虚假相关性(捷径)被认为是许多语言模型成功背后的关键因素。以往的研究主要集中在识别影响预测的输入元素上,而我们则探讨了这些捷径如何在模型决策机制中实际被处理。 为了研究这一问题,我们使用电影评论中的演员名字作为已知能对结果产生影响的可控捷径。通过机械解释性方法(mechanistic interpretability),我们确定了一些特别关注捷径的注意力头(attention heads)。这些注意力头会使模型在处理完全部输入之前就倾向于偏向某个标签,实际上导致了过早决策,从而绕过了上下文分析。 基于以上发现,我们提出了基于注意力头的标记归因法(Head-based Token Attribution, HTA),这种方法能够追踪中间决策回溯到原始输入标记。结果显示HTA在检测LLM中的捷径方面非常有效,并且可以通过选择性地关闭与捷径相关的注意力头来实现有针对性的缓解措施。
https://arxiv.org/abs/2505.06032
Hierarchical Text Classification (HTC) involves assigning documents to labels organized within a taxonomy. Most previous research on HTC has focused on supervised methods. However, in real-world scenarios, employing supervised HTC can be challenging due to a lack of annotated data. Moreover, HTC often faces issues with large label spaces and long-tail distributions. In this work, we present Knowledge Graphs for zero-shot Hierarchical Text Classification (KG-HTC), which aims to address these challenges of HTC in applications by integrating knowledge graphs with Large Language Models (LLMs) to provide structured semantic context during classification. Our method retrieves relevant subgraphs from knowledge graphs related to the input text using a Retrieval-Augmented Generation (RAG) approach. Our KG-HTC can enhance LLMs to understand label semantics at various hierarchy levels. We evaluate KG-HTC on three open-source HTC datasets: WoS, DBpedia, and Amazon. Our experimental results show that KG-HTC significantly outperforms three baselines in the strict zero-shot setting, particularly achieving substantial improvements at deeper levels of the hierarchy. This evaluation demonstrates the effectiveness of incorporating structured knowledge into LLMs to address HTC's challenges in large label spaces and long-tailed label distributions. Our code is available at: this https URL.
层次文本分类(HTC)涉及将文档分配到按照知识体系组织的标签中。大多数关于HTC的先前研究都集中在监督方法上。然而,在实际场景中,由于缺乏标注数据,使用监督式HTC可能会面临挑战。此外,HTC通常会遇到大规模标签空间和长尾分布的问题。在本文工作中,我们提出了用于零样本层次文本分类的知识图谱(KG-HTC),旨在通过将知识图与大型语言模型(LLMs)相结合来解决应用中HTC的这些难题,并提供结构化的语义上下文以进行分类处理。我们的方法利用检索增强生成(RAG)的方法从知识图中提取与输入文本相关的子图。我们的KG-HTC能够提升LLMs在不同层次上理解标签语义的能力。我们在三个开源HTC数据集:WoS、DBpedia和Amazon上评估了KG-HTC的性能,实验证明KG-HTC在严格的零样本设置下显著优于三种基准方法,并且特别在层级更深的情况下取得了重大改进。这一评估表明,将结构化知识融入LLMs中可以有效应对HTC在大规模标签空间和长尾分布问题上的挑战。我们的代码可在以下网址获得:this https URL.
https://arxiv.org/abs/2505.05583
This paper presents a survey of Abstract Meaning Representation (AMR), a semantic representation framework that captures the meaning of sentences through a graph-based structure. AMR represents sentences as rooted, directed acyclic graphs, where nodes correspond to concepts and edges denote relationships, effectively encoding the meaning of complex sentences. This survey investigates AMR and its extensions, focusing on AMR capabilities. It then explores the parsing (text-to-AMR) and generation (AMR-to-text) tasks by showing traditional, current, and possible futures approaches. It also reviews various applications of AMR including text generation, text classification, and information extraction and information seeking. By analyzing recent developments and challenges in the field, this survey provides insights into future directions for research and the potential impact of AMR on enhancing machine understanding of human language.
本文对抽象意义表示(AMR)进行了一次综述,这是一种通过基于图的结构来捕捉句子含义的语义表示框架。在AMR中,句子被表示为以根节点为起点的有向无环图,其中节点代表概念,边则表示关系,从而有效地编码了复杂句子的意义。本综述探讨了AMR及其扩展,并重点关注其功能。接着,它通过展示传统、当前及可能未来的方法来探索解析(文本到AMR)和生成(AMR到文本)任务。此外,本文还回顾了AMR的各种应用,包括文本生成、文本分类以及信息抽取和信息检索。通过对该领域的近期发展与挑战进行分析,本综述为未来的研究方向提供了洞见,并探讨了AMR在增强机器对人类语言理解的潜在影响。
https://arxiv.org/abs/2505.03229
This study investigates the self-rationalization framework constructed with a cooperative game, where a generator initially extracts the most informative segment from raw input, and a subsequent predictor utilizes the selected subset for its input. The generator and predictor are trained collaboratively to maximize prediction accuracy. In this paper, we first uncover a potential caveat: such a cooperative game could unintentionally introduce a sampling bias during rationale extraction. Specifically, the generator might inadvertently create an incorrect correlation between the selected rationale candidate and the label, even when they are semantically unrelated in the original dataset. Subsequently, we elucidate the origins of this bias using both detailed theoretical analysis and empirical evidence. Our findings suggest a direction for inspecting these correlations through attacks, based on which we further introduce an instruction to prevent the predictor from learning the correlations. Through experiments on six text classification datasets and two graph classification datasets using three network architectures (GRUs, BERT, and GCN), we show that our method not only significantly outperforms recent rationalization methods, but also achieves comparable or even better results than a representative LLM (llama3.1-8b-instruct).
这项研究探讨了基于合作博弈构建的自我合理化框架,其中生成器最初从原始输入中提取最具信息量的部分,随后的预测器使用选定子集进行其输入处理。生成器和预测器被协同训练以最大化预测准确性。在本文中,我们首先揭示了一个潜在的问题:这种合作博弈可能无意间在合理化抽取过程中引入采样偏差。具体来说,生成器可能会错误地建立所选理由候选者与标签之间的关联,即使在原始数据集中它们在语义上是无关的。 接下来,我们通过详细的理论分析和实证证据解释了该偏见的起源。我们的研究结果表明可以通过攻击方法来检查这些相关性,并基于此进一步引入指导以防止预测器学习到这些相关性。通过在六个文本分类数据集以及两个图分类数据集中使用三种网络架构(GRUs、BERT 和 GCN)进行实验,我们证明了该方法不仅显著优于最近的合理化方法,而且还实现了与代表性的大型语言模型(llama3.1-8b-instruct)相当甚至更好的结果。
https://arxiv.org/abs/2505.02118
As a fundamental task in machine learning, text classification plays a crucial role in many areas. With the rapid scaling of Large Language Models (LLMs), particularly through reinforcement learning (RL), there is a growing need for more capable discriminators. Consequently, advances in classification are becoming increasingly vital for enhancing the overall capabilities of LLMs. Traditional discriminative methods map text to labels but overlook LLMs' intrinsic generative strengths. Generative classification addresses this by prompting the model to directly output labels. However, existing studies still rely on simple SFT alone, seldom probing the interplay between training and inference prompts, and no work has systematically leveraged RL for generative text classifiers and unified SFT, RL, and inference-time prompting in one framework. We bridge this gap with GenCLS++, a framework that jointly optimizes SFT and RL while systematically exploring five high-level strategy dimensions-in-context learning variants, category definitions, explicit uncertainty labels, semantically irrelevant numeric labels, and perplexity-based decoding-during both training and inference. After an SFT "policy warm-up," we apply RL with a simple rule-based reward, yielding sizable extra gains. Across seven datasets, GenCLS++ achieves an average accuracy improvement of 3.46% relative to the naive SFT baseline; on public datasets, this improvement rises to 4.00%. Notably, unlike reasoning-intensive tasks that benefit from explicit thinking processes, we find that classification tasks perform better without such reasoning steps. These insights into the role of explicit reasoning provide valuable guidance for future LLM applications.
在机器学习中,文本分类是一项基本任务,在许多领域都发挥着关键作用。随着大型语言模型(LLM)通过强化学习(RL)迅速扩展,对更强大的判别器的需求日益增长。因此,推进分类技术对于增强LLM的整体能力变得越来越重要。传统的判别方法将文本映射到标签,但忽视了LLM内在的生成优势。而生成式分类则通过促使模型直接输出标签来解决这一问题。然而,现有的研究大多依赖于简单的指令微调(SFT)方法,并且很少探讨训练和推理提示之间的相互作用,也没有任何工作系统地利用RL为生成式的文本分类器并统一SFT、RL以及推理时的提示技术在同一个框架内进行部署。 我们通过GenCLS++填补了这一空白。这是一个优化SFT和RL的同时,在训练与推理过程中系统性探索五个高层次策略维度——上下文学习变体、类别定义、显式不确定性标签、语义无关数字标签及基于困惑度解码的框架。在经过SFT“政策预热”后,我们采用了一个简单的规则基础奖励机制进行强化学习,并获得了显著的额外增益。在七个数据集上,GenCLS++相对于朴素的SFT基线模型平均准确率提高了3.46%;而在公共数据集上的改进则上升到了4.00%。 值得注意的是,不同于需要明确思考过程的任务(如推理任务),我们发现分类任务在没有此类推理步骤的情况下表现得更好。这些关于显式推理作用的见解为未来LLM的应用提供了宝贵的指导原则。
https://arxiv.org/abs/2504.19898
Efficient text classification is essential for handling the increasing volume of academic publications. This study explores the use of pre-trained language models (PLMs), including BERT, SciBERT, BioBERT, and BlueBERT, fine-tuned on the Web of Science (WoS-46985) dataset for scientific text classification. To enhance performance, we augment the dataset by executing seven targeted queries in the WoS database, retrieving 1,000 articles per category aligned with WoS-46985's main classes. PLMs predict labels for this unlabeled data, and a hard-voting strategy combines predictions for improved accuracy and confidence. Fine-tuning on the expanded dataset with dynamic learning rates and early stopping significantly boosts classification accuracy, especially in specialized domains. Domain-specific models like SciBERT and BioBERT consistently outperform general-purpose models such as BERT. These findings underscore the efficacy of dataset augmentation, inference-driven label prediction, hard-voting, and fine-tuning techniques in creating robust and scalable solutions for automated academic text classification.
高效的文本分类对于处理不断增加的学术出版物数量至关重要。本研究探讨了预训练语言模型(PLM),包括BERT、SciBERT、BioBERT和BlueBERT,在Web of Science (WoS-46985) 数据集上进行微调,以实现科学文献分类的应用。为了增强性能,我们通过执行七个针对特定领域的查询来扩充该数据集,在WoS数据库中为每个类别检索1,000篇与WoS-46985主要类别对齐的文章。PLM模型对这些未标记的数据进行标签预测,并采用硬投票策略结合多个预测结果以提高准确性和置信度。在扩展后的数据集上使用动态学习率和提前停止法进行微调,显著提升了分类准确性,特别是在专业领域内表现尤为突出。特定领域的模型如SciBERT和BioBERT始终优于通用模型如BERT的性能。这些发现强调了数据扩充、基于推理的标签预测、硬投票以及微调技术在构建稳健且可扩展的自动学术文本分类解决方案中的有效性。
https://arxiv.org/abs/2504.19021