Federated Learning (FL) faces major challenges regarding communication overhead and model privacy when training large language models (LLMs), especially in healthcare applications. To address these, we introduce Selective Attention Federated Learning (SAFL), a novel approach that dynamically fine-tunes only those transformer layers identified as attention-critical. By employing attention patterns to determine layer importance, SAFL significantly reduces communication bandwidth and enhances differential privacy resilience. Evaluations on clinical NLP benchmarks (i2b2 Clinical Concept Extraction and MIMIC-III discharge summaries) demonstrate that SAFL achieves competitive performance with centralized models while substantially improving communication efficiency and privacy preservation.
联邦学习(FL)在训练大型语言模型(LLMs)时,特别是在医疗应用中,面临着通信开销和模型隐私的重大挑战。为了解决这些问题,我们引入了一种新颖的方法——选择性注意力联邦学习(SAFL),该方法动态地只对那些被识别为关键注意层的变换器层进行微调。通过利用注意力模式来确定各层的重要性,SAFL显著减少了通信带宽,并增强了差分隐私的抵抗能力。在临床自然语言处理基准测试(i2b2临床概念抽取和MIMIC-III出院总结)上的评估表明,SAFL能够在保持与集中式模型相当性能的同时,大幅提高通信效率并保护隐私。
https://arxiv.org/abs/2504.11793
Transformer-based language models have revolutionized the field of natural language processing (NLP). However, using these models often involves navigating multiple frameworks and tools, as well as writing repetitive boilerplate code. This complexity can discourage non-programmers and beginners, and even slow down prototyping for experienced developers. To address these challenges, we introduce Langformers, an open-source Python library designed to streamline NLP pipelines through a unified, factory-based interface for large language model (LLM) and masked language model (MLM) tasks. Langformers integrates conversational AI, MLM pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API, supporting popular platforms such as Hugging Face and Ollama. Key innovations include: (1) task-specific factories that abstract training, inference, and deployment complexities; (2) built-in memory and streaming for conversational agents; and (3) lightweight, modular design that prioritizes ease of use. Documentation: this https URL
基于Transformer的语言模型已经彻底革新了自然语言处理(NLP)领域。然而,使用这些模型通常涉及到在多个框架和工具之间切换,并且需要编写大量的样板代码。这种复杂性可能会让非编程人员和初学者望而却步,甚至会减慢经验丰富的开发者的原型设计过程。为了应对这些挑战,我们推出了Langformers——一个开源的Python库,旨在通过统一、工厂式的界面简化NLP流水线,该界面支持大型语言模型(LLM)和掩码语言模型(MLM)任务。 Langformers将对话AI、MLM预训练、文本分类、句子嵌入/重排序、数据标注、语义搜索以及知识蒸馏等整合到一个连贯的API中,并且它支持流行平台如Hugging Face和Ollama。关键创新包括: 1. 任务特定工厂:抽象了训练、推理和部署的复杂性。 2. 内置记忆与流处理功能,专为对话代理设计。 3. 轻量级模块化设计,优先考虑易用性。 文档:[此处插入具体URL链接]
https://arxiv.org/abs/2504.09170
Traditionally, authorship attribution (AA) tasks relied on statistical data analysis and classification based on stylistic features extracted from texts. In recent years, pre-trained language models (PLMs) have attracted significant attention in text classification tasks. However, although they demonstrate excellent performance on large-scale short-text datasets, their effectiveness remains under-explored for small samples, particularly in AA tasks. Additionally, a key challenge is how to effectively leverage PLMs in conjunction with traditional feature-based methods to advance AA research. In this study, we aimed to significantly improve performance using an integrated integrative ensemble of traditional feature-based and modern PLM-based methods on an AA task in a small sample. For the experiment, we used two corpora of literary works to classify 10 authors each. The results indicate that BERT is effective, even for small-sample AA tasks. Both BERT-based and classifier ensembles outperformed their respective stand-alone models, and the integrated ensemble approach further improved the scores significantly. For the corpus that was not included in the pre-training data, the integrated ensemble improved the F1 score by approximately 14 points, compared to the best-performing single model. Our methodology provides a viable solution for the efficient use of the ever-expanding array of data processing tools in the foreseeable future.
传统上,作者身份归属(AA)任务依赖于基于从文本中提取的风格特征进行统计数据分析和分类。近年来,在文本分类任务中,预训练语言模型(PLM)引起了广泛关注。然而,尽管它们在大规模短文本数据集上的表现优异,但对于小样本尤其是在AA任务中的有效性仍需进一步探索。此外,一个关键挑战是如何有效结合传统基于特征的方法与现代PLM方法来推进AA研究。在这项研究中,我们旨在通过整合传统的基于特征和现代的基于PLM的方法,在小样本AA任务上显著提升性能。实验中,我们使用了两部文学作品语料库,每个语料库包含10位作者的作品。 结果表明,即使在小样本AA任务中,BERT也表现出有效性。基于BERT和分类器集成方法均优于各自独立的模型,并且综合集成的方法进一步显著提高了评分。对于未包括在预训练数据中的语料库,与表现最佳的单一模型相比,综合集成方法将F1分数提升了大约14个百分点。我们的研究方法为未来有效利用不断扩大的数据处理工具提供了可行方案。
https://arxiv.org/abs/2504.08527
With the introduction of the PSD2 regulation in the EU which established the Open Banking framework, a new window of opportunities has opened for banks and fintechs to explore and enrich Bank transaction descriptions with the aim of building a better understanding of customer behavior, while using this understanding to prevent fraud, reduce risks and offer more competitive and tailored services. And although the usage of natural language processing models and techniques has seen an incredible progress in various applications and domains over the past few years, custom applications based on domain-specific text corpus remain unaddressed especially in the banking sector. In this paper, we introduce a language-based Open Banking transaction classification system with a focus on the french market and french language text. The system encompasses data collection, labeling, preprocessing, modeling, and evaluation stages. Unlike previous studies that focus on general classification approaches, this system is specifically tailored to address the challenges posed by training a language model with a specialized text corpus (Banking data in the French context). By incorporating language-specific techniques and domain knowledge, the proposed system demonstrates enhanced performance and efficiency compared to generic approaches.
随着欧盟引入PSD2法规,建立了开放银行框架,为银行和金融科技公司提供了一个新的机遇窗口,可以探索并丰富银行交易描述,以更好地理解客户行为。利用这种理解来预防欺诈、降低风险,并提供更具竞争力且个性化的服务。尽管近年来自然语言处理模型和技术在各种应用和领域中取得了惊人的进展,但在特定领域的文本语料库基础上开发定制应用程序方面仍然未得到充分解决,尤其是在银行业。 本文介绍了一种基于语言的开放银行交易分类系统,重点关注法国市场和法语文本。该系统涵盖了数据收集、标注、预处理、建模和评估阶段。与以往专注于通用分类方法的研究不同,本系统特别针对使用特定领域文本语料库(在法国背景下为银行业务数据)训练语言模型所面临的挑战进行了定制化设计。通过结合语言特有技术和领域知识,该提议的系统相比通用方法表现出更优的性能和效率。 这一系统的开发标志着向更精细、更具针对性的数据驱动解决方案迈出了重要一步,这将有助于银行更好地理解客户需求,并提供更加个性化且有效的服务。
https://arxiv.org/abs/2504.12319
This study presents a multi-stage approach to mental health classification by leveraging traditional machine learning algorithms, deep learning architectures, and transformer-based models. A novel data set was curated and utilized to evaluate the performance of various methods, starting with conventional classifiers and advancing through neural networks. To broaden the architectural scope, recurrent neural networks (RNNs) such as LSTM and GRU were also evaluated to explore their effectiveness in modeling sequential patterns in the data. Subsequently, transformer models such as BERT were fine-tuned to assess the impact of contextual embeddings in this domain. Beyond these baseline evaluations, the core contribution of this study lies in a novel training strategy involving a dual-model architecture composed of a teacher and a student network. Unlike standard distillation techniques, this method does not rely on soft label transfer; instead, it facilitates information flow through both the teacher model's output and its latent representations by modifying the loss function. The experimental results highlight the effectiveness of each modeling stage and demonstrate that the proposed loss function and teacher-student interaction significantly enhance the model's learning capacity in mental health prediction tasks.
这项研究提出了一种多阶段方法,用于通过利用传统机器学习算法、深度学习架构和基于变换器的模型来分类心理健康。研究人员精心策划并使用了一个新颖的数据集,以评估各种方法的性能,从传统的分类器开始,并逐步过渡到神经网络。为了拓宽架构范围,还评估了诸如LSTM(长短期记忆)和GRU(门控循环单元)等递归神经网络,以探索它们在建模数据中序列模式方面的有效性。 随后,对BERT等变换模型进行了微调,以评估上下文嵌入在此领域的影响力。除了这些基准测试外,这项研究的核心贡献在于一种新颖的训练策略,该策略涉及由教师和学生网络组成的双模型架构。与标准蒸馏技术不同,这种方法不依赖于软标签转移;相反,通过修改损失函数来促进信息流通过教师模型的输出及其潜在表示。 实验结果突显了每个建模阶段的有效性,并展示了所提出的损失函数及师生互动显著增强了心理健康预测任务中模型的学习能力。
https://arxiv.org/abs/2504.07245
Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.
基于Transformer架构的语言模型在许多与语言相关的任务(如文本分类或情感分析)中取得了出色的成绩。然而,尽管这些模型的架构已经被很好地定义了,但对于它们内部计算如何帮助其取得结果知之甚少。这使得当前的这些模型成为一种“黑箱”系统。然而,有一条研究路线——“可解释性”,旨在学习这些模型内部信息是如何编码的。具体来说,有些工作致力于研究基于Transformer的模型是否拥有类似于人类语言使用者的语言现象知识——我们称这一领域为“这些模型的语言可解释性”。在这篇综述中,我们对160项跨多种语言和模型(包括多语种模型)的研究进行了全面分析,试图从传统语言学领域的多个视角(如句法、形态学、词汇-语义和话语)发现这些模型中的语言信息。我们的调查填补了现有可解释性文献的一个空白,即要么不关注这些模型中的语言知识,要么存在一些局限——例如只研究基于英语的模型。此外,本综述还侧重于未进一步针对下游任务进行专门化的预训练语言模型,并强调使用探索模型内部表示的技术来实现可解释性的作品。
https://arxiv.org/abs/2504.08001
Embedding fusion has emerged as an effective approach for enhancing performance across various NLP tasks. However, systematic guidelines for selecting optimal layers and developing effective fusion strategies for the integration of LLMs remain underexplored. In this study, we propose a layer-aware embedding selection method and investigate how to quantitatively evaluate different layers to identify the most important ones for downstream NLP tasks, showing that the critical layers vary depending on the dataset. We also explore how combining embeddings from multiple LLMs, without requiring model fine-tuning, can improve performance. Experiments on four English text classification datasets (SST-2, MR, R8, and R52) demonstrate that different layers in LLMs exhibit varying degrees of representational strength for classification, and that combining embeddings from different models can enhance performance if the models exhibit complementary characteristics. Additionally, we discuss resources overhead (memory and inference time) to provide a balanced perspective on the real world feasibility of embedding fusion. Future work will explore multilingual and domain specific datasets, as well as techniques for automating layer selection, to improve both performance and scalability.
嵌入融合已成为增强各种自然语言处理(NLP)任务性能的有效方法。然而,关于如何选择最佳层次以及为大规模语言模型(LLMs)的集成开发有效的融合策略,系统性指导原则尚未得到充分探索。在这项研究中,我们提出了一种基于层的嵌入选择方法,并探讨了如何定量评估不同层次以识别对下游NLP任务最重要的那些层次,研究表明关键层次会因数据集的不同而变化。此外,我们还探索了在无需微调模型的情况下,通过结合来自多个LLMs的嵌入来改进性能的方法。我们在四个英文文本分类数据集(SST-2、MR、R8和R52)上的实验表明,LLMs中的不同层次对于分类任务表现出不同的表征能力,并且如果这些模型具有互补特征,则结合来自不同模型的嵌入可以提高性能。我们还讨论了资源开销(内存和推理时间),以提供关于嵌入融合在现实世界可行性方面的平衡视角。未来的研究将探索多语言及特定领域的数据集,以及自动选择层次的技术,从而进一步提升性能和可扩展性。
https://arxiv.org/abs/2504.05764
Natural language processing models often face challenges due to limited labeled data, especially in domain specific areas, e.g., clinical trials. To overcome this, text augmentation techniques are commonly used to increases sample size by transforming the original input data into artificial ones with the label preserved. However, traditional text classification methods ignores the relationship between augmented texts and treats them as independent samples which may introduce classification error. Therefore, we propose a novel approach called 'Batch Aggregation' (BAGG) which explicitly models the dependence of text inputs generated through augmentation by incorporating an additional layer that aggregates results from correlated texts. Through studying multiple benchmark data sets across different domains, we found that BAGG can improve classification accuracy. We also found that the increase of performance with BAGG is more obvious in domain specific data sets, with accuracy improvements of up to 10-29%. Through the analysis of benchmark data, the proposed method addresses limitations of traditional techniques and improves robustness in text classification tasks. Our result demonstrates that BAGG offers more robust results and outperforms traditional approaches when training data is limited.
自然语言处理模型常因标注数据有限而在特定领域(如临床试验)中面临挑战。为解决这一问题,通常采用文本增强技术来通过转换原始输入数据生成人工样本以增加样本量,并保留原有标签信息。然而,传统文本分类方法忽略了这些增强文本之间的关系,将其视为独立的样本,这可能会引入分类错误。为此,我们提出了一种名为“批量聚合”(Batch Aggregation,简称BAGG)的新方法,该方法通过添加一层来明确建模通过增强生成的文本输入间的关系,这一层用于聚集相关文本的结果。 通过对不同领域内的多个基准数据集的研究,我们发现BAGG可以提高分类准确率。此外,在特定领域的数据集中,使用BAGG性能提升更为显著,最高可达到10-29%的精度改善。通过分析基准数据,所提出的方法解决了传统技术的局限性,并增强了文本分类任务中的鲁棒性。 我们的结果显示,在训练数据有限的情况下,BAGG方法提供更稳健的结果并优于传统的做法。
https://arxiv.org/abs/2504.05020
Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary language. Evaluation of these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks are disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this gap, we introduce GlotEval, a lightweight framework designed for massively multilingual evaluation. Supporting seven key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation), spanning over dozens to hundreds of languages, GlotEval highlights consistent multilingual benchmarking, language-specific prompt templates, and non-English-centric machine translation. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval's applicability for multilingual and language-specific evaluations.
大型语言模型(LLMs)在全球范围内以前所未有的速度发展,各地区越来越多地采用这些模型来支持其主要语言的应用。在多语种环境中特别是低资源语言中评估这些模型已成为学术界和工业界的巨大挑战。现有的评估框架过于侧重于英语和其他几种高资源语言,从而忽略了大规模多语言及低资源场景下大型语言模型的真实性能表现。为解决这一问题,我们引入了GlotEval,这是一个轻量级的框架,旨在进行大规模多语种评估。该框架支持七项关键任务(机器翻译、文本分类、摘要生成、开放性生成、阅读理解、序列标注和内在评测),覆盖数十到数百种语言,强调一致性的跨语言基准测试、语言特定的提示模板以及非英语中心化的机器翻译。这使得能够精确诊断模型在不同语言环境中的优势与劣势。 一个多语种翻译案例研究展示了GlotEval在多语言及语言特有评估方面的适用性。
https://arxiv.org/abs/2504.04155
Zero-shot text classification typically relies on prompt engineering, but the inherent prompt brittleness of large language models undermines its reliability. Minor changes in prompt can cause significant discrepancies in model performance. We attribute this prompt brittleness largely to the narrow focus on nexttoken probabilities in existing methods. To address this, we propose Placeholding Parallel Prediction (P3), a novel approach that predicts token probabilities across multiple positions and simulates comprehensive sampling of generation paths in a single run of a language model. Experiments show improved accuracy and up to 98% reduction in the standard deviation across prompts, boosting robustness. Even without a prompt, P3 maintains comparable performance, reducing the need for prompt engineering.
零样本文本分类通常依赖于提示工程,但大型语言模型内在的提示脆弱性影响了其可靠性。即使是微小的提示变化也可能导致模型性能出现显著差异。我们认为这种提示脆弱性的主要原因是现有方法对下一个词概率的狭窄关注。为解决这一问题,我们提出了一种新颖的方法——占位并行预测(Placeholding Parallel Prediction, P3),该方法可以同时在多个位置预测词的概率,并在一个语言模型运行过程中模拟生成路径的全面采样。实验表明,P3 方法不仅提高了准确性,还使提示间的标准差降低了高达 98%,从而增强了模型的稳健性。即使没有提示的情况下,P3 方法也能保持相当的性能水平,减少了对复杂提示工程的需求。
https://arxiv.org/abs/2504.03159
Automatic text classification (ATC) has experienced remarkable advancements in the past decade, best exemplified by recent small and large language models (SLMs and LLMs), leveraged by Transformer architectures. Despite recent effectiveness improvements, a comprehensive cost-benefit analysis investigating whether the effectiveness gains of these recent approaches compensate their much higher costs when compared to more traditional text classification approaches such as SVMs and Logistic Regression is still missing in the literature. In this context, this work's main contributions are twofold: (i) we provide a scientifically sound comparative analysis of the cost-benefit of twelve traditional and recent ATC solutions including five open LLMs, and (ii) a large benchmark comprising {22 datasets}, including sentiment analysis and topic classification, with their (train-validation-test) partitions based on folded cross-validation procedures, along with documentation, and code. The release of code, data, and documentation enables the community to replicate experiments and advance the field in a more scientifically sound manner. Our comparative experimental results indicate that LLMs outperform traditional approaches (up to 26%-7.1% on average) and SLMs (up to 4.9%-1.9% on average) in terms of effectiveness. However, LLMs incur significantly higher computational costs due to fine-tuning, being, on average 590x and 8.5x slower than traditional methods and SLMs, respectively. Results suggests the following recommendations: (1) LLMs for applications that require the best possible effectiveness and can afford the costs; (2) traditional methods such as Logistic Regression and SVM for resource-limited applications or those that cannot afford the cost of tuning large LLMs; and (3) SLMs like Roberta for near-optimal effectiveness-efficiency trade-off.
自动文本分类(ATC)在过去十年中取得了显著的进展,这一进步最典型的例子是近期的小型和大型语言模型(SLMs 和 LLMs),这些模型采用了Transformer架构。尽管最近的有效性有了很大的提高,但文献中仍然缺乏一种全面的成本效益分析来研究这些新方法的效果提升是否能够补偿其相比于传统的文本分类方法如支持向量机(SVM)和逻辑回归等所增加的高昂成本。 在这个背景下,这项工作的主要贡献包括两个方面:(i) 我们提供了一种科学合理的比较分析,评估了十二种传统和近期ATC解决方案的成本效益,其中包括五种开源大型语言模型;(ii) 一个大规模基准测试集合,包含22个数据集(如情感分析和主题分类),这些数据集基于折叠交叉验证过程划分的训练、验证和测试部分,并附有文档和代码。代码、数据及文档的发布使得社区能够复制实验并以更科学的方式推进领域研究。 我们的比较实验结果表明,在有效性方面,大型语言模型(LLMs)优于传统方法(平均提升26%-7.1%)和小型语言模型(SLMs)(平均提升4.9%-1.9%)。然而,LLMs由于需要进行微调而产生显著更高的计算成本,与传统方法相比平均慢590倍,比SLMs慢8.5倍。实验结果表明以下建议:(1) 对于那些要求最佳性能且能承担相应费用的应用程序,推荐使用大型语言模型;(2) 资源有限或无法承受微调大LLM成本的场景,则适合采用传统方法如逻辑回归和SVM;(3) 需要接近最优效率-效果平衡的应用场景可以考虑选择Roberta等小型语言模型。
https://arxiv.org/abs/2504.01930
The Tsetlin Machine (TM) is a propositional logic based model that uses conjunctive clauses to learn patterns from data. As with typical neural networks, the performance of a Tsetlin Machine is largely dependent on its parameter count, with a larger number of parameters producing higher accuracy but slower execution. Knowledge distillation in neural networks transfers information from an already-trained teacher model to a smaller student model to increase accuracy in the student without increasing execution time. We propose a novel approach to implementing knowledge distillation in Tsetlin Machines by utilizing the probability distributions of each output sample in the teacher to provide additional context to the student. Additionally, we propose a novel clause-transfer algorithm that weighs the importance of each clause in the teacher and initializes the student with only the most essential data. We find that our algorithm can significantly improve performance in the student model without negatively impacting latency in the tested domains of image recognition and text classification.
Tsetlin 机(TM)是一种基于命题逻辑的模型,它使用合取子句从数据中学习模式。与典型的神经网络一样,Tsetlin 机器的性能在很大程度上取决于其参数数量:更多的参数可以提高准确率,但会导致执行速度变慢。知识蒸馏技术通常是在已训练好的教师模型和较小的学生模型之间转移信息,以此增加学生模型的准确性而不延长执行时间。 我们提出了一种新的方法,在 Tsetlin 机器中实施知识蒸馏,通过利用教师模型每个输出样本的概率分布向学生提供额外的信息上下文。此外,我们还提出了一个新颖的子句传输算法,该算法根据其在教师中的重要性对每个子句进行加权,并且只用最关键的数据初始化学生模型。 我们的实验结果表明,在图像识别和文本分类等测试领域中,本算法可以在不增加延迟的情况下显著提高学生模型的表现。
https://arxiv.org/abs/2504.01798
In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high-quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in potential misinterpretation of biomedical documents. To address this issue, current approaches generally adopt the Synthetic Data Augmentation method which involves similarity computation followed by word replacement, but counterfactual data are usually generated. As a result, these methods disrupt meaningful word sets or produce sentences with meanings that deviate substantially from the original context, rendering them ineffective in improving model performance. To this end, this paper proposes a biomedical-dedicated rationale-based synthetic data augmentation method. Beyond the naive lexicon similarity, specific bio-relation similarity is measured to hold the augmented instance having a strong correlation with bio-relation instead of simply increasing the diversity of augmented data. Moreover, a multi-agents-involved reflection mechanism helps the model iteratively distinguish different usage of similar entities to escape falling into the mis-replace trap. We evaluate our method on the BLURB and BigBIO benchmark, which includes 9 common datasets spanning four major BioNLP tasks. Our experimental results demonstrate consistent performance improvements across all tasks, highlighting the effectiveness of our approach in addressing the challenges associated with data scarcity and enhancing the overall performance of biomedical NLP models.
在生物医学自然语言处理(BioNLP)任务中,如关系抽取、命名实体识别和文本分类等,高质量数据的稀缺仍然是一个重大挑战。这一限制使得大型语言模型难以准确理解诸如分子与疾病之间或药物相互作用之间的生物学实体间的关系,并可能导致对生物医学文献的潜在误解。为解决此问题,目前的方法通常采用合成数据增强方法,包括相似性计算后进行词汇替换,但往往生成的是反事实数据。结果是这些方法破坏了有意义的词组集,或者产生了与原始语境显著不同的句子含义,从而无法有效提升模型性能。 为此,本文提出了一种基于生物医学专门理据的合成数据增强方法。该方法超越了简单的词汇相似度,在衡量特定生物关系的相似性时,确保增强后的实例具有较强的生物关系相关性,而不是简单地增加增强数据的多样性。此外,一种涉及多代理参与的反思机制帮助模型迭代区分类似实体的不同用法,从而避免落入错误替换的陷阱。 我们在BLURB和BigBIO基准测试上评估了我们的方法,这些基准包括涵盖四大类BioNLP任务的九个常见数据集。实验结果表明,在所有任务中均实现了持续的性能改进,这凸显了我们方法在应对数据稀缺挑战方面的有效性,并显著提升了生物医学自然语言处理模型的整体表现。
https://arxiv.org/abs/2503.23673
We investigate the effectiveness of fine-tuning large language models (LLMs) on small medical datasets for text classification and named entity recognition tasks. Using a German cardiology report dataset and the i2b2 Smoking Challenge dataset, we demonstrate that fine-tuning small LLMs locally on limited training data can improve performance achieving comparable results to larger models. Our experiments show that fine-tuning improves performance on both tasks, with notable gains observed with as few as 200-300 training examples. Overall, the study highlights the potential of task-specific fine-tuning of LLMs for automating clinical workflows and efficiently extracting structured data from unstructured medical text.
我们研究了在小型医疗数据集上对大型语言模型(LLMs)进行微调,以改进文本分类和命名实体识别任务的有效性。通过使用德国心脏病报告数据集和i2b2吸烟挑战数据集,我们展示了在有限训练数据下本地微调小规模的LLM可以提高性能,并且能达到与更大模型相当的结果。我们的实验表明,在这两种任务上进行微调都可以改善性能,即使使用少至200-300个训练样本也观察到了显著的提升效果。总体而言,该研究强调了针对特定任务对大型语言模型进行微调在自动化临床工作流程和高效提取结构化医疗文本数据方面的潜力。
https://arxiv.org/abs/2503.21349
Cardiovascular disease remains one of the leading causes of mortality worldwide, underscoring the need for accurate as well as interpretable diagnostic machine learning tools. In this work, we investigate heart disease classification using electrocardiogram (ECG) data from two widely-utilized datasets: The MIT-BIH Arrhythmia and the PTB-XL datasets. We adapt a hierarchical attention network (HAN), originally developed for text classification, into an ECG-based heart-disease classification task. Our adapted HAN incorporates two attention layers that focus on ECG data segments of varying sizes. We conduct a comparative analysis between our adapted HAN and a more sophisticated state-of-the-art architecture, featuring a network with convolution, attention, and transformer layers (CAT-Net). Our empirical evaluation encompasses multiple aspects including test accuracy (quantified by 0-1 loss); model complexity (measured by the number of model parameters); and interpretability (through attention map visualization). Our adapted HAN demonstrates comparable test accuracy with significant reductions in model complexity and enhanced interpretability analysis: For the MIT-BIH dataset, our adapted HAN achieves 98.55\% test accuracy compared to 99.14\% for CAT-Net, while reducing the number of model parameters by a factor of 15.6. For the PTB-XL dataset, our adapted HAN achieves a 19.3-fold reduction in model complexity compared to CAT-Net, with only a 5\% lower test accuracy. From an interpretability perspective, the significantly simpler architecture and the hierarchical nature of our adapted HAN model facilitate a more straightforward interpretability analysis based on visualizing attention weights. Building on this advantage, we conduct an interpretability analysis of our HAN that highlights the regions of the ECG signal most relevant to the model's decisions.
心血管疾病仍然是全球主要的死亡原因之一,这凸显了准确且易于解释的诊断机器学习工具的需求。在这项研究中,我们使用心电图(ECG)数据对心脏病进行分类,这些数据来自两个广泛使用的数据集:MIT-BIH心律失常数据集和PTB-XL数据集。我们将最初为文本分类开发的分层注意力网络(HAN)适应到基于ECG的心脏病分类任务中。我们的改进型HAN包含两个关注不同大小的ECG数据片段的关注层。 我们对改进型HAN与一种更复杂的最新架构——具有卷积、注意和变换器层的CAT-Net进行了比较分析,后者在模型复杂性和准确性方面表现出色。我们的实证评估包括多个方面:测试准确度(通过0-1损失衡量);模型复杂性(通过模型参数的数量衡量);以及可解释性(通过注意力图可视化)。改进型HAN展示了与CAT-Net相当的测试准确率,但显著降低了模型复杂性和提高了可解释性: - 对于MIT-BIH数据集,我们的改进型HAN达到了98.55%的测试准确度,而CAT-Net为99.14%,同时减少了15.6倍的模型参数数量。 - 在PTB-XL数据集中,与CAT-Net相比,我们的改进型HAN将模型复杂性降低了19.3倍,仅在测试准确性上低了5%。 从可解释性的角度来看,我们简化后的架构和分层特性使得基于可视化注意力权重的解释分析更加直接。在此优势的基础上,我们对HAN进行了详细的可解释性分析,突出了ECG信号中模型决策最相关的区域。
https://arxiv.org/abs/2504.03703
Automatic math correction aims to check students' solutions to mathematical problems via artificial intelligence technologies. Most existing studies focus on judging the final answer at the problem level, while they ignore detailed feedback on each step in a math problem-solving process, which requires abilities of semantic understanding and reasoning. In this paper, we propose a reinforcement learning (RL)-based method to boost large language model (LLM) for step-level automatic math correction, named StepAMC. Particularly, we convert the step-level automatic math correction within the text classification task into an RL problem to enhance the reasoning capabilities of LLMs. Then, we design a space-constrained policy network to improve the stability of RL. Then, we introduce a fine-grained reward network to convert the binary human feedback into a continuous value. We conduct extensive experiments over two benchmark datasets and the results show that our model outperforms the eleven strong baselines.
自动数学纠错旨在通过人工智能技术检查学生解答数学问题的过程。现有大多数研究主要集中在评判每个问题的最终答案,而忽视了对解决问题过程中每一步的具体反馈,这需要具备语义理解和推理的能力。本文提出了一种基于强化学习(RL)的方法来增强大语言模型(LLM),用于步骤级别的自动数学纠错,并将其命名为StepAMC。特别地,我们将文本分类任务中的步骤级自动数学纠错问题转化为一个RL问题,以提升LLMs的推理能力。然后设计了一个空间受限的策略网络,以提高RL的稳定性。接着引入了一种细粒度奖励网络,将二元人类反馈转换为连续值。我们在两个基准数据集上进行了广泛的实验,并且结果显示我们的模型优于11个强大的基线模型。
https://arxiv.org/abs/2503.18432
Tibetan, a minority language in China, features a highly intricate grammatical structure, characterized by four verb tenses and a tense system with frequent irregularities, contributing to its extensive inflectional diversity. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in many domains. Despite the success in other fields, current LLMs often fall short in catering to the needs of domain experts like Tibetans, and the potential of LLMs for Tibetan culture is under-explored. The intrinsic reasons are the immense and intricate nature of Tibetan culture as well as the necessity for higher granularity and richness in knowledge. Simultaneously, the complexity and uniqueness of its grammatical structure, coupled with its status as a minority ethnic language, contribute to data scarcity, which remains a fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture, which is expert in various Tibetan language processing tasks. Sun-Shine incorporates state-of-the-art model architectures optimized for Tibetan's linguistic features. We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts such as literature, religious scripts, news, and conversational data, which is also the first large-scale dataset for Tibetan culture. Though comprehensive experiments, Sun-Shine not only demonstrates a higher level of knowledge expertise for Tibetan culture but also gains preliminary embodied intelligence capabilities in Tibetan language processing tasks, like language modeling, text classification, machine translation, and syntactic analysis. Moreover, it excels in low-resource scenarios, showcasing strong generalization capabilities.
翻译成中文如下: 藏语,作为中国的少数民族语言之一,具有极其复杂的语法结构,包括四个时态和一个经常出现不规则变化的时制系统,这导致了其广泛的词形变化多样性。近年来,大型语言模型(LLMs)在许多领域中改变了范式。尽管在其他领域的应用取得了成功,当前的LLM通常无法满足像藏族这样的专业人员的需求,并且这些模型在藏文化的潜力方面尚未得到充分探索。内在原因是藏文化宏大的复杂性和知识上的细致程度和丰富度的要求。同时,其语法结构的独特复杂性以及作为少数民族语言的地位导致了数据稀缺的问题,这仍然是一个根本性的挑战。为了缓解这些问题,我们推出了Llama-Sunshine(Sun-Shine),这是第一个专门针对藏文化的大型语言模型,擅长处理各种藏语任务。Sun-Shine融合了最先进的模型架构,并针对藏语的语言特征进行了优化。此外,我们还提出了TIB-STC,这是一个包含多种藏文文本的综合性数据集,例如文学作品、宗教文献、新闻和对话数据,也是第一个大规模的藏文化数据集。通过全面的实验,Sun-Shine不仅展示了对藏文化的更高知识专业度,并且在语言建模、文本分类、机器翻译及句法分析等藏语处理任务中获得了初步的实体智能能力。此外,在资源匮乏的情况下,它也表现出强大的泛化能力。
https://arxiv.org/abs/2503.18288
In-context learning (ICL) has transformed the use of large language models (LLMs) for NLP tasks, enabling few-shot learning by conditioning on labeled examples without finetuning. Despite its effectiveness, ICL is prone to errors, especially for challenging examples. With the goal of improving the performance of ICL, we propose corrective in-context learning (CICL), an approach that incorporates a model's incorrect predictions alongside ground truth corrections into the prompt, aiming to enhance classification accuracy through self-correction. However, contrary to our hypothesis, extensive experiments on text classification tasks demonstrate that CICL consistently underperforms standard ICL, with performance degrading as the proportion of corrections in the prompt increases. Our findings indicate that CICL introduces confusion by disrupting the model's task understanding, rather than refining its predictions. Additionally, we observe that presenting harder examples in standard ICL does not improve performance, suggesting that example difficulty alone may not be a reliable criterion for effective selection. By presenting these negative results, we provide important insights into the limitations of self-corrective mechanisms in LLMs and offer directions for future research.
基于上下文的学习(ICL)已经改变了大型语言模型(LLMs)在自然语言处理(NLP)任务中的使用方式,通过利用标注示例进行少量样本学习而不需微调。尽管ICL非常有效,但它容易出错,尤其是在面对难题时。为了改进ICL的性能,我们提出了校正上下文学习(CICL),这是一种将模型错误预测和真实纠正结果整合到提示中的方法,旨在通过自我修正来提高分类精度。然而,与我们的假设相反,在文本分类任务上的大量实验表明,CICL的表现始终逊于标准ICL,并且随着提示中纠错比例的增加,性能还会下降。我们的研究发现指出,CICL引入了混淆,破坏了模型的任务理解能力,而不是改进其预测结果。此外,我们还观察到在标准ICL中呈现更难的例子并不能提高表现,这表明难度本身可能不是有效选择示例的一个可靠标准。通过展示这些负面结果,我们为LLMs自我修正机制的局限性提供了重要见解,并为进一步研究指明了方向。
https://arxiv.org/abs/2503.16022
Text classification, a fundamental task in natural language processing (NLP), aims to categorize textual data into predefined labels. Traditional methods struggled with complex linguistic structures and semantic dependencies. The advent of deep learning, particularly recurrent neural networks (RNNs) and Transformer-based models, has significantly advanced the field by enabling nuanced feature extraction and context-aware predictions. Despite improvements, existing models exhibit limitations in balancing interpretability, computational efficiency, and long-range contextual understanding. This paper proposes the Dynamic Bidirectional Elman with Attention Network (DBEAN), which integrates bidirectional temporal modelling with self-attention mechanisms. DBEAN dynamically assigns weights to critical segments of input, improving contextual representation while maintaining computational efficiency.
文本分类是自然语言处理(NLP)中的一个基本任务,旨在将文本数据归类到预定义的标签中。传统方法在处理复杂的语言结构和语义依赖方面遇到了困难。深度学习技术的发展,尤其是循环神经网络(RNNs)和基于Transformer的模型的应用,极大地推动了该领域的发展,使得细微特征提取和上下文感知预测成为可能。然而,现有的模型在可解释性、计算效率以及长程上下文理解之间存在难以平衡的问题。 本文提出了一种新的方法:动态双向Elman注意力网络(DBEAN),它将双向时间建模与自注意力机制相结合。DBEAN能够根据输入文本的重要部分动态分配权重,在提升上下文表示能力的同时保持较高的计算效率。
https://arxiv.org/abs/2503.15469
This study compares the performance of two open-source large language models (LLMs)-Llama3-70B and DeepSeekR1-distill-Llama3-70B-on six biomedical text classification tasks. Four tasks involve data from social media, while two tasks focus on clinical notes from electronic health records, and all experiments were performed in zero-shot settings. Performance metrics, including precision, recall, and F1 scores, were measured for each task, along with their 95% confidence intervals. Results demonstrated that DeepSeekR1-distill-Llama3-70B generally performs better in terms of precision on most tasks, with mixed results on recall. While the zero-shot LLMs demonstrated high F1 scores for some tasks, they grossly underperformed on others, for data from both sources. The findings suggest that model selection should be guided by the specific requirements of the health-related text classification tasks, particularly when considering the precision-recall trade-offs, and that, in the presence of annotated data, supervised classification approaches may be more reliable than zero-shot LLMs.
这项研究比较了两个开源大型语言模型(LLMs)——Llama3-70B 和 DeepSeekR1-distill-Llama3-70B 在六个生物医学文本分类任务中的性能。其中四个任务涉及社交媒体数据,而另外两个任务则侧重于电子健康记录中的临床笔记,并且所有实验均在零样本设置下进行。每个任务的性能指标包括准确率、召回率和F1值,以及95%置信区间。 结果显示,在大多数任务中,DeepSeekR1-distill-Llama3-70B 在准确率方面表现更好,而对召回率的结果则较为混合。尽管零样本LLMs在某些任务上表现出较高的F1分数,但在另一些任务上的表现明显不足,无论是社交媒体数据还是电子健康记录中的数据。 研究结果表明,在选择模型时应考虑特定的健康相关文本分类任务的具体需求,特别是在权衡准确率和召回率方面。此外,当存在标注数据时,监督分类方法可能比零样本LLMs更为可靠。
https://arxiv.org/abs/2503.15169