We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pre-trained causal LLM and fine-tuning on the task (using the LLM's final token embedding as a sequence representation), and (2) instruction-tuning the LLM in a prompt->response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two datasets - a proprietary single-label dataset and the public WIPO-Alpha patent dataset (extreme multi-label classification) - show that the embedding-based method significantly outperforms the instruction-tuned method in F1-score, and is very competitive with - even surpassing - fine-tuned domain-specific models (e.g. BERT) on the same tasks. These results demonstrate that directly leveraging the internal representations of causal LLMs, along with efficient fine-tuning techniques, yields impressive classification performance under limited computational resources. We discuss the advantages of each approach while outlining practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.
我们探讨了在资源受限条件下,为下游文本分类任务高效微调解码器独享的大规模语言模型(LLMs)的策略。研究调查了两种方法:(1) 将分类头附加到预训练的因果LMM上,并针对特定任务进行微调(使用LMM最终的标记嵌入作为序列表示),以及 (2) 采用指令调整,以提示-响应格式对LLM进行微调用于分类。为了实现单GPU环境下高达80亿参数模型的有效微调,我们结合了4位模型量化与低秩适应(LoRA)技术,以达到高效的训练效果。 在两个数据集上进行了实验:一个是专有的单一标签数据集和公开的WIPO-Alpha专利数据集(极端多标签分类)。实验结果显示,基于嵌入的方法在F1分数方面显著优于指令调优方法,并且与针对特定领域的微调模型(如BERT)相比,在相同的任务中表现出色甚至超越。 这些结果表明,在计算资源有限的情况下,直接利用因果LMM内部表示以及高效的微调技术可以实现卓越的分类性能。我们讨论了每种方法的优势,并提出了优化LLM在分类场景下的微调的实际指南和未来方向。
https://arxiv.org/abs/2512.12677
Semantic distance measurement is a fundamental problem in computational linguistics, providing a quantitative characterization of similarity or relatedness between text segments, and underpinning tasks such as text retrieval and text classification. From a mathematical perspective, a semantic distance can be viewed as a metric defined on a space of texts or on a representation space derived from them. However, most classical semantic distance methods are essentially fixed, making them difficult to adapt to specific data distributions and task requirements. In this paper, a semantic distance measure based on multi-kernel Gaussian processes (MK-GP) was proposed. The latent semantic function associated with texts was modeled as a Gaussian process, with its covariance function given by a combined kernel combining Matérn and polynomial components. The kernel parameters were learned automatically from data under supervision, rather than being hand-crafted. This semantic distance was instantiated and evaluated in the context of fine-grained sentiment classification with large language models under an in-context learning (ICL) setup. The experimental results demonstrated the effectiveness of the proposed measure.
语义距离测量是计算语言学中的一个基本问题,它提供了一种对文本片段之间相似性或相关性的定量描述,并支持诸如文本检索和分类等任务。从数学角度来看,语义距离可以被视为在文本空间或由其衍生的表示空间上定义的一种度量标准。然而,大多数传统的语义距离方法本质上是固定的,这使得它们难以适应特定的数据分布和任务需求。 本文提出了一种基于多核高斯过程(MK-GP)的语义距离测量方法。该方法将与文本相关的潜在语义函数建模为高斯过程,并使用由Matérn核和多项式成分结合而成的协方差函数来定义其内核参数。这些内核参数是从数据中自动学习得到的,而不是手工设置的。这种语义距离在大规模语言模型的细粒度情感分类任务中的情境学习(ICL)环境中进行了实例化并进行评估。 实验结果表明了所提出的方法的有效性。
https://arxiv.org/abs/2512.12238
LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone's embeddings with the LLM-derived per-class scores -- obtained through structured prompt-engineering strategies -- and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains -- achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification -- while enabling practical trade-offs between accuracy, latency, and cost.
LabelFusion 是一种用于文本分类的融合集成方法,它能够学习将传统的基于变压器的分类器(例如 RoBERTa)与一个或多个大型语言模型(LLM,如 OpenAI GPT、Google Gemini 或 DeepSeek)结合在一起,以在多类和多标签任务中提供准确且成本意识强的预测。该软件包提供了简单易用的高级接口 (AutoFusionClassifier),可以进行端到端的完整流水线训练,并需要最少的配置;同时,它还为高级用户提供了一个灵活的 API。 从内部机制来看,LabelFusion 通过将机器学习骨干模型(ML backbone)的嵌入与大型语言模型(LLM)生成的每个类别的得分相连接来整合来自两个来源向量信号——这些得分是通过结构化提示工程策略获得的。然后将这种联合表示输入到一个紧凑的多层感知器(FusionMLP)中,以产生最终预测结果。 这种方法通过学习融合能够捕捉大型语言模型推理和传统基于变压器分类器各自的优势,在多个领域中均表现出稳健性能——在 AG News 数据集上达到了 92.4% 的准确率,在包含十个类别的 Reuters 21578 主题分类数据集上则为 92.3%,同时还能实现准确性、延迟和成本之间的实用权衡。
https://arxiv.org/abs/2512.10793
This study proposes a text classification algorithm based on large language models, aiming to address the limitations of traditional methods in capturing long-range dependencies, understanding contextual semantics, and handling class imbalance. The framework includes text encoding, contextual representation modeling, attention-based enhancement, feature aggregation, and classification prediction. In the representation stage, deep semantic embeddings are obtained through large-scale pretrained language models, and attention mechanisms are applied to enhance the selective representation of key features. In the aggregation stage, global and weighted strategies are combined to generate robust text-level vectors. In the classification stage, a fully connected layer and Softmax output are used to predict class distributions, and cross-entropy loss is employed to optimize model parameters. Comparative experiments introduce multiple baseline models, including recurrent neural networks, graph neural networks, and Transformers, and evaluate them on Precision, Recall, F1-Score, and AUC. Results show that the proposed method outperforms existing models on all metrics, with especially strong improvements in Recall and AUC. In addition, sensitivity experiments are conducted on hyperparameters and data conditions, covering the impact of hidden dimensions on AUC and the impact of class imbalance ratios on Recall. The findings demonstrate that proper model configuration has a significant effect on performance and reveal the adaptability and stability of the model under different conditions. Overall, the proposed text classification method not only achieves effective performance improvement but also verifies its robustness and applicability in complex data environments through systematic analysis.
这项研究提出了一种基于大规模语言模型的文本分类算法,旨在解决传统方法在捕捉长距离依赖性、理解上下文语义以及处理类别不平衡方面的局限。该框架包括文本编码、上下文表示建模、注意力机制增强、特征聚合和分类预测阶段。 在表示阶段,通过大规模预训练的语言模型获得深层语义嵌入,并应用注意机制来增强对关键特征的选择性表示。在聚合阶段,结合全局和加权策略生成稳健的文本级向量。在分类阶段,使用全连接层和Softmax输出来预测类别分布,并采用交叉熵损失来优化模型参数。 比较实验引入了多个基准模型,包括递归神经网络、图神经网络和Transformer,并从精度(Precision)、召回率(Recall)、F1-Score以及AUC等指标评估它们的性能。结果显示,所提出的方法在所有度量标准上均优于现有模型,特别是在召回率和AUC方面有显著改进。 此外,还进行了对超参数和数据条件进行敏感性实验,涵盖了隐藏维度对AUC的影响及类别不平衡比例对召回率的影响。研究发现表明,合适的模型配置会对性能产生重要影响,并揭示了该模型在不同条件下具有适应性和稳定性。 总体而言,所提出的文本分类方法不仅实现了有效的性能提升,还通过系统分析验证了其在复杂数据环境中的鲁棒性和适用性。
https://arxiv.org/abs/2512.09444
Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed "lexical-dense" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at this https URL.
前沿语言模型的质量越来越依赖于我们组织网络规模文本语料库的能力进行训练。当前主导的工具在速度和灵活性之间做出权衡:例如,FastText 等词汇分类器速度快但仅限于生成分类输出分数;而变换器文本嵌入模型则可以灵活支持多种工作流程(如聚类、分类及检索),但由于计算成本高昂,并不具备高效性。我们推出了 Luxical 库,该库用于高速“词汇密集”型文本嵌入,旨在结合这两种方法的最佳特性来实现大规模网络文本组织。 Luxical 结合了稀疏的 TF-IDF 特征、一个小的 ReLU 网络和知识蒸馏训练程序,以较低的成本近似大型变换器模型。在这份技术报告中,我们将描述 Luxical 的架构及其培训目标,并在两个不同的应用中评估具体的 Luxical 模型:定向网络爬取文档检索测试及基于文本分类的语言模型数据管理任务。 在这两项任务中,相对于不同大小的神经基线模型,Luxical 能提供从 3 倍到 100 倍的速度提升,并且在数据管理工作流程中接近 FastText 模型推理速度。根据这些评估结果,在大规模文本组织方面,测试过的 Luxical 模型展示出有利的计算/质量权衡,与神经基线模型的质量相当。 Luxical 可以作为开源软件通过此链接访问(请将“this https URL”替换为实际链接地址)。
https://arxiv.org/abs/2512.09015
In this study, we propose a structured methodology that utilizes large language models (LLMs) in a cost-efficient and parsimonious manner, integrating the strengths of scholars and machines while offsetting their respective weaknesses. Our methodology, facilitated through a chain of thought and few-shot learning prompting from computer science, extends best practices for co-author teams in qualitative research to human-machine teams in quantitative research. This allows humans to utilize abductive reasoning and natural language to interrogate not just what the machine has done but also what the human has done. Our method highlights how scholars can manage inherent weaknesses OF LLMs using careful, low-cost techniques. We demonstrate how to use the methodology to interrogate human-machine rating discrepancies for a sample of 1,934 press releases announcing pharmaceutical alliances (1990-2017).
在这项研究中,我们提出了一种结构化的方法论,该方法利用大型语言模型(LLMs),以成本效益高且简洁的方式运作,并整合学者和机器的优势,同时弥补它们各自的不足。我们的方法论通过计算机科学中的思维链和少量样本学习提示来实现,将质性研究中最佳的合作作者团队实践扩展到了量化研究中的人机团队合作。这种方法使得人类能够利用溯因推理和自然语言,不仅质疑机器做了什么,还能质疑人自己做了什么。我们提出的方法强调了学者如何通过仔细且低成本的技术手段来管理LLMs固有的弱点。我们展示了如何使用该方法论来审问一组1,934份宣布医药联盟的新闻稿(1990-2017年)中的人机评分差异。
https://arxiv.org/abs/2512.07583
We present LOCUS (LOw-cost Customization for Universal Specialization), a pipeline that consumes few-shot data to streamline the construction and training of NLP models through targeted retrieval, synthetic data generation, and parameter-efficient tuning. With only a small number of labeled examples, LOCUS discovers pertinent data in a broad repository, synthesizes additional training samples via in-context data generation, and fine-tunes models using either full or low-rank (LoRA) parameter adaptation. Our approach targets named entity recognition (NER) and text classification (TC) benchmarks, consistently outperforming strong baselines (including GPT-4o) while substantially lowering costs and model sizes. Our resultant memory-optimized models retain 99% of fully fine-tuned accuracy while using barely 5% of the memory footprint, also beating GPT-4o on several benchmarks with less than 1% of its parameters.
我们介绍了LOCUS(LOw-cost Customization for Universal Specialization),这是一种通过目标检索、合成数据生成和参数高效调整来消耗少量样本数据,从而简化自然语言处理模型构建和训练的流水线。仅使用少量标记示例,LOCUS可以在广泛的数据存储库中发现相关数据,利用上下文数据生成额外的训练样本,并采用全参数或低秩(LoRA)参数适应方法进行微调。我们的方法针对命名实体识别(NER)和文本分类(TC)基准测试,在性能上持续优于强大的基线模型(包括GPT-4o),同时大幅降低成本和模型规模。我们优化后的内存模型在使用不到全量微调所需内存的5%的情况下,保留了99%的精度,并且在某些基准测试中,参数仅为GPT-4o的不到1%时依然超越了后者的表现。
https://arxiv.org/abs/2512.06239
Large Language Models (LLMs) demonstrate strong in-context learning abilities, yet their effectiveness in text classification depends heavily on prompt design and incurs substantial computational cost. Conformal In-Context Learning (CICLe) has been proposed as a resource-efficient framework that integrates a lightweight base classifier with Conformal Prediction to guide LLM prompting by adaptively reducing the set of candidate classes. However, its broader applicability and efficiency benefits beyond a single domain have not yet been systematically explored. In this paper, we present a comprehensive evaluation of CICLe across diverse NLP classification benchmarks. The results show that CICLe consistently improves over its base classifier and outperforms few-shot prompting baselines when the sample size is sufficient for training the base classifier, and performs comparably in low-data regimes. In terms of efficiency, CICLe reduces the number of shots and prompt length by up to 34.45% and 25.16%, respectively, and enables the use of smaller models with competitive performance. CICLe is furthermore particularly advantageous for text classification tasks with high class imbalance. These findings highlight CICLe as a practical and scalable approach for efficient text classification, combining the robustness of traditional classifiers with the adaptability of LLMs, and achieving substantial gains in data and computational efficiency.
大型语言模型(LLMs)展现了强大的上下文学习能力,但它们在文本分类中的效果严重依赖于提示设计,并且计算成本高。为了解决资源效率问题,提出了符合上下文学习(CICLe)框架,该框架将轻量级的基础分类器与符合预测结合,通过自适应减少候选类别集来指导LLM的提示生成。然而,其在单个领域之外的应用广度和效率优势尚未系统地探索。 本文中,我们对CICLe进行了全面评估,覆盖了多样化的NLP分类基准测试。结果显示,在基础分类器训练所需的样本量充足时,CICLe始终优于它的基础分类器,并且在少量提示基线的条件下表现出色;而在低数据环境中表现相当。 从效率角度来看,CICLe将示例数量和提示长度分别减少了最多34.45%和25.16%,并且允许使用性能竞争的小型模型。此外,在高类别不平衡的文本分类任务中,CICLe尤其具有优势。 这些发现突显了CICLe作为一种实用且可扩展的方法,用于高效的文本分类,结合传统分类器的稳健性与LLM的适应性,并在数据和计算效率方面取得了显著改进。
https://arxiv.org/abs/2512.05732
Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies --codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.
由于其架构和庞大的预训练数据,大型语言模型(LLMs)展现了强大的文本分类性能。然而,LLM的输出——即为文本分配的类别——在很大程度上依赖于提示语的措辞。尽管有关提示工程的研究正在扩展,但很少有研究关注分类任务,并且几乎没有研究针对像心理学这样的领域,在这些领域中,构念具有精确、理论驱动的定义,这可能不会很好地反映在预训练数据中。我们提出了一种通过提示工程优化LLM性能以识别文本中构念的经验框架。我们在零样本和少样本分类中实验性地评估了五种提示策略——代码本引导的经验提示选择、自动提示工程、人格提示、链式思维推理以及解释性提示。我们发现,人格、链式思维和解释并不能完全解决因措辞不佳的提示而导致的表现下降问题。相反,提示中最具有影响力的特征是构念定义、任务框架,并且在一定程度上,提供的示例也起到作用。在三个构念和两个模型中进行测试后,最接近专家判断的分类结果来自一种结合了代码本引导的经验提示选择与自动提示工程的少样本提示。基于我们的研究发现,我们建议研究人员尽可能生成并评估各种各样的提示变体,无论是人工设计、自动生成还是理想情况下同时采用这两种方式,并根据训练数据中的实证表现来选择提示和示例,在保留集中进一步验证最终的方法。这一程序为在需要与专家判断保持一致的环境中优化LLM提示提供了一种实用、系统且理论驱动的方法。
https://arxiv.org/abs/2512.03818
Transformer models often exhibit brittle extrapolation, failing on inputs that are longer or structurally more complex than those seen during training. We introduce Counter-Example-Driven Curricula (CEDC), an automated framework that improves model robustness by iteratively focusing on its own failures. At each step, CEDC uses the current model to generate a diverse set of candidate problems, employs a fast, executable verifier to identify incorrect predictions (counter-examples), and then fine-tunes the model on a dataset enriched with these discovered failures. We evaluate CEDC on a suite of algorithmic and natural language tasks, including integer addition, sorting, Dyck-2 language recognition, and three text classification benchmarks. Compared to static training and standard curriculum learning baselines, CEDC achieves up to 30x greater length extrapolation, is 3.75x more computationally efficient than uniform data augmentation, and requires no manual difficulty heuristics. We provide a detailed analysis of the counter-examples, showing how the curriculum naturally adapts to target progressively more complex error modes. Our findings establish verifier-guided, failure-driven learning as a simple, powerful, and efficient paradigm for enhancing the generalization capabilities of Transformer models.
翻译: Transformer 模型常常在面对超出训练数据长度或结构复杂度的输入时表现出脆弱性。我们引入了一种名为“反例驱动课程学习”(CEDC)的自动化框架,通过迭代关注模型自身的错误来提高其鲁棒性。在每一步中,CEDC 使用当前模型生成一系列多样化的候选问题集,使用快速可执行验证器识别出不正确的预测(反例),然后对包含这些发现失败案例的数据集进行微调。 我们在整数加法、排序、Dyck-2 语言识别和三项文本分类基准上评估了 CEDC。与静态训练以及标准课程学习基线相比,CEDC 在长度外推方面取得了高达30倍的进步,其计算效率比均匀数据增强高出3.75倍,并且不需要手动难度启发式方法。 我们提供了对反例的详细分析,展示了该课程如何自然地适应目标并逐渐解决更复杂的错误模式。我们的研究结果确立了由验证器引导、以失败为导向的学习作为一种简单、强大而高效的范式,用于增强 Transformer 模型的一般化能力。
https://arxiv.org/abs/2512.01187
This study presents an unsupervised method to infer discreteness, syntax and temporal structures of fruit-bats vocalizations, as a case study of graded vocal systems, and evaluates the complexity of communication patterns in relation with behavioral context. The method improved the baseline for unsupervised labeling of vocal units (i.e. syllables) through manifold learning, by investigating how dimen- sionality reduction on mel-spectrograms affects labeling, and comparing it with unsupervised labels based on acoustic similarity. We then encoded vocalizations as syllabic sequences to analyze the type of syntax, and extracted the Maximal Repetitions (MRs) to evaluate syntactical structures. We found evidence for: i) associative syntax, rather than combinatorial (context classification is unaffected by permutation of sequences, F 1 > 0.9); ii) context-dependent use of syllables (Wilcoxon rank-sum tests, p-value < 0.05); iii) heavy-tail distribution of MRs (truncated power-law, exponent {\alpha} < 2), indicative of mechanism encoding com- binatorial complexity. Analysis of MRs and syllabic transition networks revealed that mother-pupil interactions were characterized by repetitions, while commu- nication in conflict-contexts exhibited higher complexity (longer MRs and more interconnected vocal sequences) than non-agonistic contexts. We propose that communicative complexity is higher in scenarios of disagreement, reflecting lower compressibility of information.
这项研究提出了一种无监督方法,用于推断果蝠叫声中的离散性、语法结构和时间结构,并以此作为分级声音系统的案例研究。该方法评估了在行为背景下的交流模式的复杂性。 具体而言,本方法通过流形学习改进了基于无监督标签的声音单元(即音节)的基本标注水平,探讨了梅尔频谱图降维对标注的影响,并将其与基于声学相似性的无监督标签进行了比较。然后我们将声音编码为音节序列以分析语法类型,并提取最大重复次数(MRs)来评估语法结构。 研究发现有以下几点证据: i) 关联性语法规则,而不是组合性规则(序列的排列对上下文分类没有影响,F1 > 0.9) ii) 音节使用的背景依赖性(威尔科克森秩和检验,p值 < 0.05) iii) 最大重复次数的幂律分布(截断幂律,指数{\alpha} < 2),表明存在编码组合复杂性的机制 通过对最大重复次数和音节转换网络进行分析发现,母蝙蝠与幼崽之间的互动特征是重复性更强,而冲突背景下的交流则表现出更高的复杂性(更长的MRs以及更加相互连接的声音序列)。研究提出,在分歧情景下,沟通的复杂度更高,反映了信息可压缩性的降低。
https://arxiv.org/abs/2512.01033
Accurately dating historical texts is essential for organizing and interpreting cultural heritage collections. This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models. We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries. Comparative analysis shows that these feature domains provide complementary temporal signals, with combined models outperforming any individual feature set. On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines (20% and 2.3%). Under relaxed temporal precision, performance increases to 96.0% top-2 accuracy for centuries and 85.8% top-10 accuracy for decades. The final model exhibits strong ranking capabilities with AUCROC up to 94.8% and AUPRC up to 83.3%, and maintains controlled errors with mean absolute deviations of 27 years and 30 years, respectively. For authentication-style tasks, binary models around key thresholds (e.g., 1850-1900) reach 85-98% accuracy. Feature importance analysis identifies distance features and lexical structure as most informative, with compression-based features providing complementary signals. SHAP explainability reveals systematic linguistic evolution patterns, with the 19th century emerging as a pivot point across feature domains. Cross-dataset evaluation on Project Gutenberg highlights domain adaptation challenges, with accuracy dropping by 26.4 percentage points, yet the computational efficiency and interpretability of tree-based models still offer a scalable, explainable alternative to neural architectures.
https://arxiv.org/abs/2511.23056
We propose SemImage, a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs). In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions. Each pixel is not a typical RGB value but a vector in a disentangled HSV color space, encoding different linguistic features: the Hue with two components H_cos and H_sin to account for circularity encodes the topic, Saturation encodes the sentiment, and Value encodes intensity or certainty. We enforce this disentanglement via a multi-task learning framework: a ColorMapper network maps each word embedding to the HSV space, and auxiliary supervision is applied to the Hue and Saturation channels to predict topic and sentiment labels, alongside the main task objective. The insertion of dynamically computed boundary rows between sentences yields sharp visual boundaries in the image when consecutive sentences are semantically dissimilar, effectively making paragraph breaks salient. We integrate SemImage with standard 2D CNNs (e.g., ResNet) for document classification. Experiments on multi-label datasets (with both topic and sentiment annotations) and single-label benchmarks demonstrate that SemImage can achieve competitive or better accuracy than strong text classification baselines (including BERT and hierarchical attention networks) while offering enhanced interpretability. An ablation study confirms the importance of the multi-channel HSV representation and the dynamic boundary rows. Finally, we present visualizations of SemImage that qualitatively reveal clear patterns corresponding to topic shifts and sentiment changes in the generated image, suggesting that our representation makes these linguistic features visible to both humans and machines.
我们提出了一种名为SemImage的新方法,用于将文本文档表示为二维语义图像,以便使用卷积神经网络(CNN)进行处理。在SemImage中,每个单词都作为2D图像中的一个像素来表示:行对应于句子,并且会在每两个句子之间插入额外的边界行以标记语义转换。每个像素不是典型的RGB值,而是在解耦合HSV颜色空间中的向量,编码不同的语言特征:色调(包括用于处理循环性的H_cos和H_sin两部分)表示主题,饱和度表示情感,亮度表示强度或确定性。我们通过多任务学习框架强制执行这种解耦:ColorMapper网络将每个词嵌入映射到HSV空间,并对色调和饱和度通道进行辅助监督以预测主题和情感标签,同时实现主要的任务目标。在句子之间插入动态计算的边界行会产生当连续句子语义不相似时图像中的清晰视觉边界,从而有效地使段落之间的分隔变得显著。我们将SemImage与标准2D CNN(例如ResNet)集成用于文档分类。在具有主题和情感标注的多标签数据集以及单标签基准上的实验表明,SemImage可以达到甚至优于强大的文本分类基线(包括BERT和层次注意力网络)的准确性,并且提供增强的可解释性。消融研究表明HSV多通道表示和动态边界行的重要性。最后,我们展示了SemImage的可视化效果,这些可视化的定性结果揭示了生成图像中对应于主题转换和情感变化的明显模式,表明我们的表示方法使语言特征对人和机器都可见。
https://arxiv.org/abs/2512.00088
The ubiquity of time series data creates a strong demand for general-purpose foundation models, yet developing them for classification remains a significant challenge, largely due to the high cost of labeled data. Foundation models capable of in-context learning (ICL) offer a powerful solution, adapting to new tasks with minimal examples and reducing the need for extensive retraining. However, prior work on large-scale time series models has predominantly focused on forecasting, leaving a critical gap for versatile, fine-tuning-free classification. To address this, we introduce TiCT (Time-series in-Context Transformer), a transformer-based model pre-trained exclusively on synthetic data to perform in-context classification. We make two primary technical contributions: 1) a novel architecture featuring a scalable bit-based label encoding and a special output attention mechanism to handle an arbitrary number of classes; and 2) a synthetic pre-training framework that combines a Mixup-inspired process with data augmentation to foster generalization and noise invariance. Extensive evaluations on the UCR Archive show that TiCT achieves competitive performance against state-of-the-art supervised methods. Crucially, this is accomplished using only in-context examples at inference time, without updating a single model weight.
https://arxiv.org/abs/2511.19694
In our daily lives, newspapers are an essential information source that impacts how the public talks about present-day issues. However, effectively navigating the vast amount of news content from different newspapers and online news portals can be challenging. Newspaper headlines with sentiment analysis tell us what the news is about (e.g., politics, sports) and how the news makes us feel (positive, negative, neutral). This helps us quickly understand the emotional tone of the news. This research presents a state-of-the-art approach to Bangla news headline classification combined with sentiment analysis applying Natural Language Processing (NLP) techniques, particularly the hybrid transfer learning model BERT-CNN-BiLSTM. We have explored a dataset called BAN-ABSA of 9014 news headlines, which is the first time that has been experimented with simultaneously in the headline and sentiment categorization in Bengali newspapers. Over this imbalanced dataset, we applied two experimental strategies: technique-1, where undersampling and oversampling are applied before splitting, and technique-2, where undersampling and oversampling are applied after splitting on the In technique-1 oversampling provided the strongest performance, both headline and sentiment, that is 78.57\% and 73.43\% respectively, while technique-2 delivered the highest result when trained directly on the original imbalanced dataset, both headline and sentiment, that is 81.37\% and 64.46\% respectively. The proposed model BERT-CNN-BiLSTM significantly outperforms all baseline models in classification tasks, and achieves new state-of-the-art results for Bangla news headline classification and sentiment analysis. These results demonstrate the importance of leveraging both the headline and sentiment datasets, and provide a strong baseline for Bangla text classification in low-resource.
https://arxiv.org/abs/2511.18618
Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model's label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected directives and decision outputs. We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants, under varying few-shot and an adversarial setting. Our results show that the ability of LDD to recover performance lost to the adversarial attack varies across models and alias choices. For every model evaluated, LDD is able to restore a portion of the accuracy degradation caused by the attack. Moreover, for the vast majority of models, we can identify more than one alias pair that achieves higher accuracy than the under-attack baseline, in which the model relies solely on few-shot learning without any defensive mechanism. A linguistic analysis further reveals that semantically aligned alias labels(e.g., good vs. bad) yield stronger robustness than unaligned symbols(e.g., blue vs. yellow). Overall, this study demonstrates that label semantics can serve as an effective defense layer, transforming meaning itself into a shield against prompt injection.
https://arxiv.org/abs/2511.21752
News text classification is a crucial task in natural language processing, essential for organizing and filtering the massive volume of digital content. Traditional methods typically rely on statistical features like term frequencies or TF-IDF values, which are effective at capturing word-level importance but often fail to reflect contextual meaning. In contrast, modern deep learning approaches utilize semantic features to understand word usage within context, yet they may overlook simple, high-impact statistical indicators. This paper introduces an Attention-Guided Feature Fusion (AGFF) model that combines statistical and semantic features in a unified framework. The model applies an attention-based mechanism to dynamically determine the relative importance of each feature type, enabling more informed classification decisions. Through evaluation on benchmark news datasets, the AGFF model demonstrates superior performance compared to both traditional statistical models and purely semantic deep learning models. The results confirm that strategic integration of diverse feature types can significantly enhance classification accuracy. Additionally, ablation studies validate the contribution of each component in the fusion process. The findings highlight the model's ability to balance and exploit the complementary strengths of statistical and semantic representations, making it a practical and effective solution for real-world news classification tasks.
https://arxiv.org/abs/2511.17184
Understanding sentiment in financial documents is crucial for gaining insights into market behavior. These reports often contain obfuscated language designed to present a positive or neutral outlook, even when underlying conditions may be less favorable. This paper presents a novel approach using Aspect-Based Sentiment Analysis (ABSA) to decode obfuscated sentiment in Thai financial annual reports. We develop specific guidelines for annotating obfuscated sentiment in these texts and annotate more than one hundred financial reports. We then benchmark various text classification models on this annotated dataset, demonstrating strong performance in sentiment classification. Additionally, we conduct an event study to evaluate the real-world implications of our sentiment analysis on stock prices. Our results suggest that market reactions are selectively influenced by specific aspects within the reports. Our findings underscore the complexity of sentiment analysis in financial texts and highlight the importance of addressing obfuscated language to accurately assess market sentiment.
https://arxiv.org/abs/2511.13481
The rise of social networks has not only facilitated communication but also allowed the spread of harmful content. Although significant advances have been made in detecting toxic language in textual data, the exploration of concept-based explanations in toxicity detection remains limited. In this study, we leverage various subtype attributes present in toxicity detection datasets, such as obscene, threat, insult, identity attack, and sexual explicit as concepts that serve as strong indicators to identify whether language is toxic. However, disproportionate attribution of concepts towards the target class often results in classification errors. Our work introduces an interpretability technique based on the Concept Gradient (CG) method which provides a more causal interpretation by measuring how changes in concepts directly affect the output of the model. This is an extension of traditional gradient-based methods in machine learning, which often focus solely on input features. We propose the curation of Targeted Lexicon Set, which captures toxic words that contribute to misclassifications in text classification models. To assess the significance of these lexicon sets in misclassification, we compute Word-Concept Alignment (WCA) scores, which quantify the extent to which these words lead to errors due to over-attribution to toxic concepts. Finally, we introduce a lexicon-free augmentation strategy by generating toxic samples that exclude predefined toxic lexicon sets. This approach allows us to examine whether over-attribution persists when explicit lexical overlap is removed, providing insights into the model's attribution on broader toxic language patterns.
https://arxiv.org/abs/2511.16689
Large Language Models have advanced clinical text classification, but their opaque predictions remain a critical barrier to practical adoption in research and clinical settings where investigators and physicians need to understand which parts of a patient's record drive risk signals. To address this challenge, we introduce \textbf{CALM}, short for \textbf{Classification with Additive Large Language Models}, an interpretable framework for semi-structured text where inputs are composed of semantically meaningful components, such as sections of an admission note or question-answer fields from an intake form. CALM predicts outcomes as the additive sum of each component's contribution, making these contributions part of the forward computation itself and enabling faithful explanations at both the patient and population level. The additive structure also enables clear visualizations, such as component-level risk curves similar to those used in generalized additive models, making the learned relationships easier to inspect and communicate. Although CALM expects semi-structured inputs, many clinical documents already have this form, and similar structure can often be automatically extracted from free-text notes. CALM achieves performance comparable to conventional LLM classifiers while improving trust, supporting quality-assurance checks, and revealing clinically meaningful patterns during model development and auditing.
https://arxiv.org/abs/2511.11922