System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O(n^2)$ memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concurrent classifiers need ${\sim}$4.5\,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage~1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from $O(n^2)$ to $O(n)$ and end-to-end (E2E) latency from 4{,}918\,ms to 127\,ms (\textbf{38.7$\times$}), enabling 8K--32K tokens where SDPA OOMs. \emph{Stage~2}: classical NLP prompt compression (TextRank, position weighting, TF-IDF, and novelty scoring) reduces all inputs to ${\sim}$512 tokens without neural inference, capping both latency and GPU memory at a constant regardless of original prompt length (E2E 127$\to$62\,ms, \textbf{2.0$\times$}). \emph{Stage~3}: near-streaming body processing with adaptive chunking and zero-copy JSON eliminates serialization overhead (E2E 62$\to$50\,ms, \textbf{1.2$\times$}). Cumulatively: \textbf{98$\times$} improvement (4{,}918\,ms to 50\,ms), 16K-token routing in 108\,ms, and a total router GPU footprint under 800\,MB -- small enough to share a GPU with LLM serving and removing the need for a dedicated accelerator. Stage~1 targets AMD ROCm (NVIDIA GPUs already have FlashAttention via cuDNN); Stages~2 and~3 are hardware-agnostic.
https://arxiv.org/abs/2603.12646
Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.
零样本文本分类(ZSC)提供了一种希望,即通过将文本直接匹配到可读的标签描述来消除耗资的任务特定标注。虽然早期的方法主要依赖于为自然语言推理(NLI)微调的交叉编码器模型,但最近在文本嵌入模型、重排序器和指令调整的大规模语言模型(LLM)方面的进步挑战了基于NLI架构的主导地位。然而,系统地比较这些多样化的方法仍然困难重重。现有的评估方法,如MTEB,通常通过监督探针或微调来整合带标签的例子,这使得真正意义上的零样本能力被忽视了。为了解决这一问题,我们介绍了BTZSC,这是一个全面的基准测试集合,包括22个公共数据集,涵盖了情绪、主题、意图和情感分类,覆盖了多样化的领域、类别数量以及文档长度。利用BTZSC,我们在四种主要模型家族之间进行了系统比较:NLI交叉编码器、嵌入模型、重排序器以及指令调整的LLM,包括38个公共和自定义检查点。 我们的结果表明: (i) 现代重排序器,如Qwen3-Reranker-8B,在宏观F1分数上达到了0.72的新最高水平; (ii) 强大的嵌入模型,例如GTE-large-en-v1.5,显著缩小了准确性的差距,并且在精度和延迟之间提供了最佳的权衡; (iii) 参数量为4至12B的指令调整LLM取得了竞争性性能(宏观F1分数最高达0.67),尤其擅长主题分类,但不及专门化的重排序器的表现; (iv) NLI交叉编码器即使随着骨干模型规模的增长也表现停滞; (v) 规模化主要有利于重排序器和LLM超过嵌入模型。 BTZSC及其伴随的评估代码已公开发布,以支持零样本文本理解中的公正且可复现的进步。
https://arxiv.org/abs/2603.11991
This work describes an automatic text classification method implemented in a software tool called NETHIC, which takes advantage of the inner capabilities of highly-scalable neural networks combined with the expressiveness of hierarchical taxonomies. As such, NETHIC succeeds in bringing about a mechanism for text classification that proves to be significantly effective as well as efficient. The tool had undergone an experimentation process against both a generic and a domain-specific corpus, outputting promising results. On the basis of this experimentation, NETHIC has been now further refined and extended by adding a document embedding mechanism, which has shown improvements in terms of performance on the individual networks and on the whole hierarchical model.
这项工作描述了一种自动文本分类方法,该方法被实现在名为NETHIC的软件工具中。NETHIC利用了高度可扩展神经网络的内在能力,并结合了分层分类法的表现力,从而成功地创建了一个既有效又高效的文本分类机制。该工具针对通用语料库和特定领域语料库进行了实验测试,取得了令人鼓舞的结果。基于这些实验结果,NETHIC现已进一步改进并扩充,添加了一种文档嵌入机制,在各个单独网络以及整个分层模型上都显示出性能的提升。
https://arxiv.org/abs/2603.11770
Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.
语言随着时间的推移而发生变化。计算模型可以被训练来识别这些变化,从而能够估计文本的出版日期。尽管近年来大型语言模型(LLM)取得了显著进展,但它们在自动确定文本年代上的表现——即时间文本分类(TTC)方面尚未得到充分研究。本研究首次系统地评估了领先的专有模型(Claude 3.5、GPT-4o、Gemini 1.5)和开源模型(LLaMA 3.2、Gemma 2、Mistral、Nemotron 4)在TTC上的表现,使用三个历史语料库进行评估,其中包括两个英语语料库和一个葡萄牙语文料库。我们测试了零样本和少量样本提示以及微调设置的表现。我们的结果显示,专有模型表现良好,特别是在少量样本提示下。此外,结果还表明开源模型在经过微调后性能显著提升,但仍然无法达到专有LLM的水平。
https://arxiv.org/abs/2603.11295
Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.
主题索引对于发现至关重要,但在大规模和跨语言环境中维护却颇具挑战。我们发布了一个大型的双语(英语/德语)目录记录语料库,并且这些记录使用综合权威文件(GND)进行了注释,同时还提供了一个机器可操作的GND分类学。该资源支持基于本体的知识多标签分类、文本到权威术语的映射以及代理辅助编目的工作,具备可重复性和以权威为基础的评估能力。我们提供了三个系统的简要统计概况和定性错误分析。我们邀请社区不仅从准确性方面,还从实用性及透明度方面对这些系统进行评价,以便共同开发出锚定于权威、能够增强目录员工作的AI辅助工具。
https://arxiv.org/abs/2603.10876
The pseudo-projector is a lightweight modification that can be integrated into existing language models and other neural networks without altering their core architecture. It can be viewed as a hidden-representation corrector that reduces sensitivity to noise by suppressing directions induced by label-irrelevant input content. The design is inspired by the multigrid (MG) paradigm, originally developed to accelerate the convergence of iterative solvers for partial differential equations and boundary value problems, and later extended to more general linear systems through algebraic multigrid methods. We refer to the method as a pseudo-projector because its linear prototype corresponds to a strictly idempotent orthogonal projector, whereas the practical formulation employs learnable restriction and prolongation operators and therefore does not, in general, satisfy the properties of an exact orthogonal projection. We evaluate the proposed approach on transformer-based text classification tasks, as well as controlled synthetic benchmarks, demonstrating its effectiveness in improving training dynamics and robustness. Experimental results, together with supporting theoretical heuristics, indicate consistent improvements in training behavior across a range of settings, with no adverse effects observed otherwise. Our next step will be to extend this approach to language models.
伪投影器是一种轻量级的修改,可以集成到现有的语言模型和其他神经网络中而不改变其核心架构。它可以被视为一种隐藏表示纠正器,通过抑制由标签无关输入内容引起的导向来减少对噪声的敏感性。该设计受到了多重网格(MG)范式的启发,这种范式最初是为了加速偏微分方程和边值问题迭代求解器的收敛速度而开发,并且后来通过代数多重网格方法被扩展到更一般的线性系统中。我们将其称为伪投影器是因为其线性原型对应于严格的幂等正交投影器,而在实际实施方案中使用了可学习的限制操作符和延拓操作符,因此通常不满足精确正交投影所具备的性质。 我们在基于变压器的文本分类任务以及受控合成基准测试上评估了该方法,展示了其在改进训练动态性和提高鲁棒性方面的有效性。实验结果与支持性的理论依据表明,在各种设置下训练行为的一致改善,并且没有观察到其他不良影响。下一步我们将扩展此方法以应用于语言模型。
https://arxiv.org/abs/2603.09815
This study examines the role of uncertainty estimation (UE) methods in multilingual text classification under noisy and non-topical conditions. Using a complex-vs-simple sentence classification task across several languages, we evaluate a range of UE techniques against a range of metrics to assess their contribution to making more robust predictions. Results indicate that while methods relying on softmax outputs remain competitive in high-resource in-domain settings, their reliability declines in low-resource or domain-shift scenarios. In contrast, Monte Carlo dropout approaches demonstrate consistently strong performance across all languages, offering more robust calibration, stable decision thresholds, and greater discriminative power even under adverse conditions. We further demonstrate the positive impact of UE on non-topical classification: abstaining from predicting the 10\% most uncertain instances increases the macro F1 score from 0.81 to 0.85 in the Readme task. By integrating UE with trustworthiness metrics, this study provides actionable insights for developing more reliable NLP systems in real-world multilingual environments. See this https URL
这项研究探讨了不确定性估计(UE)方法在多语言文本分类中噪声和非主题条件下的作用。通过跨几种语言的复杂句与简单句分类任务,我们评估了一系列不确定性估计技术,并使用一系列度量标准来评估它们对提高预测稳健性方面的贡献。结果显示,在高资源且领域内设置的情况下,依赖于softmax输出的方法仍然具有竞争力;但在低资源或领域迁移场景中,其可靠性会下降。相比之下,蒙特卡洛 dropout 方法在所有语言中的表现始终强劲,提供了更稳健的校准、稳定的决策阈值和更高的判别力,即使是在不利条件下也是如此。 我们还展示了 UE 对非主题分类的积极影响:放弃预测 10% 最不确定的实例后,在 Readme 任务中宏平均 F1 分数从 0.81 提升到了 0.85。通过将不确定性估计与可信度指标相结合,这项研究为在真实世界的多语言环境中开发更可靠的 NLP 系统提供了实用见解。 参见此 [链接](https://example.com)。
https://arxiv.org/abs/2603.07330
Density aggregation is a central problem in machine learning, for instance when combining predictions from a Deep Ensemble. The choice of aggregation remains an open question with two commonly proposed approaches being linear pooling (probability averaging) and geometric pooling (logit averaging). In this work, we address this question by studying the normalized generalized mean of order $r \in \mathbb{R} \cup \{-\infty,+\infty\}$ through the lens of log-likelihood, the standard evaluation criterion in machine learning. This provides a unifying aggregation formalism and shows different optimal configurations for different situations. We show that the regime $r \in [0,1]$ is the only range ensuring systematic improvements relative to individual distributions, thereby providing a principled justification for the reliability and widespread practical use of linear ($r=1$) and geometric ($r=0$) pooling. In contrast, we show that aggregation rules with $r \notin [0,1]$ may fail to provide consistent gains with explicit counterexamples. Finally, we corroborate our theoretical findings with empirical evaluations using Deep Ensembles on image and text classification benchmarks.
密度聚合是机器学习中的一个核心问题,尤其是在结合深度集成模型的预测时。选择哪种聚合方法仍然是一个开放性的问题,其中两种常用的方法分别是线性池化(概率平均)和几何池化(对数几率平均)。在这项工作中,我们通过从似然度的角度来研究归一化的广义均值$r \in \mathbb{R} \cup \{-\infty,+\infty\}$,为这一问题提供了新的视角。这提供了一个统一的聚合形式,并展示了在不同情况下不同的最优配置。我们证明了当$r \in [0,1]$时是唯一一个能够确保相对于单个分布系统性改进的区间,从而从理论上证明了线性($r=1$)和几何($r=0$)池化方法的有效性和广泛实用性。相比之下,我们展示了当$r \notin [0,1]$时聚合规则可能无法提供一致的好处,并通过具体的反例进行了说明。最后,我们使用深度集成模型在图像分类和文本分类基准上进行实验评估来验证我们的理论发现。 这段翻译介绍了研究中探讨的密度聚合问题以及如何通过归一化的广义均值$r \in [0,1]$区间提供了对线性池化和几何池化方法的有效性的解释。同时也指出,当$r$不在这个区间内时,可能存在其他聚合规则不能提供一致改进的风险,并且以具体的反例加以说明。最后提到该研究通过实验验证了理论上的发现,在深度集成模型的图像分类与文本分类任务中进行了实证评估。
https://arxiv.org/abs/2603.04204
Short text classification (STC) remains a challenging task due to the scarcity of contextual information and labeled data. However, existing approaches have pre-dominantly focused on English because most benchmark datasets for the STC are primarily available in English. Consequently, existing methods seldom incorporate the linguistic and structural characteristics of Korean, such as its agglutinative morphology and flexible word order. To address these limitations, we propose LIGRAM, a hierarchical heterogeneous graph model for Korean short-text classification. The proposed model constructs sub-graphs at the morpheme, part-of-speech, and named-entity levels and hierarchically integrates them to compensate for the limited contextual information in short texts while precisely capturing the grammatical and semantic dependencies inherent in Korean. In addition, we apply Semantics-aware Contrastive Learning (SemCon) to reflect semantic similarity across documents, enabling the model to establish clearer decision boundaries even in short texts where class distinctions are often ambiguous. We evaluate LIGRAM on four Korean short-text datasets, where it consistently outperforms existing baseline models. These outcomes validate that integrating language-specific graph representations with SemCon provides an effective solution for short text classification in agglutinative languages such as Korean.
短文本分类(STC)任务由于缺乏上下文信息和标注数据而仍然具有挑战性。然而,现有的方法主要集中在英语上,因为大多数基准数据集都是以英语为主的。因此,现有方法很少考虑韩语的语言和结构特征,例如其黏着形态和灵活的词序特点。为了解决这些问题,我们提出了LIGRAM模型,这是一种用于韩语短文本分类的层次异构图模型。该模型构建了音素级、词性级和命名实体级别的子图,并分层地整合它们以弥补短文本中上下文信息不足的问题,同时精确捕捉到韩语内在的语法和语义依存关系。 此外,我们还应用了基于语义感知对比学习(SemCon)的方法来反映文档之间的语义相似性,在短文本中即使类别区分模糊的情况下也能帮助模型建立更清晰的决策边界。我们在四个韩语短文本数据集上对LIGRAM进行了评估,结果表明它在所有情况下都优于现有的基线模型。这些成果验证了将特定语言图表示与SemCon相结合可以为像韩语这样的黏着语系提供有效的短文本分类解决方案。
https://arxiv.org/abs/2603.03652
OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration from the user. The library also provides integrated data quality diagnostics, configurable out-of-distribution (OOD) detection, and large language model (LLM) features, all within a minimal lowcode API. The demo app is accessible here this https URL.
OpenAutoNLU 是一个开源的自动化机器学习库,专门用于自然语言理解(NLU)任务,涵盖了文本分类和命名实体识别(NER)。与现有的解决方案不同,我们引入了数据感知的训练模式选择机制,无需用户进行手动配置。该库还提供集成的数据质量诊断、可配置的出站分布(OOD)检测以及大型语言模型(LLM)功能,并且所有这些都在一个极简的低代码API中实现。演示应用可以在这里访问:[此处应为URL]。 请注意,原始文本中的链接未被明确给出,实际使用时请确保正确插入有效的URL地址。
https://arxiv.org/abs/2603.01824
Cybercrime forums play a central role in the cybercrime ecosystem, serving as hubs for the exchange of illicit goods, services, and knowledge. Previous studies have explored the market and social structures of these forums, but less is known about the behavioral dynamics of users, particularly regarding participants' disclosure of criminal activity. This study provides the first large-scale assessment of crime disclosure patterns in a major cybercrime forum, analysing over 3.5 million posts from nearly 300k users. Using a three-level classification scheme (benign, grey, and crime) and a scalable labelling pipeline powered by large language models (LLMs), we measure the level of crime disclosure present in initial posts, analyse how participants switch between levels, and assess how crime disclosure behavior relates to private communications. Our results show that crime disclosure is relatively normative: one quarter of initial posts include explicit crime-related content, and more than one third of users disclose criminal activity at least once in their initial posts. At the same time, most participants show restraint, with over two-thirds posting only benign or grey content and typically escalating disclosure gradually. Grey initial posts are particularly prominent, indicating that many users avoid overt statements and instead anchor their activity in ambiguous content. The study highlights the value of LLM-based text classification and Markov chain modelling for capturing crime disclosure patterns, offering insights for law enforcement efforts aimed at distinguishing benign, grey, and criminal content in cybercrime forums.
网络犯罪论坛在网络犯罪生态系统中扮演着核心角色,作为非法商品、服务和知识交换的中心。先前的研究探讨了这些论坛的市场和社会结构,但对于用户的行为动态了解较少,特别是关于参与者披露犯罪活动的情况。本研究提供了对一个主要网络犯罪论坛内犯罪信息披露模式的首次大规模评估,分析了近30万用户的逾350万个帖子。通过采用三级分类体系(良性、灰色和犯罪)及由大型语言模型(LLM)驱动的可扩展标签流水线,我们测量了初始帖子中所包含的犯罪披露水平,分析了参与者在不同层级之间的切换,并评估了犯罪信息披露行为与私人通信的关系。 研究结果表明,犯罪信息披露是相对常规的现象:四分之一以上的初始帖子包含了明确的犯罪相关内容,超过三分之一的用户至少在其初始帖子中披露过一次犯罪活动。与此同时,大多数参与者表现出谨慎态度,三分之二以上的人仅发布良性或灰色内容,并通常逐步增加其披露的强度。灰色的初始帖子尤为突出,表明许多用户避免做出直接声明,而是将他们的行为锚定在模棱两可的内容上。 该研究强调了基于LLM的文本分类和马尔科夫链建模对于捕捉犯罪信息披露模式的价值,为旨在区分网络犯罪论坛内良性、灰色及犯罪内容的执法努力提供了见解。
https://arxiv.org/abs/2603.01624
This study addresses the issues of semantic entanglement, unclear label structure, and insufficient feature representation in few-shot text classification, and proposes an optimization framework based on structured prompts to enhance semantic understanding and task adaptation under low-resource conditions. The framework first uses a pretrained language model to encode the input text and obtain basic semantic representations. It then introduces structured prompts composed of multi-dimensional semantic factors and integrates them with text features through a learnable combination mechanism, which forms task-related representations with clear boundaries in the latent space. To further strengthen the consistency between text representations and label semantics, the method constructs a structured label embedding matrix and employs a cross-space alignment mechanism to ensure stable matching between textual features and label attributes. In addition, the model applies prompt orthogonality constraints and a joint optimization objective to maintain independence across different semantic factors in the prompts, allowing the structured prompts to provide transparent and controllable guidance for classification decisions. Three types of sensitivity experiments, including learning rate sensitivity, prompt length sensitivity, and data scale sensitivity, are designed to evaluate the stability and robustness of the framework under different conditions. Experimental results show that the proposed structured prompt optimization framework effectively alleviates semantic conflicts and label ambiguity in few-shot text classification. It significantly improves performance on accuracy, precision, recall, and AUC, and demonstrates strong cross-task applicability.
这项研究针对少量样本文本分类中的语义纠缠、标签结构不明确以及特征表示不足等问题,提出了一种基于结构化提示词的优化框架。该框架旨在提升在资源匮乏条件下的语义理解和任务适应能力。 具体来说,该框架首先利用预训练的语言模型对输入文本进行编码,并获取基础的语义表示。然后引入由多维语义因素组成的结构化提示词,并通过可学习组合机制将其与文本特征相融合,在潜在空间中形成具有明确边界的任务相关表示。为了进一步增强文本表征和标签语义之间的一致性,该方法构建了一个结构化的标签嵌入矩阵,并采用跨空间对齐机制来确保文本特征和标签属性之间的稳定匹配。 此外,模型还应用了提示词正交约束以及联合优化目标,以保持不同语义因素在提示中的独立性。这使得结构化提示能够为分类决策提供透明且可控的指导。研究设计了三种敏感性实验(学习率敏感性、提示长度敏感性和数据规模敏感性),用以评估该框架在不同条件下的稳定性和鲁棒性。 实验结果显示,所提出的基于结构化提示词的优化框架有效地缓解了少量样本文本分类中的语义冲突和标签模糊问题。它显著提高了准确率、精确度、召回率及AUC指标的表现,并展示了强大的跨任务适用性。
https://arxiv.org/abs/2602.23753
In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70\%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50\% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.
在这项工作中,我们研究了描述模型(caption models)的特性及其对文本到图像生成模型的影响。设计了一套系统的分析方法:给定由某特定模型生成的文字说明或对应图片,训练神经网络预测其使用的原始描述模型。实验结果显示,基于文字分类的方法能够达到非常高的准确率(99.70%),表明不同的描述模型具有独特的风格特征。然而,这些独特标志在生成的图像中却几乎消失不见,即便对于最先进的Flux模型,通过图片进行分类的准确性也降至最多50%。 为了更好地理解跨模态差异的原因,我们进一步分析了数据集,发现生成的图片未能保留文字说明中的关键变化,例如细节程度、对颜色和纹理的关注以及场景中物体分布的不同。总体而言,我们的基于分类的方法为量化描述模型的独特风格特征及文本到图像系统遵循提示的能力提供了一种新颖且具有创新性的方法论。
https://arxiv.org/abs/2602.22734
Customer-provided reviews have become an important source of information for business owners and other customers alike. However, effectively analyzing millions of unstructured reviews remains challenging. While large language models (LLMs) show promise for natural language understanding, their application to large-scale review analysis has been limited by computational costs and scalability concerns. This study proposes a hybrid approach that uses LLMs for aspect identification while employing classic machine-learning methods for sentiment classification at scale. Using ChatGPT to analyze sampled restaurant reviews, we identified key aspects of dining experiences and developed sentiment classifiers using human-labeled reviews, which we subsequently applied to 4.7 million reviews collected over 17 years from a major online platform. Regression analysis reveals that our machine-labeled aspects significantly explain variance in overall restaurant ratings across different aspects of dining experiences, cuisines, and geographical regions. Our findings demonstrate that combining LLMs with traditional machine learning approaches can effectively automate aspect-based sentiment analysis of large-scale customer feedback, suggesting a practical framework for both researchers and practitioners in the hospitality industry and potentially, other service sectors.
客户提供的评论已经成为商家和其他顾客获取信息的重要来源。然而,有效地分析数以百万计的非结构化评论仍然是一项挑战。虽然大型语言模型(LLM)在自然语言理解方面显示出潜力,但其在大规模评价分析中的应用因计算成本和可扩展性问题而受到限制。本研究提出了一种混合方法,利用LLM进行方面识别,并采用传统的机器学习方法进行大规模情感分类。通过使用ChatGPT对采样的餐厅评论进行分析,我们确定了用餐体验的关键方面,并基于人工标注的评论开发了情感分类器,随后将其应用于17年间从主要在线平台收集到的470万条评论中。回归分析表明,我们的机器标注的方面在不同用餐体验、菜肴类型和地区方面显著解释了餐厅总体评分的变化。研究结果证明,结合LLM和传统机器学习方法可以有效地自动化大规模客户反馈的情感方面分析,这为酒店业研究人员和从业人员以及可能的服务行业提供了实用框架。
https://arxiv.org/abs/2602.21082
Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently with deep-level grammatical and semantic features. Existing models capture features by learning from large corpora with transformer models, which are computationally intensive and unsuitable for resource-constrained environments. Therefore, our proposed study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classification model without resorting to full parameterised transformer models or heavy deep learning architectures. The novelty of our approach lies in its explicit encoding of sentence-level grammatical structure, including syntactic composition, phrase patterns, and complexity indicators, into a compact grammar vector, which is then fused with frozen contextual embeddings. These heterogeneous elements unified a single representation that captures both the structural and semantic characteristics of the text. Deep learning models such as Deep Belief Networks (DBNs), Long Short-Term Memory (LSTMs), BiLSTMs, and transformer-based BERT and XLNET were used to train and evaluate the model, with the number of epochs varied. Based on experimental results, the unified feature representation model captures both the semantic and structural properties of text, outperforming baseline models by 2%-15%, enabling more effective learning across heterogeneous domains. Unlike prior syntax-aware transformer models that inject grammatical structure through additional attention layers, tree encoders, or full fine-tuning, the proposed framework treats grammar as an explicit inductive bias rather than a learnable module, resulting in a very lightweight model that delivers better performance on edge devices
自然语言处理(NLP)通过分析和分类文本中的深层次语法和语义特征,使计算机能够理解人类语言。现有模型利用大规模语料库训练的变压器模型来捕捉这些特征,但这种方法计算密集且不适合资源受限环境。因此,我们提出的研究结合了全面的语法规则与语义信息,构建了一个无需使用全参数化的变压器模型或重型深度学习架构的健壮、轻量级分类模型。 我们的方法新颖之处在于明确地将句子级别的语法结构(包括句法构成、短语模式和复杂度指标)编码为一个紧凑的语法规则向量,并将其与冻结的上下文嵌入相结合。这些异构元素统一在一个单一表示中,该表示同时捕捉文本的结构和语义特性。 我们使用深度信念网络(DBNs)、长短期记忆网络(LSTMs)、双向LSTMs以及基于变压器的BERT和XLNET等深度学习模型来训练和评估我们的模型,并通过改变训练轮数来进行实验。根据实验结果,统一特征表示模型能够捕捉文本的语义与结构属性,在各类基准模型上表现出2%-15%的优势,实现了异构领域中的更有效学习。 不同于之前的语法感知变换器模型(这些模型通过附加注意力层、树编码器或完全微调来注入语法结构),我们的框架将语法视为一种显式的归纳偏差而非可学习模块。这导致了一个非常轻量级的模型,在边缘设备上表现出更好的性能。
https://arxiv.org/abs/2602.20749
This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99\%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56\% accuracy while maintaining moderate training costs and offering robust contextual understanding. Naive Bayes proved to be the fastest to train, on the order of milliseconds, yet delivered the lowest accuracy, averaging around 94.5\%. Class imbalance influenced all methods, particularly in the recognition of minority categories. A fully functional demonstrative system was implemented to validate practical applicability, enabling automated routing of technical requests with throughput unattainable through manual processing. The study concludes that BiLSTM offers the most balanced solution for the examined scenario, while also outlining opportunities for future improvements and further exploration of transformer architectures.
本文评估了几种机器学习方法在自动文本分类中的应用,并设计了一个用于不平衡文档分类和分布的演示系统。研究重点在于平衡分类准确性和计算效率,这是将AI集成到现实世界自动化流水线中时的关键考虑因素。研究人员考察了三种不同复杂度的模型:朴素贝叶斯分类器、双向LSTM网络以及经过微调的基于Transformer的BERT模型。实验结果显示性能存在显著差异。 BERT取得了最高的准确性,超过了99%,但需要更长的训练时间和更多的计算资源。BiLSTM模型则提供了一个强有力的折中方案,在保持适中的训练成本的同时达到了大约98.56%的准确率,并且具有较强的理解上下文的能力。相比之下,朴素贝叶斯分类器训练速度最快,仅需毫秒级时间,然而其准确性最低,平均在94.5%左右。 类别不平衡影响了所有方法,尤其在识别少数类别的表现上更加明显。为验证实际应用的可行性,实施了一个完全功能的演示系统,该系统能够自动处理技术请求,大大提高了处理效率并达到了手动处理无法企及的速度水平。研究结论认为BiLSTM模型是目前所考察场景中的最优方案,同时指出未来改进和进一步探索Transformer架构的机会。
https://arxiv.org/abs/2602.20336
This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.
这项研究引入了第一个大规模、平衡良好的波斯语社交媒体文本分类数据集,旨在解决该领域内综合性资源不足的问题。该数据集包含九个类别(经济、艺术、体育、政治、社会、健康、心理、历史和科技)的共计36,000篇帖子,每个类别包括4,000份样本,确保了类别的分布均衡。数据收集阶段从各种波斯语社交媒体平台中获取了60,000条原始帖子,并随后进行了严格的预处理以及结合ChatGPT少样本提示和人工验证的混合标注过程。 为缓解类别不平衡的问题,我们采用了欠采样并移除语义冗余的方法,同时采用包括词汇替换和生成性提示在内的高级数据增强策略。我们在多个模型(包括BiLSTM、XLM-RoBERTa及其LoRA和AdaLoRA适应版本、FaBERT、基于SBERT的架构以及专为波斯语设计的TookaBERT基础版和大型版)上进行了基准测试。 实验结果显示,基于转换器的模型始终优于传统的神经网络模型,其中TookaBERT-Large表现最佳(精确率:0.9622,召回率:0.9621,F1分数:0.9621)。类别级别的评估进一步证实了所有类别的稳健性能,尽管社会和政治文本由于固有的模糊性而得分略低。 这项研究不仅提供了一个新的高质量数据集,还全面评测了前沿模型的性能,为波斯语自然语言处理(NLP)领域的未来发展奠定了坚实的基础,包括趋势分析、社交行为建模及用户分类等。该数据集对未来的科研工作开放,可供下载使用。
https://arxiv.org/abs/2602.19333
Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.
https://arxiv.org/abs/2602.17051
This paper presents system description for Arabic medical text classification across 82 distinct categories. Our primary architecture utilizes a fine-tuned AraBERTv2 encoder enhanced with a hybrid pooling strategies, combining attention and mean representations, and multi-sample dropout for robust regularization. We systematically benchmark this approach against a suite of multilingual and Arabic-specific encoders, as well as several large-scale causal decoders, including zero-shot re-ranking via Llama 3.3 70B and feature extraction from Qwen 3B hidden states. Our findings demonstrate that specialized bidirectional encoders significantly outperform causal decoders in capturing the precise semantic boundaries required for fine-grained medical text classification. We show that causal decoders, optimized for next-token prediction, produce sequence-biased embeddings that are less effective for categorization compared to the global context captured by bidirectional attention. Despite significant class imbalance and label noise identified within the training data, our results highlight the superior semantic compression of fine-tuned encoders for specialized Arabic NLP tasks. Final performance metrics on the test set, including Accuracy and Macro-F1, are reported and discussed.
本文介绍了针对阿拉伯语医学文本分类的系统描述,涵盖了82个不同的类别。我们的主要架构使用了一种改进后的AraBERTv2编码器,该编码器融合了混合池化策略(结合注意力和平均表示)以及多样本 dropout 技术来增强鲁棒性正则化。我们通过一系列多语言和阿拉伯语特定的编码器,以及一些大规模因果解码器进行了系统的基准测试,包括零样本重排序方法Llama 3.3 70B及从Qwen 3B隐藏状态中提取特征的方法。我们的研究发现表明,在捕捉精细医学文本分类所需的精确语义边界方面,专业双向编码器显著优于因果解码器。 我们展示了优化于下一个标记预测的因果解码器会产生序列偏向性嵌入,这些嵌入在类别化任务中的效果不如双向注意力捕获的全局上下文。尽管训练数据中存在明显的类别不平衡和标签噪声问题,我们的结果强调了针对专业阿拉伯语自然语言处理(NLP)任务进行微调后的编码器所具有的卓越语义压缩能力。 最终,在测试集上报告并讨论了性能指标,包括准确率和宏F1值。
https://arxiv.org/abs/2603.10008
We present our approach to the AbjadGenEval shared task on detecting AI-generated Arabic text. We fine-tuned the multilingual E5-large encoder for binary classification, and we explored several pooling strategies to pool token representations, including weighted layer pooling, multi-head attention pooling, and gated fusion. Interestingly, none of these outperformed simple mean pooling, which achieved an F1 of 0.75 on the test set. We believe this is because complex pooling methods introduce additional parameters that need more data to train properly, whereas mean pooling offers a stable baseline that generalizes well even with limited examples. We also observe a clear pattern in the data: human-written texts tend to be significantly longer than machine-generated ones.
我们介绍了针对AbjadGenEval任务检测AI生成的阿拉伯文本的方法。我们将多语言E5-large编码器进行了微调,以进行二分类,并探索了多种池化策略来合并token表示,包括加权层池化、多头注意力池化和门控融合。有趣的是,在测试集上,没有一种复杂方法的表现超过简单的均值池化,后者达到了0.75的F1分数。我们相信这是因为复杂的池化方法引入了额外需要大量数据进行适当训练的参数,而均值池法则提供了一个稳健的基础模型,即使在少量示例的情况下也能很好地泛化。此外,我们在数据中观察到一个明显的模式:人工撰写的文章往往比机器生成的文章要长得多。
https://arxiv.org/abs/2603.10007