Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.
在全球超过7000种语言中,商用语言识别(LID)系统仅能可靠地识别几百种语言的书面形式。研究级别的系统在特定情况下可以扩大这一覆盖范围,但对于大多数语言而言,覆盖率仍然支离破碎或根本不存在。本文认为这种情况主要是自我限制的结果。特别是,它源于将LID视为脱离上下文的文本分类这种持久的观点,这掩盖了先验概率估计的核心作用,并且受到倾向于全球性、固定先验模型的制度激励所强化。我们认为,要提高尾部语言(即使用较少的语言)的覆盖率,需要重新审视LID作为一个路由问题的本质,并开发出有原则的方法来整合环境线索,使各种语言在局部环境中变得合理可信。
https://arxiv.org/abs/2602.08951
Large language models (LLMs) such as GPT-4o and Claude Sonnet 4.5 have demonstrated strong capabilities in open-ended reasoning and generative language tasks, leading to their widespread adoption across a broad range of NLP applications. However, for structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone, overlooking operational constraints encountered in production systems. In this work, we present a systematic comparison of two contrasting paradigms for text classification: zero- and few-shot prompt-based large language models, and fully fine-tuned encoder-only architectures. We evaluate these approaches across four canonical benchmarks (IMDB, SST-2, AG News, and DBPedia), measuring predictive quality (macro F1), inference latency, and monetary cost. We frame model evaluation as a multi-objective decision problem and analyze trade-offs using Pareto frontier projections and a parameterized utility function reflecting different deployment regimes. Our results show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance while operating at one to two orders of magnitude lower cost and latency compared to zero- and few-shot LLM prompting. Overall, our findings suggest that indiscriminate use of large language models for standard text classification workloads can lead to suboptimal system-level outcomes. Instead, fine-tuned encoders emerge as robust and efficient components for structured NLP pipelines, while LLMs are better positioned as complementary elements within hybrid architectures. We release all code, datasets, and evaluation protocols to support reproducibility and cost-aware NLP system design.
大型语言模型(LLMs)如GPT-4o和Claude Sonnet 4.5在开放性推理和生成语言任务中展现了强大的能力,因此这些模型被广泛应用于各种自然语言处理(NLP)应用。然而,在结构化的文本分类问题中,这些问题具有固定的标签空间时,模型的选择往往只关注预测性能,而忽略了生产系统中的操作限制。在这项工作中,我们提出了两种截然不同的文本分类方法的系统比较:零样本和少量样本文本提示的大语言模型与完全微调过的编码器架构。我们在四个标准基准(IMDB、SST-2、AG News 和 DBPedia)上评估了这些方法,并测量了预测质量(宏F1分数)、推理延迟以及经济成本。我们将模型评估框架化为一个多目标决策问题,并通过帕累托前沿投影和反映不同部署方案的参数化效用函数来分析权衡。我们的结果表明,基于BERT家族的完全微调编码器模型在分类性能上具有竞争力甚至更优,同时其运行成本和延迟仅为零样本或少量样本文本提示的大语言模型的一到两个数量级。 总体而言,我们的研究发现认为,在标准文本分类工作负载中无差别地使用大型语言模型会导致次优的系统级结果。相反,微调过的编码器作为结构化NLP管道中的稳健且高效的组件而出现,而LLM则更适合在混合架构中作为补充元素。我们发布了所有代码、数据集和评估协议以支持可重复性以及成本意识型NLP系统的构建。
https://arxiv.org/abs/2602.06370
Large language models (LLMs) are widely used as zero-shot and few-shot classifiers, where task behaviour is largely controlled through prompting. A growing number of works have observed that LLMs are sensitive to prompt variations, with small changes leading to large changes in performance. However, in many cases, the investigation of sensitivity is performed using underspecified prompts that provide minimal task instructions and weakly constrain the model's output space. In this work, we argue that a significant portion of the observed prompt sensitivity can be attributed to prompt underspecification. We systematically study and compare the sensitivity of underspecified prompts and prompts that provide specific instructions. Utilising performance analysis, logit analysis, and linear probing, we find that underspecified prompts exhibit higher performance variance and lower logit values for relevant tokens, while instruction-prompts suffer less from such problems. However, linear probing analysis suggests that the effects of prompt underspecification have only a marginal impact on the internal LLM representations, instead emerging in the final layers. Overall, our findings highlight the need for more rigour when investigating and mitigating prompt sensitivity.
大型语言模型(LLMs)被广泛用作零样本和少量样本分类器,其任务行为主要通过提示来控制。越来越多的研究发现,LLM 对提示的变化非常敏感,即使是微小的改变也可能导致性能上的大幅波动。然而,在许多情况下,对这种敏感性的调查是使用提供最少任务说明且弱化了模型输出空间约束的不明确提示来进行的。在本工作中,我们认为观察到的大部分提示敏感性可以归因于提示的不明确性。我们系统地研究并比较了不明确提示与提供具体指令的提示之间的敏感度差异。通过性能分析、对数几率值分析以及线性探测,我们发现不明确提示表现出更高的性能波动和更低的相关词令牌的对数值,而指令提示则较少受到这些问题的影响。然而,线性探测分析表明,提示不明确性的效果只在模型内部表示的最终层中产生边际影响,而不是影响整体表示结构。总体而言,我们的研究结果强调了在调查和缓解提示敏感性时需要更加严谨的态度。
https://arxiv.org/abs/2602.04297
Finding effective prompts for language models (LMs) is critical yet notoriously difficult: the prompt space is combinatorially large, rewards are sparse due to expensive target-LM evaluation. Yet, existing RL-based prompt optimizers often rely on on-policy updates and a meta-prompt sampled from a fixed distribution, leading to poor sample efficiency. We propose GFlowPO, a probabilistic prompt optimization framework that casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior. In the first step, we fine-tune a lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective, using a replay-based training policy that reuses past prompt evaluations to enable sample-efficient exploration. In the second step, we introduce Dynamic Memory Update (DMU), a training-free mechanism that updates the meta-prompt by injecting both (i) diverse prompts from a replay buffer and (ii) top-performing prompts from a small priority queue, thereby progressively concentrating the search process on high-reward regions. Across few-shot text classification, instruction induction benchmarks, and question answering tasks, GFlowPO consistently outperforms recent discrete prompt optimization baselines.
寻找有效的提示对于语言模型(LM)来说是至关重要的,但同时也非常困难:提示空间在组合上庞大无比,由于目标语言模型的评估成本高昂,因此奖励也极为稀疏。然而,现有的基于强化学习(RL)的提示优化器通常依赖于在线策略更新和从固定分布中抽取的元提示,这导致了较差的样本效率。 我们提出了GFlowPO,这是一种概率性的提示优化框架,它将提示搜索视为一种通过元提示引导的语言模型先验进行后验推理的问题。在第一步中,我们使用基于离线策略的目标生成流网络(Generative Flow Network, GFlowNet)目标训练了一个轻量级的提示语言模型,并采用了基于回放的训练策略来重用过去的提示评估以实现高效的样本探索。第二步中,我们引入了动态内存更新(Dynamic Memory Update, DMU),这是一种无需训练的方法,它通过注入来自回放缓冲区的多样化的提示以及来自优先队列的小型高绩效提示,逐步将搜索过程集中在高奖励区域。 在少量样本文本分类、指令诱导基准测试和问答任务中,GFlowPO始终优于最近提出的离散提示优化基线方法。
https://arxiv.org/abs/2602.03358
One-shot prediction enables rapid adaptation of pretrained foundation models to new tasks using only one labeled example, but lacks principled uncertainty quantification. While conformal prediction provides finite-sample coverage guarantees, standard split conformal methods are inefficient in the one-shot setting due to data splitting and reliance on a single predictor. We propose Conformal Aggregation of One-Shot Predictors (CAOS), a conformal framework that adaptively aggregates multiple one-shot predictors and uses a leave-one-out calibration scheme to fully exploit scarce labeled data. Despite violating classical exchangeability assumptions, we prove that CAOS achieves valid marginal coverage using a monotonicity-based argument. Experiments on one-shot facial landmarking and RAFT text classification tasks show that CAOS produces substantially smaller prediction sets than split conformal baselines while maintaining reliable coverage.
https://arxiv.org/abs/2601.05219
Automated peer review has evolved from simple text classification to structured feedback generation. However, current state-of-the-art systems still struggle with "surface-level" critiques: they excel at summarizing content but often fail to accurately assess novelty and significance or identify deep methodological flaws because they evaluate papers in a vacuum, lacking the external context a human expert possesses. In this paper, we introduce ScholarPeer, a search-enabled multi-agent framework designed to emulate the cognitive processes of a senior researcher. ScholarPeer employs a dual-stream process of context acquisition and active verification. It dynamically constructs a domain narrative using a historian agent, identifies missing comparisons via a baseline scout, and verifies claims through a multi-aspect Q&A engine, grounding the critique in live web-scale literature. We evaluate ScholarPeer on DeepReview-13K and the results demonstrate that ScholarPeer achieves significant win-rates against state-of-the-art approaches in side-by-side evaluations and reduces the gap to human-level diversity.
https://arxiv.org/abs/2601.22638
The increasing availability of unstructured clinical narratives in electronic health records (EHRs) has created new opportunities for automated disease characterization, cohort identification, and clinical decision support. However, modeling long, domain-specific clinical text remains challenging due to limited labeled data, severe class imbalance, and the high computational cost of adapting large pretrained language models. This study presents a GPT-based architecture for clinical text classification that adapts a pretrained decoder-only Transformer using a selective fine-tuning strategy. Rather than updating all model parameters, the majority of the GPT-2 backbone is frozen, and training is restricted to the final Transformer block, the final layer normalization, and a lightweight classification head. This approach substantially reduces the number of trainable parameters while preserving the representational capacity required to model complex clinical language. The proposed method is evaluated on radiology reports from the MIMIC-IV-Note dataset using uncertainty-aware CheXpert-style labels derived directly from report text. Experiments cover multiple problem formulations, including multi-label classification of radiographic findings, binary per-label classification under different uncertainty assumptions, and aggregate disease outcome prediction. Across varying dataset sizes, the model exhibits stable convergence behavior and strong classification performance, particularly in settings dominated by non-mention and negated findings. Overall, the results indicate that selective fine-tuning of pretrained generative language models provides an efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data while significantly reducing computational complexity.
https://arxiv.org/abs/2601.21955
Hierarchical text classification (HTC) depends on taxonomies that organize labels into structured hierarchies. However, many real-world taxonomies introduce ambiguities, such as identical leaf names under similar parent nodes, which prevent language models (LMs) from learning clear decision boundaries. In this paper, we present TaxMorph, a framework that uses large language models (LLMs) to transform entire taxonomies through operations such as renaming, merging, splitting, and reordering. Unlike prior work, our method revises the full hierarchy to better match the semantics encoded by LMs. Experiments across three HTC benchmarks show that LLM-refined taxonomies consistently outperform human-curated ones in various settings up to +2.9pp. in F1. To better understand these improvements, we compare how well LMs can assign leaf nodes to parent nodes and vice versa across human-curated and LLM-refined taxonomies. We find that human-curated taxonomies lead to more easily separable clusters in embedding space. However, the LLM-refined taxonomies align more closely with the model's actual confusion patterns during classification. In other words, even though they are harder to separate, they better reflect the model's inductive biases. These findings suggest that LLM-guided refinement creates taxonomies that are more compatible with how models learn, improving HTC performance.
https://arxiv.org/abs/2601.18375
Intent detection, a fundamental text classification task, aims to identify and label the semantics of user queries, playing a vital role in numerous business applications. Despite the dominance of deep learning techniques in this field, the internal mechanisms enabling Recurrent Neural Networks (RNNs) to solve intent detection tasks are poorly understood. In this work, we apply dynamical systems theory to analyze how RNN architectures address this problem, using both the balanced SNIPS and the imbalanced ATIS datasets. By interpreting sentences as trajectories in the hidden state space, we first show that on the balanced SNIPS dataset, the network learns an ideal solution: the state space, constrained to a low-dimensional manifold, is partitioned into distinct clusters corresponding to each intent. The application of this framework to the imbalanced ATIS dataset then reveals how this ideal geometric solution is distorted by class imbalance, causing the clusters for low-frequency intents to degrade. Our framework decouples geometric separation from readout alignment, providing a novel, mechanistic explanation for real world performance disparities. These findings provide new insights into RNN dynamics, offering a geometric interpretation of how dataset properties directly shape a network's computational solution.
https://arxiv.org/abs/2601.17156
Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks despite identical preprocessing and algorithms, with 600 documents sufficient to achieve 95% of the performance attainable with 10,000 documents for all tasks. Vocabulary analysis revealed that more strong predictors and fewer noisy predictors were associated with steeper learning curves, where every 100 additional noisy words decreased accuracy by approximately 0.02 while 100 additional strong predictors increased maximum accuracy by approximately 0.04.
引言:使用自然语言处理(NLP)模型进行临床文本分类需要充足的训练数据以达到最佳性能。通常情况下,这要求对大约200到500份文档进行注释。然而,这个数字受到时间和成本的限制,并且缺乏关于样本量需求及其与文本词汇属性关系的合理解释。 方法:我们使用了公开可用的MIMIC-III数据集,该数据集中包含医院出院记录及相应的ICD-9诊断标签。通过应用预训练的BERT嵌入并随后采用随机森林分类器来识别10个随机选择的诊断,并且在从100到10,000份文档不等的训练语料库大小下进行了实验。此外,我们还利用bag-of-words(词袋)模型中的Lasso逻辑回归分析词汇属性,通过识别强预测词和噪声预测词来实现这一目标。 结果:尽管预处理过程与算法保持一致,但针对10个分类任务的学习曲线差异显著。对于所有任务而言,使用600份文档即可达到使用10,000份文档所能获得的95%性能水平。词汇分析揭示了更多的强预测词和更少的噪声预测词与陡峭的学习曲线之间的关联性:每增加100个噪声单词会使准确率降低约0.02,而每增加100个强预测词则会将最大准确度提高大约0.04。
https://arxiv.org/abs/2601.15846
Open-set learning and discovery (OSLD) is a challenging machine learning task in which samples from new (unknown) classes can appear at test time. It can be seen as a generalization of zero-shot learning, where the new classes are not known a priori, hence involving the active discovery of new classes. While zero-shot learning has been extensively studied in text classification, especially with the emergence of pre-trained language models, open-set learning and discovery is a comparatively new setup for the text domain. To this end, we introduce the first multilingual open-set learning and discovery (MOSLD) benchmark for text categorization by topic, comprising 960K data samples across 12 languages. To construct the benchmark, we (i) rearrange existing datasets and (ii) collect new data samples from the news domain. Moreover, we propose a novel framework for the OSLD task, which integrates multiple stages to continuously discover and learn new classes. We evaluate several language models, including our own, to obtain results that can be used as reference for future work. We release our benchmark at this https URL.
开放集学习与发现(OSLD)是一项具有挑战性的机器学习任务,在该任务中,测试时可能出现来自新类别的样本。这可以被视为零样本学习的一种泛化形式,其中新的类别事先未知,因此需要主动发现新类别。虽然在文本分类领域特别是在预训练语言模型出现之后对零样本学习进行了广泛的研究,但开放集学习和发现对于文本领域来说是一个相对较新的设置。 为此,我们引入了一个名为多语言开放集学习与发现(MOSLD)的基准测试,用于按主题进行文本分类,包含12种语言共96万个数据样本。为了构建该基准测试,我们(i)重新组织现有的数据集和(ii)从新闻领域收集新的数据样本。 此外,我们还提出了一种新颖的框架来处理OSLD任务,该框架整合了多个阶段以持续发现并学习新类别。我们评估了几种语言模型(包括我们自己的),从而获得未来研究可以参考的结果。 我们的基准测试可以在以下网址获取:[请在此处提供链接]。
https://arxiv.org/abs/2601.13437
The increasing prevalence of Large Language Models (LLMs) demands effective safeguards for their operation, particularly concerning their tendency to generate out-of-context responses. A key challenge is accurately detecting when LLMs stray from expected conversational norms, manifesting as topic shifts, factual inaccuracies, or outright hallucinations. Traditional anomaly detection struggles to directly apply within contextual semantics. This paper outlines our experiment in exploring the use of Representation Engineering (RepE) and One-Class Support Vector Machine (OCSVM) to identify subspaces within the internal states of LLMs that represent a specific context. By training OCSVM on in-context examples, we establish a robust boundary within the LLM's hidden state latent space. We evaluate out study with two open source LLMs - Llama and Qwen models in specific contextual domain. Our approach entailed identifying the optimal layers within the LLM's internal state subspaces that strongly associates with the context of interest. Our evaluation results showed promising results in identifying the subspace for a specific context. Aside from being useful in detecting in or out of context conversation threads, this research work contributes to the study of better interpreting LLMs.
大型语言模型(LLMs)的日益普及要求有效的安全措施来保障其运行,尤其是在这些模型倾向于生成不符合上下文响应的情况下。一个关键挑战是准确检测出当LLM偏离预期对话规范时的情况,这可能表现为话题转变、事实错误或直接幻觉。传统的异常检测方法难以直接应用于语境化的语义中。本文概述了我们使用表示工程(RepE)和一类支持向量机(OCSVM)探索如何识别内部状态中的子空间,这些子空间代表特定上下文的实验。 通过在具体上下文中训练OCSVM,我们在LLM的隐藏状态潜在空间内建立了一个坚实的边界。我们的研究使用了两个开源的大规模语言模型——Llama和Qwen,并针对特定的上下文领域进行了评估。我们方法的核心在于识别出LLM内部状态子空间中的最佳层,这些层与感兴趣的上下文密切相关。 评估结果显示,在确定特定上下文的空间方面取得了令人鼓舞的结果。除了在检测是否符合上下文对话线程的应用中非常有用之外,这项研究还为更好地理解和解释大型语言模型做出了贡献。
https://arxiv.org/abs/2601.12286
Bengali text classification is a Significant task in natural language processing (NLP), where text is categorized into predefined labels. Unlike English, Bengali faces challenges due to the lack of extensive annotated datasets and pre-trained language models. This study explores the effectiveness of large language models (LLMs) in classifying Bengali newspaper articles. The dataset used, obtained from Kaggle, consists of articles from Prothom Alo, a major Bangladeshi newspaper. Three instruction-tuned LLMs LLaMA 3.1 8B Instruct, LLaMA 3.2 3B Instruct, and Qwen 2.5 7B Instruct were evaluated for this task under the same classification framework. Among the evaluated models, Qwen 2.5 achieved the highest classification accuracy of 72%, showing particular strength in the "Sports" category. In comparison, LLaMA 3.1 and LLaMA 3.2 attained accuracies of 53% and 56%, respectively. The findings highlight the effectiveness of LLMs in Bengali text classification, despite the scarcity of resources for Bengali NLP. Future research will focus on exploring additional models, addressing class imbalance issues, and refining fine-tuning approaches to improve classification performance.
বাংলা টেক্সট ক্লাসিফিকেশন ন্যাচরাল ল্যাঙুয়েজ প্রসেসিং (NLP) এর একটি গুরুত্বপূর্ণ উদ্দেশ্য, যেখানে টেক্সটকে ইতিমধ্যেই ডিফাইন করা লেবেলগুলির ভিত্তিতে বিভাগ করা হয়। ইংরেজির চেয়ে, বাংলা এমন দুর্নিয়তা মোকাবেলা করে যা আইনসূচক ডেটাসেট এবং প্রিট্রেইনড ল্যাঙুয়েজ মডেলগুলির অভাবের ফলে হয়। এই গবেষণা দেখেছে উচ্চকনিষ্ঠ ল্যাঙুয়েজ মডেল (LLMs) প্রতিরোধ করে বাংলা নভিসপেপার আর্টিকেলগুলিকে ক্লাসিফিকেট করার দক্ষতা। এই গবেষণায় ব্যবহৃত ডেটাসেট, Kaggle থেকে আসা হল, যা প্রথম অলো, একটি মুখ্য বাংলাদেশি নভিল থেকে উদ্ভূত। LLaMA 3.1 8B Instruct, LLaMA 3.2 3B Instruct এবং Qwen 2.5 7B Instruct মডেলগুলি বিশেষ সমস্যার জন্য পরীক্ষা করা হয়েছে, এবং আইনসূচক লেবেলগুলির বিশ্লেষণ ফ্রেমওয়ার্কটির ভিত্তিতে। মডেলগুলির মধ্যে, Qwen 2.5 দৃঢভাবে "খেলনা" শ্রেণিতে সম্পর্কিত লেবেলগুলিতে সবচেয়ে উত্তম ফলাফল দেখিয়েছে, 72% সঠিক বিশ্লেষণ ধরন প্রদর্শিত করে। একইভাবে, LLaMA 3.1 এবং LLaMA 3.2 যথাক্রমে 53% এবং 56% সঠিক ফলাফল প্রদর্শন করেছে। এই গবেষণার মূল্যায়ণ উচ্চকনিষ্ঠ ল্যাঙুয়েজ মডেলগুলির দক্ষতা, বাংলা NLP-এর প্রাসঙ্গিক এবং অপ্রাসঙ্গিক ডেটাসেট থাকার মুখ্য অভাব ছাড়িয়েও। ভবিষ্যতের গবেষণা আরও মডেলগুলির পরীক্ষা, শ্রেণি অসমতাকে দখল করার জন্য, এবং বিশ্লেষণ সহজবোধ্য উপায়গুলির পরিষ্কার ফরমাট করা দিগন্তে আকৃষ্ট হয়।
https://arxiv.org/abs/2601.12132
Large language models (LLMs) are challenging to deploy for domain-specific tasks due to their massive scale. While distilling a fine-tuned LLM into a smaller student model is a promising alternative, the capacity gap between teacher and student often leads to suboptimal performance. This raises a key question: when and how can a student model match or even surpass its teacher on domain-specific tasks? In this work, we propose a novel theoretical insight: a student can outperform its teacher if its advantage on a Student-Favored Subdomain (SFS) outweighs its deficit on the Teacher-Favored Subdomain (TFS). Guided by this insight, we propose Scheduled Checkpoint Distillation (SCD), which reduces the TFS deficit by emulating the teacher's convergence process during supervised fine-tuning (SFT) on the domain task, and a sample-wise Adaptive Weighting (AW) mechanism to preserve student strengths on SFS. Experiments across diverse domain tasks--including QA, NER, and text classification in multiple languages--show that our method consistently outperforms existing distillation approaches, allowing the student model to match or even exceed the performance of its fine-tuned teacher.
大型语言模型(LLMs)由于其巨大的规模,很难针对特定领域任务进行部署。尽管将经过微调的大型语言模型提炼成较小的学生模型是一种有前景的替代方案,但教师和学生模型之间的容量差距通常会导致次优性能。这引发了一个关键问题:在何时何地,学生模型才能匹配甚至超越其教师模型在特定领域的表现?在这项工作中,我们提出了一种新颖的理论见解:如果学生模型在其偏好的子领域(SFS)上的优势能够抵消它在教师偏好子领域(TFS)上的劣势,则该学生模型可以超过其老师的表现。根据这一洞察,我们提出了有计划检查点蒸馏(SCD),这种方法通过模拟教师模型在监督微调(SFT)过程中的收敛情况来减少TFS的差距,并且采用自适应加权机制来保持学生模型在SFS上的优势。实验结果显示,在包括问答、命名实体识别和多语言文本分类等不同领域的任务上,我们的方法始终优于现有的蒸馏方法,使得学生模型能够与甚至超越其微调后的教师模型的表现。
https://arxiv.org/abs/2601.10114
Deep neural networks have achieved remarkable success across a variety of tasks, yet they often suffer from unreliable probability estimates. As a result, they can be overconfident in their predictions. Conformal Prediction (CP) offers a principled framework for uncertainty quantification, yielding prediction sets with rigorous coverage guarantees. Existing conformal training methods optimize for overall set size, but shaping the prediction sets in a class-conditional manner is not straightforward and typically requires prior knowledge of the data distribution. In this work, we introduce Class Adaptive Conformal Training (CaCT), which formulates conformal training as an augmented Lagrangian optimization problem that adaptively learns to shape prediction sets class-conditionally without making any distributional assumptions. Experiments on multiple benchmark datasets, including standard and long-tailed image recognition as well as text classification, demonstrate that CaCT consistently outperforms prior conformal training methods, producing significantly smaller and more informative prediction sets while maintaining the desired coverage guarantees.
深度神经网络在多种任务中取得了显著的成功,但它们常常遭受不可靠的概率估计问题。因此,这些模型在其预测上可能过于自信。一致性预测(CP)提供了一个基于原则的不确定性量化框架,可以生成具有严格覆盖保证的预测集。现有的符合性训练方法优化的是整体集合大小,但是以条件分类的方式塑造预测集合并不简单,并且通常需要对数据分布有先验知识。 在本研究中,我们引入了类自适应一致性训练(CaCT),该方法将一致性的训练公式化为一个增强拉格朗日优化问题,在不做出任何关于分布假设的情况下,可以学习以条件分类的方式动态调整预测集合。在多个基准数据集上的实验表明,无论是标准和长尾图像识别还是文本分类任务,CaCT都能持续优于先前的一致性训练方法,生成更为精简且信息量更大的预测集合,同时保持预期的覆盖保证。
https://arxiv.org/abs/2601.09522
We consider the problem of distinguishing human-written creative fiction (excerpts from novels) from similar text generated by an LLM. Our results show that, while human observers perform poorly (near chance levels) on this binary classification task, a variety of machine-learning models achieve accuracy in the range 0.93 - 0.98 over a previously unseen test set, even using only short samples and single-token (unigram) features. We therefore employ an inherently interpretable (linear) classifier (with a test accuracy of 0.98), in order to elucidate the underlying reasons for this high accuracy. In our analysis, we identify specific unigram features indicative of LLM-generated text, one of the most important being that the LLM tends to use a larger variety of synonyms, thereby skewing the probability distributions in a manner that is easy to detect for a machine learning classifier, yet very difficult for a human observer. Four additional explanation categories were also identified, namely, temporal drift, Americanisms, foreign language usage, and colloquialisms. As identification of the AI-generated text depends on a constellation of such features, the classification appears robust, and therefore not easy to circumvent by malicious actors intent on misrepresenting AI-generated text as human work.
我们研究了区分人类创作的创意小说(小说片段)和类似由大型语言模型生成文本的问题。我们的结果显示,虽然人类观察者在这项二元分类任务中表现不佳(接近随机水平),但多种机器学习模型在对先前未见过的数据集进行测试时达到了0.93到0.98之间的准确率,即使只使用短样本和单词特征(一元特征)也是如此。因此,我们采用了一种固有的可解释的线性分类器(其测试精度为0.98),以揭示这种高准确性背后的原因。在我们的分析中,我们确定了一些特定的一元特征,这些特征表明了由大型语言模型生成的文本的特点,其中最重要的是:大型语言模型倾向于使用更多种类的同义词,从而使概率分布向易于机器学习分类器检测但对人类观察者来说非常难以察觉的方向偏移。此外还确定了四个额外的解释类别,即时间漂移、美式英语表达、外语用法和俚语。由于识别AI生成文本依赖于多种此类特征的组合,因此这种分类显得十分稳健,并且不容易被试图将AI生成内容冒充为人类作品的恶意行为者所规避。
https://arxiv.org/abs/2601.07368
Transformers require positional encodings to represent sequence order, yet most prior work focuses on designing new positional encodings rather than examining how positional information is fused with token embeddings. In this paper, we study whether the fusion mechanism itself affects performance, particularly in long-sequence settings. We conduct a controlled empirical study comparing three canonical fusion strategies--element-wise addition, concatenation with projection, and scalar gated fusion--under identical Transformer architectures, data splits, and random seeds. Experiments on three text classification datasets spanning short (AG News), medium (IMDB), and long (ArXiv) sequences show that fusion choice has negligible impact on short texts but produces consistent gains on long documents. To verify that these gains are structural rather than stochastic, we perform paired-seed analysis and cross-dataset comparison across sequence-length regimes. Additional experiments on the ArXiv dataset indicate that the benefit of learnable fusion generalizes across multiple positional encoding families. Finally, we explore a lightweight convolutional gating mechanism that introduces local inductive bias at the fusion level, evaluated on long documents only. Our results indicate that positional-encoding fusion is a non-trivial design choice for long-sequence Transformers and should be treated as an explicit modeling decision rather than a fixed default.
翻译如下: 转换器(Transformers)需要位置编码来表示序列顺序,然而大多数先前的工作主要集中在设计新的位置编码上,而不是研究如何将位置信息与标记嵌入融合。在这篇论文中,我们探讨了融合机制本身是否会影响性能,特别是在长序列设置下。我们在相同的Transformer架构、数据分割和随机种子条件下,对三种标准的融合策略——元素级相加、连接后投影以及标量门控融合进行了有控制的经验研究。在三个文本分类数据集(包括短文AG News、中等长度IMDB和长文档ArXiv)上的实验表明,在短文中选择融合方式几乎不会影响性能,但在长文档上则会产生一致的改进效果。为了验证这些改进是结构性而非随机性的,我们进行了配对种子分析及跨序列长度范围的数据集比较。在ArXiv数据集上的额外实验证明了可学习的融合方法能够跨多种位置编码家族进行泛化。最后,我们探索了一种轻量级的卷积门控机制,在长文档上对其进行评估,以引入局部归纳偏差。我们的结果表明,对于长序列转换器而言,位置编码的融合是一个重要的设计选择,并且应该被视为一个明确的建模决策而非固定的默认选项。
https://arxiv.org/abs/2601.05807
This paper proposes an automatic speech recognition (ASR) model for hate speech using large language models (LLMs). The proposed method integrates the encoder of the ASR model with the decoder of the LLMs, enabling simultaneous transcription and censorship tasks to prevent the exposure of harmful content. Instruction tuning of the LLM to mask hate-related words with specific tokens requires an annotated hate speech dataset, which is limited. We generate text samples using an LLM with the Chain-of-Thought (CoT) prompting technique guided by cultural context and examples and then convert them into speech samples using a text-to-speech (TTS) system. However, some of them contain non-hate speech samples with hate-related words, which degrades the censorship performance. This paper filters the samples which text classification models correctly label as hate content. By adjusting the threshold for the number of correct answer models, we can control the level of hate in the generated dataset, allowing us to train the LLMs through curriculum learning in a gradual manner. Experimental results show that the proposed method achieves a masking accuracy of 58.6\% for hate-related words, surpassing previous baselines. We also confirm that the curriculum training contributes to the efficiency of both transcription and censorship tasks.
本文提出了一种使用大型语言模型(LLM)进行仇恨言论自动语音识别(ASR)的模型。该方法将ASR模型的编码器与LLMs的解码器相结合,使得转录和审查任务能够同时执行,以防止有害内容的传播。为了通过特定标记屏蔽与仇恨相关的词汇对LLMs进行指令微调,需要一个带有标注的仇恨言论数据集,但这样的数据集较为有限。我们使用了大型语言模型,并结合了“思维链”(CoT)提示技术,在文化背景和示例引导下生成文本样本,然后通过文字转语音(TTS)系统将其转换为语音样本。然而,部分生成的样本中包含含有仇恨相关词汇但不属于仇恨言论的非仇恨言论样本,这降低了审查性能。本文的方法筛选出被文本分类模型正确标注为仇恨内容的样本。通过调整判定为正确答案的模型数量阈值,我们可以控制生成数据集中的仇恨程度,从而允许我们逐步进行课程学习训练LLMs。实验结果显示,该方法在与仇恨相关的词汇屏蔽准确率上达到了58.6%,超过了之前的基准水平。此外,我们也确认了课程训练对于转录和审查任务的有效性有所提升。
https://arxiv.org/abs/2601.04654
User-Defined Text Classification (UDTC) considers the challenge of classifying input text to user-specified, previously unseen classes, a setting that arises frequently in real-world applications such as enterprise analytics, content moderation, and domain-specific information retrieval. We propose a soft-contextualized encoder architecture for UDTC which contextualizes each candidate label with the label set and a static soft prompt representation of the input query. Training on diverse, multi-source datasets enables the model to generalize effectively to zero-shot classification over entirely unseen topic sets drawn from arbitrary domains. We evaluate the proposed architecture both on held-out in-distribution test data and on multiple unseen UDTC benchmarks. Across datasets, the model achieves state-of-the-art performance, consistently outperforming or matching the baselines.
用户定义文本分类(UDTC)旨在解决将输入文本分类到用户指定且之前未见过的类别中的挑战,这一问题在企业分析、内容审核和特定领域的信息检索等实际应用中经常出现。我们提出了一种软上下文编码器架构来处理 UDTC 任务,该架构通过结合标签集以及输入查询的静态软提示表示来对每个候选标签进行上下文化处理。基于多样化多源数据集的训练使模型能够有效地实现零样本分类,适用于从任意领域抽取的完全未见过的主题集合。我们不仅在保留的内部测试数据上评估了所提出的架构,还在多个未见过的 UDTC 基准测试中进行了测试。跨不同数据集来看,该模型实现了最先进的性能,在所有情况下都优于或与基线方法相匹配。
https://arxiv.org/abs/2601.03450
Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster - the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 News- groups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.
近年来,由于大型语言模型(LLMs)和高维嵌入的兴起,语义文本分类取得了显著进展。尽管LLM-嵌入常用于通过向量数据库存储和检索具有语义相似性的文本,但文本语料库中的全局结构化语义关系往往仍然是不透明的。在此我们提出了一种嵌套密度聚类方法,以推断相关文本的语义层次树。该方法从在LLM嵌入空间中搜索密集簇开始,识别出具有强烈语义相似性的文本。随着密度标准逐渐放松,这些密集簇合并为更稀疏的簇,直到整个数据集由一个单一的簇表示——即树的根部。通过将密集簇嵌入到越来越稀疏的簇中,我们构建了一种结构,该结构捕捉了文本之间的层次语义关系。 本文阐述了如何使用这种方法作为案例研究来对科学摘要的文本来进行分类,这使数据驱动地发现研究领域及其子领域的过程不再需要预定义类别。为了评估该方法的一般适用性,我们将进一步将其应用于20新闻组和IMDb 5万电影评论等已建立基准数据集上,证明了其在不同领域中的鲁棒性。 最后我们讨论了可能的应用场景,包括科学计量学、主题演变等方面,强调嵌套密度树如何揭示文本数据集中的语义结构及其演化。
https://arxiv.org/abs/2512.23471