Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.
大型语言模型(LLM)常常被集成在一起以提高整体可靠性和鲁棒性,但在实践中这些模型之间往往存在强相关性。这引发了一个根本问题:在形成一个LLM集成时应该选择哪些模型?我们把预算受限的集成选择问题定义为最大化所选模型的真实标签和预测之间的互信息。此外,为了解释为什么即使使用许多模型性能也可能达到饱和,我们采用高斯-库利(Gaussian-copula)方法来建模模型的相关错误,并展示了集成本身的信息论误差下限。 受到这些理论的启发,我们提出了一种简单的贪婪互信息选择算法,该算法直接从数据中估计所需的互信息项,并在查询预算内迭代构建集成。我们在两个问答数据集和一个二元情感分类数据集上测试了我们的方法:MEDMCQA、MMLU 和 IMDB 电影评论。在所有数据集中,我们观察到,在相同的查询预算下,我们的方法始终优于强基线模型。 这段文字描述了一种关于如何从多个相关大型语言模型中选择最有效的子集合的方法,并提出了一种基于互信息的算法来实现这一目标。这种方法不仅能够解释为什么使用大量模型后性能可能饱和的现象,还提供了一个在查询预算限制下的高效解决方案,在实践中显示出优于其他方法的表现。
https://arxiv.org/abs/2602.08003
Despite remarkable advances in natural language processing, developing effective systems for low-resource languages remains a formidable challenge, with performances typically lagging far behind high-resource counterparts due to data scarcity and insufficient linguistic resources. Cross-lingual knowledge transfer has emerged as a promising approach to address this challenge by leveraging resources from high-resource languages. In this paper, we investigate methods for transferring linguistic knowledge from high-resource languages to low-resource languages, where the number of labeled training instances is in hundreds. We focus on sentence-level and word-level tasks. We introduce a novel method, GETR (Graph-Enhanced Token Representation) for cross-lingual knowledge transfer along with two adopted baselines (a) augmentation in hidden layers and (b) token embedding transfer through token translation. Experimental results demonstrate that our GNN-based approach significantly outperforms existing multilingual and cross-lingual baseline methods, achieving 13 percentage point improvements on truly low-resource languages (Mizo, Khasi) for POS tagging, and 20 and 27 percentage point improvements in macro-F1 on simulated low-resource languages (Marathi, Bangla, Malayalam) across sentiment classification and NER tasks respectively. We also present a detailed analysis of the transfer mechanisms and identify key factors that contribute to successful knowledge transfer in this linguistic context.
尽管自然语言处理领域取得了显著进展,但在低资源语言上开发有效的系统仍然是一项艰巨的挑战。由于数据稀缺和语言资源不足,这些系统的性能通常远远落后于高资源语言对应的系统。跨语言知识转移作为解决这一问题的一种有前景的方法已经浮现,通过这种方法可以利用来自高资源语言的数据来帮助低资源语言。 在本文中,我们研究了将语言知识从高资源语言转移到低资源语言的方法,其中标签训练实例的数量只有几百个。我们的重点是句子级和词汇级的任务。为此,我们提出了一种名为GETR(图增强标记表示)的新方法来进行跨语言知识转移,并采用两种基线方法:(a) 隐藏层中的数据扩充和 (b) 通过词元翻译进行词元嵌入传递。 实验结果表明,我们的基于GNN的方法在多语言和跨语言基准方法上取得了显著的改进,在真正低资源的语言(米佐语、卡西语)上的词性标注任务中,性能提升了13个百分点。对于模拟的低资源语言(马拉地语、孟加拉语、马拉雅拉姆语),我们在情感分类和命名实体识别(NER)任务上分别实现了20和27个百分点的宏观F1得分提升。 此外,我们还对转移机制进行了详细的分析,并确定了在这种语言环境中成功实现知识传递的关键因素。
https://arxiv.org/abs/2602.05599
Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.
https://arxiv.org/abs/2601.22738
Urdu, spoken by 230 million people worldwide, lacks dedicated transformer-based language models and curated corpora. While multilingual models provide limited Urdu support, they suffer from poor performance, high computational costs, and cultural inaccuracies due to insufficient training data. To address these challenges, we present UrduLM, a pretrained Urdu monolingual language model trained in low-resource settings. We curate a 33GB Urdu corpus from diverse sources, develop a custom BPE tokenizer that reduces tokenization overhead by atleast 20-30% compared to multilingual alternatives, and pretrain a 100M-parameter decoder-only model. In few-shot evaluations, UrduLM achieves competitive performance with multilingual models up to 30x its size, reaching 66.6% accuracy on sentiment classification and BLEU scores exceeding 30 on grammar correction tasks. The complete methodology -- including corpus, tokenizer, model weights, and evaluation benchmarks -- is released openly to establish a baseline for Urdu NLP research and provide a scalable framework for other underrepresented languages.
https://arxiv.org/abs/2601.17664
From school playgrounds to corporate boardrooms, status hierarchies -- rank orderings based on respect and perceived competence -- are universal features of human social organization. Language models trained on human-generated text inevitably encounter these hierarchical patterns embedded in language, raising the question of whether they might reproduce such dynamics in multi-agent settings. This thesis investigates when and how language models form status hierarchies by adapting Berger et al.'s (1972) expectation states framework. I create multi-agent scenarios where separate language model instances complete sentiment classification tasks, are introduced with varying status characteristics (e.g., credentials, expertise), then have opportunities to revise their initial judgments after observing their partner's responses. The dependent variable is deference, the rate at which models shift their ratings toward their partner's position based on status cues rather than task information. Results show that language models form significant status hierarchies when capability is equal (35 percentage point asymmetry, p < .001), but capability differences dominate status cues, with the most striking effect being that high-status assignments reduce higher-capability models' deference rather than increasing lower-capability models' deference. The implications for AI safety are significant: status-seeking behavior could introduce deceptive strategies, amplify discriminatory biases, and scale across distributed deployments far faster than human hierarchies form organically. This work identifies emergent social behaviors in AI systems and highlights a previously underexplored dimension of the alignment challenge.
https://arxiv.org/abs/2601.17577
This study investigates the use of prompt engineering to enhance large language models (LLMs), specifically GPT-4o-mini and gemini-1.5-flash, in sentiment analysis tasks. It evaluates advanced prompting techniques like few-shot learning, chain-of-thought prompting, and self-consistency against a baseline. Key tasks include sentiment classification, aspect-based sentiment analysis, and detecting subtle nuances such as irony. The research details the theoretical background, datasets, and methods used, assessing performance of LLMs as measured by accuracy, recall, precision, and F1 score. Findings reveal that advanced prompting significantly improves sentiment analysis, with the few-shot approach excelling in GPT-4o-mini and chain-of-thought prompting boosting irony detection in gemini-1.5-flash by up to 46%. Thus, while advanced prompting techniques overall improve performance, the fact that few-shot prompting works best for GPT-4o-mini and chain-of-thought excels in gemini-1.5-flash for irony detection suggests that prompting strategies must be tailored to both the model and the task. This highlights the importance of aligning prompt design with both the LLM's architecture and the semantic complexity of the task.
这项研究探讨了利用提示工程来增强大型语言模型(LLM)在情感分析任务中的性能,具体针对GPT-4o-mini和gemini-1.5-flash这两款模型。研究评估了诸如少量样本学习、链式思维提示以及自一致性等先进的提示技术,并与基线进行比较。关键任务包括情感分类、基于方面的的情感分析,以及检测讽刺等细微差别。该研究详细介绍了理论背景、所用数据集及方法,通过准确率、召回率、精确度和F1分数来评估LLM的性能。 研究结果表明,先进的提示技术显著提高了情感分析的效果,在GPT-4o-mini中少量样本学习表现尤为突出;而在gemini-1.5-flash中,链式思维提示在检测讽刺方面提升了高达46%。因此,尽管高级别提示策略总体上能够提升模型性能,但针对不同模型和任务定制最佳的提示策略显得尤为重要:例如,对于GPT-4o-mini来说,少量样本学习效果最好;而对于gemini-1.5-flash而言,在检测讽刺时链式思维提示表现更佳。这表明,为了最大程度地发挥大型语言模型的效果,提示设计应当与LLM架构及其处理任务的语义复杂度相匹配。
https://arxiv.org/abs/2601.08302
Identifying the strengths and limitations of a research paper is a core component of any literature review. However, traditional summaries reflect only the authors' self-presented perspective. Analyzing how other researchers discuss and cite the paper can offer a deeper, more practical understanding of its contributions and shortcomings. In this research, we introduce SECite, a novel approach for evaluating scholarly impact through sentiment analysis of citation contexts. We develop a semi-automated pipeline to extract citations referencing nine research papers and apply advanced natural language processing (NLP) techniques with unsupervised machine learning to classify these citation statements as positive or negative. Beyond sentiment classification, we use generative AI to produce sentiment-specific summaries that capture the strengths and limitations of each target paper, derived both from clustered citation groups and from the full text. Our findings reveal meaningful patterns in how the academic community perceives these works, highlighting areas of alignment and divergence between external citation feedback and the authors' own presentation. By integrating citation sentiment analysis with LLM-based summarization, this study provides a comprehensive framework for assessing scholarly contributions.
识别研究论文的优缺点是文献综述的核心组成部分。然而,传统的摘要只能反映作者自身的观点。分析其他研究人员如何讨论和引用该论文可以提供更深层次、更具实践意义的理解其贡献与不足之处。在这项研究中,我们介绍了SECite,这是一种通过引文语境中的情感分析来评估学术影响力的新型方法。我们开发了一种半自动管道,用于提取九篇目标论文的引用,并采用先进的自然语言处理(NLP)技术和无监督机器学习技术将这些引文声明分类为正面或负面评价。 除了情感分类之外,我们还使用生成式人工智能根据聚类引文组和全文本信息来制作特定情感摘要,捕捉每篇目标论文的优点与不足。我们的研究结果揭示了学术界如何看待这些作品的有意义模式,突出了外部引用反馈与作者自身表述之间的共识与分歧领域。 通过将引文情感分析与大型语言模型(LLM)生成的总结相结合,本研究为评估学术贡献提供了一个全面的框架。
https://arxiv.org/abs/2601.07939
Multimodal aspect-based sentiment analysis (MABSA) aims to identify aspect-level sentiments by jointly modeling textual and visual information, which is essential for fine-grained opinion understanding in social media. Existing approaches mainly rely on discriminative classification with complex multimodal fusion, yet lacking explicit sentiment explainability. In this paper, we reformulate MABSA as a generative and explainable task, proposing a unified framework that simultaneously predicts aspect-level sentiment and generates natural language explanations. Based on multimodal large language models (MLLMs), our approach employs a prompt-based generative paradigm, jointly producing sentiment and explanation. To further enhance aspect-oriented reasoning capabilities, we propose a dependency-syntax-guided sentiment cue strategy. This strategy prunes and textualizes the aspect-centered dependency syntax tree, guiding the model to distinguish different sentiment aspects and enhancing its explainability. To enable explainability, we use MLLMs to construct new datasets with sentiment explanations to fine-tune. Experiments show that our approach not only achieves consistent gains in sentiment classification accuracy, but also produces faithful, aspect-grounded explanations.
多模态基于方面的情感分析(MABSA)旨在通过联合建模文本和视觉信息来识别基于方面的观点,这对于社交媒体上细微差别的意见理解至关重要。现有方法主要依赖于复杂的多模态融合的判别性分类,但缺乏明确的情感可解释性。在本文中,我们将 MABSA 重新定义为一个生成性和可解释性的任务,并提出了一个统一的框架,该框架同时预测基于方面的观点并生成自然语言解释。我们的方法基于多模态大型语言模型(MLLM),采用基于提示的生成范式,联合产生情感和解释。为了进一步增强面向方面的推理能力,我们提出了一种依赖句法引导的情感线索策略。这种策略通过修剪以方面为中心的依存句法树并将文本化,指导模型区分不同的情感方面并提高其可解释性。为了实现可解释性,我们使用 MLLMs 构建带有情感解释的新数据集进行微调。实验表明,我们的方法不仅在情感分类准确性上取得了持续改进,而且还生成了忠实、基于方面的解释。
https://arxiv.org/abs/2601.06848
The emergence of large language models (LLMs) has significantly transformed natural language processing (NLP), enabling more generalized models to perform various tasks with minimal training. However, traditional sentiment analysis methods, which focus on individual tasks such as sentiment classification or aspect-based analysis, are not practical for real-world applications that usually require handling multiple tasks. While offering flexibility, LLMs in sentiment-specific tasks often fall short of the required accuracy. Techniques like fine-tuning and evolutionary model merging help integrate models into a unified framework, which can improve the learning performance while reducing computational costs. The use of task meta-data and curriculum learning to optimize learning processes remains underexplored, while sentiment analysis is a critical task in NLP that requires high accuracy and scalability across multiple subtasks. In this study, we propose a hybrid learning model called Multi-stage Evolutionary Model Merging with Meta data driven Curriculum Learning (MEM-MCL), to enhance the sentiment analysis in large language modeling. In particular, expert models are created through instruction tuning for specific sentiment tasks and then merged using evolutionary algorithms to form a unified model. The merging process is optimized with weak data to enhance performance across tasks. The curriculum learning is incorporated to provide a learning sequence based on task difficulty, improving knowledge extraction from LLMs. Experiment results demonstrate that the proposed MEM-MCL model outperforms conventional LLMs in a majority of sentiment analysis tasks, achieving superior results across various subtasks.
大型语言模型(LLMs)的出现显著地改变了自然语言处理(NLP),使得更通用的模型能够在极少训练的情况下执行各种任务。然而,传统的侧重于情感分类或基于方面分析等单一任务的情感分析方法,在通常需要处理多项任务的实际应用中并不实用。虽然提供了灵活性,但在特定的情感任务中,大型语言模型往往达不到所需的准确度。通过微调和进化模型融合技术可以帮助将这些模型整合到统一的框架中,这可以提高学习性能同时减少计算成本。使用任务元数据和课程学习来优化学习过程尚处于未充分探索阶段,而情感分析作为NLP中的一个关键任务,需要在多项子任务上实现高准确性和可扩展性。 在此研究中,我们提出了一种混合学习模型——多阶段进化模型融合与基于元数据驱动的课程学习(MEM-MCL),以增强大型语言模型的情感分析能力。具体来说,通过指令微调为特定情感任务创建专家模型,并使用进化算法将这些模型合并成一个统一模型。在合并过程中利用弱标签数据来优化跨任务性能。此外,该模型还融入了基于任务难度的学习序列的课程学习,以提高从大型语言模型中提取知识的能力。 实验结果表明,所提出的MEM-MCL模型在多数情感分析任务上优于传统的LLMs,并且在各种子任务上都取得了更好的效果。
https://arxiv.org/abs/2601.06780
The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.
在印度,品牌监测的有效性越来越受到“HINGLISH”(印地语和英语混合语言)的挑战。这种混合语言广泛用于如Twitter等平台上的用户生成内容中。传统的自然语言处理(NLP)模型基于单语种数据构建,常常无法解读这种代码混用语言在语法和语义层面的复杂性,导致情感分析不准确及市场洞察误导。 为了解决这一问题,我们提出了一种专门针对Hinglish推文的情感分类框架。我们的方法是通过微调mBERT(多语言BERT),利用其多语言能力来更好地理解印度社交媒体的语言多样性。我们的方法的一个关键组成部分是使用子词分词技术,使模型能够有效处理罗马化Hinglish中常见的拼写变化、俚语和词汇外术语。 这项研究提供了一种现成的AI解决方案,用于品牌情感追踪,并在低资源、代码混用环境中建立了强大的多语言NLP基准。
https://arxiv.org/abs/2601.05091
Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance (e.g., +0.9 and +1.5 F1, respectively) while substantially enhancing explanation faithfulness, increasing IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Although explanations become more concise, sufficiency improves (+2.3 pp) and fairness remains stable, indicating more faithful rationales without performance or bias trade-offs
仇恨言论检测模型依赖于表面级词汇特征,这增加了对虚假相关性的脆弱性,并限制了其鲁棒性、文化语境化和可解释性。我们提出了监督道德理由注意(SMRA),这是一种首个自我解释的仇恨言论检测框架,该框架将道德理由作为注意力对齐的直接监督。基于道德基础理论,SMRA使词级注意力与专家标注的道德理由对齐,引导模型关注道德上重要的短语而非虚假词汇模式。不同于之前的依据理由监管或事后方法,SMRA直接在训练目标中集成了道德理由监督,从而生成固有的可解释和文化背景化的解释。 为了支持我们的框架,我们还引入了HateBRMoralXplain,这是一个巴西葡萄牙语基准数据集,该数据集根据仇恨标签、道德类别、词级道德理由及社会政治元数据进行了标注。在二元仇恨言论检测和多标签道德情绪分类任务中,SMRA显著提高了性能(例如,分别增加了0.9 F1和1.5 F1),同时大幅增强了解释的一致性,IoU F1提高7.4个百分点,Token F1提高5.0个百分点。虽然解释变得更加简洁,但充足度有所增加(+2.3 pp),公平性保持稳定,这表明更忠实的理由并未影响性能或偏见权衡。
https://arxiv.org/abs/2601.03481
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their black-box nature raises concerns about transparency and faithfulness. Input attribution methods aim to highlight each input token's contributions to the model's output, but existing approaches are typically model-agnostic, and do not focus on transformer-specific architectures, leading to limited faithfulness. To address this, we propose Grad-ELLM, a gradient-based attribution method for decoder-only transformer-based LLMs. By aggregating channel importance from gradients of the output logit with respect to attention layers and spatial importance from attention maps, Grad-ELLM generates heatmaps at each generation step without requiring architectural modifications. Additionally, we introduce two faithfulneses metrics $\pi$-Soft-NC and $\pi$-Soft-NS, which are modifications of Soft-NC/NS that provide fairer comparisons by controlling the amount of information kept when perturbing the text. We evaluate Grad-ELLM on sentiment classification, question answering, and open-generation tasks using different models. Experiment results show that Grad-ELLM consistently achieves superior faithfulness than other attribution methods.
大型语言模型(LLMs)在各种任务上展现了卓越的能力,但其黑盒特性引发了关于透明度和忠实性的担忧。输入归因方法旨在突出每个输入标记对模型输出的贡献,然而现有的方法通常是无模型特定的,不专注于变压器架构特有的属性,导致了忠实性方面的限制。为了解决这些问题,我们提出了Grad-ELLM,这是一种基于梯度的针对解码器独立的Transformer基础LLM的归因方法。通过聚合输出逻辑与注意力层相关的梯度所计算出的通道重要性和从注意力图谱中得出的空间重要性,Grad-ELLM能够在不进行架构修改的情况下,在每次生成步骤时产生热力图。 此外,我们还引入了两种忠实性衡量标准$\pi$-Soft-NC和$\pi$-Soft-NS,它们是对Soft-NC/NS的改进版本,通过控制在扰动文本时保留的信息量,提供了更公平的比较方式。我们在不同的模型上使用情感分类、问答和开放生成任务来评估Grad-ELLM的表现。 实验结果显示,与其它归因方法相比,Grad-ELLM在忠实性方面始终表现得更为优秀。
https://arxiv.org/abs/2601.03089
Large Language Models (LLMs) have been emerging as prominent AI models for solving many natural language tasks due to their high performance (e.g., accuracy) and capabilities in generating high-quality responses to the given inputs. However, their large computational cost, huge memory footprints, and high processing power/energy make it challenging for their embedded deployments. Amid several tinyLLMs, recent works have proposed spike-driven language models (SLMs) for significantly reducing the processing power/energy of LLMs. However, their memory footprints still remain too large for low-cost and resource-constrained embedded devices. Manual quantization approach may effectively compress SLM memory footprints, but it requires a huge design time and compute power to find the quantization setting for each network, hence making this approach not-scalable for handling different networks, performance requirements, and memory budgets. To bridge this gap, we propose QSLM, a novel framework that performs automated quantization for compressing pre-trained SLMs, while meeting the performance and memory constraints. To achieve this, QSLM first identifies the hierarchy of the given network architecture and the sensitivity of network layers under quantization, then employs a tiered quantization strategy (e.g., global-, block-, and module-level quantization) while leveraging a multi-objective performance-and-memory trade-off function to select the final quantization setting. Experimental results indicate that our QSLM reduces memory footprint by up to 86.5%, reduces power consumption by up to 20%, maintains high performance across different tasks (i.e., by up to 84.4% accuracy of sentiment classification on the SST-2 dataset and perplexity score of 23.2 for text generation on the WikiText-2 dataset) close to the original non-quantized model while meeting the performance and memory constraints.
大型语言模型(LLMs)由于其高性能和生成高质量响应的能力,已成为解决许多自然语言任务的突出AI模型。然而,它们巨大的计算成本、庞大的内存占用以及对处理能力和能源的巨大需求,使其在嵌入式设备上的部署变得具有挑战性。鉴于此,最近的研究提出了一种称为脉冲驱动语言模型(SLMs)的方法,以显著降低LLM的处理功率和能耗。尽管如此,这些模型仍然因内存占用过大而难以应用于低成本、资源受限的嵌入式设备中。 手动量化方法可以通过压缩SLM的内存足迹来有效解决问题,但这种方法需要大量的设计时间和计算能力来为每个网络找到合适的量化设置,因此对于不同网络、性能需求和内存预算而言并不具备可扩展性。为了弥合这一差距,我们提出了QSLM——一种新的框架,它能够自动执行预训练SLMs的量化以压缩其大小,同时满足性能和内存约束。 为了实现这个目标,QSLM首先识别给定网络架构的层次结构以及在量化过程中各层的敏感性,然后采用分层量化策略(例如全局、块级和模块级量化),并利用一个兼顾性能与内存双重优化的目标函数来选择最终的量化设置。实验结果显示,我们的QSLM可以在满足性能和内存约束的同时,将模型的内存占用减少多达86.5%,能耗降低最多达20%。同时,在不同的任务上保持高精度(例如SST-2数据集情感分类准确率为84.4%,WikiText-2数据集文本生成的困惑度得分为23.2),接近未量化模型的表现水平。
https://arxiv.org/abs/2601.00679
In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.
在企业自然语言处理(NLP)迅速发展的背景下,对能够高效处理多领域文本自动化任务的轻量级模型的需求日益增加。本研究对三种突出的轻量级Transformer模型——DistilBERT、MiniLM和ALBERT,在客户情感分类、新闻主题分类以及毒性与仇恨言论检测三个不同领域进行了比较分析。我们使用IMDb、AG News以及衡量仇恨言论语料库的数据集,通过包括准确率、精确度、召回率和F1值在内的基于准确性的指标,以及模型大小、推理时间、吞吐量和内存使用等效率指标进行性能评估。 研究的关键发现表明,没有单一的模型在所有性能维度上占据主导地位。ALBERT在多个领域的特定任务准确性方面表现出最高水平,MiniLM在推理速度和吞吐量方面表现优异,而DistilBERT则展现了其跨任务的一致准确性和相对高效的特性。所有结果均基于受控的微调,在企业导向的固定约束下进行,而非经过详尽的超参数优化。 这些结果突出了准确性与效率之间的权衡,并建议对于延迟敏感的企业应用选择MiniLM,追求平衡性能时使用DistilBERT,而针对资源受限环境则推荐ALBERT。
https://arxiv.org/abs/2601.00444
Social media (SM) platforms (e.g. Facebook, Twitter, and Reddit) are increasingly leveraged to share opinions and emotions, specifically during challenging events, such as natural disasters, pandemics, and political elections, and joyful occasions like festivals and celebrations. Among the SM platforms, Reddit provides a unique space for its users to anonymously express their experiences and thoughts on sensitive issues such as health and daily life. In this work, we present a novel dataset, called NepEMO, for multi-label emotion (MLE) and sentiment classification (SC) on the Nepali subreddit post. We curate and build a manually annotated dataset of 4,462 posts (January 2019- June 2025) written in English, Romanised Nepali and Devanagari script for five emotions (fear, anger, sadness, joy, and depression) and three sentiment classes (positive, negative, and neutral). We perform a detailed analysis of posts to capture linguistic insights, including emotion trends, co-occurrence of emotions, sentiment-specific n-grams, and topic modelling using Latent Dirichlet Allocation and TF-IDF keyword extraction. Finally, we compare various traditional machine learning (ML), deep learning (DL), and transformer models for MLE and SC tasks. The result shows that transformer models consistently outperform the ML and DL models for both tasks.
社交媒体(SM)平台(如 Facebook、Twitter 和 Reddit)越来越被用于分享观点和情感,尤其是在自然灾害、流行病和政治选举等挑战性事件以及节日和庆典等欢乐时刻。在这些社交媒体平台上,Reddit 为用户提供了匿名表达对健康和日常生活等问题看法的独特空间。在这项工作中,我们介绍了一种名为 NepEMO 的新型数据集,该数据集用于尼泊尔 subreddit 帖子的多标签情绪(MLE)分类和情感(SC)分类。我们整理并构建了一个包含 4,462 条帖子的手动注释数据集(时间跨度为 2019 年 1 月到 2025 年 6 月),这些帖子用英语、罗马化尼泊尔语和梵文书写,涵盖五种情绪(恐惧、愤怒、悲伤、喜悦和抑郁)以及三个情感类别(正面、负面和中立)。我们对帖子进行了详细分析,以捕捉语言学见解,包括情绪趋势、情绪共现、特定情感的 n-gram 和使用潜在狄利克雷分配和 TF-IDF 关键词提取进行的主题建模。最后,我们将多种传统机器学习(ML)、深度学习(DL)以及转换器模型用于 MLE 和 SC 任务,并进行了比较。结果表明,在这两项任务中,转换器模型始终优于 ML 和 DL 模型。
https://arxiv.org/abs/2512.22823
Financial sentiment analysis plays a crucial role in informing investment decisions, assessing market risk, and predicting stock price trends. Existing works in financial sentiment analysis have not considered the impact of stock prices or market feedback on sentiment analysis. In this paper, we propose an adaptive framework that integrates large language models (LLMs) with real-world stock market feedback to improve sentiment classification in the context of the Indian stock market. The proposed methodology fine-tunes the LLaMA 3.2 3B model using instruction-based learning on the SentiFin dataset. To enhance sentiment predictions, a retrieval-augmented generation (RAG) pipeline is employed that dynamically selects multi-source contextual information based on the cosine similarity of the sentence embeddings. Furthermore, a feedback-driven module is introduced that adjusts the reliability of the source by comparing predicted sentiment with actual next-day stock returns, allowing the system to iteratively adapt to market behavior. To generalize this adaptive mechanism across temporal data, a reinforcement learning agent trained using proximal policy optimization (PPO) is incorporated. The PPO agent learns to optimize source weighting policies based on cumulative reward signals from sentiment-return alignment. Experimental results on NIFTY 50 news headlines collected from 2024 to 2025 demonstrate that the proposed system significantly improves classification accuracy, F1-score, and market alignment over baseline models and static retrieval methods. The results validate the potential of combining instruction-tuned LLMs with dynamic feedback and reinforcement learning for robust, market-aware financial sentiment modeling.
金融情感分析在指导投资决策、评估市场风险和预测股价走势方面发挥着关键作用。现有研究中,金融情感分析尚未充分考虑股票价格或市场反馈对情感分析的影响。本文提出了一种自适应框架,该框架将大型语言模型(LLMs)与现实世界的股市反馈相结合,以提升印度股市背景下的情感分类准确性。 所提出的方案采用基于指令的学习方法,在SentiFin数据集上微调LLaMA 3.2 3B模型。为了增强情感预测的准确性,采用了检索增强生成(RAG)管道,该管道动态选择多源上下文信息,并根据句子嵌入之间的余弦相似度来筛选。 此外,我们引入了一个反馈驱动模块,通过比较预测的情感与实际第二天股票回报率,调整来源的可信度。这使得系统能够逐步适应市场行为的变化。为了在时间序列数据中推广这种自适应机制,我们融入了使用近端策略优化(PPO)训练的强化学习代理。该PPO代理学会了根据情感与回报的一致性累积奖励信号来优化源权重策略。 实验结果表明,在2024年至2025年间收集的NIFTY 50新闻头条上,所提出的系统在分类准确性、F1分数和市场一致性方面显著优于基准模型和静态检索方法。这些结果显示了结合指令微调的LLMs与动态反馈及强化学习来构建稳健且具有市场意识的情感分析模型的潜力。
https://arxiv.org/abs/2512.20082
Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. this https URL
Bahasa Indonesia, yang digunakan lebih dari 200 juta orang, masih kurang didukung dalam penelitian pengenalan emosi multi-modal meskipun hadir dominan di platform media sosial Asia Tenggara. Kami memperkenalkan IndoMER, benchmark pertama untuk pengenalan emosi multi-modal dalam Bahasa Indonesia yang terdiri dari 1.944 potongan video dari 203 pembicara dengan anotasi teks, audio dan visual yang disinkronkan secara temporal di tujuh kategori emosi. Dataset ini menunjukkan tantangan realistis termasuk ketidaksinkronan antar modul dan distribusi kelas ekor panjang yang dibentuk oleh norma komunikasi budaya Indonesia. Untuk mengatasi tantangan tersebut, kami merancang OmniMER, kerangka adaptasi multi-modal yang didasarkan pada Qwen2.5-Omni yang meningkatkan pengenalan emosi melalui tiga tugas pendeteksian khusus modul: ekstraksi kata kunci emosi untuk teks, analisis ekspresi wajah untuk video dan analisis prosody untuk audio. Tugas-tugas penunjang ini membantu model mengidentifikasi petunjuk yang relevan dengan emosi di setiap modul sebelum penggabungan, mengurangi ketergantungan pada korelasi palsu dalam lingkungan sumber daya rendah. Eksperimen di IndoMER menunjukkan bahwa OmniMER mencapai 0,582 Macro-F1 untuk klasifikasi sentimen dan 0,454 untuk pengenalan emosi, mengungguli model dasar sebesar 7,6 dan 22,1 poin mutlak masing-masing. Evaluasi silang bahasa di dataset CH-SIMS dalam Bahasa Mandarin lebih lanjut menunjukkan generalisasi kerangka yang disarankan. Dataset dan kode tersedia secara publik di tautan ini: [URL Anda].
https://arxiv.org/abs/2512.19379
Financial sentiment analysis enhances market understanding; however, standard natural language processing approaches encounter significant challenges when applied to small datasets. This study provides a comparative evaluation of embedding-based methods for financial news sentiment classification in resource-constrained environments. Word2Vec, GloVe, and sentence transformer representations are evaluated in combination with gradient boosting on manually labeled headlines. Experimental results identify a substantial gap between validation and test performance, with models performing worse than trivial baselines despite strong validation metrics. The analysis demonstrates that pretrained embeddings yield diminishing returns below a critical data sufficiency threshold, and that small validation sets contribute to overfitting during model selection. Practical application is illustrated through weekly sentiment aggregation and narrative summarization for market monitoring workflows. The findings offer empirical evidence that embedding quality alone cannot address fundamental data scarcity in sentiment classification. For practitioners operating with limited resources, the results indicate the need to consider alternative approaches such as few-shot learning, data augmentation, or lexicon-enhanced hybrid methods when labeled samples are scarce.
财务情绪分析增强了市场理解;然而,标准的自然语言处理方法在应用于小数据集时会遇到重大挑战。这项研究对资源受限环境下基于嵌入的方法进行了比较评估,用于金融新闻的情绪分类。研究评估了Word2Vec、GloVe和句子变换器表示与梯度提升结合使用的效果,并且这些评估是基于人工标记的标题进行的。实验结果表明,在验证集和测试集中性能存在显著差距:尽管模型在验证指标上表现出色,但其实际表现不如简单的基准方法。分析还显示,预训练嵌入在数据充足性阈值以下会收益递减,并且较小的验证集会导致在模型选择过程中过度拟合。 这项研究通过每周情绪汇总和叙事总结来说明市场监控工作流程的实际应用情况。这些发现提供了实证证据,证明仅靠嵌入质量无法解决情感分类中根本的数据稀缺问题。对于资源有限的操作人员来说,结果表明,在样本标签稀少的情况下需要考虑替代方法,如少量学习、数据增强或词汇表增强的混合方法。
https://arxiv.org/abs/2512.13749
Semantic distance measurement is a fundamental problem in computational linguistics, providing a quantitative characterization of similarity or relatedness between text segments, and underpinning tasks such as text retrieval and text classification. From a mathematical perspective, a semantic distance can be viewed as a metric defined on a space of texts or on a representation space derived from them. However, most classical semantic distance methods are essentially fixed, making them difficult to adapt to specific data distributions and task requirements. In this paper, a semantic distance measure based on multi-kernel Gaussian processes (MK-GP) was proposed. The latent semantic function associated with texts was modeled as a Gaussian process, with its covariance function given by a combined kernel combining Matérn and polynomial components. The kernel parameters were learned automatically from data under supervision, rather than being hand-crafted. This semantic distance was instantiated and evaluated in the context of fine-grained sentiment classification with large language models under an in-context learning (ICL) setup. The experimental results demonstrated the effectiveness of the proposed measure.
语义距离测量是计算语言学中的一个基本问题,它提供了一种对文本片段之间相似性或相关性的定量描述,并支持诸如文本检索和分类等任务。从数学角度来看,语义距离可以被视为在文本空间或由其衍生的表示空间上定义的一种度量标准。然而,大多数传统的语义距离方法本质上是固定的,这使得它们难以适应特定的数据分布和任务需求。 本文提出了一种基于多核高斯过程(MK-GP)的语义距离测量方法。该方法将与文本相关的潜在语义函数建模为高斯过程,并使用由Matérn核和多项式成分结合而成的协方差函数来定义其内核参数。这些内核参数是从数据中自动学习得到的,而不是手工设置的。这种语义距离在大规模语言模型的细粒度情感分类任务中的情境学习(ICL)环境中进行了实例化并进行评估。 实验结果表明了所提出的方法的有效性。
https://arxiv.org/abs/2512.12238
The rapid integration of generative artificial intelligence into education has driven digital transformation in e-teaching, yet user perceptions of AI educational apps remain underexplored. This study performs a sentiment-driven evaluation of user reviews from top AI ed-apps on the Google Play Store to assess efficacy, challenges, and pedagogical implications. Our pipeline involved scraping app data and reviews, RoBERTa for binary sentiment classification, GPT-4o for key point extraction, and GPT-5 for synthesizing top positive/negative themes. Apps were categorized into seven types (e.g., homework helpers, math solvers, language tools), with overlaps reflecting multifunctional designs. Results indicate predominantly positive sentiments, with homework apps like Edu AI (95.9% positive) and this http URL (92.7%) leading in accuracy, speed, and personalization, while language/LMS apps (e.g., Teacher AI at 21.8% positive) lag due to instability and limited features. Positives emphasize efficiency in brainstorming, problem-solving, and engagement; negatives center on paywalls, inaccuracies, ads, and glitches. Trends show that homework helpers outperform specialized tools, highlighting AI's democratizing potential amid risks of dependency and inequity. The discussion proposes future ecosystems with hybrid AI-human models, VR/AR for immersive learning, and a roadmap for developers (adaptive personalization) and policymakers (monetization regulation for inclusivity). This underscores generative AI's role in advancing e-teaching by enabling ethical refinements that foster equitable, innovative environments. The full dataset is available here(this https URL).
生成人工智能在教育领域的迅速整合推动了在线教学的数字化转型,然而用户对AI教育应用程序的看法仍鲜有研究。本研究通过分析Google Play Store上顶级AI教育应用的用户评论的情感驱动评估来评价其有效性、挑战和教学法意义。我们的流程包括抓取应用数据与评论、使用RoBERTa进行二元情感分类、利用GPT-4o提取关键点,并借助GPT-5合成主要正面/负面主题。应用程序被分为七类(如家庭作业助手、数学解题器、语言工具),重叠类别反映了多功能设计的特点。研究结果显示,总体情绪偏向积极,其中像Edu AI (95.9%的正面评价)和另一个应用(92.7%的正面评价)这样的家庭作业应用程序在准确度、速度和个人化方面表现突出;而语言/LMS应用程序(如Teacher AI仅21.8%的正面评价)则因稳定性差和功能有限而落后。正面反馈主要集中在创意生成、问题解决以及互动性方面的效率,负面反馈则聚焦于付费墙、准确性低、广告过多及技术故障等问题。趋势显示家庭作业助手类应用优于专业工具,这凸显了AI在促进教育平等化中的潜力,同时也揭示了对依赖性和不平等等风险的担忧。 讨论部分提出了未来生态系统中混合AI与人类模型的应用、VR/AR技术在沉浸式学习中的作用,并为开发者(如自适应个性化)和政策制定者(货币化监管以实现包容性)制定了路线图。这强调了生成人工智能在推进在线教学方面的角色,通过促进伦理改进来创造公平且创新的学习环境。完整的数据集可在此处获取(该链接应替换为您提供的具体网址)。
https://arxiv.org/abs/2512.11934