Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.
https://arxiv.org/abs/2604.15153
While Aspect-based Sentiment Analysis (ABSA) systems have achieved high accuracy in identifying sentiment polarities, they often operate as "black boxes," lacking the explicit reasoning capabilities characteristic of human affective cognition. Humans do not merely categorize sentiment; they construct causal explanations for their judgments. To bridge this gap, we propose ABSA-R1, a large language model framework designed to mimic this ``reason-before-predict" cognitive process. By leveraging reinforcement learning (RL), ABSA-R1 learns to articulate the why behind the what, generating natural language justifications that ground its sentiment predictions. We introduce a Cognition-Aligned Reward Model (formerly sentiment-aware reward model) that enforces consistency between the generated reasoning path and the final emotional label. Furthermore, inspired by metacognitive monitoring, we implement a performance-driven rejection sampling strategy that selectively targets hard cases where the model's internal reasoning is uncertain or inconsistent. Experimental results on four benchmarks demonstrate that equipping models with this explicit reasoning capability not only enhances interpretability but also yields superior performance in sentiment classification and triplet extraction compared to non-reasoning baselines.
https://arxiv.org/abs/2604.13398
Hate speech detection in Devanagari-scripted social media memes presents compounded challenges: multimodal content structure, script-specific linguistic complexity, and extreme data scarcity in low-resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three-class sentiment classification: positive, neutral, negative). We propose a hybrid cross-modal attention fusion architecture that combines CLIP (ViT-B/32) for visual encoding with BGE-M3 for multilingual text representation, connected through 4-head self-attention and a learnable gating network that dynamically weights modality contributions on a per-sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross-modal reasoning achieves a 5.9% F1-macro improvement over text-only baselines on Subtask A, while uncovering two unexpected but critical findings: English-centric vision models exhibit near-random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at this https URL
https://arxiv.org/abs/2604.14218
[Background:] Thematic analysis of free-text justifications in human experiments provides significant qualitative insights. Yet, it is costly because reliable annotations require multiple domain experts. Large language models (LLMs) seem ideal candidates to replace human annotators. [Problem:] Coding security-specific aspects (code identifiers mentioned, lines-of-code mentioned, security keywords mentioned) may require deeper contextual understanding than sentiment classification. [Objective:] Explore whether LLMs can act as automated annotators for technical security comments by human subjects. [Method:] We prompt four top-performing LLMs on LiveBench to detect nine security-relevant codes in free-text comments by human subjects analyzing vulnerable code snippets. Outputs are compared to human annotators using Cohen's Kappa (chance-corrected accuracy). We test different prompts mimicking annotation best practices, including emerging codes, detailed codebooks with examples, and conflicting examples. [Negative Results:] We observed marked improvements only when using detailed code descriptions; however, these improvements are not uniform across codes and are insufficient to reliably replace a human annotator. [Limitations:] Additional studies with more LLMs and annotation tasks are needed.
https://arxiv.org/abs/2604.10834
We present a systematic empirical study of the spectral structure of LoRA weight updates. Through 2D Discrete Cosine Transform (DCT) analysis of trained adaptation matrices across BERT-base and RoBERTa-base on four GLUE benchmarks (SST-2, MNLI, CoLA, QQP), we establish that LoRA updates are universally dominated by low-frequency components: on average, just 33% of DCT coefficients capture 90% of total spectral energy. Retaining only 10% of frequency coefficients reduces adapter storage by 10x while sacrificing only 1.95pp on SST-2. Notably, frequency masking at k=50% improves over full LoRA on 3 of 8 model-task pairs, suggesting high-frequency components act as adaptation noise. We further discover that RoBERTa-base is systematically more spectrally compressible than BERT-base across all tasks, and that task complexity governs spectral sensitivity -- NLI tasks require more frequency budget than sentiment classification. These findings motivate a new design principle for PEFT: spectral sparsity in adaptation.
https://arxiv.org/abs/2604.10649
The exponential growth of user-generated movie reviews on digital platforms has made accurate text sentiment classification a cornerstone task in natural language processing. Traditional models, including standard BERT and recurrent architectures, frequently struggle to capture long-distance semantic dependencies and resolve ambiguous emotional expressions in lengthy review texts. This paper proposes a novel hybrid framework that seamlessly integrates dynamic adaptive multi-head attention with supervised contrastive learning into a BERT-based Transformer encoder. The dynamic adaptive attention module employs a global context pooling vector to dynamically regulate the contribution of each attention head, thereby focusing on critical sentiment-bearing tokens while suppressing noise. Simultaneously, the supervised contrastive learning branch enforces tighter intra-class compactness and larger inter-class separation in the embedding space. Extensive experiments on the IMDB dataset demonstrate that the proposed model achieves competitive performance with an accuracy of 94.67\%, outperforming strong baselines by 1.5--2.5 percentage points. The framework is lightweight, efficient, and readily extensible to other text classification tasks.
https://arxiv.org/abs/2604.10459
Existing Indonesian sentiment analysis models classify text in isolation, ignoring the topical context that often determines whether a statement is positive, negative, or neutral. We introduce IndoBERT-Sentiment, a context-conditioned sentiment classifier that takes both a topical context and a text as input, producing sentiment predictions grounded in the topic being discussed. Built on IndoBERT Large (335M parameters) and trained on 31,360 context-text pairs labeled across 188 topics, the model achieves an F1 macro of 0.856 and accuracy of 88.1%. In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points. We show that context-conditioning, previously demonstrated for relevancy classification, transfers effectively to sentiment analysis and enables the model to correctly classify texts that are systematically misclassified by context-free approaches.
https://arxiv.org/abs/2604.07057
This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.
本研究对sPeriodika语料库中的斯洛文尼亚历史报纸《Slovenec》与《Slovenski narod》进行了计算分析,综合运用主题建模、基于大语言模型(LLM)的方面级情感分析、实体图可视化及质性话语分析,探究20世纪之交公共话语中集体身份、政治倾向与国家归属的呈现方式。通过BERTopic识别主要主题模式,揭示两份报纸在保守-天主教与自由-进步取向下的共同关切与明显意识形态差异。研究还评估了四种指令遵循型大语言模型对光学字符识别(OCR)受损的历史斯洛文尼亚语文本进行定向情感分类的效果,选定经斯洛文尼亚语优化的GaMS3-12B-Instruct模型作为大规模应用的最佳选择,同时记录其重要局限——该模型在中性情感识别上表现显著优于积极或消极情感。在数据集层面应用该模型发现,集体身份的 portrayal 存在显著差异:部分群体主要出现在中性描述语境中,而另一些则更频繁出现在评价性或冲突相关话语里。随后构建命名实体图以探索集体身份与地点的关联,采用混合方法分析实体图,结合量化网络分析与批判性话语分析,聚焦历史政治与社会经济身份的交织演进。整体而言,本研究展示了将可扩展计算方法与批判性阐释相结合的价值,为处理噪声较大的历史报纸数据的数字人文研究提供了支持。
https://arxiv.org/abs/2603.25051
Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows.
人类标注是自然语言处理评估的核心环节,但主观性任务在标注者间常存在显著差异。尽管大语言模型能提供结构化推理以辅助标注,但其对人类标注行为的影响尚不明确。我们提出ReasonAlign——一种基于推理的标注支架,该支架展示大语言模型生成的解释但隐藏预测标签。我们将此设计为一项受控研究,旨在探究推理如何影响人类标注行为,而非对标注准确性进行全面评估。研究采用受德尔斐式修订启发的双轮协议:标注者首先独立标注实例,随后在查看模型生成的推理后修订决策。我们在情感分类与观点检测任务中评估该方法,分析标注者间一致性与修订行为的变化。为量化这些效应,我们引入标注者努力代理指标(AEP),该指标捕捉标注者在接触推理后修订标签的比例。结果表明,接触推理与更高的一致性及最小的修订行为相关,暗示推理主要帮助解决模糊案例,而未引发广泛改动。这些发现揭示了推理解释如何塑造标注一致性,并凸显基于推理的支架可作为支持人机协同标注流程的实用机制。
https://arxiv.org/abs/2603.21094
This study examines how different artificial intelligence architectures interpret sentiment in conflict-related media discourse, using the 2023 Gaza War as a case study. Drawing on a corpus of 10,990 Arabic news headlines (Eleraqi 2026), the research conducts a comparative analysis between three large language models and six fine-tuned Arabic BERT models. Rather than evaluating accuracy against a single human-annotated gold standard, the study adopts an epistemological approach that treats sentiment classification as an interpretive act produced by model architectures. To quantify systematic differences across models, the analysis employs information-theoretic and distributional metrics, including Shannon Entropy, Jensen-Shannon Distance, and a Variance Score measuring deviation from aggregate model behavior. The results reveal pronounced and non-random divergence in sentiment distributions. Fine-tuned BERT models, particularly MARBERT, exhibit a strong bias toward neutral classifications, while LLMs consistently amplify negative sentiment, with LLaMA-3.1-8B showing near-total collapse into negativity. Frame-conditioned analysis further demonstrates that GPT-4.1 adjusts sentiment judgments in line with narrative frames (e.g., humanitarian, legal, security), whereas other LLMs display limited contextual modulation. These findings suggest that the choice of model constitutes a choice of interpretive lens, shaping how conflict narratives are algorithmically framed and emotionally evaluated. The study contributes to media studies and computational social science by foregrounding algorithmic discrepancy as an object of analysis and by highlighting the risks of treating automated sentiment outputs as neutral or interchangeable measures of media tone in contexts of war and crisis.
https://arxiv.org/abs/2604.08566
For millions of users in developing economies who depend on mobile banking as their primary gateway to financial services, app quality directly shapes financial access. The study analyzed 5,652 Google Play reviews in English and Bangla (filtered from 11,414 raw reviews) for four Bangladeshi government banking apps. The authors used a hybrid labeling approach that combined use of the reviewer's star rating for each review along with a separate independent XLM-RoBERTa classifier to produce moderate inter-method agreement (kappa = 0.459). Traditional models outperformed transformer-based ones: Random Forest produced the highest accuracy (0.815), while Linear SVM produced the highest weighted F1 score (0.804); both were higher than the performance of fine-tuned XLM-RoBERTa (0.793). McNemar's test confirmed that all classical models were significantly superior to the off-the-shelf XLM-RoBERTa (p < 0.05), while differences with the fine-tuned variant were not statistically significant. DeBERTa-v3 was applied to analyze the sentiment at the aspect level across the reviews for the four apps; the reviewers expressed their dissatisfaction primarily with the speed of transactions and with the poor design of interfaces; eJanata app received the worst ratings from the reviewers across all apps. Three policy recommendations are made based on these findings - remediation of app quality, trust-centred release management, and Bangla-first NLP adoption - to assist state-owned banks in moving towards improving their digital services through data-driven methods. Notably, a 16.1-percentage-point accuracy gap between Bangla and English text highlights the need for low-resource language model development.
https://arxiv.org/abs/2604.13057
Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.
https://arxiv.org/abs/2603.15981
The recent escalation of the Iran Israel USA conflict in 2026 has triggered widespread global discussions across social media platforms. As people increasingly use these platforms for expressing opinions, analyzing public sentiment from these discussions can provide valuable insights into global public perception. This study aims to analyze global public sentiment regarding the Iran Israel USA conflict by mining user-generated comments from YouTube news channels. The work contributes to public opinion analysis by introducing a privacy preserving framework that combines topic wise sentiment analysis with modern deep learning techniques and Federated Learning. To achieve this, approximately 19,000 YouTube comments were collected from major international news channels and preprocessed to remove noise and normalize text. Sentiment labels were initially generated using the VADER sentiment analyzer and later validated through manual inspection to improve reliability. Latent Dirichlet Allocation (LDA) was applied to identify key discussion topics related to the conflict. Several transformer-based models, including BERT, RoBERTa, XLNet, DistilBERT, ModernBERT, and ELECTRA, were fine tuned for sentiment classification. The best-performing model was further integrated into a federated learning environment to enable distributed training by preserving user data privacy. Additionally, Explainable Artificial Intelligence (XAI) techniques using SHAP were applied to interpret model predictions and identify influential words affecting sentiment classification. Experimental results demonstrate that transformer models perform effectively, and among them, ELECTRA achieved the best performance with 91.32% accuracy. The federated learning also maintained strong performance while preserving privacy, achieving 89.59% accuracy in a two client configuration.
https://arxiv.org/abs/2603.13655
Implicit Discourse Relation Recognition (IDRR) remains a challenging task due to the requirement for deep semantic understanding in the absence of explicit discourse markers. A further limitation is that existing methods only predict relations without providing any supporting explanations. Recent advances in large language models (LLMs) have shown strong reasoning capabilities in both deep language understanding and natural language explanation generation. In this work, we propose a simple yet effective approach to distill the reasoning capabilities of LLMs into lightweight IDRR models to improve both performance and interpretability. Specifically, we first prompt an LLM to generate explanations for each training instance conditioned on its gold label. Then, we introduce a novel classification-generation framework that jointly performs relation prediction and explanation generation, and train it with the additional supervision of LLM-generated explanations. Our framework is plug-and-play, enabling easy integration with most existing IDRR models. Experimental results on PDTB demonstrate that our approach significantly improves IDRR performance, while human evaluation further confirms that the generated explanations enhance model interpretability. Furthermore, we validate the generality of our approach on sentiment classification and natural language inference
隐式话语关系识别(IDRR)仍然是一项具有挑战性的任务,因为它要求在缺乏明确的话语标记的情况下进行深度语义理解。现有方法的另一个限制是它们仅预测关系而不提供任何支持性解释。近年来,大型语言模型(LLMs)在深入的语言理解和自然语言解释生成方面表现出强大的推理能力。在这项工作中,我们提出了一种简单而有效的方法,即将LLM的推理能力提炼到轻量级IDRR模型中,以提高性能和可解释性。具体而言,我们首先提示一个LLM根据每个训练实例的黄金标签生成解释。然后,我们引入了一个新颖的分类-生成框架,该框架同时进行关系预测和解释生成,并通过附加的LLM生成的解释监督对其进行训练。我们的框架是即插即用型,能够轻松集成到大多数现有的IDRR模型中。 在PDTB数据集上的实验结果表明,我们的方法显著提高了IDRR性能,而人类评估进一步证实了所生成的解释增强了模型的可解释性。此外,我们还验证了该方法在情感分类和自然语言推理任务中的通用性。
https://arxiv.org/abs/2602.21763
Customer-provided reviews have become an important source of information for business owners and other customers alike. However, effectively analyzing millions of unstructured reviews remains challenging. While large language models (LLMs) show promise for natural language understanding, their application to large-scale review analysis has been limited by computational costs and scalability concerns. This study proposes a hybrid approach that uses LLMs for aspect identification while employing classic machine-learning methods for sentiment classification at scale. Using ChatGPT to analyze sampled restaurant reviews, we identified key aspects of dining experiences and developed sentiment classifiers using human-labeled reviews, which we subsequently applied to 4.7 million reviews collected over 17 years from a major online platform. Regression analysis reveals that our machine-labeled aspects significantly explain variance in overall restaurant ratings across different aspects of dining experiences, cuisines, and geographical regions. Our findings demonstrate that combining LLMs with traditional machine learning approaches can effectively automate aspect-based sentiment analysis of large-scale customer feedback, suggesting a practical framework for both researchers and practitioners in the hospitality industry and potentially, other service sectors.
客户提供的评论已经成为商家和其他顾客获取信息的重要来源。然而,有效地分析数以百万计的非结构化评论仍然是一项挑战。虽然大型语言模型(LLM)在自然语言理解方面显示出潜力,但其在大规模评价分析中的应用因计算成本和可扩展性问题而受到限制。本研究提出了一种混合方法,利用LLM进行方面识别,并采用传统的机器学习方法进行大规模情感分类。通过使用ChatGPT对采样的餐厅评论进行分析,我们确定了用餐体验的关键方面,并基于人工标注的评论开发了情感分类器,随后将其应用于17年间从主要在线平台收集到的470万条评论中。回归分析表明,我们的机器标注的方面在不同用餐体验、菜肴类型和地区方面显著解释了餐厅总体评分的变化。研究结果证明,结合LLM和传统机器学习方法可以有效地自动化大规模客户反馈的情感方面分析,这为酒店业研究人员和从业人员以及可能的服务行业提供了实用框架。
https://arxiv.org/abs/2602.21082
We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks (Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)), four SemEval datasets, and two encoder-decoder models: T5-Base and Tk-Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting-based generation. These benefits are most pronounced for T5-Base, while the more heavily pretrained Tk-Instruct exhibits smaller improvements. As a result, augmented data helps T5-Base achieve comparable performance with its counterpart.
https://arxiv.org/abs/2602.16379
The rapid growth of the global poultry industry, driven by rising demand for affordable animal protein, has intensified public discourse surrounding production practices, housing, management, animal welfare, and supply-chain transparency. Social media platforms such as X (formerly Twitter) generate large volumes of unstructured textual data that capture stakeholder sentiment across the poultry industry. Extracting accurate sentiment signals from this domain-specific discourse remains challenging due to contextual ambiguity, linguistic variability, and limited domain awareness in general-purpose language models. This study presents PoultryLeX-Net, a lexicon-enhanced, domain-adaptive dual-stream transformer framework for fine-grained sentiment analysis in poultry-related text. The proposed architecture integrates sentiment classification, topic modeling, and contextual representation learning through domain-specific embeddings and gated cross-attention mechanisms. A lexicon-guided stream captures poultry-specific terminology and sentiment cues, while contextual stream models long-range semantic dependencies. Latent Dirichlet Allocation is employed to identify dominant thematic structures associated with production management and welfare-related discussions, providing complementary interpretability to sentiment predictions. PoultryLeX-Net was evaluated against multiple baseline models, including convolutional neural network and pre-trained transformer architectures such as DistilBERT and RoBERTa. PoultryLeX-Net consistently outperformed all baselines, achieving an accuracy of 97.35%, an F1 score of 96.67%, and an area under the receiver operating characteristic curve (AUC-ROC) of 99.61% across sentiment classification tasks. Overall, domain adaptation and dual-stream attention markedly improve sentiment classification, enabling scalable intelligence for poultry production decision support.
全球家禽产业的快速增长,由对负担得起的动物蛋白质需求上升所推动,已加剧了围绕生产实践、饲养环境、管理方式、动物福利以及供应链透明度等议题的公众讨论。社交媒体平台如X(前身为Twitter)产生了大量非结构化的文本数据,这些数据捕捉到了家禽行业中利益相关者的观点和情绪。然而,从这个特定领域的对话中提取准确的情绪信号仍然具有挑战性,这主要是由于语境模糊、语言变异以及通用语言模型在领域知识方面的限制。 这项研究提出了PoultryLeX-Net,这是一种增强词汇表的、适应性强的双通道变压器框架,用于家禽相关文本中的细粒度情绪分析。所提出的架构通过特定领域的嵌入和门控交叉注意力机制整合了情绪分类、主题建模以及上下文表示学习。一个由词汇表引导的流捕捉与家禽相关的专业术语及情绪线索,而另一个上下文流则模型化长距离语义依赖关系。使用潜在狄利克雷分配来识别与生产管理和福利相关讨论中的主要主题结构,从而为情绪预测提供额外的解释能力。 在多项基础模型对比测试中评估了PoultryLeX-Net,包括卷积神经网络以及预训练变压器架构如DistilBERT和RoBERTa。PoultryLeX-Net在整个情绪分类任务上始终优于所有基准,实现了97.35%的准确率、96.67%的F1分数以及AUC-ROC下面积为99.61%。 总的来说,领域适应性和双通道注意力显著改善了情绪分类效果,使得基于家禽生产的决策支持能够实现可扩展智能。
https://arxiv.org/abs/2603.09991
This study advances aspect-based sentiment analysis (ABSA) for Persian-language user reviews in the tourism domain, addressing challenges of low-resource languages. We propose a hybrid BERT-based model with Top-K routing and auxiliary losses to mitigate routing collapse and improve efficiency. The pipeline includes: (1) overall sentiment classification using BERT on 9,558 labeled reviews, (2) multi-label aspect extraction for six tourism-related aspects (host, price, location, amenities, cleanliness, connectivity), and (3) integrated ABSA with dynamic routing. The dataset consists of 58,473 preprocessed reviews from the Iranian accommodation platform Jabama, manually annotated for aspects and sentiments. The proposed model achieves a weighted F1-score of 90.6% for ABSA, outperforming baseline BERT (89.25%) and a standard hybrid approach (85.7%). Key efficiency gains include a 39% reduction in GPU power consumption compared to dense BERT, supporting sustainable AI deployment in alignment with UN SDGs 9 and 12. Analysis reveals high mention rates for cleanliness and amenities as critical aspects. This is the first ABSA study focused on Persian tourism reviews, and we release the annotated dataset to facilitate future multilingual NLP research in tourism.
https://arxiv.org/abs/2602.12778
Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.
大型语言模型(LLM)常常被集成在一起以提高整体可靠性和鲁棒性,但在实践中这些模型之间往往存在强相关性。这引发了一个根本问题:在形成一个LLM集成时应该选择哪些模型?我们把预算受限的集成选择问题定义为最大化所选模型的真实标签和预测之间的互信息。此外,为了解释为什么即使使用许多模型性能也可能达到饱和,我们采用高斯-库利(Gaussian-copula)方法来建模模型的相关错误,并展示了集成本身的信息论误差下限。 受到这些理论的启发,我们提出了一种简单的贪婪互信息选择算法,该算法直接从数据中估计所需的互信息项,并在查询预算内迭代构建集成。我们在两个问答数据集和一个二元情感分类数据集上测试了我们的方法:MEDMCQA、MMLU 和 IMDB 电影评论。在所有数据集中,我们观察到,在相同的查询预算下,我们的方法始终优于强基线模型。 这段文字描述了一种关于如何从多个相关大型语言模型中选择最有效的子集合的方法,并提出了一种基于互信息的算法来实现这一目标。这种方法不仅能够解释为什么使用大量模型后性能可能饱和的现象,还提供了一个在查询预算限制下的高效解决方案,在实践中显示出优于其他方法的表现。
https://arxiv.org/abs/2602.08003
Despite remarkable advances in natural language processing, developing effective systems for low-resource languages remains a formidable challenge, with performances typically lagging far behind high-resource counterparts due to data scarcity and insufficient linguistic resources. Cross-lingual knowledge transfer has emerged as a promising approach to address this challenge by leveraging resources from high-resource languages. In this paper, we investigate methods for transferring linguistic knowledge from high-resource languages to low-resource languages, where the number of labeled training instances is in hundreds. We focus on sentence-level and word-level tasks. We introduce a novel method, GETR (Graph-Enhanced Token Representation) for cross-lingual knowledge transfer along with two adopted baselines (a) augmentation in hidden layers and (b) token embedding transfer through token translation. Experimental results demonstrate that our GNN-based approach significantly outperforms existing multilingual and cross-lingual baseline methods, achieving 13 percentage point improvements on truly low-resource languages (Mizo, Khasi) for POS tagging, and 20 and 27 percentage point improvements in macro-F1 on simulated low-resource languages (Marathi, Bangla, Malayalam) across sentiment classification and NER tasks respectively. We also present a detailed analysis of the transfer mechanisms and identify key factors that contribute to successful knowledge transfer in this linguistic context.
尽管自然语言处理领域取得了显著进展,但在低资源语言上开发有效的系统仍然是一项艰巨的挑战。由于数据稀缺和语言资源不足,这些系统的性能通常远远落后于高资源语言对应的系统。跨语言知识转移作为解决这一问题的一种有前景的方法已经浮现,通过这种方法可以利用来自高资源语言的数据来帮助低资源语言。 在本文中,我们研究了将语言知识从高资源语言转移到低资源语言的方法,其中标签训练实例的数量只有几百个。我们的重点是句子级和词汇级的任务。为此,我们提出了一种名为GETR(图增强标记表示)的新方法来进行跨语言知识转移,并采用两种基线方法:(a) 隐藏层中的数据扩充和 (b) 通过词元翻译进行词元嵌入传递。 实验结果表明,我们的基于GNN的方法在多语言和跨语言基准方法上取得了显著的改进,在真正低资源的语言(米佐语、卡西语)上的词性标注任务中,性能提升了13个百分点。对于模拟的低资源语言(马拉地语、孟加拉语、马拉雅拉姆语),我们在情感分类和命名实体识别(NER)任务上分别实现了20和27个百分点的宏观F1得分提升。 此外,我们还对转移机制进行了详细的分析,并确定了在这种语言环境中成功实现知识传递的关键因素。
https://arxiv.org/abs/2602.05599