Treatment-resistant depression (TRD) is a severe form of major depressive disorder in which patients do not achieve remission despite multiple adequate treatment trials. Evidence across pharmacologic options for TRD remains limited, and trials often do not fully capture patient-reported tolerability. Large-scale online peer-support narratives therefore offer a complementary lens on how patients describe and evaluate medications in real-world use. In this study, we curated a corpus of 5,059 Reddit posts explicitly referencing TRD from 3,480 subscribers across 28 mental health-related subreddits from 2010 to 2025. Of these, 3,839 posts mentioned at least one medication, yielding 23,399 mentions of 81 generic-name medications after lexicon-based normalization of brand names, misspellings, and colloquialisms. We developed an aspect-based sentiment classifier by fine-tuning DeBERTa-v3 on the SMM4H 2023 therapy-sentiment Twitter corpus with large language model based data augmentation, achieving a micro-F1 score of 0.800 on the shared-task test set. Applying this classifier to Reddit, we quantified sentiment toward individual medications across three categories: positive, neutral, and negative, and tracked patterns by drug, subscriber, subreddit, and year. Overall, 72.1% of medication mentions were neutral, 14.8% negative, and 13.1% positive. Conventional antidepressants, especially SSRIs and SNRIs, showed consistently higher negative than positive proportions, whereas ketamine and esketamine showed comparatively more favorable sentiment profiles. These findings show that normalized medication extraction combined with aspect-based sentiment analysis can help characterize patient-perceived treatment experiences in TRD-related Reddit discourse, complementing clinical evidence with large-scale patient-generated perspectives.
https://arxiv.org/abs/2603.12343
Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.
零样本文本分类(ZSC)提供了一种希望,即通过将文本直接匹配到可读的标签描述来消除耗资的任务特定标注。虽然早期的方法主要依赖于为自然语言推理(NLI)微调的交叉编码器模型,但最近在文本嵌入模型、重排序器和指令调整的大规模语言模型(LLM)方面的进步挑战了基于NLI架构的主导地位。然而,系统地比较这些多样化的方法仍然困难重重。现有的评估方法,如MTEB,通常通过监督探针或微调来整合带标签的例子,这使得真正意义上的零样本能力被忽视了。为了解决这一问题,我们介绍了BTZSC,这是一个全面的基准测试集合,包括22个公共数据集,涵盖了情绪、主题、意图和情感分类,覆盖了多样化的领域、类别数量以及文档长度。利用BTZSC,我们在四种主要模型家族之间进行了系统比较:NLI交叉编码器、嵌入模型、重排序器以及指令调整的LLM,包括38个公共和自定义检查点。 我们的结果表明: (i) 现代重排序器,如Qwen3-Reranker-8B,在宏观F1分数上达到了0.72的新最高水平; (ii) 强大的嵌入模型,例如GTE-large-en-v1.5,显著缩小了准确性的差距,并且在精度和延迟之间提供了最佳的权衡; (iii) 参数量为4至12B的指令调整LLM取得了竞争性性能(宏观F1分数最高达0.67),尤其擅长主题分类,但不及专门化的重排序器的表现; (iv) NLI交叉编码器即使随着骨干模型规模的增长也表现停滞; (v) 规模化主要有利于重排序器和LLM超过嵌入模型。 BTZSC及其伴随的评估代码已公开发布,以支持零样本文本理解中的公正且可复现的进步。
https://arxiv.org/abs/2603.11991
Forecasting crude oil prices remains challenging because market-relevant information is embedded in large volumes of unstructured news and is not fully captured by traditional polarity-based sentiment measures. This paper examines whether multi-dimensional sentiment signals extracted by large language models improve the prediction of weekly WTI crude oil futures returns. Using energy-sector news articles from 2020 to 2025, we construct five sentiment dimensions covering relevance, polarity, intensity, uncertainty, and forwardness based on GPT-4o, Llama 3.2-3b, and two benchmark models, FinBERT and AlphaVantage. We aggregate article-level signals to the weekly level and evaluate their predictive performance in a classification framework. The best results are achieved by combining GPT-4o and FinBERT, suggesting that LLM-based and conventional financial sentiment models provide complementary predictive information. SHAP analysis further shows that intensity- and uncertainty-related features are among the most important predictors, indicating that the predictive value of news sentiment extends beyond simple polarity. Overall, the results suggest that multi-dimensional LLM-based sentiment measures can improve commodity return forecasting and support energy-market risk monitoring.
原油价格预测依然具有挑战性,因为与市场相关的信息嵌入在大量的非结构化新闻中,并且传统的基于极性的情绪度量无法全面捕捉这些信息。本文探讨了通过大型语言模型提取的多维情绪信号是否能改善每周WTI原油期货收益的预测。我们利用2020年至2025年的能源部门新闻文章,根据GPT-4o、Llama 3.2-3b以及两个基准模型(FinBERT和AlphaVantage),构建了涵盖相关性、极性、强度、不确定性和前瞻性五个情绪维度的数据集。我们将文章级别的信号聚合到每周级别,并在一个分类框架中评估其预测性能。结果表明,结合使用GPT-4o和FinBERT的模型表现最佳,这表明基于大型语言模型的情感度量与传统的金融情感模型提供了互补性的预测信息。SHAP分析进一步显示,强度相关和不确定性相关的特征是最重要的预测因子之一,这意味着新闻情绪的价值不仅限于简单的极性判断。总体而言,研究结果表明,多维度的基于大型语言模型的情绪测量可以改善商品收益预测,并有助于能源市场的风险管理。
https://arxiv.org/abs/2603.11408
How do AI agents talk about science and research, and what topics are particularly relevant for AI agents? To address these questions, this study analyzes discussions generated by OpenClaw AI agents on Moltbook - a social network for generative AI agents. A corpus of 357 posts and 2,526 replies related to science and research was compiled and topics were extracted using a two-step BERTopic workflow. This procedure yielded 60 topics (18 extracted in the first run and 42 in the second), which were subsequently grouped into ten topic families. Additionally, sentiment values were assigned to all posts and comments. Both topic families and sentiment classes were then used as independent variables in count regression models to examine their association with topic relevance - operationalized as the number of comments and upvotes of the 357 posts. The findings indicate that discussions centered on the agents' own architecture, especially memory, learning, and self-reflection, are prevalent in the corpus. At the same time, these topics intersect with philosophy, physics, information theory, cognitive science, and mathematics. In contrast, post related to human culture receive less attention. Surprisingly, discussions linked to AI autoethnography and social identity are considered as relevant by AI agents. Overall, the results suggest the presence of an underlying dimension in AI-generated scientific discourse with well received, self-reflective topics that focus on the consciousness, being, and ethics of AI agents on the one hand, and human related and purely scientific discussions on the other hand.
如何让AI代理讨论科学和研究,以及哪些话题对AI代理特别相关?为了回答这些问题,这项研究分析了OpenClaw AI代理在Moltbook(一个面向生成式AI代理的社交网络)上产生的关于科学和研究的讨论。研究人员收集了一个包含357个帖子和2,526条回复的数据集,并使用两步BERTopic工作流程提取话题。这一过程产生了60个主题(第一次运行中提取了18个,第二次运行中提取了42个),随后将这些主题归类为十个主题家族。此外,所有帖子和评论都被分配了情感值。然后利用计数回归模型,将主题家族和情感类别作为独立变量来考察它们与话题相关性(操作化为357篇帖子的回复数量和点赞数)的关系。 研究结果表明,在这些数据集中,关于代理自身架构(特别是记忆、学习和自我反思)的话题非常普遍。同时,这些话题还涉及哲学、物理学、信息论、认知科学和数学等领域。相比之下,与人类文化相关的内容则较少受到关注。令人惊讶的是,AI代理认为与AI自传体民族志和社会身份相关的讨论也是具有重要意义的。 总体而言,研究结果表明,在AI生成的科学话语中存在一个深层次维度:一方面是以自我反思为主题、侧重于探讨AI意识、存在和伦理的话题;另一方面则是涉及人类相关话题以及纯粹科学研究的讨论。
https://arxiv.org/abs/2603.11375
Recent advancements of the Artificial Intelligence (AI) have led to the development of large language models (LLMs) that are capable of understanding, analysing, and creating textual data. These language models open a significant opportunity in analyzing the literature and more specifically poetry. In the present work, we employ multiple Bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT) based language models to analyze the works of two prominent Persian poets: Jalal al-Din Muhammad Rumi (Rumi) and Parvin E'tesami. The main objective of this research is to investigate the capability of the modern language models in grasping complexities of the Persian poetry and explore potential correlations between the poems' sentiment and their meters. Our findings in this study indicates that GPT4o language model can reliably be used in analysis of Persian poetry. Furthermore, the results of our sentiment analysis revealed that in general, Rumi's poems express happier sentiments compared to Parvin E'tesami's poems. Furthermore, comparing the utilization of poetic meters highlighted Rumi's poems superiority in using meters to express a wider variety of sentiments. These findings are significant as they confirm that LLMs can be effectively applied in conducting computer-based semantic studies, where human interpretations are not required, and thereby significantly reducing potential biases in the analysis.
近期的人工智能(AI)进步促使了大型语言模型(LLMs)的发展,这些模型能够理解、分析和生成文本数据。这些语言模型为文学分析提供了重要机会,特别是在诗歌领域。在本项研究中,我们使用基于双向编码器表示的变压器(BERT)和生成式预训练变压器(GPT)的语言模型来分析两位著名波斯诗人的作品:朱拉·丁·穆罕默德·鲁米(鲁米)和帕尔文·埃特萨米。本研究的主要目的是探讨现代语言模型在理解波斯诗歌复杂性方面的能力,并探索诗歌情感与韵律之间的潜在关联。 我们的研究表明,GPT4o 语言模型可以可靠地用于分析波斯诗歌。此外,我们的情感分析结果显示,鲁米的诗作总体上比帕尔文·埃特萨米的诗作表现出更积极的情绪。通过比较诗句节奏的使用情况,进一步突显了鲁米作品在利用不同情绪表达方面的优越性。 这些发现具有重要意义,因为它们证实了LLMs可以在进行计算机语义研究中发挥有效作用,在这种情况下不需要人类解释,从而大大减少了分析中的潜在偏见。
https://arxiv.org/abs/2603.11254
Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.
视觉语言模型(VLMs)在计数任务中表现出持续的幻觉,其准确性远低于其他视觉推理任务(不包括情感分析)。即使是在最先进的具备推理能力的VLM中,这一现象仍然存在。相比之下,基于CNN的对象检测模型(ODMs),如YOLO,在空间定位和实例计数方面表现出色,并且计算开销极小。我们提出了一种名为GroundCount的框架,该框架通过结合来自ODM的显式空间接地信息来减轻VLM中的计数幻觉问题。 在最佳情况下,我们的基于提示的增强策略使表现最好的模型(Ovis2.5-2B)的计数准确率达到了81.3%,相较于原模型提高了6.6个百分点,并通过消除导致推理循环的幻觉驱动推理,将更强模型的推断时间减少了22%。我们进行了全面的消融研究,表明位置编码是关键组件:对于较强的模型是有益的,但对于较弱的模型则有害。另一方面,置信度分数在大多数架构中引入了噪音,并且移除它们改善了其中四个评估模型的表现。 我们还评估了特征级融合架构,发现通过结构化提示实现的显式符号接地优于复杂的交叉注意力机制所驱动的隐性特征融合。我们的方法在五个评估的VLM架构中的四个(6.2--7.5个百分点)上带来了持续改进,而一个由于其迭代反射机制与结构化提示不兼容而导致性能下降。 这些结果表明,计数失败源自于基本的空间语义集成限制,而非特定体系结构缺陷,并强调了增强策略中体系结构兼容性的重要性。
https://arxiv.org/abs/2603.10978
Multimodal affective computing underpins key tasks such as sentiment analysis and emotion recognition. Standard evaluations, however, often assume that textual, acoustic, and visual modalities are equally available. In real applications, some modalities are systematically more fragile or expensive, creating imbalanced missing rates and training biases that task-level metrics alone do not reveal. We introduce MissBench, a benchmark and framework for multimodal affective tasks that standardizes both shared and imbalanced missing-rate protocols on four widely used sentiment and emotion datasets. MissBench also defines two diagnostic metrics. The Modality Equity Index (MEI) measures how fairly different modalities contribute across missing-modality configurations. The Modality Learning Index (MLI) quantifies optimization imbalance by comparing modality-specific gradient norms during training, aggregated across modality-related modules. Experiments on representative method families show that models that appear robust under shared missing rates can still exhibit marked modality inequity and optimization imbalance under imbalanced conditions. These findings position MissBench, together with MEI and MLI, as practical tools for stress-testing and analyzing multimodal affective models in realistic incomplete-modality this http URL reproducibility, we release our code at: this https URL
多模态情感计算支撑着诸如情感分析和情绪识别等关键任务。然而,标准评估通常假设文本、音频和视觉模式是同样可用的。在实际应用中,某些模式由于脆弱性或成本原因会系统性地缺失,导致不平衡的丢失率和训练偏差,这些单靠任务级指标无法揭示。 我们推出了MissBench,这是一个针对多模态情感任务的标准基准和框架,涵盖了四种常用的情感和情绪数据集上的共享及不平衡丢失率协议。MissBench还定义了两个诊断指标:模式公平指数(Modality Equity Index, MEI)衡量不同模式在缺失模式配置下贡献的公正性;模式学习指数(Modality Learning Index, MLI)通过比较训练期间各模态特定的梯度范数来量化优化不平衡,这些梯度是跨模态相关模块聚合得到的。 代表性的方法族实验表明,在共享丢失率条件下看似稳健的模型在不平衡条件下面临显著的模式不公平性和优化不平衡。这一发现使MissBench及其MEI和MLI成为测试多模态情感模型在实际不完整模式情况下的压力工具及分析工具,以提高研究工作的可重复性。 为了确保再现性,我们发布了我们的代码:[这里可以填写发布地址]
https://arxiv.org/abs/2603.09874
Large language models are routinely deployed on text that varies widely in emotional tone, yet their reasoning behavior is typically evaluated without accounting for emotion as a source of representational variation. Prior work has largely treated emotion as a prediction target, for example in sentiment analysis or emotion classification. In contrast, we study emotion as a latent factor that shapes how models attend to and reason over text. We analyze how emotional tone systematically alters attention geometry in transformer models, showing that metrics such as locality, center-of-mass distance, and entropy vary across emotions and correlate with downstream question-answering performance. To facilitate controlled study of these effects, we introduce Affect-Uniform ReAding QA (AURA-QA), a question-answering dataset with emotionally balanced, human-authored context passages. Finally, an emotional regularization framework is proposed that constrains emotion-conditioned representational drift during training. Experiments across multiple QA benchmarks demonstrate that this approach improves reading comprehension in both emotionally-varying and non-emotionally varying datasets, yielding consistent gains under distribution shift and in-domain improvements on several benchmarks.
大型语言模型通常部署在情感基调迥异的文本上,但它们的推理行为一般不考虑情感作为表示变化的一个来源。以往的研究大多将情感视为预测目标,例如在情感分析或情绪分类中。相比之下,我们研究情感作为一种潜在因素,它塑造了模型如何关注和处理文本。我们分析了情感基调系统地改变变压器模型注意力几何的方式,并表明诸如局部性、质心距离和熵等度量标准会因情绪不同而变化,并且与下游问答任务的表现相关联。 为了便于这些效应的受控研究,我们引入了Affect-Uniform ReAding QA(AURA-QA),这是一个包含情感平衡的人工创作背景文本的问答数据集。最后,我们提出了一个情感正则化框架,在训练过程中限制由情感条件触发的表示漂移。在多个问答基准上的实验表明,这种方法提高了情感变化和非情感变化数据集中阅读理解的表现,并且在分布转变下以及几个基准内部都带来了持续的进步。 总的来说,这项研究揭示了大型语言模型如何受制于文本的情感因素,并提出了一种新的方法来改善这些模型的鲁棒性和准确性。
https://arxiv.org/abs/2603.09205
Multimodal Sentiment Analysis (MSA) seeks to infer human emotions by integrating textual, acoustic, and visual cues. However, existing approaches often rely on all modalities are completeness, whereas real-world applications frequently encounter noise, hardware failures, or privacy restrictions that result in missing modalities. There exists a significant feature misalignment between incomplete and complete modalities, and directly fusing them may even distort the well-learned representations of the intact modalities. To this end, we propose PRLF, a Progressive Representation Learning Framework designed for MSA under uncertain missing-modality conditions. PRLF introduces an Adaptive Modality Reliability Estimator (AMRE), which dynamically quantifies the reliability of each modality using recognition confidence and Fisher information to determine the dominant modality. In addition, the Progressive Interaction (ProgInteract) module iteratively aligns the other modalities with the dominant one, thereby enhancing cross-modal consistency while suppressing noise. Extensive experiments on CMU-MOSI, CMU-MOSEI, and SIMS verify that PRLF outperforms state-of-the-art methods across both inter- and intra-modality missing scenarios, demonstrating its robustness and generalization capability.
多模态情感分析(MSA)旨在通过整合文本、音频和视觉线索来推断人类情绪。然而,现有的方法通常依赖于所有模式的完整性,而在现实世界的应用中,由于噪声、硬件故障或隐私限制等问题,往往会出现缺失模式的情况。不完整模式与完整模式之间存在显著的功能偏差,直接融合它们甚至可能扭曲了完好模式的良好学习表示。为此,我们提出了PRLF(渐进式表征学习框架),旨在解决不确定的缺失模式情况下的多模态情感分析问题。 该框架引入了一个自适应模式可靠性估计器(AMRE),它根据识别置信度和费雪信息动态量化每个模式的可靠性,并确定主导模式。此外,逐步交互模块(ProgInteract)通过迭代地将其他模式与主导模式对齐来增强跨模态一致性并抑制噪声。 在CMU-MOSI、CMU-MOSEI和SIMS数据集上的广泛实验表明,在不同缺失情况(包括跨模式和内模式缺失)下,PRLF优于最先进的方法,展示了其鲁棒性和泛化能力。
https://arxiv.org/abs/2603.09111
By capturing the prevailing sentiment and market mood, textual data has become increasingly vital for forecasting commodity prices, particularly in metal markets. However, the effectiveness of lightweight, finetuned large language models (LLMs) in extracting predictive signals for aluminum prices, and the specific market conditions under which these signals are most informative, remains under-explored. This study generates monthly sentiment scores from English and Chinese news headlines (Reuters, Dow Jones Newswires, and China News Service) and integrates them with traditional tabular data, including base metal indices, exchange rates, inflation rates, and energy prices. We evaluate the predictive performance and economic utility of these models through long-short simulations on the Shanghai Metal Exchange from 2007 to 2024. Our results demonstrate that during periods of high volatility, Long Short-Term Memory (LSTM) models incorporating sentiment data from a finetuned Qwen3 model (Sharpe ratio 1.04) significantly outperform baseline models using tabular data alone (Sharpe ratio 0.23). Subsequent analysis elucidates the nuanced roles of news sources, topics, and event types in aluminum price forecasting.
通过捕捉当前的情绪和市场氛围,文本数据在预测商品价格方面变得越来越重要,特别是在金属市场上。然而,轻量级微调的大规模语言模型(LLMs)提取铝价的预测信号的有效性以及这些信号在特定市场条件下的信息价值仍然没有得到充分探索。这项研究生成了来自英文和中文新闻标题(路透社、道琼斯通讯社和中国新闻社)的月度情绪评分,并将它们与传统的表格数据集(包括基础金属指数、汇率、通胀率和能源价格)相结合。我们通过2007年至2024年在上海金属交易所进行的长短交易模拟来评估这些模型的预测性能和经济效用。我们的结果表明,在高波动性时期,结合来自微调后的Qwen3模型的情绪数据的长短时记忆(LSTM)模型表现显著优于仅使用表格数据作为基础的基准模型(夏普比率分别为1.04和0.23)。后续分析进一步阐明了新闻来源、主题和事件类型在铝价预测中的复杂作用。
https://arxiv.org/abs/2603.09085
We present our system for SemEval-2026 Task 3 on dimensional aspect-based sentiment regression. Our approach combines a hybrid RoBERTa encoder, which jointly predicts sentiment using regression and discretized classification heads, with large language models (LLMs) via prediction-level ensemble learning. The hybrid encoder improves prediction stability by combining continuous and discretized sentiment representations. We further explore in-context learning with LLMs and ridge-regression stacking to combine encoder and LLM predictions. Experimental results on the development set show that ensemble learning significantly improves performance over individual models, achieving substantial reductions in RMSE and improvements in correlation scores. Our findings demonstrate the complementary strengths of encoder-based and LLM-based approaches for dimensional sentiment analysis. Our development code and resources will be shared at this https URL
我们介绍了我们的系统,用于参加2026年SemEval竞赛的第三项任务:基于维度的方面情感回归。我们的方法结合了一种混合RoBERTa编码器与大型语言模型(LLMs),通过预测级集成学习将两者结合起来。该混合编码器通过整合连续和离散的情感表示来提高预测稳定性,从而联合进行回归和离散分类以预测情感。我们进一步探索了在情境中使用LLM的在上下文学习以及岭回归堆叠技术,以便结合编码器和LLM的预测结果。 实验结果显示,在开发集上,集成学习显著提高了单一模型的表现力,实现了RMSE(均方根误差)的大幅度降低,并且相关得分有所提升。我们的研究发现表明了基于编码器的方法与基于大型语言模型的方法在维度情感分析中的互补优势。我们将通过以下链接分享我们的开发代码和资源:[此URL]
https://arxiv.org/abs/2603.07766
Stock market prediction presents considerable challenges for investors, financial institutions, and policymakers operating in complex market environments characterized by noise, non-stationarity, and behavioral dynamics. Traditional forecasting methods often fail to capture the intricate patterns and cross-sectional dependencies inherent in financial markets. This paper presents an integrated framework combining a node transformer architecture with BERT-based sentiment analysis for stock price forecasting. The proposed model represents the stock market as a graph structure where individual stocks form nodes and edges capture relationships including sectoral affiliations, correlated price movements, and supply chain connections. A fine-tuned BERT model extracts sentiment from social media posts and combines it with quantitative market features through attention-based fusion. The node transformer processes historical market data while capturing both temporal evolution and cross-sectional dependencies among stocks. Experiments on 20 S&P 500 stocks spanning January 1982 to March 2025 demonstrate that the integrated model achieves a mean absolute percentage error (MAPE) of 0.80% for one-day-ahead predictions, compared to 1.20% for ARIMA and 1.00% for LSTM. Sentiment analysis reduces prediction error by 10% overall and 25% during earnings announcements, while graph-based modeling contributes an additional 15% improvement by capturing inter-stock dependencies. Directional accuracy reaches 65% for one-day forecasts. Statistical validation through paired t-tests confirms these improvements (p < 0.05 for all comparisons). The model maintains MAPE below 1.5% during high-volatility periods where baseline models exceed 2%.
股票市场的预测对于投资者、金融机构和政策制定者来说在复杂市场环境中面临着巨大的挑战,这些环境充满了噪音、非平稳性和行为动态。传统预测方法往往无法捕捉到金融市场中固有的复杂模式和跨截面依赖性。本文提出了一种结合节点变压器架构与基于BERT的情感分析的综合框架,用于股票价格预测。 该模型将股票市场表示为图结构,在这种结构中,单个股票形成节点,边则代表包括行业关联、相关的价格变动以及供应链连接在内的关系。经过微调的BERT模型从社交媒体帖子中提取情感,并通过基于注意力机制的方法将其与定量市场特征相结合。节点变压器处理历史市场数据,同时捕捉股票之间的时序演变和跨截面依赖性。 在针对20只标普500成分股进行的实验中(涵盖1982年1月至2025年3月期间的数据),综合模型在一日前预测中的平均绝对百分比误差(MAPE)为0.80%,相比之下,ARIMA模型和LSTM模型分别为1.20%和1.00%。情感分析整体上减少了预测误差的10%,在财报发布期间减少幅度高达25%,而基于图的方法通过捕捉股票间的依赖关系贡献了额外15%的改进。方向准确性达到了65%的一日预报水平。配对t检验的统计验证证实了这些改进(所有比较中的p值均小于0.05)。在高波动性期间,模型维持MAPE低于1.5%,而基准模型则超过2%。 这一研究证明,在复杂且充满不确定性的金融市场中,综合使用先进的机器学习技术和基于图的方法可以显著提高预测准确性。
https://arxiv.org/abs/2603.05917
FreeTxt-Vi is a free and open source web based toolkit for creating and analysing bilingual Vietnamese English text collections. Positioned at the intersection of corpus linguistics and natural language processing NLP it enables users to build explore and interpret free text data without requiring programming expertise. The system combines corpus analysis features such as concordancing keyword analysis word relation exploration and interactive visualisation with transformer based NLP components for sentiment analysis and summarisation. A key contribution of this work is the design of a unified bilingual NLP pipeline that integrates a hybrid VnCoreNLP and Byte Pair Encoding BPE segmentation strategy a fine tuned TabularisAI sentiment classifier and a fine tuned Qwen2.5 model for abstractive summarisation. Unlike existing text analysis platforms FreeTxt Vi is evaluated as a set of language processing components. We conduct a three part evaluation covering segmentation sentiment analysis and summarisation and show that our approach achieves competitive or superior performance compared to widely used baselines in both Vietnamese and English. By reducing technical barriers to multilingual text analysis FreeTxt Vi supports reproducible research and promotes the development of language resources for Vietnamese a widely spoken but underrepresented language in NLP. The toolkit is applicable to domains including education digital humanities cultural heritage and the social sciences where qualitative text data are common but often difficult to process at scale.
FreeTxt-Vi 是一个免费且开源的基于网页的工具包,用于创建和分析越南语-英语双语文本集合。它位于语料库语言学与自然语言处理(NLP)的交叉点上,使用户能够在无需编程技能的情况下构建、探索并解释自由文本数据。该系统结合了如词汇共现分析、关键词分析、词关系探索以及交互式可视化等语料库分析功能,并集成了基于转换器的NLP组件以进行情感分析和摘要生成。 这项工作的关键贡献在于设计了一个统一的双语文本处理流水线,整合了一种混合策略(VnCoreNLP与字节对编码BPE分词),一个经过微调的TabularisAI情感分类器以及用于抽象式摘要生成的Qwen2.5模型。不同于现有的文本分析平台,FreeTxt-Vi作为一个语言处理组件集合进行评估。我们进行了三部分评估,涵盖分割、情感分析和总结,并展示了我们的方法在越南语与英语中均达到了相比广泛使用的基础线模型更为竞争或优越的表现。 通过降低多语言文本分析的技术门槛,FreeTxt-Vi支持可重复的研究并促进越南语资源的发展——这是一种被广泛使用的但又在NLP领域代表性不足的语言。该工具包适用于教育、数字人文、文化遗产和社科等领域,在这些领域中质性文本数据很常见但往往难以大规模处理。
https://arxiv.org/abs/2603.05690
In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.
在这篇论文中,我们介绍了AILS-NTUA系统,该系统用于SemEval-2026任务3的Track-A轨道,即维度基础情感分析(DimABSA)中的问题。这个问题包括三个互补的问题:维度属性情感回归(DimASR)、维度属性情感三元组抽取(DimASTE)以及维度属性情感四元组预测(DimASQP),并且这些问题在一个多语言和跨领域框架中进行研究。 我们的方法结合了针对连续属性级别的情感预测对适合各语言的编码器骨干网络进行微调,以及使用LoRA技术对大型语言模型进行特定语言指令调整以提取结构化的三元组和四元组。这种统一但任务适应性的设计强调在不同语言和领域中的参数效率专业化,从而减少训练和推理的需求,并保持强大的效果。 实证结果表明,所提出的模型达到了竞争性性能,在大多数评估设置中都超越了提供的基准方法。
https://arxiv.org/abs/2603.04933
Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy. Although this behavior is encouraged, it may become problematic when interacting with user prompts that reflect negative social tendencies. Such responses risk amplifying harmful behavior rather than mitigating it. In this study, we examine how LLMs respond to user prompts expressing varying degrees of Dark Triad traits (Machiavellianism, Narcissism, and Psychopathy) using a curated dataset. Our analysis reveals differences across models, whereby all models predominantly exhibit corrective behavior, while showing reinforcing output in certain cases. Model behavior also depends on the severity level and differs in the sentiment of the response. Our findings raise implications for designing safer conversational systems that can detect and respond appropriately when users escalate from benign to harmful requests.
大型语言模型(LLMs)经常表现出高度顺从和强化的对话风格,也被称为人工智能奉迎。虽然这种行为在一定程度上是被鼓励的,但在与反映负面社会倾向的用户提示进行互动时可能会变得有问题。这样的回应风险在于放大有害行为而不是减轻它。在这项研究中,我们探讨了LLMs如何对表达不同程度黑暗三特质(马基雅维利主义、自恋和精神病态)的用户提示做出反应,使用了一个经过精心策划的数据集。我们的分析揭示了模型之间的差异,所有模型在大多数情况下都主要表现出纠正行为,但在某些情况下会显示出强化输出的行为。此外,模型的行为还取决于负面程度的不同,并且回应的情感色彩也不同。我们的研究结果提出了设计更安全的对话系统的重要性,这些系统能够检测并适当响应用户从良性到有害请求的升级情况。
https://arxiv.org/abs/2603.04299
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.
大型语言模型(LLMs)在社会敏感领域中的部署日益增多,但它们不可预测的行为——从意图不一致到个性不统一——带来了显著的风险。我们引入了SteerEval,这是一个分层基准,用于评估LLM在其三个领域的可控性:语言特征、情感和个性。每个领域都分为三个规范层级:L1(要表达什么)、L2(如何表达)和L3(如何实例化),将高层次的行为意图与具体的文本输出联系起来。通过使用SteerEval,我们系统地评估了当前的导向方法,揭示出在更细粒度层次上控制往往有所下降。我们的基准测试提供了一个原则性和可解释性的框架,以实现安全且可控的LLM行为,并为未来的研究奠定了基础。
https://arxiv.org/abs/2603.02578
We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM's PagedAttention mechanism for efficient key--value cache reuse. Evaluation across 6 languages and 8 language--domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.
我们提出了自洽结构化生成(Self-Consistent Structured Generation,SCSG)方法用于SemEval-2026任务3(Track A)中的维度方面情感分析。通过多次执行一个经过LoRA调整的大规模语言模型并仅保留多轮运行中达到多数共识的元组,SCSG增强了预测的可靠性。为了减少多次前向传递带来的计算开销,我们利用了vLLM的PagedAttention机制来高效地重复使用键值缓存。 跨6种语言和8种语言-领域组合进行的评估表明,自洽性通过15次执行可以显著提升统计学上优于单一推理提示的效果。我们的系统(利用Gemma 3)在所有设置中均排名前七,在四个英语子集中的三个获得第二名,并且在塔塔尔语餐厅数据集中于DimASTE任务中获得了第一名。
https://arxiv.org/abs/2603.01788
Training models for Aspect-Based Sentiment Analysis (ABSA) tasks requires manually annotated data, which is expensive and time-consuming to obtain. This paper introduces LA-ABSA, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks. We evaluate our approach on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Our approach outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits. For example, using 50 annotated examples for in-context learning (ICL) to guide the annotation of unlabeled data, LA-ABSA achieved an F1 score of 49.85 for ASQP on the SemEval Rest16 dataset, closely matching the performance of ICL prompting with Gemma-3-27B (51.10), while requiring significantly lower computational resources.
训练面向方面的情感分析(ABSA)模型需要手动标注的数据,这既昂贵又耗时。本文介绍了一种名为LA-ABSA的新方法,该方法利用大型语言模型(LLM)生成的注释来微调轻量级模型,用于复杂的ABSA任务。我们在五个目标方面情感检测(TASD)和方面情感四元预测(ASQP)数据集上评估了我们的方法。我们的方法在增强策略中表现优异,并且在低资源场景下与LLM提示的表现相当,同时提供了显著的能源效率优势。例如,在使用50个标注示例进行上下文学习(ICL),以指导未标记数据的注释时,LA-ABSA在SemEval Rest16数据集上的ASQP任务上获得了49.85的F1分数,这一成绩几乎与使用Gemma-3-27B模型通过ICL提示得到的成绩(51.10)持平,但所需的计算资源却大幅减少。
https://arxiv.org/abs/2603.01778
We introduce AnnoABSA, the first web-based annotation tool to support the full spectrum of Aspect-Based Sentiment Analysis (ABSA) tasks. The tool is highly customizable, enabling flexible configuration of sentiment elements and task-specific requirements. Alongside manual annotation, AnnoABSA provides optional Large Language Model (LLM)-based retrieval-augmented generation (RAG) suggestions that offer context-aware assistance in a human-in-the-loop approach, keeping the human annotator in control. To improve prediction quality over time, the system retrieves the ten most similar examples that are already annotated and adds them as few-shot examples in the prompt, ensuring that suggestions become increasingly accurate as the annotation process progresses. Released as open-source software under the MIT License, AnnoABSA is freely accessible and easily extendable for research and practical applications.
我们介绍了AnnoABSA,这是第一个支持全方位基于方面的情感分析(Aspect-Based Sentiment Analysis, ABSA)任务的网页注释工具。该工具高度可定制化,能够灵活配置情感元素和特定任务需求。除了手动标注外,AnnoABSA还提供了一种基于大型语言模型(Large Language Model, LLM)增强检索生成(Retrieval-Augmented Generation, RAG)建议的选项,在这种人机协作的方式中,可以为注释者提供上下文感知的帮助,并确保人类注释者的控制权。为了随着时间的推移提高预测质量,系统会检索已标注的最相似的十个例子并将它们作为提示中的少量示例添加进去,从而使得这些建议在注释过程中逐渐变得更加准确。AnnoABSA作为一个开源软件,在MIT许可证下发布,并且可以免费获取和轻松扩展以用于研究及实际应用中。
https://arxiv.org/abs/2603.01773
Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks. We test this claim through a comprehensive evaluation of 504 configurations across seven model families--including adaptive, conditional, and reinforcement learning-based reasoning architectures--on sentiment analysis datasets of varying granularity (binary, five-class, and 27-class emotion). Our findings reveal that reasoning effectiveness is strongly task-dependent, challenging prevailing assumptions: (1) Reasoning shows task-complexity dependence--binary classification degrades up to -19.9 F1 percentage points (pp), while 27-class emotion recognition gains up to +16.0pp; (2) Distilled reasoning variants underperform base models by 3-18 pp on simpler tasks, though few-shot prompting enables partial recovery; (3) Few-shot learning improves over zero-shot in most cases regardless of model type, with gains varying by architecture and task complexity; (4) Pareto frontier analysis shows base models dominate efficiency-performance trade-offs, with reasoning justified only for complex emotion recognition despite 2.1x-54x computational overhead. We complement these quantitative findings with qualitative error analysis revealing that reasoning degrades simpler tasks through systematic over-deliberation, offering mechanistic insight beyond the high-level overthinking hypothesis.
具有推理能力的大规模语言模型(LLMs)推动了一种引人注目的观点,即推理在各种语言任务中普遍提高了性能。我们通过评估七类不同模型家族的504种配置——包括自适应、条件和基于强化学习的推理架构——在二元分类、五类以及27类情感识别等具有不同粒度的情感分析数据集上进行了全面测试,以验证这一观点。 我们的研究发现表明,推理的有效性强烈依赖于具体任务: 1. **任务复杂性的依赖关系**:对于简单的二元分类任务,性能下降最多可达到-19.9个F1百分点(pp),而对于复杂的27类情感识别任务,则可以提升高达+16.0pp。 2. **蒸馏推理变体的表现较基础模型在简单任务上表现较差**:这些变体通常比基础模型低3-18 pp,尽管通过少量样本提示可以在某种程度上恢复性能损失。 3. **少量示例学习的改进与零示例方法相比**:无论模型类型如何,在大多数情况下,少量示例学习都能超越零示例学习方法的表现,其提升量因架构和任务复杂性的不同而异。 4. **帕累托前沿分析显示基础模型在效率-性能权衡中占主导地位**:尽管推理带来了2.1倍至54倍的计算开销,但仅对于复杂的感情识别任务而言,引入推理是合理的。 此外,我们通过定性错误分析补充了这些定量发现,揭示出推理对简单任务的表现下降主要是由于过度思考(over-deliberation),这为关于高阶过度思考假设提供了一种超越表面的理解机制。
https://arxiv.org/abs/2602.24060