Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.
大型语言模型(LLM)代理在长时间互动中常常表现出语气和人格的突然变化,这反映了没有明确时间结构来管理代理级别的状态。尽管之前的工作强调了转轮本地的情感或静态情感分类,但显性情感动态对长时段内代理行为塑造的作用仍被忽视。这项工作研究了通过强加外在情感状态的动力学结构是否可以诱导多回合对话中的时间连贯性和可控恢复。 我们引入了一个保持连续“效价-唤醒度-支配度”(VAD)状态的代理级别情感子系统,该状态独立于语言模型并由一阶和二阶更新规则控制。即时的情感信号通过固定的无记忆估算器提取,并通过指数平滑或动量动力学进行时间集成。由此产生的感情状态被注入生成过程而不改变模型参数。 使用固定25轮对话协议,我们比较了无状态、一阶和二阶情感动态的效果。无状态的代理无法展示连贯的行为轨迹或恢复能力,而状态持久性则允许延迟响应并确保可靠的恢复。二阶动力学引入了随着动量增加的情感惯性和滞后效应,揭示了稳定性和反应性之间的权衡关系。
https://arxiv.org/abs/2601.16087
The use of Transfer Learning & Transformers has steadily improved accuracy and has significantly contributed in solving complex computation problems. However, this transformer led accuracy improvement in Applied AI Analytics specifically in sentiment analytics comes with the dark side. It is observed during experiments that a lot of these improvements in transformer led accuracy of one class of sentiment has been at the cost of polarization of another class of sentiment and the failing of neutrality. This lack of neutrality poses an acute problem in the Applied NLP space, which relies heavily on the computational outputs of sentiment analytics for reliable industry ready tasks.
使用迁移学习和Transformer技术稳步提高了准确性,并在解决复杂计算问题方面做出了重大贡献。然而,在应用人工智能分析,特别是在情感分析中,基于Transformer的准确率提升也带来了一个阴暗面。实验观察发现,一个类别的情感准确率提高往往是以另一个类别的情感极化为代价,导致中立性的缺失。这种缺乏中立性在依赖情感分析计算输出以完成可靠行业任务的应用自然语言处理领域造成了严重问题。
https://arxiv.org/abs/2601.15509
Misinformation and fake news have become a pressing societal challenge, driving the need for reliable automated detection methods. Prior research has highlighted sentiment as an important signal in fake news detection, either by analyzing which sentiments are associated with fake news or by using sentiment and emotion features for classification. However, this poses a vulnerability since adversaries can manipulate sentiment to evade detectors especially with the advent of large language models (LLMs). A few studies have explored adversarial samples generated by LLMs, but they mainly focus on stylistic features such as writing style of news publishers. Thus, the crucial vulnerability of sentiment manipulation remains largely unexplored. In this paper, we investigate the robustness of state-of-the-art fake news detectors under sentiment manipulation. We introduce AdSent, a sentiment-robust detection framework designed to ensure consistent veracity predictions across both original and sentiment-altered news articles. Specifically, we (1) propose controlled sentiment-based adversarial attacks using LLMs, (2) analyze the impact of sentiment shifts on detection performance. We show that changing the sentiment heavily impacts the performance of fake news detection models, indicating biases towards neutral articles being real, while non-neutral articles are often classified as fake content. (3) We introduce a novel sentiment-agnostic training strategy that enhances robustness against such perturbations. Extensive experiments on three benchmark datasets demonstrate that AdSent significantly outperforms competitive baselines in both accuracy and robustness, while also generalizing effectively to unseen datasets and adversarial scenarios.
错误信息和假新闻已成为一个紧迫的社会挑战,这推动了可靠自动化检测方法的需求。先前的研究强调了情绪在假新闻检测中的重要性,通过分析与假新闻相关的情绪或使用情感和情绪特征进行分类来体现这一点。然而,这种做法带来了脆弱性,因为对手可以利用这些情绪操纵来规避探测器,尤其是在大型语言模型(LLMs)出现的情况下更是如此。尽管有一些研究探讨了由LLM生成的对抗样本,但它们主要集中在新闻出版商写作风格等风格特征上。因此,情感操控这一关键漏洞尚未得到充分探索。 在本文中,我们调查了当前最先进的假新闻检测器在其面临情绪操纵时的鲁棒性问题。我们引入了一个名为AdSent的情感稳健检测框架,旨在确保原始和情感修改后的新闻文章之间的一致真实度预测。具体来说: 1. 我们提出了基于LLM的受控情感对抗攻击。 2. 分析了情感转变对检测性能的影响。结果显示,改变情绪会严重影响假新闻检测模型的表现,并且中立的文章往往被误判为真实的,而非中立的文章则通常被分类为虚假内容。 3. 我们介绍了一种新的非情感依赖训练策略,以增强其对抗这种扰动的鲁棒性。 在三个基准数据集上的广泛实验表明,AdSent不仅在准确性和鲁棒性方面显著优于竞争基线,在未见过的数据集和对抗场景下也表现出良好的泛化能力。
https://arxiv.org/abs/2601.15277
Human cognition exhibits strong circadian modulation, yet its influence on high-dimensional semantic behavior remains poorly understood. Using large-scale Reddit data, we quantify time-of-day variation in language use by embedding text into a pretrained transformer model and measuring semantic entropy as an index of linguistic exploration-exploitation, for which we show a robust circadian rhythmicity that could be entrained by seasonal light cues. Distinguishing between local and global semantic entropy reveals a systematic temporal dissociation: local semantic exploration peaks in the morning, reflecting broader exploration of semantic space, whereas global semantic diversity peaks later in the day as submissions accumulate around already established topics, consistent with "rich-get-richer" dynamics. These patterns are not explained by sentiment or affective valence, indicating that semantic exploration captures a cognitive dimension distinct from mood. The observed temporal structure aligns with known diurnal patterns in neuromodulatory systems, suggesting that biological circadian rhythms extend to the semantic domain.
人类的认知受到强烈的昼夜节律调节,但其对高维语义行为的影响仍不为人知。通过使用大规模的Reddit数据,我们将文本嵌入到预训练的Transformer模型中,并测量语言使用的语义熵作为语言探索与利用的指标,从而量化一天中的时间变化情况。我们展示了这种稳健的昼夜节律性可能会受到季节光照信号的影响而同步。 区分局部和全局的语义熵揭示了一个系统的时间差异:在早晨,局部语义探索达到顶峰,这反映了对更大范围语义空间的广泛探索;而在一天晚些时候,随着话题积累到已建立的话题上,全球语义多样性达到顶峰,与“强者愈强”动态一致。 这些模式不能仅通过情感或情绪价值来解释,表明语言探索捕捉到了不同于心情的认知维度。所观察到的时间结构与神经递质系统已知的日常模式相吻合,这表明生物昼夜节律扩展到了语义领域。
https://arxiv.org/abs/2601.15091
Qualitative research often contains personal, contextual, and organizational details that pose privacy risks if not handled appropriately. Manual anonymization is time-consuming, inconsistent, and frequently omits critical identifiers. Existing automated tools tend to rely on pattern matching or fixed rules, which fail to capture context and may alter the meaning of the data. This study uses local LLMs to build a reliable, repeatable, and context-aware anonymization process for detecting and anonymizing sensitive data in qualitative transcripts. We introduce a Structured Framework for Adaptive Anonymizer (SFAA) that includes three steps: detection, classification, and adaptive anonymization. The SFAA incorporates four anonymization strategies: rule-based substitution, context-aware rewriting, generalization, and suppression. These strategies are applied based on the identifier type and the risk level. The identifiers handled by the SFAA are guided by major international privacy and research ethics standards, including the GDPR, HIPAA, and OECD guidelines. This study followed a dual-method evaluation that combined manual and LLM-assisted processing. Two case studies were used to support the evaluation. The first includes 82 face-to-face interviews on gamification in organizations. The second involves 93 machine-led interviews using an AI-powered interviewer to test LLM awareness and workplace privacy. Two local models, LLaMA and Phi were used to evaluate the performance of the proposed framework. The results indicate that the LLMs found more sensitive data than a human reviewer. Phi outperformed LLaMA in finding sensitive data, but made slightly more errors. Phi was able to find over 91% of the sensitive data and 94.8% kept the same sentiment as the original text, which means it was very accurate, hence, it does not affect the analysis of the qualitative data.
定性研究通常包含个人、情境和组织细节,如果不适当处理,则会带来隐私风险。手动匿名化耗时且不一致,并常常遗漏关键标识符。现有的自动化工具往往依赖于模式匹配或固定规则,这些方法无法捕捉上下文信息,可能会改变数据的意义。本研究利用本地大型语言模型(LLM)构建了一个可靠、可重复且理解上下文的匿名化过程,用于检测和匿名处理定性转录中的敏感数据。我们引入了适应性匿名器结构框架(SFAA),该框架包含三个步骤:检测、分类和自适应匿名化。SFAA结合了四种匿名策略:基于规则替换、上下文感知重写、概括和抑制,这些策略根据标识符类型和风险水平应用。SFAA处理的标识符由主要国际隐私和研究伦理标准指导,包括GDPR(通用数据保护条例)、HIPAA(健康保险流通与责任法案)和OECD(经济合作与发展组织)指南。 本研究采用了结合手动和LLM辅助处理的双重方法评估方式,并使用两个案例研究来支持评估。第一个案例包括82次面对面访谈,涉及组织中的游戏化;第二个案例则有93次由机器引导的访谈,利用人工智能面试官测试LLM对工作场所隐私的认知情况。 为了评估所提出框架的效果,我们采用了两种本地模型:LLaMA和Phi进行实验。结果表明,这些大型语言模型发现的敏感数据比人工审查员多。在寻找敏感数据方面,Phi优于LLaMA,但错误稍多一些。Phi能够找到超过91%的敏感数据,并且有94.8%的数据保持了与原文相同的情感基调,这意味着其准确性非常高,因此不会影响定性数据分析的结果。
https://arxiv.org/abs/2601.14683
Repeated exposure to violence and abusive content in music and song content can influence listeners' emotions and behaviours, potentially normalising aggression or reinforcing harmful stereotypes. In this study, we explore the use of generative artificial intelligence (GenAI) and Large Language Models (LLMs) to automatically transform abusive words (vocal delivery) and lyrical content in popular music. Rather than simply muting or replacing a single word, our approach transforms the tone, intensity, and sentiment, thus not altering just the lyrics, but how it is expressed. We present a comparative analysis of four selected English songs and their transformed counterparts, evaluating changes through both acoustic and sentiment-based lenses. Our findings indicate that Gen-AI significantly reduces vocal aggressiveness, with acoustic analysis showing improvements in Harmonic to Noise Ratio, Cepstral Peak Prominence, and Shimmer. Sentiment analysis reduced aggression by 63.3-85.6\% across artists, with major improvements in chorus sections (up to 88.6\% reduction). The transformed versions maintained musical coherence while mitigating harmful content, offering a promising alternative to traditional content moderation that avoids triggering the "forbidden fruit" effect, where the censored content becomes more appealing simply because it is restricted. This approach demonstrates the potential for GenAI to create safer listening experiences while preserving artistic expression.
重复接触暴力和虐待内容的音乐和歌曲可能会对听众的情感和行为产生影响,可能使攻击行为正常化或强化有害刻板印象。在这项研究中,我们探讨了使用生成式人工智能(GenAI)和大型语言模型(LLMs),来自动转换流行音乐中的辱骂性词语(声乐表现)和歌词内容的方法。我们的方法不仅仅是简单地屏蔽或替换一个词,而是改变了歌曲的语气、强度和情感表达方式。 我们选取了四首英文歌曲及其经过变换后的版本进行了比较分析,并从声学特性和情感两个角度评估其变化情况。研究结果表明,GenAI显著降低了声乐中的攻击性,通过声学分析发现,谐波噪声比(Harmonic to Noise Ratio)、倒谱峰值突出度(Cepstral Peak Prominence)以及颤音(Shimmer)都有了改善。情感分析显示,各艺术家的攻击性减少了63.3%至85.6%,尤其是副歌部分减少幅度高达88.6%。经过变换后的版本在保留音乐连贯性的前提下降低了有害内容的影响,为传统的审查方式提供了一个有前景的替代方案。 这种方法避免了“禁忌之果”效应——即被限制的内容因受限而更具吸引力。这项研究展示了GenAI能够创造出更安全的听觉体验,同时保持艺术表达的完整性。
https://arxiv.org/abs/2601.15348
Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.
多模态情感分析集成了语言、视觉和听觉三种模式。主流的方法基于模态不变性和特定于模态的分解,或是复杂的融合方法,这些方法仍然依赖时空混合建模。这种方法忽略了时空异质性,导致了时空信息不对称,并因此限制了性能表现。为此,我们提出了TSDA(Temporal-Spatial Decouple before Act),即在任何交互之前显式地将每种模式解耦为时间动态和空间结构背景。对于每一个模态,一个时间编码器和一个空间编码器分别将信号投影到独立的时间和空间主体中。随后的Factor-Consistent Cross-Modal Alignment(一致因子跨模态对齐)模块仅在不同模态之间对齐相应的时间特征,并且只对齐相应的空间特征。 为了减少跨因素泄漏同时保持互补性,引入了特定于因素的监督和去相关正则化。最后,一个Gated Recouple模块随后重新耦合已对齐的数据流以完成任务需求。广泛实验表明TSDA优于基准方法。消融分析研究证实了设计的必要性和可解释性。 简单来说,这种方法通过将每种模态的时间维度和空间维度解耦,并在两者之间建立一致性的跨模态对齐机制,解决了现有技术中的时空信息不对称问题,从而提高了多模态情感分析任务的表现。
https://arxiv.org/abs/2601.13659
Production LLM systems often rely on separate models for safety and other classification-heavy steps, increasing latency, VRAM footprint, and operational complexity. We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation. We frame classification as representation selection over the full token-layer hidden-state tensor, rather than committing to a fixed token or fixed layer (e.g., first-token logits or final-layer pooling). To implement this, we introduce a two-stage aggregator that (i) summarizes tokens within each layer and (ii) aggregates across layer summaries to form a single representation for classification. We instantiate this template with direct pooling, a 100K-parameter scoring-attention gate, and a downcast multi-head self-attention (MHA) probe with up to 35M trainable parameters. Across safety and sentiment benchmarks our probes improve over logit-only reuse (e.g., MULI) and are competitive with substantially larger task-specific baselines, while preserving near-serving latency and avoiding the VRAM and latency costs of a separate guard-model pipeline.
生产环境中的大型语言模型(LLM)系统通常依赖于单独的安全性和其他分类密集型步骤的模型,这会增加延迟、VRAM占用和操作复杂性。我们则重新利用已经由服务模型支付计算成本的部分:我们在其隐藏状态上训练轻量级探测器,并在生成过程中同一前向传递中预测标签。我们将分类视为在整个令牌层隐藏状态张量上的表示选择问题,而不是固定于特定的令牌或层次(例如,首个令牌的logits或最终层次的池化)。为实现这一点,我们引入了一个两阶段聚合器,该聚合器首先在每一层内总结令牌,然后跨层级摘要进行聚合以形成单个分类表示。我们将此模板实例化为直接池化、一个包含10万参数评分注意力门以及最多含3500万可训练参数的降级多头自注意力(MHA)探测器。在安全性和情感基准测试中,我们的探测器比仅使用logits重用的方法(例如MULI)表现出更好的性能,并且与显著更大的特定任务基线相当,同时保持了接近服务延迟,并避免了单独防护模型管道所需的VRAM和延迟成本。
https://arxiv.org/abs/2601.13288
Memes are a dominant medium for online communication and manipulation because meaning emerges from interactions between embedded text, imagery, and cultural context. Existing meme research is distributed across tasks (hate, misogyny, propaganda, sentiment, humour) and languages, which limits cross-domain generalization. To address this gap we propose MemeLens, a unified multilingual and multitask explanation-enhanced Vision Language Model (VLM) for meme understanding. We consolidate 38 public meme datasets, filter and map dataset-specific labels into a shared taxonomy of $20$ tasks spanning harm, targets, figurative/pragmatic intent, and affect. We present a comprehensive empirical analysis across modeling paradigms, task categories, and datasets. Our findings suggest that robust meme understanding requires multimodal training, exhibits substantial variation across semantic categories, and remains sensitive to over-specialization when models are fine-tuned on individual datasets rather than trained in a unified setting. We will make the experimental resources and datasets publicly available for the community.
这段文本的中文翻译如下: 模因在网络沟通和操纵中占据主导地位,因为它们的意义是从嵌入的文字、图像以及文化背景之间的互动中产生的。现有的关于模因的研究分散在不同的任务(仇恨言论、厌女症、宣传、情感表达、幽默)和语言上,这限制了跨领域泛化的能力。为了填补这一空白,我们提出了MemeLens,这是一个统一的多语言多任务解释增强型视觉-语言模型(VLM),用于模因理解。我们将38个公开的模因数据集合并起来,并将特定于各个数据集的标签过滤并映射到一个包含20项任务共享分类学中,这些任务涵盖了危害、目标群体、修辞/实用意图以及情感等不同方面。我们呈现了跨建模范式、任务类别和数据集的全面实证分析。我们的研究结果表明,稳健的模因理解需要多模式训练,并在语义类别上表现出显著差异;当模型在单独的数据集而非统一设置中进行微调时,它们仍然对过度专业化敏感。我们将向社区公开实验资源和数据集。 这段翻译准确地传达了原文的主要内容和技术细节,适合学术交流使用。
https://arxiv.org/abs/2601.12539
Existing image emotion editing methods struggle to disentangle emotional cues from latent content representations, often yielding weak emotional expression and distorted visual structures. To bridge this gap, we propose EmoKGEdit, a novel training-free framework for precise and structure-preserving image emotion editing. Specifically, we construct a Multimodal Sentiment Association Knowledge Graph (MSA-KG) to disentangle the intricate relationships among objects, scenes, attributes, visual clues and emotion. MSA-KG explicitly encode the causal chain among object-attribute-emotion, and as external knowledge to support chain of thought reasoning, guiding the multimodal large model to infer plausible emotion-related visual cues and generate coherent instructions. In addition, based on MSA-KG, we design a disentangled structure-emotion editing module that explicitly separates emotional attributes from layout features within the latent space, which ensures that the target emotion is effectively injected while strictly maintaining visual spatial coherence. Extensive experiments demonstrate that EmoKGEdit achieves excellent performance in both emotion fidelity and content preservation, and outperforms the state-of-the-art methods.
现有的图像情感编辑方法在从潜在内容表示中分离情感线索方面存在困难,通常会导致情感表达弱和视觉结构失真。为了弥补这一差距,我们提出了一种新的训练无关框架EmoKGEdit,用于精确且保持结构完整性的图像情感编辑。具体而言,我们构建了一个多模态情感关联知识图(Multimodal Sentiment Association Knowledge Graph, MSA-KG),以分离物体、场景、属性、视觉线索和情绪之间的复杂关系。MSA-KG明确编码了从对象-属性-情绪的因果链,并作为外部知识支持思考链条推理,指导多模态大模型推断出合理的情绪相关视觉提示并生成连贯指令。 此外,在基于MSA-KG的基础上,我们设计了一个分离结构-情感编辑模块,该模块在潜在空间中明确区分了情感属性与布局特征。这确保了目标情绪的有效注入,并严格保持了视觉空间的连贯性。大量的实验表明,EmoKGEdit在情感保真度和内容保存方面均表现出色,并优于现有的最先进的方法。
https://arxiv.org/abs/2601.12326
We propose EmoLat, a novel emotion latent space that enables fine-grained, text-driven image sentiment transfer by modeling cross-modal correlations between textual semantics and visual emotion features. Within EmoLat, an emotion semantic graph is constructed to capture the relational structure among emotions, objects, and visual attributes. To enhance the discriminability and transferability of emotion representations, we employ adversarial regularization, aligning the latent emotion distributions across modalities. Building upon EmoLat, a cross-modal sentiment transfer framework is proposed to manipulate image sentiment via joint embedding of text and EmoLat features. The network is optimized using a multi-objective loss incorporating semantic consistency, emotion alignment, and adversarial regularization. To support effective modeling, we construct EmoSpace Set, a large-scale benchmark dataset comprising images with dense annotations on emotions, object semantics, and visual attributes. Extensive experiments on EmoSpace Set demonstrate that our approach significantly outperforms existing state-of-the-art methods in both quantitative metrics and qualitative transfer fidelity, establishing a new paradigm for controllable image sentiment editing guided by textual input. The EmoSpace Set and all the code are available at this http URL.
我们提出了EmoLat,这是一种新颖的情感潜在空间模型,它通过建模文本语义和视觉情感特征之间的跨模式相关性,实现了基于文本的、精细粒度的图像情绪转移。在EmoLat中,构建了一个情感语义图,以捕捉情感、物体以及视觉属性之间关系的结构。为了增强情感表示的区分能力和可迁移性,我们采用对抗正则化方法,使不同模态中的潜在情感分布对齐。基于EmoLat,我们提出了一种跨模式情绪转移框架,通过联合嵌入文本和EmoLat特征来操纵图像的情绪。该网络使用一个多目标损失进行优化,该损失结合了语义一致性、情感对齐以及对抗正则化。为了支持有效的建模,我们构建了EmoSpace Set,这是一个大规模的基准数据集,包含大量带有密集标注(情绪、物体语义和视觉属性)的图像。 在EmoSpace Set上进行的广泛实验表明,我们的方法在定量指标和定性转移保真度方面均显著优于现有的最先进方法,为基于文本输入指导下的可控图像情感编辑确立了一种新的范式。EmoSpace Set及其所有代码可在以下网址获取:[提供URL](请将方括号内的文字替换为您提供的实际链接)。
https://arxiv.org/abs/2601.12079
In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To mitigate this issue, we introduce a Masked Jigsaw Puzzle (MJP) framework. MJP starts with random token shuffling to break the token order, and then a learnable \textit{unknown (unk)} position embedding is used to mask out the PEs of the shuffled tokens. In this manner, the local spatial information which is encoded in the position embeddings is disrupted, and the models are forced to learn feature representations that are less reliant on the local spatial information. Notably, with the careful use of MJP, we can not only improve models' robustness against gradient attacks, but also boost their performance in both vision and text application scenarios, such as classification for images (\textit{e.g.,} ImageNet-1K) and sentiment analysis for text (\textit{e.g.,} Yelp and Amazon). Experimental results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks. Code is publicly available via this https URL
在联邦学习中,Transformer作为一种流行的架构,在防御梯度攻击和提升计算机视觉(CV)及自然语言处理(NLP)任务中的模型性能方面面临着关键挑战。研究表明,位置嵌入(Position Embeddings, PEs)的梯度包含足够信息可以用于重构输入数据。为了缓解这个问题,我们引入了一个名为Masked Jigsaw Puzzle (MJP) 的框架。 MJP 通过随机打乱令牌顺序开始,这打破了令牌的原有排列,并随后使用一个可学习的“未知(unk)”位置嵌入来遮蔽被打乱令牌的位置嵌入。这样可以破坏由位置嵌入编码的局部空间信息,迫使模型学习不那么依赖于局部空间信息的功能表示。 值得注意的是,通过谨慎地应用 MJP,不仅可以提高模型对梯度攻击的鲁棒性,还可以增强其在视觉和文本应用场景中的性能,例如图像分类(如 ImageNet-1K)和文本情感分析(如 Yelp 和 Amazon 数据集)。 实验结果表明,MJP 是一个适用于基于 Transformer 的不同模型的统一框架,无论是在视觉任务还是语言任务中。相关代码可通过此链接公开获得:[https URL]
https://arxiv.org/abs/2601.12051
Customer reviews contain rich signals about product weaknesses and unmet user needs, yet existing analytic methods rarely move beyond descriptive tasks such as sentiment analysis or aspect extraction. While large language models (LLMs) can generate free-form suggestions, their outputs often lack accuracy and depth of reasoning. In this paper, we present a multi-agent, LLM-based framework for prescriptive decision support, which transforms large scale review corpora into actionable business advice. The framework integrates four components: clustering to select representative reviews, generation of advices, iterative evaluation, and feasibility based ranking. This design couples corpus distillation with feedback driven advice refinement to produce outputs that are specific, actionable, and practical. Experiments across three service domains and multiple model families show that our framework consistently outperform single model baselines on actionability, specificity, and non-redundancy, with medium sized models approaching the performance of large model frameworks.
客户评论中包含有关产品弱点和未满足用户需求的丰富信号,然而现有的分析方法往往仅限于诸如情感分析或方面提取等描述性任务。虽然大型语言模型(LLMs)可以生成自由形式的建议,但它们的输出常常缺乏准确性和深度推理。在本文中,我们提出了一种基于多代理和大型语言模型的框架,用于规范性的决策支持,该框架将大规模评论语料库转化为可操作的商业建议。该框架集成了四个组件:聚类以选择代表性评论、生成建议、迭代评估以及可行性排名。此设计结合了语料库提炼与反馈驱动的建议优化功能,从而产生具体、可行且实用的输出。在三个服务领域及多种模型家族中的实验表明,我们的框架在行动性、特异性和非冗余方面始终优于单个模型基准线,在中型规模模型上接近大型模型框架的表现。
https://arxiv.org/abs/2601.12024
Anxiety affects hundreds of millions of individuals globally, yet large-scale screening remains limited. Social media language provides an opportunity for scalable detection, but current models often lack interpretability, keyword-robustness validation, and rigorous user-level data integrity. This work presents a transparent approach to social media-based anxiety detection through linguistically interpretable feature-grounded modeling and cross-domain validation. Using a substantial dataset of Reddit posts, we trained a logistic regression classifier on carefully curated subreddits for training, validation, and test splits. Comprehensive evaluation included feature ablation, keyword masking experiments, and varying-density difference analyses comparing anxious and control groups, along with external validation using clinically interviewed participants with diagnosed anxiety disorders. The model achieved strong performance while maintaining high accuracy even after sentiment removal or keyword masking. Early detection using minimal post history significantly outperformed random classification, and cross-domain analysis demonstrated strong consistency with clinical interview data. Results indicate that transparent linguistic features can support reliable, generalizable, and keyword-robust anxiety detection. The proposed framework provides a reproducible baseline for interpretable mental health screening across diverse online contexts.
焦虑影响着全球数亿人,然而大规模筛查仍然有限。社交媒体语言为可扩展的检测提供了机会,但目前的模型往往缺乏透明度、关键词稳健性验证以及严格的数据完整性检查。本研究提出了一种通过基于语言解读特征和跨域验证的方法来进行社交媒体焦虑检测的透明方法。利用Reddit帖子构成的大规模数据集,我们在精心筛选的子版块上训练了一个逻辑回归分类器,并将其应用于训练、验证及测试分割部分。全面评估包括了特征消融实验、关键词屏蔽实验以及对比焦虑组与对照组在不同密度差异分析中的表现,并进行了外部临床访谈参与者确诊焦虑障碍者的验证。模型实现了强劲的表现,即使去除情感信息或进行关键词屏蔽后仍保持高准确率。仅基于少量帖子历史的早期检测显著优于随机分类,并且跨域分析显示了与临床访谈数据的一致性。结果表明,透明的语言特征可以支持可靠的、泛化的以及对关键词稳健性的焦虑检测。所提出的框架为在多样化的在线环境中进行可解释的精神健康筛查提供了可重复的基础线模型。
https://arxiv.org/abs/2601.11758
Debt collection is a critical function within the banking, financial services, and insurance (BFSI) sector, relying heavily on large-scale human-to-human conversational interactions conducted primarily in Vietnamese contact centers. These conversations involve informal spoken language, emotional variability, and complex domain-specific reasoning, which pose significant challenges for traditional natural language processing systems. This paper introduces Credit C-GPT, a domain-specialized large language model with seven billion parameters, fine-tuned for conversational understanding in Vietnamese debt collection scenarios. The proposed model integrates multiple conversational intelligence tasks, including dialogue understanding, sentiment recognition, intent detection, call stage classification, and structured slot-value extraction, within a single reasoning-based framework. We describe the data construction process, annotation strategy, and training methodology, and evaluate the model on proprietary human-annotated datasets. Experimental results show consistent improvements over traditional pipeline-based approaches, indicating that domain-specialized conversational language models provide a scalable and privacy-aware solution for real-time assistance and post-call analytics in enterprise contact centers.
债务催收是银行业、金融服务和保险(BFSI)领域的一项关键功能,主要依赖于越南联络中心的大规模人工对话互动。这些对话涉及非正式的口语表达、情感波动以及复杂的特定行业推理,这给传统的自然语言处理系统带来了重大挑战。本文介绍了Credit C-GPT,这是一种专为越南语债务催收场景中的对话理解进行微调的领域专用大型语言模型,拥有70亿参数。该模型在单一基于推理框架内集成了多项会话智能任务,包括对话理解、情感识别、意图检测、呼叫阶段分类和结构化槽值提取。 我们描述了数据构建过程、标注策略以及训练方法,并使用专门的人工注释数据集对模型进行了评估。实验结果表明,该模型在传统的流水线式方法上取得了持续的改进,这表明领域专用会话语言模型可以为企业联络中心提供一种可扩展且注重隐私的实时辅助和事后分析解决方案。
https://arxiv.org/abs/2601.10167
We present a hybrid transformer architecture that replaces discrete middle layers with a continuous-depth Neural Ordinary Differential Equation (ODE) block, enabling inference-time control over generation attributes via a learned steering signal. Unlike standard transformers that process representations through fixed discrete layers, our approach treats depth as a continuous variable governed by a learned vector field $F_\theta(H, \tau, u)$, where $u$ is a low-dimensional control signal injected via explicit concatenation. We validate the architecture through four experiments: (1) gradient flow stability with zero exploding/vanishing gradient events, (2) semantic steering achieving 98\%/88\% accuracy for positive/negative sentiment control, (3) continuous interpolation validated by a negligible 0.068\% trajectory divergence between fixed and adaptive solvers, and (4) efficiency benchmarking demonstrating latency parity with standard discrete baselines. Additionally, we show that adaptive ODE solvers reveal geometric structure in the learned dynamics: the control signal partitions the vector field into distinct dynamical regimes with different curvature characteristics. The adjoint method enables $O(1)$ memory training regardless of integration depth. Our results demonstrate that continuous-depth dynamics with learned control signals provide a viable, efficient mechanism for steerable language generation.
我们提出了一种混合变换器架构,该架构用连续深度的神经普通微分方程(ODE)块替换了离散中间层。这种设计使模型在推理阶段可以通过学习到的控制信号来调整生成属性。与标准的通过固定离散层处理表示的传统变换器不同,我们的方法将深度视为由一个已学得向量场 $F_\theta(H, \tau, u)$ 控制的连续变量,其中 $u$ 是通过明确拼接注入的低维控制信号。 我们通过四个实验验证了该架构的有效性: 1. 梯度流稳定性:没有梯度消失或爆炸事件。 2. 语义导向:正向/负向情感控制的准确率为98%/88%。 3. 连续插值:固定和自适应求解器之间的轨迹偏差仅0.068%,表明连续性得到很好的验证。 4. 效率基准测试:延迟时间与标准离散基线相当。 此外,我们还展示了自适应ODE求解器揭示了学习动态中的几何结构:控制信号将向量场划分为具有不同曲率特性的独立动力学区域。伴随方法使得无论积分深度如何,都可以实现$O(1)$内存训练。 我们的结果表明,带有学习到的控制信号的连续深度动力学为可导向的语言生成提供了一种可行且高效的机制。
https://arxiv.org/abs/2601.10007
Most Multimodal Sentiment Analysis research has focused on point-wise regression. While straightforward, this approach is sensitive to label noise and neglects whether one sample is more positive than another, resulting in unstable predictions and poor correlation alignment. Pairwise ordinal learning frameworks emerged to address this gap, capturing relative order by learning from comparisons. Yet, they introduce two new trade-offs: First, they assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples. Second, they employ static ranking margins, which fail to reflect the varying semantic distances between sentiment groups. To address this, we propose a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization (GRPO). Our framework resolves these trade-offs by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples. Specifically, Stage 1 introduces a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build a fine-grained ordinal structure. Stage 2 then employs an MAE-driven objective to align prediction magnitudes. To validate its generalizability, we extend GRCF to classification tasks, including multimodal humor detection and sarcasm detection. GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.
大多数多模态情感分析研究都集中在点式回归上。尽管这种做法简单直接,但它对标签噪声敏感,并且忽视了某些样本比另一些更积极这一事实,导致预测不稳定和相关性较差。为了解决这些问题,成对序数学习框架应运而生,通过比较来捕捉相对顺序。然而,这些方法引入了两个新的权衡:首先,它们将所有比较的重要性视为统一的,未能适应性地聚焦于难以排名的样本;其次,它们使用静态排序间隔,无法反映不同情感组之间的语义距离变化。 为了解决这些问题,我们提出了一种两阶段分组排序和校准框架(GRCF),该框架借鉴了集团相对策略优化(GRPO)的理念。我们的框架通过同时保持相对序数结构、确保绝对评分校准以及适应性地关注困难样本来解决这些权衡问题。 具体而言,在第一阶段,我们引入了一种受GRPO启发的动态优势加权间隔排序损失函数,以构建一个精细粒度的序数结构。在第二阶段,则采用了一个由MAE驱动的目标函数,用于对齐预测幅度。 为了验证其泛化能力,我们将GRCF扩展到分类任务中,包括多模态幽默检测和讽刺检测。GRCF在核心回归基准测试上取得了最先进的性能,并且在分类任务中也显示出强大的泛化性。
https://arxiv.org/abs/2601.09606
Standardized Student Evaluation of Teaching often suffer from low reliability, restricted response options, and response distortion. Existing machine learning methods that mine open-ended comments usually reduce feedback to binary sentiment, which overlooks concrete concerns such as content clarity, feedback timeliness, and instructor demeanor, and provides limited guidance for instructional this http URL propose TeachPro, a multi-label learning framework that systematically assesses five key teaching dimensions: professional expertise, instructional behavior, pedagogical efficacy, classroom experience, and other performance metrics. We first propose a Dimension-Anchored Evidence Encoder, which integrates three core components: (i) a pre-trained text encoder that transforms qualitative feedback annotations into contextualized embeddings; (ii) a prompt module that represents five teaching dimensions as learnable semantic anchors; and (iii) a cross-attention mechanism that aligns evidence with pedagogical dimensions within a structured semantic space. We then propose a Cross-View Graph Synergy Network to represent student comments. This network comprises two components: (i) a Syntactic Branch that extracts explicit grammatical dependencies from parse trees, and (ii) a Semantic Branch that models latent conceptual relations derived from BERT-based similarity graphs. BiAffine fusion module aligns syntactic and semantic units, while a differential regularizer disentangles embeddings to encourage complementary representations. Finally, a cross-attention mechanism bridges the dimension-anchored evidence with the multi-view comment representations. We also contribute a novel benchmark dataset featuring expert qualitative annotations and multi-label scores. Extensive experiments demonstrate that TeachPro offers superior diagnostic granularity and robustness across diverse evaluation settings.
标准的学生教学评价通常会遇到可靠性低、选项受限和回答失真的问题。现有的机器学习方法通常将开放性评论简化为二元情感反馈,忽略了诸如内容清晰度、反馈及时性和教师举止等具体关注点,并且对教学改进的指导作用有限。为此,我们提出了TeachPro框架,这是一个多标签学习系统,旨在全面评估五个关键的教学维度:专业能力、教学行为、教育效果、课堂体验和其他表现指标。 在该框架中,首先提出了一种“锚定维度的证据编码器”,它包含了三个核心组件: (i) 一个预训练的文字编码器,它可以将定性的反馈注释转换为上下文化的嵌入; (ii) 一个提示模块,代表五个教学维度作为可学习的语义锚点;以及 (iii) 一种跨注意力机制,在结构化的语义空间内对齐证据与教育维度。 接着我们提出了“交叉视图图协同网络”,用于表示学生评论。这个网络包括两个组成部分: (i) 句法分支,从句法树中提取显式的语法依赖关系;以及 (ii) 语义分支,根据基于BERT的相似性图形建模隐含的概念联系。 双向融合模块对齐了句法和语义单元,而差异正则化器分离嵌入以鼓励互补表示。 最后,一种跨注意力机制将维度锚定证据与多视图评论表示连接起来。我们还贡献了一个新的基准数据集,该数据集包含专家的定性注释和多标签评分。广泛实验表明,TeachPro在各种评估设置下提供了更精细的诊断粒度和更强健的表现。
https://arxiv.org/abs/2601.09246
This study investigates the use of prompt engineering to enhance large language models (LLMs), specifically GPT-4o-mini and gemini-1.5-flash, in sentiment analysis tasks. It evaluates advanced prompting techniques like few-shot learning, chain-of-thought prompting, and self-consistency against a baseline. Key tasks include sentiment classification, aspect-based sentiment analysis, and detecting subtle nuances such as irony. The research details the theoretical background, datasets, and methods used, assessing performance of LLMs as measured by accuracy, recall, precision, and F1 score. Findings reveal that advanced prompting significantly improves sentiment analysis, with the few-shot approach excelling in GPT-4o-mini and chain-of-thought prompting boosting irony detection in gemini-1.5-flash by up to 46%. Thus, while advanced prompting techniques overall improve performance, the fact that few-shot prompting works best for GPT-4o-mini and chain-of-thought excels in gemini-1.5-flash for irony detection suggests that prompting strategies must be tailored to both the model and the task. This highlights the importance of aligning prompt design with both the LLM's architecture and the semantic complexity of the task.
这项研究探讨了利用提示工程来增强大型语言模型(LLM)在情感分析任务中的性能,具体针对GPT-4o-mini和gemini-1.5-flash这两款模型。研究评估了诸如少量样本学习、链式思维提示以及自一致性等先进的提示技术,并与基线进行比较。关键任务包括情感分类、基于方面的的情感分析,以及检测讽刺等细微差别。该研究详细介绍了理论背景、所用数据集及方法,通过准确率、召回率、精确度和F1分数来评估LLM的性能。 研究结果表明,先进的提示技术显著提高了情感分析的效果,在GPT-4o-mini中少量样本学习表现尤为突出;而在gemini-1.5-flash中,链式思维提示在检测讽刺方面提升了高达46%。因此,尽管高级别提示策略总体上能够提升模型性能,但针对不同模型和任务定制最佳的提示策略显得尤为重要:例如,对于GPT-4o-mini来说,少量样本学习效果最好;而对于gemini-1.5-flash而言,在检测讽刺时链式思维提示表现更佳。这表明,为了最大程度地发挥大型语言模型的效果,提示设计应当与LLM架构及其处理任务的语义复杂度相匹配。
https://arxiv.org/abs/2601.08302
Despite remarkable progress in large language models, Urdu-a language spoken by over 230 million people-remains critically underrepresented in modern NLP systems. Existing multilingual models demonstrate poor performance on Urdu-specific tasks, struggling with the language's complex morphology, right-to-left Nastaliq script, and rich literary traditions. Even the base LLaMA-3.1 8B-Instruct model shows limited capability in generating fluent, contextually appropriate Urdu text. We introduce Qalb, an Urdu language model developed through a two-stage approach: continued pre-training followed by supervised fine-tuning. Starting from LLaMA 3.1 8B, we perform continued pre-training on a dataset of 1.97 billion tokens. This corpus comprises 1.84 billion tokens of diverse Urdu text-spanning news archives, classical and contemporary literature, government documents, and social media-combined with 140 million tokens of English Wikipedia data to prevent catastrophic forgetting. We then fine-tune the resulting model on the Alif Urdu-instruct dataset. Through extensive evaluation on Urdu-specific benchmarks, Qalb demonstrates substantial improvements, achieving a weighted average score of 90.34 and outperforming the previous state-of-the-art Alif-1.0-Instruct model (87.1) by 3.24 points, while also surpassing the base LLaMA-3.1 8B-Instruct model by 44.64 points. Qalb achieves state-of-the-art performance with comprehensive evaluation across seven diverse tasks including Classification, Sentiment Analysis, and Reasoning. Our results demonstrate that continued pre-training on diverse, high-quality language data, combined with targeted instruction fine-tuning, effectively adapts foundation models to low-resource languages.
尽管大型语言模型取得了显著进展,但乌尔都语——一种超过2.3亿人使用的语言——在现代自然语言处理系统中仍然严重代表性不足。现有的多语言模型在特定于乌尔都语的任务上表现不佳,难以应对该语言复杂的形态结构、从右至左的纳斯塔利克书写体系以及丰富的文学传统。即使是基础的LLaMA-3.1 8B-Instruct模型,在生成流畅且上下文适宜的乌尔都语文本方面也表现出有限的能力。 我们推出了Qalb,这是一种通过两阶段方法开发出来的乌尔都语语言模型:持续预训练后进行监督微调。从LLaMA 3.1 8B开始,我们在一个包含19.7亿个标记的数据集上进行了持续的预训练。这一语料库包括了广泛的乌尔都语文本——涵盖了新闻档案、古典和当代文学、政府文件以及社交媒体——再加上1.4亿个英语维基百科数据标记,以防止灾难性遗忘的发生。 接下来,我们使用Alif Urdu-instruct数据集对生成的模型进行了微调。通过对特定于乌尔都语基准测试进行广泛的评估,Qalb展示了显著改进,在加权平均得分上达到了90.34分,并且在与之前最先进的Alif-1.0-Instruct模型(得分为87.1)相比时,超越了其3.24个点;同时,还比基础LLaMA-3.1 8B-Instruct模型高出44.64个点。Qalb实现了在包括分类、情感分析和推理在内的七个多样化任务上的最先进性能。 我们的研究结果表明,通过持续对多样化的高质量语言数据进行预训练,并结合目标指令微调,能够有效将基础模型调整至资源较少的语言中去。
https://arxiv.org/abs/2601.08141