LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.
大型语言模型(LLMs)正越来越多地被整合到临床工作流程中,但它们通常缺乏临床同理心,这是有效医患沟通的一个关键方面。现有的自然语言处理框架主要集中在反应性地标记医生回复中的同理心,但在预测一般健康查询所需的同理心需求方面支持有限。我们介绍了情感适用性框架(EAF),这是一个基于理论的方法,根据临床、上下文和语言线索来分类患者咨询中情感反应和解释的适用性。我们发布了一个由人类和GPT-4双重标注的真实患者咨询基准数据集。在达成人类共识的数据子集中,我们也观察到了显著的人类与GPT之间的一致性。为了验证EAF的有效性,我们在人工标记和仅GPT标记的数据上训练分类器来预测情感适用性,取得了优异的表现,并超越了启发式方法和零样本LLM基准线。错误分析突显了一些持续存在的挑战:隐含的压力、临床严重程度的模糊性和上下文上的困境,这强调了需要多标注者建模、临床医生在循环中的校准以及文化多样化的标注工作。EAF为识别响应生成前的情感需求提供了一个框架,建立了预测性同理心模型的基准,并支持异步医疗保健中的情感沟通。
https://arxiv.org/abs/2601.09696
With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: this https URL
随着多模态大型语言模型(MLLMs)的迅速发展,它们在中文古典研究(CCS)中的潜力引起了广泛关注。尽管现有研究主要集中在文本和视觉模式上,但该领域内的音频语料库仍未被充分探索。为填补这一空白,我们提出了一个多任务中国古典文学体裁音频语料库(MCGA)。它涵盖了六个方面的多样化文学体裁:自动语音识别(ASR)、语音到文本翻译(S2TT)、语音情感说明(SEC)、口语问答(SQA)、语音理解(SU)和语音推理(SR)。通过评估十个MLLMs,我们的实验结果显示,在处理MCGA测试集时,当前模型仍面临重大挑战。此外,我们还引入了一个用于SEC的评价指标以及一个衡量MLLMs在语音与文本能力之间一致性程度的指标。我们将MCGA语料库及其代码公开发布,以促进CCS领域中具有更强大多维度音频处理能力的MLLM的发展。 **MCGA语料库链接:[此网址](this https URL)**
https://arxiv.org/abs/2601.09270
We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI. This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage. To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks. Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks. The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use. By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence. Our models are available at this https URL. For technical inquiries, please contact midm-llm@kt.com.
我们推出了Mi:dm 2.0,这是一个专为推进韩国人工智能发展而设计的双语大型语言模型(LLM)。该模型不仅超越了韩文文本处理,还通过整合与韩国社会价值观、推理模式和常识知识相关的元素,能够深刻理解文化背景、情感细微差别及现实场景,从而生成可靠且符合文化的响应。为了克服现有LLM因韩文数据不足或质量低以及缺乏文化适应性所带来的局限性,Mi:dm 2.0强调通过一个全面的数据处理管道来提高数据的质量,包括专有的数据清洗、高质量合成数据的生成、策略性的数据混合与课程学习,并采用定制化的韩国优化分词器以提升效率和覆盖面。为了实现这一愿景,我们提供了两种互补配置:Mi:dm 2.0 Base(115亿参数),使用深度扩展策略构建,适用于通用用途;以及Mi:dm 2.0 Mini(23亿参数),针对资源受限环境及专业任务进行了优化。 Mi:dm 2.0在韩国特定基准测试中实现了业界领先的表现,在KMMLU上的零样本结果处于顶级水平,并且在语言、人文和社会科学任务的内部评估中表现出色。Mi:dm 2.0系列以MIT许可证发布,支持广泛的研究和商业用途。通过提供可访问性和高性能的韩国专用LLM,KT旨在加速AI在韩国各行各业、公共服务及教育领域的采用,强化韩国AI开发者社区,并为K智能(K-Intelligence)这一更广泛的愿景奠定基础。 我们的模型可在[此处](https://kt.com/midm)获取。如有技术问题,请联系[midm-llm@kt.com](mailto:midm-llm@kt.com)。
https://arxiv.org/abs/2601.09066
Large language models generate judgments that resemble those of humans. Yet the extent to which these models align with human judgments in interpreting figurative and socially grounded language remains uncertain. To investigate this, human participants and four instruction-tuned LLMs of different sizes (GPT-4, Gemma-2-9B, Llama-3.2, and Mistral-7B) rated 240 dialogue-based sentences representing six linguistic traits: conventionality, sarcasm, funny, emotional, idiomacy, and slang. Each of the 240 sentences was paired with 40 interpretive questions, and both humans and LLMs rated these sentences on a 10-point Likert scale. Results indicated that humans and LLMs aligned at the surface level with humans, but diverged significantly at the representational level, especially in interpreting figurative sentences involving idioms and Gen Z slang. GPT-4 most closely approximates human representational patterns, while all models struggle with context-dependent and socio-pragmatic expressions like sarcasm, slang, and idiomacy.
大型语言模型生成的判断与人类类似,但这些模型在解释比喻性和社会背景化的语言时是否与人类判断一致仍不确定。为了研究这个问题,研究人员让人类参与者和四种不同规模的指令调优LLM(GPT-4、Gemma-2-9B、Llama-3.2 和 Mistral-7B)对240条代表六种语言特征的对话句子进行评分:常规性、讽刺、幽默、情感化、成语表达和俚语。每条句子配以40个解释性问题,人类参与者和LLM们都在一个10分的Likert量表上对其进行评分。 结果表明,在表面层面,人类与模型的评价一致;但在代表性的层面上,则显著不同,尤其是在解读涉及成语和Z世代俚语等比喻性语言时。GPT-4 最接近于人类在代表性模式上的表现,而所有这些模型在处理依赖于上下文和社会语境表达(如讽刺、俚语和成语)方面都存在困难。
https://arxiv.org/abs/2601.09041
A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction-driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction-driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state-of-the-art methods on the CK+ and CelebV-HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications More information about our project can be found at this https URL
一个3D虚拟人物通常具有六种基本面部表情之一。为了模拟真实的情感变化,我们应当能够从一种任意的表情过渡到另一种。这项研究提出了一套由指令驱动的面部表情生成框架,该框架可以从一张脸部图片开始,并将指定的一种面部表情转换为另一种。通过这种方法可以渲染出一个3D人脸及其在不同情感状态下的动态变化。 为了实现这一目标,我们引入了“指令驱动面部表情分解器”(Instruction-driven Facial Expression Decomposer, IFED)模块。这个模块旨在促进多模态数据学习,并捕捉文本描述与面部表情特征之间的关联性。随后,我们提出了“从指令到面部表情过渡”的方法(I2FET),这种方法利用IFED以及顶点重建损失函数来优化潜在向量的语义理解能力,从而根据给定的指令生成一系列面部表情变化。 最后,我们提出了一种面部表情转换模型,用于在不同的面部表情之间产生平滑的过渡效果。广泛的评估表明,在CK+和CelebV-HQ数据集上,我们的方法超过了现有的最佳技术方案。结果证明了框架可以根据文本指令生成面部表情轨迹的能力。 考虑到文字提示允许我们对人类情感状态进行多样化的描述,这大大扩展了面部表情及其之间转换的可能性范围。我们期望这套框架能够应用于各种实际场景中。有关我们项目的更多信息,请参阅此链接 [插入具体URL]。
https://arxiv.org/abs/2601.08179
Subject-independent EEG emotion recognition is challenged by pronounced inter-subject variability and the difficulty of learning robust representations from short, noisy recordings. To address this, we propose a fusion framework that integrates (i) local, channel-wise descriptors and (ii) global, trial-level descriptors, improving cross-subject generalization on the SEED-VII dataset. Local representations are formed per channel by concatenating differential entropy with graph-theoretic features, while global representations summarize time-domain, spectral, and complexity characteristics at the trial level. These representations are fused in a dual-branch transformer with attention-based fusion and domain-adversarial regularization, with samples filtered by an intensity threshold. Experiments under a leave-one-subject-out protocol demonstrate that the proposed method consistently outperforms single-view and classical baselines, achieving approximately 40% mean accuracy in 7-class subject-independent emotion recognition. The code has been released at this https URL.
主题无关的EEG情感识别面临着显著的个体间差异和从短而嘈杂记录中学习稳健表示的困难。为了应对这一挑战,我们提出了一种融合框架,该框架整合了(i)局部、通道级别的描述符和(ii)全局、试验级别的描述符,在SEED-VII数据集上提高了跨主体泛化性能。 局部表示通过将微分熵与图理论特征连接起来在每个通道级别形成。而全局表示则总结了试验级别的时间域、频谱和复杂性特性。这些表示在一个双分支变压器中融合,该变压器采用了基于注意力的融合和领域对抗正则化,并对样本进行了强度阈值过滤。 根据留一受试者协议(leave-one-subject-out protocol)进行的实验表明,所提出的方法在7类主题无关情感识别中的平均准确率上始终优于单视图和经典基线方法,达到了大约40%。代码已在此URL发布:[此链接] (请将"this https URL"替换为实际的GitHub或其他版本控制平台上的链接)。 该研究通过整合局部和全局特征,并采用先进的融合策略和正则化技术,有效提高了跨受试者的识别性能,在主题无关EEG情感识别任务中取得了显著进展。
https://arxiv.org/abs/2601.08094
The increasing adoption of smart classroom technologies in higher education has mainly focused on automating attendance, with limited attention given to students' emotional and cognitive engagement during lectures. This limits instructors' ability to identify disengagement and adapt teaching strategies in real time. This paper presents SCASED (Smart Classroom Attendance System with Emotion Detection), an IoT-based system that integrates automated attendance tracking with facial emotion recognition to support classroom engagement monitoring. The system uses a Raspberry Pi camera and OpenCV for face detection, and a finetuned MobileNetV2 model to classify four learning-related emotional states: engagement, boredom, confusion, and frustration. A session-based mechanism is implemented to manage attendance and emotion monitoring by recording attendance once per session and performing continuous emotion analysis thereafter. Attendance and emotion data are visualized through a cloud-based dashboard to provide instructors with insights into classroom dynamics. Experimental evaluation using the DAiSEE dataset achieved an emotion classification accuracy of 89.5%. The results show that integrating attendance data with emotion analytics can provide instructors with additional insight into classroom dynamics and support more responsive teaching practices.
在高等教育中,智能教室技术的采用主要集中在自动签到上,而对于学生在课堂上的情感和认知参与度的关注较少。这限制了教师实时识别学生脱轨并调整教学策略的能力。本文介绍了一种基于物联网(IoT)的系统——SCASED(带情绪检测的智能教室考勤系统),该系统将自动化签到跟踪与面部表情识别相结合,以支持课堂参与度监测。 SCASED系统使用树莓派摄像头和OpenCV进行人脸检测,并采用经过微调的MobileNetV2模型来分类四种学习相关的情绪状态:投入、无聊、困惑和沮丧。此外,系统还实现了一种基于会话机制的方法来管理签到与情绪监控,即每个会话记录一次签到并随后持续进行情绪分析。 学生签到及情绪数据通过云端仪表板可视化显示,为教师提供有关课堂动态的洞察力。使用DAiSEE数据集进行实验评估后,达到了89.5%的情绪分类准确性。结果表明,将签到数据与情感分析相结合可以为教师提供更多关于课堂动态的见解,并支持更具响应性的教学实践。
https://arxiv.org/abs/2601.08049
As emotional support chatbots have recently gained significant traction across both research and industry, a common evaluation strategy has emerged: use help-seeker simulators to interact with supporter chatbots. However, current simulators suffer from two critical limitations: (1) they fail to capture the behavioral diversity of real-world seekers, often portraying them as overly cooperative, and (2) they lack the controllability required to simulate specific seeker profiles. To address these challenges, we present a controllable seeker simulator driven by nine psychological and linguistic features that underpin seeker behavior. Using authentic Reddit conversations, we train our model via a Mixture-of-Experts (MoE) architecture, which effectively differentiates diverse seeker behaviors into specialized parameter subspaces, thereby enhancing fine-grained controllability. Our simulator achieves superior profile adherence and behavioral diversity compared to existing approaches. Furthermore, evaluating 7 prominent supporter models with our system uncovers previously obscured performance degradations. These findings underscore the utility of our framework in providing a more faithful and stress-tested evaluation for emotional support chatbots.
随着情感支持聊天机器人的研究和应用日益受到关注,一种常见的评估策略是使用求助者模拟器与支持型聊天机器人进行交互。然而,现有的模拟器存在两个关键限制:(1)它们未能捕捉到现实世界中求助者的多样行为模式,往往将其描绘为过于合作;(2)缺乏模拟特定求助者特征所需的控制能力。为了应对这些挑战,我们提出了一种基于九个心理和语言特征驱动的可控求助者模拟器,这些特征构成了求助者行为的基础。通过使用来自Reddit的真实对话数据,我们采用混合专家(MoE)架构来训练我们的模型,并将多样化的求助者行为有效区分到专门的参数子空间中,从而提高了细粒度控制能力。与现有方法相比,我们的模拟器在求助者特性符合度和行为多样性方面表现更优。此外,通过使用我们的系统评估7种主要的支持型聊天机器人模型,我们揭示了之前未被发现的性能下降问题。这些发现强调了我们的框架在为情感支持聊天机器人提供更为真实且经过充分测试的评估方面的实用性。
https://arxiv.org/abs/2601.07698
Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies--adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.
多模态情感理解需要有效地整合文本、音频和视觉模态,以实现离散情绪识别和连续情感分析。我们提出了EGMF框架,这是一个结合专家引导的多模态融合与大型语言模型的统一框架。我们的方法包括三个专门设计的专家网络:一个是用于细微情感差异的精细局部专家;另一个是用于跨模态关系的语义相关性专家;还有一个是用于长距离依赖的全局上下文专家。这些专家通过层次化动态门控机制自适应地融合,以实现基于上下文的功能选择。 增强后的多模态表示通过伪标记注入和基于提示的条件设定与大型语言模型结合,从而使单一生成框架能够通过自然语言生成处理分类和回归任务。我们采用低秩适应(LoRA)微调技术来提高计算效率。 在双语基准测试数据集上的实验结果表明,EGMF框架相对于现有方法表现出一致且显著的优势,其跨语言鲁棒性也优于其他模型,在英语和汉语的多模态情感表达中揭示了普遍模式。我们将在稍后公开发布源代码。
https://arxiv.org/abs/2601.07565
This tutorial paper provides a step-by-step workflow for building and analysing semantic networks from short creative texts. We introduce and compare two widely used text-to-network approaches: word co-occurrence networks and textual forma mentis networks (TFMNs). We also demonstrate how they can be used in machine learning to predict human creativity ratings. Using a corpus of 1029 short stories, we guide readers through text preprocessing, network construction, feature extraction (structural measures, spreading-activation indices, and emotion scores), and application of regression models. We evaluate how network-construction choices influence both network topology and predictive performance. Across all modelling settings, TFMNs consistently outperformed co-occurrence networks through lower prediction errors (best MAE = 0.581 for TFMN, vs 0.592 for co-occurrence with window size 3). Network-structural features dominated predictive performance (MAE = 0.591 for TFMN), whereas emotion features performed worse (MAE = 0.711 for TFMN) and spreading-activation measures contributed little (MAE = 0.788 for TFMN). This paper offers practical guidance for researchers interested in applying network-based methods for cognitive fields like creativity research. we show when syntactic networks are preferable to surface co-occurrence models, and provide an open, reproducible workflow accessible to newcomers in the field, while also offering deeper methodological insight for experienced researchers.
这篇教程论文提供了一种从短创意文本构建和分析语义网络的分步工作流程。我们介绍了并比较了两种广泛使用的文本到网络方法:词共现网络和文本形式心智网络(TFMN)。此外,还展示了如何将这些方法用于机器学习以预测人类创造力评分。使用包含1029篇短故事的数据集,我们将指导读者完成文本预处理、网络构建、特征提取(结构测量、传播激活指数和情感得分)以及回归模型应用的整个过程。我们评估了网络构建选择对网络拓扑及预测性能的影响。在所有建模设置中,TFMN始终通过较低的预测误差优于共现网络(最佳MAE = 0.581对于TFMN,而窗口大小为3时共现网络的最佳MAE = 0.592)。网络结构特征主导了预测性能(TFMN的MAE = 0.591),相比之下,情感特征表现较差(TFMN的MAE = 0.711),并且传播激活度量贡献较少(TFMN的MAE = 0.788)。 这篇论文为对认知领域如创造力研究感兴趣、希望应用基于网络的方法的研究人员提供了实用指南。我们展示了语义网络在何种情况下优于表层共现模型,并提供了一个开放且可复现的工作流程,以便于新进入该领域的研究人员使用,同时也向经验丰富的研究人员提供了更深入的方法学见解。
https://arxiv.org/abs/2601.07327
Balancing dialogue, music, and sound effects with accompanying video is crucial for immersive storytelling, yet current audio mixing workflows remain largely manual and labor-intensive. While recent advancements have introduced the visually guided acoustic highlighting task, which implicitly rebalances audio sources using multimodal guidance, it remains unclear which visual aspects are most effective as conditioning this http URL address this gap through a systematic study of whether deep video understanding improves audio remixing. Using textual descriptions as a proxy for visual analysis, we prompt large vision-language models to extract six types of visual-semantic aspects, including object and character appearance, emotion, camera focus, tone, scene background, and inferred sound-related cues. Through extensive experiments, camera focus, tone, and scene background consistently yield the largest improvements in perceptual mix quality over state-of-the-art baselines. Our findings (i) identify which visual-semantic cues most strongly support coherent and visually aligned audio remixing, and (ii) outline a practical path toward automating cinema-grade sound design using lightweight guidance derived from large vision-language models.
平衡对话、音乐和音效与视频的配合对于沉浸式叙事至关重要,但目前的音频混音工作流程仍然主要依赖于手动操作且耗时。虽然近期的进步引入了视觉引导的声学突出显示任务,该任务使用多模态指导隐含地重新调整音频来源,但仍不清楚哪些视觉方面最有效地作为条件来改善这一过程。为解决这一差距,通过一项系统研究探讨深度视频理解是否能提升音频混音的质量。 在本研究中,我们采用文本描述作为视觉分析的替代方案,提示大型视觉-语言模型提取六种类型的视觉语义要素,包括物体和人物外观、情感、摄像机聚焦点、基调、场景背景以及推断出的声音相关线索。通过广泛的实验发现,摄像机聚焦点、基调和场景背景在感知混音质量方面,相较于最先进的基准方法,能持续带来最大的提升。 我们的研究结果表明: (i) 确定了哪些视觉语义提示最有力地支持连贯且与视觉一致的音频重混。 (ii) 概述了利用大型视觉-语言模型轻量级指导实现电影级别的声音设计自动化的一种实际路径。
https://arxiv.org/abs/2601.08871
Emotion recognition from electroencephalography (EEG) signals remains challenging due to high inter-subject variability, limited labeled data, and the lack of interpretable reasoning in existing approaches. While recent multimodal large language models (MLLMs) have advanced emotion analysis, they have not been adapted to handle the unique spatiotemporal characteristics of neural signals. We present E^2-LLM (EEG-to-Emotion Large Language Model), the first MLLM framework for interpretable emotion analysis from EEG. E^2-LLM integrates a pretrained EEG encoder with Qwen-based LLMs through learnable projection layers, employing a multi-stage training pipeline that encompasses emotion-discriminative pretraining, cross-modal alignment, and instruction tuning with chain-of-thought reasoning. We design a comprehensive evaluation protocol covering basic emotion prediction, multi-task reasoning, and zero-shot scenario understanding. Experiments on the dataset across seven emotion categories demonstrate that E^2-LLM achieves excellent performance on emotion classification, with larger variants showing enhanced reliability and superior zero-shot generalization to complex reasoning scenarios. Our work establishes a new paradigm combining physiological signals with LLM reasoning capabilities, showing that model scaling improves both recognition accuracy and interpretable emotional understanding in affective computing.
从脑电图(EEG)信号中识别情绪仍然是一个挑战,原因包括个体间变异性的高度差异、标注数据的限制以及现有方法缺乏可解释性。虽然最近的多模态大型语言模型(MLLMs)在情感分析方面取得了进展,但它们尚未被调整以处理神经信号独特的时空特性。我们提出了E^2-LLM(EEG到情绪的大规模语言模型),这是第一个用于从EEG数据进行可解释的情绪分析的多模态大规模语言模型框架。E^2-LLM通过可学习的投影层将预训练的EEG编码器与基于Qwen的语言模型集成,并采用了一个包含情感区分性预训练、跨模式对齐和带有链式思维推理的指令调优在内的多阶段训练流程。 我们设计了一套全面的评估协议,涵盖基础情绪预测、多任务推理以及零样本场景理解。在跨越七个情绪类别的数据集上的实验表明,E^2-LLM在情绪分类上取得了卓越的成绩;更大的模型变体表现出增强的可靠性和对复杂推理场景中的零样本泛化能力。 我们的工作开创了一种将生理信号与大规模语言模型推理性能力相结合的新范式,并展示了通过模型规模扩大可以提高情感计算中识别准确率和可解释性的情感理解。
https://arxiv.org/abs/2601.07877
South Africa's escalating mental health crisis, compounded by limited access to culturally responsive care, calls for innovative and contextually grounded interventions. While large language models show considerable promise for mental health support, their predominantly Western-centric training data limit cultural and linguistic applicability in African contexts. This study introduces a proof-of-concept framework that integrates cognitive behavioral therapy with the African philosophy of Ubuntu to create a culturally sensitive, emotionally intelligent, AI-driven mental health dialogue system. Guided by a design science research methodology, the framework applies both deep theoretical and therapeutic adaptations as well as surface-level linguistic and communicative cultural adaptations. Key CBT techniques, including behavioral activation and cognitive restructuring, were reinterpreted through Ubuntu principles that emphasize communal well-being, spiritual grounding, and interconnectedness. A culturally adapted dataset was developed through iterative processes of language simplification, spiritual contextualization, and Ubuntu-based reframing. The fine-tuned model was evaluated through expert-informed case studies, employing UniEval for conversational quality assessment alongside additional measures of CBT reliability and cultural linguistic alignment. Results demonstrate that the model effectively engages in empathetic, context-aware dialogue aligned with both therapeutic and cultural objectives. Although real-time end-user testing has not yet been conducted, the model underwent rigorous review and supervision by domain specialist clinical psychologists. The findings highlight the potential of culturally embedded emotional intelligence to enhance the contextual relevance, inclusivity, and effectiveness of AI-driven mental health interventions across African settings.
南非日益严重的心理健康危机,加上缺乏文化响应性护理的访问权限,需要创新和基于具体环境的方法。虽然大型语言模型在心理健康支持方面显示出相当大的潜力,但它们主要以西方为中心的数据训练限制了其在非洲语境中的文化和语言适用性。本研究提出了一种概念验证框架,该框架将认知行为疗法(CBT)与非洲哲学“Ubuntu”相结合,创建了一个文化敏感、情感智能的AI驱动心理健康对话系统。 这一框架采用设计科学研究方法,并进行深层理论和治疗适应以及表面级别的语言和沟通文化的适应调整。关键的认知行为技术,包括行为激活和认知重组,在强调社区福祉、精神根基和相互联系的原则下通过Ubuntu理念进行了重新诠释。通过简化语言、提供精神背景语境化和基于Ubuntu的重构等迭代过程开发了文化适应的数据集。 经过微调的模型通过专家指导的案例研究进行了评估,并使用UniEval进行对话质量评估,同时结合认知行为疗法可靠性和文化语言一致性的额外指标。结果表明,该模型能够有效地进行同理心强、情境意识的心理健康对话,符合治疗和文化目标。 尽管尚未进行实时最终用户测试,但该模型已经通过领域专家临床心理学家的严格审查和监督。研究发现强调了嵌入式情感智能在提高非洲环境中的AI驱动心理健康干预措施的相关性、包容性和有效性方面的潜力。
https://arxiv.org/abs/2601.06875
Psychological research has long utilized circumplex models to structure emotions, placing similar emotions adjacently and opposing ones diagonally. Although frequently used to interpret deep learning representations, these models are rarely directly incorporated into the representation learning of language models, leaving their geometric validity unexplored. This paper proposes a method to induce circular emotion representations within language model embeddings via contrastive learning on a hypersphere. We show that while this circular alignment offers superior interpretability and robustness against dimensionality reduction, it underperforms compared to conventional designs in high-dimensional settings and fine-grained classification. Our findings elucidate the trade-offs involved in applying psychological circumplex models to deep learning architectures.
长期以来,心理学研究利用圆周模型(circumplex models)来结构化情感,将相似的情感相邻放置并将对立的情感对角线放置。尽管这些模型经常被用来解释深度学习表示,但它们很少直接融入语言模型的表示学习中,因此其几何有效性尚未得到充分探索。本文提出了一种方法,通过在超球体上进行对比学习,在语言模型嵌入中诱导出圆形情感表示。我们发现,虽然这种圆周对齐提供了更好的可解释性和对抗维度缩减的鲁棒性,但在高维设置和细粒度分类任务中,其表现不及传统的设计。我们的研究结果阐明了将心理圆周模型应用于深度学习架构时所涉及的权衡。
https://arxiv.org/abs/2601.06575
Computational narrative analysis aims to capture rhythm, tension, and emotional dynamics in literary texts. Existing large language models can generate long stories but overly focus on causal coherence, neglecting the complex story arcs and orchestration inherent in human narratives. This creates a structural misalignment between model- and human-generated narratives. We propose VISTA Space, a high-dimensional representational framework for narrative orchestration that unifies human and model narrative perspectives. We further introduce LitVISTA, a structurally annotated benchmark grounded in literary texts, enabling systematic evaluation of models' narrative orchestration capabilities. We conduct oracle evaluations on a diverse selection of frontier LLMs, including GPT, Claude, Grok, and Gemini. Results reveal systematic deficiencies: existing models fail to construct a unified global narrative view, struggling to jointly capture narrative function and structure. Furthermore, even advanced thinking modes yield only limited gains for such literary narrative understanding.
计算叙事分析旨在捕捉文学文本中的节奏、张力和情感动态。现有的大型语言模型可以生成较长的故事,但过于注重因果连贯性,忽视了人类叙述中固有的复杂故事情节和编排。这在由模型生成的叙述与人为创作的叙述之间造成了结构性差异。我们提出了VISTA空间,这是一种用于叙事编排的高维表示框架,旨在统一人与机器的叙事视角。此外,我们还引入了基于文学文本的结构化注释基准LitVISTA,该基准能够系统地评估模型在叙事编排能力方面的表现。我们在前沿大型语言模型(包括GPT、Claude、Grok和Gemini)中进行了一些oracle评价。结果显示了一系列系统的缺陷:现有的模型无法构建统一的整体叙述视角,并且难以同时捕捉到叙事的功能与结构。此外,即使是最先进的思考模式对于理解这类文学叙事也只能带来有限的改善。
https://arxiv.org/abs/2601.06445
Complex networks provide powerful tools for analyzing and understanding the intricate structures present in various systems, including natural language. Here, we analyze topology of growing word-adjacency networks constructed from Chinese and English literary works written in different periods. Unconventionally, instead of considering dictionary words only, we also include punctuation marks as if they were ordinary words. Our approach is based on two arguments: (1) punctuation carries genuine information related to emotional state, allows for logical grouping of content, provides a pause in reading, and facilitates understanding by avoiding ambiguity, and (2) our previous works have shown that punctuation marks behave like words in a Zipfian analysis and, if considered together with regular words, can improve authorship attribution in stylometric studies. We focus on a functional dependence of the average shortest path length $L(N)$ on a network size $N$ for different epochs and individual novels in their original language as well as for translations of selected novels into the other language. We approximate the empirical results with a growing network model and obtain satisfactory agreement between the two. We also observe that $L(N)$ behaves asymptotically similar for both languages if punctuation marks are included but becomes sizably larger for Chinese if punctuation marks are neglected.
复杂网络为分析和理解各种系统(包括自然语言)中错综复杂的结构提供了强有力的工具。在这里,我们研究了从不同历史时期创作的中文和英文文学作品构建出的不断增长的词邻接网络的拓扑结构。不同于以往仅考虑字典词汇的情况,我们也把标点符号视为普通的词语来进行分析。我们的方法基于两个论据:(1)标点符号承载着与情感状态相关的真正信息,允许内容逻辑分组,提供阅读中的停顿,并通过避免歧义来帮助理解;(2)我们之前的研究表明,在齐夫定律的分析中,标点符号表现得像单词一样,并且如果将它们与普通单词一起考虑,则可以提高在风格计量研究中的作者归属准确性。我们关注的是不同历史时期以及单部小说在其原语言和翻译成另一种语言后的网络规模$N$对平均最短路径长度$L(N)$的功能依赖关系。我们将实证结果近似为一种不断增长的网络模型,并得到了两者之间令人满意的一致性。此外,我们还注意到,如果考虑标点符号,则两种语言的$L(N)$在渐进意义上表现类似;但如果忽略标点符号,则对于中文而言,$L(N)$变得明显更大。
https://arxiv.org/abs/2601.06361
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.
在大规模原始音频上预训练的生成式语音语言模型能够延续带有适当内容的说话提示,并保留如说话者和情感等属性,因此可作为口语对话的基础模型。然而,在先前的研究文献中,这些模型通常使用“全局标记困惑度”进行评估,这直接将文本困惑度公式应用于语音标记。这种方法忽视了语音与文本模态之间的基本差异,可能导致对语音特征的低估。在这项工作中,我们提出了一系列基于似然性和生成的方法来替代简单的全局令牌困惑度评估方法。我们证明了所提出的评估方法更真实地反映了感知到的生成质量,并且这种关联性通过人类评分的人均意见得分(MOS)得到了更强的支持。 在这些新指标下进行评估时,口语语言模型之间的相对性能格局被重新塑造,显示出了最佳表现模型与人类顶级水平之间显著缩小了差距。综上所述,这些结果表明适当的评估对于准确衡量口语语言建模的进步至关重要。
https://arxiv.org/abs/2601.06329
Generalisation to unseen subjects in EEG-based emotion classification remains a challenge due to high inter-and intra-subject variability. Continual learning (CL) poses a promising solution by learning from a sequence of tasks while mitigating catastrophic forgetting. Regularisation-based CL approaches, such as Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI), and Memory Aware Synapses (MAS), are commonly used as baselines in EEG-based CL studies, yet their suitability for this problem remains underexplored. This study theoretically and empirically finds that regularisation-based CL methods show limited performance for EEG-based emotion classification on the DREAMER and SEED datasets. We identify a fundamental misalignment in the stability-plasticity trade-off, where regularisation-based methods prioritise mitigating catastrophic forgetting (backward transfer) over adapting to new subjects (forward transfer). We investigate this limitation under subject-incremental sequences and observe that: (1) the heuristics for estimating parameter importance become less reliable under noisy data and covariate shift, (2) gradients on parameters deemed important by these heuristics often interfere with gradient updates required for new subjects, moving optimisation away from the minimum, (3) importance values accumulated across tasks over-constrain the model, and (4) performance is sensitive to subject order. Forward transfer showed no statistically significant improvement over sequential fine-tuning (p > 0.05 across approaches and datasets). The high variability of EEG signals means past subjects provide limited value to future subjects. Regularisation-based continual learning approaches are therefore limited for robust generalisation to unseen subjects in EEG-based emotion classification.
在基于脑电图(EEG)的情感分类中,对于未见过的受试者的泛化仍然是一个挑战,这主要是由于高度的个体间和个体内变异性。连续学习(CL)通过从一系列任务中学到同时减轻灾难性遗忘,为解决这个问题提供了一个有希望的方法。基于正则化的连续学习方法,如弹性权重巩固(EWC)、突触智能(SI) 和记忆感知突触(MAS),常被用作EEG基础的连续学习研究中的基准方法,然而这些方法在该问题上的适用性仍然未得到充分探索。 本研究表明,在DREAMER和SEED数据集上基于脑电图的情感分类中,基于正则化的CL方法表现有限。我们发现了一个基本的一致性和可塑性权衡错位的问题:即基于正则化的方法优先考虑减轻灾难性遗忘(向后迁移),而不是适应新受试者(向前迁移)。在受试者增量序列下进行的限制调查表明: 1. 在存在噪声数据和协变量偏移的情况下,估计参数重要性的启发式方法变得不可靠。 2. 由这些启发式方法认为重要的参数上的梯度往往干扰了更新新受试者所需的梯度更新,使优化远离最小值。 3. 来自任务的积累的重要性值过度限制了模型。 4. 性能对受试者的顺序敏感。 向前迁移在统计上没有显示出比顺序微调有显著改进(所有方法和数据集中p>0.05)。EEG信号的高度变异性意味着过去受试者提供的信息对未来受试者价值有限。因此,基于正则化的连续学习方法对于在基于脑电图的情感分类中对未见过的受试者的稳健泛化是有限制的。
https://arxiv.org/abs/2601.07858
Driven by the rapid advancement of Large Language Models (LLMs), particularly Audio-LLMs and Omni-models, spoken dialogue systems have evolved significantly, progressively narrowing the gap between human-machine and human-human interactions. Achieving truly ``human-like'' communication necessitates a dual capability: emotional intelligence to perceive and resonate with users' emotional states, and robust interaction mechanisms to navigate the dynamic, natural flow of conversation, such as real-time turn-taking. Therefore, we launched the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026 to benchmark these dual capabilities. Anchored by a sizable dataset derived from authentic human conversations, this initiative establishes a fair evaluation platform across two tracks: (1) Emotional Intelligence, targeting long-term emotion understanding and empathetic generation; and (2) Full-Duplex Interaction, systematically evaluating real-time decision-making under `` listening-while-speaking'' conditions. This paper summarizes the dataset, track configurations, and the final results.
随着大型语言模型(LLMs)、特别是音频LLMs和全模态模型的迅速发展,语音对话系统已经取得了显著进步,缩小了人机交互与人人交互之间的差距。实现真正“人性化的”交流需要具备两种能力:一是情感智能,即感知并响应用户的情绪状态;二是强大的互动机制,以应对对话中动态、自然的流变过程,如实时轮流发言等。因此,我们于2026年ICASSP会议上发起了首届人性化语音对话系统挑战赛(HumDial),旨在评估这些双重要求。 本项目基于一个由真实人类对话衍生出的大规模数据集,建立了一个公正的评价平台,涵盖了两个赛道: 1. **情感智能**:侧重于长期情绪理解和共情生成。 2. **全双工交互**:系统性地评估在“边听边说”的条件下实时决策的能力。 本文总结了该挑战赛的数据集、赛道配置以及最终结果。
https://arxiv.org/abs/2601.05564
In this work, we introduce the Keep Emotional and Essential Memory (KEEM) dataset, a novel generation-based dataset designed to enhance memory updates in long-term conversational systems. Unlike existing approaches that rely on simple accumulation or operation-based methods, which often result in information conflicts and difficulties in accurately tracking a user's current state, KEEM dynamically generates integrative memories. This process not only preserves essential factual information but also incorporates emotional context and causal relationships, enabling a more nuanced understanding of user interactions. By seamlessly updating a system's memory with both emotional and essential data, our approach promotes deeper empathy and enhances the system's ability to respond meaningfully in open-domain conversations.
在这项工作中,我们介绍了 Keep Emotional and Essential Memory (KEEM) 数据集,这是一个新型的生成式数据集,旨在增强长期对话系统中的记忆更新。与现有依赖简单积累或操作方法的方法不同,这些方法通常会导致信息冲突并难以准确追踪用户的当前状态,KEEM 动态地生成综合记忆。这一过程不仅保留了关键的事实性信息,还融入了情感背景和因果关系,从而能够更细致地理解用户交互。通过将系统记忆与情绪性和关键数据无缝更新相结合,我们的方法促进了更深的理解力,并增强了系统在开放式对话中做出有意义回应的能力。
https://arxiv.org/abs/2601.05548