Electroencephalographic (EEG) signals have long been applied in the field of affective brain-computer interfaces (aBCIs). Cross-subject EEG-based emotion recognition has demonstrated significant potential in practical applications due to its suitability across diverse people. However, most studies on cross-subject EEG-based emotion recognition neglect the presence of inter-individual variability and negative transfer phenomena during model training. To address this issue, a cross-subject EEG-based emotion recognition through source selection with adversarial strategy is introduced in this paper. The proposed method comprises two modules: the source selection network (SS) and the adversarial strategies network (AS). The SS uses domain labels to reverse-engineer the training process of domain adaptation. Its key idea is to disrupt class separability and magnify inter-domain differences, thereby raising the classification difficulty and forcing the model to learn domain-invariant yet emotion-relevant representations. The AS gets the source domain selection results and the pretrained domain discriminators from SS. The pretrained domain discriminators compute a novel loss aimed at enhancing the performance of domain classification during adversarial training, ensuring the balance of adversarial strategies. This paper provides theoretical insights into the proposed method and achieves outstanding performance on two EEG-based emotion datasets, SEED and SEED-IV. The code can be found at this https URL.
脑电图(EEG)信号长期以来被应用于情感型脑机接口(aBCIs)领域。跨受试者基于EEG的情感识别由于其在不同人群中的适用性,已经在实际应用中展示了巨大的潜力。然而,大多数关于跨受试者基于EEG的情感识别的研究忽略了个体间差异和模型训练过程中负面迁移现象的存在。为了解决这个问题,本文提出了一种通过对抗策略进行源选择的跨受试者EEG情感识别方法。 该提议的方法包括两个模块:源选择网络(SS)与对抗策略网络(AS)。SS使用领域标签来逆向工程域适应的训练过程。其核心思想是破坏类别的可分离性,放大不同领域的差异,从而提高分类难度,并迫使模型学习到对领域不变但情感相关的表示形式。而AS从SS获取源域选择结果和预训练的领域判别器。这些预训练的领域判别器计算一种新的损失函数,旨在通过对抗式训练来增强领域分类性能,确保对抗策略的平衡性。 本文提供了对该方法的理论见解,并在两个基于EEG的情感数据集SEED和SEED-IV上实现了卓越的表现。代码可在此网址访问:[提供一个URL以供参考]。 请注意,上述文本最后提供的URL是一个占位符,请根据实际情况替换为实际可用的链接地址。
https://arxiv.org/abs/2512.13458
This study investigates emotion drift: the change in emotional state across a single text, within mental health-related messages. While sentiment analysis typically classifies an entire message as positive, negative, or neutral, the nuanced shift of emotions over the course of a message is often overlooked. This study detects sentence-level emotions and measures emotion drift scores using pre-trained transformer models such as DistilBERT and RoBERTa. The results provide insights into patterns of emotional escalation or relief in mental health conversations. This methodology can be applied to better understand emotional dynamics in content.
这项研究探讨了情绪漂移:即在与心理健康相关的消息中,情感状态随文本内容变化的现象。尽管通常的情感分析会将整个消息归类为积极、消极或中立之一,但消息过程中情感细微的变化往往被忽略。本研究使用预训练的变压器模型(如DistilBERT和RoBERTa)检测句子级别的情绪,并测量情绪漂移分数。研究结果提供了心理健康对话中情绪升级或缓解模式的见解。该方法可用于更好地理解内容中的情感动态。
https://arxiv.org/abs/2512.13363
Debate has been widely adopted as a strategy to enhance critical thinking skills in English Language Arts (ELA). One important skill in debate is forming effective argumentation, which requires debaters to select supportive evidence from literature and construct compelling claims. However, the training of this skill largely depends on human coaching, which is labor-intensive and difficult to scale. To better support students in preparing for debates, this study explores the potential of leveraging artificial intelligence to generate effective arguments. Specifically, we prompted GPT-4 to create an evidence card and compared it to those produced by human debaters. The evidence cards outline the arguments students will present and how those arguments will be delivered, including components such as literature-based evidence quotations, summaries of core ideas, verbatim reading scripts, and tags (i.e., titles of the arguments). We compared the quality of the arguments in the evidence cards created by GPT and student debaters using Aristotle's rhetorical principles: ethos (credibility), pathos (emotional appeal), and logos (logical reasoning). Through a systematic qualitative and quantitative analysis, grounded in the rhetorical principles, we identify the strengths and limitations of human and GPT in debate reasoning, outlining areas where AI's focus and justifications align with or diverge from human reasoning. Our findings contribute to the evolving role of AI-assisted learning interventions, offering insights into how student debaters can develop strategies that enhance their argumentation and reasoning skills.
辩论作为一种策略被广泛采用,旨在增强英语语言艺术(ELA)中的批判性思维能力。在辩论中,形成有效论证是一项重要技能,这要求辩手从文学作品中挑选支持证据,并构建有说服力的主张。然而,培养这一技能很大程度上依赖于人力辅导,这种方式耗时且难以扩展规模。为了更好地帮助学生准备辩论,本研究探索了利用人工智能生成有效论据的潜力。具体而言,我们提示GPT-4创建了一个证据卡片,并将其与由人类辩手制作的证据卡片进行了比较。这些证据卡片概述了学生们将要呈现的论点以及如何呈现这些论点,其中包括文献引用、核心观点摘要、逐字阅读脚本和标签(即论证标题)等组成部分。 我们根据亚里士多德的修辞原则——伦理(可信性)、情感(情感吸引力)和逻辑(逻辑推理),比较了GPT与学生辩手生成的证据卡片中论点的质量。通过系统性的定性和定量分析,基于这些修辞原则,我们识别出人类和AI在辩论推理中的优缺点,并指出AI的关注点和论证是否与其人类对应物一致或有所偏离之处。 我们的研究结果为人工智能辅助学习干预措施的角色演变做出了贡献,提供了关于学生辩手如何制定策略以提高其论据构建和推理技能的见解。
https://arxiv.org/abs/2512.12817
As artificial intelligence (AI) systems become increasingly embedded in our daily life, the ability to recognize and adapt to human emotions is essential for effective human-computer interaction. Facial expression recognition (FER) provides a primary channel for inferring affective states, but the dynamic and culturally nuanced nature of emotions requires models that can learn continuously without forgetting prior knowledge. In this work, we propose a hybrid framework for FER in a continual learning setting that mitigates catastrophic forgetting. Our approach integrates two complementary modalities: deep convolutional features and facial Action Units (AUs) derived from the Facial Action Coding System (FACS). The combined representation is modelled through Bayesian Gaussian Mixture Models (BGMMs), which provide a lightweight, probabilistic solution that avoids retraining while offering strong discriminative power. Using the Compound Facial Expression of Emotion (CFEE) dataset, we show that our model can first learn basic expressions and then progressively recognize compound expressions. Experiments demonstrate improved accuracy, stronger knowledge retention, and reduced forgetting. This framework contributes to the development of emotionally intelligent AI systems with applications in education, healthcare, and adaptive user interfaces.
随着人工智能(AI)系统在日常生活中的应用日益广泛,识别和适应人类情绪的能力对于有效的交互至关重要。面部表情识别(FER)提供了推断情感状态的主要渠道,但情感的动态性和文化差异性要求模型能够在不断学习新知识的同时不遗忘先前的知识。在此项工作中,我们提出了一种用于持续学习设置下的面部表情识别框架,该框架能够缓解灾难性的遗忘现象。我们的方法整合了两种互补的模式:深度卷积特征和从面部动作编码系统(FACS)中提取的面部动作单元(AUs)。这种组合表示通过贝叶斯高斯混合模型(BGMMs)进行建模,提供了轻量级的概率解决方案,既避免了重新训练又保持了强大的鉴别能力。利用复合情感面部表情数据集(CFEE),我们展示了我们的模型能够首先学习基本的表情,并逐步识别复杂的表情。实验结果表明,该框架提高了准确率,增强了知识保留能力并减少了遗忘现象的发生。这一框架有助于开发具有教育、医疗保健和适应性用户界面等应用领域的智能情绪感知AI系统。
https://arxiv.org/abs/2512.12277
Understanding emotional responses in children with Autism Spectrum Disorder (ASD) during social interaction remains a critical challenge in both developmental psychology and human-robot interaction. This study presents a novel deep learning pipeline for emotion recognition in autistic children in response to a name-calling event by a humanoid robot (NAO), under controlled experimental settings. The dataset comprises of around 50,000 facial frames extracted from video recordings of 15 children with ASD. A hybrid model combining a fine-tuned ResNet-50-based Convolutional Neural Network (CNN) and a three-layer Graph Convolutional Network (GCN) trained on both visual and geometric features extracted from MediaPipe FaceMesh landmarks. Emotions were probabilistically labeled using a weighted ensemble of two models: DeepFace's and FER, each contributing to soft-label generation across seven emotion classes. Final classification leveraged a fused embedding optimized via Kullback-Leibler divergence. The proposed method demonstrates robust performance in modeling subtle affective responses and offers significant promise for affective profiling of ASD children in clinical and therapeutic human-robot interaction contexts, as the pipeline effectively captures micro emotional cues in neurodivergent children, addressing a major gap in autism-specific HRI research. This work represents the first such large-scale, real-world dataset and pipeline from India on autism-focused emotion analysis using social robotics, contributing an essential foundation for future personalized assistive technologies.
理解自闭症谱系障碍(ASD)儿童在社交互动中的情感反应仍然是发育心理学和人机交互领域的一个关键挑战。本研究提出了一种新颖的深度学习管道,用于识别自闭症儿童对类人机器人(NAO)进行名称呼叫事件时的情感反应,在受控实验环境中进行。数据集包括从15名患有ASD的儿童视频记录中提取的约50,000张面部帧。 该模型结合了基于ResNet-50微调后的卷积神经网络(CNN)和在MediaPipe FaceMesh地标点提取出的视觉及几何特征上训练的三层图卷积网络(GCN)。情感采用了一种加权集成方法,使用DeepFace和FER两个模型的概率标签进行生成,涵盖七种情绪类别。最终分类利用了通过Kullback-Leibler散度优化融合嵌入的技术。 所提出的方法在建模细微的情感反应方面表现出强大的性能,并为临床及治疗性人机交互环境中ASD儿童情感特征的个性化辅助技术提供了重要的基础,因为该管道能够有效地捕捉神经多样儿童的微表情变化,填补了自闭症特定HRI研究中的一个重要空白。这项工作是来自印度的第一个大规模、现实世界的数据集和管道,在使用社交机器人进行以自闭症为重点的情感分析方面做出了贡献,并为未来个性化辅助技术的发展奠定了基础。
https://arxiv.org/abs/2512.12208
Social media serves as a critical medium in modern politics because it both reflects politicians' ideologies and facilitates communication with younger generations. We present MultiParTweet, a multilingual tweet corpus from X that connects politicians' social media discourse with German political corpus GerParCor, thereby enabling comparative analyses between online communication and parliamentary debates. MultiParTweet contains 39 546 tweets, including 19 056 media items. Furthermore, we enriched the annotation with nine text-based models and one vision-language model (VLM) to annotate MultiParTweet with emotion, sentiment, and topic annotations. Moreover, the automated annotations are evaluated against a manually annotated subset. MultiParTweet can be reconstructed using our tool, TTLABTweetCrawler, which provides a framework for collecting data from X. To demonstrate a methodological demonstration, we examine whether the models can predict each other using the outputs of the remaining models. In summary, we provide MultiParTweet, a resource integrating automatic text and media-based annotations validated with human annotations, and TTLABTweetCrawler, a general-purpose X data collection tool. Our analysis shows that the models are mutually predictable. In addition, VLM-based annotation were preferred by human annotators, suggesting that multimodal representations align more with human interpretation.
社交媒体在现代政治中扮演着关键媒介的角色,因为它不仅反映了政客的思想理念,还促进了与年轻一代的沟通。我们推出了MultiParTweet,这是一个来自X平台的多语言推文语料库,它将政客的社交媒体话语与德国政治语料库GerParCor连接起来,从而能够对在线交流和议会辩论进行比较分析。MultiParTweet包含39,546条推文,其中19,056条是媒体内容。 此外,我们利用九个基于文本的模型以及一个视觉-语言模型(VLM)为MultiParTweet添加了情感、情绪和主题标注,进一步丰富了其注释。自动化的注释还通过与手动标记子集进行对比得到了评估。 使用我们的工具TTLABTweetCrawler可以重构MultiParTweet,该工具提供了一个从X平台收集数据的框架。为了展示一种方法论上的演示,我们检验了模型是否能利用其余模型的输出相互预测。 总之,我们提供了MultiParTweet这一资源,它集成了基于自动文本和媒体的注释,并且这些注释通过人类注释得到了验证。此外还有TTLABTweetCrawler,一个通用目的的X数据收集工具。我们的分析表明,这些模型是互相可预测的。另外,VLM(视觉-语言模型)生成的注释更受人工标注者的青睐,这表明多模态表示方法与人类解释更加契合。
https://arxiv.org/abs/2512.11567
Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at this https URL.
从自然语言生成动态3D面部动画需要理解时间结构化的语义和细微的表情变化。现有的数据集和方法主要关注于语音驱动的动画或无序的表情序列,因此缺乏表达性人类性能生成所需的语义基础和时间结构。在这项工作中,我们引入了KeyframeFace,这是一个大规模多模态数据集,旨在通过关键帧级别的监督进行文本到动画的研究。KeyframeFace提供了2,100个具有表现力的脚本,并与单目视频、每帧ARKit系数、背景环境、复杂情绪以及人工定义的关键帧配对。此外,该数据集还包含了基于LLM(大型语言模型)和MLLM(多模态大型语言模型)的大规模注释系统,根据ARKit系数和图像进行多视角注释。 除了这个数据集之外,我们提出了第一个文本到动画框架,它明确地利用了LLM的先验知识来进行可解释的脸部运动合成。这一设计将LLM的理解能力与ARKit系数的可解释结构相结合,从而能够生成高保真的表情动画。KeyframeFace和我们的基于LLM的框架共同为可解释、关键帧引导以及背景感知的文本到动画研究奠定了新的基础。 代码和数据可以在以下链接获取:[此URL](请注意,在实际回复中提供了一个占位符URL,请替换为具体的数据集和代码发布平台的实际链接)。
https://arxiv.org/abs/2512.11321
Speech deepfake detection has been widely explored using low-level acoustic descriptors. However, each study tends to select different feature sets, making it difficult to establish a unified representation for the task. Moreover, such features are not intuitive for humans to perceive, as the distinction between bona fide and synthesized speech becomes increasingly subtle with the advancement of deepfake generation techniques. Emotion, on the other hand, remains a unique human attribute that current deepfake generator struggles to fully replicate, reflecting the gap toward true artificial general intelligence. Interestingly, many existing acoustic and semantic features have implicit correlations with emotion. For instance, speech features recognized by automatic speech recognition systems often varies naturally with emotional expression. Based on this insight, we propose a novel training framework that leverages emotion as a bridge between conventional deepfake features and emotion-oriented representations. Experiments on the widely used FakeOrReal and In-the-Wild datasets demonstrate consistent and substantial improvements in accuracy, up to approximately 6% and 2% increases, respectively, and in equal error rate (EER), showing reductions of up to about 4% and 1%, respectively, while achieving comparable results on ASVspoof2019. This approach provides a unified training strategy for all features and interpretable feature direction for deepfake detection while improving model performance through emotion-informed learning.
语音深度伪造检测已经广泛使用低级声学描述符进行了研究。然而,每项研究往往会选择不同的特征集,这使得建立统一的任务表示变得困难。此外,这些特征对于人类来说并不直观感知,因为随着深度伪造生成技术的进步,真实与合成语音之间的区别越来越难以区分。相比之下,情感是当前深度伪造生成器难以完全复制的独特的人类属性,这反映了向真正的人工通用智能的差距。有趣的是,许多现有的声学和语义特征在情绪上具有隐含的相关性。例如,自动语音识别系统识别的语音特征通常会随着情感表达而自然变化。基于这一见解,我们提出了一种新的训练框架,该框架利用情感作为传统深度伪造特征与面向情感表示之间的桥梁。在广泛使用的FakeOrReal和In-the-Wild数据集上进行的实验表明,在准确性方面分别提高了约6%和2%,而在等错误率(EER)方面则减少了多达4%和1%,同时在ASVspoof2019上的表现与现有方法相当。这种方法为所有特征提供了一种统一的训练策略,并提供了可解释的情感导向特性方向,通过情感引导的学习提高了模型性能。
https://arxiv.org/abs/2512.11241
Incomplete multi-modal emotion recognition (IMER) aims at understanding human intentions and sentiments by comprehensively exploring the partially observed multi-source data. Although the multi-modal data is expected to provide more abundant information, the performance gap and modality under-optimization problem hinder effective multi-modal learning in practice, and are exacerbated in the confrontation of the missing data. To address this issue, we devise a novel Cross-modal Prompting (ComP) method, which emphasizes coherent information by enhancing modality-specific features and improves the overall recognition accuracy by boosting each modality's performance. Specifically, a progressive prompt generation module with a dynamic gradient modulator is proposed to produce concise and consistent modality semantic cues. Meanwhile, cross-modal knowledge propagation selectively amplifies the consistent information in modality features with the delivered prompts to enhance the discrimination of the modality-specific output. Additionally, a coordinator is designed to dynamically re-weight the modality outputs as a complement to the balance strategy to improve the model's efficacy. Extensive experiments on 4 datasets with 7 SOTA methods under different missing rates validate the effectiveness of our proposed method.
不完全多模态情感识别(IMER)旨在通过全面探索部分观察到的多源数据来理解人类意图和情绪。尽管多模态数据有望提供更丰富的信息,但性能差距和模态优化不足的问题在实践中妨碍了有效的多模态学习,并且在面对缺失数据时问题更加严重。为解决这一问题,我们设计了一种新颖的跨模态提示(ComP)方法,该方法通过增强特定于模态的功能来强调一致的信息,并通过提升每个模态的表现来提高整体识别准确性。 具体而言,提出了一种具有动态梯度调制器的渐进式提示生成模块,用于产生简洁且一致的模态语义线索。同时,跨模态知识传播选择性地放大了模态特征中的一致信息,以增强特定于模态输出的区分能力。此外,设计了一个协调器来动态重新调整模态输出的权重,作为平衡策略的补充,从而提高模型的有效性。 在四个数据集上进行的不同缺失率下的七种最先进的方法验证了我们提出的方法的有效性。
https://arxiv.org/abs/2512.11239
Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze the display of emotions, the emergence of multimodal generative AI promises great advances. However, we lack evidence about the effectiveness of multimodal AI in emotion analysis. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in video-based analysis of emotional arousal in two complementary data sets of human-labeled video recordings. I find that under ideal circumstances, mLLMs' emotional arousal ratings are highly reliable and show little to know indication of demographic bias. However, in recordings of speakers in real-world parliamentary debates, mLLMs' arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in political analysis and contributes a suitable replicable framework.
情感在政治中占据核心地位,分析其在政治沟通中的作用有着悠久的传统。随着研究越来越多地利用音视频材料来分析情绪的表达,多模态生成式人工智能的发展预示着重大进步的到来。然而,关于多模态AI在情绪分析中的有效性缺乏实证证据。本文通过评估当前的多模态大型语言模型(mLLMs)在两组由人类标注的视频记录中基于视频的情感唤醒分析的效果来填补这一空白。我发现,在理想情况下,mLLMs对情感唤醒的评分非常可靠,并且几乎没有显示出人口统计学偏见的迹象。然而,在真实世界议会辩论中的讲话者录制视频中,mLLMs对唤醒度的评分未能兑现其承诺,这可能对未来统计推断产生潜在的负面影响。因此,本研究强调了在政治分析领域持续、全面评估新兴生成式AI方法的需求,并贡献了一个合适的可复制框架。
https://arxiv.org/abs/2512.10882
This study analyzes the emotional tone of dialogue in J. R. R. Tolkien's The Hobbit (1937) using computational text analysis. Dialogue was extracted with regular expressions, then preprocessed, and scored using the NRC-VAD lexicon to quantify emotional dimensions. The results show that the dialogue maintains a generally positive (high valence) and calm (low arousal) tone, with a gradually increasing sense of agency (dominance) as the story progresses. These patterns reflect the novel's emotional rhythm: moments of danger and excitement are regularly balanced by humor, camaraderie, and relief. Visualizations -- including emotional trajectory graphs and word clouds -- highlight how Tolkien's language cycles between tension and comfort. By combining computational tools with literary interpretation, this study demonstrates how digital methods can uncover subtle emotional structures in literature, revealing the steady rhythm and emotional modulation that shape the storytelling in The Hobbit.
这项研究使用计算文本分析方法来分析J.R.R.托尔金的《霍比特人》(1937年)对话中的情感基调。首先,通过正则表达式提取对话内容,然后进行预处理,并利用NRC-VAD词典对其进行了评分以量化情感维度。结果显示,对话整体上保持了积极(高价值)和冷静(低唤醒)的语气,在故事进展过程中逐渐增加了主体性(支配感)。这些模式反映了小说的情感节奏:危险和兴奋的时刻经常被幽默、同伴情谊和宽慰所平衡。 包括情绪轨迹图和词云在内的可视化工具突显了托尔金的语言如何在紧张与舒适之间循环。通过结合计算工具与文学解读,这项研究展示了数字方法如何揭示出文学中微妙的情感结构,揭示了《霍比特人》中的故事叙述背后稳定而细腻的情绪变化。
https://arxiv.org/abs/2512.10865
This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students' cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.
本文介绍了一种具备心理感知能力的对话代理,旨在提升教育环境中学生的学习表现和情感福祉。该系统结合了大型语言模型(LLMs)、知识图谱增强版的BERT(KG-BERT)以及双向长短时记忆网络(LSTM)与注意力机制,以实时分类学生的认知状态和情绪状态。不同于以往仅限于辅导或情感支持的聊天机器人,我们的方法利用多模态数据——包括文本语义、韵律语音特征及时间行为趋势——来推断学生参与度、压力水平以及概念理解情况。 一项针对大学生进行的试点研究表明,在与基准方法相比时,该系统可以提高学生的积极性、减少他们的压力,并带来中等程度的成绩提升。这些结果表明了将语义推理、多模态融合和时间建模相结合以支持适应性、学生中心化教育干预措施的巨大潜力。
https://arxiv.org/abs/2512.10441
Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.
心理健康障碍影响着全球数亿人口,而网络已成为获取支持、信息和评估的主要渠道。大型语言模型(LLMs)提供了可扩展且易于访问的援助,但在其推理不完整、不一致或缺乏依据的情况下,在心理健康的环境中部署它们仍然存在风险。现有的心理学LLM侧重于情感理解或知识回忆,但忽略了对评价、诊断、干预规划、抽象和验证所需的逐步临床导向性推理的需求。为解决这些问题,我们引入了MentraSuite,这是一个促进可靠心理健康推理的统一框架。我们提出了MentraBench,一个全面的基准测试体系,涵盖五个核心推理方面,六个任务以及13个数据集,评估任务性能和推理质量,并且从五个维度进行评价:简洁性、一致性、避免幻觉、理解任务及内在一致性。 此外,我们还介绍了Mindora,这是一个通过混合SFT-RL框架优化的后训练模型,在该框架中引入了检测不一致性的奖励机制以强制执行忠实而一致的推理。为了支持训练,我们使用了一种新颖的推理轨迹生成策略构建高质量的轨迹,这种策略战略性地过滤困难样本,并应用结构化的一致性导向重写过程来产生简洁、可读且均衡的轨迹。 在评估了20个LLM之后,Mindora在MentraBench上的平均性能最高,并显示出了显著的推理可靠性,在复杂的心理健康场景中证明了其有效性。
https://arxiv.org/abs/2512.09636
Understanding how humans and AI systems interpret ambiguous visual stimuli offers critical insight into the nature of perception, reasoning, and decision-making. This paper examines image labeling performance across human participants and deep neural networks, focusing on low-resolution, perceptually degraded stimuli. Drawing from computational cognitive science, cognitive architectures, and connectionist-symbolic hybrid models, we contrast human strategies such as analogical reasoning, shape-based recognition, and confidence modulation with AI's feature-based processing. Grounded in Marr's tri-level hypothesis, Simon's bounded rationality, and Thagard's frameworks of representation and emotion, we analyze participant responses in relation to Grad-CAM visualizations of model attention. Human behavior is further interpreted through cognitive principles modeled in ACT-R and Soar, revealing layered and heuristic decision strategies under uncertainty. Our findings highlight key parallels and divergences between biological and artificial systems in representation, inference, and confidence calibration. The analysis motivates future neuro-symbolic architectures that unify structured symbolic reasoning with connectionist representations. Such architectures, informed by principles of embodiment, explainability, and cognitive alignment, offer a path toward AI systems that are not only performant but also interpretable and cognitively grounded.
理解人类和人工智能系统如何解释模棱两可的视觉刺激,为感知、推理和决策的本质提供了关键见解。本文探讨了在低分辨率且视觉效果受损的情况下,人类参与者与深度神经网络在图像标注性能上的表现差异。本研究借鉴计算认知科学、认知架构以及连接主义-符号主义混合模型的观点,对比了人类使用的类比推理、基于形状识别及自信程度调整等策略和人工智能系统的特征处理方式。 本文建立在Marr的三层次假设、Simon的有限理性理论以及Thagard关于表征与情感框架的基础上,通过Grad-CAM可视化方法分析参与者对模型注意点的回答。此外,还利用ACT-R和Soar认知原理来解释人类的行为,在不确定性条件下揭示了分层且基于启发式的决策策略。 研究结果突显了生物系统与人工系统在表示、推断及自信度校准方面的关键相似性和差异性。该分析促进了未来神经-符号架构的发展,这种架构将结构化符号推理与连接主义表示相结合。这些由身体化原则、可解释性以及认知一致性指导的架构,为创建不仅性能出色而且易于理解和具有认知基础的人工智能系统提供了一条路径。
https://arxiv.org/abs/2512.09340
Understanding how driver mental states differ between active and autonomous driving is critical for designing safe human-vehicle interfaces. This paper presents the first EEG-based comparison of cognitive load, fatigue, valence, and arousal across the two driving modes. Using data from 31 participants performing identical tasks in both scenarios of three different complexity levels, we analyze temporal patterns, task-complexity effects, and channel-wise activation differences. Our findings show that although both modes evoke similar trends across complexity levels, the intensity of mental states and the underlying neural activation differ substantially, indicating a clear distribution shift between active and autonomous driving. Transfer-learning experiments confirm that models trained on active driving data generalize poorly to autonomous driving and vice versa. We attribute this distribution shift primarily to differences in motor engagement and attentional demands between the two driving modes, which lead to distinct spatial and temporal EEG activation patterns. Although autonomous driving results in lower overall cortical activation, participants continue to exhibit measurable fluctuations in cognitive load, fatigue, valence, and arousal associated with readiness to intervene, task-evoked emotional responses, and monotony-related passive fatigue. These results emphasize the need for scenario-specific data and models when developing next-generation driver monitoring systems for autonomous vehicles.
理解驾驶员在主动驾驶和自动驾驶模式下心理状态的差异对于设计安全的人车界面至关重要。本文首次基于EEG(脑电图)比较了两种驾驶模式下的认知负荷、疲劳度、效价和唤醒水平,并分析了它们的时间模式、任务复杂性影响以及通道激活差异。我们使用来自31名参与者在三种不同难度级别的主动驾驶与自动驾驶场景下执行相同任务的数据进行了研究。 我们的发现表明,尽管两种模式在同一复杂程度下显示出相似的趋势,但心理状态的强度及底层神经活动存在显著差异,这显示了主动驾驶和自动驾驶之间存在明显的分布变化。迁移学习实验证实,在主动驾驶数据上训练的模型难以推广到自动驾驶场景中,反之亦然。我们主要将这种分布变化归因于两种驾驶模式之间的运动参与度和注意力需求的不同,这些因素导致了不同的空间与时间EEG激活模式。 即使在自动驾驶情况下整体大脑皮层活动较低的情况下,参与者仍会表现出与准备干预、任务诱发的情绪反应以及由于单调性引发的被动疲劳相关的认知负荷、疲劳度、效价和唤醒水平的变化。这些结果强调,在为自动驾驶车辆开发下一代驾驶员监控系统时,需要根据具体场景来收集数据并构建模型的重要性。
https://arxiv.org/abs/2512.09190
Cognitive trust and the belief that a robot is capable of accurately performing tasks, are recognized as central factors in fostering high-quality human-robot interactions. It is well established that performance factors such as the robot's competence and its reliability shape cognitive trust. Recent studies suggest that affective factors, such as robotic attentiveness, also play a role in building cognitive trust. This work explores the interplay between these two factors that shape cognitive trust. Specifically, we evaluated whether different combinations of robotic competence and attentiveness introduce a compensatory mechanism, where one factor compensates for the lack of the other. In the experiment, participants performed a search task with a robotic dog in a 2x2 experimental design that included two factors: competence (high or low) and attentiveness (high or low). The results revealed that high attentiveness can compensate for low competence. Participants who collaborated with a highly attentive robot that performed poorly reported trust levels comparable to those working with a highly competent robot. When the robot did not demonstrate attentiveness, low competence resulted in a substantial decrease in cognitive trust. The findings indicate that building cognitive trust in human-robot interaction may be more complex than previously believed, involving emotional processes that are typically overlooked. We highlight an affective compensatory mechanism that adds a layer to consider alongside traditional competence-based models of cognitive trust.
认知信任和对机器人能够准确完成任务能力的信念,被认为是促进高质量的人机交互的关键因素。研究表明,性能因素如机器人的能力和可靠性塑造了认知信任。近期研究还表明,情感因素(例如机器人的专注度)也在建立认知信任中发挥重要作用。本研究探讨了这些影响认知信任的因素之间的相互作用。 具体而言,我们评估了不同组合的机器人能力和关注程度是否引入了一种补偿机制,在这种机制下,一个因素可以弥补另一个因素的不足。在实验中,参与者在一个2x2的设计中与一只机器狗合作完成搜索任务,该设计包括两个因素:能力(高或低)和专注度(高或低)。结果显示,当机器人表现不佳但非常关注时,高关注度能够补偿其较低的能力。与执行不力但注意力高度集中的机器人合作的参与者报告的信任水平,与那些与能力很强但不太关心的机器人合作的人所报告的信任水平相当。 另一方面,在机器人不具备专注度的情况下,能力低下会导致认知信任显著下降。这些发现表明,在人机交互中建立认知信任可能比之前认为的更为复杂,涉及通常被忽视的情感过程。我们强调了一种情感补偿机制的存在,这为传统的基于能力和性能的认知信任模型提供了一个新的视角和考虑层面。
https://arxiv.org/abs/2512.09105
A remote robot operator's affective state can significantly impact the resulting robot's motions leading to unexpected consequences, even when the user follows protocol and performs permitted tasks. The recognition of a user operator's affective states in remote robot control scenarios is, however, underexplored. Current emotion recognition methods rely on reading the user's vital signs or body language, but the devices and user participation these measures require would add limitations to remote robot control. We demonstrate that the functional movements of a remote-controlled robotic avatar, which was not designed for emotional expression, can be used to infer the emotional state of the human operator via a machine-learning system. Specifically, our system achieved 83.3$\%$ accuracy in recognizing the user's emotional state expressed by robot movements, as a result of their hand motions. We discuss the implications of this system on prominent current and future remote robot operation and affective robotic contexts.
远程机器人操作员的情绪状态可以显著影响机器人的动作,即使用户遵循协议并执行允许的任务,也可能导致意外后果。然而,在远程机器人控制场景中识别操作员的情绪状态却是一个尚未充分研究的领域。目前的情感识别方法依赖于读取用户的生理信号或身体语言,但这些措施所需的设备和用户参与会为远程机器人控制增加限制。我们展示了可以通过机器学习系统利用远程控制机器人化身的功能动作(该机器人并未设计用于表达情感)来推断人类操作员的情绪状态。具体而言,我们的系统在仅通过机器人因手部动作而产生的运动识别用户情绪状态方面达到了83.3%的准确率。我们将讨论这一系统对未来远程机器人操控和情感型机器人的影响。
https://arxiv.org/abs/2512.09086
Narratives about artificial intelligence (AI) entangle autonomy, the capacity to self-govern, with sentience, the capacity to sense and feel. AI agents that perform tasks autonomously and companions that recognize and express emotions may activate mental models of autonomy and sentience, respectively, provoking distinct reactions. To examine this possibility, we conducted three pilot studies (N = 374) and four preregistered vignette experiments describing an AI as autonomous, sentient, both, or neither (N = 2,702). Activating a mental model of sentience increased general mind perception (cognition and emotion) and moral consideration more than autonomy, but autonomy increased perceived threat more than sentience. Sentience also increased perceived autonomy more than vice versa. Based on a within-paper meta-analysis, sentience changed reactions more than autonomy on average. By disentangling different mental models of AI, we can study human-AI interaction with more precision to better navigate the detailed design of anthropomorphized AI and prompting interfaces.
关于人工智能(AI)的故事将自主性——即自我治理的能力——与感知能力——感受和体验情感的能力——交织在一起。能够独立执行任务的AI代理以及能够识别并表达情绪的伴侣可能会激发人们对自主性和意识的不同心理模型,从而引发不同的反应。为了探讨这一可能性,我们进行了三项初步研究(N = 374)和四项预注册的情景实验,这些实验描述了将AI视为具有自主性、感知能力、两者兼备或两者皆无的情况(N = 2,702)。激活感知心理模型比自主性的心理模型更能增加一般心智感知(认知与情感)及道德考量。然而,自主性则比感知性更易引发人们感到威胁。此外,感知性还比自主性更容易被认作拥有更高的自主性。 根据文内元分析的结果,总体而言,感知性对反应的影响大于自主性的影响。通过区分不同的AI心理模型,我们可以更加精确地研究人机互动,并更好地设计拟人化的人工智能和提示界面。
https://arxiv.org/abs/2512.09085
Music improvisation is fascinating to study, being essentially a live demonstration of a creative process. In jazz, musicians often improvise across predefined chord progressions (leadsheets). How do we assess the creativity of jazz improvisations? And can we capture this in automated metrics for creativity for current LLM-based generative systems? Demonstration of emotional involvement is closely linked with creativity in improvisation. Analysing musical audio, can we detect emotional involvement? This study hypothesises that if an improvisation contains more evidence of emotion-laden content, it is more likely to be recognised as creative. An embeddings-based method is proposed for capturing the emotional content in musical improvisations, using a psychologically-grounded classification of musical characteristics associated with emotions. Resulting 'emovectors' are analysed to test the above hypothesis, comparing across multiple improvisations. Capturing emotional content in this quantifiable way can contribute towards new metrics for creativity evaluation that can be applied at scale.
音乐即兴创作的研究非常迷人,它本质上是创造力过程的现场演示。在爵士乐中,音乐家经常会在预定义和弦进程(简谱)上进行即兴演奏。我们如何评估爵士乐即兴演奏中的创造性?能否通过自动化指标捕捉这种创造性,并应用于当前基于LLM(大型语言模型)的生成系统? 在即兴创作过程中,情感参与的表现与创造力密切相关。分析音乐音频时,我们是否能够检测到这种情感参与呢?这项研究假设:如果一段即兴表演中包含更多带有情感色彩的内容,则它更有可能被认为是具有创造性的。 为了验证这一假设,该研究提出了一种基于嵌入表示的方法来捕捉音乐即兴创作中的情感内容。这种方法使用了心理学基础的情感相关音乐特征分类方法。随后生成的“emovectors”(情感向量)将被分析,并通过对比多个即兴表演来进行测试。 以这种方式量化地捕捉情感内容,可以为创造力评估提供新的、可大规模应用的指标。
https://arxiv.org/abs/2512.08812
Agentic LLM frameworks promise autonomous behavior via task decomposition, tool use, and iterative planning, but most deployed systems remain brittle. They lack runtime introspection, cannot diagnose their own failure modes, and do not improve over time without human intervention. In practice, many agent stacks degrade into decorated chains of LLM calls with no structural mechanisms for reliability. We present VIGIL (Verifiable Inspection and Guarded Iterative Learning), a reflective runtime that supervises a sibling agent and performs autonomous maintenance rather than task execution. VIGIL ingests behavioral logs, appraises each event into a structured emotional representation, maintains a persistent EmoBank with decay and contextual policies, and derives an RBT diagnosis that sorts recent behavior into strengths, opportunities, and failures. From this analysis, VIGIL generates both guarded prompt updates that preserve core identity semantics and read only code proposals produced by a strategy engine that operates on log evidence and code hotspots. VIGIL functions as a state gated pipeline. Illegal transitions produce explicit errors rather than allowing the LLM to improvise. In a reminder latency case study, VIGIL identified elevated lag, proposed prompt and code repairs, and when its own diagnostic tool failed due to a schema conflict, it surfaced the internal error, produced a fallback diagnosis, and emitted a repair plan. This demonstrates meta level self repair in a deployed agent runtime.
代理型大语言模型框架通过任务分解、工具使用和迭代规划来承诺自主行为,但大多数已部署的系统仍然脆弱。它们缺乏运行时自我检查能力,无法诊断自身故障模式,并且在没有人类干预的情况下不会随着时间改善。实际上,许多代理堆栈退化为装饰化的大型语言模型调用链,没有任何确保可靠性的结构性机制。 我们提出了VIGIL(可验证检视和受控迭代学习),这是一种具有反思功能的运行时环境,用于监督同级代理并执行自主维护而不是任务执行。VIGIL会摄取行为日志,将每个事件评估为结构化的“情感”表示,并维护一个持久的EmoBank(情绪银行)以衰减方式存储信息和上下文策略。它还会诊断最近的行为,将其分类为优势、机会以及失败。通过这种分析,VIGIL生成受控提示更新,这些更新保留核心身份语义,同时由运行于日志证据和代码热点上的策略引擎提出只读代码提案。 VIGIL作为一个状态门控管道运行:非法过渡会产生明确的错误而不是允许大语言模型即兴发挥。在一个提醒延迟案例研究中,当检测到滞后增加时,VIGIL提出了提示及代码修复方案;在自身诊断工具因模式冲突失效的情况下,它呈现内部错误、产生备用诊断,并发出了修复计划。这展示了部署代理运行时的元级别的自我修复能力。 这一描述强调了VIGIL如何通过监控和分析其行为日志来实现自主维护和改进的能力,同时展示了一个实际案例中VIGIL如何识别并解决自身的问题或限制。这种方法有助于提高代理系统的可靠性和长期稳定性,而不需要外部的人工干预。
https://arxiv.org/abs/2512.07094