This paper presents our system for SemEval 2025 Task 11: Bridging the Gap in Text-Based Emotion Detection (Track A), which focuses on multi-label emotion detection in short texts. We propose a feature-centric framework that dynamically adapts document representations and learning algorithms to optimize language-specific performance. Our study evaluates three key components: document representation, dimensionality reduction, and model training in 28 languages, highlighting five for detailed analysis. The results show that TF-IDF remains highly effective for low-resource languages, while contextual embeddings like FastText and transformer-based document representations, such as those produced by Sentence-BERT, exhibit language-specific strengths. Principal Component Analysis (PCA) reduces training time without compromising performance, particularly benefiting FastText and neural models such as Multi-Layer Perceptrons (MLP). Computational efficiency analysis underscores the trade-off between model complexity and processing cost. Our framework provides a scalable solution for multilingual emotion detection, addressing the challenges of linguistic diversity and resource constraints.
这篇论文介绍了我们为SemEval 2025任务11:文本情感检测中的桥梁(A组轨道)开发的系统,该任务专注于短文本中的多标签情绪识别。我们提出了一种以特征为中心的框架,该框架能够动态调整文档表示和学习算法,从而优化特定语言的表现。我们的研究评估了三个关键组成部分:文档表示、降维以及在28种语言中进行模型训练的情况,并对其中五种语言进行了详细分析。 结果显示,在资源匮乏的语言中,TF-IDF依然非常有效;而FastText等上下文嵌入和基于Transformer的Sentence-BERT文档表征,则展示了特定语言的优势。主成分分析(PCA)能够在不牺牲性能的前提下减少训练时间,尤其有利于FastText以及多层感知机(MLP)这类神经网络模型。 计算效率分析强调了在处理成本与模型复杂度之间的权衡。我们的框架为解决语言多样性及资源限制问题提供了可扩展的解决方案,在多种语言的情感检测方面展现出强大的能力。
https://arxiv.org/abs/2507.08499
Pain is an inherent part of human existence, manifesting as both physical and emotional experiences, and can be categorized as either acute or chronic. Over the years, extensive research has been conducted to understand the causes of pain and explore potential treatments, with contributions from various scientific disciplines. However, earlier studies often overlooked the role of gender in pain experiences. In this study, we utilized Natural Language Processing (NLP) to analyze and gain deeper insights into individuals' pain experiences, with a particular focus on gender differences. We successfully classified posts into male and female corpora using the Hidden Attribute Model-Convolutional Neural Network (HAM-CNN), achieving an F1 score of 0.86 by aggregating posts based on usernames. Our analysis revealed linguistic differences between genders, with female posts tending to be more emotionally focused. Additionally, the study highlighted that conditions such as migraine and sinusitis are more prevalent among females and explored how pain medication affects individuals differently based on gender.
疼痛是人类存在的一部分,它既表现为身体上的痛苦也体现为情感上的体验,并可以被分类为急性或慢性。多年来,人们进行了大量的研究以理解疼痛的原因并探索潜在的治疗方法,这些研究成果涵盖了多个科学学科。然而,在早期的研究中,人们对性别在疼痛经历中的作用往往忽视了这一点。在这项研究中,我们利用自然语言处理(NLP)来分析和深入探究个体的疼痛体验,并特别关注性别的差异。 通过使用隐藏属性模型-卷积神经网络(HAM-CNN),我们成功地根据用户名聚合帖子后将贴文分类为男性或女性语料库,在此过程中达到了F1分数0.86的成绩。我们的分析揭示了性别间的语言差异,发现女性的发帖更倾向于情感表达集中。此外,研究还强调了偏头痛和鼻窦炎等状况在女性中更为普遍,并探讨了基于性别的不同个体如何受到止痛药物的影响。 这项研究表明,在处理疼痛问题时考虑到性别差异的重要性,以及进一步利用先进的分析技术如NLP来理解这些差异的价值。
https://arxiv.org/abs/2507.08241
Emotion and intent recognition from speech is essential and has been widely investigated in human-computer interaction. The rapid development of social media platforms, chatbots, and other technologies has led to a large volume of speech data streaming from users. Nevertheless, annotating such data manually is expensive, making it challenging to train machine learning models for recognition purposes. To this end, we propose applying semi-supervised learning to incorporate a large scale of unlabelled data alongside a relatively smaller set of labelled data. We train end-to-end acoustic and linguistic models, each employing multi-task learning for emotion and intent recognition. Two semi-supervised learning approaches, including fix-match learning and full-match learning, are compared. The experimental results demonstrate that the semi-supervised learning approaches improve model performance in speech emotion and intent recognition from both acoustic and text data. The late fusion of the best models outperforms the acoustic and text baselines by joint recognition balance metrics of 12.3% and 10.4%, respectively.
从语音中识别情绪和意图在人机交互领域至关重要,并且已经得到了广泛的研究。社交媒体平台、聊天机器人和其他技术的快速发展产生了大量来自用户的语音数据流。然而,手动标注这些数据的成本高昂,使得训练用于识别目的的机器学习模型变得具有挑战性。为此,我们提出应用半监督学习方法来结合大规模未标记的数据和相对较小的已标记数据集。 我们训练端到端声学和语言模型,每个模型都采用多任务学习进行情绪和意图识别。比较了两种半监督学习方法,包括fix-match学习和full-match学习。实验结果表明,在从声学和文本数据中识别语音情感和意图方面,半监督学习方法提高了模型的性能。最佳模型的后期融合分别比声学和文本基线在联合识别平衡指标上提升了12.3%和10.4%。
https://arxiv.org/abs/2507.07806
The growing need for psychological support due to increasing pressures has exposed the scarcity of relevant datasets, particularly in non-English languages. To address this, we propose a framework that leverages limited real-world data and expert knowledge to fine-tune two large language models: Dialog Generator and Dialog Modifier. The Generator creates large-scale psychological counseling dialogues based on predefined paths, which guide system response strategies and user interactions, forming the basis for effective support. The Modifier refines these dialogues to align with real-world data quality. Through both automated and manual review, we construct the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K dialogues across 13 groups, 16 psychological problems, 13 causes, and 12 support focuses. Additionally, we introduce the Comprehensive Agent Dialogue Support System (CADSS), where a Profiler analyzes user characteristics, a Summarizer condenses dialogue history, a Planner selects strategies, and a Supporter generates empathetic responses. The experimental results of the Strategy Prediction and Emotional Support Conversation (ESC) tasks demonstrate that CADSS achieves state-of-the-art performance on both CPsDD and ESConv datasets.
由于压力不断增加,对心理支持的需求也在增长,这暴露了相关数据集的稀缺性,尤其是非英语语言的数据。为了解决这一问题,我们提出了一种框架,该框架利用有限的真实世界数据和专家知识来微调两个大型语言模型:对话生成器(Dialog Generator)和对话修改器(Dialog Modifier)。生成器根据预定义的路径创建大规模的心理咨询对话,这些路径指导系统响应策略和用户互动,形成有效支持的基础。修改器则进一步优化这些对话,使其与真实世界的数据质量相匹配。 通过自动审核和手动审查相结合的方式,我们构建了中文心理支持对话数据集(CPsDD),其中包含了68K个跨越13组、涉及16种心理健康问题、13种原因及12项支持重点的对话。此外,我们还推出了全面代理对话支持系统(CADSS),该系统包括一个分析用户特性的配置文件生成器(Profiler)、总结对话历史的摘要器(Summarizer)、选择策略的计划器(Planner)以及产生同理回应的支持者(Supporter)。 在策略预测和情感支持对话任务上的实验结果表明,CADSS在CPsDD和ESConv数据集上均达到了最先进的性能。
https://arxiv.org/abs/2507.07509
Nowadays, speech emotion recognition (SER) plays a vital role in the field of human-computer interaction (HCI) and the evolution of artificial intelligence (AI). Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise, which are trained on five datasets: RAVDESS (R), TESS (T), SAVEE (S), EmoDB (E), and Crema-D (C). The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% for CREMA-D, and a perfect 100% on both TESS and EMO-DB. For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results. To our knowledge, no existing study has evaluated a single SER model across all five benchmark datasets (i.e., R+T+S+C+E) simultaneously. In our work, we introduce this comprehensive combination and achieve a remarkable overall accuracy of 93.76%. These results confirm the robustness and generalizability of our DCRF-BiLSTM framework across diverse datasets.
如今,语音情感识别(SER)在人机交互(HCI)和人工智能(AI)的发展中扮演着重要角色。我们提出的DCRF-BiLSTM模型用于识别七种情感:中立、快乐、悲伤、愤怒、恐惧、厌恶和惊讶,并且该模型已在五个数据集上进行训练,包括RAVDESS(R)、TESS(T)、SAVEE(S)、EmoDB(E)和Crema-D(C)。在各个数据集中,我们的模型达到了很高的准确率:97.83%的RAVDESS准确率、97.02%的SAVEE准确率、CREMA-D的95.10%,以及TESS和EMO-DB上的完美100%。对于合并的数据集(R+T+S),该模型达到了98.82%的准确率,超过了之前报告的结果。据我们所知,目前还没有任何研究在所有五个基准数据集上同时评估单一SER模型(即 R+T+S+C+E)。我们的工作首次引入了这种全面组合,并取得了令人瞩目的整体准确率为93.76%。这些结果证实了我们DCRF-BiLSTM框架在不同数据集上的鲁棒性和泛化能力。
https://arxiv.org/abs/2507.07046
Multi-modal emotion recognition has garnered increasing attention as it plays a significant role in human-computer interaction (HCI) in recent years. Since different discrete emotions may exist at the same time, compared with single-class emotion recognition, emotion distribution learning (EDL) that identifies a mixture of basic emotions has gradually emerged as a trend. However, existing EDL methods face challenges in mining the heterogeneity among multiple modalities. Besides, rich semantic correlations across arbitrary basic emotions are not fully exploited. In this paper, we propose a multi-modal emotion distribution learning framework, named HeLo, aimed at fully exploring the heterogeneity and complementary information in multi-modal emotional data and label correlation within mixed basic emotions. Specifically, we first adopt cross-attention to effectively fuse the physiological data. Then, an optimal transport (OT)-based heterogeneity mining module is devised to mine the interaction and heterogeneity between the physiological and behavioral representations. To facilitate label correlation learning, we introduce a learnable label embedding optimized by correlation matrix alignment. Finally, the learnable label embeddings and label correlation matrices are integrated with the multi-modal representations through a novel label correlation-driven cross-attention mechanism for accurate emotion distribution learning. Experimental results on two publicly available datasets demonstrate the superiority of our proposed method in emotion distribution learning.
近年来,多模态情感识别在人机交互(HCI)中受到了越来越多的关注。由于不同的离散情绪可能同时存在,在单类情感识别与能够辨识基础情绪混合物的情感分布学习(EDL)之间,后者逐渐成为一种趋势。然而,现有的EDL方法在挖掘多种模态之间的异质性方面面临挑战,并且任意两种基本情绪之间的丰富语义关联尚未得到充分开发。 为此,本文提出了一种多模态情感分布学习框架,命名为HeLo,旨在全面探索多模态情感数据中的异质性和互补信息以及混合基础情绪标签间的相关性。具体来说,我们首先采用跨注意力机制有效地融合生理数据。接着设计了一个基于最优传输(OT)的异质性挖掘模块来深入探究生理和行为表示之间的交互作用及异质性。为了促进标签关联的学习,引入了一种可学习的标签嵌入,并通过相关矩阵对齐进行优化。最后,将可学习的标签嵌入与标签相关矩阵通过一种新的由标签相关驱动的跨注意力机制集成到多模态表示中,以实现精确的情感分布识别。 实验结果表明,在两个公开可用的数据集上,我们提出的方法在情感分布学习方面具有明显优势。
https://arxiv.org/abs/2507.06821
Ensuring safety in human-robot interaction (HRI) is essential to foster user trust and enable the broader adoption of robotic systems. Traditional safety models primarily rely on sensor-based measures, such as relative distance and velocity, to assess physical safety. However, these models often fail to capture subjective safety perceptions, which are shaped by individual traits and contextual factors. In this paper, we introduce and analyze a parameterized general safety model that bridges the gap between physical and perceived safety by incorporating a personalization parameter, $\rho$, into the safety measurement framework to account for individual differences in safety perception. Through a series of hypothesis-driven human-subject studies in a simulated rescue scenario, we investigate how emotional state, trust, and robot behavior influence perceived safety. Our results show that $\rho$ effectively captures meaningful individual differences, driven by affective responses, trust in task consistency, and clustering into distinct user types. Specifically, our findings confirm that predictable and consistent robot behavior as well as the elicitation of positive emotional states, significantly enhance perceived safety. Moreover, responses cluster into a small number of user types, supporting adaptive personalization based on shared safety models. Notably, participant role significantly shapes safety perception, and repeated exposure reduces perceived safety for participants in the casualty role, emphasizing the impact of physical interaction and experiential change. These findings highlight the importance of adaptive, human-centered safety models that integrate both psychological and behavioral dimensions, offering a pathway toward more trustworthy and effective HRI in safety-critical domains.
确保人机交互(HRI)中的安全是建立用户信任并促进机器人系统广泛应用的关键。传统的安全模型主要依赖于基于传感器的测量,如相对距离和速度来评估物理安全性。然而,这些模型通常无法捕捉主观的安全感知,而后者是由个人特质和情境因素塑造的。在本文中,我们引入并分析了一个参数化的通用安全模型,该模型通过将个性化参数$\rho$纳入安全度量框架内,填补了物理安全性和感知安全性的差距,以考虑个体间安全感知差异。 通过一系列假设驱动的人体实验,在模拟救援场景中,我们研究了情绪状态、信任以及机器人行为如何影响感知的安全性。我们的结果表明,$\rho$有效地捕捉到了有意义的个人差异,这些差异是由情感反应、任务一致性的信任感以及不同类型用户的集群构成的。具体来说,我们的发现确认了可预测且一致的机器人行为以及激发积极情绪状态,显著提高了感知安全性。此外,响应呈现出少量类型的用户群集,支持基于共享安全模型的自适应个性化。 值得注意的是,参与者的角色极大地影响着他们的安全感知,并且对于扮演受害者角色的参与者而言,多次接触会降低其对安全性的感知,这强调了物理互动和体验变化的影响。这些发现突出了具有适应性和以人为中心的安全模型的重要性,这些模型整合了心理和行为维度,在安全关键领域中构建更加可信和有效的HRI提供了一条路径。
https://arxiv.org/abs/2507.06700
Recent breakthroughs in singing voice synthesis (SVS) have heightened the demand for high-quality annotated datasets, yet manual annotation remains prohibitively labor-intensive and resource-intensive. Existing automatic singing annotation (ASA) methods, however, primarily tackle isolated aspects of the annotation pipeline. To address this fundamental challenge, we present STARS, which is, to our knowledge, the first unified framework that simultaneously addresses singing transcription, alignment, and refined style annotation. Our framework delivers comprehensive multi-level annotations encompassing: (1) precise phoneme-audio alignment, (2) robust note transcription and temporal localization, (3) expressive vocal technique identification, and (4) global stylistic characterization including emotion and pace. The proposed architecture employs hierarchical acoustic feature processing across frame, word, phoneme, note, and sentence levels. The novel non-autoregressive local acoustic encoders enable structured hierarchical representation learning. Experimental validation confirms the framework's superior performance across multiple evaluation dimensions compared to existing annotation approaches. Furthermore, applications in SVS training demonstrate that models utilizing STARS-annotated data achieve significantly enhanced perceptual naturalness and precise style control. This work not only overcomes critical scalability challenges in the creation of singing datasets but also pioneers new methodologies for controllable singing voice synthesis. Audio samples are available at this https URL.
近期在歌唱声音合成(SVS)领域的突破性进展提高了对高质量注释数据集的需求,但手动标注仍然极其耗费人力和资源。现有的自动歌唱标注(ASA)方法主要解决的是注释流水线中的孤立方面问题。为了解决这一基本挑战,我们提出了STARS系统,据我们所知,这是首个同时处理歌唱转录、对齐及精细风格标注的统一框架。我们的框架提供了全面多层级的标注,包括:(1) 精确的音素-音频对齐;(2) 坚固的乐符转写和时间定位;(3) 表达性声乐技巧识别;以及 (4) 全局风格特征描述,涵盖情感和节奏。所提出的架构采用了跨帧、词、音素、乐符及句子层级的分层声学特征处理方法。新颖的非自回归局部声学编码器能够支持有结构的层次化表示学习。 实验验证确认了该框架在多个评估维度上的卓越性能,优于现有的标注方法。此外,在SVS训练中的应用显示,使用STARS注释数据集的模型在感知自然度和精准风格控制方面显著提升。这项工作不仅克服了歌唱数据集创建过程中的关键可扩展性挑战,而且开创了可控歌唱声音合成的新方法论。 音频样本可在以下链接获取:[此处插入链接]
https://arxiv.org/abs/2507.06670
This project introduces a new measure of elite polarization via actor and subject detection using artificial intelligence. I identify when politicians mention one another in parliamentary speeches, note who is speaking and who is being addressed, and assess the emotional temperature behind these evaluations. This maps how elites evaluate their various out-parties, allowing us to create an index of mutual out-party hostility, that is, elite polarization. While I analyzed polarization data over the past four decades for the UK, and two decades for Hungary and Italy, my approach lays the groundwork for a twenty-year, EU-wide time-series dataset on elite polarization. I obtain the results that can be aggregated by party and quarter. The resulting index demonstrates a good face validity: it reacts to events such as electoral campaigns, country- and party-level crises, and to parties losing and assuming power.
该项目通过使用人工智能进行角色和主题检测,引入了一种新的精英分化衡量标准。我识别出政客在议会演讲中提及彼此的情况,并记录发言者及被提及的人的身份,同时评估这些言论背后的情感温度。这描绘了精英阶层如何评价不同的反对派,从而让我们能够创建一个相互反对派敌意指数,即精英分化的指标。虽然我已经分析了过去四十年英国的分化数据、二十年匈牙利和意大利的数据,但我的方法为构建一个涵盖整个欧盟范围内长达二十年的精英分化时间序列数据库奠定了基础。我获取的结果可以按政党及季度进行汇总。由此产生的指数具有良好的表面效度:它对选举活动、国家和政党的危机以及政党失去或获得权力等事件做出了响应。
https://arxiv.org/abs/2507.06658
This study investigates how stylized, voiced agents shape user interaction in a multimodal language learning environment. We conducted a mixed-methods evaluation of 54 participants interacting with anime-inspired characters powered by large language models and expressive text-to-speech synthesis. These agents responded in Japanese character language, offering users asynchronous, semi-structured conversation in varying speech styles and emotional tones. We analyzed user engagement patterns, perceived usability, emotional responses, and learning behaviors, with particular attention to how agent stylization influenced interaction across language proficiency levels and cultural backgrounds. Our findings reveal that agent design, especially voice, persona, and linguistic style, substantially affected user experience, motivation, and strategy. This work contributes to the understanding of affective, culturally stylized agents in human-agent interaction and offers guidance for designing more engaging, socially responsive systems.
这项研究探讨了具有声音的风格化代理如何影响多模态语言学习环境中的用户交互。我们对54名参与者与由大型语言模型和有表现力的文字转语音合成技术驱动的日式动漫角色互动进行了混合方法评估。这些代理以日语字符语言回应,提供了异步、半结构化的对话,并采用了不同的说话风格和情感基调。我们分析了用户的参与模式、感知可用性、情绪反应以及学习行为,特别关注代理的风格化设计如何在不同语言水平和文化背景之间影响交互。 我们的发现表明,代理的设计(特别是声音、人格和语言风格)对用户体验、动机和策略有显著的影响。这项工作有助于理解情感化且具有文化特色的代理在人机互动中的作用,并为设计更具吸引力和社会响应性的系统提供了指导。
https://arxiv.org/abs/2507.06483
In recent years, affective computing and its applications have become a fast-growing research topic. Despite significant advancements, the lack of affective multi-modal datasets remains a major bottleneck in developing accurate emotion recognition systems. Furthermore, the use of contact-based devices during emotion elicitation often unintentionally influences the emotional experience, reducing or altering the genuine spontaneous emotional response. This limitation highlights the need for methods capable of extracting affective cues from multiple modalities without physical contact, such as remote physiological emotion recognition. To address this, we present the Contactless Affective States Through Physiological Signals Database (CAST-Phys), a novel high-quality dataset explicitly designed for multi-modal remote physiological emotion recognition using facial and physiological cues. The dataset includes diverse physiological signals, such as photoplethysmography (PPG), electrodermal activity (EDA), and respiration rate (RR), alongside high-resolution uncompressed facial video recordings, enabling the potential for remote signal recovery. Our analysis highlights the crucial role of physiological signals in realistic scenarios where facial expressions alone may not provide sufficient emotional information. Furthermore, we demonstrate the potential of remote multi-modal emotion recognition by evaluating the impact of individual and fused modalities, showcasing its effectiveness in advancing contactless emotion recognition technologies.
近年来,情感计算及其应用已成为一个快速增长的研究领域。尽管取得了显著进展,缺乏情感多模态数据集仍然是开发准确情绪识别系统的主要瓶颈。此外,在激发情绪时使用接触式设备往往会无意中影响情绪体验,减少或改变真正的自发情绪反应。这一局限性突显了从面部和生理信号等非物理接触方式提取情感线索的方法的必要性,例如远程生理情绪识别。为此,我们提出了《通过生理信号进行无接触情感状态数据库》(CAST-Phys),这是一个专门为利用面部和生理信号实现多模态远程生理情绪识别而设计的新颖高质量数据集。该数据集包括多种生理信号,如光体积描记图(PPG)、皮肤电活动(EDA)以及呼吸率(RR),并附有高分辨率的未压缩面部视频记录,从而有可能进行远程信号恢复。我们的分析强调了在仅凭面部表情可能无法提供足够情感信息的实际场景中,生理信号的关键作用。此外,我们通过评估单个和融合模态的影响来展示远程多模态情绪识别的潜力,展示了其在推进无接触情绪识别技术方面的有效性。
https://arxiv.org/abs/2507.06080
Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline.
音频驱动的情感3D面部动画旨在生成同步的唇部动作和生动的表情。然而,大多数现有的方法专注于静态且预先定义的情绪标签,这限制了它们的多样性和自然性。为了解决这些挑战,我们提出了MEDTalk,这是一个用于精细度高、动态情感虚拟头像生成的新颖框架。我们的方法首先通过精心设计的交叉重建过程从运动序列中解耦内容和情绪嵌入空间,从而能够独立控制唇部动作和面部表情。 除了传统的音频驱动的唇部同步之外,我们还整合了音频和语音文本,预测每帧的情感强度变化,并动态调整静态情感特征以生成逼真的情感表达。此外,为了增强控制和个人化,我们引入了多模态输入——包括文本描述和参考表情图像——来引导生成用户指定的面部表情。 以MetaHuman为优先考虑对象,我们的生成结果可以方便地集成到工业生产流水线中。
https://arxiv.org/abs/2507.06071
This paper describes the approach of the Unibuc - NLP team in tackling the SemEval 2025 Workshop, Task 11: Bridging the Gap in Text-Based Emotion Detection. We mainly focused on experiments using large language models (Gemini, Qwen, DeepSeek) with either few-shot prompting or fine-tuning. With our final system, for the multi-label emotion detection track (track A), we got an F1-macro of $0.7546$ (26/96 teams) for the English subset, $0.1727$ (35/36 teams) for the Portuguese (Mozambican) subset and $0.325$ (\textbf{1}/31 teams) for the Emakhuwa subset.
本文描述了Unibuc-NLP团队在应对SemEval 2025研讨会任务11:基于文本的情感检测缺口中的方法。我们主要集中在使用大型语言模型(Gemini、Qwen、DeepSeek)进行实验,这些实验要么采用少量样本提示法,要么进行微调。最终系统中,在多标签情感检测赛道(赛道A),我们在英语子集上取得了0.7546的F1-macro评分(在26支队伍中排名第96),葡萄牙语(莫桑比克)子集上的得分是0.1727(在36支队伍中排名35),而在Emakhuwa语言子集中则获得了0.325的分数(在31支队伍中排名第1)。
https://arxiv.org/abs/2507.05918
This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furthermore, we employ the Gumbel-Softmax technique to render the reward function differentiable, thereby streamlining the RLHF training process. Additionally, we introduce a multi-task reward (MTR) model which can provide feedback from different perspectives and find that it can augment the system's capability to follow instructions this http URL results indicate that DiffRO significantly improves the pronunciation accuracy of the TTS system, achieving state-of-the-art (SOTA) WER results on the seed-tts-eval benchmark. Moreover, with the integration of the MTR model, we demonstrate the ability to control emotional and quality attributes in a zero-shot manner.
这篇论文提出了一种新颖的可微奖励优化(DiffRO)方法,旨在提高基于神经编码器-解码器模型的文字转语音(TTS)系统的性能。与传统的从人工反馈中学习的强化学习(RLHF)应用于TTS系统的方法不同,DiffRO直接根据神经编码器令牌计算奖励,而不是依赖于合成音频。此外,我们采用Gumbel-Softmax技术使奖励函数可微,从而简化了RLHF训练过程。另外,我们引入了一种多任务奖励(MTR)模型,该模型可以从不同的角度提供反馈,并且发现它可以增强系统遵循指令的能力。实验结果表明,DiffRO显著提高了TTS系统的发音准确度,在种子TTS评估基准上实现了最先进的单词错误率(WER)结果。此外,通过集成MTR模型,我们展示了在零样本设置下控制情感和质量属性的能力。
https://arxiv.org/abs/2507.05911
Creating a cast of characters by attending to their relational dynamics is a critical aspect of most long-form storywriting. However, our formative study (N=14) reveals that writers struggle to envision new characters that could influence existing ones, to balance similarities and differences among characters, and to intricately flesh out their relationships. Based on these observations, we designed Constella, an LLM-based multi-agent tool that supports storywriters' interconnected character creation process. Constella suggests related characters (FRIENDS DISCOVERY feature), reveals the inner mindscapes of several characters simultaneously (JOURNALS feature), and manifests relationships through inter-character responses (COMMENTS feature). Our 7-8 day deployment study with storywriters (N=11) shows that Constella enabled the creation of expansive communities composed of related characters, facilitated the comparison of characters' thoughts and emotions, and deepened writers' understanding of character relationships. We conclude by discussing how multi-agent interactions can help distribute writers' attention and effort across the character cast.
通过关注角色之间的关系动态来构建一个角色阵容,是大多数长篇故事写作中的关键方面。然而,我们初步研究(N=14)显示,作家在想象能够影响现有角色的新角色、平衡角色间的相似性和差异性以及详细描绘他们的关系方面面临困难。 基于这些观察结果,我们设计了Constella这一基于大型语言模型的多代理工具,以支持故事作者的相互关联的角色创造过程。Constella提供了相关角色建议(FRIENDS DISCOVERY功能)、同时揭示多个角色内心的思维景观(JOURNALS功能),并通过跨角色之间的互动展现他们的关系(COMMENTS功能)。 我们与故事作家进行了一项为期7至8天的应用测试研究(N=11),结果显示,Constella使得创建充满相关联的角色的社区成为可能、促进了比较不同角色的思想和情感,并加深了作者对角色之间关系的理解。最后,我们将讨论多代理交互如何帮助分散作者在创作过程中对于整个角色阵容的关注和精力。
https://arxiv.org/abs/2507.05820
Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs' intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model's ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield two principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems.
尽管大型语言模型(LLM)在各个领域的进展显著,但它们预测早产儿视网膜病变(ROP)风险的能力仍然鲜有探索。为了解决这一差距,我们引入了一个新颖的中文基准数据集,称为CROP,其中包括993份住院记录,并用低、中、高风险标签进行了标注。为了系统地评估LLM在ROP风险分层中的预测能力和情感偏见,我们提出了Affective-ROPTester,这是一个自动化的评估框架,整合了三种提示策略:基于指令的、链式思维(CoT)和上下文学习(ICL)。指令方案评估的是LLM内在知识及其相关偏差,而CoT和ICL方案则利用外部医学知识来提高预测准确性。尤为重要的是,我们在提示级别上集成了情感元素,以研究不同的情感框架如何影响模型预测ROP的能力以及其偏见模式。 基于CROP数据集的实证结果得出两个主要观察结论:首先,当LLM仅依靠内在知识进行操作时,在ROP风险预测中表现出有限的效果;然而,通过加入结构化的外部输入后,性能显著提升。其次,模型输出中存在情感偏差,表现为倾向于过度估计中等和高危病例。第三,与负面情绪相比,正面的情感框架有助于减少模型输出中的预测偏见。 这些发现强调了情感敏感提示工程在提高诊断可靠性方面的关键作用,并突出了Affective-ROPTester作为评估和减轻临床语言建模系统情感偏差框架的实用性。
https://arxiv.org/abs/2507.05816
The exponential growth of social media and generative AI has transformed information dissemination, fostering connectivity but also accelerating the spread of misinformation. Understanding information propagation dynamics and developing effective control strategies is essential to mitigate harmful content. Traditional models, such as SIR, provide basic insights but inadequately capture the complexities of online interactions. Advanced methods, including attention mechanisms and graph neural networks, enhance accuracy but typically overlook user psychology and behavioral dynamics. Large language models (LLMs), with their human-like reasoning, offer new potential for simulating psychological aspects of information spread. We introduce an LLM-based simulation environment capturing agents' evolving attitudes, emotions, and responses. Initial experiments, however, revealed significant gaps between LLM-generated behaviors and authentic human dynamics, especially in stance detection and psychological realism. A detailed evaluation through Social Information Processing Theory identified major discrepancies in goal-setting and feedback evaluation, stemming from the lack of emotional processing in standard LLM training. To address these issues, we propose the Social Information Processing-based Chain of Thought (SIP-CoT) mechanism enhanced by emotion-guided memory. This method improves the interpretation of social cues, personalization of goals, and evaluation of feedback. Experimental results confirm that SIP-CoT-enhanced LLM agents more effectively process social information, demonstrating behaviors, attitudes, and emotions closer to real human interactions. In summary, this research highlights critical limitations in current LLM-based propagation simulations and demonstrates how integrating SIP-CoT and emotional memory significantly enhances the social intelligence and realism of LLM agents.
社交媒体和生成式人工智能的指数增长已经改变了信息传播的方式,促进了连接性的同时也加速了错误信息的扩散。理解信息传播的动力学并开发有效的控制策略对于减少有害内容至关重要。传统的模型如SIR(易感-感染-恢复)提供了基本的理解,但未能充分捕捉在线互动的复杂性。先进的方法,包括注意力机制和图神经网络,在准确性上有所提升,但却通常忽略了用户心理和行为动态。大型语言模型(LLMs),凭借其类似人类的推理能力,为模拟信息传播的心理方面提供了新的潜力。我们引入了一种基于LLM的仿真环境,捕捉代理人的态度、情绪和反应的变化过程。然而,初步实验表明,在立场识别和心理现实性等方面,由LLM生成的行为与真实的用户行为之间存在显著差距。通过社会信息处理理论进行详细评估后发现,在目标设定和反馈评价方面存在重大差异,这主要是由于标准LLM训练缺乏情感处理所致。 为解决这些问题,我们提出了一种基于社会信息处理的思维链(SIP-CoT)机制,并辅以由情绪引导的记忆功能。这种方法提高了对社会线索的理解、个人化的目标设定以及对反馈的评估能力。实验结果证实,增强型SIP-CoT LLM代理更有效地处理社交信息,在行为、态度和情感方面更加接近真实的人类互动。 总之,这项研究揭示了当前基于LLM的信息传播仿真模拟的关键局限,并展示了通过整合SIP-CoT机制与情绪记忆如何显著提升LLM代理的社会智能和真实性。
https://arxiv.org/abs/2507.05638
Multimodal emotion and intent recognition is essential for automated human-computer interaction, It aims to analyze users' speech, text, and visual information to predict their emotions or intent. One of the significant challenges is that missing modalities due to sensor malfunctions or incomplete data. Traditional methods that attempt to reconstruct missing information often suffer from over-coupling and imprecise generation processes, leading to suboptimal outcomes. To address these issues, we introduce an Attention-based Diffusion model for Missing Modalities feature Completion (ADMC). Our framework independently trains feature extraction networks for each modality, preserving their unique characteristics and avoiding over-coupling. The Attention-based Diffusion Network (ADN) generates missing modality features that closely align with authentic multimodal distribution, enhancing performance across all missing-modality scenarios. Moreover, ADN's cross-modal generation offers improved recognition even in full-modality contexts. Our approach achieves state-of-the-art results on the IEMOCAP and MIntRec benchmarks, demonstrating its effectiveness in both missing and complete modality scenarios.
多模态情感和意图识别对于自动化的人机交互至关重要,其目标是通过分析用户的语音、文本和视觉信息来预测他们的情绪或意图。其中一个主要挑战是由于传感器故障或数据不完整而导致的缺失模式(模态)。传统方法试图重建缺失信息时往往会出现过度耦合和生成过程不够精确的问题,导致结果不佳。 为了解决这些问题,我们引入了一种基于注意力机制的扩散模型以完成缺失模态特征(ADMC)。我们的框架独立训练每个模态的特征提取网络,保留它们的独特特性并避免过度耦合。基于注意力的扩散网络 (ADN) 生成与真实多模式分布紧密匹配的缺失模态特征,在所有缺失模态的情况下提升了性能。此外,ADN 的跨模态生成在全模态上下文中提供了更好的识别效果。 我们的方法在 IEMOCAP 和 MIntRec 基准测试中实现了最先进的结果,证明了它在缺少和完整的模态情况下均有效。
https://arxiv.org/abs/2507.05624
According to what we call the Emotional Alignment Design Policy, artificial entities should be designed to elicit emotional reactions from users that appropriately reflect the entities' capacities and moral status, or lack thereof. This principle can be violated in two ways: by designing an artificial system that elicits stronger or weaker emotional reactions than its capacities and moral status warrant (overshooting or undershooting), or by designing a system that elicits the wrong type of emotional reaction (hitting the wrong target). Although presumably attractive, practical implementation faces several challenges including: How can we respect user autonomy while promoting appropriate responses? How should we navigate expert and public disagreement and uncertainty about facts and values? What if emotional alignment seems to require creating or destroying entities with moral status? To what extent should designs conform to versus attempt to alter user assumptions and attitudes?
根据我们所说的“情感对齐设计政策”,人工实体应被设计成能够激发与其能力和道德地位(或缺乏)相匹配的情感反应。这一原则可以以两种方式被违反:一种是设计出能够引发比其能力与道德地位应有的更强或更弱的情感反应的人工系统(过度或不足),另一种则是设计出能引发错误类型情感反应的系统(偏离目标)。尽管该政策可能看似有吸引力,但在实际操作中面临着诸多挑战,包括: 1. 如何在尊重用户自主权的同时促进适当回应? 2. 我们应如何应对专家和公众对于事实与价值的不同意见以及不确定性? 3. 如果情感对齐似乎需要创建或销毁具有道德地位的实体怎么办? 4. 设计应该多大程度上符合而非试图改变用户的假设和态度? 这些问题都需要仔细考虑以确保人工系统的开发既能满足技术需求,又能尊重伦理和社会价值观。
https://arxiv.org/abs/2507.06263
The U.S. Supreme Court's 2022 ruling in Dobbs v. Jackson Women's Health Organization marked a turning point in the national debate over reproductive rights. While the ideological divide over abortion is well documented, less is known about how gender and local sociopolitical contexts interact to shape public discourse. Drawing on nearly 10 million abortion-related posts on X (formerly Twitter) from users with inferred gender, ideology and location, we show that gender significantly moderates abortion attitudes and emotional expression, particularly in conservative regions, and independently of ideology. This creates a gender gap in abortion attitudes that grows more pronounced in conservative regions. The leak of the Dobbs draft opinion further intensified online engagement, disproportionately mobilizing pro-abortion women in areas where access was under threat. These findings reveal that abortion discourse is not only ideologically polarized but also deeply structured by gender and place, highlighting the central role of identity in shaping political expression during moments of institutional disruption.
美国最高法院在2022年对多布斯诉杰克逊妇女健康组织案的裁决标志着全国关于生殖权利辩论的一个转折点。虽然有关堕胎的意识形态分歧早已广受记录,但较少有人了解性别与当地社会政治环境如何相互作用来塑造公共话语。我们基于近1亿条具有推断出的用户性别、意识形态和位置信息的X(原名Twitter)上的堕胎相关帖子进行分析后发现,在保守地区特别是,性别显著影响人们对堕胎的态度和情绪表达,并且这种影响独立于个人的意识形态。这导致了在保守地区态度上性别差距更加明显的现象。多布斯案草案意见泄露进一步加剧了线上的参与度,在那里堕胎存取受到威胁的情况下,支持堕胎权利的女性群体尤其积极动员起来。这些发现揭示了关于堕胎的话语不仅具有意识形态极化的特点,而且深深植根于性别和地方环境之中,并突显了在制度动荡时刻身份认同对政治表达形成的重要作用。
https://arxiv.org/abs/2507.05443