Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.
近期在音频驱动的说话人脸生成领域的进展,特别是在唇同步方面取得了显著成就。然而,目前的方法通常无法充分控制面部动画(如讲话风格和情感表达),导致输出结果较为单一。本文重点改进了两个关键因素:唇-音频对齐和情感控制,以增强说话视频的多样性和用户友好性。唇-音频对齐控制关注于言语风格和嘴唇动作幅度等元素,而情感控制则致力于生成逼真的情感表情,并允许在强度等多个属性上进行修改。 为了实现面部动画的精确控制,我们提出了一种新的框架——PC-Talk(Pose and Control for Talking Face),该框架通过隐式关键点变形实现了唇-音频对齐和情感控制。首先,我们的唇-音频对齐控制模块可以在单词级别精确实现讲话风格编辑,并调整嘴唇动作幅度以模拟不同音量级别的说话,从而保持与音频的唇同步。其次,我们的情感控制模块能够生成逼真的面部表情特征,仅通过纯情感变形即可实现。此外,该模块还支持在强度方面进行精细调节和跨面部区域的不同情绪组合。 我们的方法展示了卓越的控制能力,在广泛的实验中证明了其在HDTF(High Definition Talking Face)和MEAD(Multimodal Emotional Assessment Dataset)数据集上均达到了当前最佳性能。
https://arxiv.org/abs/2503.14295
Visual perceptual tasks aim to predict human judgment of images (e.g., emotions invoked by images, image quality assessment). Unlike objective tasks such as object/scene recognition, perceptual tasks rely on subjective human assessments, making its data-labeling difficult. The scarcity of such human-annotated data results in small datasets leading to poor generalization. Typically, specialized models were designed for each perceptual task, tailored to its unique characteristics and its own training dataset. We propose a unified architectural framework for solving multiple different perceptual tasks leveraging CLIP as a prior. Our approach is based on recent cognitive findings which indicate that CLIP correlates well with human judgment. While CLIP was explicitly trained to align images and text, it implicitly also learned human inclinations. We attribute this to the inclusion of human-written image captions in CLIP's training data, which contain not only factual image descriptions, but inevitably also human sentiments and emotions. This makes CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest that minimal adaptation of CLIP suffices for solving a variety of perceptual tasks. Our simple unified framework employs a lightweight adaptation to fine-tune CLIP to each task, without requiring any task-specific architectural changes. We evaluate our approach on three tasks: (i) Image Memorability Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual Emotion Analysis. Our model achieves state-of-the-art results on all three tasks, while demonstrating improved generalization across different datasets.
视觉感知任务的目标是预测人们对图像的主观判断(例如,由图像引发的情感,图片质量评估等)。与基于客观标准的任务(如物体/场景识别)不同,感知任务依赖于主观的人类评价,这使得数据标注变得困难。由于缺乏此类人类注释的数据,导致小规模训练集无法很好地泛化到其他情况中。通常情况下,为每个特定的感知任务都会设计专门化的模型,这些模型被定制以适应各自独特的特性和训练数据集。 我们提出了一种基于CLIP(对比语言-图像预训练)作为先验知识的统一架构框架,用于解决多种不同的视觉感知任务。我们的方法依据最近的认知研究结果,表明CLIP与人类判断有着良好的相关性。尽管CLIP是被明确地设计为将图片和文本进行对齐而训练的,但它在训练过程中也隐式学习到了人类的情感倾向。我们认为这归因于CLIP训练数据中包含了由人书写的图像描述,这些描述不仅包括了事实性的图片说明,还不可避免地带有人类情感和情绪的信息。因此,CLIP成为解决视觉感知任务的一个特别强大的先验模型。 基于以上观点,我们建议只需对CLIP进行最小化的适应调整就足以解决各种视觉感知任务。我们的统一框架采用了一种轻量级的适应方法来微调CLIP以适用于每个具体任务,并且不需要任何特定于该任务的架构变化。我们在三个不同的任务上评估了这种方法:(i)图像记忆预测,(ii)无参考图片质量评估和(iii)视觉情感分析。 我们的模型在这三项任务中均取得了当前最优的结果,并展示了在不同数据集上的改进泛化能力。
https://arxiv.org/abs/2503.13260
This paper introduces MAVEN (Multi-modal Attention for Valence-Arousal Emotion Network), a novel architecture for dynamic emotion recognition through dimensional modeling of affect. The model uniquely integrates visual, audio, and textual modalities via a bi-directional cross-modal attention mechanism with six distinct attention pathways, enabling comprehensive interactions between all modality pairs. Our proposed approach employs modality-specific encoders to extract rich feature representations from synchronized video frames, audio segments, and transcripts. The architecture's novelty lies in its cross-modal enhancement strategy, where each modality representation is refined through weighted attention from other modalities, followed by self-attention refinement through modality-specific encoders. Rather than directly predicting valence-arousal values, MAVEN predicts emotions in a polar coordinate form, aligning with psychological models of the emotion circumplex. Experimental evaluation on the Aff-Wild2 dataset demonstrates the effectiveness of our approach, with performance measured using Concordance Correlation Coefficient (CCC). The multi-stage architecture demonstrates superior ability to capture the complex, nuanced nature of emotional expressions in conversational videos, advancing the state-of-the-art (SOTA) in continuous emotion recognition in-the-wild. Code can be found at: this https URL.
本文介绍了MAVEN(多模态注意力情绪网络,用于效价-唤醒情感建模),这是一种通过情感维度化模型进行动态情感识别的创新架构。该模型独创性地整合了视觉、音频和文本三种模式,采用双向跨模式注意机制,并包含了六种不同的注意路径,从而实现了所有模式配对之间的全面互动。我们提出的这种方法利用特定于每种模式的编码器从同步视频帧、音频片段和字幕中提取丰富的特征表示。该架构的独特之处在于其跨模态增强策略:通过其他模式加权注意力来精炼每个模式的表示,然后通过特定于模式的编码器进行自我注意改进。 不同于直接预测效价-唤醒值,MAVEN以极坐标形式预测情绪,这与心理学中情感圆周模型相吻合。在Aff-Wild2数据集上的实验评估证明了我们方法的有效性,并使用一致性相关系数(CCC)作为性能衡量标准。这种多阶段架构展示了捕捉对话视频中复杂细微的情感表达的卓越能力,从而推动了野外连续情绪识别领域的最新技术水平。 代码可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2503.12623
This study presents an emotion-aware navigation framework -- EmoBipedNav -- using deep reinforcement learning (DRL) for bipedal robots walking in socially interactive environments. The inherent locomotion constraints of bipedal robots challenge their safe maneuvering capabilities in dynamic environments. When combined with the intricacies of social environments, including pedestrian interactions and social cues, such as emotions, these challenges become even more pronounced. To address these coupled problems, we propose a two-stage pipeline that considers both bipedal locomotion constraints and complex social environments. Specifically, social navigation scenarios are represented using sequential LiDAR grid maps (LGMs), from which we extract latent features, including collision regions, emotion-related discomfort zones, social interactions, and the spatio-temporal dynamics of evolving environments. The extracted features are directly mapped to the actions of reduced-order models (ROMs) through a DRL architecture. Furthermore, the proposed framework incorporates full-order dynamics and locomotion constraints during training, effectively accounting for tracking errors and restrictions of the locomotion controller while planning the trajectory with ROMs. Comprehensive experiments demonstrate that our approach exceeds both model-based planners and DRL-based baselines. The hardware videos and open-source code are available at this https URL.
这项研究提出了一种情感感知导航框架——EmoBipedNav,它使用深度强化学习(DRL)来为在社交互动环境中行走的双足机器人提供服务。双足机器人的内在运动约束对其在动态环境中的安全操控能力构成了挑战。当结合社会环境的复杂性时,包括行人交互和情感相关的社会线索等,这些挑战变得更加明显。 为了应对这些问题,我们提出了一种两阶段管道方法,该方法同时考虑了双足步行机的动力学限制和复杂的社交环境。具体来说,通过顺序LiDAR网格地图(LGMs)来表示社交导航场景,并从其中提取潜在特征,包括碰撞区域、情感相关不适区、社会互动以及演变环境的时空动态。这些提取出的特征直接映射到降阶模型(ROMs)的动作上,通过一个深度强化学习架构实现。 此外,该框架在训练过程中引入了全序动力学和运动约束,在计划轨迹时有效考虑跟踪误差及行走控制器限制的影响。 全面的实验表明,我们的方法优于基于模型的传统规划器和基于DRL的基础线方案。硬件视频和开源代码可在以下网址获取:[此 URL] (请将 [此 URL] 替换为实际提供的链接地址)。
https://arxiv.org/abs/2503.12538
Consensus building is inherently challenging due to the diverse opinions held by stakeholders. Effective facilitation is crucial to support the consensus building process and enable efficient group decision making. However, the effectiveness of facilitation is often constrained by human factors such as limited experience and scalability. In this research, we propose a Parallel Thinking-based Facilitation Agent (PTFA) that facilitates online, text-based consensus building processes. The PTFA automatically collects textual posts and leverages large language models (LLMs) to perform all of the six distinct roles of the well-established Six Thinking Hats technique in parallel thinking. To illustrate the potential of PTFA, a pilot study was carried out and PTFA's ability in idea generation, emotional probing, and deeper analysis of ideas was demonstrated. Furthermore, a comprehensive dataset that contains not only the conversational content among the participants but also between the participants and the agent is constructed for future study.
构建共识由于利益相关者持有不同的意见而内在地具有挑战性。有效的促进对于支持共识建设过程和实现高效的群体决策至关重要。然而,促进的有效性经常受到人类因素的限制,例如经验有限和可扩展性问题。在本研究中,我们提出了一种基于并行思维的促进代理(PTFA),该代理用于在线、基于文本的共识构建流程中的促进工作。PTFA可以自动收集文本帖子,并利用大型语言模型(LLMs)执行六顶思考帽技术中确立的六个不同角色的所有功能,实现并行思维。 为了展示PTFA的潜力,进行了一项试点研究,在此过程中,展示了PTFA在想法生成、情感挖掘以及对想法进行深入分析方面的能力。此外,还构建了一个包含参与者之间对话内容以及参与者与代理之间互动的全面数据集,以供未来研究使用。
https://arxiv.org/abs/2503.12499
We propose a novel dual-loop system that synergistically combines responsive neurostimulation (RNS) implants with artificial intelligence-driven wearable devices for treating post-traumatic stress disorder (PTSD) and enabling naturalistic brain research. In PTSD Therapy Mode, an implanted closed-loop neural device monitors amygdala activity and provides on-demand stimulation upon detecting pathological theta oscillations, while an ensemble of wearables (smart glasses, smartwatches, smartphones) uses multimodal large language model (LLM) analysis of sensory data to detect environmental or physiological PTSD triggers and deliver timely audiovisual interventions. Logged events from both the neural and wearable loops are analyzed to personalize trigger detection and progressively transition patients to non-invasive interventions. In Neuroscience Research Mode, the same platform is adapted for real-world brain activity capture. Wearable-LLM systems recognize naturalistic events (social interactions, emotional situations, compulsive behaviors, decision making) and signal implanted RNS devices (via wireless triggers) to record synchronized intracranial data during these moments. This approach builds on recent advances in mobile intracranial EEG recording and closed-loop neuromodulation in humans (BRAIN Initiative, 2023) (Mobbs et al., 2021). We discuss how our interdisciplinary system could revolutionize PTSD therapy and cognitive neuroscience by enabling 24/7 monitoring, context-aware intervention, and rich data collection outside traditional labs. The vision is a future where AI-enhanced devices continuously collaborate with the human brain, offering therapeutic support and deep insights into neural function, with the resulting real-world context rich neural data, in turn, accelerating the development of more biologically-grounded and human-centric AI.
我们提出了一种新颖的双环路系统,该系统将响应性神经刺激(RNS)植入物与人工智能驱动的可穿戴设备相结合,用于治疗创伤后应激障碍(PTSD),并支持自然环境下的大脑研究。在 PTSD 治疗模式中,一种植入式的闭环神经装置监测杏仁核活动,并在检测到病理性的 theta 脑电波振荡时提供即时刺激;同时,一系列可穿戴设备(智能眼镜、智能手表、智能手机)通过多模态大型语言模型分析感官数据来识别环境或生理 PTSD 触发因素,并及时提供音视频干预措施。来自神经和可穿戴双环路的日志事件被分析以个性化触发检测并逐步将患者过渡到非侵入性干预。 在神经科学研究模式下,同样的平台可以适应于捕捉现实生活中的大脑活动。可穿戴-LLM 系统识别自然环境下的事件(社交互动、情感状况、强迫行为、决策制定),并通过无线信号通知植入的 RNS 设备,在这些时刻同步记录颅内数据。这种方法建立在移动式颅内 EEG 记录和人类闭环神经调节的最新进展基础上(BRAIN 初创项目,2023)(Mobbs 等人,2021)。我们讨论了我们的跨学科系统如何通过实现 24/7 监控、情境感知干预以及在传统实验室之外的数据收集来彻底革新 PTSD 治疗和认知神经科学。这一愿景是在人工智能增强设备与人类大脑持续协作的未来,在此背景下,这些设备可以提供治疗支持,并深入探究神经功能;同时,由此产生的丰富的真实环境下的神经数据反过来将加速生物基础更为坚实、以人为本的人工智能的发展。
https://arxiv.org/abs/2503.12334
Multimodal emotion recognition has recently drawn a lot of interest in affective computing as it has immense potential to outperform isolated unimodal approaches. Audio and visual modalities are two predominant contact-free channels in videos, which are often expected to carry a complementary relationship with each other. However, audio and visual channels may not always be complementary with each other, resulting in poor audio-visual feature representations, thereby degrading the performance of the system. In this paper, we propose a flexible audio-visual fusion model that can adapt to weak complementary relationships using a gated attention mechanism. Specifically, we extend the recursive joint cross-attention model by introducing gating mechanism in every iteration to control the flow of information between the input features and the attended features depending on the strength of their complementary relationship. For instance, if the modalities exhibit strong complementary relationships, the gating mechanism chooses cross-attended features, otherwise non-attended features. To further improve the performance of the system, we further introduce stage gating mechanism, which is used to control the flow of information across the gated outputs of each iteration. Therefore, the proposed model improves the performance of the system even when the audio and visual modalities do not have a strong complementary relationship with each other by adding more flexibility to the recursive joint cross attention mechanism. The proposed model has been evaluated on the challenging Affwild2 dataset and significantly outperforms the state-of-the-art fusion approaches.
最近,跨模态情感识别在情感计算领域引起了广泛关注,因为它具有超越单一模态方法的潜力。音频和视觉模式是视频中两种主要的非接触式通道,并且通常期望它们之间存在互补关系。然而,在某些情况下,音频和视觉通道可能并不总是彼此互补,这会导致较差的音视频特征表示,从而降低系统的性能。 在这篇论文中,我们提出了一种灵活的音频-视觉融合模型,该模型能够利用门控注意力机制来适应较弱的互补关系。具体来说,我们在每个迭代过程中引入了门控机制,以根据模态之间的互补性强度控制输入特征和注意后的特征之间信息流动的方向。例如,如果模式表现出强烈的互补关系,则门控机制会选择交叉注意后的特征;否则,选择未经过注意处理的特征。 为了进一步提高系统的性能,我们还提出了一种阶段门控机制,用于控制在每次迭代中门控输出之间的信息流方向。 因此,所提出的模型即使在音频和视觉模式之间没有强烈互补关系的情况下也能改善系统性能。通过为递归联合交叉注意力机制增加更多的灵活性,该模型已经在具有挑战性的Affwild2数据集上进行了评估,并且显著优于现有的融合方法。
https://arxiv.org/abs/2503.12261
We present our contribution to the 8th ABAW challenge at CVPR 2025, where we tackle valence-arousal estimation, emotion recognition, and facial action unit detection as three independent challenges. Our approach leverages the well-known Dual-Direction Attention Mixed Feature Network (DDAMFN) for all three tasks, achieving results that surpass the proposed baselines. Additionally, we explore the use of CLIP for the emotion recognition challenge as an additional experiment. We provide insights into the architectural choices that contribute to the strong performance of our methods.
我们在2025年CVPR的第八届ABAW挑战赛中提交了我们的研究成果,针对该挑战中的三个独立任务:效价-唤醒度估计、情感识别和面部动作单元检测。我们采用了广为人知的双向注意力混合特征网络(Dual-Direction Attention Mixed Feature Network, DDAMFN)来应对这三个任务,并且在所有这些任务上都取得了优于基线方法的结果。此外,为了进一步探索不同模型的能力,我们在情感识别挑战中尝试使用CLIP进行额外实验。我们还提供了对架构选择的见解,解释了我们的方法为何能够取得优异的成绩。
https://arxiv.org/abs/2503.12260
Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker's voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design a disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies, thereby enhancing emotion-prosody alignment. Extensive experiments show that our method performs favorably against the state-of-the-art models on two primary benchmarks. The demos are available at this https URL.
电影配音指的是将剧本转化为与给定的电影片段在时间上和情感上都对齐的口语表达,并且该表达要体现参考音频片段中说话者的声音特点。这一任务需要模型能够跨越角色表演和复杂的韵律结构,以构建高质量的视频同步配音音轨。由于电影配音数据集规模较小以及语音数据中存在的背景噪音,训练模型在声学建模性能方面面临挑战。 为了应对这些问题,我们提出了一种声学-韵律分离的两阶段方法,旨在通过精确的韵律对齐来实现高质量的配音生成。首先,我们提出了增强韵律的声学预训练,以发展出强大的声学建模能力。接着,我们将预训练的声学系统冻结,并设计了一个解耦框架,用于建模范式文本特征和配音风格,同时保持音频质量。此外,我们还整合了域内情感分析模块,以减少不同电影之间视觉领域变化对情绪韵律对齐的影响。 广泛的实验表明,在两个主要基准测试中,我们的方法优于当前最先进的模型。演示材料可以在提供的链接中查看。
https://arxiv.org/abs/2503.12042
Digital twins (DTs) are redefining healthcare by paving the way for more personalized, proactive, and intelligent medical interventions. As the shift toward personalized care intensifies, there is a growing need for an individual's virtual replica that delivers the right treatment at the optimal time and in the most effective manner. The emerging concept of a Human Digital Twin (HDT) holds the potential to revolutionize the traditional healthcare system much like digital twins have transformed manufacturing and aviation. An HDT mirrors the physical entity of a human body through a dynamic virtual model that continuously reflects changes in molecular, physiological, emotional, and lifestyle factors. This digital representation not only supports remote monitoring, diagnosis, and prescription but also facilitates surgery, rehabilitation, and overall personalized care, thereby relieving pressure on conventional healthcare frameworks. Despite its promising advantages, there are considerable research challenges to overcome as HDT technology evolves. In this study, I will initially delineate the distinctions between traditional digital twins and HDTs, followed by an exploration of the networking architecture integral to their operation--from data acquisition and communication to computation, management, and decision-making--thereby offering insights into how these innovations may reshape the modern healthcare industry.
数字孪生(DT)通过推动更加个性化、主动和智能的医疗干预,正在重新定义医疗保健。随着向个性化护理转变的趋势日益增强,迫切需要一个能够提供适时且有效治疗的个人虚拟副本。新兴的人体数字孪生(HDT)概念有望彻底改变传统医疗体系,就像数字孪生已经对制造业和航空业产生的影响一样。人体数字孪生通过动态虚拟模型模拟人类身体的真实情况,该模型持续反映分子、生理、情感和生活方式因素的变化。这一数字表示不仅支持远程监控、诊断和处方,还促进了手术、康复及整体个性化护理的发展,从而减轻了传统医疗体系的压力。尽管其前景广阔,随着HDT技术的发展,仍然存在许多研究挑战需要克服。在这项研究中,我将首先区分传统的数字孪生与人体数字孪生的区别,并探讨其运作所需的核心网络架构——从数据采集和通信到计算、管理和决策的过程,从而揭示这些创新如何重塑现代医疗行业。
https://arxiv.org/abs/2503.11944
In controlled text generation using large language models (LLMs), gaps arise between the language model's interpretation and human expectations. We look at the problem of controlling emotions in keyword-based sentence generation for both GPT-4 and LLaMA-3. We selected four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. Our human evaluation looked at the Human-LLM alignment for each representation, as well as the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. However, we found that converting the originally-numeric VAD scales to Lexical scales (e.g., +4.0 becomes "High") dramatically improved agreement. Furthermore, the perception of how much a generated sentence conveys an emotion is highly dependent on the LLM, representation type, and which emotion it is.
在使用大型语言模型(LLM)进行受控文本生成时,语言模型的解释与人类期望之间存在差距。我们研究了基于关键词句子生成中控制情感的问题,涉及GPT-4和LLaMA-3两种模型。我们选择了四种情感表示方法:单词、瓦勒斯-唤醒度-主宰感(VAD)维度的词汇形式和数字形式以及表情符号。 我们的人类评估考察了每种表示方法下人与语言模型之间的对齐程度,生成句子的情感准确性和现实性。尽管像VAD这样的表示方式将情感分解为易于计算的组成部分,但我们的研究发现表明,在基于英语单词(如“愤怒”)而不是VAD量表进行条件设置时,人们对LLM产生的文本更为认可。这种差异在比较数字形式的VAD和单词形式尤为明显。然而,我们将原本采用数字形式的VAD量表转换为词汇形式(例如+4.0变为“高”),这大大提高了人们的认同感。 此外,我们发现生成句子传达情感的程度高度依赖于所使用的语言模型、表示类型以及具体的情感内容。
https://arxiv.org/abs/2503.11881
Sarcasm detection, with its figurative nature, poses unique challenges for affective systems designed to perform sentiment analysis. While these systems typically perform well at identifying direct expressions of emotion, they struggle with sarcasm's inherent contradiction between literal and intended sentiment. Since transformer-based language models (LMs) are known for their efficient ability to capture contextual meanings, we propose a method that leverages LMs and prototype-based networks, enhanced by sentiment embeddings to conduct interpretable sarcasm detection. Our approach is intrinsically interpretable without extra post-hoc interpretability techniques. We test our model on three public benchmark datasets and show that our model outperforms the current state-of-the-art. At the same time, the prototypical layer enhances the model's inherent interpretability by generating explanations through similar examples in the reference time. Furthermore, we demonstrate the effectiveness of incongruity loss in the ablation study, which we construct using sentiment prototypes.
讽刺检测因其比喻性质,为旨在执行情感分析的设计系统带来了独特的挑战。虽然这些系统通常在识别直接表达的情感方面表现出色,但它们在处理讽刺时却难以应对文字表面意义与实际意图之间的矛盾关系。鉴于基于转换器的语言模型(LMs)以其捕捉上下文含义的能力而著称,我们提出了一种方法,该方法结合了语言模型和原型网络,并通过情感嵌入进行增强,以实现可解释的讽刺检测。我们的方法本质上是可解释的,无需额外的后验可解释性技术。我们在三个公开基准数据集上测试了我们的模型,并证明它在当前最先进的技术水平上取得了超越。同时,原型层通过生成参考时间内的类似示例来提高模型的内在可解释性。此外,我们还在消融研究中展示了不协调损失的有效性,该损失是使用情感原型构建的。
https://arxiv.org/abs/2503.11838
Affective Image Manipulation (AIM) aims to alter an image's emotional impact by adjusting multiple visual elements to evoke specific this http URL AIM is inherently complex, necessitating a collaborative approach that involves identifying semantic cues within source images, manipulating these elements to elicit desired emotional responses, and verifying that the combined adjustments successfully evoke the target this http URL address these challenges, we introduce EmoAgent, the first multi-agent collaboration framework for AIM. By emulating the cognitive behaviors of a human painter, EmoAgent incorporates three specialized agents responsible for planning, editing, and critical evaluation. Furthermore, we develop an emotion-factor knowledge retriever, a decision-making tree space, and a tool library to enhance EmoAgent's effectiveness in handling AIM. Experiments demonstrate that the proposed multi-agent framework outperforms existing methods, offering more reasonable and effective emotional expression.
情感图像操纵(AIM)旨在通过调整多个视觉元素来改变图片的情感影响,从而激发特定的情绪反应。AIM本质上是复杂的,需要一种协作方法,该方法涉及识别来源图中的语义线索,并操纵这些元素以引发所需的情感响应,同时验证组合的调整是否成功地唤起了目标情绪。为了解决这些挑战,我们引入了EmoAgent——首个用于AIM的多代理合作框架。通过模仿人类画家的认知行为,EmoAgent包含三个专门负责计划、编辑和批判性评估的代理。此外,我们开发了一种情感因素知识检索器、决策树空间以及工具库以增强EmoAgent在处理AIM时的有效性。实验表明,所提出的多代理框架优于现有方法,在提供更加合理和有效的情感表达方面表现出色。
https://arxiv.org/abs/2503.11290
Compound Expression Recognition (CER) is crucial for understanding human emotions and improving human-computer interaction. However, CER faces challenges due to the complexity of facial expressions and the difficulty of capturing subtle emotional cues. To address these issues, we propose a novel approach leveraging Large Vision-Language Models (LVLMs). Our method employs a two-stage fine-tuning process: first, pre-trained LVLMs are fine-tuned on basic facial expressions to establish foundational patterns; second, the model is further optimized on a compound-expression dataset to refine visual-language feature interactions. Our approach achieves advanced accuracy on the RAF-DB dataset and demonstrates strong zero-shot generalization on the C-EXPR-DB dataset, showcasing its potential for real-world applications in emotion analysis and human-computer interaction.
复合表情识别(CER)对于理解人类情感和改善人机交互至关重要。然而,由于面部表情的复杂性和捕捉细微情感线索的难度,CER面临着挑战。为了解决这些问题,我们提出了一种利用大型视觉-语言模型(LVLMs)的新方法。我们的方法采用两阶段微调过程:首先,在基本面部表情数据上对预训练的LVLM进行微调以建立基础模式;其次,进一步在复合表情数据集上优化模型,以细化视觉和语言特征之间的交互。我们在RAF-DB数据集上实现了先进的准确性,并且在C-EXPR-DB数据集中展示了强大的零样本泛化能力,这表明我们的方法在情感分析和人机交互的实际应用中具有巨大潜力。
https://arxiv.org/abs/2503.11241
Understanding why people trust or distrust one another, institutions, or information is a complex task that has led scholars from various fields of study to employ diverse epistemological and methodological approaches. Despite the challenges, it is generally agreed that the antecedents of trust (and distrust) encompass a multitude of emotional and cognitive factors, including a general disposition to trust and an assessment of trustworthiness factors. In an era marked by increasing political polarization, cultural backlash, widespread disinformation and fake news, and the use of AI software to produce news content, the need to study trust in the news has gained significant traction. This study presents the findings of a trust in the news experiment designed in collaboration with Spanish and UK journalists, fact-checkers, and the CardiffNLP Natural Language Processing research group. The purpose of this experiment, conducted in June 2023, was to examine the extent to which people trust a set of fake news articles based on previously identified disinformation narratives related to gender, climate change, and COVID-19. The online experiment participants (801 in Spain and 800 in the UK) were asked to read three fake news items and rate their level of trust on a scale from 1 (not true) to 8 (true). The pieces used a combination of factors, including stance (favourable, neutral, or against the narrative), presence of toxic expressions, clickbait titles, and sources of information to test which elements influenced people's responses the most. Half of the pieces were produced by humans and the other half by ChatGPT. The results show that the topic of news articles, stance, people's age, gender, and political ideologies significantly affected their levels of trust in the news, while the authorship (humans or ChatGPT) does not have a significant impact.
理解人们为何会相互信任或不信任,对制度或信息产生信任或怀疑,是一项复杂的工作。这促使了来自不同领域的学者采用多样化的认识论和方法论来研究这一问题。尽管存在挑战,但普遍认为,信任(包括不信任)的先决条件涉及众多情感与认知因素,包括一般的信任倾向以及评估可信度的因素。在政治极化加剧、文化反弹、广泛传播虚假信息和假新闻,以及使用AI软件生成新闻内容的时代背景下,研究新闻领域的信任问题变得尤为重要。 这项研究展示了与西班牙和英国记者、事实核查员及卡迪夫NLP自然语言处理研究小组合作设计的“对新闻的信任”实验的结果。此次于2023年6月进行的实验旨在考察人们对于一组基于先前识别出的性别、气候变化以及新冠疫情相关误导性叙事的假新闻文章的信任程度。在线实验参与者包括801名西班牙人和800名英国人,他们被要求阅读三篇假新闻并根据“1(不真实)”到“8(真实)”的评分标准对其信任度进行打分。所使用的文本结合了立场(支持、中立或反对)、有毒言论的存在、标题党式标题以及信息来源等因素,以测试哪些因素最能影响人们的反应。一半的文章由人类撰写,另一半则由ChatGPT生成。研究结果表明,新闻文章的主题、立场,参与者的年龄、性别和政治意识形态显著地影响了他们对新闻的信任程度,而作者身份(人类或ChatGPT)并没有产生显著的影响。
https://arxiv.org/abs/2503.11116
Speech-driven 3D facial animation seeks to produce lifelike facial expressions that are synchronized with the speech content and its emotional nuances, finding applications in various multimedia fields. However, previous methods often overlook emotional facial expressions or fail to disentangle them effectively from the speech content. To address these challenges, we present EmoDiffusion, a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions. Specifically, our method employs two Variational Autoencoders (VAEs) to separately generate the upper face region and mouth region, thereby learning a more refined representation of the facial sequence. Unlike traditional methods that use diffusion models to connect facial expression sequences with audio inputs, we perform the diffusion process in the latent space. Furthermore, we introduce an Emotion Adapter to evaluate upper face movements accurately. Given the paucity of 3D emotional talking face data in the animation industry, we capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone. This effort results in the creation of an innovative 3D blendshape emotional talking face dataset (3D-BEF) used to train our network. Extensive experiments and perceptual evaluations validate the effectiveness of our approach, confirming its superiority in generating realistic and emotionally rich facial animations.
语音驱动的3D面部动画旨在产生与讲话内容及其情感细微差别同步的逼真面部表情,其在各种多媒体领域中有着广泛的应用。然而,以往的方法往往忽视了情感面部表情或未能有效地将其从讲话内容中分离出来。为了解决这些挑战,我们提出了EmoDiffusion,这是一种新颖的方法,通过将语音中的不同情绪分解开来,生成丰富的情感3D面部表情。具体而言,我们的方法采用了两个变分自编码器(VAEs),分别用于生成上半脸区域和嘴巴区域,从而学习更加精细的面部序列表示。 与传统的使用扩散模型将面部表情序列连接到音频输入的方法不同,我们在潜在空间中执行扩散过程,并引入了一个情感适配器来准确评估上半脸部的动作。由于在动画行业中缺乏3D情绪性谈话面孔的数据,我们利用iPhone上的LiveLinkFace,在动画专家的指导下捕捉面部表情。这一努力导致创建了创新性的3D变形曲线情感性谈话面孔数据集(3D-BEF),用于训练我们的网络。 通过广泛的实验和感知评估验证了我们方法的有效性,并确认其在生成逼真且情感丰富的面部动画方面的优越性能。
https://arxiv.org/abs/2503.11028
Despite recent advances in text-to-speech (TTS) models, audio-visual to audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multi-modal guidance with CFM, our model robustly preserves speaker-specific characteristics and significantly enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating robust speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements.
尽管在文本到语音(TTS)模型方面取得了最近的进展,但音频-视频至音频-视频(AV2AV)翻译仍然面临一个关键挑战:即保持原始和翻译后语音及面部特征的一致性。为了解决这个问题,我们提出了一种基于条件流匹配(CFM)的零样本音频-视频渲染器,该渲染器利用来自音频和视觉模态的强大双重指导。通过使用多模态引导与CFM相结合,我们的模型能够稳健地保持特定说话者的特征,并显著提升了零样本AV2AV翻译能力。 在处理音频模式时,我们通过集成鲁棒的说话者嵌入(x-vectors)来增强CFM过程,这有助于加强说话者一致性。此外,我们将情感细微差别传达给面部渲染模块。由音频和视觉线索提供的指导独立于语义或语言内容,使我们的渲染器能够有效地处理不同语言中单语者的零样本翻译任务。 我们通过实验证明,在条件化为面部信息的高质量梅尔频谱图的加入不仅提高了合成语音的质量,而且正面影响了面部生成,从而整体上提升了性能。
https://arxiv.org/abs/2503.11026
Emotional Mimicry Intensity (EMI) estimation serves as a critical technology for understanding human social behavior and enhancing human-computer interaction experiences, where the core challenge lies in dynamic correlation modeling and robust fusion of multimodal temporal signals. To address the limitations of existing methods in insufficient exploitation of modal synergistic effects, noise sensitivity, and limited fine-grained alignment capabilities, this paper proposes a dual-stage cross-modal alignment framework. First, we construct vision-text and audio-text contrastive learning networks based on an improved CLIP architecture, achieving preliminary alignment in the feature space through modality-decoupled pre-training. Subsequently, we design a temporal-aware dynamic fusion module that combines Temporal Convolutional Networks (TCN) and gated bidirectional LSTM to respectively capture the macro-evolution patterns of facial expressions and local dynamics of acoustic features. Innovatively, we introduce a quality-guided modality fusion strategy that enables modality compensation under occlusion and noisy scenarios through differentiable weight allocation. Experimental results on the Hume-Vidmimic2 dataset demonstrate that our method achieves an average Pearson correlation coefficient of 0.35 across six emotion dimensions, outperforming the best baseline by 40\%. Ablation studies further validate the effectiveness of the dual-stage training strategy and dynamic fusion mechanism, providing a novel technical pathway for fine-grained emotion analysis in open environments.
情感模仿强度(EMI)估计作为一种关键技术,用于理解人类社会行为并增强人机交互体验。其核心挑战在于动态相关性建模和多模式时序信号的鲁棒融合。为了克服现有方法在多模式协同效应利用不足、噪声敏感以及细粒度对齐能力有限等问题,本文提出了一种双阶段跨模态对齐框架。 首先,我们基于改进的CLIP架构构建了视觉-文本和音频-文本对比学习网络,在特征空间中通过解耦预训练实现初步对齐。随后,设计了一个感知时间信息的动力融合模块,该模块结合了时序卷积网络(TCN)与门控双向LSTM,分别捕捉面部表情的宏观演变模式及声学特征的局部动态变化。 创新性地引入了一种质量导向的模态融合策略,在遮挡和噪声场景中通过可微权重分配实现模态补偿。在Hume-Vidmimic2数据集上的实验结果表明,我们的方法在六个情感维度上取得了平均皮尔逊相关系数为0.35的成绩,比最佳基线方法高出40%。消融研究进一步验证了双阶段训练策略和动态融合机制的有效性,为开放环境下的细粒度情感分析提供了新的技术路径。
https://arxiv.org/abs/2503.10603
In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.
在这项研究中,我们提出了一种使用MobileNetV4和基于多尺度3D MLP-Mixer的时序聚合模块进行高效时空特征提取的方法。MobileNetV4通过其通用倒残块(UIB)作为骨干网络来从输入图像序列中抽取分层特征表示,确保计算效率的同时还提供丰富的语义编码。为了捕捉时间依赖性,我们引入了一个三级MLP-Mixer模块,该模块在多个分辨率下处理空间特征,并保持结构完整性。 在ABAW 8th竞赛中的实验结果证明了我们方法的有效性,在情感行为分析方面表现出令人鼓舞的性能。通过将高效的视觉骨干网络与有组织的时间建模机制相结合,所提出的框架实现了计算效率和预测准确度之间的平衡,使其非常适合移动设备和嵌入式计算环境下的实时应用。
https://arxiv.org/abs/2503.10530
This article presents our results for the eighth Affective Behavior Analysis in-the-Wild (ABAW) competition. We combine facial emotional descriptors extracted by pre-trained models, namely, our EmotiEffLib library, with acoustic features and embeddings of texts recognized from speech. The frame-level features are aggregated and fed into simple classifiers, e.g., multi-layered perceptron (feed-forward neural network with one hidden layer), to predict ambivalence/hesitancy and facial expressions. In the latter case, we also use the pre-trained facial expression recognition model to select high-score video frames and prevent their processing with a domain-specific video classifier. The video-level prediction of emotional mimicry intensity is implemented by simply aggregating frame-level features and training a multi-layered perceptron. Experimental results for three tasks from the ABAW challenge demonstrate that our approach significantly increases validation metrics compared to existing baselines.
本文介绍了我们在第八届情感行为分析真实场景(ABAW)竞赛中的结果。我们结合了预训练模型提取的面部情绪描述符,特别是我们的EmotiEffLib库、音频特征以及从语音中识别出的文字嵌入。将帧级特征聚合并输入简单的分类器,例如多层感知机(具有一个隐藏层的前馈神经网络),以预测矛盾/犹豫和面部表情。在后者的情况下,我们还使用预训练的面部表情识别模型来选择高分视频帧,并防止这些帧通过特定领域的视频分类器进行处理。情感模仿强度的视频级预测是通过简单地聚合帧级特征并训练多层感知机实现的。 实验结果显示,在ABAW挑战赛中的三项任务中,我们的方法相比现有基线显著提高了验证指标。
https://arxiv.org/abs/2503.10399