Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.
表达性语音转换(VC)通过共同转换说话人身份和情感风格来对情感说话人进行演讲者身份转换。对于表达性VC中任意说话人的情感风格建模,尚未进行过深入探讨。之前的解决方案依赖于语音重建器的语音重建,这使得语音质量高度依赖于语音重建器的表现。情感语音转换的一个主要挑战是情感语调建模。为了应对这些挑战,本文基于条件去噪扩散概率模型(DDPM)提出了一种完整的端到端情感语音转换框架。我们利用自监督语音模型产生的语音单元作为内容条件,同时从语音情感识别和说话人验证系统中提取深度特征来建模情感风格和说话人身份。客观和主观评估结果表明,我们的框架的有效性。代码和样本公开可用。
https://arxiv.org/abs/2405.01730
This paper presents the TartuNLP team submission to EvaLatin 2024 shared task of the emotion polarity detection for historical Latin texts. Our system relies on two distinct approaches to annotating training data for supervised learning: 1) creating heuristics-based labels by adopting the polarity lexicon provided by the organizers and 2) generating labels with GPT4. We employed parameter efficient fine-tuning using the adapters framework and experimented with both monolingual and cross-lingual knowledge transfer for training language and task adapters. Our submission with the LLM-generated labels achieved the overall first place in the emotion polarity detection task. Our results show that LLM-based annotations show promising results on texts in Latin.
本文是TartuNLP团队在2024年EvaLatin共享任务中提交的关于情感极性检测历史拉丁文本的任务。我们的系统依赖于两种不同的数据注释方法:1)采用主办方提供的极性词典创建基于策略的标签;2)使用GPT4生成标签。我们使用适应器框架进行参数高效的微调,并尝试了为训练语言和任务适配器进行本体和跨语言知识传递。使用LLM生成的标签,我们在情感极性检测任务中获得了 overall first place 的成绩。我们的结果表明,基于LLM的注释在拉丁文本上显示出有希望的结果。
https://arxiv.org/abs/2405.01159
Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings? To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation. We release the full dataset and code at this https URL.
性别偏见研究在揭示大型语言模型中的不良行为、职业 associated 的严重性别刻板印象和情感方面具有关键作用。以前的工作中发现,模型会强化刻板印象,因为训练数据中存在的性别相关关系。在本文中,我们关注训练数据效果不明确的情况,并回答这个问题:语言模型在非刻板印象的场景中是否仍然存在性别偏见?为了回答这个问题,我们引入了UnStereoEval(USE),一种专为研究性别偏见在非刻板印象场景中的框架。USE根据预训练数据统计数据定义了一个句子级得分,以确定句子是否包含最小词-性别关联。为了系统地评估流行语言模型在非刻板印象场景中的公平性,我们利用USE自动生成没有任何性别相关语言的基准。通过利用USE的句子级得分,我们还重新利用了以前的性别偏见基准(Winobias和Winogender)进行非刻板印象评估。令人惊讶的是,我们发现所有28个测试模型在非刻板印象场景中都存在低公平性。具体来说,模型在非刻板印象句子中的公正行为仅占9%-41%。这些结果引发了一些重要问题,即潜在模型偏见来自何处,以及需要更系统化和全面的偏见评估。我们在这个链接上发布了完整的数据集和代码:https://www.aclweb.org/anthology/N18-2172
https://arxiv.org/abs/2405.00588
Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.
情感人工智能是计算机理解人类情感状态的能力。现有的工作已经取得了一定的进展,但还需要解决两个限制:1)以前的研究主要关注短序列视频的情感分析,而忽略了长序列视频;然而,短序列视频中的情感仅反映瞬时的情感,可能是有意引导或隐藏的。相反,长序列视频可以揭示真正的情感;2)以前的研究通常会利用各种信号,如面部、语音,甚至是敏感的生物信号(例如心率图),然而,由于对隐私的需求不断增加,不依赖敏感信号的开发情感人工智能变得越来越重要。为了应对上述限制,在本文中,我们通过收集和处理运动员比赛后采访的序列,构建了一个用于情感分析的长序列和去识别性视频的 dataset,称为 EALD。除了提供每个视频的整体情感状态的注释外,我们还为每个球员提供了非面部身体语言(NFBL)注释。NFBL是一种内部驱动的情感表达,可以作为无身份的线索来理解情感状态。此外,我们提供了一个简单但有效的基础,供进一步的研究使用。具体来说,我们使用去识别性信号(如视觉、语音和 NFBL)评估多模态大型语言模型(MLLM)进行情感分析。我们的实验结果表明:1)MLLM可以在零散景观中实现与监督单一模态模型相当甚至更好的性能;2)NFBL在长序列情感分析中是一个重要的线索。EALD 将公开发布在开源平台上。
https://arxiv.org/abs/2405.00574
Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications in various fields, including human-machine interaction, virtual assistants, and mental health assistance. However, existing SER methods often overlook the information gap between the pre-training speech recognition task and the downstream SER task, resulting in sub-optimal performance. Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called \textsc{After}, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. Specifically, we first use TAPT to minimize the information gap between the pre-training speech recognition task and the downstream speech emotion recognition task. Then, AL methods are employed to iteratively select a subset of the most informative and diverse samples for fine-tuning, thereby reducing time consumption. Experiments demonstrate that our proposed method \textsc{After}, using only 20\% of samples, improves accuracy by 8.45\% and reduces time consumption by 79\%. The additional extension of \textsc{After} and ablation studies further confirm its effectiveness and applicability to various real-world scenarios. Our source code is available on Github for reproducibility. (this https URL).
演讲情感识别(SER)因为其在各种领域的广泛应用而引起了越来越多的关注,包括人机交互、虚拟助手和心理健康协助。然而,现有的SER方法通常忽视了预训练语音识别任务和下游SER任务之间的信息差距,导致性能低。此外,当前方法对每个具体语音数据集进行微调的时间成本很高,比如IEMOCAP,这限制了其在具有大规模嘈杂数据的大规模场景中的有效性。为了解决这些问题,我们提出了一个基于主动学习(AL)的SER微调框架,称为\textsc{After},并利用任务适应预训练(TAPT)和AL方法来提高性能和效率。具体来说,我们首先使用TAPT最小化预训练语音识别任务和下游语音情感识别任务之间的信息差距。然后,使用AL方法迭代选择具有最信息量和多样性的样本进行微调,从而减少时间消耗。实验证明,我们提出的\textsc{After}方法,只需使用20%的样本,提高了8.45%的准确率,并将时间消耗降低了79%。此外,对\textsc{After}的扩展研究和消融分析进一步证实了其有效性和应用到各种现实场景中的可行性。我们的源代码可以在Github上获取可重复性。(就是这个https://URL)
https://arxiv.org/abs/2405.00307
This paper explores the impact of incorporating sentiment, emotion, and domain-specific lexicons into a transformer-based model for depression symptom estimation. Lexicon information is added by marking the words in the input transcripts of patient-therapist conversations as well as in social media posts. Overall results show that the introduction of external knowledge within pre-trained language models can be beneficial for prediction performance, while different lexicons show distinct behaviours depending on the targeted task. Additionally, new state-of-the-art results are obtained for the estimation of depression level over patient-therapist interviews.
本文探讨了将情感、情感和领域特定词汇融入基于Transformer的抑郁症状估计模型的影响。词汇信息是通过在患者与治疗师对话的输入转录中以及社交媒体帖子中标记单词来添加的。总体结果表明,在预训练语言模型中引入外部知识可以提高预测性能,而不同的词汇表现出针对特定任务的独特行为。此外,本文还获得了评估抑郁程度的新颖结果,该结果仅在患者与治疗师对话中进行。
https://arxiv.org/abs/2404.19359
Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective.
Singing voice improvement是一种具有应用价值的新任务,旨在通过校正音高和改进表现力来纠正唱歌的音高,同时不改变原有的音色和内容。现有的方法依赖于成对数据或仅专注于音高的校正。然而,由于专业歌曲和同一人业余歌曲难以获得,唱歌 voice improvement不仅包括音高校正还包括其他方面,如情感和节奏。由于我们提出了一个快速且高保真的唱歌 voice improvement 系统,称为 ConTuner,一个扩散模型与修改条件相结合来生成美化的 Mel-光谱图,其中修改条件由优化音高和表现力组成。对于音高校正,我们建立了从MIDI、频谱 envelop到音高的映射关系。为了使业余唱歌更具表现力,我们在潜在空间中提出了表现力增强器,将业余嗓音音高转换为专业。ConTuner 在汉语和英语歌曲上都实现了满意的的美化效果。消融研究证实了 ConTuner 中的表现力增强器和基于生成器的方法是有效的。
https://arxiv.org/abs/2404.19187
The dynamic nature of esports makes the situation relatively complicated for average viewers. Esports broadcasting involves game expert casters, but the caster-dependent game commentary is not enough to fully understand the game situation. It will be richer by including diverse multimodal esports information, including audiences' talks/emotions, game audio, and game match event information. This paper introduces GAME-MUG, a new multimodal game situation understanding and audience-engaged commentary generation dataset and its strong baseline. Our dataset is collected from 2020-2022 LOL game live streams from YouTube and Twitch, and includes multimodal esports game information, including text, audio, and time-series event logs, for detecting the game situation. In addition, we also propose a new audience conversation augmented commentary dataset by covering the game situation and audience conversation understanding, and introducing a robust joint multimodal dual learning model as a baseline. We examine the model's game situation/event understanding ability and commentary generation capability to show the effectiveness of the multimodal aspects coverage and the joint integration learning approach.
电子竞技(esports)的动态性质使得普通观众的情况变得复杂。电子竞技直播涉及游戏专家主持者,但仅依赖主持者的游戏解说无法完全理解游戏情况。通过包括多样化的多模态电竞信息,包括观众的谈话/情感、游戏音频和游戏比赛事件信息,可以让数据更加丰富。本文介绍了一个新的多模态游戏情况理解和观众参与评论生成数据集GAME-MUG及其强基线。我们的数据来自2020-2022年从YouTube和Twitch上收集的《英雄联盟》(LOL)游戏直播,包括多模态电竞游戏信息,包括文本、音频和时间序列事件日志,以检测游戏情况。此外,我们还提出了一个新的观众对话增强评论数据集,涵盖了游戏情况和观众对话理解,并引入了一个稳健的联合多模态双学习作为基线。我们研究了模型的游戏情况/事件理解能力和评论生成能力,以展示多模态方面的覆盖和联合集成学习方法的有效性。
https://arxiv.org/abs/2404.19175
Head avatars animated by visual signals have gained popularity, particularly in cross-driving synthesis where the driver differs from the animated character, a challenging but highly practical approach. The recently presented MegaPortraits model has demonstrated state-of-the-art results in this domain. We conduct a deep examination and evaluation of this model, with a particular focus on its latent space for facial expression descriptors, and uncover several limitations with its ability to express intense face motions. To address these limitations, we propose substantial changes in both training pipeline and model architecture, to introduce our EMOPortraits model, where we: Enhance the model's capability to faithfully support intense, asymmetric face expressions, setting a new state-of-the-art result in the emotion transfer task, surpassing previous methods in both metrics and quality. Incorporate speech-driven mode to our model, achieving top-tier performance in audio-driven facial animation, making it possible to drive source identity through diverse modalities, including visual signal, audio, or a blend of both. We propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions, filling the gap with absence of such data in existing datasets.
翻译:由视觉信号生成的头部Avatar广受欢迎,尤其是在跨模态合成中,司机与动画角色不同,这是一种具有挑战性但非常实用的方法。最近提出的MegaPortraits模型在这个领域已经取得了最先进的结果。我们对这个模型进行深入的评估和审查,特别关注其面部表情描述器的潜在空间,并发现其表达强烈脸部动作的能力存在几个局限。为了应对这些限制,我们提出对训练过程和模型架构的大幅改进,以引入我们的EMOPortraits模型,该模型:提高模型在忠实支持强烈不对称面部表情方面的能力,在情感传递任务中实现了最先进的结果,超越了前方法和指标。将语音驱动模式融入我们的模型,在音频驱动面部动画中实现了卓越的性能,使通过多种方式驱动源身份成为可能,包括视觉信号、音频或二者的混合。我们提出了一个包含广泛强烈和不对称面部表情的多视角视频数据集,填补了现有数据集中缺少这种数据的空白。
https://arxiv.org/abs/2404.19110
Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. The main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. Although lip alignment has seen many related studies, existing methods struggle to synthesize natural and realistic expressions, resulting in a mechanical and stiff appearance of facial animations. Even with some research extracting emotional features from speech, the randomness of facial movements limits the effective expression of emotions. To address this issue, this paper proposes a method called CSTalk (Correlation Supervised) that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions that conform to human facial motion patterns. To generate more intricate animations, we employ a rich set of control parameters based on the metahuman character model and capture a dataset for five different emotions. We train a generative network using an autoencoder structure and input an emotion embedding vector to achieve the generation of user-control expressions. Experimental results demonstrate that our method outperforms existing state-of-the-art methods.
多年来,为了实现 speech-driven 3D 面部动画技术,已经开展了很多研究,但实际应用仍存在一定的期望。主要挑战在于数据限制、嘴唇对齐问题和面部表情自然性的问题。尽管在嘴唇对齐方面已经进行了很多相关研究,但现有的方法很难合成自然和逼真的表情,导致面部动画显得僵硬和机械。即使有些研究从语音中提取了情感特征,但面部运动的随机性仍然限制了情感的有效表达。为了解决这个问题,本文提出了一种名为 CSTalk(相关监督)的方法,该方法建模了不同面部运动区域之间的相关性,并监督生成模型的训练,以生成符合人类面部运动模式的真实表情。为了生成更复杂的动画,我们基于元人体模型的一组丰富参数,并捕获了五个不同情感的数据集。我们使用自动编码器结构训练生成网络,并输入情感嵌入向量以实现用户可控表情的生成。实验结果表明,我们的方法超越了现有最先进的方法。
https://arxiv.org/abs/2404.18604
Speech emotion recognition (SER) is constantly gaining attention in recent years due to its potential applications in diverse fields and thanks to the possibility offered by deep learning technologies. However, recent studies have shown that deep learning models can be vulnerable to adversarial attacks. In this paper, we systematically assess this problem by examining the impact of various adversarial white-box and black-box attacks on different languages and genders within the context of SER. We first propose a suitable methodology for audio data processing, feature extraction, and CNN-LSTM architecture. The observed outcomes highlighted the significant vulnerability of CNN-LSTM models to adversarial examples (AEs). In fact, all the considered adversarial attacks are able to significantly reduce the performance of the constructed models. Furthermore, when assessing the efficacy of the attacks, minor differences were noted between the languages analyzed as well as between male and female speech. In summary, this work contributes to the understanding of the robustness of CNN-LSTM models, particularly in SER scenarios, and the impact of AEs. Interestingly, our findings serve as a baseline for a) developing more robust algorithms for SER, b) designing more effective attacks, c) investigating possible defenses, d) improved understanding of the vocal differences between different languages and genders, and e) overall, enhancing our comprehension of the SER task.
近年来,由于其在各种领域具有潜在应用以及深度学习技术的优势,情感识别(SER)引起了越来越多的关注。然而,最近的研究表明,深度学习模型可能容易受到对抗攻击。在本文中,我们通过研究各种对抗性白盒和黑盒攻击对不同语言和性别在SER背景下的影响,系统地评估了这个问题。我们首先提出了一个音频数据处理、特征提取和CNN-LSTM架构的合适方法。观察到的结果突出了CNN-LSTM模型对对抗实例(AEs)的重大漏洞。事实上,所考虑的所有攻击都能够显著地降低构建模型的性能。此外,在评估攻击的有效性时,分析的语言之间以及男性和女性之间的差异较小。总之,本工作为理解CNN-LSTM模型的SER鲁棒性以及AEs的影响做出了贡献。有趣的是,我们的研究为a)为SER开发更健壮的算法,b)设计更有效的攻击,c)研究可能的防御,d)改进对不同语言和性别之间语音差异的理解,以及e)提高对SER任务的全面理解做出了贡献。
https://arxiv.org/abs/2404.18514
Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214
情感文本转语音(E-TTS)合成在近年来因增强人与计算机的互动潜力而受到广泛关注。然而,目前的E-TTS方法通常难以捕捉人类情感的复杂性,主要依赖过简化的情感标签或单模态输入。为了克服这些限制,我们提出了多模态情感文本转语音系统(MM-TTS),这是一个统一框架,利用多种模态的情感线索生成高度富有表现力和情感共鸣的语音。MM-TTS由两个关键组件组成:(1)情感提示对齐模块(EP-Align),它采用对比学习来对齐文本、音频和视觉模态中的情感特征,确保多模态信息的融合;(2)情感嵌入诱导的TTS(EMI-TTS),它将协调的情感嵌入与最先进的TTS模型集成,合成准确反映预期情感的语音。在多样数据集的广泛评估中,MM-TTS的表现优于传统E-TTS模型。客观指标,包括单词错误率(WER)和字符错误率(CER),在ESD数据集上显示显著改善,MM-TTS的分数分别为7.35%和3.07%。主观评估进一步证实,MM-TTS生成的语音具有与人类 speech 相同的情感忠实度和自然性。我们的代码和预训练模型是公开可用的,位于https://anonymous.4open.science/r/MMTTS-D214。
https://arxiv.org/abs/2404.18398
The complex information processing system of humans generates a lot of objective and subjective evaluations, making the exploration of human cognitive products of great cutting-edge theoretical value. In recent years, deep learning technologies, which are inspired by biological brain mechanisms, have made significant strides in the application of psychological or cognitive scientific research, particularly in the memorization and recognition of facial data. This paper investigates through experimental research how neural networks process and store facial expression data and associate these data with a range of psychological attributes produced by humans. Researchers utilized deep learning model VGG16, demonstrating that neural networks can learn and reproduce key features of facial data, thereby storing image memories. Moreover, the experimental results reveal the potential of deep learning models in understanding human emotions and cognitive processes and establish a manifold visualization interpretation of cognitive products or psychological attributes from a non-Euclidean space perspective, offering new insights into enhancing the explainability of AI. This study not only advances the application of AI technology in the field of psychology but also provides a new psychological theoretical understanding the information processing of the AI. The code is available in here: this https URL.
人类的复杂信息处理系统产生了很多客观和主观的评价,这使得探索人类认知产品在很大程度上具有前沿理论价值。近年来,受到生物大脑机制启发的人工智能技术在应用心理学或认知科学研究方面取得了显著进展,特别是在面部数据记忆和识别方面。本文通过实验研究探讨了神经网络如何处理和存储面部表情数据,以及将这些数据与人类产生的各种心理属性相关联。研究者使用了深度学习模型VGG16,证明了神经网络可以学习和复制面部数据的 key features,从而存储图像记忆。此外,实验结果揭示了人工智能模型在理解人类情感和认知过程方面的潜力,并从非欧氏空间角度建立了一个多维可视化解释,为提高AI的透明度提供了新的见解。这项研究不仅在心理学领域推动了AI技术的应用,还提供了关于AI信息处理的新心理理论理解。代码可在此处查看:https://www. this URL.
https://arxiv.org/abs/2404.18352
This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAEDER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the RAVDESS dataset and by 2.06% on the CREMAD. Furthermore, when compared with the state-of-the-art model of multimodal self-supervised learning, MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.
本文提出了一种名为多模态掩码自动编码器(Multimodal Masked Autoencoder for Dynamic Emotion Recognition,MMMAER)的新方法来处理多模态数据以实现动态情感识别。MultiMAE-DER 通过利用预训练的掩码自动编码器模型,通过简单的直接微调实现。通过优化六种融合策略来提高MultiMAE-DER的表现,这些策略处理跨领域数据中的动态特征相关性。与用于动态情感识别的多模态监督学习模型相比,MultiMAE-DER在RAVDESS数据集上提高了4.41%的加权平均召回(WAR),在CREMAD数据集上提高了2.06%的WAR。此外,与用于多模态自监督学习的最先进模型相比,MultiMAE-DER在IEMOCAP数据集上实现了1.86%的WAR提升。
https://arxiv.org/abs/2404.18327
Recent advancements in large language models (LLMs) have significantly boosted the rise of Role-Playing Language Agents (RPLAs), i.e., specialized AI systems designed to simulate assigned personas. By harnessing multiple advanced abilities of LLMs, including in-context learning, instruction following, and social intelligence, RPLAs achieve a remarkable sense of human likeness and vivid role-playing performance. RPLAs can mimic a wide range of personas, ranging from historical figures and fictional characters to real-life individuals. Consequently, they have catalyzed numerous AI applications, such as emotional companions, interactive video games, personalized assistants and copilots, and digital clones. In this paper, we conduct a comprehensive survey of this field, illustrating the evolution and recent progress in RPLAs integrating with cutting-edge LLM technologies. We categorize personas into three types: 1) Demographic Persona, which leverages statistical stereotypes; 2) Character Persona, focused on well-established figures; and 3) Individualized Persona, customized through ongoing user interactions for personalized services. We begin by presenting a comprehensive overview of current methodologies for RPLAs, followed by the details for each persona type, covering corresponding data sourcing, agent construction, and evaluation. Afterward, we discuss the fundamental risks, existing limitations, and future prospects of RPLAs. Additionally, we provide a brief review of RPLAs in AI applications, which reflects practical user demands that shape and drive RPLA research. Through this work, we aim to establish a clear taxonomy of RPLA research and applications, and facilitate future research in this critical and ever-evolving field, and pave the way for a future where humans and RPLAs coexist in harmony.
近年来,大型语言模型(LLMs)的进步显著推动了角色扮演语言代理(RPLAs)的发展,即专门设计来模拟分配角色的AI系统。通过利用LLMs的多项先进能力,包括上下文学习、指令跟随和社会智能,RPLAs实现了惊人的人性相似度和生动的角色扮演表现。RPLAs可以模拟各种角色,从历史人物和虚构角色到现实生活中的人。因此,它们推动了 numerous AI应用的发展,如情感伴侣、交互式视频游戏、私人助手和副驾驶,以及数字克隆。在本文中,我们对这个领域进行了全面的调查,展示了RPLAs与尖端LLM技术相结合的演变和最近进展。我们将角色归类为三种类型:1) demographic persona,利用统计刻板印象;2) established persona,关注已知人物;3) personalized persona,通过持续的用户交互进行个性化定制服务。我们首先对RPLAs的当前方法进行全面概述,然后对每种角色类型的详细介绍,涵盖相应数据来源、代理构建和评估。接着,我们讨论了RPLAs的基本风险、现有局限性和未来前景。此外,我们还简要回顾了RPLAs在AI应用中的情况,反映了实际用户需求对RPLA研究的影响。通过这项工作,我们旨在建立一个清晰的RPLA研究范畴和应用,促进在这个关键且不断发展的领域进行未来的研究,为人类和RPLA和谐共存铺平道路。
https://arxiv.org/abs/2404.18231
This paper explores how artificial intelligence (AI) technology can contribute to achieve progress on good health and well-being, one of the United Nations' 17 Sustainable Development Goals. It is estimated that one in ten of the global population lived with a mental disorder. Inspired by studies showing that engaging and viewing beautiful natural images can make people feel happier and less stressful, lead to higher emotional well-being, and can even have therapeutic values, we explore how AI can help to promote mental health by developing automatic algorithms for finding beautiful and happy images. We first construct a large image database consisting of nearly 20K very high resolution colour photographs of natural scenes where each image is labelled with beautifulness and happiness scores by about 10 observers. Statistics of the database shows that there is a good correlation between the beautifulness and happiness scores which provides anecdotal evidence to corroborate that engaging beautiful natural images can potentially benefit mental well-being. Building on this unique database, the very first of its kind, we have developed a deep learning based model for automatically predicting the beautifulness and happiness scores of natural images. Experimental results are presented to show that it is possible to develop AI algorithms to automatically assess an image's beautifulness and happiness values which can in turn be used to develop applications for promoting mental health and well-being.
本文探讨了人工智能(AI)技术如何为实现可持续发展的联合国17项可持续发展目标之一——改善健康和福祉作出贡献。据估计,全球有十分之一的人口患有某种精神疾病。受到研究显示,欣赏美丽自然图像可以让人感到更快乐、更轻松,从而提高情感幸福感,甚至具有治疗价值的研究启发,我们探讨了AI如何通过开发自动寻找美丽和快乐图像的算法来促进心理健康。我们首先构建了一个包含近20K张高分辨率自然场景照片的大型图像数据库,每张照片由大约10个观察家对其美丽度和快乐度进行评分。数据库统计数据显示,美丽度和快乐度之间存在良好的相关性,这提供了支持美丽自然图像可能对心理健康产生益处的 anecdotal证据。在此基础上,我们开发了世界上第一个基于深度学习的图像模型,用于自动预测自然图像的美感和快乐度。实验结果表明,可以开发AI算法来自动评估图像的美感和快乐度,从而可以用于开发促进精神和健康应用程序。
https://arxiv.org/abs/2404.18109
Audio signals can reveal intimate details about a person's life, including their conversations, health status, emotions, location, and personal preferences. Unauthorized access or misuse of this information can have profound personal and social implications. In an era increasingly populated by devices capable of audio recording, safeguarding user privacy is a critical obligation. This work studies the ethical and privacy concerns in current audio classification systems. We discuss the challenges and research directions in designing privacy-preserving audio sensing systems. We propose privacy-preserving audio features that can be used to classify wide range of audio classes, while being privacy preserving.
音频信号可以揭示一个人生活的亲密细节,包括他们的谈话、健康状况、情感、位置和个人喜好。未经授权访问或滥用这种信息可能会对个人和社会产生深远的影响。在音频录制功能日益普及的时代,保护用户隐私是一项重要的个人和社会责任。本研究旨在研究当前音频分类系统中伦理和隐私问题。我们讨论了设计隐私保护音频传感系统的挑战和研究方向。我们提出了可以用于分类广泛音频类别的隐私保护音频特征。
https://arxiv.org/abs/2404.18002
Automatic Speech Understanding (ASU) aims at human-like speech interpretation, providing nuanced intent, emotion, sentiment, and content understanding from speech and language (text) content conveyed in speech. Typically, training a robust ASU model relies heavily on acquiring large-scale, high-quality speech and associated transcriptions. However, it is often challenging to collect or use speech data for training ASU due to concerns such as privacy. To approach this setting of enabling ASU when speech (audio) modality is missing, we propose TI-ASU, using a pre-trained text-to-speech model to impute the missing speech. We report extensive experiments evaluating TI-ASU on various missing scales, both multi- and single-modality settings, and the use of LLMs. Our findings show that TI-ASU yields substantial benefits to improve ASU in scenarios where even up to 95% of training speech is missing. Moreover, we show that TI-ASU is adaptive to dropout training, improving model robustness in addressing missing speech during inference.
自动语音理解(ASU)旨在实现人类式的语音理解,提供细微的意图、情感、情感和内容理解,来自语音中传达的语言(文本)内容。通常,训练一个稳健的ASU模型依赖于获取大量高质量的语音及其转录。然而,由于隐私等 concerns,收集或使用语音数据进行ASU训练往往具有挑战性。为了在语音(音频)模式缺失时实现ASU,我们提出了TI-ASU,使用预训练的文本转语音模型进行填补。我们在各种缺失 scale(多模态和单模态)上进行了广泛的实验评估,以及使用LLMs。我们的发现表明,在训练语音甚至缺失至95%时,TI-ASU能够显著提高ASU性能。此外,我们还证明了TI-ASU对缺失语音的适应性,从而提高模型在推理过程中的鲁棒性。
https://arxiv.org/abs/2404.17983
Neural Machine Translation (NMT) is the task of translating a text from one language to another with the use of a trained neural network. Several existing works aim at incorporating external information into NMT models to improve or control predicted translations (e.g. sentiment, politeness, gender). In this work, we propose to improve translation quality by adding another external source of information: the automatically recognized emotion in the voice. This work is motivated by the assumption that each emotion is associated with a specific lexicon that can overlap between emotions. Our proposed method follows a two-stage procedure. At first, we select a state-of-the-art Speech Emotion Recognition (SER) model to predict dimensional emotion values from all input audio in the dataset. Then, we use these predicted emotions as source tokens added at the beginning of input texts to train our NMT model. We show that integrating emotion information, especially arousal, into NMT systems leads to better translations.
神经机器翻译(NMT)是将一种语言文本翻译成另一种语言文本的任务,使用训练好的神经网络来实现。为了提高或控制预测的翻译质量(例如:情感、礼貌、性别等),许多现有作品试图将外部信息引入NMT模型中。在这项工作中,我们提出了一种通过添加另一个外部信息源来提高翻译质量的方法:说话者的情感。这项工作源于这样的假设,每个情感都与特定的词汇表相关联,这些词汇表可以在情感之间重叠。我们提出的方法分为两个阶段。首先,我们选择了一个最先进的语音情感识别(SER)模型,预测数据库中所有输入音频的维度情感值。然后,我们将这些预测的情感作为输入文本的开头添加,训练我们的NMT模型。我们证明了将情感信息,特别是兴奋,融入NMT系统会导致更好的翻译。
https://arxiv.org/abs/2404.17968
Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are limited by some inherent characteristics of GNN, such as over-smoothing and low-pass filtering, resulting in the inability to learn long-distance consistency information and complementary information efficiently. Since consistency and complementarity information correspond to low-frequency and high-frequency information, respectively, this paper revisits the problem of multimodal emotion recognition in conversation from the perspective of the graph spectrum. Specifically, we propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and uses efficient Fourier graph operators to extract long-distance high-frequency and low-frequency information, respectively. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect real emotions. Finally, GS-MCC inputs the collaborative high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets.
有效地在多模态对话环境中捕捉到一致性和互补性的语义特征对于多模态情感识别(MERC)至关重要。现有的方法主要使用图结构来建模对话上下文语义关系,并使用图神经网络(GNN)来捕捉多模态语义特征以进行情感识别。然而,这些方法由于GNN的一些固有特性(如过度平滑和低通滤波),导致无法有效地学习长距离一致性和互补信息。由于一致性和互补性信息对应于低频和高频信息,因此本文从图谱的角度重新研究了对话中多模态情感识别的问题。具体来说,本文提出了一种基于图谱的跨模态一致性和互补性协同学习框架GS-MCC。首先,GS-MCC使用滑动窗口构建一个多模态交互图来建模对话关系,并使用高效的傅里叶图操作提取 long-distance high-frequency和low-frequency信息。然后,GS-MCC使用对比学习构建自监督信号,反映高和低频信号的互补性和一致性,从而提高高和低频信息对真实情感的反映能力。最后,GS-MCC将合作 high和low-frequency信息输入到MLP网络和软max函数进行情感预测。大量实验证明,本文提出的GS-MCC架构在两个基准数据集上的优越性。
https://arxiv.org/abs/2404.17862