Holistic perception of affective attributes is an important human perceptual ability. However, this ability is far from being realized in current affective computing, as not all of the attributes are well studied and their interrelationships are poorly understood. In this work, we investigate the relationship between two affective attributes: personality and emotion, from a transfer learning perspective. Specifically, we transfer Transformer-based and wav2vec-based emotion recognition models to perceive personality from speech across corpora. Compared with previous studies, our results show that transferring emotion recognition is effective for personality perception. Moreoever, this allows for better use and exploration of small personality corpora. We also provide novel findings on the relationship between personality and emotion that will aid future research on holistic affect recognition.
情感属性的全面感知是一项重要的人类感知能力。然而,在当前的情感计算中,这一能力尚未得到实现,因为不是所有的属性都得到充分研究,并且它们之间的相互关系也未被很好地理解。在本研究中,我们从迁移学习的角度出发,研究人格和情感两个情感属性之间的关系。具体来说,我们迁移了基于Transformer和wav2vec的情感识别模型,从多个语料库中识别人格。与以前的研究相比,我们的结果表明,迁移情感识别对于人格感知非常有效。此外,这还使得更有效地利用和探索小型人格语料库更加方便。我们还研究了人格和情感之间的关系,并提出了一些新发现,这将为全面情感识别研究提供有益的启示。
https://arxiv.org/abs/2305.16076
In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora and examining the distribution of word errors and confidence scores in ASR transcripts to gain insight into how emotion affects ASR. We utilize four ASR systems, namely Kaldi ASR, wav2vec, Conformer, and Whisper, and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability. Additionally, we conduct text-based SER on ASR transcripts with increasing word error rates to investigate how ASR affects SER. The objective of this study is to uncover the relationship and mutual impact of ASR and SER, in order to facilitate ASR adaptation to emotional speech and the use of SER in real world.
在语音情感识别( SER )中,通常使用文本数据与音频信号一起解决问题,以解决其固有的不确定性。然而,在大多数研究中,依赖人类标注的文本限制了实际 SER 系统的开发。要克服这一挑战,我们研究如何将自动语音识别(ASR )在情感语音中进行表现,通过分析情感 corpora 的 ASR 表现,并检查 ASR transcripts 中单词错误和自信心分数的分布,以了解情感如何影响 ASR。我们使用四个 ASR 系统,即 Kaldi ASR、wav2vec、Conformer 和 Whisper,以及三个 corpora:IEMOCAP、MOSI 和 MELD,以确保可扩展性。此外,我们逐渐增加单词错误率,在 ASR transcripts 上进行文本 SER,以研究 ASR 对 SER 的影响。本研究的目标是揭示 ASR 和 SER 之间的关系和相互影响,以促进 ASR 适应情感语音,并促进 SER 在现实世界中的应用。
https://arxiv.org/abs/2305.16065
Longitudinal Dialogues (LD) are the most challenging type of conversation for human-machine dialogue systems. LDs include the recollections of events, personal thoughts, and emotions specific to each individual in a sparse sequence of dialogue sessions. Dialogue systems designed for LDs should uniquely interact with the users over multiple sessions and long periods of time (e.g. weeks), and engage them in personal dialogues to elaborate on their feelings, thoughts, and real-life events. In this paper, we study the task of response generation in LDs. We evaluate whether general-purpose Pre-trained Language Models (PLM) are appropriate for this purpose. We fine-tune two PLMs, GePpeTto (GPT-2) and iT5, using a dataset of LDs. We experiment with different representations of the personal knowledge extracted from LDs for grounded response generation, including the graph representation of the mentioned events and participants. We evaluate the performance of the models via automatic metrics and the contribution of the knowledge via the Integrated Gradients technique. We categorize the natural language generation errors via human evaluations of contextualization, appropriateness and engagement of the user.
长期对话(LD)是人类-机器对话系统中最具挑战性的通话类型。LD包括个体在对话序列中的特定事件、个人想法和情绪的记忆。为LD设计的对话系统应该在多个对话 session 和长时间内(例如几周)uniquely 与用户交互,并让他们参与个人对话,以详细阐述他们的感受、想法和真实生活中的事件。在本文中,我们研究了LD中的响应生成任务。我们评估了通用预训练语言模型(PLM)是否适合这一目的。我们利用LD数据的集微调了两个PLM:GePpeTto (GPT-2) 和 iT5。我们使用不同的个人知识表示方法,包括从LD中提取的提及的事件和参与者的图形表示,进行了实验。我们通过自动指标和集成梯度技术评估了模型的性能,并利用人类评估了情境化、合适性和用户参与的程度,将自然语言生成错误进行分类。
https://arxiv.org/abs/2305.15908
The term "Code Mixed" refers to the use of more than one language in the same text. This phenomenon is predominantly observed on social media platforms, with an increasing amount of adaptation as time goes on. It is critical to detect foreign elements in a language and process them correctly, as a considerable number of individuals are using code-mixed languages that could not be comprehended by understanding one of those languages. In this work, we focus on low-resource Hindi-English code-mixed language and enhancing the performance of different code-mixed natural language processing tasks such as sentiment analysis, emotion recognition, and hate speech identification. We perform a comparative analysis of different Transformer-based language Models pre-trained using unsupervised approaches. We have included the code-mixed models like HingBERT, HingRoBERTa, HingRoBERTa-Mixed, mBERT, and non-code-mixed models like AlBERT, BERT, and RoBERTa for comparative analysis of code-mixed Hindi-English downstream tasks. We report state-of-the-art results on respective datasets using HingBERT-based models which are specifically pre-trained on real code-mixed text. Our HingBERT-based models provide significant improvements thus highlighting the poor performance of vanilla BERT models on code-mixed text.
"代码混合"一词指的是在同一文本中使用多种语言的现象,这在社交媒体平台上尤为普遍,随着时间的流逝,适应度不断增加。重要的是要识别语言中的异国元素,并正确处理它们,因为相当多人使用代码混合语言,这些语言无法通过理解其中一种语言来理解。在本文中,我们重点关注资源有限的希伯来语-英语代码混合语言,并提高不同代码混合自然语言处理任务(如情感分析、情绪识别和恶言识别)的性能。我们使用无监督方法预先训练的不同Transformer-based语言模型进行了比较分析。我们包括代码混合模型,如HingBERT、HingRoBERTa、HingRoBERTa-混合、mBERT和非代码混合模型,如AlBERT、BERT和RoBERTa,以对代码混合希伯来语-英语下游任务进行代码混合语言比较分析。我们使用HingBERT-based模型分别报告了各自数据集的最佳结果,这些模型是在真实代码混合文本中进行预先训练的。我们的HingBERT-based模型提供了显著的改进,从而突出了代码混合文本中普通BERT模型表现不佳的情况。
https://arxiv.org/abs/2305.15722
Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks. While LLMs are increasingly deployed in many forms including conversational agents that interact with humans, we lack a grounded benchmark to measure how well LLMs understand \textit{social} language. Here, we introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge which we group into five categories: humor & sarcasm, offensiveness, sentiment & emotion, and trustworthiness. In tests on the benchmark, we demonstrate that current models attain only moderate performance but reveal significant potential for task transfer among different types and categories of tasks, which were predicted from theory. Through zero-shot evaluations, we show that pretrained models already possess some innate but limited capabilities of social language understanding and training on one category of tasks can improve zero-shot testing on others. Our benchmark provides a systematic way to analyze model performance on an important dimension of language and points to clear room for improvement to build more socially-aware LLMs. The associated resources are released at this https URL.
大型语言模型(LLMs)在语法、言语和推理任务中表现出色。尽管LLMs正在越来越应用于各种形式,包括与人类的交互对话机器人,但我们缺乏一个基准来测量LLMs如何理解社交语言。在这里,我们介绍了一个新的基于理论的基准,SocKET,它包含58个NLP任务测试社交知识,我们将它们分为五个类别:幽默和讽刺、攻击性、情感和可靠性。在基准测试中,我们表明,当前模型只能达到中等表现,但揭示了从理论预测的Task转移的重要潜力。通过零样本评估,我们表明预训练模型已经具备一些固有的但有限的社交语言理解能力,而训练在一个类别任务中可以改进其他类别任务的零样本测试。我们的基准提供了一种系统的方式来分析模型在语言一个重要的维度上的表现,并指出了建立更社交意识的LLMs的巨大改进空间。相关的资源在这里发布。
https://arxiv.org/abs/2305.14938
We present metrics for evaluating dialog systems through a psychologically-grounded "human" lens: conversational agents express a diversity of both states (short-term factors like emotions) and traits (longer-term factors like personality) just as people do. These interpretable metrics consist of five measures from established psychology constructs that can be applied both across dialogs and on turns within dialogs: emotional entropy, linguistic style and emotion matching, as well as agreeableness and empathy. We compare these human metrics against 6 state-of-the-art automatic metrics (e.g. BARTScore and BLEURT) on 7 standard dialog system data sets. We also introduce a novel data set, the Three Bot Dialog Evaluation Corpus, which consists of annotated conversations from ChatGPT, GPT-3, and BlenderBot. We demonstrate the proposed human metrics offer novel information, are uncorrelated with automatic metrics, and lead to increased accuracy beyond existing automatic metrics for predicting crowd-sourced dialog judgements. The interpretability and unique signal of our proposed human-centered framework make it a valuable tool for evaluating and improving dialog systems.
我们通过心理grounded的“人类”视角呈现了评估对话系统的指标。对话代理像人类一样,表达了状态(短期因素,如情感)和特征(长期因素,如人格)的多样性。这些可解释的指标包括从确立心理学构造的五个指标,可以在对话中应用,并在对话中转折时应用:情感熵、语言风格和情感匹配,以及友善度和共情。我们在7个标准对话系统数据集上对上述人类指标与6项先进的自动指标(如BartScore和BLEURT)进行了比较。我们还介绍了一个全新的数据集, Three Bot Dialog Evaluation Corpus,它由ChatGPT、GPT-3和BlenderBot等聊天机器人的注解对话组成。我们证明了所提议的人类指标提供了新信息,与自动指标无关,并超越了现有的自动指标,用于预测众包对话判断的精度。我们提议的人中心框架的可解释性和独特信号使其成为评估和改进对话系统的有价值的工具。
https://arxiv.org/abs/2305.14757
While several previous studies have analyzed gender bias in research, we are still missing a comprehensive analysis of gender differences in the AI community, covering diverse topics and different development trends. Using the AI Scholar dataset of 78K researchers in the field of AI, we identify several gender differences: (1) Although female researchers tend to have fewer overall citations than males, this citation difference does not hold for all academic-age groups; (2) There exist large gender homophily in co-authorship on AI papers; (3) Female first-authored papers show distinct linguistic styles, such as longer text, more positive emotion words, and more catchy titles than male first-authored papers. Our analysis provides a window into the current demographic trends in our AI community, and encourages more gender equality and diversity in the future. Our code and data are at this https URL.
虽然过去的一些研究已经分析了研究中的性别偏见,但我们仍然缺乏涵盖各种主题和不同发展趋势的全面性别分析。利用人工智能领域的78,000名研究人员的AI Scholar数据集,我们识别了几个性别差异:(1)尽管女性研究人员通常总引用次数较少,但这一引用差异并不适用于所有学术年龄组;(2)在人工智能论文的联合作者中,存在巨大的性别同质性;(3)女性作者的先著论文表现出不同的语言风格,如更长的文章、更多的积极情感词汇和更具吸引力的标题,而男性作者的先著论文则表现出不同的语言风格。我们的分析为我们所在的人工智能社区当前的人口趋势提供了一个窗口,并鼓励未来更多的性别平等和多样性。我们的代码和数据都在这个httpsURL上。
https://arxiv.org/abs/2305.14597
The most meaningful connections between people are often fostered through expression of shared vulnerability and emotional experiences in personal narratives. We introduce a new task of identifying similarity in personal stories based on empathic resonance, i.e., the extent to which two people empathize with each others' experiences, as opposed to raw semantic or lexical similarity, as has predominantly been studied in NLP. Using insights from social psychology, we craft a framework that operationalizes empathic similarity in terms of three key features of stories: main events, emotional trajectories, and overall morals or takeaways. We create EmpathicStories, a dataset of 1,500 personal stories annotated with our empathic similarity features, and 2,000 pairs of stories annotated with empathic similarity scores. Using our dataset, we fine-tune a model to compute empathic similarity of story pairs, and show that this outperforms semantic similarity models on automated correlation and retrieval metrics. Through a user study with 150 participants, we also assess the effect our model has on retrieving stories that users empathize with, compared to naive semantic similarity-based retrieval, and find that participants empathized significantly more with stories retrieved by our model. Our work has strong implications for the use of empathy-aware models to foster human connection and empathy between people.
人们之间最有意义的联系往往通过个人故事表达共同的 vulnerability 和 emotional experiences。我们提出了一个新的任务,即基于共情共鸣来确定个人故事之间的相似性,即两人共情的程度,而不是仅仅基于语义或词汇相似,这在自然语言处理中为主要研究的领域。从社会心理学的角度出发,我们构建了一个框架,该框架将共情相似性以故事的三个关键特征的形式进行定义:主要事件、情感轨迹和总体道德或教训。我们创造了 EmpathicStories 数据集,其中包括我们共情相似性特征注释的 1,500 个人故事和 2,000 对注释共情相似性的故事。利用我们的数据集,我们优化了模型来计算故事对之间的共情相似性,并表明这种方法在自动化联系和检索度量方面比基于天真语义相似性的检索方法更有效。通过与 150 名参与者进行的用户研究,我们还评估了模型对于用户与那些被检索的故事之间的共情影响,与基于天真语义相似性的检索方法相比,我们发现参与者对模型检索到的故事共情程度 significantly higher。我们的工作对于使用共情感知模型促进人类连接和共情之间的关系非常具有意义。
https://arxiv.org/abs/2305.14246
In Emotion Recognition in Conversations (ERC), the emotions of target utterances are closely dependent on their context. Therefore, existing works train the model to generate the response of the target utterance, which aims to recognise emotions leveraging contextual information. However, adjacent response generation ignores long-range dependencies and provides limited affective information in many cases. In addition, most ERC models learn a unified distributed representation for each utterance, which lacks interpretability and robustness. To address these issues, we propose a VAD-disentangled Variational AutoEncoder (VAD-VAE), which first introduces a target utterance reconstruction task based on Variational Autoencoder, then disentangles three affect representations Valence-Arousal-Dominance (VAD) from the latent space. We also enhance the disentangled representations by introducing VAD supervision signals from a sentiment lexicon and minimising the mutual information between VAD distributions. Experiments show that VAD-VAE outperforms the state-of-the-art model on two datasets. Further analysis proves the effectiveness of each proposed module and the quality of disentangled VAD representations. The code is available at this https URL.
在对话中的情感识别(ERC)中,目标言论的情感取决于其上下文。因此,现有工作训练模型生成目标言论的反应,旨在利用上下文信息识别情感。然而,相邻反应生成忽略长距离依赖并在某些情况下提供有限的情感信息。此外,大多数ERC模型学习每个言论的统一分布式表示,缺乏解释性和稳定性。为了解决这些问题,我们提出了一种VAD分离的变分自编码器(VAD-VAE),它首先提出了基于变分自编码的目标言论重建任务,然后从隐空间分离三个影响表示valence-arousal-dominance(VAD)。我们还增强分离表示,从情感词汇表引入VAD监督信号并最小化VAD分布之间的互信息。实验表明,VAD-VAE在两个数据集上比最先进的模型表现更好。进一步分析证明了每个提议模块的有效性和分离的VAD表示质量。代码在此https URL可用。
https://arxiv.org/abs/2305.14071
Speech Emotion Recognition (SER) is a critical enabler of emotion-aware communication in human-computer interactions. Deep Learning (DL) has improved the performance of SER models by improving model complexity. However, designing DL architectures requires prior experience and experimental evaluations. Encouragingly, Neural Architecture Search (NAS) allows automatic search for an optimum DL model. In particular, Differentiable Architecture Search (DARTS) is an efficient method of using NAS to search for optimised models. In this paper, we propose DARTS for a joint CNN and LSTM architecture for improving SER performance. Our choice of the CNN LSTM coupling is inspired by results showing that similar models offer improved performance. While SER researchers have considered CNNs and RNNs separately, the viability of using DARTs jointly for CNN and LSTM still needs exploration. Experimenting with the IEMOCAP dataset, we demonstrate that our approach outperforms best-reported results using DARTS for SER.
语音情感识别( SER )是实现人类计算机交互中情感有意识的通信的关键工具。深度学习(DL )通过提高模型复杂性,已经改善了 SER 模型的性能。然而,设计 DL 架构需要先前经验和实验评估。令人鼓舞的是,神经网络架构搜索(NAS )可以自动搜索最佳的 DL 模型。特别是,可变性架构搜索(DARTS )是一种高效的利用NAS 搜索优化模型的方法。在本文中,我们提出了用 joint CNN 和 LSTM 架构来提高 SER 性能的 DARTS 方案。我们选择 CNN LSTM 耦合的方式是受结果显示类似模型可以提高性能的启发的。虽然 SER 研究人员已经考虑了 CNN 和 RNN 分别考虑,但用 DARTs 联合 CNN 和 LSTM 的可行性仍需要探索。通过使用 IEMOCAP 数据集进行实验,我们证明了我们的方法和使用 DARTS 的 SER 性能最佳结果相比表现更好。
https://arxiv.org/abs/2305.14402
Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at this https URL.
情感文本语音合成(TTS)是开发需要自然和情感语音的系统(例如人类般的对话代理)的重要任务。然而,现有的方法仅旨在在训练期间为可见的演讲者生成情感TTS,而不考虑对未可见的演讲者进行泛化。在本文中,我们提出了ZET-Speech,一种零样本自适应情感控制TTS模型,可以使用简短的中性语音片段和目标情感标签来合成任何演讲者的情感语音。具体而言,为了使零样本自适应TTS模型能够合成情感语音,我们提出了扩散模型的域对抗学习和指导方法。实验结果表明,ZET-Speech成功地合成了自然和情感语音,对于可见和未可见的演讲者都具有所需的情感。样本代码位于这个https URL上。
https://arxiv.org/abs/2305.13831
We propose ChatGPT-EDSS, an empathetic dialogue speech synthesis (EDSS) method using ChatGPT for extracting dialogue context. ChatGPT is a chatbot that can deeply understand the content and purpose of an input prompt and appropriately respond to the user's request. We focus on ChatGPT's reading comprehension and introduce it to EDSS, a task of synthesizing speech that can empathize with the interlocutor's emotion. Our method first gives chat history to ChatGPT and asks it to generate three words representing the intention, emotion, and speaking style for each line in the chat. Then, it trains an EDSS model using the embeddings of ChatGPT-derived context words as the conditioning features. The experimental results demonstrate that our method performs comparably to ones using emotion labels or neural network-derived context embeddings learned from chat histories. The collected ChatGPT-derived context information is available at this https URL.
我们提出了ChatGPT-EDSS,一个使用ChatGPT提取对话上下文的同情心对话合成(EDSS)方法。ChatGPT是一个智能对话机器人,能够深入理解输入提示的内容和目的,并适当地响应用户的请求。我们重点关注ChatGPT的阅读理解能力,并将它引入EDSS任务,即合成能够感受对话双方情感的语音。我们的方法首先将对话历史向ChatGPT提供,并要求它生成三句代表对话中每个段落的意图、情感和说话风格的词语。然后,它使用ChatGPT生成上下文单词的embedding作为条件特征,以训练EDSS模型。实验结果显示,我们的方法和从对话历史中学习情感标签或神经网络生成上下文embedding的方法相比表现相似。收集的ChatGPT生成上下文信息可用在这个httpsURL上。
https://arxiv.org/abs/2305.13724
Fusing multiple modalities for affective computing tasks has proven effective for performance improvement. However, how multimodal fusion works is not well understood, and its use in the real world usually results in large model sizes. In this work, on sentiment and emotion analysis, we first analyze how the salient affective information in one modality can be affected by the other in crossmodal attention. We find that inter-modal incongruity exists at the latent level due to crossmodal attention. Based on this finding, we propose a lightweight model via Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG), which determines a primary modality according to its contribution to the target task and then hierarchically incorporates auxiliary modalities to alleviate inter-modal incongruity and reduce information redundancy. The experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP verifies the efficacy of our approach, showing that it: 1) outperforms major prior work by achieving competitive results and can successfully recognize hard samples; 2) mitigates the inter-modal incongruity at the latent level when modalities have mismatched affective tendencies; 3) reduces model size to less than 1M parameters while outperforming existing models of similar sizes.
将多种感官模式合并用于情感计算任务已经证明能够提高性能。然而,如何整合多种感官模式的运作尚不清楚,因此在现实世界中使用通常会导致大型模型大小。在这项工作中,对于情感和情绪分析,我们首先分析了如何在跨感官注意力中影响某一感官模式的另一条感官信息。我们发现,由于跨感官注意力,在潜伏阶段存在跨感官不匹配。基于这一发现,我们提出了一种轻量级模型,通过Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG),根据它对目标任务的贡献确定一种主要感官模式,然后Hierarchically incorporates辅助感官模式,减轻跨感官不匹配,减少信息冗余。在三个基准数据集:CMU-MOSI、CMU-MOSEI和IEMOCAP的实验评估证实了我们的方法的有效性,表明它: 1) 通过实现竞争结果并成功识别困难样本,超越了先前的主要工作; 2) 在感官模式不匹配的潜伏阶段减轻跨感官不匹配; 3) 在模型大小不到100万参数的情况下,却超越了同类模型的大小。
https://arxiv.org/abs/2305.13583
Emotion understanding is an essential but highly challenging component of artificial general intelligence. The absence of extensively annotated datasets has significantly impeded advancements in this field. We present EmotionCLIP, the first pre-training paradigm to extract visual emotion representations from verbal and nonverbal communication using only uncurated data. Compared to numerical labels or descriptions used in previous methods, communication naturally contains emotion information. Furthermore, acquiring emotion representations from communication is more congruent with the human learning process. We guide EmotionCLIP to attend to nonverbal emotion cues through subject-aware context encoding and verbal emotion cues using sentiment-guided contrastive learning. Extensive experiments validate the effectiveness and transferability of EmotionCLIP. Using merely linear-probe evaluation protocol, EmotionCLIP outperforms the state-of-the-art supervised visual emotion recognition methods and rivals many multimodal approaches across various benchmarks. We anticipate that the advent of EmotionCLIP will address the prevailing issue of data scarcity in emotion understanding, thereby fostering progress in related domains. The code and pre-trained models are available at this https URL.
情感理解是人工智能通用智能中一个重要的但具有挑战性的组件。缺乏大量的标注数据已经显著阻碍了该领域的发展。我们提出了EmotionCLIP,是第一个基于未整理数据仅使用未标注数据提取视觉情感表示的训练前范式。与以前的方法和使用数字标签或描述相比,通信自然地包含情感信息。此外,从通信中获取情感表示更加与人类学习过程一致。我们指导EmotionCLIP关注非语言情感提示,通过主题意识上下文编码和情感引导比较学习。广泛的实验验证了EmotionCLIP的有效性和可移植性。仅仅使用线性探测评估协议,EmotionCLIP超越了最先进的监督视觉情感识别方法,并在各种基准上与许多多模式方法相媲美。我们预计,EmotionCLIP的出现将解决情感理解中数据稀缺的问题,从而促进相关领域的进展。代码和预训练模型可在本httpsURL上获取。
https://arxiv.org/abs/2305.13500
Automatic detection and classification of animal sounds has many applications in biodiversity monitoring and animal behaviour. In the past twenty years, the volume of digitised wildlife sound available has massively increased, and automatic classification through deep learning now shows strong results. However, bioacoustics is not a single task but a vast range of small-scale tasks (such as individual ID, call type, emotional indication) with wide variety in data characteristics, and most bioacoustic tasks do not come with strongly-labelled training data. The standard paradigm of supervised learning, focussed on a single large-scale dataset and/or a generic pre-trained algorithm, is insufficient. In this work we recast bioacoustic sound event detection within the AI framework of few-shot learning. We adapt this framework to sound event detection, such that a system can be given the annotated start/end times of as few as 5 events, and can then detect events in long-duration audio -- even when the sound category was not known at the time of algorithm training. We introduce a collection of open datasets designed to strongly test a system's ability to perform few-shot sound event detections, and we present the results of a public contest to address the task. We show that prototypical networks are a strong-performing method, when enhanced with adaptations for general characteristics of animal sounds. We demonstrate that widely-varying sound event durations are an important factor in performance, as well as non-stationarity, i.e. gradual changes in conditions throughout the duration of a recording. For fine-grained bioacoustic recognition tasks without massive annotated training data, our results demonstrate that few-shot sound event detection is a powerful new method, strongly outperforming traditional signal-processing detection methods in the fully automated scenario.
自动检测和分类动物声音在许多生物多样性监测和动物行为中有广泛的应用。在过去二十年中,数字野生动物声音的数量大幅度增加,通过深度学习进行自动分类已经取得了显著成果。然而,生物声学不是单一的任务,而是一系列小型任务(如个人ID、通话类型、情感指示)的数据特征丰富多彩,而且大多数生物声学任务都没有强烈的标注训练数据。传统的监督学习标准范式,集中在单个大规模数据集和/或通用预训练算法上,已经不足以满足需求。在这个项目中,我们重新将生物声学事件检测 within the AI framework of few-shot learning 框架内。我们适应声音事件检测的框架,使得系统可以给出每个事件开始/结束时间的轻微标注,然后可以在长时间的音频中检测事件 - 即使声音类别在算法训练时尚未可知。我们介绍了一组开放数据集,旨在强烈测试系统进行 few-shot 声音事件检测的能力,并公开展示了解决这个问题的公开竞赛的结果。我们表明,典型的网络是一种强大的表现方法,通过适应动物声音的典型特征进行增强。我们证明了广泛变化的声音事件持续时间是表现的一个重要因素,以及不稳定性,即记录期间环境的逐渐变化。对于没有大量轻微标注训练数据的复杂生物声学识别任务,我们的结果表明, few-shot 声音事件检测是一种强大的新方法,在完全自动化的情况下, strongly 超越了传统的信号处理检测方法。
https://arxiv.org/abs/2305.13210
The increasing adoption of text-to-speech technologies has led to a growing demand for natural and emotive voices that adapt to a conversation's context and emotional tone. This need is particularly relevant for interactive narrative-driven systems such as video games, TV shows, and graphic novels. To address this need, we present the Emotive Narrative Storytelling (EMNS) corpus, a dataset of high-quality British English speech with labelled utterances designed to enhance interactive experiences with dynamic and expressive language. We provide high-quality clean audio recordings and natural language description pairs with transcripts and user-reviewed and self-reported labels for features such as word emphasis, expressiveness, and emotion labels. EMNS improves on existing datasets by providing higher quality and clean recording to aid more natural and expressive speech synthesis techniques for interactive narrative-driven experiences. Additionally, we release our remote and scalable data collection system to the research community.
越来越多的采用文本语音识别技术,导致了对自然和情感的声音的需求不断增长,这些声音能够适应对话的背景和情感语调。这种情况特别适用于交互叙事驱动的系统,例如视频游戏、电视节目和漫画书。为了解决这个问题,我们提出了情感叙事故事(EMNS)语料库,这是一个高质量的英国英语语音标注语料库,旨在增强具有动态和表达性的语言的交互体验。我们提供了高质量的干净音频录音和自然语言描述对,以及经过用户审查和自我报告的标签,以包括单词强调、表达能力和情感标签等特征。EMNS通过提供高质量的和干净的录音,以帮助更自然和表达性的语音合成技术,为交互叙事驱动的体验提供更好的体验。此外,我们还向研究社区发布了远程和可扩展的数据收集系统。
https://arxiv.org/abs/2305.13137
This paper presents Reflective Linguistic Programming (RLP), a unique approach to conversational AI that emphasizes self-awareness and strategic planning. RLP encourages models to introspect on their own predefined personality traits, emotional responses to incoming messages, and planned strategies, enabling contextually rich, coherent, and engaging interactions. A striking illustration of RLP's potential involves a toy example, an AI persona with an adversarial orientation, a demon named `Bogus' inspired by the children's fairy tale Hansel & Gretel. Bogus exhibits sophisticated behaviors, such as strategic deception and sensitivity to user discomfort, that spontaneously arise from the model's introspection and strategic planning. These behaviors are not pre-programmed or prompted, but emerge as a result of the model's advanced cognitive modeling. The potential applications of RLP in socially-aware AGI (Social AGI) are vast, from nuanced negotiations and mental health support systems to the creation of diverse and dynamic AI personas. Our exploration of deception serves as a stepping stone towards a new frontier in AGI, one filled with opportunities for advanced cognitive modeling and the creation of truly human `digital souls'.
本论文介绍了反思语言编程(RLP)这种方法,它是对话人工智能中强调自我意识和战略规划的独特方法。RLP鼓励模型反思自己预先定义的个性特征、对接收到的消息的情感反应以及计划策略,从而创造 contextually rich、连贯且富有互动性的对话。RLP的潜力是一个非常显著的示例,它涉及到一个玩具例子,一个具有对抗性取向的人工智能角色,名为“Bogus”,是由儿童童话故事《Hansel & Gretel》启发的恶魔。Bogus表现出 sophisticated 的行为,例如战略欺骗和对用户不适的敏感性,这些行为是从模型的反思和战略规划中 spontaneous 产生的。这些行为并不是预先编程或提示产生的,而是从模型的 advanced 认知建模中产生的。RLP 在社交 aware 人工智能(Social AGI)中的潜在应用非常广泛,包括精细的谈判和心理健康支持系统,以及创造各种动态、多样化的人工智能角色。我们探索的欺骗作为 AI 模型的新前沿,充满了高级认知建模和创造真正人类“数字灵魂”的机会。
https://arxiv.org/abs/2305.12647
New-age conversational agent systems perform both speech emotion recognition (SER) and automatic speech recognition (ASR) using two separate and often independent approaches for real-world application in noisy environments. In this paper, we investigate a joint ASR-SER multitask learning approach in a low-resource setting and show that improvements are observed not only in SER, but also in ASR. We also investigate the robustness of such jointly trained models to the presence of background noise, babble, and music. Experimental results on the IEMOCAP dataset show that joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively in clean scenarios. In noisy scenarios, results on data augmented with MUSAN show that the joint approach outperforms the independent ASR and SER approaches across many noisy conditions. Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
新一代交互式对话系统系统同时实现语音情感识别( SER ) 和自动语音识别( ASR ),使用两个独立的方法和在噪声环境中的实际应用。在本文中,我们探讨了在资源有限的环境下一种 joint ASR- SER 多任务学习方法,并证明不仅 SER 有改善, ASR 也有改善。我们还研究了这种联合训练模型对背景噪声、嗡嗡声和音乐的鲁棒性。IEMOCAP 数据集的实验结果显示,在干净的情况下,联合学习可以分别提高 ASR 单词错误率(WER)和 SER 分类精度 10.7%。在噪声的情况下,使用MUSAN增强的数据结果证明,联合方法在许多噪声条件下优于独立的 ASR 和 SER 方法。总体而言,联合 ASR- SER 方法产生比独立 ASR 和 SER 方法更难被噪声影响模型。
https://arxiv.org/abs/2305.12540
We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora lack phrase or emotion diversity, which makes it difficult to analyze NVs and support downstream tasks like emotion recognition. We first propose a corpus-design method that contains two phases: (1) collecting NVs phrases based on crowd-sourcing; (2) recording NVs by stimulating speakers with emotional scenarios. We then collect $420$ audio clips from $4$ speakers that cover $6$ emotions based on the proposed method. Results of comprehensive objective and subjective experiments demonstrate that the collected NVs have high emotion recognizability and authenticity that are comparable to previous corpora of English NVs. Additionally, we analyze the distributions of vowel types in Japanese NVs. To our best knowledge, JNV is currently the largest Japanese NVs corpus in terms of phrase and emotion diversities.
我们提出了 JNV (日本非言语语音化) corpus,一个包含多种短语和情绪的日本非言语语音化(NV) corpus。现有的日本 NV corpora 缺乏短语或情绪多样性,这使得很难分析和支持后续任务,如情绪识别。我们首先提出了一个 corpus 设计方法,其中包括两个阶段:(1)基于众包收集 NVs 的短语;(2)通过刺激演讲者以情感场景来记录 NVs。然后我们从 $4$ 名演讲者中收集了 $420$ 个音频片段,涵盖了 $6$ 种情绪。综合客观和主观实验的结果表明,收集到的 NVs 具有高情感识别度和真实性,与以前的英语 NVs corpora 相当。此外,我们分析了日本 NVs 中各种元音类型的分布情况。据我们所知,JNV 目前是日语 NVs corpus 中短语和情绪多样性最大的一份。
https://arxiv.org/abs/2305.12445
Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. By analyzing affect dynamics, we can gain insights into how people communicate, respond to different situations, and form relationships. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of interpersonal relationships, the situation, and other factors that influence affective displays. To address this challenge, we propose a Cross-person Memory Transformer (CPM-T) framework which is able to explicitly model affective dynamics (intrapersonal and interpersonal influences) by identifying verbal and non-verbal cues, and with a large language model to utilize the pre-trained knowledge and perform verbal reasoning. The CPM-T framework maintains memory modules to store and update the contexts within the conversation window, enabling the model to capture dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and generalizability of our approach on three publicly available datasets for joint engagement, rapport, and human beliefs prediction tasks. Remarkably, the CPM-T framework outperforms baseline models in average F1-scores by up to 7.3%, 9.3%, and 2.0% respectively. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.
准确地建模情感动态,是指在人类对话中情感和表现的变化和波动,对于理解人类互动至关重要。通过分析情感动态,我们可以获取人们对如何沟通、如何应对不同情况、以及形成关系的见解。然而,建模情感动态因为上下文因素(例如人际关系的复杂性和细微差别、情况、以及其他影响情感表现的因素)而具有挑战性。为了应对这一挑战,我们提出了 Cross-Person Memory Transformer (CPM-T)框架,该框架能够明确建模情感动态(个人间影响),通过识别语言和非语言信号,并使用大型语言模型利用预训练的知识进行语言推理。CPM-T框架维护记忆模块,在对话窗口内存储和更新上下文,使模型能够捕捉对话前端和后端之间的依赖关系。此外,我们的框架使用跨模态注意力有效地对齐多模态信息,并利用跨个人注意力在多主体交互中对齐行为。我们评估了我们的方法和三个公开数据集(联合参与、关系建立和人类信念预测任务)的有效性和泛化性。令人惊讶地,CPM-T框架在平均F1得分上比基准模型高出7.3%、9.3%和2.0%。最后,我们通过删除部分框架成分的研究来展示每个组件在框架中的重要性,以多模态时间行为的角度。
https://arxiv.org/abs/2305.12369