Accurate recognition of human emotions is a crucial challenge in affective computing and human-robot interaction (HRI). Emotional states play a vital role in shaping behaviors, decisions, and social interactions. However, emotional expressions can be influenced by contextual factors, leading to misinterpretations if context is not considered. Multimodal fusion, combining modalities like facial expressions, speech, and physiological signals, has shown promise in improving affect recognition. This paper proposes a transformer-based multimodal fusion approach that leverages facial thermal data, facial action units, and textual context information for context-aware emotion recognition. We explore modality-specific encoders to learn tailored representations, which are then fused using additive fusion and processed by a shared transformer encoder to capture temporal dependencies and interactions. The proposed method is evaluated on a dataset collected from participants engaged in a tangible tabletop Pacman game designed to induce various affective states. Our results demonstrate the effectiveness of incorporating contextual information and multimodal fusion for affective state recognition.
准确识别人类情感是情感计算和人与机器人互动(HRI)中的一个关键挑战。情感状态在塑造行为、决策和社交互动中起着重要作用。然而,情感表达可能受到情境因素的影响,如果忽略情境因素,就会导致误解。多模态融合,结合面部表情、语音和生理信号等模态,已经在提高情感识别方面显示出前景。本文提出了一种基于Transformer的多模态融合方法,利用面部热数据、面部动作单元和文本上下文信息来实现情境意识的情感识别。我们探讨了模态特异性编码器,以学习定制化的表示,然后通过加性融合和由共享Transformer编码器处理来捕获时间依赖性和交互。所提出的方法在参与设计一个旨在引起各种情感状态的实体桌游的参与者在数据集上的评估。我们的结果表明,将情境信息和多模态融合用于情感状态识别是有效的。
https://arxiv.org/abs/2409.11906
Understanding emotions is fundamental to human interaction and experience. Humans easily infer emotions from situations or facial expressions, situations from emotions, and do a variety of other \emph{affective cognition}. How adept is modern AI at these inferences? We introduce an evaluation framework for testing affective cognition in foundation models. Starting from psychological theory, we generate 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes. We evaluate the abilities of foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across carefully selected conditions. Our results show foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are ``superhuman'' -- they better predict modal human judgements than the average human. All models benefit from chain-of-thought reasoning. This suggests foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior.
理解情感是人类互动和体验的基础。人类很容易从情境或面部表情中推断情感,从情感中推断情境,以及进行各种情感认知。现代AI在推断这些方面有多擅长呢?我们为测试基础模型对情感认知的能力建立了一个评估框架。从心理理论开始,我们生成了1280个不同的情景,探讨了评价、情感、表达和结果之间的关系。我们评估了基础模型(GPT-4,Claude-3,Gemini-1.5-Pro)和人类(N=567)在这些条件下的能力。我们的结果显示,基础模型往往与人类的直觉相符,甚至超过参与者之间的共识。在某些情况下,模型表现得“超人类”——它们比平均人类更准确地预测模态人类评判。所有模型都受益于链式思维推理。这表明,基础模型已经获得了类似于人类对情感及其对信念和行为影响的理解。
https://arxiv.org/abs/2409.11733
Phone scams pose a significant threat to individuals and communities, causing substantial financial losses and emotional distress. Despite ongoing efforts to combat these scams, scammers continue to adapt and refine their tactics, making it imperative to explore innovative countermeasures. This research explores the potential of large language models (LLMs) to provide detection of fraudulent phone calls. By analyzing the conversational dynamics between scammers and victims, LLM-based detectors can identify potential scams as they occur, offering immediate protection to users. While such approaches demonstrate promising results, we also acknowledge the challenges of biased datasets, relatively low recall, and hallucinations that must be addressed for further advancement in this field
电话诈骗对个人和社区构成了重大威胁,导致大量财务损失和情感困扰。尽管针对这些诈骗持续进行着努力,但骗子们仍在适应和优化他们的策略,使得必须探索创新的对策。这项研究探讨了大型语言模型(LLMs)在检测电话诈骗中的潜在作用。通过分析骗子与受害者之间的会话动态,LLM-based 检测器可以在发生诈骗行为时立即识别出可能的诈骗,为用户提供即时保护。虽然这种方法显示出积极的效果,但我们还承认这种方法存在数据偏见、召回率相对较低和幻觉等问题,这些问题必须得到解决,才能在这个领域取得进一步的进展。
https://arxiv.org/abs/2409.11643
Pain is a more intuitive and user-friendly way of communicating problems, making it especially useful in rehabilitation nurse training robots. While most previous methods have focused on classifying or recognizing pain expressions, these approaches often result in unnatural, jiggling robot faces. We introduce PainDiffusion, a model that generates facial expressions in response to pain stimuli, with controllable pain expressiveness and emotion status. PainDiffusion leverages diffusion forcing to roll out predictions over arbitrary lengths using a conditioned temporal U-Net. It operates as a latent diffusion model within EMOCA's facial expression latent space, ensuring a compact data representation and quick rendering time. For training data, we process the BioVid Heatpain Database, extracting expression codes and subject identity configurations. We also propose a novel set of metrics to evaluate pain expressions, focusing on expressiveness, diversity, and the appropriateness of model-generated outputs. Finally, we demonstrate that PainDiffusion outperforms the autoregressive method, both qualitatively and quantitatively. Code, videos, and further analysis are available at: \href{this https URL}{this https URL}.
疼痛是一种更直观、更用户友好的沟通问题的方式,因此在康复护士培训机器人中尤其有用。虽然大多数以前的方法都集中于分类或识别疼痛表达,但这些方法通常导致不自然、颤抖的机器人面部。我们引入了PainDiffusion,一种根据疼痛刺激生成面部表情的模型,具有可控制的选择性疼痛表现和情绪状态。PainDiffusion利用扩散强化在EMOCA的面部表情潜在空间中展开预测,确保了紧凑的数据表示和快速的渲染时间。对于训练数据,我们处理了BioVid Heatpain数据库,提取了表情代码和受试者身份配置。我们还提出了一组新的指标来评估疼痛表达,重点关注表现力、多样性和模型生成的输出适配性。最后,我们证明了PainDiffusion在质量和数量上优于自回归方法。代码、视频和进一步分析可在此处访问:\href{this <https://this https URL>}{this <https://this https URL>}.
https://arxiv.org/abs/2409.11635
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. However, building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. In this paper, we propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages. Specifically, we employ expressive Speech-to-Speech translation (S2ST) combined with a novel bootstrapping data selection pipeline to generate labeled data in the target language. Extensive experiments demonstrate that our method is both effective and generalizable across different upstream models and languages. Our results suggest that this approach can facilitate the development of more scalable and robust multilingual SER systems.
演讲情感识别(SER)是开发通用人工智能代理的一个重要组成部分,可以实现自然的人机交互。然而,由于其他语言中标签数据稀缺,构建健壮的多语言SER系统仍然具有挑战性。在本文中,我们提出了一种通过利用高资源语言数据来增强低资源语言SER性能的方法。具体来说,我们采用结合表达式转录(S2ST)和新颖的倍增数据选择管道的方法来生成目标语言中的标记数据。大量实验证明,我们的方法在不同的上游模型和语言上都具有有效性和可扩展性。我们的结果表明,这种方法有助于推动开发更可扩展和健壮的多语言SER系统。
https://arxiv.org/abs/2409.10985
Audio-driven 3D facial animation has made immersive progress both in research and application developments. The newest approaches focus on Transformer-based methods and diffusion-based methods, however, there is still gap in the vividness and emotional expression between the generated animation and real human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion policy model for 3D facial animation prediction. This method generates variable and realistic human facial movements by predicting the 3D vertex trajectory on the 3D facial template with diffusion policy instead of facial generation for every frame. It takes audio and vertex states as observations to predict the vertex trajectory and imitate real human facial expressions, which keeps the continuous and natural flow of human emotions. The experiments show that our approach is effective in variable and dynamic facial motion synthesizing.
音频驱动的三维面部动画在研究和应用发展方面都取得了沉浸式进展。最新的方法主要集中在基于Transformer的方法和基于扩散的方法上,然而,生成的动画与真实人类面部之间的生动度和情感表达方面仍然存在差距。为了应对这一局限,我们提出了3DFacePolicy,一个用于3D面部动画预测的扩散策略模型。这种方法通过预测3D面部模板的3D顶点轨迹来生成变量和真实的的人类面部运动,而不是通过面部生成来预测每帧的顶点轨迹。它以音频和顶点状态作为观测值来预测顶点轨迹并模仿真实人类面部表情,这保持了人类情感的连续性和自然流动。实验结果表明,我们的方法在变体和动态面部运动合成方面非常有效。
https://arxiv.org/abs/2409.10848
Speech Emotion Recognition (SER) systems rely on speech input and emotional labels annotated by humans. However, various emotion databases collect perceptional evaluations in different ways. For instance, the IEMOCAP dataset uses video clips with sounds for annotators to provide their emotional perceptions. However, the most significant English emotion dataset, the MSP-PODCAST, only provides speech for raters to choose the emotional ratings. Nevertheless, using speech as input is the standard approach to training SER systems. Therefore, the open question is the emotional labels elicited by which scenarios are the most effective for training SER systems. We comprehensively compare the effectiveness of SER systems trained with labels elicited by different modality stimuli and evaluate the SER systems on various testing conditions. Also, we introduce an all-inclusive label that combines all labels elicited by various modalities. We show that using labels elicited by voice-only stimuli for training yields better performance on the test set, whereas labels elicited by voice-only stimuli.
言语情感识别(SER)系统依赖于由人类标注的情感标签和语音输入。然而,各种情感数据库以不同的方式收集感知评估。例如,IEMOCAP数据集使用带有声音的视频片段作为标注者提供其情感感知的方式。然而,最显著的英语情感数据集,MSP-PODCAST,仅提供给评分者选择情感评级的语音。然而,使用语音作为输入是训练SER系统的标准方法。因此,开放性问题是由哪种场景产生的情感标签对于训练SER系统是最有效的。我们全面比较了使用不同模态刺激产生的标签训练SER系统的有效性,并评估了在各种测试条件下SER系统的性能。我们还引入了一个全面的标签,该标签结合了各种模态产生的标签。我们发现,使用仅基于语音刺激的标签进行训练在测试集中取得了更好的性能,而基于语音刺激的标签。
https://arxiv.org/abs/2409.10762
Emojis have become an integral part of digital communication, enriching text by conveying emotions, tone, and intent. Existing emoji recommendation methods are primarily evaluated based on their ability to match the exact emoji a user chooses in the original text. However, they ignore the essence of users' behavior on social media in that each text can correspond to multiple reasonable emojis. To better assess a model's ability to align with such real-world emoji usage, we propose a new semantics preserving evaluation framework for emoji recommendation, which measures a model's ability to recommend emojis that maintain the semantic consistency with the user's text. To evaluate how well a model preserves semantics, we assess whether the predicted affective state, demographic profile, and attitudinal stance of the user remain unchanged. If these attributes are preserved, we consider the recommended emojis to have maintained the original semantics. The advanced abilities of Large Language Models (LLMs) in understanding and generating nuanced, contextually relevant output make them well-suited for handling the complexities of semantics preserving emoji recommendation. To this end, we construct a comprehensive benchmark to systematically assess the performance of six proprietary and open-source LLMs using different prompting techniques on our task. Our experiments demonstrate that GPT-4o outperforms other LLMs, achieving a semantics preservation score of 79.23%. Additionally, we conduct case studies to analyze model biases in downstream classification tasks and evaluate the diversity of the recommended emojis.
表情符号已经成为数字通信的重要组成部分,通过传达情感、语调和意图,丰富了文本。现有的表情符号推荐方法主要根据其匹配用户在原文中选择的准确表情的能力进行评估。然而,它们忽略了用户在社交媒体上的行为本质,因为每篇文本都可以对应多种合理的表情符号。为了更准确地评估一个模型在现实世界表情符号使用中的对齐能力,我们提出了一个新的保持语义一致性的评估框架,用于表情符号推荐,该框架衡量了模型推荐具有语义一致性的表情符号的能力。为了评估模型保留语义的能力,我们评估预测的用户情感状态、人口统计学和态度立场是否发生改变。如果这些属性保持不变,我们考虑推荐的表情符号保留了原始语义。大型语言模型(LLMs)在理解和生成复杂、上下文相关的输出方面具有先进的能力,使它们非常适合处理保留表情符号语义的建议。因此,我们为我们的任务系统构建了一个全面的基准,使用不同的提示技术对六个专用和开源LLM进行性能评估。我们的实验结果表明,GPT-4o在其他LLM中表现优异,其语义保留得分达到79.23%。此外,我们进行案例研究,分析下游分类任务中的模型偏见,并评估推荐表情符号的多样性。
https://arxiv.org/abs/2409.10760
Social-emotional learning (SEL) skills are essential for children to develop to provide a foundation for future relational and academic success. Using art as a medium for creation or as a topic to provoke conversation is a well-known method of SEL learning. Similarly, social robots have been used to teach SEL competencies like empathy, but the combination of art and social robotics has been minimally explored. In this paper, we present a novel child-robot interaction designed to foster empathy and promote SEL competencies via a conversation about art scaffolded by a social robot. Participants (N=11, age range: 7-11) conversed with a social robot about emotional and neutral art. Analysis of video and speech data demonstrated that this interaction design successfully engaged children in the practice of SEL skills, like emotion recognition and self-awareness, and greater rates of empathetic reasoning were observed when children engaged with the robot about emotional art. This study demonstrated that art-based reflection with a social robot, particularly on emotional art, can foster empathy in children, and interactions with a social robot help alleviate discomfort when sharing deep or vulnerable emotions.
社交情感学习(SEL)技能对于儿童的发展至关重要,可以为未来的关系和学术成功奠定基础。使用艺术作为创作媒介或作为话题引发对话是社交情感学习的常见方法。同样,社交机器人已被用于教授SEL技能,如同理心,但将艺术和社交机器人相结合的研究仍然很少。在本文中,我们提出了一个新颖的儿童与机器人互动,旨在通过关于艺术的对话来促进同理心和SEL技能的发展,以社会机器人为中介。参与者在(N=11,年龄范围:7-11)与社交机器人交谈关于情感和中性艺术。视频和语音数据的分析证明,这种互动设计成功地将儿童卷入了SEL技能的实践,如情感识别和自我意识,当儿童与机器人谈论情感艺术时,观察到了更多的同情推理。这项研究证明,基于艺术的社交机器人可以帮助儿童培养同理心,尤其是关于情感艺术的互动可以缓解分享深度或脆弱情感时的不适。
https://arxiv.org/abs/2409.10710
Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these already fine-tuned models or ensembling ViT/BEiT models gets us the highest classification accuracies per individual when it comes to identifying four primary emotions from their speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs or BEiTs.
情感是口头交流中不可或缺的元素,因此在人机交互(HRI)中理解个体情感变得至关重要。本文研究了视觉转换器模型(ViT和BEiT)在口语情感识别(SER)中的应用。重点是通过在基准数据集上微调这些模型,并利用集成方法来扩展SER模型。为此,我们从与NAO机器人进行伪自然对话的不同人类对象中收集音频数据。然后,我们对ViT和BEiT-based模型进行微调,并使用这些预微调的模型或集成ViT/BEiT模型对参与者未见过的语音样本进行测试。在结果中,我们证明了在基准数据集上微调视觉转换器和然后使用已经微调的模型或集成ViT/BEiT模型的最高分类准确率是个体,即识别四种基本情感:中性,快乐,悲伤和愤怒,与微调普通的ViT或BEiT相比。
https://arxiv.org/abs/2409.10687
Empathetic response generation necessitates the integration of emotional and intentional dynamics to foster meaningful interactions. Existing research either neglects the intricate interplay between emotion and intent, leading to suboptimal controllability of empathy, or resorts to large language models (LLMs), which incur significant computational overhead. In this paper, we introduce ReflectDiffu, a lightweight and comprehensive framework for empathetic response generation. This framework incorporates emotion contagion to augment emotional expressiveness and employs an emotion-reasoning mask to pinpoint critical emotional elements. Additionally, it integrates intent mimicry within reinforcement learning for refinement during diffusion. By harnessing an intent twice reflect the mechanism of Exploring-Sampling-Correcting, ReflectDiffu adeptly translates emotional decision-making into precise intent actions, thereby addressing empathetic response misalignments stemming from emotional misrecognition. Through reflection, the framework maps emotional states to intents, markedly enhancing both response empathy and flexibility. Comprehensive experiments reveal that ReflectDiffu outperforms existing models regarding relevance, controllability, and informativeness, achieving state-of-the-art results in both automatic and human evaluations.
情感反应生成需要将情感和意图的动力学整合起来,以促进有意义的互动。现有的研究要么忽视情感和意图之间的复杂相互作用,导致 empathy控制的最优化程度较低,要么依赖于大规模语言模型(LLMs),导致计算开销巨大。在本文中,我们引入了ReflectDiffu,一个轻量级且全面的情感反应生成框架。该框架包括情感传染,以增强情感的表现力,并采用情感推理掩码来确定关键情感元素。此外,它将意图模仿融入强化学习以进行微调。通过利用意图两次反射探索-采样-纠正的机制,ReflectDiffu有效地将情感决策转化为精确的意图行动,从而解决情感错认导致的情感反应不协调问题。通过反思,该框架将情感状态映射为意图,大大增强了反应的 empathy和灵活性。全面的实验证明,ReflectDiffu在相关性、可控性和信息性方面优于现有模型,在自动和人类评估中均取得了最先进的结果。
https://arxiv.org/abs/2409.10289
Traditional approaches for analyzing RGB frames are capable of providing a fine-grained understanding of a face from different angles by inferring emotions, poses, shapes, landmarks. However, when it comes to subtle movements standard RGB cameras might fall behind due to their latency, making it hard to detect micro-movements that carry highly informative cues to infer the true emotions of a subject. To address this issue, the usage of event cameras to analyze faces is gaining increasing interest. Nonetheless, all the expertise matured for RGB processing is not directly transferrable to neuromorphic data due to a strong domain shift and intrinsic differences in how data is represented. The lack of labeled data can be considered one of the main causes of this gap, yet gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. In this paper, we first present FACEMORPHIC, a multimodal temporally synchronized face dataset comprising both RGB videos and event streams. The data is labeled at a video level with facial Action Units and also contains streams collected with a variety of applications in mind, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space.
传统分析RGB帧的方法可以对不同角度的人脸进行详细的情感、姿势、形状和特征识别。然而,当涉及到细微运动时,标准RGB相机可能会因为延迟而落后,导致难以检测到携带高度信息提示的人脸真实情感。为解决这个问题,使用事件相机分析人脸的方法越来越受到关注。然而,所有针对RGB处理的专家知识都无法直接应用于神经元数据,因为数据表示的领域转移和内在差异非常强烈。缺乏标注数据可以被视为导致这一差距的主要原因之一,然而,在事件领域中收集数据更加困难,因为无法从互联网上爬取,并且需要考虑事件聚合率和静态部分可能不会在某些帧中可见的事实。在本文中,我们首先介绍了FACEMORPHIC,一个包括RGB视频和事件流的多模态时序人脸数据集。数据以视频级别进行标注,同时包含各种应用程序收集的流,从3D形状估计到 lip-reading。然后,我们展示了时序同步如何允许在不手动标注视频的情况下进行有效的神经元面部分析:我们相反利用跨模态监督来弥合领域差异,通过将面部形状表示为3D空间中的形式。
https://arxiv.org/abs/2409.10213
Current emotional text-to-speech (TTS) models predominantly conduct supervised training to learn the conversion from text and desired emotion to its emotional speech, focusing on a single emotion per text-speech pair. These models only learn the correct emotional outputs without fully comprehending other emotion characteristics, which limits their capabilities of capturing the nuances between different emotions. We propose a controllable Emo-DPO approach, which employs direct preference optimization to differentiate subtle emotional nuances between emotions through optimizing towards preferred emotions over less preferred emotional ones. Instead of relying on traditional neural architectures used in existing emotional TTS models, we propose utilizing the emotion-aware LLM-TTS neural architecture to leverage LLMs' in-context learning and instruction-following capabilities. Comprehensive experiments confirm that our proposed method outperforms the existing baselines.
目前,情感文本转语音(TTS)模型主要通过有监督训练来学习将文本和所需情感转换为情感语音,重点关注每个文本-语音对中的单个情感。这些模型仅学习正确的情感输出,而没有完全理解其他情感特征,这限制了它们捕捉不同情感之间细微差别的的能力。我们提出了一个可控制的情感DPO方法,该方法采用直接偏好优化来通过优化更喜欢的心情来区分微妙的情感差异。而不是依赖现有情感TTS模型的传统神经架构,我们提出了利用情感关注的LLM-TTS神经架构来利用LLM的上下文学习和指令跟随能力。全面的实验证实了我们所提出的方法超越了现有基线。
https://arxiv.org/abs/2409.10157
This paper presents a novel deep neural network-based architecture tailored for Speech Emotion Recognition (SER). The architecture capitalises on dense interconnections among multiple layers of bidirectional dilated convolutions. A linear kernel dynamically fuses the outputs of these layers to yield the final emotion class prediction. This innovative architecture is denoted as TBDM-Net: Temporally-Aware Bi-directional Dense Multi-Scale Network. We conduct a comprehensive performance evaluation of TBDM-Net, including an ablation study, across six widely-acknowledged SER datasets for unimodal speech emotion recognition. Additionally, we explore the influence of gender-informed emotion prediction by appending either golden or predicted gender labels to the architecture's inputs or predictions. The implementation of TBDM-Net is accessible at: this https URL
本文提出了一种针对语音情感识别(SER)的新型深度神经网络架构,该架构利用了多个卷积层双向 dilated卷积之间的密集互联。线性核动态地融合这些层的输出,产生最后的情感类别预测。这种创新架构被称为TBDM-Net:时间感知双向密集多尺度网络。我们对TBDM-Net进行了全面性能评估,包括一个消融研究,以及六个广泛认可的SER数据集。此外,我们还研究了通过在架构的输入或预测中附上黄金或预测性别的标签来影响情感预测的影响。TBDM-Net的实现可在:https://此链接进行访问
https://arxiv.org/abs/2409.10056
Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language instructions across a range of tasks, including robotic manipulation and navigation. However, existing methods are primarily designed for static environments and do not leverage the agent's own experiences to refine its initial plans. Given that real-world environments are inherently stochastic, initial plans based solely on LLMs' general knowledge may fail to achieve their objectives, unlike in static scenarios. To address this limitation, this study introduces the Experience-and-Emotion Map (E2Map), which integrates not only LLM knowledge but also the agent's real-world experiences, drawing inspiration from human emotional responses. The proposed methodology enables one-shot behavior adjustments by updating the E2Map based on the agent's experiences. Our evaluation in stochastic navigation environments, including both simulations and real-world scenarios, demonstrates that the proposed method significantly enhances performance in stochastic environments compared to existing LLM-based approaches. Code and supplementary materials are available at this https URL.
大语言模型(LLMs)在指导具身代理在各种任务中执行语言指令方面显示出显著潜力,包括机器人操作和导航。然而,现有的方法主要针对静态环境,并且没有利用代理自身的经验来优化其初始计划。考虑到现实世界环境固有随机性,仅基于LLM的通用知识制定的初始计划可能无法实现其目标,与静态场景相比更是如此。为了克服这一局限,本研究引入了经验情感图(E2Map),该方法不仅包含了LLM的知识,还整合了代理在现实世界中的经验,并从人类情感反应中汲取灵感。基于代理的经验,该方法可以通过更新E2Map来自动调整行为。我们对随机导航环境(包括模拟和真实世界场景)的评估表明,与现有的LLM基于方法相比,所提出的方法在随机环境中的性能显著增强。代码和补充材料可在此链接下载:https://www.example.com/。
https://arxiv.org/abs/2409.10027
Singing voice synthesis and conversion have emerged as significant subdomains of voice generation, leading to much demands on prompt-conditioned generation. Unlike common voice data, generating a singing voice requires an understanding of various associated vocal and musical characteristics, such as the vocal tone of the singer or emotional expressions. However, existing open-source audio-text datasets for voice generation tend to capture only a very limited range of attributes, often missing musical characteristics of the audio. To fill this gap, we introduce S2Cap, an audio-text pair dataset with a diverse set of attributes. S2Cap consists of pairs of textual prompts and music audio samples with a wide range of vocal and musical attributes, including pitch, volume, tempo, mood, singer's gender and age, and musical genre and emotional expression. Utilizing S2Cap, we suggest an effective novel baseline algorithm for singing style captioning. Singing style captioning is a relative task to voice generation that generates text descriptions of vocal characteristics, which we first suggested. First, to mitigate the misalignment between the audio encoder and the text decoder, we present a novel mechanism called CRESCENDO, which utilizes positive-pair similarity learning to synchronize the embedding spaces of a pretrained audio encoder to get similar embeddings with a text encoder. We additionally supervise the model using the singer's voice, which is demixed by the accompaniment. This supervision allows the model to more accurately capture vocal characteristics, leading to improved singing style captions that better reflect the style of the singer. The dataset and the codes are available at \bulurl{this https URL}.
singing voice synthesis and conversion have emerged as significant subdomains of voice generation, leading to much demands on prompt-conditioned generation. Unlike common voice data, generating a singing voice requires an understanding of various associated vocal and musical characteristics, such as the vocal tone of the singer or emotional expressions. However, existing open-source audio-text datasets for voice generation tend to capture only a very limited range of attributes, often missing musical characteristics of the audio. To fill this gap, we introduce S2Cap, an audio-text pair dataset with a diverse set of attributes. S2Cap consists of pairs of textual prompts and music audio samples with a wide range of vocal and musical attributes, including pitch, volume, tempo, mood, singer's gender and age, and musical genre and emotional expression. Utilizing S2Cap, we suggest an effective novel baseline algorithm for singing style captioning. Singing style captioning is a relative task to voice generation that generates text descriptions of vocal characteristics, which we first suggested. First, to mitigate the misalignment between the audio encoder and the text decoder, we present a novel mechanism called CRESCENDO, which utilizes positive-pair similarity learning to synchronize the embedding spaces of a pretrained audio encoder to get similar embeddings with a text encoder. We additional supervise the model using the singer's voice, which is demixed by the accompaniment. This supervision allows the model to more accurately capture vocal characteristics, leading to improved singing style captions that better reflect the style of the singer. The dataset and the code are available at <https://this URL>.
https://arxiv.org/abs/2409.09866
Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
鉴于最近的生成性人工智能技术的进步,一个关键问题是大型语言模型(LLMs)如何利用预训练自动语音识别(ASR)模型从冻僵的预训练语言模型中获取文本解码结果来增强语音建模任务。为了探索用于语音处理的语义建模的新功能,我们引入了生成语音转录错误纠正(GenSEC)挑战。这个挑战包括三个预ASR语言建模任务:(i)预ASR转录纠正,(ii)说话人分类,(iii)情感识别。这些任务旨在模仿未来基于语音的交互系统中的LLM代理,同时通过使用开源的预训练语言模型或基于代理的API使结果对广泛的受众可访问。我们还讨论了基线评估的见解,以及为设计未来评估总结了经验教训。
https://arxiv.org/abs/2409.09785
Micro-expressions are involuntary facial movements that cannot be consciously controlled, conveying subtle cues with substantial real-world applications. The analysis of micro-expressions generally involves two main tasks: spotting micro-expression intervals in long videos and recognizing the emotions associated with these intervals. Previous deep learning methods have primarily relied on classification networks utilizing sliding windows. However, fixed window sizes and window-level hard classification introduce numerous constraints. Additionally, these methods have not fully exploited the potential of complementary pathways for spotting and recognition. In this paper, we present a novel temporal state transition architecture grounded in the state space model, which replaces conventional window-level classification with video-level regression. Furthermore, by leveraging the inherent connections between spotting and recognition tasks, we propose a synergistic strategy that enhances overall analysis performance. Extensive experiments demonstrate that our method achieves state-of-the-art performance. The codes and pre-trained models are available at this https URL.
微表情是指无法通过有意识控制的不自主面部表情,传达微妙的信息,具有实际应用。对微表情的分析通常包括两个主要任务:在长视频中找出微表情的时间间隔,并识别与这些间隔相关的情感。之前的深度学习方法主要依赖于使用滑动窗口的分类网络。然而,固定的窗口大小和对窗口级别的硬分类引入了诸多限制。此外,这些方法没有充分利用互补路径识别和识别的潜力。在本文中,我们提出了一个基于状态空间模型的全新时间状态转移架构,用视频级别的回归取代了传统的窗口级别分类。此外,通过利用识别和识别任务之间的固有联系,我们提出了一种协同策略,增强了整体分析性能。大量实验证明,我们的方法达到了最先进的性能水平。代码和预训练的模型可以从该链接获取:https://url.com/
https://arxiv.org/abs/2409.09707
Expressing stressful experiences in words is proven to improve mental and physical health, but individuals often disengage with writing interventions as they struggle to organize their thoughts and emotions. Reflective prompts have been used to provide direction, and large language models (LLMs) have demonstrated the potential to provide tailored guidance. Current systems often limit users' flexibility to direct their reflections. We thus present ExploreSelf, an LLM-driven application designed to empower users to control their reflective journey. ExploreSelf allows users to receive adaptive support through dynamically generated questions. Through an exploratory study with 19 participants, we examine how participants explore and reflect on personal challenges using ExploreSelf. Our findings demonstrate that participants valued the balance between guided support and freedom to control their reflective journey, leading to deeper engagement and insight. Building on our findings, we discuss implications for designing LLM-driven tools that promote user empowerment through effective reflective practices.
表达压力经历在文字中表达被证明可以提高精神和身体健康,但个人通常会与写作干预断开联系,因为他们努力组织自己的思想和情感。反思提示已经被用来提供方向,大型语言模型(LLMs)已经证明了提供定制指导的潜力。目前的系统通常限制用户对自己的灵活性,以便将反思指向特定的方向。因此,我们推出了 ExploreSelf,一个基于LLM的驱动应用程序,旨在 empower 用户控制他们的反思旅程。 ExploreSelf 通过动态生成问题,允许用户获得自适应支持。通过一项研究,我们探讨了参与者如何使用 ExploreSelf 探索和反思个人挑战。我们的研究结果表明,参与者珍视指导和支持之间的平衡,这导致更深入的参与和更深刻的见解。基于我们的研究结果,我们讨论了如何通过有效的反思实践设计LLM驱动工具,以促进用户 empower。
https://arxiv.org/abs/2409.09662
As minimally verbal autistic (MVA) children communicate with parents through few words and nonverbal cues, parents often struggle to encourage their children to express subtle emotions and needs and to grasp their nuanced signals. We present AACessTalk, a tablet-based, AI-mediated communication system that facilitates meaningful exchanges between an MVA child and a parent. AACessTalk provides real-time guides to the parent to engage the child in conversation and, in turn, recommends contextual vocabulary cards to the child. Through a two-week deployment study with 11 MVA child-parent dyads, we examine how AACessTalk fosters everyday conversation practice and mutual engagement. Our findings show high engagement from all dyads, leading to increased frequency of conversation and turn-taking. AACessTalk also encouraged parents to explore their own interaction strategies and empowered the children to have more agency in communication. We discuss the implications of designing technologies for balanced communication dynamics in parent-MVA child interaction.
作为最小程度的言语自闭症(MVA)儿童通过少量的单词和非言语线索与父母交流时,父母经常很难鼓励他们的孩子表达微妙的情感和需求,并理解他们的复杂信号。我们介绍了一款基于平板电脑、通过人工智能进行沟通的AACessTalk系统,该系统促进MVA儿童与父母之间的有意义交流。AACessTalk为父母提供了实时指南,以与孩子进行对话,并相应地向孩子推荐上下文词汇卡。在为期两周的部署研究中,我们研究了11个MVA儿童与父母之间的关系,探讨了AACessTalk如何促进日常对话练习和相互参与。我们的研究结果表明,所有家庭的参与度都很高,导致对话频率和轮换次数增加。AACessTalk还鼓励父母探索自己的交流策略,并使孩子更有自主地在沟通中。我们讨论了为平衡沟通动态而在家长-MVA儿童互动中设计技术的潜在影响。
https://arxiv.org/abs/2409.09641