This review aims to systematically assess the current status and prospects of artificial intelligence (AI) in the rehabilitation management of patients with schizophrenia and their impact on the rehabilitation process. We selected 70 studies from 2012 to the present, focusing on application, technology categories, products, and data types of machine learning, deep learning, reinforcement learning, and other technologies in mental health interventions and management. The results indicate that AI can be widely used in symptom monitoring, relapse risk prediction, and rehabilitation treatment by analyzing ecological momentary assessment, behavioral, and speech data. This review further explores the potential challenges and future directions of emerging products, technologies, and analytical methods based on AI, such as social media analysis, serious games, and large language models in rehabilitation. In summary, this study systematically reviews the application status of AI in schizophrenia rehabilitation management and provides valuable insights and recommendations for future research paths.
本研究旨在系统地评估人工智能(AI)在精神分裂症患者康复管理中的现状和前景,以及其对康复过程的影响。我们选中了2012年至2019年间发表的70篇研究,重点关注机器学习、深度学习、强化学习和其他技术在精神卫生干预和管理中的应用、技术类别、产品和数据类型。研究结果表明,AI可以通过分析生态瞬时评估、行为和语音数据,在症状监测、复发风险预测和康复治疗中得到广泛应用。本研究还深入探讨了基于AI的新兴产品、技术和分析方法,如社交媒体分析、严重游戏和大语言模型的康复应用,为未来研究提供了宝贵的见解和方向。总之,本研究系统地评价了AI在精神分裂症康复管理中的应用状况,为未来研究提供了宝贵的见解和方向。
https://arxiv.org/abs/2405.10883
The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities.
本工作的目标是同时从文本中生成自然对话脸和语音输出。我们通过将谈话面部生成(TFG)和文本转语音(TTS)系统集成到一个统一框架中来实现这一目标。我们解决了每个任务的主要挑战:(1)生成具有真实世界场景中各种头势的广泛范围的头部;(2)在同一身份下,即使面部运动存在差异,也要确保声音的一致性。为解决这些问题,我们引入了一种基于条件流匹配的运动采样方法,该方法能够以高效的方式生成高质量的运动码。此外,我们引入了一种新的条件方法来对TTS系统,该方法利用TFG模型的运动去除特征来产生统一的语音输出。我们广泛的实验证明了我们的方法有效地创建了自然外观的对话脸和语音,准确地匹配了输入文本。据我们所知,这是第一个在未见过的身份上构建多模态合成系统的尝试。
https://arxiv.org/abs/2405.10272
Trigger points are a concept introduced by Mau, Lux, and Westheuser (2023) to study qualitative focus group interviews and understand polarisation in Germany. When people communicate, trigger points represent moments when individuals feel that their understanding of what is fair, normal, or appropriate in society is questioned. In the original studies, individuals react affectively to such triggers and show strong and negative emotional responses. In this paper, we introduce the first systematic study of the large-scale effect of individual words as trigger points by analysing a large amount of social media posts. We examine online deliberations on Reddit between 2020 and 2022 and collect >100 million posts from subreddits related to a set of words identified as trigger points in UK politics. We find that such trigger words affect user engagement and have noticeable consequences on animosity in online discussions. We share empirical evidence of trigger words causing animosity, and how they provide incentives for hate speech, adversarial debates, and disagreements. Our work is the first to introduce trigger points to computational studies of online communication. Our findings are relevant to researchers interested in online harms and who examine how citizens debate politics and society in light of affective polarisation.
触发点(trigger points)是Mau、Lux和Westheuser(2023)提出的概念,旨在研究定性焦点小组访谈以了解德国的极化现象。当人们交流时,触发点代表个人在社会中认为公平、正常或适当的看法受到质疑的时刻。在原始研究中,个人会以情感反应的形式对这样的触发点作出反应,并表现出强烈的积极或消极情感。在本文中,我们通过分析大量社交媒体帖子,系统地研究了个人单词作为触发点的大规模影响。我们研究了2020年到2022年间Reddit上的在线辩论,并将与英国政治触发点相关的一组单词的子reddit分为1000多个子reddit。我们发现,这样的触发词会影响用户的参与度,并在网上讨论中产生明显的后果。我们分享了关于触发词导致愤怒的实证证据,以及它们为仇恨言论、对抗性辩论和分歧提供激励的证据。我们的工作是第一个将触发点引入到计算机网络沟通研究中的。我们的研究结果与关注在线伤害的 researchers 相关,他们研究公民如何辩论政治和社会问题,以及情感极化如何影响公民的政治和社会互动。
https://arxiv.org/abs/2405.10213
Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.
文本转语音(TTS)发展对于非洲语言如卢加丹仍然是有限的,主要原因是因为高质量、单声道录音对于训练TTS模型至关重要。之前的工作主要集中在利用多个年龄在20-49岁的说话者的卢加丹共同声音录音。尽管生成的声音是可以理解的,但它们仍然比训练在工作室级别录音上的模型质量较低。这是因为用于提高共同声音录音质量的数据预处理方法不足。此外,由于变调的存在以及背景噪音,声音收敛更困难。在本文中,我们证明了通过在共同声音训练中使用多声道接近调的说话者,卢加丹TTS的质量可以得到改善。具体来说,我们选择了六名女性说话者,通过主观听力和比较它们的录音,确定了他们的共同声音。除了剪去录音的开头和结尾处的无声部分外,我们还应用了一个预训练的语音增强模型来降低背景噪音并提高音频质量。我们还利用了一个预训练的非侵入性自我监督Mean Opinion Score(MOS)估计模型,该模型可以过滤具有估计MOS超过3.5的录音,表明具有高感知质量。来自九名母语为卢加丹的参与者的主观MOS评估表明,我们的TTS模型实现了比现有模型更高的MOS值(3.55),而现有的模型报告的MOS值为2.5。此外,对于公平的比较,基于多声道的接近调的模型训练胜过基于单声道的模型(3.13 MOS)或双声道的模型(3.22 MOS)。这展示了通过从单一说话者的不足数据中补充多声道接近调的数据来提高TTS质量的有效性。
https://arxiv.org/abs/2405.10211
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.
近年来,大型语言模型(LLMs)的进步推动了自动语音识别(ASR)中的生成误差纠正(GER)的发展,该旨在从解码的N个最佳假设预测听到的地面真实转录。得益于LLMs强大的语言生成能力以及N个最佳列表中的丰富信息,GER在增强ASR效果方面表现出巨大的效果。然而,它仍然存在两个局限性:1)LLMs在GER过程中无法感知原始语音,这可能导致语法正确但违反源语音内容的成果;2)N个最佳假设通常只在几个词上变化,这使得为GER发送所有它们变得冗余,可能会使LLM困惑于应关注哪些词,从而导致增加误译。在本文中,我们提出了ClozeGER,一种新的ASR生成误差纠正范式。首先,我们引入了一个多模态LLM(即SpeechGPT)以接收原始语音作为额外的输入以提高纠错输出的保真度。然后,我们将GER重新格式化为一个cloze测试,对logits进行归一化以消除输入信息冗余并简化GER,并提供明确的指导。实验证明,ClozeGER在9个流行的ASR数据集上取得了与普通GER的新突破。
https://arxiv.org/abs/2405.10025
Monaural Speech enhancement on drones is challenging because the ego-noise from the rotating motors and propellers leads to extremely low signal-to-noise ratios at onboard microphones. Although recent masking-based deep neural network methods excel in monaural speech enhancement, they struggle in the challenging drone noise scenario. Furthermore, existing drone noise datasets are limited, causing models to overfit. Considering the harmonic nature of drone noise, this paper proposes a frequency domain bottleneck adapter to enable transfer learning. Specifically, the adapter's parameters are trained on drone noise while retaining the parameters of the pre-trained Frequency Recurrent Convolutional Recurrent Network (FRCRN) fixed. Evaluation results demonstrate the proposed method can effectively enhance speech quality. Moreover, it is a more efficient alternative to fine-tuning models for various drone types, which typically requires substantial computational resources.
在无人机上进行单声道语音增强是一个具有挑战性的任务,因为旋转电机和螺旋桨的自元噪声导致机载麦克风中的信号-噪声比非常低。尽管基于遮罩的深度神经网络方法在单声道语音增强方面表现出色,但在具有挑战性的无人机噪音场景中,它们的表现不佳。此外,现有的无人机噪音数据集有限,导致模型过拟合。考虑到无人机噪音的谐波特性,本文提出了一种频率域瓶颈适配器,以实现迁移学习。具体来说,适配器的参数在保留前预训练的FRCRN参数的同时,在无人机噪音上进行训练。评估结果表明,与单声道语音增强相比,所提出的方法可以有效增强语音质量。此外,它是为各种无人机类型进行模型微调的更有效选择,而通常需要大量的计算资源。
https://arxiv.org/abs/2405.10022
In this work, we present Semantic Gesticulator, a novel framework designed to synthesize realistic gestures accompanying speech with strong semantic correspondence. Semantically meaningful gestures are crucial for effective non-verbal communication, but such gestures often fall within the long tail of the distribution of natural human motion. The sparsity of these movements makes it challenging for deep learning-based systems, trained on moderately sized datasets, to capture the relationship between the movements and the corresponding speech semantics. To address this challenge, we develop a generative retrieval framework based on a large language model. This framework efficiently retrieves suitable semantic gesture candidates from a motion library in response to the input speech. To construct this motion library, we summarize a comprehensive list of commonly used semantic gestures based on findings in linguistics, and we collect a high-quality motion dataset encompassing both body and hand movements. We also design a novel GPT-based model with strong generalization capabilities to audio, capable of generating high-quality gestures that match the rhythm of speech. Furthermore, we propose a semantic alignment mechanism to efficiently align the retrieved semantic gestures with the GPT's output, ensuring the naturalness of the final animation. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit, as evidenced by a comprehensive collection of examples. User studies confirm the quality and human-likeness of our results, and show that our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
在这项工作中,我们提出了一个名为Semantic Gestulator的新框架,旨在通过强烈的语义匹配来合成真实伴随说话的肢体动作。语义上意义的肢体动作对于有效的非言语交流至关重要,但这些动作通常位于自然人类运动的分布长尾。这些运动的稀疏性使得基于数据规模中训练的深度学习系统难以捕捉运动和相应语义之间的关系。为了应对这个挑战,我们基于一个大型语言模型开发了一个生成检索框架。这个框架能够根据输入的语音 efficiently检索出合适的语义肢体动作候选者。为了构建这个运动库,我们总结了基于语言学发现的一系列常用的语义肢体动作,并收集了一个包含身体和手部动作的高质量数据集。我们还设计了一个基于GPT的新型模型,具有很强的泛化能力,能够生成与说话节奏相符的高质量肢体动作。此外,我们还提出了一个语义对齐机制,以有效地将检索到的语义肢体动作与GPT的输出对齐,确保最终动画的自然性。我们的系统在生成具有节奏感和语义明确的肢体动作方面表现出鲁棒性,这可以通过一系列广泛的例子得到证实。用户研究证实了我们的结果具有质量和人性化,并且我们的系统在语义适应性方面明显优于最先进的系统。
https://arxiv.org/abs/2405.09814
Recent advances in generative language modeling applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech language models (SLMs), similarly to their textual counterparts, are scalable, probabilistic, and context-aware. While they can produce diverse and natural outputs, they sometimes face issues such as unintelligibility and the inclusion of non-speech noises or hallucination. As the adoption of this innovative paradigm in speech synthesis increases, there is a clear need for an in-depth evaluation of its capabilities and limitations. In this paper, we evaluate TTS from a discrete token-based SLM, through both automatic metrics and listening tests. We examine five key dimensions: speaking style, intelligibility, speaker consistency, prosodic variation, spontaneous behaviour. Our results highlight the model's strength in generating varied prosody and spontaneous outputs. It is also rated higher in naturalness and context appropriateness in listening tests compared to a conventional TTS. However, the model's performance in intelligibility and speaker consistency lags behind traditional TTS. Additionally, we show that increasing the scale of SLMs offers a modest boost in robustness. Our findings aim to serve as a benchmark for future advancements in generative SLMs for speech synthesis.
近年来,将生成语言建模应用于离散语音词的进步为文本到语音(TTS)合成提供了新的途径。这些语音语言模型(SLMs)与它们的文本对应物一样,具有可扩展性、概率性和上下文感知。尽管它们可以产生多样且自然的结果,但有时会面临诸如可理解性问题和包含非语音噪音或幻觉等问题。随着这种创新范式在语音合成中的采用不断增加,对其能力和限制进行深入评估显得尤为重要。在本文中,我们评估了基于离散词的SLM的TTS,并通过自动指标和听觉测试进行评估。我们检查了五个关键维度:说话风格、可理解性、说话者一致性、语调变化和自发性行为。我们的结果表明,该模型在生成多样且自发的语调方面表现出色。在听觉测试中,它与传统TTS相比,自然度和上下文适应性评分更高。然而,在智力和说话者一致性方面,传统TTS仍然领先。此外,我们发现,增加SLMs的规模在鲁棒性方面略有提高。我们的研究结果旨在为未来基于生成语言模型的TTS合成提供基准。
https://arxiv.org/abs/2405.09768
Spoken language interaction is at the heart of interpersonal communication, and people flexibly adapt their speech to different individuals and environments. It is surprising that robots, and by extension other digital devices, are not equipped to adapt their speech and instead rely on fixed speech parameters, which often hinder comprehension by the user. We conducted a speech comprehension study involving 39 participants who were exposed to different environmental and contextual conditions. During the experiment, the robot articulated words using different vocal parameters, and the participants were tasked with both recognising the spoken words and rating their subjective impression of the robot's speech. The experiment's primary outcome shows that spaces with good acoustic quality positively correlate with intelligibility and user experience. However, increasing the distance between the user and the robot exacerbated the user experience, while distracting background sounds significantly reduced speech recognition accuracy and user satisfaction. We next built an adaptive voice for the robot. For this, the robot needs to know how difficult it is for a user to understand spoken language in a particular setting. We present a prediction model that rates how annoying the ambient acoustic environment is and, consequentially, how hard it is to understand someone in this setting. Then, we develop a convolutional neural network model to adapt the robot's speech parameters to different users and spaces, while taking into account the influence of ambient acoustics on intelligibility. Finally, we present an evaluation with 27 users, demonstrating superior intelligibility and user experience with adaptive voice parameters compared to fixed voice.
口语交互是人际交往的核心,而人们会根据不同的个体和环境灵活调整自己的讲话。令人惊讶的是,机器人以及其他数字设备并没有具备适应讲话的能力,而是依赖固定的讲话参数,这往往阻碍了用户的理解。我们对39名参与者进行了一项口语理解研究,让他们暴露于不同的环境和情境中。在实验过程中,机器人使用不同的语音参数表达单词,参与者被要求识别出听到的单词,并对机器人的讲话进行主观评价。实验的主要结果表明,具有良好的声学质量的空间与可理解性和用户体验正相关。然而,用户与机器人之间的距离增加会加剧用户体验,而分散的背景声音会显著降低语音识别准确性和用户满意度。接下来,我们为机器人构建了一个自适应的语音。为此,机器人需要知道用户在特定环境中理解口语语言的困难程度。我们提出了一种预测模型,用于评估环境声学对可理解性的影响程度,从而对机器人的讲话参数进行调整。最后,我们展示了使用自适应语音参数的评估结果,证明了与固定语音相比,具有更好的智能度和用户体验。
https://arxiv.org/abs/2405.09708
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.
鉴于自动语音识别(ASR)系统的广泛应用,其安全性问题比以往任何时候都受到更多的关注,主要原因是深度神经网络的易受性。以前的研究表明,在约束条件下悄悄地生成对抗扰动能够操纵语音识别系统,从而产生恶意命令。这些攻击方法主要需要在$\ell_p$范数约束下添加噪声扰动,不可避免地留下了手动修改的残影。最近的研究通过将文本到语音(TTS)合成音频的对抗样本,缓解了这一限制。然而,基于优化目标的风格修改会显著降低音频风格的可控性和可编辑性。在本文中,我们提出了基于用户自定义风格迁移的ASR系统攻击。我们首先测试了顺序风格迁移攻击(STA)的效果。然后,作为改进,我们提出了一个迭代式风格码攻击(SCA)来保持音频质量。实验结果表明,我们的方法可以满足用户自定义风格的需求,攻击成功率为82%,同时保持声音的自然度。
https://arxiv.org/abs/2405.09470
Generative artificial intelligence (AI) technologies and large models are producing realistic outputs across various domains, such as images, text, speech, and music. Creating these advanced generative models requires significant resources, particularly large and high-quality datasets. To minimize training expenses, many algorithm developers use data created by the models themselves as a cost-effective training solution. However, not all synthetic data effectively improve model performance, necessitating a strategic balance in the use of real versus synthetic data to optimize outcomes. Currently, the previously well-controlled integration of real and synthetic data is becoming uncontrollable. The widespread and unregulated dissemination of synthetic data online leads to the contamination of datasets traditionally compiled through web scraping, now mixed with unlabeled synthetic data. This trend portends a future where generative AI systems may increasingly rely blindly on consuming self-generated data, raising concerns about model performance and ethical issues. What will happen if generative AI continuously consumes itself without discernment? What measures can we take to mitigate the potential adverse effects? There is a significant gap in the scientific literature regarding the impact of synthetic data use in generative AI, particularly in terms of the fusion of multimodal information. To address this research gap, this review investigates the consequences of integrating synthetic data blindly on training generative AI on both image and text modalities and explores strategies to mitigate these effects. The goal is to offer a comprehensive view of synthetic data's role, advocating for a balanced approach to its use and exploring practices that promote the sustainable development of generative AI technologies in the era of large models.
生成人工智能(AI)技术和大模型在各种领域产生逼真的输出,如图像、文本、语音和音乐。创建这些先进的生成模型需要大量的资源,特别是大型和高质量的数据集。为了最小化训练成本,许多算法开发者使用模型自身产生的数据作为有效的训练解决方案。然而,不是所有合成数据都能有效提高模型性能,导致在真实数据和合成数据的使用上进行战略平衡以优化结果。目前,严格控制真实和合成数据的集成变得越来越不可控。广泛且未经监管的在线传播合成数据,导致通过网络爬取编写的传统数据集现在被混合了未经标签的合成数据所污染。这种趋势预示着,生成AI系统可能越来越多地依赖盲目地消费自生数据,引发关于模型性能和伦理问题的担忧。如果生成AI连续不断地自我消费而没有察觉,会发生什么?我们该如何减轻这些潜在的负面影响?关于合成数据在生成AI中的使用,特别是在多模态信息的融合方面,科学文献中的研究存在很大的空白。为了填补这一研究空白,本综述调查了在图像和文本模态上盲目集成合成数据对训练生成AI的影响,并探讨了减轻这些影响的方法。目标是提供一个全面的了解合成数据在生成AI中的作用,提倡平衡使用合成数据,探讨在大型模型时代促进生成AI技术可持续发展的一些实践。
https://arxiv.org/abs/2405.09597
Our study addresses a significant gap in online hate speech detection research by focusing on homophobia, an area often neglected in sentiment analysis research. Utilising advanced sentiment analysis models, particularly BERT, and traditional machine learning methods, we developed a nuanced approach to identify homophobic content on X/Twitter. This research is pivotal due to the persistent underrepresentation of homophobia in detection models. Our findings reveal that while BERT outperforms traditional methods, the choice of validation technique can impact model performance. This underscores the importance of contextual understanding in detecting nuanced hate speech. By releasing the largest open-source labelled English dataset for homophobia detection known to us, an analysis of various models' performance and our strongest BERT-based model, we aim to enhance online safety and inclusivity. Future work will extend to broader LGBTQIA+ hate speech detection, addressing the challenges of sourcing diverse datasets. Through this endeavour, we contribute to the larger effort against online hate, advocating for a more inclusive digital landscape. Our study not only offers insights into the effective detection of homophobic content by improving on previous research results, but it also lays groundwork for future advancements in hate speech analysis.
我们的研究在在线仇恨言论检测研究中填补了一个重要的空白,专注于情感分析研究经常被忽视的领域。利用先进的情感分析模型,特别是BERT,以及传统机器学习方法,我们开发了一种 nuanced的方法来识别X/Twitter上的同性恋内容。由于在检测模型中持续存在对仇恨言论的低估,这项研究至关重要。我们的发现表明,尽管BERT超越了传统方法,但验证技术的选择可能会影响模型性能。这凸显了在检测复杂仇恨言论中情境理解的重要性。通过发布我们所拥有的最大开放源代码的英语仇恨言论检测数据集,以及我们最强的基于BERT的模型,我们旨在提高在线安全和包容性。未来的工作将扩展到更广泛的LGBTQIA+仇恨言论检测,解决数据来源的挑战。通过这项努力,我们为反对在线仇恨言论作出了贡献,主张建设一个更加包容的数字环境。我们的研究不仅为以前的研究成果提供了洞察,而且也为未来仇恨言论分析的进步奠定了基础。
https://arxiv.org/abs/2405.09221
It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
控制文本到语音合成(TTS)中情感表达仍然具有挑战性。以前的研究主要集中在在句子级别学习全局音调表示,这与语言音调高度相关。我们的目标是构建一个层级情感分布(ED),有效地捕捉情感在各个粒度级别上的强度变化,包括音素、单词和句子。在TTS训练过程中,从地面真实音频中提取层级ED,并引导预测器建立情感和语言音调之间的联系。在运行时推理,TTS模型生成情感语音,同时为语音成分提供情感定量控制。客观和主观评价证实了所提出的框架在情感预测和控制方面的有效性。
https://arxiv.org/abs/2405.09171
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
当前的讲话者语音识别系统在提取讲话者嵌入之前依赖于外部语音活动检测模型。在本文中,我们证明了发言者嵌入提取器的注意系统充当一个弱监督的内部VAD模型,并且其表现与相应的监督VAD系统相当或者更好。随后,通过同时提取VAD日志和相应的讲话者嵌入,可以高效地实现发言者识别。我们详细分析了当前讲话者验证模型中帧级注意系统的行为,并使用ECAPA2讲话者嵌入提出了用于VAD和嵌入提取的新讲话者识别流程。所提出的策略在AMI、VoxConverse和DIHARD III语调基准上获得了最先进的性能。
https://arxiv.org/abs/2405.09142
Binaural Audio Telepresence (BAT) aims to encode the acoustic scene at the far end into binaural signals for the user at the near end. BAT encompasses an immense range of applications that can vary between two extreme modes of Immersive BAT (I-BAT) and Enhanced BAT (E-BAT). With I-BAT, our goal is to preserve the full ambience as if we were at the far end, while with E-BAT, our goal is to enhance the far-end conversation with significantly improved speech quality and intelligibility. To this end, this paper presents a tunable BAT system to vary between these two AT modes with a desired application-specific balance. Microphone signals are converted into binaural signals with prescribed ambience factor. A novel Spatial COherence REpresentation (SCORE) is proposed as an input feature for model training so that the network remains robust to different array setups. Experimental results demonstrated the superior performance of the proposed BAT, even when the array configurations were not included in the training phase.
双向声道远程呈现(BAT)旨在将远端的音频场景编码成适合近端用户的双耳信号。BAT涵盖了从极端的沉浸式双向声道(I-BAT)和增强型双向声道(E-BAT)两种模式中广泛的适用应用。使用I-BAT,我们的目标是将完整的氛围保留下来,就好像我们处在远端一样,而使用E-BAT,我们的目标是显著提高远端对话的语音质量和可懂度。为此,本文提出了一种可调节的BAT系统,以在两个AT模式之间进行平衡,并具有所需的应用程序特定平衡。 microphone信号通过指定氛围系数转换为双耳信号。 为了训练模型,还提出了一个新的空间互相关表示(SCORE)作为输入特征,以便网络保持对不同阵列设置的鲁棒性。 实验结果证明了所提出的BAT的优越性能,即使训练阶段没有包括阵列设置。
https://arxiv.org/abs/2405.08742
This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.
本文解决了自监督通用音频表示学习的问题。我们探讨了使用联合嵌入预测架构(JEPA)解决这个任务的途径,它包括将输入的Mel声谱图拆分为两个部分(上下文和目标),计算每个部分的神经表示,并训练神经网络从上下文表示预测目标表示。我们在这种框架内研究了几个设计选择,并通过广泛的实验研究了它们的影响,评估了我们的模型在各种音频分类基准上的表现,包括环境声音、语音和音乐下游任务。我们特别关注输入数据中哪个部分被用作上下文或目标,并通过实验证明了它对模型性能的影响。值得注意的是,在图像领域,一些有效的设计选择导致了在音频方面的表现不佳,从而突出了这两种媒体之间的主要区别。
https://arxiv.org/abs/2405.08679
The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts as an effective tool for detecting newly emerged deepfake audio while maintaining performance on older types, lacks a well-constructed and user-friendly evaluation framework. To address this gap, we introduce EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA includes classic datasets from the Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from models like GPT-4 and GPT-4o. It supports various continual learning techniques, such as Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), and recent methods like Regularized Adaptive Weight Modification (RAWM) and Radian Weight Modification (RWM). Additionally, EVDA facilitates the development of robust algorithms by providing an open interface for integrating new continual learning methods
高级大型语言模型(如GPT-4、GPT-4o和Claude家族)的崛起使得伪造音频检测变得越来越具有挑战性。传统的微调方法很难与不断变化的合成语音格局保持同步,需要不断学习的方法来适应新的音频,同时保留检测较老类型音频的能力。持续学习作为一种有效的工具,在检测新型深度伪造音频的同时保持对较老类型的检测性能,但它缺乏一个结构良好和用户友好的评估框架。为了填补这一空白,我们引入了EVDA,一个用于评估持续学习在深度伪造音频检测中的基准。EVDA包括来自反伪造声音系列、中国伪造音频检测系列以及GPT-4和GPT-4o生成的全新深度伪造音频。它支持各种持续学习技术,例如EWC、学习不遗忘(LwF)以及像Regularized Adaptive Weight Modification(RAWM)和Radian Weight Modification(RWM)这样的最近方法。此外,EVDA通过提供一个开放的接口,促进将新的持续学习方法集成到算法中,从而推动其发展。
https://arxiv.org/abs/2405.08596
When studying political communication, combining the information from text, audio, and video signals promises to reflect the richness of human communication more comprehensively than confining it to individual modalities alone. However, when modeling such multimodal data, its heterogeneity, connectedness, and interaction are challenging to address. We argue that aligning the respective modalities can be an essential step in entirely using the potential of multimodal data because it informs the model with human understanding. Exploring aligned modalities unlocks promising analytical leverage. First, it allows us to make the most of information in the data, which inter alia opens the door to better quality predictions. Second, it is possible to answer research questions that span multiple modalities with cross-modal queries. Finally, alignment addresses concerns about model interpretability. We illustrate the utility of this approach by analyzing how German MPs address members of the far-right AfD in their speeches, and predicting the tone of video advertising in the context of the 2020 US presidential race. Our paper offers important insights to all keen to analyze multimodal data effectively.
在研究政治沟通时,将文本、音频和视频信号的信息结合起来,比仅仅局限于单一方式更全面地反映人类沟通的丰富性。然而,当尝试建模这种多模态数据时,其异质性、联系性和交互性是难以解决的问题。我们认为,对各自模态进行对齐可以是完全利用多模态数据潜在功能的关键步骤,因为它赋予了模型人类理解。通过对齐模态,我们解锁了有前途的分析优势。首先,它让我们能够充分利用数据中的信息,这不仅为更好的预测打开了大门,而且还有助于跨模态问题。其次,可以在跨模态查询中回答研究问题。最后,对齐解决了关于模型可解释性的担忧。我们通过分析德国议员如何回应极右翼民粹主义者在讲话中如何对待成员,以及预测2020年美国总统竞选期间视频广告的语气,展示了这种方法的有效性。我们的论文为所有希望有效地分析多模态数据的人提供了重要的见解。
https://arxiv.org/abs/2405.08454
Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder has to be learned that allows for efficient transmission of the input audio signal. This discrete representation is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-the-art neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. Furthermore, we propose a new causal network architecture for neural speech coding that shows good performance at very low computational complexity.
神经音频编码已成为一个生动的研究方向,因为它在非常低的比特率下承诺提供优秀的音频质量,这是经典编码技术无法实现的。在本文中,我们提出了基于投影标量量化(SQ)的简单替代VQ的量化方法,这些量化技术不需要额外的损失、调度参数或代码本存储,从而简化了神经音频编码器的训练。此外,我们提出了一种新的因果神经网络架构,用于神经语音编码,在非常低的计算复杂度下表现出良好的性能。
https://arxiv.org/abs/2405.08417
Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech this http URL better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the training procedure result in faster convergence and competitive performance on downstream tasks.
自监督学习在语音识别方面取得了巨大的成功。然而,观察发现,对学习模型的所有层进行微调会导致性能低于重置顶层。这种现象归因于“自编码器”行为:顶层包含更接近输入的信息,并且不太适合需要语言信息的任务,比如更好地理解这个行为。为了研究模型在预训练期间的高级信息演化,我们关注了表现较轻的HuBERT模型。通过实验探索可能影响训练过程的各种因素,我们的目标是改进训练程序并提高HuBERT模型在高级任务上的顶级层。此外,我们的实验还表明,这些训练过程的改进会导致更快的学习曲线和竞争力的下游任务性能。
https://arxiv.org/abs/2405.08402