Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed or added into the input speech. Comparisons of experimental results and discussions are also covered in this survey.
As one of the major branches of automatic speech recognition, attention-based models greatly improves the feature representation ability of the model. In particular, the multi-head mechanism is employed in the attention, hoping to learn speech features of more aspects in different attention subspaces. For speech recognition of complex languages, on the one hand, a small head size will lead to an obvious shortage of learnable aspects. On the other hand, we need to reduce the dimension of each subspace to keep the size of the overall feature space unchanged when we increase the number of heads, which will significantly weaken the ability to represent the feature of each subspace. Therefore, this paper explores how to use a small attention subspace to represent complete speech features while ensuring many heads. In this work we propose a novel neural network architecture, namely, pyramid multi-branch fusion DCNN with multi-head self-attention. The proposed architecture is inspired by Dilated Convolution Neural Networks (DCNN), it uses multiple branches with DCNN to extract the feature of the input speech under different receptive fields. To reduce the number of parameters, every two branches are merged until all the branches are merged into one. Thus, its shape is like a pyramid rotated 90 degrees. We demonstrate that on Aishell-1, a widely used Mandarin speech dataset, our model achieves a character error rate (CER) of 6.45% on the test sets.
作为自动语音识别的主要分支之一，注意力模型极大地提高了模型的特征表示能力。特别是，采用了多眼机制，希望在不同注意力 subspace 中学习更多的语音特征方面。对于复杂语言的语音识别，一方面，较小的头部大小会导致可学习方面的数量明显不足。另一方面，我们需要在每个 subspace 中减少维度，以保持整个特征空间的size不变，当头部数量增加时，这将会极大地减弱表示每个 subspace 特征的能力。因此，本文探讨了如何使用一个小的注意力 subspace 来代表完整的语音特征，同时确保许多头部。在这个工作中，我们提出了一种新型的神经网络架构，即金字塔多分支融合 DCNN 和多眼自注意力。该架构受到缩小卷积神经网络(DCNN)的启发，它使用多个 DCNN 分支从不同的接收域中提取输入语音的特征。为了减少参数数量，每个分支都合并直到所有分支都合并成一条。因此，它的形状就像金字塔旋转90度。我们证明了在广泛使用的 Mandarin 语音数据集 Aishell-1 中，我们的模型在测试集上实现了字符错误率(CER)为6.45%。
Transformer-based models have recently made significant achievements in the application of end-to-end (E2E) automatic speech recognition (ASR). It is possible to deploy the E2E ASR system on smart devices with the help of Transformer-based models. While these models still have the disadvantage of requiring a large number of model parameters. To overcome the drawback of universal Transformer models for the application of ASR on edge devices, we propose a solution that can reuse the block in Transformer models for the occasion of the small footprint ASR system, which meets the objective of accommodating resource limitations without compromising recognition accuracy. Specifically, we design a novel block-reusing strategy for speech Transformer (BRST) to enhance the effectiveness of parameters and propose an adapter module (ADM) that can produce a compact and adaptable model with only a few additional trainable parameters accompanying each reusing block. We conducted an experiment with the proposed method on the public AISHELL-1 corpus, and the results show that the proposed approach achieves the character error rate (CER) of 9.3%/6.63% with only 7.6M/8.3M parameters without and with the ADM, respectively. In addition, we also make a deeper analysis to show the effect of ADM in the general block-reusing method.
Transformer-based models 最近在端到端(E2E)自动语音识别(ASR)的应用方面取得了重要成就。借助Transformer-based模型，可以在智能设备上部署E2E ASR系统。尽管这些模型仍然具有需要大量模型参数的缺点，但我们希望克服通用Transformer模型在边缘设备上ASR应用的缺点，并提出一种解决方案，可以在Transformer模型中重用块以实现小 footprint ASR系统，满足适应资源限制并不影响识别精度的目标。具体来说，我们设计了一种Speech Transformer(BRST)的块重用策略，以提高参数的有效性，并提出了适应模块(ADM)，该模块可以产生紧凑且可适应的模型，每个重用块仅有几个训练参数相随。我们在公共AIShell-1语料库上进行了实验，结果表明，没有ADM的情况下，该方法实现了字符错误率(CER)9.3%/6.63%，而有了ADM的情况下，仅使用7.6M/8.3M参数分别实现了9.3%/6.63%。此外，我们还进行了深入分析，以显示通用块重用方法中的ADM效应。
Gesture synthesis has gained significant attention as a critical research area, focusing on producing contextually appropriate and natural gestures corresponding to speech or textual input. Although deep learning-based approaches have achieved remarkable progress, they often overlook the rich semantic information present in the text, leading to less expressive and meaningful gestures. We propose GesGPT, a novel approach to gesture generation that leverages the semantic analysis capabilities of Large Language Models (LLMs), such as GPT. By capitalizing on the strengths of LLMs for text analysis, we design prompts to extract gesture-related information from textual input. Our method entails developing prompt principles that transform gesture generation into an intention classification problem based on GPT, and utilizing a curated gesture library and integration module to produce semantically rich co-speech gestures. Experimental results demonstrate that GesGPT effectively generates contextually appropriate and expressive gestures, offering a new perspective on semantic co-speech gesture generation.
手势合成作为一个重要的研究领域，重点是如何产生与语音或文本输入对应的适当、自然手势。尽管基于深度学习的方法已经取得了显著进展，但它们往往忽略了文本中丰富的语义信息，导致表达力和有意义的手势减少。我们提出了GesGPT，一种手势生成的新型方法，利用大型语言模型(LLM)如GPT的语义分析能力。通过利用LRM在文本分析方面的优势，我们设计Prompts从文本输入中提取手势相关的信息。我们的方法和方法包括开发Prompt Principles，将手势生成转换为基于GPT的意图分类问题，并利用 curated gesture 库和集成模块生产语义丰富的合并口语手势。实验结果表明，GesGPT有效地生成了适当的、表达性的手势，提供了语义合并口语手势生成的新视角。
We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at this https URL.
我们介绍了LMCodec，一个 causal 神经网络语音编码器，提供低比特率下的高质量语音。该系统的核心是一个 causal 卷积编码器，通过残留向量化将音频编码为精细到粗的代币层级，从而实现更少量的代码传输。LMCodec 训练了一个 Transformer 语言模型，以生成从粗代币到精细代币的预测，从而允许更少的代码传输。第二个 Transformer 预测了给定过去传输的代码的不确定性，并用于执行条件熵编码。一项MusHRA 主观测试进行了 conducted，表明质量在更高的比特率下与参考codec 相当。示例音频可用在这个 https URL 上。
We show that training a multi-headed self-attention-based deep network to predict deleted, information-dense 2-8 Hz speech modulations over a 1.5-second section of a speech utterance is an effective way to make machines learn to extract speech modulations using time-domain contextual information. Our work exhibits that, once trained on large volumes of unlabelled data, the outputs of the self-attention layers vary in time with a modulation peak at 4 Hz. These pre-trained layers can be used to initialize parts of an Automatic Speech Recognition system to reduce its reliance on labeled speech data greatly.
我们表明，训练一个多头自注意力为基础的深度网络，以预测在语音发言中删除的、信息密度高的2-8Hz语音调制，在1.5秒的 section 内，是一种有效的方法，使机器学习使用时间域上下文信息从语音调制中提取信息。我们的工作表明，一旦训练了大规模的未标记数据，自注意力层的输出时间上的分布与一个调制峰值为4Hz的调制峰变化。这些预训练层可以用来初始化自动语音识别系统的部分，以减少它对标记语音数据的依赖性。
This work focuses on sign language retrieval-a recently proposed task for sign language understanding. Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. Different from traditional video-text retrieval, sign language videos, not only contain visual signals but also carry abundant semantic meanings by themselves due to the fact that sign languages are also natural languages. Considering this character, we formulate sign language retrieval as a cross-lingual retrieval problem as well as a video-text retrieval task. Concretely, we take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual (i.e., sign-to-word) mappings while contrasting the texts and the sign videos in a joint embedding space. This process is termed as cross-lingual contrastive learning. Another challenge is raised by the data scarcity issue-sign language datasets are orders of magnitude smaller in scale than that of speech recognition. We alleviate this issue by adopting a domain-agnostic sign encoder pre-trained on large-scale sign videos into the target domain via pseudo-labeling. Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e.g., +22.4 T2V and +28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1 improvements on PHOENIX-2014T dataset. Code and models are available at: this https URL.
本研究专注于 Sign Language 检索--一项最近提出的理解 sign language 的任务。 Sign Language 检索包括两个子任务:文本到Sign视频(T2V)检索和Sign视频到文本(V2T)检索。与传统的 video-text 检索不同,Sign 视频不仅包含视觉信号,而且本身携带丰富的语义含义,因为 sign 语言也是自然语言。考虑到这一特点,我们将 Sign Language 检索界定为跨语言检索问题和视频-text 检索任务。具体来说,我们考虑了 Sign 语言和自然语言的语言学特征,同时同时识别 fine-grained 跨语言映射(即 sign-to-word 映射),而在 joint embedding 空间中,同时比较文本和 Sign 视频。这一过程被称为跨语言对比学习。此外,数据稀缺问题也带来了挑战--Sign 语言数据集的规模比语音识别数据集小得多。我们通过伪标签方式将具有广泛 Sign 语言训练数据的Sign 编码器应用于目标域。我们的框架,称为跨语言对比学习的 Sign 语言检索(CiCo),在多个数据集上比先驱方法表现更好,例如,How2Sign 数据集上的 T2V 检索改进了 22.4 倍,V2T 检索改进了 28.0 倍,而 PHOENIX-2014T 数据集上的 V2T 检索改进了 13.7 倍和 17.1 倍。代码和模型可在 this https URL 中找到。
The advancement of speech technologies has been remarkable, yet its integration with African languages remains limited due to the scarcity of African speech corpora. To address this issue, we present AfroDigits, a minimalist, community-driven dataset of spoken digits for African languages, currently covering 38 African languages. As a demonstration of the practical applications of AfroDigits, we conduct audio digit classification experiments on six African languages [Igbo (ibo), Yoruba (yor), Rundi (run), Oshiwambo (kua), Shona (sna), and Oromo (gax)] using the Wav2Vec2.0-Large and XLS-R models. Our experiments reveal a useful insight on the effect of mixing African speech corpora during finetuning. AfroDigits is the first published audio digit dataset for African languages and we believe it will, among other things, pave the way for Afro-centric speech applications such as the recognition of telephone numbers, and street numbers. We release the dataset and platform publicly at this https URL and this https URL respectively.
语音技术的发展引人注目,但与非洲语言的融合仍然受到限制,因为非洲语音 corpora 的稀缺性。为了解决这个问题,我们提出了Afr计分系统,这是一个专注于非洲语言的语音数字数据集,目前涵盖了38种非洲语言。为了展示Afr计分系统的实用应用,我们进行了六种非洲语言的语音数字分类实验,使用WAV2Vec2.0-Large和Xls-R模型。我们的实验揭示了混合非洲语音 corpora 的效果在微调期间的重要性。Afr计分系统是非洲语言的首次发表语音数字数据集,我们相信它将有助于开辟非洲语言为中心的语音应用,如电话号码和街道号码的识别。我们将公开发布数据集和平台,分别位于这两个httpsURL。
Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail words challenge. In this paper, we propose a text with knowledge graph augmented transformer (TextKG) for video captioning. Notably, TextKG is a two-stream transformer, formed by the external stream and internal stream. The external stream is designed to absorb additional knowledge, which models the interactions between the additional knowledge, e.g., pre-built knowledge graph, and the built-in information of videos, e.g., the salient object regions, speech transcripts, and video captions, to mitigate the long-tail words challenge. Meanwhile, the internal stream is designed to exploit the multi-modality information in videos (e.g., the appearance of video frames, speech transcripts, and video captions) to ensure the quality of caption results. In addition, the cross attention mechanism is also used in between the two streams for sharing information. In this way, the two streams can help each other for more accurate results. Extensive experiments conducted on four challenging video captioning datasets, i.e., YouCookII, ActivityNet Captions, MSRVTT, and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods. Specifically, the proposed TextKG method outperforms the best published results by improving 18.7% absolute CIDEr scores on the YouCookII dataset.
In recent years, End-to-End speech recognition technology based on deep learning has developed rapidly. Due to the lack of Turkish speech data, the performance of Turkish speech recognition system is poor. Firstly, this paper studies a series of speech recognition tuning technologies. The results show that the performance of the model is the best when the data enhancement technology combining speed perturbation with noise addition is adopted and the beam search width is set to 16. Secondly, to maximize the use of effective feature information and improve the accuracy of feature extraction, this paper proposes a new feature extractor LSPC. LSPC and LiGRU network are combined to form a shared encoder structure, and model compression is realized. The results show that the performance of LSPC is better than MSPC and VGGnet when only using Fbank features, and the WER is improved by 1.01% and 2.53% respectively. Finally, based on the above two points, a new multi-feature fusion network is proposed as the main structure of the encoder. The results show that the WER of the proposed feature fusion network based on LSPC is improved by 0.82% and 1.94% again compared with the single feature (Fbank feature and Spectrogram feature) extraction using LSPC. Our model achieves performance comparable to that of advanced End-to-End models.
Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 seconds. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.
近年来的工作表明,语音分离引导的分音(SSGD)是一个越来越有前途的方向,这主要得益于语音分离领域的 recent 进展。它首先分离说话人,然后对每个分离的流应用语音活动检测(VAD)。在这项工作中,我们深入研究了语音分离引导的分音(SSGD)在口语电话语音(CTS)领域中的应用,主要集中在低延迟流分音应用。我们考虑了三种最先进的语音分离算法(SSep),并研究了它们在在线和离线场景下的性能,考虑了非因果和因果实现的实现方式,以及连续 SSep(CSS)窗口推理。我们比较了不同 SSGD 算法在两个广泛使用的 CTS 数据集上的表现:CALLHOME 和 Fisher Corpus(Part 1 和 2),并评估了分离和分音性能。为了改善性能,我们提出了一种新的、因果且计算高效的泄漏去除算法,这显著减少了误报。我们还首次探索了 SSep 和 VAD 模块之间的完全端到端 SSGD 集成。至关重要的是,这使得可以在没有可用的oracle 说话人来源的现实世界数据上进行微调。特别是,我们的最佳模型在CALLHOME上取得了 8.8%的der,比当前最先进的端到端神经网络分音模型还要好,尽管训练数据量要少得多,且延迟显著更低,即 0.1 秒 vs. 1秒。最后,我们还表明,分离信号可以方便地用于自动语音识别,在某些配置下达到与使用oracle 说话人来源类似的性能。
Personalized TTS is an exciting and highly desired application that allows users to train their TTS voice using only a few recordings. However, TTS training typically requires many hours of recording and a large model, making it unsuitable for deployment on mobile devices. To overcome this limitation, related works typically require fine-tuning a pre-trained TTS model to preserve its ability to generate high-quality audio samples while adapting to the target speaker's voice. This process is commonly referred to as ``voice cloning.'' Although related works have achieved significant success in changing the TTS model's voice, they are still required to fine-tune from a large pre-trained model, resulting in a significant size for the voice-cloned model. In this paper, we propose applying trainable structured pruning to voice cloning. By training the structured pruning masks with voice-cloning data, we can produce a unique pruned model for each target speaker. Our experiments demonstrate that using learnable structured pruning, we can compress the model size to 7 times smaller while achieving comparable voice-cloning performance.
个性化TTS是一个令人兴奋且高度渴望的应用,它允许用户使用少数录制进行TTS语音训练。然而,TTS训练通常需要大量录制和大型模型,因此不适合在移动设备上部署。要克服这个限制,相关工作通常需要微调预先训练的TTS模型,以保留其生成高质量音频样本的能力,同时适应目标说话人的声音。这一过程通常被称为“语音克隆”。尽管相关工作已经成功地改变了TTS模型的声音,但它们仍然需要从大型预先训练模型进行微调,导致语音克隆模型的大小很大。在本文中,我们提议将可训练的结构压缩应用于语音克隆。通过使用语音克隆数据训练结构压缩 masks,我们可以为每个目标说话人生产一个独特的压缩模型。我们的实验表明,使用可训练的结构压缩,我们可以将模型大小压缩到7倍 smaller,同时实现类似的语音克隆性能。
The remarkable success of transformers in the field of natural language processing has sparked the interest of the speech-processing community, leading to an exploration of their potential for modeling long-range dependencies within speech sequences. Recently, transformers have gained prominence across various speech-related domains, including automatic speech recognition, speech synthesis, speech translation, speech para-linguistics, speech enhancement, spoken dialogue systems, and numerous multimodal applications. In this paper, we present a comprehensive survey that aims to bridge research studies from diverse subfields within speech technology. By consolidating findings from across the speech technology landscape, we provide a valuable resource for researchers interested in harnessing the power of transformers to advance the field. We identify the challenges encountered by transformers in speech processing while also offering insights into potential solutions to address these issues.
Deep Speech Enhancement Challenge is the 5th edition of deep noise suppression (DNS) challenges organized at ICASSP 2023 Signal Processing Grand Challenges. DNS challenges were organized during 2019-2023 to stimulate research in deep speech enhancement (DSE). Previous DNS challenges were organized at INTERSPEECH 2020, ICASSP 2021, INTERSPEECH 2021, and ICASSP 2022. From prior editions, we learnt that improving signal quality (SIG) is challenging particularly in presence of simultaneously active interfering talkers and noise. This challenge aims to develop models for joint denosing, dereverberation and suppression of interfering talkers. When primary talker wears a headphone, certain acoustic properties of their speech such as direct-to-reverberation (DRR), signal to noise ratio (SNR) etc. make it possible to suppress neighboring talkers even without enrollment data for primary talker. This motivated us to create two tracks for this challenge: (i) Track-1 Headset; (ii) Track-2 Speakerphone. Both tracks has fullband (48kHz) training data and testset, and each testclips has a corresponding enrollment data (10-30s duration) for primary talker. Each track invited submissions of personalized and non-personalized models all of which are evaluated through same subjective evaluation. Most models submitted to challenge were personalized models, same team is winner in both tracks where the best models has improvement of 0.145 and 0.141 in challenge's Score as compared to noisy blind testset.
Deep Speech Enhancement Challenge是ICASSP 2023信号处理 Grand Challenges组织的第5版深度噪声抑制(DNS)挑战,该挑战在2019-2023年期间组织,以刺激深度语音增强研究(DSE)。以前的DNS挑战在InterSPEECH 2020、ICASSP 2021、InterSPEECH 2021和ICASSP 2022组织过。从以前的版本中,我们得知,提高信号质量(SIG)是挑战性的任务,特别是在同时具有干扰讲话者和噪声的情况下。该挑战旨在开发模型,以 joint 去噪声、去混响和抑制干扰讲话者。当主讲话者戴上耳机时,他们的 speech 的某些物理特性,如直接反射(DRR)、信号到噪声比(SNR)等,可以使在没有主讲话者注册数据的情况下抑制相邻讲话者,这促使我们创建两个轨道:(i) track-1耳机;(ii) track-2麦克风。两个轨道都有全频(48kHz)的训练数据和测试集,每个测试片段都有对应的主要讲话者注册数据(10-30秒)。每个轨道都邀请了个性化和非个性化的模型提交,所有模型都通过相同的主观评估进行评估。大多数模型提交了挑战,同一个团队在两个轨道中都赢得了胜利,最好的模型在挑战得分上比噪声盲测试集提高了0.145和0.141。
Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-supervised learning framework that generalizes to mixture speech using a masked pseudo source separation objective. This objective encourages the model to identify the number of sources, separate and understand the context, and infer the content of masked regions represented as discovered units. Cocktail HuBERT outperforms state-of-the-art results with 69% lower WER on multi-speaker ASR, 31% lower DER on diarization, and is competitive on single- and multi-speaker tasks from SUPERB.
自监督学习有效地利用了未标记数据,提高了标签效率和将未标记数据 domains,如更多的声学/语言学领域、语言和模式学 generalization 到其他领域的能力。尽管最近的工作研究了更广泛的声学/语言学领域、语言和模式学的泛化,但这些研究局限于在录制中只有一个主要说话人的单一源语音。本文介绍了鸡尾酒HuBERT,一种自监督学习框架,使用掩盖伪源分离目标将混合语音 generalization 到发现单元。这个目标鼓励模型确定来源数量、分离和理解上下文,并推断掩盖区域的内容,使其在多说话人 ASR 任务中比最先进的结果低69%,在去噪任务中低31%,并在SuperB中的单和多说话人任务中具有竞争力。
Speech-driven 3D face animation aims to generate realistic facial expressions that match the speech content and emotion. However, existing methods often neglect emotional facial expressions or fail to disentangle them from speech content. To address this issue, this paper proposes an end-to-end neural network to disentangle different emotions in speech so as to generate rich 3D facial expressions. Specifically, we introduce the emotion disentangling encoder (EDE) to disentangle the emotion and content in the speech by cross-reconstructed speech signals with different emotion labels. Then an emotion-guided feature fusion decoder is employed to generate a 3D talking face with enhanced emotion. The decoder is driven by the disentangled identity, emotional, and content embeddings so as to generate controllable personal and emotional styles. Finally, considering the scarcity of the 3D emotional talking face data, we resort to the supervision of facial blendshapes, which enables the reconstruction of plausible 3D faces from 2D emotional data, and contribute a large-scale 3D emotional talking face dataset (3D-ETF) to train the network. Our experiments and user studies demonstrate that our approach outperforms state-of-the-art methods and exhibits more diverse facial movements. We recommend watching the supplementary video: this https URL
语音驱动的3D人脸动画旨在生成与现实相符的面部表达方式,符合语音内容和情感。然而,现有方法往往忽视了情感面部表达方式,或未能将它们从语音内容中分离。为了解决这一问题,本文提出了一种端到端神经网络,以分离语音中的不同情感,生成丰富的3D面部表达方式。具体来说,我们引入了情感分离编码器(EDE),通过交叉重建具有不同情感标签的语音信号,分离语音中的情感和内容。然后,我们使用情感引导特征融合解码器生成增强情感的3D说话人。解码器由分离的身份、情感和内容嵌入驱动,生成可控制的个人和情感风格。最后,考虑到3D情感说话人数据的稀缺性,我们采取了面部混合shape的监督,这使可以从2D情感数据中恢复出合理的3D面部形状,并贡献一个大规模的3D情感说话人数据集(3D-ETF)用于训练网络。我们的实验和用户研究表明,我们的方法优于现有方法,表现出更为多样化的面部运动。我们强烈推荐观看补充视频: this https URL
The time-delay neural network (TDNN) is one of the state-of-the-art models for text-independent speaker verification. However, it is difficult for conventional TDNN to capture global context that has been proven critical for robust speaker representations and long-duration speaker verification in many recent works. Besides, the common solutions, e.g., self-attention, have quadratic complexity for input tokens, which makes them computationally unaffordable when applied to the feature maps with large sizes in TDNN. To address these issues, we propose the Global Filter for TDNN, which applies log-linear complexity FFT/IFFT and a set of differentiable frequency-domain filters to efficiently model the long-term dependencies in speech. Besides, a dynamic filtering strategy, and a sparse regularization method are specially designed to enhance the performance of the global filter and prevent it from overfitting. Furthermore, we construct a dual-stream TDNN (DS-TDNN), which splits the basic channels for complexity reduction and employs the global filter to increase recognition performance. Experiments on Voxceleb and SITW databases show that the DS-TDNN achieves approximate 10% improvement with a decline over 28% and 15% in complexity and parameters compared with the ECAPA-TDNN. Besides, it has the best trade-off between efficiency and effectiveness compared with other popular baseline systems when facing long-duration speech. Finally, visualizations and a detailed ablation study further reveal the advantages of the DS-TDNN.
时间延迟神经网络(TDNN)是文本独立的 Speaker Verification 中的一种最先进的模型。然而,传统的 TDNN 难以捕捉全球上下文,这在许多最近的研究中证明对于稳定 speaker Representation 和长时间 speaker Verification 至关重要。此外,常见的解决方案,例如自我注意,对于输入 token 具有quadratic 复杂性,这使得它们在 TDNN 中的特征映射大小较大的情况下的计算成本非常高。为了解决这些问题,我们提出了 TDNN 的全球滤波器,该滤波器应用了 log-线性的复杂性FFT/IFFT 和一组可区分的频率滤波器,高效地建模语音中的长期依赖关系。此外,动态滤波策略和稀疏正则化方法特别设计,以增强全球滤波器的性能并防止其过拟合。此外,我们建立了双流 TDNN(DS-TDNN),该将基本通道进行分片,以降低复杂性,并使用全球滤波来提高识别性能。 VoxCeleb 和 SITW 数据库的实验表明,DS-TDNN 几乎实现了与 ECAPA-TDNN 相比10% 的提高,复杂性和参数下降28% 和15%。此外,与其他流行的基准系统相比,在处理长时间语音时,它具有效率与效果的最佳权衡。最后,可视化和详细的剔除研究进一步揭示了 DS-TDNN 的优势。
Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance. Automatic Speech Recognition (ASR) with End-to-End (E2E) modeling for such speech can be a challenging task due to the lack of data. In this study, we investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T), in Mandarin-English code-switching speech recognition. We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces. Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models, i.e., 16% relative Token-based Error Rate (TER) reduction averaged on three evaluation sets, and the approach of tying speech and text latent spaces is superior to that of TTS conversion on the evaluation set which contains more homogeneous data with the training set.
转述语英语混合语言的表达方法被称为 Code-Switching 语言。由于缺乏数据,自动语音识别(ASR)和端到端(E2E)建模对于此类语言识别是一项挑战性的任务。在本研究中,我们研究文本生成和注入以提高行业常用的流模型 Transformer-Transducer (T-T) 的性能。我们提出了一种策略来生成 Code-Switching 文本数据,然后研究如何将生成的文本直接注入到 T-T 模型中,或通过将语言和文本隐式地连接来实现。使用训练集包含 1,800 小时的真实英语和中文混合语言数据训练 T-T 模型的实验结果显示,我们的方法注入生成的 Code-Switching 文本显著提高了 T-T 模型的性能,即平均而言,16% 的相对代币错误率(TER)减少。将语言和文本隐式地连接的方法比TTS转换在包含训练集更相似的数据评估集中的表现更好。
End-to-end speech recognition models are improved by incorporating external text sources, typically by fusion with an external language model. Such language models have to be retrained whenever the corpus of interest changes. Furthermore, since they store the entire corpus in their parameters, rare words can be challenging to recall. In this work, we propose augmenting a transducer-based ASR model with a retrieval language model, which directly retrieves from an external text corpus plausible completions for a partial ASR hypothesis. These completions are then integrated into subsequent predictions by an adapter, which is trained once, so that the corpus of interest can be switched without incurring the computational overhead of retraining. Our experiments show that the proposed model significantly improves the performance of a transducer baseline on a pair of question-answering datasets. Further, it outperforms shallow fusion on recognition of named entities by about 7 relative; when the two are combined, the relative improvement increases to 13%.
端到端语音识别模型可以通过添加外部文本来源而改进,通常通过与外部语言模型 fusion 来实现。这些语言模型必须在感兴趣的语料库发生变化时重新训练。此外,因为它们会将整个语料库存储在他们的参数中,罕见的单词可能很难回忆。在这项工作中,我们建议将基于转换器的 ASR 模型与检索语言模型结合起来,该检索语言模型直接从外部文本语料库中检索可能的完成句子,以支持 partial ASR 假设。这些完成句子随后通过适配器被集成到后续的预测中,一次性训练了一次,因此感兴趣的语料库可以切换而不涉及重新训练的计算 overhead。我们的实验结果表明, proposed 模型 significantly improves the performance of a transducer baseline on two question-answering datasets. Furthermore, it outperforms shallow fusion by about 7 relative on the recognition of named entities. When the two are combined, the relative improvement increases to 13%.
This paper proposes a methodology for discovering meaningful properties in data by exploring the latent space of unsupervised deep generative models. We combine manipulation of individual latent variables to extreme values outside the training range with methods inspired by causal inference into an approach we call causal disentanglement with extreme values (CDEV) and show that this approach yields insights for model interpretability. Using this technique, we can infer what properties of unknown data the model encodes as meaningful. We apply the methodology to test what is meaningful in the communication system of sperm whales, one of the most intriguing and understudied animal communication systems. We train a network that has been shown to learn meaningful representations of speech and test whether we can leverage such unsupervised learning to decipher the properties of another vocal communication system for which we have no ground truth. The proposed technique suggests that sperm whales encode information using the number of clicks in a sequence, the regularity of their timing, and audio properties such as the spectral mean and the acoustic regularity of the sequences. Some of these findings are consistent with existing hypotheses, while others are proposed for the first time. We also argue that our models uncover rules that govern the structure of communication units in the sperm whale communication system and apply them while generating innovative data not shown during training. This paper suggests that an interpretation of the outputs of deep neural networks with causal methodology can be a viable strategy for approaching data about which little is known and presents another case of how deep learning can limit the hypothesis space. Finally, the proposed approach combining latent space manipulation and causal inference can be extended to other architectures and arbitrary datasets.
本 paper 提出了一种方法,通过探索 unsupervised 深度生成模型的隐态空间,来发现数据中的有意义属性。我们结合了对因果推断 inspired 的方法的操纵 individual 隐态变量到训练范围之外极端值的方法,并将其转化为我们称之为因果分离与极端值 (CDEV) 的方法,并证明了这种方法可以带来模型解释性 insights。利用这种方法,我们可以推断未知数据中有意义属性的存在。我们应用这种方法来测试射水豚通信系统中有意义属性的存在,这是一类最具挑战性和未深入研究的动物通信系统之一。我们训练了一个网络,使其可以学习有意义的语音表示,并测试我们是否可以利用这种无监督学习解码我们没有 ground truth 的另一种语音通信系统的属性。该方法建议,射水豚使用序列中的 clicks 数量、他们的计时规律性以及音频属性,如谱均值和序列的声波规律性来编码信息。其中一些发现与现有假设一致,而另一些则是首次提出。我们同时也认为,我们的模型揭示了在射水豚通信系统中控制通信单元结构的规则,并在产生在训练期间未展示的数据的创新数据时应用这些规则。本 paper 建议,使用因果方法解释深度学习输出可以是一种可行的方法,用于处理数据中 little is known 的情况,并展示了深度学习如何限制假设空间的另一个案例。最后,我们建议将隐态空间操纵和因果推断相结合的方法可以扩展到其他架构和任意数据集。