While supervised fine-tuning (SFT) has been a straightforward approach for tailoring the output of foundation large language model (LLM) to specific preferences, concerns have been raised about the depth of this alignment, with some critiques suggesting it is merely "superficial". We critically examine this hypothesis within the scope of cross-lingual generation tasks, proposing that the effectiveness of SFT may be constrained by its reliance on prior tokens to guide cross-lingual generation. Based on this crucial insight, and in response to the challenges posed by the costly and limited availability of non-English data for SFT, we introduce a novel training-free alignment method named PreTTY, which employs minimal task-related prior tokens to bridge the foundation LLM and the SFT LLM, achieving comparable performance without training. Experiments on machine translation and part-of-speech tagging across eight languages demonstrate the efficacy of PreTTY in cross-lingual settings. Remarkably, by initiating the decoding process with only one or two prior tokens, foundation LLMs can achieve performance comparable to their SFT counterparts. This method presents a cost-effective alternative to SFT and advances the democratization of multilingual LLMs.
虽然监督微调(SFT)已经是一种将基础大型语言模型(LLM)输出定制到特定偏好的直接方法,但人们对其深度的担忧也随之提出,有些批评认为这仅仅是“表面”。我们将在跨语言生成任务的范围内对这一假设进行深入探讨,并提出一个名为PreTTY的新训练-免费对齐方法,该方法采用最小化任务相关的前缀来连接基础LLM和SFT LLM,实现与训练相同或更好的性能,而无需训练。在八种语言的机器翻译和词性标注实验中,证明了PreTTY在跨语言环境中的有效性。值得注意的是,通过仅使用前几个前缀启动解码过程,基础LLM可以实现与SFT同级的性能。这种方法为SFT提供了一种成本效益高的替代方案,并推动了多语言LLM的民主化。
https://arxiv.org/abs/2404.16766
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER estimators on in-domain data and achieves state-of-the-art performance on out-of-domain data. On the out-of-domain data, the SIWE model outperformed the baseline estimators in root mean square error and Pearson correlation coefficient by relative 17.58% and 18.21%, respectively, on Switchboard and CALLHOME. The performance was further improved when the WER of the training set was close to the WER of the evaluation dataset.
词错误率(WER)是一种用于评估自动语音识别(ASR)系统产生的转录质量的度量。在许多应用中,估计WER根据语音单位和文本的关系进行估计。之前的工作主要集中在构建特定ASR系统的模型(称为ASR系统依赖的模型)。这些模型在领域上是相关的,且在现实世界的应用中不灵活。在本文中,提出了一个用于ASR System-Independent WER估计(SIWE)的假设生成方法。与之前的工作不同,使用模拟ASR系统输出的数据来训练WER估计算法。假设是通过 phonetically similar 或 linguistically more likely 替代词来生成的。在WER估计实验中,所提出的方法在领域内数据上的性能与ASR系统依赖的WER估计算法相当,同时在对外部数据上的性能达到了最先进的水平。在外部数据上,SIWE模型分别比基线估计器在Switchboard和CALLHOME上的相对误差的绝对值和皮尔逊相关系数上提高了17.58%和18.21%。当训练集的WER接近评估数据上的WER时,性能进一步得到改善。
https://arxiv.org/abs/2404.16743
This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97.
本文介绍了由THU-HCSI团队为LIMMITS'24挑战开发的的多语种、多声道语音克隆系统。为了在单语种和跨语种场景下实现高说话者相似度和自然度,我们在YourTTS基础上进行了系统构建,并添加了几个增强功能。为了进一步提高说话者相似度和语音质量,我们引入了说话者感知的文本编码器和基于Transformer的流式解码器。此外,我们还对几 shot数据进行了去噪、混合处理,并采用了一种针对说话者的平衡采样策略,以确保对目标说话者的有效微调。在1号轨道的官方评估中,我们的系统实现了4.25的说话者相似度MOS和显著的自然度MOS。
https://arxiv.org/abs/2404.16619
This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.
本文关注使用可训练系统进行自动连续语音识别。本工作的目标是建立用于 spoken Swedish 的音频模型。这是通过使用隐马尔可夫模型,并使用SpeechDat数据库训练其参数来实现的。在音位级别上进行了语音建模,允许进行通用语音识别应用,尽管对于模型评估,简化任务(数字和自然数识别)已经被考虑。已经测试了不同类型的电话模型,包括上下文无关模型和上下文相关模型的两种变体。此外,还与大词本语言模型一起进行了很多实验,以调整一些系统参数。还检查了系统在不同说话者子集上的性能,包括不同性别、年龄和方言。结果与之前类似的研究相比显示出显著的改进。
https://arxiv.org/abs/2404.16547
Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.
翻译:在自然语言处理领域,Scale确实开辟了新的前沿,但代价高昂。为了应对这种情况,通过仅在训练和推理时激活参数集的一小部分,提出了Mixture-of-Experts(MoE)方法,作为一种能源高效的途径,以达到更大的和更强大的语言模型,并将这一代基础模型从目前的MoE模型中转移到新一代。在自动语音识别(ASR)领域,采用MoE的ASR模型已经引起了越来越多的关注,尤其是ASR任务。最近的工作包括通过附加嵌入网络路由帧、提高专家多语言能力以及为专家负载均衡或特定语言处理使用专用的辅助损失等复杂设计。我们发现,一些设计并不必要,而用MoE层替换所有前馈网络(FFN)层对所有ASR任务来说也是有效的。具体来说,我们在一个大型内部数据集(160k小时)上对所提出的模型进行了基准测试,结果显示,在保持Dense-225M级别的基线Conformer(Dense-225M)的同时,我们将基线Conformer(Dense-225M)扩展到了MoE对应的同级(MoE-1B),并达到与Dense-225M级别的实时因子(RTF)相同的WER级别。此外,通过应用统一2-路注意力解码器(U2++),我们实现了一个基于MoE的单模型流式和非流式解码模式,我们称之为U2++ MoE。我们希望我们的研究能够促进在不牺牲部署效率的情况下研究扩展语音基础模型。
https://arxiv.org/abs/2404.16407
Sequence modeling is a crucial area across various domains, including Natural Language Processing (NLP), speech recognition, time series forecasting, music generation, and bioinformatics. Recurrent Neural Networks (RNNs) and Long Short Term Memory Networks (LSTMs) have historically dominated sequence modeling tasks like Machine Translation, Named Entity Recognition (NER), etc. However, the advancement of transformers has led to a shift in this paradigm, given their superior performance. Yet, transformers suffer from $O(N^2)$ attention complexity and challenges in handling inductive bias. Several variations have been proposed to address these issues which use spectral networks or convolutions and have performed well on a range of tasks. However, they still have difficulty in dealing with long sequences. State Space Models(SSMs) have emerged as promising alternatives for sequence modeling paradigms in this context, especially with the advent of S4 and its variants, such as S4nd, Hippo, Hyena, Diagnol State Spaces (DSS), Gated State Spaces (GSS), Linear Recurrent Unit (LRU), Liquid-S4, Mamba, etc. In this survey, we categorize the foundational SSMs based on three paradigms namely, Gating architectures, Structural architectures, and Recurrent architectures. This survey also highlights diverse applications of SSMs across domains such as vision, video, audio, speech, language (especially long sequence modeling), medical (including genomics), chemical (like drug design), recommendation systems, and time series analysis, including tabular data. Moreover, we consolidate the performance of SSMs on benchmark datasets like Long Range Arena (LRA), WikiText, Glue, Pile, ImageNet, Kinetics-400, sstv2, as well as video datasets such as Breakfast, COIN, LVU, and various time series datasets. The project page for Mamba-360 work is available on this webpage.\url{this https URL}.
序列建模是一个贯穿各种领域的关键领域,包括自然语言处理(NLP)、语音识别、时间序列预测、音乐生成和生物信息学。递归神经网络(RNNs)和长短时记忆网络(LSTMs)历史上曾统治序列建模任务,如机器翻译、命名实体识别等。然而,Transformer的进步导致了一种范式的转移,由于它们在性能上的优越表现。然而,Transformer的注意力复杂性和处理归纳偏差的能力仍然存在挑战。为解决这些问题,已经提出了几种变体,包括使用特征网络或卷积的模型,并在各种任务上表现良好。然而,它们仍然很难处理长序列。状态空间模型(SSMs)在这一背景下出现了有前景的替代方案,尤其是S4和其变体,如S4nd、Hippo、Hyena、诊断状态空间(DSS)、Gated State Spaces(GSS)和Linear Recurrent Unit(LRU)、Liquid-S4、Mamba等。在本次调查中,我们根据三种范式对基本SSMs进行了分类,即开关架构、结构架构和循环架构。本调查还强调了SSMs在各个领域的多样化应用,如视觉、视频、音频、语音、语言(特别是长序列建模)、医学(包括基因组学)、化学(如药物设计)和推荐系统,以及时间序列分析,包括表格数据。此外,我们还分析了SSMs在基准数据集,如Long Range Arena(LRA)、WikiText、Glue、Pile、ImageNet、Kinetics-400、sstv2,以及视频数据集,如Breakfast、COIN、LVU等。Mamba-360工作的项目页面可以在该网页上查看。
https://arxiv.org/abs/2404.16112
We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker's superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed of 120 FPS, surpassing previous benchmarks. Our code is made available at this https URL .
我们提出了GaussianTalker,一种用于实时生成具有姿态控制的说话头的新框架。它利用了3D高斯平滑(3DGS)的快速渲染能力,同时解决了直接用语音音频控制3DGS的挑战。GaussianTalker构建了一个规范的3DGS头部的表示,并与其同步变形。一个关键的见解是将3D高斯属性编码成一个共享的隐式特征表示,其中它与音频特征合并以操纵每个高斯属性。这种设计利用了空间感知特征,并强制处理相邻点之间的交互。然后将特征嵌入 feed 到一个空间-音频关注模块,该模块预测每个高斯属性的时偏移。与之前的方法相比,GaussianTalker在面部保真度、 lip 同步准确性和渲染速度方面具有优越性。具体来说,GaussianTalker实现了令人印象深刻的120 FPS的渲染速度,超过了之前的基准。我们的代码可在此处访问的 URL 下载。
https://arxiv.org/abs/2404.16012
This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available. Thus, our study positively showcases resource building and its applications for low-resource languages and their community.
本文介绍了Killkan,这是库奇华语(Kichwa)的第一份自动语音识别(ASR)数据集,这是一种来自厄瓜多尔的土著语言。库奇华是一种极其缺乏资源、濒临灭绝的语言,以前没有库奇华的资源被融入到自然语言处理应用程序中。数据集包含近4小时的音频转录、西班牙语翻译和语素形态学注释的格式为Universal Dependencies。音频数据是从库奇华的一个公开可用的无线电节目提取的。本文还重点分析了数据集的语料库语义分析,特别关注库奇华的粘着形态和与西班牙语的频繁代码转换。实验结果表明,尽管数据集规模较小,但该数据集还是可以开发出库奇华语的第一份ASR系统,具有可靠的质量和效果。这个数据集、ASR模型和用于开发它们的代码将公开发布。因此,我们的研究正面展示了资源建设和它们对低资源语言及其社区的启示。
https://arxiv.org/abs/2404.15501
Deepfake speech represents a real and growing threat to systems and society. Many detectors have been created to aid in defense against speech deepfakes. While these detectors implement myriad methodologies, many rely on low-level fragments of the speech generation process. We hypothesize that breath, a higher-level part of speech, is a key component of natural speech and thus improper generation in deepfake speech is a performant discriminator. To evaluate this, we create a breath detector and leverage this against a custom dataset of online news article audio to discriminate between real/deepfake speech. Additionally, we make this custom dataset publicly available to facilitate comparison for future work. Applying our simple breath detector as a deepfake speech discriminator on in-the-wild samples allows for accurate classification (perfect 1.0 AUPRC and 0.0 EER on test data) across 33.6 hours of audio. We compare our model with the state-of-the-art SSL-wav2vec model and show that this complex deep learning model completely fails to classify the same in-the-wild samples (0.72 AUPRC and 0.99 EER).
Deepfake speech represents a real and growing threat to systems and society.为应对深度伪造语音,已经创建了许多检测器。尽管这些检测器采用了多种方法,但许多检测器依赖低级别语音生成过程的片段。我们假设呼吸(语音的较高层次)是自然语音的重要组成部分,因此深度伪造语音的不正确生成是一个表现性的区分器。为了评估这一点,我们创建了一个呼吸检测器,并将其应用于一个在线新闻文章音频自定义数据集,以区分真实/深度伪造语音。此外,我们还公开了这个自定义数据集,以便未来工作的比较。在野外样本上应用我们简单的呼吸检测器作为深度伪造语音区分器,能够实现93.6小时的准确分类(在测试数据上的完美1.0 AUPRC和0.0 EER)。我们将我们的模型与最先进的SSL-wav2vec模型进行比较,并展示了这个复杂的深度学习模型完全无法正确分类相同野外的样本(0.72 AUPRC和0.99 EER)。
https://arxiv.org/abs/2404.15143
Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate backbone models for each modality to address the temporal misalignment and sampling rate differences. We utilize Transformer encoders in cross-modal and early fusion techniques to effectively align and integrate speech and skeletal sequences. The study results show that combining visual and speech information significantly enhances gesture detection performance. Our findings indicate that expanding the speech buffer beyond visual time segments improves performance and that multimodal integration using cross-modal and early fusion techniques outperforms baseline methods using unimodal and late fusion methods. Additionally, we find a correlation between the models' gesture prediction confidence and low-level speech frequency features potentially associated with gestures. Overall, the study provides a better understanding and detection methods for co-speech gestures, facilitating the analysis of multimodal communication.
手势是人类交互的固有特征,通常在面对面交流中作为语言的补充,形成了一个多模态通信系统。手势分析的一个重要任务是检测手势的开始和结束。在手势检测研究中,主要集中于视觉和运动信息来检测具有较低变异性的一小部分孤立或无声手势,而忽视了语音和视觉信号的融合来检测与语言同时出现的手势。本文通过关注共词手势检测来填补这一空白,强调言语和共词手势之间的同步性。我们解决了三个主要挑战:手势形式的变异性,手势和言语发起时间之间的时序错位,以及模态之间的采样率差异。我们研究了扩展的言语时间窗口,并为每个模态使用单独的骨干模型来解决时序错位和采样率差异。我们利用跨模态和早期融合技术中的Transformer编码器来有效地对齐和整合言语和骨骼序列。研究结果表明,结合视觉和言语信息可以显著增强手势检测性能。我们的研究结果表明,将言语缓冲区扩展到视觉时间段以外可以提高性能,而跨模态和早期融合技术的使用优于使用单模态和晚期融合方法。此外,我们发现模型手势预测信心与可能与手势相关的小级别语音频率特征之间存在相关性。总体而言,研究为共词手势提供了更好的理解和检测方法,促进了多模态通信的分析。
https://arxiv.org/abs/2404.14952
While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works. In this paper, we introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective, from the recording of a Mandarin storytelling show. A systematic and comprehensive labeling framework is proposed for textual expressiveness. We analyze and define speech-related textual expressiveness in StoryTTS to include five distinct dimensions through linguistics, rhetoric, etc. Then we employ large language models and prompt them with a few manual annotation examples for batch annotation. The resulting corpus contains 61 hours of consecutive and highly prosodic speech equipped with accurate text transcriptions and rich textual expressiveness annotations. Therefore, StoryTTS can aid future ETTS research to fully mine the abundant intrinsic textual and acoustic features. Experiments are conducted to validate that TTS models can generate speech with improved expressiveness when integrating with the annotated textual labels in StoryTTS.
虽然自声表情的表达性在表现性文本-语音(ETTS)中已经进行了多年的研究,但文本自身的表达性缺乏足够的关注,特别是对于艺术作品的表现性ETTS。在本文中,我们引入了StoryTTS,一个高度的ETTS数据集,既包括听觉方面,也包括文本方面的表现性,从录制一个普通话讲故事节目入手。我们提出了一个系统化和全面的文本表现性标签框架。通过语言学、修辞学等方法,我们分析了并定义了StoryTTS中的语音相关文本表现性,包括五个不同的维度。然后我们使用大型语言模型,并对其进行批量注释。因此,StoryTTS可以助力未来的ETTS研究,充分挖掘 abundant的文本和听觉特征。实验验证了当TTS模型与StoryTTS中的注释文本相结合时,可以生成具有更好表现性的语音。
https://arxiv.org/abs/2404.14946
It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can monotonically improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.
在嘈杂条件下,使用单通道语音增强(SE)前端来提高自动语音识别(ASR)性能是一项具有挑战性的任务。通常,这种挑战是由于单通道SE前端非线性处理引起的处理失真。然而,尚未对这种失真导致的ASR性能下降进行全面调查。如何设计一种明显提高ASR性能的单通道SE前端仍然是一个开放的研究问题。在这项研究中,我们研究了一个可以解释ASR性能下降原因的信号级数值度量。为此,我们提出了基于SE误差正交投影分解的分析方案。这种方案手动修改了分解干扰、噪声和伪迹误差的比率,并使我们能够直接评估每种误差类型对ASR性能的影响。我们的分析揭示了伪迹误差对ASR性能的破坏性比其他类型的误差更加严重。然后,我们研究了两种减少伪迹误差影响的方法。首先,我们证明了简单的观察添加(OA)后处理(即插值增强和观察信号)可以单调地改善信号-伪迹比。其次,我们提出了名为伪迹增强信号-畸变比(AB-SDR)的新训练目标,该目标迫使模型使用更少的伪迹误差估计增强信号。通过实验,我们证实了OA和AB-SDR方法都能有效减少由单通道SE前端引起的伪迹误差,从而显著提高ASR性能。
https://arxiv.org/abs/2404.14860
Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayesian in-Context example Selection method (ByCS) for ICL. Extending the inference probability conditioned on in-context examples based on Bayes' theorem, ByCS focuses on the inverse inference conditioned on test input. Following the assumption that accurate inverse inference probability (likelihood) will result in accurate inference probability (posterior), in-context examples are selected based on their inverse inference results. Diverse and extensive cross-tasking and cross-modality experiments are performed with speech, text, and image examples. Experimental results show the efficacy and robustness of our ByCS method on various models, tasks and modalities.
大语言模型(LLMs)可以通过基于对话历史中提供的几个示例进行基于无模型参数更新的语境学习(ICL)来适应新的任务。尽管这种便利性,但ICL的性能很大程度上取决于提供的语境示例的质量,这使得语境示例选择方法成为一个关键的选择。本文提出了一种新颖的贝叶斯语境示例选择方法(ByCS)用于ICL。基于贝叶斯公式的推理概率条件,ByCS关注于基于测试输入的逆推理条件。假设准确的反向推理概率(概率)会导致准确的后验概率(后),根据逆推理结果选择语境示例。我们对语音、文本和图像例子进行了多样且广泛的跨任务和跨模态实验。实验结果展示了我们ByCS方法在不同模型、任务和模态上的有效性和鲁棒性。
https://arxiv.org/abs/2404.14716
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in this https URL.
近年来,在大型零样本语音合成方面,自然语言处理(NLP)模型和扩散模型的进步显著加快了该领域的进展。然而,这两种方法的生成过程缓慢且计算密集。使用较低的计算预算实现与之前工作相同的质量仍然是一个重要的挑战。在本文中,我们提出了 FlashSpeech,一种大型零样本语音合成系统,与之前的工作相比,其推理时间减少了约 5%。FlashSpeech 基于潜在一致性模型,并应用了一种新颖的对抗性一致性训练方法,可以从零开始训练,无需预先训练的扩散模型作为教师。此外,一个新的元音生成器模块增强了元音的多样性,使语音节奏更加自然。FlashSpeech 的生成过程可以通过一个或两个采样步骤实现高效,同时保持高音频质量和与零样本语音生成的音频提示的高相似度。我们的实验结果证明了 FlashSpeech 的卓越性能。值得注意的是,FlashSpeech 可以在保持与其它零样本语音合成系统相当的声音质量和相似性的同时,大约 20 倍于其他系统。此外,FlashSpeech 通过有效地执行像语音转换、语音编辑和多样语音采样等任务,展示了其多才性。音频样本可在此链接中找到。
https://arxiv.org/abs/2404.14700
One key aspect differentiating data-driven single- and multi-channel speech enhancement and dereverberation methods is that both the problem formulation and complexity of the solutions are considerably more challenging in the latter case. Additionally, with limited computational resources, it is cumbersome to train models that require the management of larger datasets or those with more complex designs. In this scenario, an unverified hypothesis that single-channel methods can be adapted to multi-channel scenarios simply by processing each channel independently holds significant implications, boosting compatibility between sound scene capture and system input-output formats, while also allowing modern research to focus on other challenging aspects, such as full-bandwidth audio enhancement, competitive noise suppression, and unsupervised learning. This study verifies this hypothesis by comparing the enhancement promoted by a basic single-channel speech enhancement and dereverberation model with two other multi-channel models tailored to separate clean speech from noisy 3D mixes. A direction of arrival estimation model was used to objectively evaluate its capacity to preserve spatial information by comparing the output signals with ground-truth coordinate values. Consequently, a trade-off arises between preserving spatial information with a more straightforward single-channel solution at the cost of obtaining lower gains in intelligibility scores.
数据驱动的单通道和多通道语音增强和去噪方法的一个重要区别是,后者的问题表述和解决方案的复杂性大大增加。此外,在有限计算资源的情况下,训练需要管理更大数据集或更复杂设计的模型非常费力。在这种情况下,一个未经证实的假设是,单通道方法可以简单地适应多通道场景,只需对每个通道独立处理,这对声景捕捉系统和输入-输出格式之间的兼容性产生了重大影响,同时也允许现代研究集中精力于其他具有挑战性的方面,例如全带宽音频增强、竞争性噪声抑制和无监督学习。通过比较基本单通道语音增强和去噪模型与两个专门针对分离干净语音和嘈杂3D混合的Multi-Channel模型的增强效果,本研究验证了这个假设。采用到达方向估计模型通过比较输出信号与地面坐标值来客观评估其保留空间信息的能力。因此,在保留空间信息方面,更简单的单通道解决方案在获得较低的增益智能分数的同时,需要在清晰度分数上做出让步。
https://arxiv.org/abs/2404.14564
Data-driven approaches have revolutionized scientific research. Machine learning and statistical analysis are commonly utilized in this type of research. Despite their widespread use, these methodologies differ significantly in their techniques and objectives. Few studies have utilized a consistent dataset to demonstrate these differences within the social sciences, particularly in language and cognitive sciences. This study leverages the Buckeye Speech Corpus to illustrate how both machine learning and statistical analysis are applied in data-driven research to obtain distinct insights. This study significantly enhances our understanding of the diverse approaches employed in data-driven strategies.
数据驱动的方法已经彻底颠覆了科学研究。机器学习和统计分析是这类研究中最常用的方法。尽管这些方法在科学领域具有广泛应用,但它们在技术和目标上存在显著差异。迄今为止,几乎没有研究在社会科学领域利用一个一致的数据集来展示这些差异,特别是在语言和认知科学领域。本研究利用布奇克语音数据集来说明,机器学习和统计分析在数据驱动研究中如何应用于获得独特的见解。本研究显著增强了我们对数据驱动策略采用的不同方法的认知。
https://arxiv.org/abs/2404.14052
Understanding cognitive processes in the brain demands sophisticated models capable of replicating neural dynamics at large scales. We present a physiologically inspired speech recognition architecture, compatible and scalable with deep learning frameworks, and demonstrate that end-to-end gradient descent training leads to the emergence of neural oscillations in the central spiking neural network. Significant cross-frequency couplings, indicative of these oscillations, are measured within and across network layers during speech processing, whereas no such interactions are observed when handling background noise inputs. Furthermore, our findings highlight the crucial inhibitory role of feedback mechanisms, such as spike frequency adaptation and recurrent connections, in regulating and synchronising neural activity to improve recognition performance. Overall, on top of developing our understanding of synchronisation phenomena notably observed in the human auditory pathway, our architecture exhibits dynamic and efficient information processing, with relevance to neuromorphic technology.
理解大脑中的认知过程需要复杂且能够在大尺度上复制神经动态的模型。我们提出了一个生理学上启发的语音识别架构,与深度学习框架兼容并具有可扩展性,并证明了端到端梯度下降训练会导致中央尖峰神经网络中神经振荡的出现。在语音处理过程中,我们测量了跨频联系,这些联系表明了这些振荡,而在处理背景噪声输入时,并没有观察到这样的相互作用。此外,我们的研究结果突出了反馈机制(如尖峰频率适应和循环连接)在调节和同步神经活动以提高识别性能中的关键抑制作用。总的来说,在发展我们人类听觉通路中同步现象的基础上,我们的架构表现出动态和高效的信息处理,与类神经形态技术有关。
https://arxiv.org/abs/2404.14024
With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.
近年来,在语音合成方面的进步包括文本到语音(TTS)和语音转换(VC)系统,这些系统能够生成超现实主义的音频深度伪造,因此人们对它们可能被滥用的问题越来越担忧。然而,大多数深度伪造(DF)检测方法仅依赖单个模型的模糊知识,导致性能瓶颈和透明度问题。受到检索增强生成(RAG)的启发,我们提出了一个检索增强检测(RAD)框架,通过增加与检索样本相似的测试样本来增强检测。我们还将多融合注意分类器扩展到与我们的RAD框架相结合。大量实验证明,与基线方法相比,所提出的RAD框架具有卓越的性能,在ASVspoof 2021 DF集上实现了最先进的成果,同时在2019和2021 LA集上获得了竞争力的结果。进一步的样本分析表明,检索器总是从具有与查询音频相似的相同说话人检索样本,从而提高了检测性能。
https://arxiv.org/abs/2404.13892
Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method.
语音情感识别在人与计算机交互中至关重要,但提取和利用音频中的情感线索仍然具有挑战性。本文介绍了一种名为MFHCA的新方法,用于基于多空间融合和层次合作注意的语音情感识别。我们采用多空间融合模块(MF)来有效地识别与情感相关的频谱图区域,并利用Hubert特征获取更高层次的音频信息。我们的方法还包括一个层次合作注意模块(HCA),以合并来自不同音频层次的特征。我们在IEMOCAP数据集上评估我们的方法,分别实现了2.6%和1.87%的加权准确性和无加权准确性的提高。大量实验证明所提出的方法的有效性。
https://arxiv.org/abs/2404.13509
The rise of deep learning has marked significant progress in fields such as computer vision, natural language processing, and medical imaging, primarily through the adaptation of pre-trained models for specific tasks. Traditional fine-tuning methods, involving adjustments to all parameters, face challenges due to high computational and memory demands. This has led to the development of Parameter Efficient Fine-Tuning (PEFT) techniques, which selectively update parameters to balance computational efficiency with performance. This review examines PEFT approaches, offering a detailed comparison of various strategies highlighting applications across different domains, including text generation, medical imaging, protein modeling, and speech synthesis. By assessing the effectiveness of PEFT methods in reducing computational load, speeding up training, and lowering memory usage, this paper contributes to making deep learning more accessible and adaptable, facilitating its wider application and encouraging innovation in model optimization. Ultimately, the paper aims to contribute towards insights into PEFT's evolving landscape, guiding researchers and practitioners in overcoming the limitations of conventional fine-tuning approaches.
深度学习的兴起在计算机视觉、自然语言处理和医学影像等领域标志着显著的进展,主要是通过为特定任务对预训练模型进行调整。然而,传统的微调方法在计算和内存需求较高的情况下遇到了挑战。这导致开发了参数高效的微调(PEFT)技术,这些技术选择性地更新参数以平衡计算效率和性能。本文回顾了PEFT方法,详细比较了各种策略,突出了在不同领域的应用,包括文本生成、医学影像、蛋白质建模和语音合成。通过评估PEFT方法在减轻计算负担、加速训练和降低内存使用方面的有效性,本文为深度学习更加便捷和适应性提供了一个论据,促进了其在各个领域的更广泛应用,并鼓励模型优化方面的创新。最终,本文旨在为PEFT的发展提供一个指导,引导研究人员和实践者克服传统微调方法的局限性。
https://arxiv.org/abs/2404.13506