This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. Mixat was developed to address the shortcomings of current speech recognition resources when applied to Emirati speech, and in particular, to bilignual Emirati speakers who often mix and switch between their local dialect and English. The data set consists of 15 hours of speech derived from two public podcasts featuring native Emirati speakers, one of which is in the form of conversations between the host and a guest. Therefore, the collection contains examples of Emirati-English code-switching in both formal and natural conversational contexts. In this paper, we describe the process of data collection and annotation, and describe some of the features and statistics of the resulting data set. In addition, we evaluate the performance of pre-trained Arabic and multi-lingual ASR systems on our dataset, demonstrating the shortcomings of existing models on this low-resource dialectal Arabic, and the additional challenge of recognizing code-switching in ASR. The dataset will be made publicly available for research use.
本文介绍了Mixat数据集:这是用英语对阿联酋语音进行混合的数据集。Mixat数据集是为了解决应用到阿联酋语音的现有语音识别资源的不足而开发的,尤其是针对双语阿联酋 speakers,他们经常混合和切换本地方言和英语。数据集包括来自两个公共播客的非母语阿联酋人士的15小时语音,其中一个是以主持人与嘉宾之间的对话形式呈现的。因此,数据集中包含了阿联酋-英语代码转换在正式和非正式会话背景中的例子。在本文中,我们描述了数据收集和注释的过程,并描述了数据集中的某些特征和统计数字。此外,我们还评估了预训练的阿拉伯语和多语言 ASR系统在我们的数据集上的性能,证明了对于这种低资源的中东阿拉伯语,现有模型的不足以及识别代码转换在 ASR 中的挑战。该数据集将公开发布,供研究使用。
https://arxiv.org/abs/2405.02578
The complex challenge of detecting sarcasm in Arabic speech on social media is increased by the language diversity and the nature of sarcastic expressions. There is a significant gap in the capability of existing models to effectively interpret sarcasm in Arabic, which mandates the necessity for more sophisticated and precise detection methods. In this paper, we investigate the impact of a fundamental preprocessing component on sarcasm speech detection. While emojis play a crucial role in mitigating the absence effect of body language and facial expressions in modern communication, their impact on automated text analysis, particularly in sarcasm detection, remains underexplored. We investigate the impact of emoji exclusion from datasets on the performance of sarcasm detection models in social media content for Arabic as a vocabulary-super rich language. This investigation includes the adaptation and enhancement of AraBERT pre-training models, specifically by excluding emojis, to improve sarcasm detection capabilities. We use AraBERT pre-training to refine the specified models, demonstrating that the removal of emojis can significantly boost the accuracy of sarcasm detection. This approach facilitates a more refined interpretation of language, eliminating the potential confusion introduced by non-textual elements. The evaluated AraBERT models, through the focused strategy of emoji removal, adeptly navigate the complexities of Arabic sarcasm. This study establishes new benchmarks in Arabic natural language processing and presents valuable insights for social media platforms.
社会媒体中检测讽刺语的复杂性增加了语言多样性和讽刺表达的性质。现有模型有效解释阿拉伯语中的讽刺的能力存在显著的差距,这迫使需要更复杂和精确的检测方法。在本文中,我们研究了基本预处理组件对讽刺语音检测的影响。尽管表情符号在减轻现代通信中肢体语言和面部表情缺失效应方面起着关键作用,但它们对自动文本分析(特别是讽刺检测)的影响仍没有被深入研究。我们研究了表情符号从数据集中排除对阿拉伯语讽刺检测模型性能的影响。这项调查包括使用AraBERT预训练模型进行调整和增强,特别是通过排除表情符号,以提高讽刺检测能力。我们使用AraBERT预训练来优化指定模型,证明删除表情符号可以显著提高讽刺检测的准确性。这种方法使得对语言的解读更加精准,消除了非文本元素可能引起的混淆。评估的AraBERT模型通过移除表情符号,巧妙地处理了阿拉伯语讽刺的复杂性。本研究为阿拉伯自然语言处理设立了新的基准,并为社交媒体平台提供了宝贵的洞见。
https://arxiv.org/abs/2405.02195
Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained this http URL this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.
泛化是当前音频深度伪造检测器面临的主要问题,它们在非分布数据上提供不可靠的结果。鉴于越来越精确的合成方法正在开发,在数据集他们并未训练过的这个http网址上,设计一些在数据集上表现良好的技术对提高深度伪造检测器的性能具有至关重要的意义。为此,我们将检测问题重新表述为说话人验证框架,通过测试语音样本与声称身份的语音之间的不匹配来暴露假音频。在这种范式下,在训练过程中不需要任何假音频样本,切断与生成方法之间的联系,并确保充分的泛化能力。大型通用预训练模型的特征是由其提取的,无需在特定的伪造检测或说话人验证数据集上进行训练或微调。在检测时只需要检测身份的有限个语音片段。在社区中广泛使用的几个数据集的实验表明,基于预训练模型的检测器实现卓越的性能,表现出很强的泛化能力,与在分布数据上的监督方法相媲美,在离散数据上大大超过了它们。
https://arxiv.org/abs/2405.02179
The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, there is still potential for enhancement in the performance of these methods. In this paper, we present GMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning), a novel HuBERT-based adaptive transfer learning framework for SER. Specifically, GMP-ATL initially employs the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level gender-augmented multi-scale pseudo-labels. Then, to fully leverage both obtained frame-level and utterance-level emotion labels, we incorporate model retraining and fine-tuning methods to further optimize GMP-ATL. Experiments on IEMOCAP show that our GMP-ATL achieves superior recognition performance, with a WAR of 80.0\% and a UAR of 82.0\%, surpassing state-of-the-art unimodal SER methods, while also yielding comparable results with multimodal SER approaches.
预训练语音模型的连续进化已经极大地推动了情感识别(SER)。然而,这些方法在表现上还有很大的提升潜力。在本文中,我们提出了GMP-ATL(性别增强多尺度伪标签自适应转移学习),一种新颖的HuBERT基情感识别(SER)自适应转移学习框架。具体来说,GMP-ATL首先采用预训练的HuBERT,实现多任务学习和多尺度k-means聚类,以获取帧级的性别增强多尺度伪标签。然后,为了充分利用获得的帧级和语料水平情感标签,我们引入模型重构和微调方法,进一步优化GMP-ATL。在IEMOCAP上的实验表明,我们的GMP-ATL取得了卓越的识别性能,具有80.0%的准确率(WAR)和82.0%的召回率(UAR),超越了当前最先进的单模态SER方法,同时与多模态SER方法相当。
https://arxiv.org/abs/2405.02151
Large Language Models have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition is becoming a mainstream paradigm. Building upon this momentum, our research delves into an indepth examination of this paradigm on a large opensource Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoderLLM ASR paradigm. Furthermore, we introduce a threestage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL1, TestNet, and TestMeeting test sets. Our analysis presents an empirical foundation for future research in LLMbased ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pretrained models and training logs to promote reproducible research.
大规模语言模型已经在各种自然语言处理任务中展示了无与伦比的效果,并将自动语音识别与大规模语言模型集成成为主流范式。在此基础上,我们的研究深入探讨了这个范式在一个大型开源中文数据集上的影响。具体来说,我们的研究旨在评估在语音基础编码器LLM ASR范式中各种配置的影响。此外,我们还介绍了一种三阶段训练方法,专门设计来增强模型对 align auditory 和文本信息的能力。这种方法的实现与ASR组件的策略集成,使我们能够在AISHELL1、TestNet和TestMeeting测试集上实现最先进的性能。我们的分析提供了LLM为基础的ASR系统未来研究的实证基础,并为使用中文数据集优化性能提供了见解。我们将公开发布所有用于数据准备、训练、推理和评分的脚本,以及预训练模型和训练日志,以促进可重复的研究。
https://arxiv.org/abs/2405.02132
In this paper, we present a novel approach for text independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (wav2vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained thanks to forced-alignment labels (using Montreal Forced Aligner) to produce multi-lingual phonetic representations, thus requiring minimal additional training. We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English, respectively. Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems. We leave experiments on other languages for future work but the design of the system makes it easily adaptable to other languages.
在本文中,我们提出了一种基于语音识别、表示学习和知识传递的新方法来实现文本独立电话到音频对齐。我们的方法利用了一个自我监督模型(wav2vec2)经过CTC损失、维度减少模型和通过强制对齐标签(使用蒙特利尔强制对齐器)训练来进行语音识别,从而生成多语言语音表示,这使得我们无需额外训练。我们使用来自TIMIT数据集和SCRIBE数据集的合成本地数据来评估我们的模型。我们提出的方法在统计指标上超过了最先进的(charsiu)模型,并在语言学习和语音处理系统中具有应用。我们将继续进行其他语言的实验,但系统的设计使其容易适应其他语言。
https://arxiv.org/abs/2405.02124
Deep learning has the potential to enhance speech signals and increase their intelligibility for users of hearing aids. Deep models suited for real-world application should feature a low computational complexity and low processing delay of only a few milliseconds. In this paper, we explore deep speech enhancement that matches these requirements and contrast monaural and binaural processing algorithms in two complex acoustic scenes. Both algorithms are evaluated with objective metrics and in experiments with hearing-impaired listeners performing a speech-in-noise test. Results are compared to two traditional enhancement strategies, i.e., adaptive differential microphone processing and binaural beamforming. While in diffuse noise, all algorithms perform similarly, the binaural deep learning approach performs best in the presence of spatial interferers. Through a post-analysis, this can be attributed to improvements at low SNRs and to precise spatial filtering.
深度学习在提高听觉辅助用户语音信号的清晰度和可听度方面具有潜力。适用于真实世界应用的深度模型应该具有低计算复杂度和低延迟,只有几毫秒。在本文中,我们探讨了符合这些要求的深度语音增强,并将在两个复杂的声场景中比较双耳和单耳处理算法。两种算法都用客观指标评估,并通过听觉受损的听众在语音噪声测试中进行实验。结果与两种传统增强策略(自适应差动麦克风处理和双耳波束成形)进行比较。虽然在扩散噪声中,所有算法表现相似,但双耳深度学习方法在存在空间干扰的情况下表现最佳。通过后分析,这可以归因于在低信噪比下信号质量的提高和精确的空间滤波。
https://arxiv.org/abs/2405.01967
This paper proposes a new approach to Machine Learning (ML) that focuses on unsupervised continuous context-dependent learning of complex patterns. Although the proposal is partly inspired by some of the current knowledge about the structural and functional properties of the mammalian brain, we do not claim that biological systems work in an analogous way (nor the opposite). Based on some properties of the cerebellar cortex and adjacent structures, a proposal suitable for practical problems is presented. A synthetic structure capable of identifying and predicting complex temporal series will be defined and experimentally tested. The system relies heavily on prediction to help identify and learn patterns based on previously acquired contextual knowledge. As a proof of concept, the proposed system is shown to be able to learn, identify and predict a remarkably complex temporal series such as human speech, with no prior knowledge. From raw data, without any adaptation in the core algorithm, the system is able to identify certain speech structures from a set of Spanish sentences. Unlike conventional ML, the proposal can learn with a reduced training set. Although the idea can be applied to a constrained problem, such as the detection of unknown vocabulary in a speech, it could be used in more applications, such as vision, or (by incorporating the missing biological periphery) fit into other ML techniques. Given the trivial computational primitives used, a potential hardware implementation will be remarkably frugal. Coincidentally, the proposed model not only conforms to a plausible functional framework for biological systems but may also explain many elusive cognitive phenomena.
本文提出了一种新的机器学习(ML)方法,重点关注无监督的连续上下文相关学习复杂模式。尽管在某种程度上,这个建议受到了哺乳动物大脑的结构和功能性质的一些现有知识的影响,但我们不声称生物系统以类似的方式运行(也不相反)。基于小脑皮质和相邻结构的一些性质,提出了一种适用于实际问题的建议。将定义一个能够识别和预测复杂时间序列的合成结构,并进行实验验证。系统依赖于预测来帮助根据先前的上下文知识识别和学习模式。作为一种概念证明,所提出的系统能够学习、识别并预测类似于人类语音的复杂时间序列,而无需任何先验知识。从原始数据中,无需对核心算法进行调整,系统能够从一系列西班牙句子中识别出某些语音结构。与传统的ML不同,这个建议可以在较小的训练集上学习。尽管这个想法可以应用于一些约束性的问题,如在语音中检测未知词汇,但它也可以应用于更多的应用场景,如视觉,或者(通过融入缺失的生物外围系统)与其他ML技术相融合。考虑到所使用的简单计算原语,这个潜在的硬件实现将非常节俭。巧合的是,所提出的模型不仅符合生物系统的合理功能框架,而且还可以解释许多令人困惑的认知现象。
https://arxiv.org/abs/2405.02371
To address the limitations of current hate speech detection models, we introduce \textsf{SGHateCheck}, a novel framework designed for the linguistic and cultural context of Singapore and Southeast Asia. It extends the functional testing approach of HateCheck and MHC, employing large language models for translation and paraphrasing into Singapore's main languages, and refining these with native annotators. \textsf{SGHateCheck} reveals critical flaws in state-of-the-art models, highlighting their inadequacy in sensitive content moderation. This work aims to foster the development of more effective hate speech detection tools for diverse linguistic environments, particularly for Singapore and Southeast Asia contexts.
为解决当前仇恨言论检测模型的局限性,我们引入了 \textsf{SGHateCheck},一个针对新加坡和东南亚文化语境的新型框架。它扩展了 hateCheck 和 MHC 的功能测试方法,使用大量语言模型进行翻译和的同义转述,并使用本土注释者对其进行细化。\textsf{SGHateCheck}揭示了现有模型的关键缺陷,突出了它们在敏感内容审查方面的不足。这项工作旨在为多样语言环境下的仇恨言论检测工具的发展培养,特别是为新加坡和东南亚语境。
https://arxiv.org/abs/2405.01842
Diagnosing autism spectrum disorder (ASD) by identifying abnormal speech patterns from examiner-patient dialogues presents significant challenges due to the subtle and diverse manifestations of speech-related symptoms in affected individuals. This study presents a comprehensive approach to identify distinctive speech patterns through the analysis of examiner-patient dialogues. Utilizing a dataset of recorded dialogues, we extracted 40 speech-related features, categorized into frequency, zero-crossing rate, energy, spectral characteristics, Mel Frequency Cepstral Coefficients (MFCCs), and balance. These features encompass various aspects of speech such as intonation, volume, rhythm, and speech rate, reflecting the complex nature of communicative behaviors in ASD. We employed machine learning for both classification and regression tasks to analyze these speech features. The classification model aimed to differentiate between ASD and non-ASD cases, achieving an accuracy of 87.75%. Regression models were developed to predict speech pattern related variables and a composite score from all variables, facilitating a deeper understanding of the speech dynamics associated with ASD. The effectiveness of machine learning in interpreting intricate speech patterns and the high classification accuracy underscore the potential of computational methods in supporting the diagnostic processes for ASD. This approach not only aids in early detection but also contributes to personalized treatment planning by providing insights into the speech and communication profiles of individuals with ASD.
通过对考官和患者对话的异常语音模式的鉴定来诊断自闭症谱系障碍(ASD)是一个具有挑战性的任务,因为受影响个体的语音相关症状 subtle 和 diverse 表现。这项研究通过分析考官和患者对话的录音数据,全面探讨了识别独特语音模式的方法。利用一个记录的对话数据集,我们提取了40个语音相关特征,分为频度、零交叉率、能量、特征、梅尔频谱系数(MFCC)和平衡。这些特征涵盖了 ASD 中各种语音方面,如语调、音量、节奏和语速,反映了 ASD 交际行为的复杂性。我们使用机器学习对分类和回归任务进行分析。分类模型旨在区分 ASD 和非 ASD 病例,达到 87.75% 的准确率。回归模型用于预测与语音模式相关的变量和复合评分,促进对 ASD 相关的语音动态的深入理解。机器学习在解释复杂语音模式和实现高度分类准确方面具有潜力,为支持 ASD 的诊断过程提供了计算方法。这种方法不仅有助于早期诊断,还有助于为 ASD 患者提供个性化的治疗计划,通过提供对 ASD 个体的语言和交流特点的洞察力。
https://arxiv.org/abs/2405.05126
This paper introduces a novel convolutional neural networks (CNN) framework tailored for end-to-end audio deep learning models, presenting advancements in efficiency and explainability. By benchmarking experiments on three standard speech emotion recognition datasets with five-fold cross-validation, our framework outperforms Mel spectrogram features by up to seven percent. It can potentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while remaining lightweight. Furthermore, we demonstrate the efficiency and interpretability of the front-end layer using the PhysioNet Heart Sound Database, illustrating its ability to handle and capture intricate long waveform patterns. Our contributions offer a portable solution for building efficient and interpretable models for raw waveform data.
本文提出了一种针对端到端音频深度学习模型的全新卷积神经网络(CNN)框架,在效率和可解释性方面取得了显著的提升。通过在三个标准语音情感识别数据集上进行五倍交叉验证,我们的框架在 Mel 频谱图特征上提升了至多 7% 的性能。它有可能取代 Mel-Frequency Cepstral Coefficients(MFCC),同时保持轻量级。此外,我们通过 PhysioNet 心声数据库展示了前馈层的效率和可解释性,证明了其处理和捕捉复杂长波形模式的能力。我们的贡献为构建高效且可解释的原始波形数据模型提供了便携式解决方案。
https://arxiv.org/abs/2405.01815
When querying a large language model (LLM), the context, i.e. personal, demographic, and cultural information specific to an end-user, can significantly shape the response of the LLM. For example, asking the model to explain Newton's second law with the context "I am a toddler" yields a different answer compared to the context "I am a physics professor." Proper usage of the context enables the LLM to generate personalized responses, whereas inappropriate contextual influence can lead to stereotypical and potentially harmful generations (e.g. associating "female" with "housekeeper"). In practice, striking the right balance when leveraging context is a nuanced and challenging problem that is often situation-dependent. One common approach to address this challenge is to fine-tune LLMs on contextually appropriate responses. However, this approach is expensive, time-consuming, and not controllable for end-users in different situations. In this work, we propose Context Steering (CoS) - a simple training-free method that can be easily applied to autoregressive LLMs at inference time. By measuring the contextual influence in terms of token prediction likelihood and modulating it, our method enables practitioners to determine the appropriate level of contextual influence based on their specific use case and end-user base. We showcase a variety of applications of CoS including amplifying the contextual influence to achieve better personalization and mitigating unwanted influence for reducing model bias. In addition, we show that we can combine CoS with Bayesian Inference to quantify the extent of hate speech on the internet. We demonstrate the effectiveness of CoS on state-of-the-art LLMs and benchmarks.
在查询大型语言模型(LLM)时,用户特定的上下文(即个人、人口统计和 cultural 信息)可以显著影响模型的回答。例如,用“我是一个婴儿”的上下文询问模型牛顿第二定律与“我是一名物理教授”的上下文相比,得到的不同答案。正确使用上下文可以使模型生成个性化的回答,而不当的上下文影响力可能导致刻板印象,甚至可能对用户造成潜在危害(例如,将“女性”与“家庭主妇”联系起来)。在实践中,在利用上下文时找到适当的平衡是一个复杂且具有挑战性的问题,通常取决于具体情况和用户处境。 为解决这一挑战,一种常见的策略是对模型的上下文进行微调。然而,这种方法代价昂贵、耗时且不可控于用户。在本文中,我们提出了一种简单且无需训练的自适应方法:上下文引导(CoS)-一种无需训练的自助方法,可轻松应用于推理时间的自回归 LLM。通过测量上下文的影响力,并根据具体使用情况和用户群体调整上下文影响力,我们的方法使实践者能够根据他们的具体情况和用户群体确定适当的上下文影响力水平。 我们还展示了 CoS 的各种应用,包括通过增强上下文影响力实现更好的个性化,以及通过减轻不必要的干扰来减少模型偏见。此外,我们还证明了将 CoS 与贝叶斯推理相结合可以衡量互联网上的仇恨言论程度。我们在最先进的 LLM 和基准上展示了 CoS 的有效性。
https://arxiv.org/abs/2405.01768
Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.
表达性语音转换(VC)通过共同转换说话人身份和情感风格来对情感说话人进行演讲者身份转换。对于表达性VC中任意说话人的情感风格建模,尚未进行过深入探讨。之前的解决方案依赖于语音重建器的语音重建,这使得语音质量高度依赖于语音重建器的表现。情感语音转换的一个主要挑战是情感语调建模。为了应对这些挑战,本文基于条件去噪扩散概率模型(DDPM)提出了一种完整的端到端情感语音转换框架。我们利用自监督语音模型产生的语音单元作为内容条件,同时从语音情感识别和说话人验证系统中提取深度特征来建模情感风格和说话人身份。客观和主观评估结果表明,我们的框架的有效性。代码和样本公开可用。
https://arxiv.org/abs/2405.01730
Reduced articulatory precision is common in speech, but for dialog its acoustic properties and pragmatic functions have been little studied. We here try to remedy this gap. This technical report contains content that was omitted from the journal article (Ward et al. 2024, submitted). Specifically, we here report 1) lessons learned about annotating for perceived reduction, 2) the finding that, unlike in read speech, the correlates of reduction in dialog include high pitch, wide pitch range, and intensity, and 3) a baseline model for predicting reduction in dialog, using simple acoustic/prosodic features, that achieves correlations with human perceptions of 0.24 for English, and 0.17 for Spanish. We also provide examples of additional possible pragmatic functions of reduction in English, and various discussion, observations and speculations
减少发声精确度在 speech 中很常见,但在对话中,对其音学和语用功能的研究还很少。在这里,我们试图弥补这个空白。本技术报告包含了 journal 文章中省略的内容(Ward 等人,2024 年提交)。具体来说,我们在这里报道了关于为感知减少的教训(1)以及对话中减少的关系因素(2),包括高音、宽音域和强度,并且用简单的声学/语调特征预测对话中减少的基准模型,其与人类对英语的感知的相关系数为 0.24,对西班牙语的感知的相关系数为 0.17。我们还提供了英语中可能具有额外语用功能的减少的例子,以及各种讨论、观察和推测。
https://arxiv.org/abs/2405.01376
This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DID (ECAPA-TDNN). An optimal InterCTC setting is initially established using a Conformer encoder. This setting is then used to train a model with an E-branchformer encoder and the performance of both architectures are compared. A multi-task fine-tuning approach is adopted for language model (LM) shallow fusion. The experiments yielded an improvement in DID accuracy of 10.8% relative to a baseline ECAPA-TDNN, and WER performance approaching the TDNN-HMM model. This multi-task approach emerges as a promising strategy for Irish low-resource ASR and DID.
本文探讨了使用经过中间CTC(InterCTC)训练的混合编码器-解码器模型(Hybrid CTC/Attention encoder-decoder)在爱尔兰(盖尔语)低资源 speech recognition(ASR)和 dialect identification(DID)任务中的应用。结果与目前最佳训练的 ASR(TDNN-HMM)和 DID(ECAPA-TDNN)模型进行了比较。首先,通过使用 Conformer 编码器建立了一个最优的 InterCTC 设置。然后,使用 E-branchfinder 编码器训练了一个模型,并比较了两种架构的性能。为语言模型(LM)采用多任务微调。实验结果表明,与基线 ECAPA-TDNN相比,DID 准确度提高了 10.8%,而 WER 性能接近于 TDNN-HMM 模型。这种多任务方法在爱尔兰低资源 ASR 和 DID 任务中具有前景。
https://arxiv.org/abs/2405.01293
We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.
我们提出了TRAMBA,一种适用于移动和可穿戴平台的混合变压器和Mamba架构的语音和骨传导增强,为语音和骨传导增强提供了一种高效且可扩展的方法。骨传导增强在移动和可穿戴平台上的实现一直是不实用的几个原因:首先,数据收集工作量很大,导致数据稀缺;其次,与具有数百MB内存开销的先进模型相比,更适用于资源受限系统的方法之间存在性能差距。为了将TRAMBA适应振动感知模式,我们使用音频语音数据集预先训练TRAMBA。然后,用户通过少量的骨传导数据进行微调。TRAMBA在PESQ和STOI方面的性能优于最先进的GAN,其内存开销减小了 orders of magnitude,并且推理速度加快了465倍。我们将TRAMBA集成到实际系统中,并证明了TRAMBA(i)通过要求更少的数据采样和传输来提高可穿戴设备的电池寿命,提高了160%;(ii)在嘈杂的环境中产生的声音质量高于通过空气传播的声音;(iii) 内存开销小于20.0 MB。
https://arxiv.org/abs/2405.01242
Membership Inference (MI) poses a substantial privacy threat to the training data of Automatic Speech Recognition (ASR) systems, while also offering an opportunity to audit these models with regard to user data. This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models. To the best of our knowledge, this approach has not yet been investigated. We compare our proposed features with commonly used error-based features and find that the proposed features greatly enhance performance for sample-level MI. For speaker-level MI, these features improve results, though by a smaller margin, as error-based features already obtained a high performance for this task. Our findings emphasise the importance of considering different feature sets and levels of access to target models for effective MI in ASR systems, providing valuable insights for auditing such models.
翻译:Membership Inference (MI) 对自动语音识别(ASR)系统的训练数据提出了实质性的隐私威胁,同时为审计这些模型与用户数据有关提供了机会。本文探讨了基于损失的特征与高斯和对抗扰动在ASR模型中进行MI的有效性。据我们所知,这种方法尚未被研究过。我们将提出的特征与常见的基于错误的特征进行比较,发现所提出的特征在样本级MI方面极大地提高了性能。在说话人级别MI方面,这些特征提高了结果,但相对较小,因为基于错误的特征已经在这一任务上取得了很高的性能。我们的研究结果强调了在ASR系统中有效MI时考虑不同特征集和访问目标模型的重要性,为审计这些模型提供了宝贵的见解。
https://arxiv.org/abs/2405.01207
Recent transformer-based ASR models have achieved word-error rates (WER) below 4%, surpassing human annotator accuracy, yet they demand extensive server resources, contributing to significant carbon footprints. The traditional server-based architecture of ASR also presents privacy concerns, alongside reliability and latency issues due to network dependencies. In contrast, on-device (edge) ASR enhances privacy, boosts performance, and promotes sustainability by effectively balancing energy use and accuracy for specific applications. This study examines the effects of quantization, memory demands, and energy consumption on the performance of various ASR model inference on the NVIDIA Jetson Orin Nano. By analyzing WER and transcription speed across models using FP32, FP16, and INT8 quantization on clean and noisy datasets, we highlight the crucial trade-offs between accuracy, speeds, quantization, energy efficiency, and memory needs. We found that changing precision from fp32 to fp16 halves the energy consumption for audio transcription across different models, with minimal performance degradation. A larger model size and number of parameters neither guarantees better resilience to noise, nor predicts the energy consumption for a given transcription load. These, along with several other findings offer novel insights for optimizing ASR systems within energy- and memory-limited environments, crucial for the development of efficient on-device ASR solutions. The code and input data needed to reproduce the results in this article are open sourced are available on [this https URL].
近年来,基于Transformer的自动语音识别(ASR)模型已经实现了词错误率(WER)低于4%,超过了人类注释者的工作准确率,然而它们需要大量的服务器资源,导致显著的碳排放足迹。传统的基于服务器的ASR架构也存在隐私问题,以及由于网络依赖关系而导致的可靠性和延迟问题。相比之下,在设备级(边缘)ASR上,通过有效平衡能源消耗和准确性,提高了隐私,增强了性能,促进了可持续性。 本研究探讨了量化、内存需求和能源消耗对各种ASR模型在NVIDIA Jetson Orin Nano上的性能的影响。通过分析在干净和噪音数据集上使用FP32、FP16和INT8量化模型的WER和转录速度,我们突出了准确度、速度、量化、能源效率和内存需求之间的关键权衡。我们发现,将精度从fp32变为fp16可以减半不同模型的音频转录能源消耗,同时性能降幅很小。模型大小和参数数量越大,并不能保证对噪声的鲁棒性,也不能预测给定转录负载的能源消耗。这些以及其他发现为在能源和内存受限的环境中优化ASR系统提供了新的见解,对于实现高效的本地ASR解决方案具有关键意义。本文开源的代码和输入数据可在[本文的链接]中找到。
https://arxiv.org/abs/2405.01004
Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.
Whisper 是一个多任务、多语言的语音模型,覆盖 99 种语言。在它所覆盖的语言的范围内,它产生了可观的自动语音识别(ASR)结果,但在一些未涵盖的语言上,该模型仍然表现不佳,尤其是在较小的模型版本中,问题更加严重。在这项工作中,我们研究了其局限性,证明了说话者相关(性别、年龄)和模型相关(资源丰富性和模型大小)偏见的存在。尽管如此,我们证明了仅模型相关偏见通过量化被放大,影响更多的低资源语言和较小的模型。为了寻找更好的压缩方法,我们提出了 DistilWhisper,一种能够同时提高这些语言的ASR性能,同时保留多任务和多语言功能优势的方法。我们的方法涉及两个关键策略:通过语言特定的专家对Whisper-small进行轻量级模块化ASR微调,以及从Whisper-large-v2中进行知识蒸馏。这种双策略使我们能够在有效提高ASR性能的同时保持多任务和多语言预训练所带来的稳健性。结果表明,我们的方法比标准的微调或LoRA适配器更有效,在目标语言的测试集中提高性能,同时仅引入了微不足道的参数开销。
https://arxiv.org/abs/2405.00966
Empathy requires perspective-taking: empathetic responses require a person to reason about what another has experienced and communicate that understanding in language. However, most NLP approaches to empathy do not explicitly model this alignment process. Here, we introduce a new approach to recognizing alignment in empathetic speech, grounded in Appraisal Theory. We introduce a new dataset of over 9.2K span-level annotations of different types of appraisals of a person's experience and over 3K empathetic alignments between a speaker's and observer's speech. Through computational experiments, we show that these appraisals and alignments can be accurately recognized. In experiments in over 9.2M Reddit conversations, we find that appraisals capture meaningful groupings of behavior but that most responses have minimal alignment. However, we find that mental health professionals engage with substantially more empathetic alignment.
共情需要换位思考:具有同情心的反应需要一个人从另一个人的经历中进行推理,并用语言表达这种理解。然而,大多数自然语言处理方法并没有明确建模这种对齐过程。在这里,我们引入了一种基于评估理论的新方法来识别共情言语中的对齐,这个新数据集包括超过9.2K个跨度级别的不同类型对一个人经历的评估,以及超过3K个说话者和观察者之间的共情对齐。通过计算实验,我们证明了这些评估和对应关系可以被准确识别。在超过9.2M个Reddit对话的实验中,我们发现评估捕捉了行为的有意义的群体,但大多数响应都没有很好的对齐性。然而,我们发现心理健康专业人士与共情对齐的参与度相当大。
https://arxiv.org/abs/2405.00948