Monaural Speech enhancement on drones is challenging because the ego-noise from the rotating motors and propellers leads to extremely low signal-to-noise ratios at onboard microphones. Although recent masking-based deep neural network methods excel in monaural speech enhancement, they struggle in the challenging drone noise scenario. Furthermore, existing drone noise datasets are limited, causing models to overfit. Considering the harmonic nature of drone noise, this paper proposes a frequency domain bottleneck adapter to enable transfer learning. Specifically, the adapter's parameters are trained on drone noise while retaining the parameters of the pre-trained Frequency Recurrent Convolutional Recurrent Network (FRCRN) fixed. Evaluation results demonstrate the proposed method can effectively enhance speech quality. Moreover, it is a more efficient alternative to fine-tuning models for various drone types, which typically requires substantial computational resources.
在无人机上进行单声道语音增强是一个具有挑战性的任务,因为旋转电机和螺旋桨的自元噪声导致机载麦克风中的信号-噪声比非常低。尽管基于遮罩的深度神经网络方法在单声道语音增强方面表现出色,但在具有挑战性的无人机噪音场景中,它们的表现不佳。此外,现有的无人机噪音数据集有限,导致模型过拟合。考虑到无人机噪音的谐波特性,本文提出了一种频率域瓶颈适配器,以实现迁移学习。具体来说,适配器的参数在保留前预训练的FRCRN参数的同时,在无人机噪音上进行训练。评估结果表明,与单声道语音增强相比,所提出的方法可以有效增强语音质量。此外,它是为各种无人机类型进行模型微调的更有效选择,而通常需要大量的计算资源。
https://arxiv.org/abs/2405.10022
This article describes the Data-Efficient Low-Complexity Acoustic Scene Classification Task in the DCASE 2024 Challenge and the corresponding baseline system. The task setup is a continuation of previous editions (2022 and 2023), which focused on recording device mismatches and low-complexity constraints. This year's edition introduces an additional real-world problem: participants must develop data-efficient systems for five scenarios, which progressively limit the available training data. The provided baseline system is based on an efficient, factorized CNN architecture constructed from inverted residual blocks and uses Freq-MixStyle to tackle the device mismatch problem. The baseline system's accuracy ranges from 42.40% on the smallest to 56.99% on the largest training set.
这篇文章描述了在DCASE 2024挑战中数据效率低复杂度音频场景分类任务及其相应的基线系统。任务设置是前几版的延续(2022和2023),重点关注记录设备不匹配和低复杂度约束。今年的版本引入了一个额外的真实世界问题:参与者必须为五个场景开发数据有效的系统,这些系统逐渐限制了可用的训练数据。提供的基线系统基于由反残差块构建的效率高,分解卷积架构,并使用Freq-MixStyle解决设备不匹配问题。基线系统的准确率在最小训练集上为42.40%,在最大训练集上为56.99%。
https://arxiv.org/abs/2405.10018
Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application. Audio samples are available at this https URL.
注意级别自动唱歌语音转录(AST)将唱歌录音转换为音符序列,有助于为唱歌语音合成(SVS)应用自动注释唱歌数据。然而,当前的AST方法在实际注释中表现出的准确性和鲁棒性令人担忧。本文介绍了ROSVOT,第一个服务于SVS的鲁棒AST模型,采用多尺度框架捕捉粗粒度音符信息并确保细粒度帧级分割,同时配备基于注意的升调解码器以确保可靠的升调预测。我们还建立了一个全面的注释和训练管道来测试该模型在现实环境中的性能。实验结果表明,无论是干净还是噪音输入,ROSVOT都能实现与最先进转录准确性的水平相当的结果。此外,当在自动标注的大型数据集上训练时,SVS模型超过了基线,证实了其在实际应用中的能力。音频样本可在此链接https://www.rosvot.org/audio/ 获取。
https://arxiv.org/abs/2405.09940
Recent advances in generative language modeling applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech language models (SLMs), similarly to their textual counterparts, are scalable, probabilistic, and context-aware. While they can produce diverse and natural outputs, they sometimes face issues such as unintelligibility and the inclusion of non-speech noises or hallucination. As the adoption of this innovative paradigm in speech synthesis increases, there is a clear need for an in-depth evaluation of its capabilities and limitations. In this paper, we evaluate TTS from a discrete token-based SLM, through both automatic metrics and listening tests. We examine five key dimensions: speaking style, intelligibility, speaker consistency, prosodic variation, spontaneous behaviour. Our results highlight the model's strength in generating varied prosody and spontaneous outputs. It is also rated higher in naturalness and context appropriateness in listening tests compared to a conventional TTS. However, the model's performance in intelligibility and speaker consistency lags behind traditional TTS. Additionally, we show that increasing the scale of SLMs offers a modest boost in robustness. Our findings aim to serve as a benchmark for future advancements in generative SLMs for speech synthesis.
近年来,将生成语言建模应用于离散语音词的进步为文本到语音(TTS)合成提供了新的途径。这些语音语言模型(SLMs)与它们的文本对应物一样,具有可扩展性、概率性和上下文感知。尽管它们可以产生多样且自然的结果,但有时会面临诸如可理解性问题和包含非语音噪音或幻觉等问题。随着这种创新范式在语音合成中的采用不断增加,对其能力和限制进行深入评估显得尤为重要。在本文中,我们评估了基于离散词的SLM的TTS,并通过自动指标和听觉测试进行评估。我们检查了五个关键维度:说话风格、可理解性、说话者一致性、语调变化和自发性行为。我们的结果表明,该模型在生成多样且自发的语调方面表现出色。在听觉测试中,它与传统TTS相比,自然度和上下文适应性评分更高。然而,在智力和说话者一致性方面,传统TTS仍然领先。此外,我们发现,增加SLMs的规模在鲁棒性方面略有提高。我们的研究结果旨在为未来基于生成语言模型的TTS合成提供基准。
https://arxiv.org/abs/2405.09768
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.
鉴于自动语音识别(ASR)系统的广泛应用,其安全性问题比以往任何时候都受到更多的关注,主要原因是深度神经网络的易受性。以前的研究表明,在约束条件下悄悄地生成对抗扰动能够操纵语音识别系统,从而产生恶意命令。这些攻击方法主要需要在$\ell_p$范数约束下添加噪声扰动,不可避免地留下了手动修改的残影。最近的研究通过将文本到语音(TTS)合成音频的对抗样本,缓解了这一限制。然而,基于优化目标的风格修改会显著降低音频风格的可控性和可编辑性。在本文中,我们提出了基于用户自定义风格迁移的ASR系统攻击。我们首先测试了顺序风格迁移攻击(STA)的效果。然后,作为改进,我们提出了一个迭代式风格码攻击(SCA)来保持音频质量。实验结果表明,我们的方法可以满足用户自定义风格的需求,攻击成功率为82%,同时保持声音的自然度。
https://arxiv.org/abs/2405.09470
In this work, we present Score MUsic Graph (SMUG)-Explain, a framework for generating and visualizing explanations of graph neural networks applied to arbitrary prediction tasks on musical scores. Our system allows the user to visualize the contribution of input notes (and note features) to the network output, directly in the context of the musical score. We provide an interactive interface based on the music notation engraving library Verovio. We showcase the usage of SMUG-Explain on the task of cadence detection in classical music. All code is available on this https URL.
在这项工作中,我们提出了Score Music Graph (SMUG)-Explain,一个用于生成和可视化应用到任意预测任务的图形化解释的框架,针对音乐乐谱。我们的系统允许用户在音乐乐谱的上下文中直接可视化输入音符(和音符特征)对网络输出的贡献。我们还基于音乐乐谱雕刻库Verovio提供了一个交互式的界面。我们在古典音乐中的句尾检测任务中展示了SMUG-Explain的使用。所有代码都可以在https://url.com/这个网址上找到。
https://arxiv.org/abs/2405.09241
We propose a new graph convolutional block, called MusGConv, specifically designed for the efficient processing of musical score data and motivated by general perceptual principles. It focuses on two fundamental dimensions of music, pitch and rhythm, and considers both relative and absolute representations of these components. We evaluate our approach on four different musical understanding problems: monophonic voice separation, harmonic analysis, cadence detection, and composer identification which, in abstract terms, translate to different graph learning problems, namely, node classification, link prediction, and graph classification. Our experiments demonstrate that MusGConv improves the performance on three of the aforementioned tasks while being conceptually very simple and efficient. We interpret this as evidence that it is beneficial to include perception-informed processing of fundamental musical concepts when developing graph network applications on musical score data.
我们提出了一个新的图形卷积块,称为MusGConv,专门为音乐分数数据的高效处理而设计,并受到一般感知原则的启发。它专注于音乐的两个基本维度,即音高和节奏,并考虑这两个组件的相对和绝对表示。我们对我们的方法在四个不同的音乐理解问题进行了评估:单声道声音分离,和弦分析,句末检测和作曲家识别。用抽象的话,这些问题翻译为不同的图学习问题,即节点分类,链预测和图分类。我们的实验结果表明,MusGConv在提高上述三个任务的同时,在概念上非常简单和高效。我们将这一结果解释为在开发基于音乐分数数据的图形网络应用时,有意识地处理基本音乐概念的感知信息是有益的。
https://arxiv.org/abs/2405.09224
It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
控制文本到语音合成(TTS)中情感表达仍然具有挑战性。以前的研究主要集中在在句子级别学习全局音调表示,这与语言音调高度相关。我们的目标是构建一个层级情感分布(ED),有效地捕捉情感在各个粒度级别上的强度变化,包括音素、单词和句子。在TTS训练过程中,从地面真实音频中提取层级ED,并引导预测器建立情感和语言音调之间的联系。在运行时推理,TTS模型生成情感语音,同时为语音成分提供情感定量控制。客观和主观评价证实了所提出的框架在情感预测和控制方面的有效性。
https://arxiv.org/abs/2405.09171
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
当前的讲话者语音识别系统在提取讲话者嵌入之前依赖于外部语音活动检测模型。在本文中,我们证明了发言者嵌入提取器的注意系统充当一个弱监督的内部VAD模型,并且其表现与相应的监督VAD系统相当或者更好。随后,通过同时提取VAD日志和相应的讲话者嵌入,可以高效地实现发言者识别。我们详细分析了当前讲话者验证模型中帧级注意系统的行为,并使用ECAPA2讲话者嵌入提出了用于VAD和嵌入提取的新讲话者识别流程。所提出的策略在AMI、VoxConverse和DIHARD III语调基准上获得了最先进的性能。
https://arxiv.org/abs/2405.09142
In this article, we explore the potential of using latent diffusion models, a family of powerful generative models, for the task of reconstructing naturalistic music from electroencephalogram (EEG) recordings. Unlike simpler music with limited timbres, such as MIDI-generated tunes or monophonic pieces, the focus here is on intricate music featuring a diverse array of instruments, voices, and effects, rich in harmonics and timbre. This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data, employing an end-to-end training approach directly on raw data without the need for manual pre-processing and channel selection. We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics. We additionally perform song classification based on the generated tracks. Our work contributes to the ongoing research in neural decoding and brain-computer interfaces, offering insights into the feasibility of using EEG data for complex auditory information reconstruction.
在本文中,我们探讨了使用潜在扩散模型(一种强大的生成模型家族)从脑电图(EEG)录音中重构自然主义音乐的潜力。与简单音乐且音色有限的作品(如MIDI生成的曲目或单旋律作品)相比,这里的重点是复杂音乐,具有多样化的乐器、声音和效果,丰富和谐和音色。本研究代表了一种使用非侵入性EEG数据实现高质量音乐重建的初步尝试,采用端到端训练方法直接在原始数据上进行,无需手动预处理和通道选择。我们将模型 training 放在公共 NMED-T 数据集上,并通过提出基于神经嵌入的指标进行定量评估。此外,我们还基于生成的曲目进行歌曲分类。我们的工作对神经解码和脑-机接口的持续研究做出了贡献,揭示了使用EEG数据进行复杂听觉信息重建的可行性。
https://arxiv.org/abs/2405.09062
Binaural Audio Telepresence (BAT) aims to encode the acoustic scene at the far end into binaural signals for the user at the near end. BAT encompasses an immense range of applications that can vary between two extreme modes of Immersive BAT (I-BAT) and Enhanced BAT (E-BAT). With I-BAT, our goal is to preserve the full ambience as if we were at the far end, while with E-BAT, our goal is to enhance the far-end conversation with significantly improved speech quality and intelligibility. To this end, this paper presents a tunable BAT system to vary between these two AT modes with a desired application-specific balance. Microphone signals are converted into binaural signals with prescribed ambience factor. A novel Spatial COherence REpresentation (SCORE) is proposed as an input feature for model training so that the network remains robust to different array setups. Experimental results demonstrated the superior performance of the proposed BAT, even when the array configurations were not included in the training phase.
双向声道远程呈现(BAT)旨在将远端的音频场景编码成适合近端用户的双耳信号。BAT涵盖了从极端的沉浸式双向声道(I-BAT)和增强型双向声道(E-BAT)两种模式中广泛的适用应用。使用I-BAT,我们的目标是将完整的氛围保留下来,就好像我们处在远端一样,而使用E-BAT,我们的目标是显著提高远端对话的语音质量和可懂度。为此,本文提出了一种可调节的BAT系统,以在两个AT模式之间进行平衡,并具有所需的应用程序特定平衡。 microphone信号通过指定氛围系数转换为双耳信号。 为了训练模型,还提出了一个新的空间互相关表示(SCORE)作为输入特征,以便网络保持对不同阵列设置的鲁棒性。 实验结果证明了所提出的BAT的优越性能,即使训练阶段没有包括阵列设置。
https://arxiv.org/abs/2405.08742
The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts as an effective tool for detecting newly emerged deepfake audio while maintaining performance on older types, lacks a well-constructed and user-friendly evaluation framework. To address this gap, we introduce EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA includes classic datasets from the Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from models like GPT-4 and GPT-4o. It supports various continual learning techniques, such as Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), and recent methods like Regularized Adaptive Weight Modification (RAWM) and Radian Weight Modification (RWM). Additionally, EVDA facilitates the development of robust algorithms by providing an open interface for integrating new continual learning methods
高级大型语言模型(如GPT-4、GPT-4o和Claude家族)的崛起使得伪造音频检测变得越来越具有挑战性。传统的微调方法很难与不断变化的合成语音格局保持同步,需要不断学习的方法来适应新的音频,同时保留检测较老类型音频的能力。持续学习作为一种有效的工具,在检测新型深度伪造音频的同时保持对较老类型的检测性能,但它缺乏一个结构良好和用户友好的评估框架。为了填补这一空白,我们引入了EVDA,一个用于评估持续学习在深度伪造音频检测中的基准。EVDA包括来自反伪造声音系列、中国伪造音频检测系列以及GPT-4和GPT-4o生成的全新深度伪造音频。它支持各种持续学习技术,例如EWC、学习不遗忘(LwF)以及像Regularized Adaptive Weight Modification(RAWM)和Radian Weight Modification(RWM)这样的最近方法。此外,EVDA通过提供一个开放的接口,促进将新的持续学习方法集成到算法中,从而推动其发展。
https://arxiv.org/abs/2405.08596
Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder has to be learned that allows for efficient transmission of the input audio signal. This discrete representation is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-the-art neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. Furthermore, we propose a new causal network architecture for neural speech coding that shows good performance at very low computational complexity.
神经音频编码已成为一个生动的研究方向,因为它在非常低的比特率下承诺提供优秀的音频质量,这是经典编码技术无法实现的。在本文中,我们提出了基于投影标量量化(SQ)的简单替代VQ的量化方法,这些量化技术不需要额外的损失、调度参数或代码本存储,从而简化了神经音频编码器的训练。此外,我们提出了一种新的因果神经网络架构,用于神经语音编码,在非常低的计算复杂度下表现出良好的性能。
https://arxiv.org/abs/2405.08417
With the rapid advancement of generative AI, multimodal deepfakes, which manipulate both audio and visual modalities, have drawn increasing public concern. Currently, deepfake detection has emerged as a crucial strategy in countering these growing threats. However, as a key factor in training and validating deepfake detectors, most existing deepfake datasets primarily focus on the visual modal, and the few that are multimodal employ outdated techniques, and their audio content is limited to a single language, thereby failing to represent the cutting-edge advancements and globalization trends in current deepfake technologies. To address this gap, we propose a novel, multilingual, and multimodal deepfake dataset: PolyGlotFake. It includes content in seven languages, created using a variety of cutting-edge and popular Text-to-Speech, voice cloning, and lip-sync technologies. We conduct comprehensive experiments using state-of-the-art detection methods on PolyGlotFake dataset. These experiments demonstrate the dataset's significant challenges and its practical value in advancing research into multimodal deepfake detection.
随着生成式 AI 的快速发展,多模态深度伪造(Multimodal deepfakes)已经引起了越来越多的公众关注。目前,深度伪造检测已成为对抗这些不断增长威胁的关键策略。然而,作为训练和验证深度伪造检测算法的关键因素,大多数现有的深度伪造数据集主要关注视觉模态,而那些多模态的采用过时技术的数据集,其音频内容仅限于一种语言,因此无法代表当前深度伪造技术的尖端发展和全球趋势。为了填补这一空白,我们提出了一个新颖的多语言、多模态深度伪造数据集:PolyGlotFake。它包括来自七种语言的内容,利用了各种尖端和流行的文本转语音、语音复制和同步技术。我们对 PolyGlotFake 数据集进行了最先进的检测方法的综合实验。这些实验证明了该数据集的重大挑战,以及其在推动多模态深度伪造检测研究方面的实际价值。
https://arxiv.org/abs/2405.08838
Respiratory disease, the third leading cause of deaths globally, is considered a high-priority ailment requiring significant research on identification and treatment. Stethoscope-recorded lung sounds and artificial intelligence-powered devices have been used to identify lung disorders and aid specialists in making accurate diagnoses. In this study, audio-spectrogram vision transformer (AS-ViT), a new approach for identifying abnormal respiration sounds, was developed. The sounds of the lungs are converted into visual representations called spectrograms using a technique called short-time Fourier transform (STFT). These images are then analyzed using a model called vision transformer to identify different types of respiratory sounds. The classification was carried out using the ICBHI 2017 database, which includes various types of lung sounds with different frequencies, noise levels, and backgrounds. The proposed AS-ViT method was evaluated using three metrics and achieved 79.1% and 59.8% for 60:40 split ratio and 86.4% and 69.3% for 80:20 split ratio in terms of unweighted average recall and overall scores respectively for respiratory sound detection, surpassing previous state-of-the-art results.
呼吸疾病,全球死亡人数排名第三,被认为是需要进行大量研究的高优先疾病。通过听诊器记录的肺声音和由人工智能驱动的设备已被用于鉴定肺病并帮助专家进行准确的诊断。在这项研究中,我们开发了一种新的方法:音频光谱图视觉转换器(AS-ViT),用于识别异常呼吸声音。使用一种称为短时傅里叶变换(STFT)的技术将肺部的声音转换成视觉表示,这些图像然后用一种称为视觉转换器的模型进行分析,以识别不同类型的呼吸声音。分类是根据ICBHI 2017数据库进行的,该数据库包括不同频率、噪声水平和背景的各种类型的肺声音。所提出的AS-ViT方法通过三个指标进行评估,在60:40分比和80:20分比下的平均召回率和总得分均超过了前人最佳水平,分别实现了79.1%和59.8%。
https://arxiv.org/abs/2405.08342
Semantic communications have been utilized to execute numerous intelligent tasks by transmitting task-related semantic information instead of bits. In this article, we propose a semantic-aware speech-to-text transmission system for the single-user multiple-input multiple-output (MIMO) and multi-user MIMO communication scenarios, named SAC-ST. Particularly, a semantic communication system to serve the speech-to-text task at the receiver is first designed, which compresses the semantic information and generates the low-dimensional semantic features by leveraging the transformer module. In addition, a novel semantic-aware network is proposed to facilitate the transmission with high semantic fidelity to identify the critical semantic information and guarantee it is recovered accurately. Furthermore, we extend the SAC-ST with a neural network-enabled channel estimation network to mitigate the dependence on accurate channel state information and validate the feasibility of SAC-ST in practical communication environments. Simulation results will show that the proposed SAC-ST outperforms the communication framework without the semantic-aware network for speech-to-text transmission over the MIMO channels in terms of the speech-to-text metrics, especially in the low signal-to-noise regime. Moreover, the SAC-ST with the developed channel estimation network is comparable to the SAC-ST with perfect channel state information.
语义通信已经被用于通过传输任务相关的语义信息来执行许多智能任务,而不是比特。在本文中,我们提出了一个名为SAC-ST的单用户多输入多输出(MIMO)和多用户MIMO通信场景的语义感知语音到文本传输系统。特别是,首先针对接收者设计的语义通信系统,通过利用变换器模块压缩语义信息并生成低维语义特征。此外,还提出了一种新的语义感知网络,以实现高语义保真度的传输来识别关键语义信息并确保其准确恢复。此外,我们通过神经网络enabled的通道估计网络扩展了SAC-ST,以减轻对准确信道状态信息的依赖并验证SAC-ST在实际通信环境中的可行性。仿真结果将展示,与没有语义感知网络的语音到文本传输通信框架相比,SAC-ST在MIMO信道上的语音到文本指标方面的性能优异,尤其是在低信噪比环境中。此外,具有开发中的通道估计网络的SAC-ST与具有完美信道状态信息的SAC-ST性能相当。
https://arxiv.org/abs/2405.08096
Understanding degraded speech is demanding, requiring increased listening effort (LE). Evaluating processed and unprocessed speech with respect to LE can objectively indicate if speech enhancement systems benefit listeners. However, existing methods for measuring LE are complex and not widely applicable. In this study, we propose a simple method to evaluate speech intelligibility and LE simultaneously without additional strain on subjects or operators. We assess this method using results from two independent studies in Norway and Denmark, testing 76 (50+26) subjects across 9 (6+3) processing conditions. Despite differences in evaluation setups, subject recruitment, and processing systems, trends are strikingly similar, demonstrating the proposed method's robustness and ease of implementation into existing practices.
理解受损语音是具有挑战性的,需要增加听力和理解努力(LE)。用LE评估经过处理和未处理的语音可以客观地表明,语音增强系统是否对听者有益。然而,现有的测量LE的方法是复杂且不广泛的。在这项研究中,我们提出了一种简单的方法来同时评估语音可懂度和LE,而不会对参与者或操作者造成额外的压力。我们用来自挪威和丹麦的两个独立研究的结果来评估这种方法,测试了9个(6个+3个)处理条件中的76个(50个+26个)参与者。尽管评估设置、参与者和处理系统存在差异,但趋势是相似的,这表明所提出的方法的稳健性和易用性,以及它对现有实践的适应性。
https://arxiv.org/abs/2405.07641
The application of Automatic Speech Recognition (ASR) technology in soccer offers numerous opportunities for sports analytics. Specifically, extracting audio commentaries with ASR provides valuable insights into the events of the game, and opens the door to several downstream applications such as automatic highlight generation. This paper presents SoccerNet-Echoes, an augmentation of the SoccerNet dataset with automatically generated transcriptions of audio commentaries from soccer game broadcasts, enhancing video content with rich layers of textual information derived from the game audio using ASR. These textual commentaries, generated using the Whisper model and translated with Google Translate, extend the usefulness of the SoccerNet dataset in diverse applications such as enhanced action spotting, automatic caption generation, and game summarization. By incorporating textual data alongside visual and auditory content, SoccerNet-Echoes aims to serve as a comprehensive resource for the development of algorithms specialized in capturing the dynamics of soccer games. We detail the methods involved in the curation of this dataset and the integration of ASR. We also highlight the implications of a multimodal approach in sports analytics, and how the enriched dataset can support diverse applications, thus broadening the scope of research and development in the field of sports analytics.
自动语音识别(ASR)技术在足球中的应用带来了许多体育分析的机会。特别是,通过提取来自足球比赛直播的音频评论,使用ASR提取音频评论提供了对比赛事件的宝贵见解,并打开了几个下游应用的大门,如自动高光提取。本文介绍了SoccerNet-Echoes,一种通过从足球比赛直播中自动生成音频评论来扩充SoccerNet数据集的增强视频内容,利用ASR生成的文本信息丰富地增强了视频内容的应用。这些文本评论使用Whisper模型生成,并使用谷歌翻译翻译。通过结合文本数据和视觉和听觉内容,SoccerNet-Echoes旨在成为开发专门捕捉足球比赛动态的算法的全面资源。我们详细介绍了这个数据集的编辑方法和ASR的集成。我们还强调了体育分析中多模态方法的意义,以及丰富数据集如何支持各种应用,从而扩大体育分析领域的研究和开发范围。
https://arxiv.org/abs/2405.07354
Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline.
电生理转换为语音(ETS)已经在静默语音界面中展示了其潜在功能,通过在静默发音时从电生理(EMG)信号生成可听见的语音。 ETS 模型通常由一个 EMG 编码器和一个语音合成器组成,编码器将 EMG 信号转换为声学特征,然后合成语音信号。 由于可用数据不足和噪声干扰,合成的语音往往表现出较低的自然水平。在这项工作中,我们提出了 Diff-ETS,一种使用基于得分的扩散概率模型来增强合成语音的自然性的 ETS 模型。扩散模型应用于改善由 EMG 编码器预测的声学特征的质量。在我们的实验中,我们通过评估 Diff-ETS 对预训练 EMG 编码器的预测进行微调,并以端到端的方式训练两个模型。我们使用客观指标比较 Diff-ETS 与没有扩散的基线 ETS 模型,并进行了 listening test。实验结果表明,与基线 ETS 模型相比,提出的 Diff-ETS 在提高语音自然性方面取得了显著的进展。
https://arxiv.org/abs/2405.08021
Neural networks and deep learning are often deployed for the sake of the most comprehensive music generation with as little involvement as possible from the human musician. Implementations in aid of, or being a tool for, music practitioners are sparse. This paper proposes the integration of generative stacked autoencoder structures for rhythm generation, within a conventional melodic step-sequencer. It further aims to work towards its implementation being accessible to the average electronic music practitioner. Several model architectures have been trained and tested for their creative potential. While the currently implementations do display limitations, they do represent viable creative solutions for music practitioners.
神经网络和深度学习通常被用于实现尽可能少的 human 音乐家的参与,生成最全面的音乐。辅助实现或作为音乐家的工具的实现数量较少。本文提出了在传统旋律步骤序列器中集成生成堆叠自动编码器结构以进行节奏生成的建议。它进一步旨在实现其实现对平均电子音乐从业者的可用性。已经训练和测试了多个模型架构,以展示其创意潜力。虽然目前的实现存在局限性,但它们确实代表了音乐家的可行创意解决方案。
https://arxiv.org/abs/2405.07034