This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,'' such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance's impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.
本文介绍了TTSOps,这是一个完全自动化的闭环框架,用于从嘈杂且未经整理的网络规模语音数据(通常称为“暗数据”)中构建多说话人文本到语音(TTS)系统,例如在线视频。传统的TTS训练流水线需要高质量声学特性和准确的文字-语音对齐的精心策划语料库,这严重限制了其可扩展性、说话人的多样性以及实际应用能力。虽然最近的研究提出了基于音质的数据选择技术,但它们往往忽视两个关键方面:(1)现代TTS模型对噪声的内在鲁棒性;(2)低感知质量却具有信息价值样本的潜在贡献。 为了解决这些问题,TTSOps引入了一个以数据为中心的训练流水线,整合了三个核心组件:(1)从暗数据源自动收集数据;(2)根据训练数据的质量动态选择话语级的数据清理方法;以及(3)使用基于预测的平均意见评分(MOS)进行闭环内的话语选取评估,以估计每个话语对模型性能的影响。此外,TTSOps通过在闭环框架中动态调整数据选择和数据清理过程来联合优化语料库和TTS模型,以便适应目标TTS模型的特点。 在日本YouTube数据上进行了广泛的实验,结果表明TTSOps在合成语音的自然性和说话人多样性方面均优于传统的基于音质的数据基线。
https://arxiv.org/abs/2506.15614
Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.
大型语言模型(LLMs)在实时语音聊天应用中广泛应用,通常与文本转语音(TTS)系统结合使用以生成音频响应。然而,其庞大的规模往往会导致用户输入结束和音频输出开始之间存在明显的延迟,从而导致用户体验不佳。这种延迟尤其明显于当LLMs被部署为单用户语音助手时,在计算能力有限的消费级硬件上运行的情况下。 我们发现,这种延迟主要是由于LLMs生成第一个句子所需的时间造成的,而TTS系统则需要这个句子作为输入来逐句合成音频响应。为了应对这一瓶颈,我们提出了一种名为预测性生成(PredGen)的新框架,该框架通过在用户输入时进行投机性解码来减轻甚至消除这种延迟。PredGen能够在用户说话的过程中生成候选响应,从而使系统可以尽早开始TTS处理过程,从而减少延迟。 模拟实验表明,在Lmsys和MT-Bench数据集上的使用情况中,所提出的方法能够有效将延迟降低大约2倍,并且在输入时仅产生很小的额外计算成本——这部分计算本来是闲置不用的。
https://arxiv.org/abs/2506.15556
Many music AI models learn a map between music content and human-defined labels. However, many annotations, such as chords, can be naturally expressed within the music modality itself, e.g., as sequences of symbolic notes. This observation enables both understanding tasks (e.g., chord recognition) and conditional generation tasks (e.g., chord-conditioned melody generation) to be unified under a music-for-music sequence modeling paradigm. In this work, we propose parameter-efficient solutions for a variety of symbolic music-for-music tasks. The high-level idea is that (1) we utilize a pretrained Language Model (LM) for both the reference and the target sequence and (2) we link these two LMs via a lightweight adapter. Experiments show that our method achieves superior performance among different tasks such as chord recognition, melody generation, and drum track generation. All demos, code and model weights are publicly available.
许多音乐AI模型学习了音乐内容与人类定义标签之间的映射关系。然而,许多注释(如和弦)可以在音乐本身的表现形式中自然地表达出来,例如以音符序列的形式。这一观察结果使得理解和生成任务(比如和弦识别以及基于和弦的旋律生成)可以统一在一种“音乐为音乐”的序列建模范式下进行。在这项工作中,我们提出了一种适用于各种符号化音乐-音乐任务的有效参数解决方案。我们的主要思路是:(1)利用一个预训练的语言模型处理参考序列和目标序列;(2)通过轻量级适配器将这两个语言模型连接起来。实验结果表明,在诸如和弦识别、旋律生成及鼓轨生成等不同任务中,我们的方法均表现出优越的性能。所有演示、代码和模型权重均可公开获取。
https://arxiv.org/abs/2506.15548
Breakthroughs in text-to-music generation models are transforming the creative landscape, equipping musicians with innovative tools for composition and experimentation like never before. However, controlling the generation process to achieve a specific desired outcome remains a significant challenge. Even a minor change in the text prompt, combined with the same random seed, can drastically alter the generated piece. In this paper, we explore the application of existing text-to-music diffusion models for instrument editing. Specifically, for an existing audio track, we aim to leverage a pretrained text-to-music diffusion model to edit the instrument while preserving the underlying content. Based on the insight that the model first focuses on the overall structure or content of the audio, then adds instrument information, and finally refines the quality, we show that selecting a well-chosen intermediate timestep, identified through an instrument classifier, yields a balance between preserving the original piece's content and achieving the desired timbre. Our method does not require additional training of the text-to-music diffusion model, nor does it compromise the generation process's speed.
文本到音乐生成模型的突破正在重塑创意领域,为作曲家提供了前所未有的创新工具。然而,控制生成过程以实现特定目标仍然是一个重大挑战。即使是细微的文本提示变化,在相同的随机种子下也会大幅改变生成的作品。本文探讨了现有文本到音乐扩散模型在乐器编辑中的应用。具体而言,对于现有的音频轨道,我们旨在利用预训练的文本到音乐扩散模型来编辑乐器,同时保留底层内容。 根据模型首先关注音频的整体结构或内容,然后添加乐器信息,最后细化质量这一洞察,我们展示了通过使用一个乐器分类器选择适当的时间步长,可以实现保留原始作品内容与达到期望音色之间的平衡。我们的方法不需要额外训练文本到音乐的扩散模型,并且不会影响生成过程的速度。
https://arxiv.org/abs/2506.15530
Automatic lyrics transcription (ALT) remains a challenging task in the field of music information retrieval, despite great advances in automatic speech recognition (ASR) brought about by transformer-based architectures in recent years. One of the major challenges in ALT is the high amplitude of interfering audio signals relative to conventional ASR due to musical accompaniment. Recent advances in music source separation have enabled automatic extraction of high-quality separated vocals, which could potentially improve ALT performance. However, the effect of source separation has not been systematically investigated in order to establish best practices for its use. This work examines the impact of source separation on ALT using Whisper, a state-of-the-art open source ASR model. We evaluate Whisper's performance on original audio, separated vocals, and vocal stems across short-form and long-form transcription tasks. For short-form, we suggest a concatenation method that results in a consistent reduction in Word Error Rate (WER). For long-form, we propose an algorithm using source separation as a vocal activity detector to derive segment boundaries, which results in a consistent reduction in WER relative to Whisper's native long-form algorithm. Our approach achieves state-of-the-art results for an open source system on the Jam-ALT long-form ALT benchmark, without any training or fine-tuning. We also publish MUSDB-ALT, the first dataset of long-form lyric transcripts following the Jam-ALT guidelines for which vocal stems are publicly available.
自动歌词转录(ALT)在音乐信息检索领域仍然是一个具有挑战性的任务,尽管近年来基于变压器架构的自动语音识别(ASR)技术取得了重大进展。ALT的一个主要挑战是由于音乐伴奏的存在,干扰音频信号相对于传统ASR来说幅度更大。最近在音乐源分离领域的进步使得可以从原始音频中提取高质量的分离人声,这有可能提高ALT性能。然而,关于源分离的效果尚未系统地进行研究以建立最佳实践方法。 这项工作利用Whisper(一种最先进的开源ASR模型),考察了源分离对ALT的影响。我们在短形式和长形式转录任务上评估了Whisper在原始音频、分离人声以及人声音轨上的表现。对于短形式,我们提出了一种拼接方法,这种方法可以持续降低词错误率(WER)。对于长形式,我们建议使用源分离作为音效检测器来推导片段边界的方法,相比起Whisper原生的长形式算法,这种方法能一致地减少WER。 我们的方法在不进行任何训练或微调的情况下,在Jam-ALT长形式ALT基准测试中达到了开源系统的最新水平。此外,我们发布了MUSDB-ALT数据集,这是首个遵循Jam-ALT指南的长形式歌词转录数据集,并公开了人声音轨。
https://arxiv.org/abs/2506.15514
We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.
我们提出了层次音频编解码器(HAC),这是一种统一的神经语音编解码器,它在一个单一模型中将瓶颈分解为三个语言层级:声学层、音素层和词汇层。HAC利用了两个知识蒸馏目标:一个来自预训练的语音编码器(HuBERT),用于提取音素级别的结构;另一个则来源于基于文本的编码器(LaBSE),以获取词汇线索。在英语及多语言数据上的实验表明,HAC分解后的瓶颈产生了分离式的标记集合:其中一个与音素对齐,而另一个捕捉了词级语义信息。定量评估确认了HAC标记能够保持自然性,并提供可解释的语言信息,在分离性和重建质量方面均优于单一层次的基线模型。这些发现突显了HAC作为一种统一离散语音表示的巨大潜力,它在下游语音生成和理解任务中连接了声学细节与词汇意义之间的桥梁。
https://arxiv.org/abs/2506.15456
While the use of social robots for language teaching has been explored, there remains limited work on a task-specific synthesized voices for language teaching robots. Given that language is a verbal task, this gap may have severe consequences for the effectiveness of robots for language teaching tasks. We address this lack of L2 teaching robot voices through three contributions: 1. We address the need for a lightweight and expressive robot voice. Using a fine-tuned version of Matcha-TTS, we use emoji prompting to create an expressive voice that shows a range of expressivity over time. The voice can run in real time with limited compute resources. Through case studies, we found this voice more expressive, socially appropriate, and suitable for long periods of expressive speech, such as storytelling. 2. We explore how to adapt a robot's voice to physical and social ambient environments to deploy our voices in various locations. We found that increasing pitch and pitch rate in noisy and high-energy environments makes the robot's voice appear more appropriate and makes it seem more aware of its current environment. 3. We create an English TTS system with improved clarity for L2 listeners using known linguistic properties of vowels that are difficult for these listeners. We used a data-driven, perception-based approach to understand how L2 speakers use duration cues to interpret challenging words with minimal tense (long) and lax (short) vowels in English. We found that the duration of vowels strongly influences the perception for L2 listeners and created an "L2 clarity mode" for Matcha-TTS that applies a lengthening to tense vowels while leaving lax vowels unchanged. Our clarity mode was found to be more respectful, intelligible, and encouraging than base Matcha-TTS while reducing transcription errors in these challenging tense/lax minimal pairs.
虽然已经探索了使用社交机器人进行语言教学的应用,但在特定任务的合成语音方面用于语言教学机器人的研究仍然有限。鉴于语言是一种口头任务,这种差距可能会对机器人为语言教学任务的有效性产生严重的影响。我们通过以下三项贡献来解决这一缺乏第二语言(L2)教学机器人声音的问题: 1. 我们解决了对一种轻量级且富有表现力的机器人语音的需求。利用Matcha-TTS的微调版本,我们使用表情符号提示创建了一种具有多种时间表达能力的表现性语音。这种语音可以在有限的计算资源下实时运行。通过案例研究,我们发现这种语音更具表现力、社会适宜性和适合长期表现性演讲(如讲故事)。 2. 我们探讨了如何根据物理和社会环境调整机器人的声音,以在不同地点部署我们的语音系统。我们在嘈杂和高能量环境中增加音调和音高变化后发现,这使机器人听起来更加合适,并且似乎更能够感知其当前的环境。 3. 我们创建了一个改进了第二语言(L2)学习者清晰度的英语文本转语音(TTS)系统。我们使用已知对这些学习者来说难以掌握的元音的语言学特性,采用基于数据驱动和感知的方法来理解L2说话人如何利用持续时间线索解释具有最小紧张度(长)和松弛(短)元音的困难单词。我们发现元音的长度强烈影响了第二语言听者的认知,并为此创建了一种“L2清晰模式”,在该模式下对紧张元音进行延长,而使松弛元音保持不变。与基础Matcha-TTS相比,“L2清晰模式”被认为更加尊重人、易于理解且鼓励学习者参与,同时减少了这些具有挑战性的紧张/松弛最小对立词的转录错误。 通过这三项贡献,我们的工作旨在填补机器人语言教学领域的这一空白,并为第二语言教育提供更有效的工具。
https://arxiv.org/abs/2506.15107
Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech. Wave-U-Net balances these attributes with improvements in speaker-specific feature retention, evidenced by VeriSpeak score gains of +10.84% on SpEAR and +27.38% on VPQAD. This research indicates how advanced methods can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition. The findings may contribute to advancing voice biometrics, forensic audio analysis, telecommunication, and speaker verification in challenging acoustic conditions.
语音增强,尤其是降噪技术,在改善真实世界应用场景中语音信号的可懂度和质量方面至关重要,尤其是在噪音环境中。尽管此前的研究已经提出了各种用于此目的的深度学习模型,但许多模型在噪声抑制、感知质量以及说话人特定特征保留之间难以取得平衡,留下了比较性能评估中的一个重要研究缺口。本研究对Wave-U-Net、CMGAN和U-Net这三种最先进的模型,在SpEAR、VPQAD和Clarkson数据集等多样化数据集上进行了基准测试。这些模型因其在文献中的相关性和代码可获取性而被选中进行研究。 评价结果表明,U-Net在噪声抑制方面表现出色,在SpEAR数据集上的信噪比(SNR)提高了71.96%,VPQAD数据集上提高了64.83%,Clarkson数据集上则提高了364.2%。CMGAN模型在感知质量方面表现优异,分别在SpEAR和VPQAD数据集中获得了最高的PESQ评分4.04和1.46,使其非常适合需要自然且易于理解的语音的应用场景。Wave-U-Net模型在保留说话人特定特征的同时也实现了噪声抑制方面的改进,这体现在VeriSpeak评分上的提升:在SpEAR数据集上提高了10.84%,VPQAD数据集上则提升了27.38%。 这项研究揭示了先进方法如何优化噪声抑制、感知质量和说话人识别之间的权衡。该研究的发现可能会推进语音生物识别技术、法医音频分析、电信通讯和在复杂声学条件下的说话人验证等领域的发展。
https://arxiv.org/abs/2506.15000
The Audio Mostly (AM) conference has long been a platform for exploring the intersection of sound, technology, and culture. Despite growing interest in sonic cultures, discussions on the role of cultural diversity in sound design and sonification remain limited. This paper investigates the implicit biases and gaps within the discourse on music and sound aesthetics, challenging the notion of music as a 'universal language'. Through a historical and cross-cultural analysis of musicology and ethnomusicology, the profound influence of cultural context on auditory perception and aesthetic appraisal is highlighted. By drawing parallels between historical music practices and contemporary sound design, the paper advocates for a more inclusive approach that recognizes the diversity of sonic traditions. Using music as a case study, we underscore broader implications for sound design and sonification, emphasizing the need to integrate cultural perspectives into auditory design practices. A reevaluation of existing frameworks in sound design and sonification is proposed, emphasizing the necessity of culturally informed practices that resonate with global audiences. Ultimately, embracing cultural diversity in sound design is suggested to lead to richer, more meaningful auditory experiences and to foster greater inclusivity within the field.
长期以来,《音频为主》(Audio Mostly,简称AM)会议一直是探讨声音、技术和文化交汇点的重要平台。尽管对声学文化的兴趣日益增长,但关于文化多样性在声音设计和声音转换中所起作用的讨论仍然有限。本文通过历史与跨文化分析音乐学和民族音乐学,调查了有关音乐和音响美学中存在的隐含偏见和知识空白,并质疑“音乐是通用语言”的概念。文章强调了文化背景对听觉感知和审美评价的深远影响。 通过对历史上音乐实践与现代声音设计之间的类比探讨,本文提倡一种更具包容性的方法,承认声学传统多样性的重要性。通过以音乐作为案例研究来突出更广泛的声音设计和声音转换的重要意义,并强调将文化视角整合到听觉设计实践中是必要的。 因此,建议对现有的声音设计和声音转换框架进行重新评估,重点在于需要采取基于文化的实践,这些实践能够与全球听众产生共鸣。最终结论指出,在声音设计中接纳文化多样性将有助于创造更为丰富、更有意义的听觉体验,并促进该领域的更大包容性。
https://arxiv.org/abs/2506.14877
In this paper, we propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates a memory-aware multi-speaker embedding module with a sequence-to-sequence architecture. The system leverages a memory module to enhance speaker embeddings and employs a Seq2Seq framework to efficiently map acoustic features to speaker labels. Additionally, we explore the application of mixture of experts in speaker diarization, and introduce a Shared and Soft Mixture of Experts (SS-MoE) module to further mitigate model bias and enhance performance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE. Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo, Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements in robustness and generalization. The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.
在本文中,我们提出了一种新颖的神经说话人识别系统(NSD-MS2S),该系统使用带有序列到序列架构的记忆感知多说话人嵌入模块。该系统利用记忆模块增强说话人的嵌入,并采用Seq2Seq框架将声学特征高效地映射为说话人标签。此外,我们还探讨了在说话人识别中应用专家混合模型的方法,并引入了一种共享和软专家混合(SS-MoE)模块来进一步减轻模型偏差并提升性能。集成SS-MoE后形成了扩展模型NSD-MS2S-SSMoE。在多个复杂的声学数据集上进行的实验,包括CHiME-6、DiPCo、Mixer 6和DIHARD-III评测集,证明了该方法在鲁棒性和泛化能力方面的显著改进。所提出的方法取得了最先进的结果,在具有挑战性的实际场景中显示出了其有效性。
https://arxiv.org/abs/2506.14750
Jamming requires coordination, anticipation, and collaborative creativity between musicians. Current generative models of music produce expressive output but are not able to generate in an \emph{online} manner, meaning simultaneously with other musicians (human or otherwise). We propose ReaLchords, an online generative model for improvising chord accompaniment to user melody. We start with an online model pretrained by maximum likelihood, and use reinforcement learning to finetune the model for online use. The finetuning objective leverages both a novel reward model that provides feedback on both harmonic and temporal coherency between melody and chord, and a divergence term that implements a novel type of distillation from a teacher model that can see the future melody. Through quantitative experiments and listening tests, we demonstrate that the resulting model adapts well to unfamiliar input and produce fitting accompaniment. ReaLchords opens the door to live jamming, as well as simultaneous co-creation in other modalities.
即兴演奏要求音乐家之间进行协调、预判和协作性创作。目前的音乐生成模型虽然能够产出有表现力的作品,但它们无法在线(即实时地)与其他音乐家同步生成音乐。我们提出了ReaLchords这一在线生成模型,旨在为用户提供的旋律即兴伴奏和弦。 我们的方法首先使用最大似然预训练一个在线模型,并通过强化学习对这个模型进行微调以适应在线应用的需求。在微调过程中,我们引入了一个新颖的奖励模型来评估旋律与和弦之间的和谐性和时间一致性,同时还有一个散度项,它从一个能够预见未来旋律的教师模型中提取知识(实施了一种新的蒸馏方法)。通过定量实验和听觉测试,我们证明了该模型在处理不熟悉的输入时表现良好,并能生成合适的伴奏。 ReaLchords不仅为实时即兴演奏打开了大门,还使得其他模态下的同时创作成为可能。
https://arxiv.org/abs/2506.14723
Automatic sample identification (ASID), the detection and identification of portions of audio recordings that have been reused in new musical works, is an essential but challenging task in the field of audio query-based retrieval. While a related task, audio fingerprinting, has made significant progress in accurately retrieving musical content under "real world" (noisy, reverberant) conditions, ASID systems struggle to identify samples that have undergone musical modifications. Thus, a system robust to common music production transformations such as time-stretching, pitch-shifting, effects processing, and underlying or overlaying music is an important open challenge. In this work, we propose a lightweight and scalable encoding architecture employing a Graph Neural Network within a contrastive learning framework. Our model uses only 9% of the trainable parameters compared to the current state-of-the-art system while achieving comparable performance, reaching a mean average precision (mAP) of 44.2%. To enhance retrieval quality, we introduce a two-stage approach consisting of an initial coarse similarity search for candidate selection, followed by a cross-attention classifier that rejects irrelevant matches and refines the ranking of retrieved candidates - an essential capability absent in prior models. In addition, because queries in real-world applications are often short in duration, we benchmark our system for short queries using new fine-grained annotations for the Sample100 dataset, which we publish as part of this work.
自动样本识别(ASID)是指检测和识别音频记录中被重新用于新音乐作品的部分。这一任务在基于音频查询的检索领域中至关重要,但极具挑战性。虽然相关任务——音频指纹化,在处理“现实世界”条件下的音频内容检索方面(即存在噪声和混响的环境)已取得显著进展,但ASID系统却难以识别经过音乐修改后的样本。因此,开发一种能够抵抗常见音乐制作变换(如时间拉伸、音高移动、效果处理以及底层或叠加音乐影响)的影响并保持准确性的系统成为了亟待解决的重要问题。 在此研究中,我们提出了一种轻量级且可扩展的编码架构,采用图神经网络在对比学习框架内进行工作。我们的模型仅使用了现有最先进的系统的9%训练参数便实现了相当的性能表现,在平均精度(mAP)上达到了44.2%。 为了提升检索质量,我们引入了一个两阶段方法:首先通过粗略相似性搜索选择候选样本,随后采用跨注意力分类器来拒绝无关匹配并优化已选候选样本的排序——这是先前模型中所缺乏的重要功能。此外,在实际应用中的查询音频往往时长较短,因此我们在Sample100数据集上使用新的细粒度注释对我们的系统进行了针对短查询的基准测试,并作为本研究的一部分发布了这些新注释。
https://arxiv.org/abs/2506.14684
This chapter reconsiders the concept of pitch in contemporary popular music (CPM), particularly in electronic contexts where traditional assumptions may fail. Drawing on phenomenological and inductive methods, it argues that pitch is not an ontologically objective property but a perceptual construct shaped by listeners and conditions. Analyses of quasi-harmonic tones reveal that a single tone can convey multiple pitches, giving rise to tonal fission. The perception of pitch may also be multistable, varying for the same listener over time. In this framework, the tuning system may emerge from a tone's internal structure. A parallel with the coastline paradox supports a model of pitch grounded in perceptual variability, challenging inherited theoretical norms.
本章重新审视了当代流行音乐(CPM)中的音高概念,尤其是在传统假设可能失效的电子音乐背景下。通过现象学和归纳方法,本文主张音高不是一个客观存在的属性,而是一种由听众和条件塑造的感知构造。对准谐波音调的分析表明,单个音可以传达多种不同的音高,从而导致声音分裂。音高的感知也可能具有多重稳定性,在相同听众的不同时间点上发生变化。在这一框架下,调律系统可能源自于一个音本身的内部结构。与海岸线悖论的类比支持了一种基于感知变化性的音高标准模型,挑战了传统理论规范。
https://arxiv.org/abs/2506.14504
This paper introduces ORD-CC32 , an open research dataset derived from the 1932 Cairo Congress of Arab Music recordings, a historically significant collection representing diverse Arab musical traditions. The dataset includes structured metadata, melodic and rhythmic mode tags (maqam and iqa), manually labeled tonic information, and acoustic features extracted using state-of-the-art pitch detection methods. These resources support computational studies of tuning, temperament, and regional variations in Arab music. A case study using pitch histograms demonstrates the potential for data-driven analysis of microtonal differences across regions. By making this dataset openly available, we aim to enable interdisciplinary research in computational ethnomusicology, music information retrieval (MIR), cultural studies, and digital heritage preservation. ORD-CC32 is shared on Zenodo with tools for feature extraction and metadata retrieval.
本文介绍了ORD-CC32,这是一个源自1932年开罗阿拉伯音乐大会录音的开放研究数据集,该会议记录了一组代表多种阿拉伯音乐传统的历史重要收藏。数据集包括结构化的元数据、旋律和节奏模式标签(maqam和iqa)、人工标注的主音信息以及使用最先进的音高检测方法提取的声学特征。这些资源支持对调律、音阶及阿拉伯音乐地区差异进行计算研究。通过一个利用音高直方图的数据驱动分析案例研究,展示了跨地区微音程差异的潜在研究可能。通过开放这一数据集,我们旨在促进跨学科的研究工作,在计算民族音乐学、音乐信息检索(MIR)、文化研究和数字遗产保护等领域发挥作用。ORD-CC32在Zenodo上共享,并附带用于特征提取和元数据检索的工具。
https://arxiv.org/abs/2506.14503
There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked attention masking in the training of zipformer-based ASR models. We demonstrate that using right-context is more effective in zipformer models compared to other conformer models due to its multi-scale nature. We analyze the effect of varying the number of right-context frames on accuracy and latency of the streaming ASR models. We use Librispeech and large in-house conversational datasets to train different versions of streaming and non-streaming models and evaluate them in a production grade server-client setup across diverse testsets of different domains. The proposed strategy reduces word error by relative 7.9\% with a small degradation in user-perceived latency. By adding more right-context frames, we are able to achieve streaming performance close to that of non-streaming models. Our approach also allows flexible control of the latency-accuracy tradeoff according to customers requirements.
人们对统一流式和非流式自动语音识别(ASR)模型的兴趣日益增加,以减少开发、训练和部署成本。我们提出了一种统一的框架,该框架使用单个端到端ASR模型同时为流式和非流式应用进行训练,并利用未来上下文信息。我们建议通过在基于zipformer的ASR模型中采用分块注意掩码来动态使用右上下文(right-context)。我们展示了与其它conformer模型相比,由于其多尺度特性,在zipformer模型中使用右上下文更为有效。我们分析了不同数量的右上下文帧对流式ASR模型准确率和延迟的影响,并使用Librispeech及大规模内部对话数据集来训练各种版本的流式和非流式模型,并在跨多个域的不同测试集中进行生产级服务器-客户端设置下的评估。所提出的策略使单词错误相对减少了7.9%,同时用户感知的延迟略有下降。通过增加更多的右上下文帧,我们可以实现接近非流式模型性能的流式表现。此外,我们的方法还允许根据客户需求灵活控制延迟和准确率之间的权衡。
https://arxiv.org/abs/2506.14434
Solutions for defending against deepfake speech fall into two categories: proactive watermarking models and passive conventional deepfake detectors. While both address common threats, their differences in training, optimization, and evaluation prevent a unified protocol for joint evaluation and selecting the best solutions for different cases. This work proposes a framework to evaluate both model types in deepfake speech detection. To ensure fair comparison and minimize discrepancies, all models were trained and tested on common datasets, with performance evaluated using a shared metric. We also analyze their robustness against various adversarial attacks, showing that different models exhibit distinct vulnerabilities to different speech attribute distortions. Our training and evaluation code is available at Github.
针对深度伪造语音的防御解决方案可以分为两类:主动水印模型和被动的传统深度伪造检测器。虽然这两类方法都能应对常见的威胁,但由于它们在训练、优化和评估方面的差异,无法实现统一的联合评价协议以选择最适合不同情况的最佳方案。这项工作提出了一种框架来评估这两种类型模型在深度伪造语音识别中的表现。 为了确保公平比较并最小化差异,在使用公共数据集对所有模型进行训练和测试时,我们采用了共享的性能评估指标。此外,我们还分析了这些模型面对各种对抗性攻击时的鲁棒性,并发现不同的模型在不同语音属性扭曲下的脆弱性有所不同。我们的训练和评估代码可在Github上获取。
https://arxiv.org/abs/2506.14398
With the development of audio deepfake techniques, attacks with partially deepfake audio are beginning to rise. Compared to fully deepfake, it is much harder to be identified by the detector due to the partially cryptic manipulation, resulting in higher security risks. Although some studies have been launched, there is no comprehensive review to systematically introduce the current situations and development trends for addressing this issue. Thus, in this survey, we are the first to outline a systematic introduction for partially deepfake audio manipulated region localization tasks, including the fundamentals, branches of existing methods, current limitations and potential trends, providing a revealing insight into this scope.
随着音频深度伪造技术的发展,部分使用深度伪造音频的攻击开始增多。相比完全的深度伪造内容,由于其半隐秘的操作方式,这类攻击更难被检测器识别,从而带来了更高的安全风险。尽管已经有一些相关研究展开,但至今还没有全面综述系统性地介绍当前应对这一问题的情况和发展趋势。因此,在本次调查中,我们首次系统性地概述了部分深度伪造音频操作区域定位任务的基础知识、现有方法的分支、目前存在的局限性和潜在的发展趋势,为这一领域提供了深刻的见解。
https://arxiv.org/abs/2506.14396
We present Sleeping-DISCO 9M, a large-scale pre-training dataset for music and song. To the best of our knowledge, there are no open-source high-quality dataset representing popular and well-known songs for generative music modeling tasks such as text-music, music-captioning, singing-voice synthesis, melody reconstruction and cross-model retrieval. Past contributions focused on isolated and constrained factors whose core perspective was to create synthetic or re-recorded music corpus (e.g. GTSinger, M4Singer) and arbitrarily large-scale audio datasets (e.g. DISCO-10M and LAIONDISCO-12M) had been another focus for the community. Unfortunately, adoption of these datasets has been below substantial in the generative music community as these datasets fail to reflect real-world music and its flavour. Our dataset changes this narrative and provides a dataset that is constructed using actual popular music and world-renowned artists.
我们介绍了Sleeping-DISCO 9M,这是一个用于音乐和歌曲的大规模预训练数据集。据我们所知,目前还没有开源的高质量数据集能够代表流行且知名的歌曲,以供诸如文本-音乐生成、音乐描述、歌声合成、旋律重构及跨模型检索等任务使用。以往的研究主要集中在孤立和受限的因素上,其核心观点是创建合成或重新录制的音乐语料库(例如GTSinger、M4Singer),而社区的另一个焦点则是任意大规模的音频数据集(如DISCO-10M和LAIONDISCO-12M)。不幸的是,由于这些数据集无法反映现实世界中的音乐及其特色,它们在生成音乐领域并未被广泛采用。我们的数据集改变了这一局面,并提供了基于实际流行音乐及世界级艺术家构建的数据集。
https://arxiv.org/abs/2506.14293
Short-utterance speaker verification presents significant challenges due to the limited information in brief speech segments, which can undermine accuracy and reliability. Recently, zero-shot text-to-speech (ZS-TTS) systems have made considerable progress in preserving speaker identity. In this study, we explore, for the first time, the use of ZS-TTS systems for test-time data augmentation for speaker verification. We evaluate three state-of-the-art pre-trained ZS-TTS systems, NatureSpeech 3, CosyVoice, and MaskGCT, on the VoxCeleb 1 dataset. Our experimental results show that combining real and synthetic speech samples leads to 10%-16% relative equal error rate (EER) reductions across all durations, with particularly notable improvements for short utterances, all without retraining any existing systems. However, our analysis reveals that longer synthetic speech does not yield the same benefits as longer real speech in reducing EERs. These findings highlight the potential and challenges of using ZS-TTS for test-time speaker verification, offering insights for future research.
简短语音的说话人验证由于在短暂语音片段中包含的信息量有限,面临着显著挑战,这可能会削弱其准确性和可靠性。最近,零样本文本到语音(ZS-TTS)系统在保持说话人身份方面取得了重大进展。在这项研究中,我们首次探讨了将ZS-TTS系统用于测试时数据增强以进行说话人验证的潜力。我们在VoxCeleb 1数据集上评估了三个最先进的预训练ZS-TTS系统:NatureSpeech 3、CosyVoice和MaskGCT。实验结果显示,在所有持续时间下,结合真实语音样本与合成语音样本可以实现10%-16%相对等错误率(EER)的降低,尤其是对于简短语句效果更为显著,并且无需重新训练任何现有系统即可达到这一结果。然而,我们的分析还揭示了一个事实:较长的合成语音并未像真实的长语音那样在减少EER方面带来同样的好处。这些发现强调了使用ZS-TTS进行测试时说话人验证的潜力及其面临的挑战,为未来的研究提供了宝贵的见解。
https://arxiv.org/abs/2506.14226
Speech pre-processing techniques such as denoising, de-reverberation, and separation, are commonly employed as front-ends for various downstream speech processing tasks. However, these methods can sometimes be inadequate, resulting in residual noise or the introduction of new artifacts. Such deficiencies are typically not captured by metrics like SI-SNR but are noticeable to human listeners. To address this, we introduce SpeechRefiner, a post-processing tool that utilizes Conditional Flow Matching (CFM) to improve the perceptual quality of speech. In this study, we benchmark SpeechRefiner against recent task-specific refinement methods and evaluate its performance within our internal processing pipeline, which integrates multiple front-end algorithms. Experiments show that SpeechRefiner exhibits strong generalization across diverse impairment sources, significantly enhancing speech perceptual quality. Audio demos can be found at this https URL.
语音预处理技术,如降噪、消除混响和分离等,通常被用作各种下游语音处理任务的前端。然而,这些方法有时可能不够充分,会导致残留噪声或引入新的伪影。这些问题往往不会被像SI-SNR这样的度量标准捕捉到,但人类听众可以明显察觉到。为了解决这个问题,我们引入了SpeechRefiner,这是一个利用条件流动匹配(CFM)来改善语音感知质量的后处理工具。 在这项研究中,我们将SpeechRefiner与最近的任务特定改进方法进行了基准测试,并评估它在我们的内部处理管道中的性能,该管道集成了多种前端算法。实验结果表明,SpeechRefiner在面对各种不同损伤源时表现出强大的泛化能力,显著提高了语音的感知质量。音频演示可在以下链接中找到:[此URL](请将括号内的文本替换为实际的URL)。
https://arxiv.org/abs/2506.13709