Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.
最近的深度学习进展鼓励开发出了一系列大规模自动语音识别(ASR)模型,这些模型在忽略计算和内存限制的情况下取得了令人鼓舞的结果。然而,在资源有限的设备上部署这样的大模型是不切实际的,尽管它们有良好的性能表现。现有的方法(如剪枝、蒸馏、跳过层等),虽然可以将大型模型转换为较小的模型,但会导致显著的性能下降或需要长时间训练小型模型以获得更好的性能。 为了应对这些问题,我们提出了一种有效的两步表示学习方法,可以从单个大规模模型中生成多个小规模模型,并确保在有限的训练周期内有相当不错的性能表现。我们在ASR基准测试上的全面实验表明了该方法的有效性,实现了三倍的训练速度提升,并且错误词率(WER)最多减少了12.54%。
https://arxiv.org/abs/2505.16991
Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.
最近的自动语音识别(ASR)进展主要得益于大规模语音语料库。然而,将覆盖范围扩展到资源有限的多种语言仍然是一个巨大的挑战。本文介绍了Speech Back-Translation,这是一种可扩展的工作流程,通过使用现成的文本转语音(TTS)模型将大型文本语料库转换为合成语音来改进多语言ASR模型。我们证明了仅用数十小时的实际录音可以有效训练TTS模型以生成比原始数据量大几百倍的高质量合成语音。为了评估合成语音的质量,我们开发了一个基于可理解性的评价框架,并确定了合成数据对ASR培训有益的确切阈值。通过使用Speech Back-Translation,我们在十种语言中生成了超过50万小时的合成语音并继续预训练Whisper-large-v3模型,在平均转录错误率上降低了超过30%。这些结果强调了Speech Back-Translation在增强多语言ASR系统中的可扩展性和有效性。
https://arxiv.org/abs/2505.16972
In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.
在这篇论文中,我们结合了两步知识蒸馏、结构化剪枝、截断和词汇精简技术,以极大压缩低资源语言的多语种编码器模型。我们的创新方法系统性地整合现有技术并推向极限,减少层数深度、前向反馈隐藏层大小以及中间层嵌入尺寸,从而创建显著更小的语言特定单语模型,同时保留了关键的语言特性知识。我们在包括情感分析、主题分类、命名实体识别和词性标注在内的四个下游任务中,在三种低资源语言上实现了高达92%的压缩率,并且仅在性能上有微不足道的下降(2-10%)。值得注意的是,性能下降的程度与教师模型中的语言特定数据量相关,较大的数据集导致较小的性能损失。此外,我们进行了详尽的消融研究,以确定使用这些技术进行多语种模型压缩的最佳实践。
https://arxiv.org/abs/2505.16956
We introduce a new paradigm for active sound modification: Active Speech Enhancement (ASE). While Active Noise Cancellation (ANC) algorithms focus on suppressing external interference, ASE goes further by actively shaping the speech signal -- both attenuating unwanted noise components and amplifying speech-relevant frequencies -- to improve intelligibility and perceptual quality. To enable this, we propose a novel Transformer-Mamba-based architecture, along with a task-specific loss function designed to jointly optimize interference suppression and signal enrichment. Our method outperforms existing baselines across multiple speech processing tasks -- including denoising, dereverberation, and declipping -- demonstrating the effectiveness of active, targeted modulation in challenging acoustic environments.
我们提出了一种新的主动声音修改范式:主动语音增强(ASE)。与专注于抑制外部干扰的主动噪声消除(ANC)算法不同,ASE 进一步通过积极塑造语音信号——既减弱不需要的噪音成分又放大对语音相关的频率——来提高可懂度和感知质量。为了实现这一目标,我们提出了一种基于Transformer-Mamba架构的新颖方法,并设计了一个特定任务的损失函数,以同时优化干扰抑制和信号增强。我们的方法在多个语音处理任务中超越了现有的基线模型,包括降噪、去混响和去削波等,展示了在具有挑战性的声学环境中进行主动、有针对性调制的有效性。
https://arxiv.org/abs/2505.16911
This paper introduces a method for detecting inappropriately targeting language in online conversations by integrating crowd and expert annotations with ChatGPT. We focus on English conversation threads from Reddit, examining comments that target individuals or groups. Our approach involves a comprehensive annotation framework that labels a diverse data set for various target categories and specific target words within the conversational context. We perform a comparative analysis of annotations from human experts, crowd annotators, and ChatGPT, revealing strengths and limitations of each method in recognizing both explicit hate speech and subtler discriminatory language. Our findings highlight the significant role of contextual factors in identifying hate speech and uncover new categories of targeting, such as social belief and body image. We also address the challenges and subjective judgments involved in annotation and the limitations of ChatGPT in grasping nuanced language. This study provides insights for improving automated content moderation strategies to enhance online safety and inclusivity.
本文介绍了一种通过结合群众和专家注释与ChatGPT来检测在线对话中不当目标语言的方法。我们重点关注来自Reddit的英文对话线程,分析针对个人或群体的评论。我们的方法涉及一个全面的标注框架,该框架为各种目标类别以及对话上下文中的特定目标词汇标记了一个多样化的数据集。我们对人类专家、群众注释者和ChatGPT提供的注解进行了比较分析,揭示了每种方法在识别明确的仇恨言论和较为微妙的歧视性语言方面的优缺点。 我们的研究结果强调了识别仇恨言论中情境因素的重要性,并发现了一些新的目标类别,如社会信仰和身体形象。我们还讨论了标注过程中的挑战和主观判断以及ChatGPT理解细微语言的局限性。这项研究表明如何改进自动内容审核策略以增强在线安全性和包容性。
https://arxiv.org/abs/2505.16847
Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In this work, we propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time. TFC enables seamlessly tunable average frame rates and dynamically allocates frame rates based on temporal entropy. Experimental results show that a codec with TFC achieves optimal reconstruction quality with high flexibility, and maintains competitive performance even at lower frame rates. Our approach is promising for the integration with other efforts to develop low-frame-rate neural speech codecs for more efficient downstream tasks.
大多数神经语音编解码器通过恒定帧率(CFR)下的帧内机制(如代码本丢弃)来实现比特率调整。然而,语音段本质上具有时间变化的信息密度(例如,静默间隔与有声区域)。这一特性使得CFR在比特率和令牌序列长度方面不是最优的,从而阻碍了实时应用中的效率。 在本文中,我们提出了一种时域灵活编码(TFC)技术,首次将可变帧率(VFR)引入神经语音编解码器。TFC能够无缝调整平均帧率,并根据时间熵动态分配帧率。实验结果显示,采用TFC的编解码器在保持高灵活性的同时实现了最优重建质量,即使是在较低的帧率下也能保持竞争力的性能。 我们的方法为开发低帧率神经语音编解码器以实现更高效的下游任务与其它努力相结合提供了有前景的方向。
https://arxiv.org/abs/2505.16845
Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.
最近在基于场景的视频生成领域的进展使得系统能够从结构化的提示中合成连贯的视觉叙述。然而,叙事中的一个关键维度——以角色驱动对话和言语——仍然相对未被充分探索。在这篇论文中,我们提出了一种模块化管道,该管道将动作级别的提示转换为基于视觉和听觉的叙述对话,从而丰富了视觉叙事,并加入了自然的声音和人物表达。我们的方法采用每场景一对输入提示作为输入,其中第一个定义背景设置,第二个指定角色的行为。虽然像Text2Story这样的故事生成模型可以产生相应的视觉场景,但我们专注于从这些提示和场景图像中生成富有表现力的对话文本。 我们应用了一个预训练的视觉-语言编码器来提取代表帧中的高层次语义特征,捕捉显著的视觉上下文。这个特征随后与结构化提示相结合,并用来指导大型语言模型合成自然且角色一致的对话。为了确保在整个故事中的场景之间保持上下文一致性,我们引入了递归叙事库,使得每一次对话生成都基于之前场景积累下来的对话历史。这种方法使角色能够以反映其不断变化的目标和互动的方式进行交谈。 最后,我们将每个语句渲染成富有表现力且符合角色的语音,从而产生完整的有声视频叙述。我们的框架无需额外训练,并展示了在各种故事设置中的适用性,包括幻想冒险和日常生活片段等场景。
https://arxiv.org/abs/2505.16819
A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here this https URL
在将说话人识别系统部署到实际应用中时,一个主要挑战是由于环境不匹配导致的性能下降。我们提出了一种基于扩散模型的方法,该方法从预训练的说话人识别模型中提取说话人嵌入,并生成优化后的嵌入。在训练阶段,我们的方法通过逐步向干净和嘈杂语音分别提取的干净与嘈杂的说话人嵌入添加高斯噪声,进行正向过程处理,然后在反向过程中重建为干净的嵌入。而在推理阶段,所有嵌入都将通过扩散过程再生。我们这种方法不需要说话人的标签,也不需要对现有的说话人识别流程做任何修改。 实验表明,在模拟环境不匹配场景的数据集上,我们的方法相对于基准模型可以提高高达19.6%的识别准确率,并且在传统场景下性能也能得到保持。代码已发布于此链接: [这里应填写实际的URL地址]
https://arxiv.org/abs/2505.16798
Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages.
近期的语音转换研究越来越多地集中在提升现有方法的零样本(zero-shot)能力上。尽管取得了显著的进步,现有的架构在处理零样本跨语言设置时仍存在挑战,并且通常无法为未见过的语言和口音泛化。在这篇论文中,我们采用了一种简单而有效的方法,该方法结合了自监督模型中的离散语音表示与基于非自回归的Diffusion-Transformer条件流匹配语音解码器。这种方法使我们能够以完全无文本、自我监督的方式训练一个语音转换模型。我们的技术无需使用多个编码器来分离语音特征就能工作,并且在零样本跨语言设置中,即使对于未见过的语言也能表现出色。
https://arxiv.org/abs/2505.16691
The integration of artificial intelligence in sports analytics has transformed soccer video understanding, enabling real-time, automated insights into complex game dynamics. Traditional approaches rely on isolated data streams, limiting their effectiveness in capturing the full context of a match. To address this, we introduce SoccerChat, a multimodal conversational AI framework that integrates visual and textual data for enhanced soccer video comprehension. Leveraging the extensive SoccerNet dataset, enriched with jersey color annotations and automatic speech recognition (ASR) transcripts, SoccerChat is fine-tuned on a structured video instruction dataset to facilitate accurate game understanding, event classification, and referee decision making. We benchmark SoccerChat on action classification and referee decision-making tasks, demonstrating its performance in general soccer event comprehension while maintaining competitive accuracy in referee decision making. Our findings highlight the importance of multimodal integration in advancing soccer analytics, paving the way for more interactive and explainable AI-driven sports analysis. this https URL
将人工智能技术融入体育数据分析,特别是在足球视频理解方面,已经改变了对复杂比赛动态的实时、自动化洞察。传统的方法依赖于孤立的数据流,这限制了它们捕捉比赛全貌的有效性。为了解决这个问题,我们推出了SoccerChat,这是一个多模态对话AI框架,它整合了视觉和文本数据以增强足球视频的理解能力。 通过利用丰富的SoccerNet数据集,并结合球衣颜色注释以及自动语音识别(ASR)转录内容,SoccerChat在经过结构化的视频指令数据集上进行微调,从而能够实现准确的比赛理解、事件分类和裁判决策。我们在动作分类和裁判决策任务上对SoccerChat进行了基准测试,展示了其在通用足球赛事理解方面的性能,并且在裁判决策方面保持了竞争性的准确性。 我们的研究结果强调了多模态整合在推进足球分析领域的关键作用,为更加互动和解释性强的AI驱动体育分析铺平道路。[参考链接](https://this-url)
https://arxiv.org/abs/2505.16630
This paper addresses the problem of single-channel speech separation, where the number of speakers is unknown, and each speaker may speak multiple utterances. We propose a speech separation model that simultaneously performs separation, dynamically estimates the number of speakers, and detects individual speaker activities by integrating an attractor module. The proposed system outperforms existing methods by introducing an attractor-based architecture that effectively combines local and global temporal modeling for multi-utterance scenarios. To evaluate the method in reverberant and noisy conditions, a multi-speaker multi-utterance dataset was synthesized by combining Librispeech speech signals with WHAM! noise signals. The results demonstrate that the proposed system accurately estimates the number of sources. The system effectively detects source activities and separates the corresponding utterances into correct outputs in both known and unknown source count scenarios.
本文解决了单通道语音分离的问题,其中说话人的数量未知,并且每个说话人可能会发出多个语句。我们提出了一种同时进行语音分离、动态估计说话人数并检测个别说话人活动的语音分离模型,通过集成吸引子模块实现了这一目标。所提出的系统通过引入基于吸引子架构的方法,在多语句场景中有效地结合了局部和全局时间建模,从而超越了现有的方法。为了评估该方法在回声和噪声条件下的性能,我们合成了一个多说话人多语句数据集,将Librispeech语音信号与WHAM!噪声信号相结合。实验结果表明,所提出的系统能够准确地估计出声源的数量,在已知或未知声源数量的情况下都能有效地检测出声源活动,并正确分离对应的语句输出。
https://arxiv.org/abs/2505.16607
As the capabilities of large-scale pre-trained models evolve, understanding the determinants of their outputs becomes more important. Feature attribution aims to reveal which parts of the input elements contribute the most to model outputs. In speech processing, the unique characteristics of the input signal make the application of feature attribution methods challenging. We study how factors such as input type and aggregation and perturbation timespan impact the reliability of standard feature attribution methods, and how these factors interact with characteristics of each classification task. We find that standard approaches to feature attribution are generally unreliable when applied to the speech domain, with the exception of word-aligned perturbation methods when applied to word-based classification tasks.
随着大规模预训练模型的能力不断进化,理解决定其输出的因素变得越来越重要。特征归因旨在揭示输入元素的哪些部分最能影响模型的输出。在语音处理中,输入信号的独特特性使得应用特征归因方法具有挑战性。我们研究了诸如输入类型、聚合和扰动时间段等因素如何影响标准特征归因方法的可靠性,并探讨这些因素与每个分类任务特性的相互作用。我们发现,当应用于语音领域时,标准的特征归因方法通常不可靠,唯一的例外是在基于单词的分类任务中应用与词对齐的扰动方法时。
https://arxiv.org/abs/2505.16406
In practical application of speech codecs, a multitude of factors such as the quality of the radio connection, limiting hardware or required user experience necessitate trade-offs between achievable perceptual quality, engendered bitrate and computational complexity. Most conventional and neural speech codecs operate on wideband (WB) speech signals to achieve this compromise. To further enhance the perceptual quality of coded speech, bandwidth extension (BWE) of the transmitted speech is an attractive and popular technique in conventional speech coding. In contrast, neural speech codecs are typically trained end-to-end to a specific set of requirements and are often not easily adaptable. In particular, they are typically trained to operate at a single fixed sampling rate. With the Universal Bandwidth Extension Generative Adversarial Network (UBGAN), we propose a modular and lightweight GAN-based solution that increases the operational flexibility of a wide range of conventional and neural codecs. Our model operates in the subband domain and extends the bandwidth of WB signals from 8 kHz to 16 kHz, resulting in super-wideband (SWB) signals. We further introduce two variants, guided-UBGAN and blind-UBGAN, where the guided version transmits quantized learned representation as a side information at a very low bitrate additional to the bitrate of the codec, while blind-BWE operates without such side-information. Our subjective assessments demonstrate the advantage of UBGAN applied to WB codecs and highlight the generalization capacity of our proposed method across multiple codecs and bitrates.
在语音编解码器的实际应用中,诸如无线连接的质量、硬件限制或所需用户体验等因素要求在可实现的感知质量、生成比特率和计算复杂度之间进行权衡。大多数传统及神经网络语音编解码器针对宽带(WB)语音信号工作以达成这一妥协。为了进一步提升编码语音的感知质量,在传统的语音编码中,带宽扩展(BWE)是一种吸引人且流行的技术。相比之下,神经网络语音编解码器通常会根据特定的一组需求进行端到端训练,并不那么容易适应变化。尤其是,它们通常只针对单一固定的采样率进行训练。 通过提出一种基于生成对抗网络(GAN)的模块化、轻量级解决方案——通用带宽扩展生成对抗网络(UBGAN),我们旨在增强传统及神经编解码器的操作灵活性。我们的模型在子带域中操作,并将WB信号的带宽从8kHz扩展到16kHz,从而产生超宽带(SWB)信号。 此外,我们还介绍了两种变体:引导式UBGAN和盲式UBGAN。其中,引导式版本通过额外传输极低比特率下的量化学习表示作为辅助信息来实现带宽扩展,而盲式BWE则不使用此类辅助信息。 我们的主观评估表明,在WB编解码器上应用UBGAN具有优势,并且突显了我们提出的方法在多种编码和比特率上的泛化能力。
https://arxiv.org/abs/2505.16404
We introduces X-ARES (eXtensive Audio Representation and Evaluation Suite), a novel open-source benchmark designed to systematically assess audio encoder performance across diverse domains. By encompassing tasks spanning speech, environmental sounds, and music, X-ARES provides two evaluation approaches for evaluating audio representations: linear fine-tuning and unparameterized evaluation. The framework includes 22 distinct tasks that cover essential aspects of audio processing, from speech recognition and emotion detection to sound event classification and music genre identification. Our extensive evaluation of state-of-the-art audio encoders reveals significant performance variations across different tasks and domains, highlighting the complexity of general audio representation learning.
我们介绍了X-ARES(eXtensive Audio Representation and Evaluation Suite),这是一个新颖的开源基准测试工具,旨在系统地评估音频编码器在不同领域的性能。通过涵盖从语音、环境声音到音乐的各项任务,X-ARES提供了两种用于评估音频表示的方法:线性微调和无参数化评估。该框架包括了22项不同的任务,涵盖了音频处理的各个方面,从语音识别和情感检测到声音事件分类和音乐流派识别。我们对最先进的音频编码器进行的广泛评估揭示了在不同任务和领域中性能差异显著,突显了一般音频表示学习的复杂性。
https://arxiv.org/abs/2505.16369
Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems.
https://arxiv.org/abs/2505.16351
Current movie dubbing technology can produce the desired speech using a reference voice and input video, maintaining perfect synchronization with the visuals while effectively conveying the intended emotions. However, crucial aspects of movie dubbing, including adaptation to various dubbing styles, effective handling of dialogue, narration, and monologues, as well as consideration of subtle details such as speaker age and gender, remain insufficiently explored. To tackle these challenges, we introduce a multi-modal generative framework. First, it utilizes a multi-modal large vision-language model (VLM) to analyze visual inputs, enabling the recognition of dubbing types and fine-grained attributes. Second, it produces high-quality dubbing using large speech generation models, guided by multi-modal inputs. Additionally, a movie dubbing dataset with annotations for dubbing types and subtle details is constructed to enhance movie understanding and improve dubbing quality for the proposed multi-modal framework. Experimental results across multiple benchmark datasets show superior performance compared to state-of-the-art (SOTA) methods. In details, the LSE-D, SPK-SIM, EMO-SIM, and MCD exhibit improvements of up to 1.09%, 8.80%, 19.08%, and 18.74%, respectively.
当前的电影配音技术能够利用参考声音和输入视频产生所需的语音,同时与画面保持完美的同步,并有效地传达所需的情感。然而,包括适应各种配音风格、有效处理对话、旁白及独白在内的关键方面,以及考虑说话者的年龄和性别等细微之处,在现有研究中仍未得到充分探索。 为解决这些挑战,我们引入了一个多模态生成框架。首先,该框架采用大规模视觉语言模型(VLM)来分析视觉输入,从而识别配音类型及其细粒度属性。其次,它利用大型语音生成模型根据多模态输入生产高质量的配音。此外,还构建了一套带有注释标记的电影配音数据集,涵盖多种类型的配音和细微信息,以此提升对电影的理解,并进一步改进提出的多模态框架下的配音质量。 实验结果显示,在多个基准测试数据集中,我们的方法相比最先进的(SOTA)技术具有显著优势。具体而言,LSE-D、SPK-SIM、EMO-SIM 和 MCD 指标分别提高了1.09%、8.80%、19.08%和18.74%。
https://arxiv.org/abs/2505.16279
The achievements of Large Language Models in Natural Language Processing, especially for high-resource languages, call for a better understanding of their characteristics from a cognitive perspective. Researchers have attempted to evaluate artificial models by testing their ability to predict behavioral (e.g., eye-tracking fixations) and physiological (e.g., brain responses) variables during language processing (e.g., reading/listening). In this paper, we propose using spontaneous speech corpora to derive production variables (speech reductions, prosodic prominences) and applying them in a similar fashion. More precisely, we extract. We then test models trained with a standard procedure on different pretraining datasets (written, spoken, and mixed genres) for their ability to predict these two variables. Our results show that, after some fine-tuning, the models can predict these production variables well above baselines. We also observe that spoken genre training data provides more accurate predictions than written genres. These results contribute to the broader effort of using high-quality speech corpora as benchmarks for LLMs.
大型语言模型在自然语言处理(特别是高资源语言方面)的成就,促使从认知视角更好地理解这些模型的特点。研究人员尝试通过测试模型预测行为(例如眼动固定点)和生理(例如大脑反应)变量的能力来评估人工模型,在语言处理过程中进行此类实验(如阅读/听)。在这篇论文中,我们提议使用自发口语语料库来推导产出变量(言语简化、语调重音),并以类似的方式应用它们。具体来说,我们提取这些变量,然后测试用标准程序在不同预训练数据集上训练的模型(书面文本、口语和混合体裁)预测这两种变量的能力。我们的结果显示,在进行一些微调之后,这些模型能够显著超过基线水平地预测产出变量。我们也观察到,使用口语体裁的训练数据比书面文本提供了更准确的预测结果。这些成果有助于利用高质量的语音语料库作为大型语言模型评估基准的更大努力。
https://arxiv.org/abs/2505.16277
Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.
社交媒体和在线论坛的流行度日益增加,但不幸的是,这些平台也被用于散播仇恨言论。在这篇论文中,我们设计了黑盒技术来保护用户免受在线平台上仇恨言论的影响,通过生成可以误导基于深度学习的最佳仇恨言论检测模型的扰动,从而降低其效率。同时,我们确保对原始仇恨言论的意义进行最小程度的改动。我们的最佳扰动生成攻击能够成功地使86.8%的仇恨文本逃避被检测出来。
https://arxiv.org/abs/2505.16263
This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener's unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, which often overlook individual subtleties and lead to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) approach enhanced with Combined-Set Meta-Training, Derivative Annealing, and per-layer per-step learning rates, enabling rapid adaptation with only a few labeled examples. By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles. Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios, highlighting its promise for personalized emotion recognition.
本文介绍了Meta-PerSER,这是一种新颖的元学习框架,它通过适应每个听众独特的解读情感的方式实现了个性化语音情感识别(Speech Emotion Recognition,简称SER)。传统的SER系统依赖于聚合注释,这些注释通常忽略了个体的细微差别,并导致预测不一致。相比之下,Meta-PerSER采用增强型模型不可知元学习(Model-Agnostic Meta-Learning,简称MAML)方法,结合了联合集元训练、导数退火和逐层逐步的学习率调整,从而能够仅通过少量标注示例实现快速适应。我们的框架首先利用预训练的自监督模型整合出稳健的情感特征表示,并在此基础上捕捉一般性情感线索,然后根据个人注释风格进行微调。 在IEMOCAP语料库上的实验表明,Meta-PerSER在已见数据和未见过数据场景中均显著优于基准方法,突显了其在个性化情绪识别方面的潜力。
https://arxiv.org/abs/2505.16220
Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children's speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children's conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs.
最近,自动语音识别(ASR)取得了显著的进步,但准确转录儿童的讲话仍然是一个重大的挑战。大型语言模型(LLMs)在改善ASR转录方面显示出潜力,但在应用于对话场景中的儿童语音时仍缺乏深入的研究。在这项研究中,我们探讨了使用LLM来纠正对话场景下儿童语音的ASR错误的方法。通过在两个儿童对话语音数据集上进行实验,包括零样本(zero-shot)和微调后的CTC模型ASR输出,我们展示了LLMs的应用前景及挑战。 我们的发现表明,当应用于零样本ASR结果以及使用基于CTC的ASR微调后输出时,LLM能够有效地纠正错误。然而,当需要利用上下文信息或处理自回归ASR(如Whisper)模型的微调后的输出时,仍然很难让LLMs进一步提升ASR的表现。 总结来说,尽管LLM在改善零样本和基于CTC模型的ASR性能方面显示出潜力,但在更加复杂的情境中,例如需要上下文理解或处理更复杂的自回归模型生成的数据时,其效果仍有限。这表明,在利用大型语言模型改进儿童对话语音识别方面的研究仍有待进一步探索和发展。
https://arxiv.org/abs/2505.16212