Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.
最近的深度学习进展鼓励开发出了一系列大规模自动语音识别(ASR)模型,这些模型在忽略计算和内存限制的情况下取得了令人鼓舞的结果。然而,在资源有限的设备上部署这样的大模型是不切实际的,尽管它们有良好的性能表现。现有的方法(如剪枝、蒸馏、跳过层等),虽然可以将大型模型转换为较小的模型,但会导致显著的性能下降或需要长时间训练小型模型以获得更好的性能。 为了应对这些问题,我们提出了一种有效的两步表示学习方法,可以从单个大规模模型中生成多个小规模模型,并确保在有限的训练周期内有相当不错的性能表现。我们在ASR基准测试上的全面实验表明了该方法的有效性,实现了三倍的训练速度提升,并且错误词率(WER)最多减少了12.54%。
https://arxiv.org/abs/2505.16991
Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.
最近的自动语音识别(ASR)进展主要得益于大规模语音语料库。然而,将覆盖范围扩展到资源有限的多种语言仍然是一个巨大的挑战。本文介绍了Speech Back-Translation,这是一种可扩展的工作流程,通过使用现成的文本转语音(TTS)模型将大型文本语料库转换为合成语音来改进多语言ASR模型。我们证明了仅用数十小时的实际录音可以有效训练TTS模型以生成比原始数据量大几百倍的高质量合成语音。为了评估合成语音的质量,我们开发了一个基于可理解性的评价框架,并确定了合成数据对ASR培训有益的确切阈值。通过使用Speech Back-Translation,我们在十种语言中生成了超过50万小时的合成语音并继续预训练Whisper-large-v3模型,在平均转录错误率上降低了超过30%。这些结果强调了Speech Back-Translation在增强多语言ASR系统中的可扩展性和有效性。
https://arxiv.org/abs/2505.16972
In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.
在这篇论文中,我们结合了两步知识蒸馏、结构化剪枝、截断和词汇精简技术,以极大压缩低资源语言的多语种编码器模型。我们的创新方法系统性地整合现有技术并推向极限,减少层数深度、前向反馈隐藏层大小以及中间层嵌入尺寸,从而创建显著更小的语言特定单语模型,同时保留了关键的语言特性知识。我们在包括情感分析、主题分类、命名实体识别和词性标注在内的四个下游任务中,在三种低资源语言上实现了高达92%的压缩率,并且仅在性能上有微不足道的下降(2-10%)。值得注意的是,性能下降的程度与教师模型中的语言特定数据量相关,较大的数据集导致较小的性能损失。此外,我们进行了详尽的消融研究,以确定使用这些技术进行多语种模型压缩的最佳实践。
https://arxiv.org/abs/2505.16956
We cast nested named entity recognition (NNER) as a sequence labeling task by leveraging prior work that linearizes constituency structures, effectively reducing the complexity of this structured prediction problem to straightforward token classification. By combining these constituency linearizations with pretrained encoders, our method captures nested entities while performing exactly $n$ tagging actions. Our approach achieves competitive performance compared to less efficient systems, and it can be trained using any off-the-shelf sequence labeling library.
我们将嵌套命名实体识别(NNER)视为一个序列标注任务,通过利用先前将构成结构线性化的研究成果,这种方法有效地简化了这种结构化预测问题,并将其转化为简单的标记分类。结合这些构成线性化和预训练编码器,我们的方法在执行精确的$n$个标签操作的同时捕捉到了嵌套实体。与效率较低的系统相比,我们的方法取得了具有竞争力的表现,并且可以使用任何现成的序列标注库进行训练。
https://arxiv.org/abs/2505.16855
Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.
针对低资源语言的命名实体识别(NER)旨在为那些可用标注训练数据有限的语言开发出稳健的系统,这在自然语言处理领域越来越受到关注。增加低资源语言标记数据量的一种常见方法是数据增强。本文探讨了合成数据在多语种、低资源NER中的作用,并考虑了来自不同语言家族的11种语言。我们的研究结果表明,合成数据确实为低资源语言的NER带来了希望,尽管我们发现各语言之间存在显著差异。
https://arxiv.org/abs/2505.16814
A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here this https URL
在将说话人识别系统部署到实际应用中时,一个主要挑战是由于环境不匹配导致的性能下降。我们提出了一种基于扩散模型的方法,该方法从预训练的说话人识别模型中提取说话人嵌入,并生成优化后的嵌入。在训练阶段,我们的方法通过逐步向干净和嘈杂语音分别提取的干净与嘈杂的说话人嵌入添加高斯噪声,进行正向过程处理,然后在反向过程中重建为干净的嵌入。而在推理阶段,所有嵌入都将通过扩散过程再生。我们这种方法不需要说话人的标签,也不需要对现有的说话人识别流程做任何修改。 实验表明,在模拟环境不匹配场景的数据集上,我们的方法相对于基准模型可以提高高达19.6%的识别准确率,并且在传统场景下性能也能得到保持。代码已发布于此链接: [这里应填写实际的URL地址]
https://arxiv.org/abs/2505.16798
Large Language Models have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine mathematical reasoning or superficial pattern recognition. Common evaluation metrics, such as final answer accuracy, fail to disentangle the underlying competencies involved, offering limited diagnostic value. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework. SMART decomposes mathematical problem solving into four distinct dimensions: understanding, reasoning, arithmetic, and reflection \& refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. Crucially, SMART integrates an automated self-generating and self-validating mechanism to produce and verify benchmark data, ensuring both scalability and reliability. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings demonstrate the inadequacy of final answer accuracy as a sole metric and motivate a new holistic metric to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.
大型语言模型在多种数学基准测试中取得了显著成果。然而,仍存在疑问,即这些成功是否反映了真正的数学推理能力,还是仅仅表面的模式识别技巧。常见的评估指标,如最终答案准确率,无法区分所涉及的底层能力,因此提供的诊断价值有限。为了解决这一局限性,我们引入了SMART:一个自我生成和自我验证的多维度评估框架。SMART将数学问题解决分解为四个不同的维度:理解、推理、算术运算及反思与完善。每个维度通过定制的任务独立进行评估,从而能够提供对LLM行为的可解释且细致入微的分析。至关重要的是,SMART集成了自动化自我生成和自我验证机制来生产和验证基准数据,确保了其在规模性和可靠性方面的表现。我们将SMART应用于21种最先进的开源和闭源大型语言模型中,揭示出它们在不同维度上的能力存在显著差异。我们的研究结果表明仅凭最终答案准确率作为单一指标是不足的,并促使我们需要一个全新的综合性指标来更好地捕捉真正的解决问题的能力。代码及基准数据将在论文接受后发布。
https://arxiv.org/abs/2505.16646
While the safety risks of image-based large language models have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce \textbf{VideoSafetyBench (VSB-77k) - the first large-scale, culturally diverse benchmark for Video LLM safety}, which compromises 77,646 video-query pairs and spans 19 principal risk categories across 10 language communities. \textit{We reveal that integrating video modality degrades safety performance by an average of 42.3\%, exposing systemic risks in multimodal attack exploitation.} To address this vulnerability, we propose \textbf{VideoSafety-R1}, a dual-stage framework achieving unprecedented safety gains through two innovations: (1) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and textual sequences, enabling explicit harm perception across modalities via multitask objectives. (2) Then, Safety-Guided GRPO enhances defensive reasoning through dynamic policy optimization with rule-based rewards derived from dual-modality verification. These components synergize to shift safety alignment from passive harm recognition to active reasoning. The resulting framework achieves a 65.1\% improvement on VSB-Eval-HH, and improves by 59.1\%, 44.3\%, and 15.0\% on the image safety datasets MMBench, VLGuard, and FigStep, respectively. \textit{Our codes are available in the supplementary materials.} \textcolor{red}{Warning: This paper contains examples of harmful language and videos, and reader discretion is recommended.}
虽然基于图像的大规模语言模型的安全风险已经得到了广泛的研究,但其视频版本(Video LLMs)却仍然严重缺乏研究。为了系统地探讨这一问题,我们引入了**VideoSafetyBench (VSB-77k)——首个大规模、文化多样化的视频LLM安全基准测试**,该基准包含了77,646个视频查询对,并跨越10种语言社区中的19个主要风险类别。*我们发现,将视频模式整合进去会导致安全性性能平均下降42.3%,揭示了多模态攻击利用的系统性风险。* 为了解决这一漏洞,我们提出了**VideoSafety-R1**,这是一个双阶段框架,通过两项创新实现了前所未有的安全提升:(1) 报警令牌引导的安全微调(AT-SFT)将可学习的报警令牌注入视觉和文本序列中,使得跨模态明确地感知伤害成为可能,这借助了多任务目标。(2) 然后,安全导向GRPO通过基于规则的奖励进行动态策略优化来增强防御性推理,这些奖励来自于双模态验证。这些组件协同工作,将安全性对齐从被动的危害识别转变为积极的推理。该框架在VSB-Eval-HH上实现了65.1%的改进,并且分别在图像安全数据集MMBench、VLGuard和FigStep上提升了59.1%、44.3%和15.0%。 *我们的代码可以在补充材料中找到。* **警告:本文包含有害语言和视频示例,建议读者谨慎阅读。**
https://arxiv.org/abs/2505.16643
The integration of artificial intelligence in sports analytics has transformed soccer video understanding, enabling real-time, automated insights into complex game dynamics. Traditional approaches rely on isolated data streams, limiting their effectiveness in capturing the full context of a match. To address this, we introduce SoccerChat, a multimodal conversational AI framework that integrates visual and textual data for enhanced soccer video comprehension. Leveraging the extensive SoccerNet dataset, enriched with jersey color annotations and automatic speech recognition (ASR) transcripts, SoccerChat is fine-tuned on a structured video instruction dataset to facilitate accurate game understanding, event classification, and referee decision making. We benchmark SoccerChat on action classification and referee decision-making tasks, demonstrating its performance in general soccer event comprehension while maintaining competitive accuracy in referee decision making. Our findings highlight the importance of multimodal integration in advancing soccer analytics, paving the way for more interactive and explainable AI-driven sports analysis. this https URL
将人工智能技术融入体育数据分析,特别是在足球视频理解方面,已经改变了对复杂比赛动态的实时、自动化洞察。传统的方法依赖于孤立的数据流,这限制了它们捕捉比赛全貌的有效性。为了解决这个问题,我们推出了SoccerChat,这是一个多模态对话AI框架,它整合了视觉和文本数据以增强足球视频的理解能力。 通过利用丰富的SoccerNet数据集,并结合球衣颜色注释以及自动语音识别(ASR)转录内容,SoccerChat在经过结构化的视频指令数据集上进行微调,从而能够实现准确的比赛理解、事件分类和裁判决策。我们在动作分类和裁判决策任务上对SoccerChat进行了基准测试,展示了其在通用足球赛事理解方面的性能,并且在裁判决策方面保持了竞争性的准确性。 我们的研究结果强调了多模态整合在推进足球分析领域的关键作用,为更加互动和解释性强的AI驱动体育分析铺平道路。[参考链接](https://this-url)
https://arxiv.org/abs/2505.16630
According to the EPA, only 25% of waste is recycled, and just 60% of U.S. municipalities offer curbside recycling. Plastics fare worse, with a recycling rate of only 8%; an additional 16% is incinerated, while the remaining 76% ends up in landfills. The low plastic recycling rate stems from contamination, poor economic incentives, and technical difficulties, making efficient recycling a challenge. To improve recovery, automated sorting plays a critical role. Companies like AMP Robotics and Greyparrot utilize optical systems for sorting, while Materials Recovery Facilities (MRFs) employ Near-Infrared (NIR) sensors to detect plastic types. Modern optical sorting uses advances in computer vision such as object recognition and instance segmentation, powered by machine learning. Two-stage detectors like Mask R-CNN use region proposals and classification with deep backbones like ResNet. Single-stage detectors like YOLO handle detection in one pass, trading some accuracy for speed. While such methods excel under ideal conditions with a large volume of labeled training data, challenges arise in realistic scenarios, emphasizing the need to further examine the efficacy of optic detection for automated sorting. In this study, we compiled novel datasets totaling 20,000+ images from varied sources. Using both public and custom machine learning pipelines, we assessed the capabilities and limitations of optical recognition for sorting. Grad-CAM, saliency maps, and confusion matrices were employed to interpret model behavior. We perform this analysis on our custom trained models from the compiled datasets. To conclude, our findings are that optic recognition methods have limited success in accurate sorting of real-world plastics at MRFs, primarily because they rely on physical properties such as color and shape.
根据美国环境保护署(EPA)的数据,只有25%的废物被回收利用,并且仅有60%的美国城市提供路边回收服务。塑料的表现更糟,其回收率仅为8%,另有16%被焚烧,而剩余的76%最终进入垃圾填埋场。低塑料回收率的原因包括污染、经济激励不足以及技术难度,这些因素使得高效的塑料回收变得极具挑战性。为了改善材料回收效率,自动分拣系统发挥着关键作用。 一些公司如AMP Robotics和Greyparrot利用光学系统进行分类,而物料回收设施(MRFs)则使用近红外(NIR)传感器来检测塑料类型。现代光学分类技术运用了计算机视觉的进步,比如目标识别和实例分割,并通过机器学习支持这些技术。Mask R-CNN等两阶段探测器采用区域提议与深层骨干网如ResNet相结合的方法进行分类。而像YOLO这样的单阶段探测器则能在一次处理中完成检测任务,尽管牺牲了一定的精度以换取速度。 在理想的条件和大量标记训练数据的情况下,这些方法表现出色。然而,在实际应用中,由于各种挑战,例如光线变化、标签不一致等问题,光学识别技术的有效性受到了限制。为了进一步探究这些问题,本研究收集了来自不同来源共计20,000多张图像的新型数据集,并使用公共和定制机器学习管道评估了光学识别在分拣中的能力和局限。 我们利用Grad-CAM(Gradient-weighted Class Activation Mapping)、热图以及混淆矩阵来解释模型的行为。我们在从编译的数据集中自训练的模型上进行了这种分析。最终,我们的研究发现表明,在物料回收设施中准确分类真实世界的塑料方面,光学识别方法仅取得了有限的成功,其主要原因在于它们依赖于颜色和形状等物理特性。 这项研究表明了尽管现有技术在理论上具有潜力,但在实际操作环境中有效应用仍面临诸多挑战,特别是在处理复杂多样的环境条件时。这强调了进一步开发更加智能且适应性强的分类系统的必要性,以便能够更好地应对这些现实世界的难题。
https://arxiv.org/abs/2505.16513
TAT-VPR is a ternary-quantized transformer that brings dynamic accuracy-efficiency trade-offs to visual SLAM loop-closure. By fusing ternary weights with a learned activation-sparsity gate, the model can control computation by up to 40% at run-time without degrading performance (Recall@1). The proposed two-stage distillation pipeline preserves descriptor quality, letting it run on micro-UAV and embedded SLAM stacks while matching state-of-the-art localization accuracy.
TAT-VPR 是一种三值量化变压器,它为视觉SLAM闭环提供了动态的精度与效率权衡。通过融合三值权重和一个学习到的激活-稀疏门控,该模型可以在运行时最多控制计算量减少40%,同时不降低性能(Recall@1)。提出的两阶段蒸馏流水线保持了描述符的质量,使其能够在微型无人机和嵌入式SLAM系统中运行,并且在定位精度方面与当前最先进的技术相匹配。
https://arxiv.org/abs/2505.16447
Pose-invariant face recognition has become a challenging problem for modern AI-based face recognition systems. It aims at matching a profile face captured in the wild with a frontal face registered in a database. Existing methods perform face frontalization via either generative models or learning a pose robust feature representation. In this paper, a new method is presented to perform face frontalization and recognition within the feature space. First, a novel feature space pose frontalization module (FSPFM) is proposed to transform profile images with arbitrary angles into frontal counterparts. Second, a new training paradigm is proposed to maximize the potential of FSPFM and boost its performance. The latter consists of a pre-training and an attention-guided fine-tuning stage. Moreover, extensive experiments have been conducted on five popular face recognition benchmarks. Results show that not only our method outperforms the state-of-the-art in the pose-invariant face recognition task but also maintains superior performance in other standard scenarios.
无姿态不变的人脸识别已成为现代基于AI的人脸识别系统面临的挑战性问题。该任务旨在将野外采集的侧脸与数据库中注册的正脸进行匹配。现有的方法通过生成模型或学习鲁棒的姿态特征表示来进行人脸正面化。本文提出了一种新的方法,在特征空间内执行人脸正面化和识别。首先,提出了一个新颖的特征空间姿态正面化模块(FSPFM),将任意角度的侧脸图像转换为正脸对应物。其次,提出了一种新的训练范式来最大化FSPFM的潜力并提升其性能。后者包括预训练和注意力引导微调两个阶段。此外,在五个流行的人脸识别基准数据集上进行了广泛的实验。结果显示,不仅我们提出的方法在无姿态不变人脸识别人脸任务中优于现有最佳方法,并且在其他标准场景中也保持了优越的性能。
https://arxiv.org/abs/2505.16412
When emotions are repressed, an individual's true feelings may be revealed through micro-expressions. Consequently, micro-expressions are regarded as a genuine source of insight into an individual's authentic emotions. However, the transient and highly localised nature of micro-expressions poses a significant challenge to their accurate recognition, with the accuracy rate of micro-expression recognition being as low as 50%, even for professionals. In order to address these challenges, it is necessary to explore the field of dynamic micro expression recognition (DMER) using multimodal fusion techniques, with special attention to the diverse fusion of temporal and spatial modal features. In this paper, we propose a novel Temporal and Spatial feature Fusion framework for DMER (TSFmicro). This framework integrates a Retention Network (RetNet) and a transformer-based DMER network, with the objective of efficient micro-expression recognition through the capture and fusion of temporal and spatial relations. Meanwhile, we propose a novel parallel time-space fusion method from the perspective of modal fusion, which fuses spatio-temporal information in high-dimensional feature space, resulting in complementary "where-how" relationships at the semantic level and providing richer semantic information for the model. The experimental results demonstrate the superior performance of the TSFmicro method in comparison to other contemporary state-of-the-art methods. This is evidenced by its effectiveness on three well-recognised micro-expression datasets.
当情绪被压抑时,一个人的真实感受可能会通过微表情显现出来。因此,微表情被认为是了解个体真实情感的一个可靠来源。然而,由于微表情的瞬息性和高度局部化特性,对其进行准确识别存在很大挑战,即使专业人士,其识别准确性也只能达到50%左右。为了应对这些挑战,有必要探索使用多模态融合技术进行动态微表情识别(DMER),特别是注重时间与空间模式特征的多样化融合。 在本文中,我们提出了一种新颖的时间和空间特征融合框架用于DMER (TSFmicro)。该框架整合了一个保留网络(RetNet)以及基于Transformer的DMER网络,旨在通过捕捉并融合时间和空间的关系实现高效的微表情识别。同时,我们还从模态融合的角度提出了一个创新性的平行时空融合方法,在高维特征空间中融合了时序与空间信息,从而在语义层面上产生了互补的“在哪里-如何”关系,并为模型提供了更丰富的语义信息。 实验结果表明,TSFmicro方法相较于其他当前最先进的方法具有优越性。这体现在它对三个著名的微表情数据集的有效性能上。
https://arxiv.org/abs/2505.16372
We introduces X-ARES (eXtensive Audio Representation and Evaluation Suite), a novel open-source benchmark designed to systematically assess audio encoder performance across diverse domains. By encompassing tasks spanning speech, environmental sounds, and music, X-ARES provides two evaluation approaches for evaluating audio representations: linear fine-tuning and unparameterized evaluation. The framework includes 22 distinct tasks that cover essential aspects of audio processing, from speech recognition and emotion detection to sound event classification and music genre identification. Our extensive evaluation of state-of-the-art audio encoders reveals significant performance variations across different tasks and domains, highlighting the complexity of general audio representation learning.
我们介绍了X-ARES(eXtensive Audio Representation and Evaluation Suite),这是一个新颖的开源基准测试工具,旨在系统地评估音频编码器在不同领域的性能。通过涵盖从语音、环境声音到音乐的各项任务,X-ARES提供了两种用于评估音频表示的方法:线性微调和无参数化评估。该框架包括了22项不同的任务,涵盖了音频处理的各个方面,从语音识别和情感检测到声音事件分类和音乐流派识别。我们对最先进的音频编码器进行的广泛评估揭示了在不同任务和领域中性能差异显著,突显了一般音频表示学习的复杂性。
https://arxiv.org/abs/2505.16369
Current movie dubbing technology can produce the desired speech using a reference voice and input video, maintaining perfect synchronization with the visuals while effectively conveying the intended emotions. However, crucial aspects of movie dubbing, including adaptation to various dubbing styles, effective handling of dialogue, narration, and monologues, as well as consideration of subtle details such as speaker age and gender, remain insufficiently explored. To tackle these challenges, we introduce a multi-modal generative framework. First, it utilizes a multi-modal large vision-language model (VLM) to analyze visual inputs, enabling the recognition of dubbing types and fine-grained attributes. Second, it produces high-quality dubbing using large speech generation models, guided by multi-modal inputs. Additionally, a movie dubbing dataset with annotations for dubbing types and subtle details is constructed to enhance movie understanding and improve dubbing quality for the proposed multi-modal framework. Experimental results across multiple benchmark datasets show superior performance compared to state-of-the-art (SOTA) methods. In details, the LSE-D, SPK-SIM, EMO-SIM, and MCD exhibit improvements of up to 1.09%, 8.80%, 19.08%, and 18.74%, respectively.
当前的电影配音技术能够利用参考声音和输入视频产生所需的语音,同时与画面保持完美的同步,并有效地传达所需的情感。然而,包括适应各种配音风格、有效处理对话、旁白及独白在内的关键方面,以及考虑说话者的年龄和性别等细微之处,在现有研究中仍未得到充分探索。 为解决这些挑战,我们引入了一个多模态生成框架。首先,该框架采用大规模视觉语言模型(VLM)来分析视觉输入,从而识别配音类型及其细粒度属性。其次,它利用大型语音生成模型根据多模态输入生产高质量的配音。此外,还构建了一套带有注释标记的电影配音数据集,涵盖多种类型的配音和细微信息,以此提升对电影的理解,并进一步改进提出的多模态框架下的配音质量。 实验结果显示,在多个基准测试数据集中,我们的方法相比最先进的(SOTA)技术具有显著优势。具体而言,LSE-D、SPK-SIM、EMO-SIM 和 MCD 指标分别提高了1.09%、8.80%、19.08%和18.74%。
https://arxiv.org/abs/2505.16279
This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener's unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, which often overlook individual subtleties and lead to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) approach enhanced with Combined-Set Meta-Training, Derivative Annealing, and per-layer per-step learning rates, enabling rapid adaptation with only a few labeled examples. By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles. Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios, highlighting its promise for personalized emotion recognition.
本文介绍了Meta-PerSER,这是一种新颖的元学习框架,它通过适应每个听众独特的解读情感的方式实现了个性化语音情感识别(Speech Emotion Recognition,简称SER)。传统的SER系统依赖于聚合注释,这些注释通常忽略了个体的细微差别,并导致预测不一致。相比之下,Meta-PerSER采用增强型模型不可知元学习(Model-Agnostic Meta-Learning,简称MAML)方法,结合了联合集元训练、导数退火和逐层逐步的学习率调整,从而能够仅通过少量标注示例实现快速适应。我们的框架首先利用预训练的自监督模型整合出稳健的情感特征表示,并在此基础上捕捉一般性情感线索,然后根据个人注释风格进行微调。 在IEMOCAP语料库上的实验表明,Meta-PerSER在已见数据和未见过数据场景中均显著优于基准方法,突显了其在个性化情绪识别方面的潜力。
https://arxiv.org/abs/2505.16220
Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children's speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children's conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs.
最近,自动语音识别(ASR)取得了显著的进步,但准确转录儿童的讲话仍然是一个重大的挑战。大型语言模型(LLMs)在改善ASR转录方面显示出潜力,但在应用于对话场景中的儿童语音时仍缺乏深入的研究。在这项研究中,我们探讨了使用LLM来纠正对话场景下儿童语音的ASR错误的方法。通过在两个儿童对话语音数据集上进行实验,包括零样本(zero-shot)和微调后的CTC模型ASR输出,我们展示了LLMs的应用前景及挑战。 我们的发现表明,当应用于零样本ASR结果以及使用基于CTC的ASR微调后输出时,LLM能够有效地纠正错误。然而,当需要利用上下文信息或处理自回归ASR(如Whisper)模型的微调后的输出时,仍然很难让LLMs进一步提升ASR的表现。 总结来说,尽管LLM在改善零样本和基于CTC模型的ASR性能方面显示出潜力,但在更加复杂的情境中,例如需要上下文理解或处理更复杂的自回归模型生成的数据时,其效果仍有限。这表明,在利用大型语言模型改进儿童对话语音识别方面的研究仍有待进一步探索和发展。
https://arxiv.org/abs/2505.16212
Recent studies have highlighted the potential of discrete tokens derived from self-supervised learning (SSL) models for various speech-related tasks. These tokens serve not only as substitutes for text in language modeling but also as intermediate representations for tasks such as automatic speech recognition (ASR). However, discrete tokens are typically obtained via k-means clustering of SSL features independently of downstream tasks, making them suboptimal for specific applications. This paper proposes the use of differentiable k-means, enabling the joint optimization of tokenization and downstream tasks. This approach enables the fine-tuning of the SSL parameters and learning weights for outputs from multiple SSL layers. Experiments were conducted with ASR as a downstream task. ASR accuracy successfully improved owing to the optimized tokens. The acquired tokens also exhibited greater purity of phonetic information, which were found to be useful even in speech resynthesis.
最近的研究强调了自监督学习(SSL)模型中离散标记在各种语音相关任务中的潜力。这些标记不仅可以用作语言建模中文本的替代品,还可以作为自动语音识别(ASR)等任务的中间表示形式。然而,离散标记通常通过独立于下游任务对SSL特征进行k均值聚类来获取,这使得它们对于特定应用来说不是最优选择。本文提出了使用可微分k均值的方法,使标记化和下游任务能够联合优化。这种方法允许对SSL参数以及来自多个SSL层的输出权重进行微调。实验是在ASR作为下游任务的情况下进行的。由于优化后的标记,ASR准确率得到了成功提升。获得的标记还表现出更纯净的音素信息,在语音重合成中也证明是有用的。
https://arxiv.org/abs/2505.16207
Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce \textbf{VLM-R$^3$} (\textbf{V}isual \textbf{L}anguage \textbf{M}odel with \textbf{R}egion \textbf{R}ecognition and \textbf{R}easoning), a framework that equips an MLLM with the ability to (i) decide \emph{when} additional visual evidence is needed, (ii) determine \emph{where} to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e.g.\ crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$^3$ sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.
最近,基于推理的大规模语言模型(MLLMs)在生成长篇文本推理链方面取得了成功。然而,它们仍然难以处理那些需要动态和迭代地关注并回顾视觉区域以实现精确的文本推理与视觉证据结合的复杂任务。我们引入了\textbf{VLM-R$^3$} (\textbf{V}isual \textbf{L}anguage \textbf{M}odel with \textbf{R}egion \textbf{R}ecognition and \textbf{R}easoning,视觉语言模型带区域识别和推理),这是一个框架,它使MLLM具备了以下能力:(i)决定何时需要额外的视觉证据;(ii)确定在图像中何处进行定位;以及(iii)将相关的子图内容无缝地融入到交错的思想链中。我们方法的核心是\textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)},即一种训练范式,它通过选择信息丰富的区域、制定适当的变换(如裁剪和缩放),并将其产生的视觉上下文整合进后续推理步骤来奖励模型。为了启动这一策略,我们汇编了一个适度但精心策划的Visuo-Lingual Interleaved Rationale (VLIR)语料库,它在区域选择和文本证明上提供了分步监督。在MathVista、ScienceQA和其他基准测试中的广泛实验表明,VLM-R$^3$ 在零样本和少量样本设置中设立了新的性能标准,在需要精细空间推理或细微视觉线索提取的问题上表现出了最大的改进。
https://arxiv.org/abs/2505.16192
Recently, a method for synthesizing foreign-accented speech only with native speech data using discrete tokens obtained from self-supervised learning (SSL) models was proposed. Considering limited availability of accented speech data, this method is expected to make it much easier to simulate foreign accents. By using the synthesized accented speech as listening materials for humans or training data for automatic speech recognition (ASR), both of them will acquire higher robustness against foreign accents. However, the previous method has a fatal flaw that it cannot reproduce duration-related accents. Durational accents are commonly seen when L2 speakers, whose native language has syllable-timed or mora-timed rhythm, speak stress-timed languages, such as English. In this paper, we integrate duration modification to the previous method to simulate foreign accents more accurately. Experiments show that the proposed method successfully replicates durational accents seen in real L2 speech.
最近,提出了一种仅使用来自自我监督学习(SSL)模型获得的离散令牌来合成带有外国口音的语音的方法。考虑到有声母语数据但缺乏带口音的数据的情况,这种方法有望使模拟外国口音变得更加容易。通过将合成的带口音语音用作人类的听力材料或自动语音识别(ASR)系统的训练数据,两者都将提高其对抗外国口音的能力。然而,先前的方法存在一个致命缺陷,即无法重现与持续时间相关的口音变化。在讲英语等重音语言时,那些母语具有音节定时或拍子定时节奏的二语(L2)说话者常常会表现出这种持续时间上的口音差异。在这篇论文中,我们将持续时间修改整合到先前的方法中,以更准确地模拟外国口音。实验表明,所提出的方法成功复制了真实L2语音中的持续时间口音变化。
https://arxiv.org/abs/2505.16191