Paper Reading AI Learner

Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition

2025-01-08 09:45:14
Huimeng Wang, Xurong Xie, Mengzhe Geng, Shujie Hu, Haoning Xu, Youjun Chen, Zhaoqing Li, Jiajun Deng, Xunying Liu

Abstract

Discrete tokens extracted provide efficient and domain adaptable speech features. Their application to disordered speech that exhibits articulation imprecision and large mismatch against normal voice remains unexplored. To improve their phonetic discrimination that is weakened during unsupervised K-means or vector quantization of continuous features, this paper proposes novel phone-purity guided (PPG) discrete tokens for dysarthric speech recognition. Phonetic label supervision is used to regularize maximum likelihood and reconstruction error costs used in standard K-means and VAE-VQ based discrete token extraction. Experiments conducted on the UASpeech corpus suggest that the proposed PPG discrete token features extracted from HuBERT consistently outperform hybrid TDNN and End-to-End (E2E) Conformer systems using non-PPG based K-means or VAE-VQ tokens across varying codebook sizes by statistically significant word error rate (WER) reductions up to 0.99\% and 1.77\% absolute (3.21\% and 4.82\% relative) respectively on the UASpeech test set of 16 dysarthric speakers. The lowest WER of 23.25\% was obtained by combining systems using different token features. Consistent improvements on the phone purity metric were also achieved. T-SNE visualization further demonstrates sharper decision boundaries were produced between K-means/VAE-VQ clusters after introducing phone-purity guidance.

Abstract (translated)

离散令牌的提取提供了高效且领域适应性强的语音特征。尽管这些特征在处理发音不准确和与正常声音严重不符的混乱语言方面尚未得到充分研究,但本论文提出了一种新的基于音素纯度引导(PPG)的离散令牌方法,用于构音障碍语音识别中的应用。该方法通过使用音素标签监督来规范标准K-means和VAE-VQ(变分自编码器-向量量化)基线模型中使用的最大似然和重构误差成本。 在UASpeech语料库上的实验表明,与基于非PPG的K-means或VAE-VQ令牌的标准TDNN混合系统以及端到端(E2E)Conformer系统的性能相比,从HuBERT模型提取的PPG离散令牌特征在不同的码本大小下,通过统计显著性的词错误率(WER)降低实现了更好的效果。具体而言,在包含16名构音障碍者的UASpeech测试集中,与混合系统和端到端系统的基线相比,PPG令牌分别带来了最高0.99%和1.77%的绝对改进,相对改进达到了3.21%和4.82%,这些结果具有统计显著性。最低词错误率为23.25%,通过结合使用不同特征令牌系统的方法实现。 此外,在音素纯度指标上也实现了持续改进。T-SNE(t-分布随机邻域嵌入)可视化进一步证明了在引入音素纯度指导后,K-means/VAE-VQ聚类之间的决策边界变得更加清晰和分离。

URL

https://arxiv.org/abs/2501.04379

PDF

https://arxiv.org/pdf/2501.04379.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot