Abstract
Discrete tokens extracted provide efficient and domain adaptable speech features. Their application to disordered speech that exhibits articulation imprecision and large mismatch against normal voice remains unexplored. To improve their phonetic discrimination that is weakened during unsupervised K-means or vector quantization of continuous features, this paper proposes novel phone-purity guided (PPG) discrete tokens for dysarthric speech recognition. Phonetic label supervision is used to regularize maximum likelihood and reconstruction error costs used in standard K-means and VAE-VQ based discrete token extraction. Experiments conducted on the UASpeech corpus suggest that the proposed PPG discrete token features extracted from HuBERT consistently outperform hybrid TDNN and End-to-End (E2E) Conformer systems using non-PPG based K-means or VAE-VQ tokens across varying codebook sizes by statistically significant word error rate (WER) reductions up to 0.99\% and 1.77\% absolute (3.21\% and 4.82\% relative) respectively on the UASpeech test set of 16 dysarthric speakers. The lowest WER of 23.25\% was obtained by combining systems using different token features. Consistent improvements on the phone purity metric were also achieved. T-SNE visualization further demonstrates sharper decision boundaries were produced between K-means/VAE-VQ clusters after introducing phone-purity guidance.
Abstract (translated)
离散令牌的提取提供了高效且领域适应性强的语音特征。尽管这些特征在处理发音不准确和与正常声音严重不符的混乱语言方面尚未得到充分研究,但本论文提出了一种新的基于音素纯度引导(PPG)的离散令牌方法,用于构音障碍语音识别中的应用。该方法通过使用音素标签监督来规范标准K-means和VAE-VQ(变分自编码器-向量量化)基线模型中使用的最大似然和重构误差成本。 在UASpeech语料库上的实验表明,与基于非PPG的K-means或VAE-VQ令牌的标准TDNN混合系统以及端到端(E2E)Conformer系统的性能相比,从HuBERT模型提取的PPG离散令牌特征在不同的码本大小下,通过统计显著性的词错误率(WER)降低实现了更好的效果。具体而言,在包含16名构音障碍者的UASpeech测试集中,与混合系统和端到端系统的基线相比,PPG令牌分别带来了最高0.99%和1.77%的绝对改进,相对改进达到了3.21%和4.82%,这些结果具有统计显著性。最低词错误率为23.25%,通过结合使用不同特征令牌系统的方法实现。 此外,在音素纯度指标上也实现了持续改进。T-SNE(t-分布随机邻域嵌入)可视化进一步证明了在引入音素纯度指导后,K-means/VAE-VQ聚类之间的决策边界变得更加清晰和分离。
URL
https://arxiv.org/abs/2501.04379