Abstract
Large-scale end-to-end models such as Whisper have shown strong performance on diverse speech tasks, but their internal behavior on pathological speech remains poorly understood. Understanding how dysarthric speech is represented across layers is critical for building reliable and explainable clinical assessment tools. This study probes the Whisper-Medium model encoder for dysarthric speech for detection and assessment (i.e., severity classification). We evaluate layer-wise embeddings with a linear classifier under both single-task and multi-task settings, and complement these results with Silhouette scores and mutual information to provide perspectives on layer informativeness. To examine adaptability, we repeat the analysis after fine-tuning Whisper on a dysarthric speech recognition task. Across metrics, the mid-level encoder layers (13-15) emerge as most informative, while fine-tuning induces only modest changes. The findings improve the interpretability of Whisper's embeddings and highlight the potential of probing analyses to guide the use of large-scale pretrained models for pathological speech.
Abstract (translated)
大型端到端模型(如Whisper)在各种语音任务中表现出强大的性能,但它们对病理性语音的内部行为仍然理解不足。了解构音障碍语音如何在各层中表示对于构建可靠和可解释的临床评估工具至关重要。这项研究针对Whisper-Medium模型的编码器进行探究,用于检测和评估(即严重程度分类)构音障碍语音。我们在单任务和多任务设置下使用线性分类器来评估逐层嵌入,并通过轮廓系数和互信息补充这些结果,以提供关于层次信息量的观点。为了检查适应性,我们还在微调Whisper进行构音障碍语音识别任务后重复了该分析。在各项指标上,中级编码器层(13-15)显示出最丰富的信息,而微调仅引起适度变化。这项研究增强了对Whisper嵌入的解释能力,并突出了探针分析引导大规模预训练模型应用于病理性语音的潜力。
URL
https://arxiv.org/abs/2510.04219