Paper Reading AI Learner

Probing Whisper for Dysarthric Speech in Detection and Assessment

2025-10-05 14:21:39
Zhengjun Yue, Devendra Kayande, Zoran Cvetkovic, Erfan Loweimi

Abstract

Large-scale end-to-end models such as Whisper have shown strong performance on diverse speech tasks, but their internal behavior on pathological speech remains poorly understood. Understanding how dysarthric speech is represented across layers is critical for building reliable and explainable clinical assessment tools. This study probes the Whisper-Medium model encoder for dysarthric speech for detection and assessment (i.e., severity classification). We evaluate layer-wise embeddings with a linear classifier under both single-task and multi-task settings, and complement these results with Silhouette scores and mutual information to provide perspectives on layer informativeness. To examine adaptability, we repeat the analysis after fine-tuning Whisper on a dysarthric speech recognition task. Across metrics, the mid-level encoder layers (13-15) emerge as most informative, while fine-tuning induces only modest changes. The findings improve the interpretability of Whisper's embeddings and highlight the potential of probing analyses to guide the use of large-scale pretrained models for pathological speech.

Abstract (translated)

大型端到端模型(如Whisper)在各种语音任务中表现出强大的性能,但它们对病理性语音的内部行为仍然理解不足。了解构音障碍语音如何在各层中表示对于构建可靠和可解释的临床评估工具至关重要。这项研究针对Whisper-Medium模型的编码器进行探究,用于检测和评估(即严重程度分类)构音障碍语音。我们在单任务和多任务设置下使用线性分类器来评估逐层嵌入,并通过轮廓系数和互信息补充这些结果,以提供关于层次信息量的观点。为了检查适应性,我们还在微调Whisper进行构音障碍语音识别任务后重复了该分析。在各项指标上,中级编码器层(13-15)显示出最丰富的信息,而微调仅引起适度变化。这项研究增强了对Whisper嵌入的解释能力,并突出了探针分析引导大规模预训练模型应用于病理性语音的潜力。

URL

https://arxiv.org/abs/2510.04219

PDF

https://arxiv.org/pdf/2510.04219.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot