Abstract
Vocal tract configurations play a vital role in generating distinguishable speech sounds, by modulating the airflow and creating different resonant cavities in speech production. They contain abundant information that can be utilized to better understand the underlying speech production mechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, this paper employs effective video action recognition techniques, like Long-term Recurrent Convolutional Networks (LRCN) models, to identify different vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract. Such a model typically combines a CNN based deep hierarchical visual feature extractor with Recurrent Networks, that ideally makes the network spatio-temporally deep enough to learn the sequential dynamics of a short video clip for video classification tasks. We use a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterances by 17 speakers. The comparative performances of this class of algorithms under various parameter settings and for various classification tasks are discussed. Interestingly, the results show a marked difference in the model performance in the context of speech classification with respect to generic sequence or video classification tasks.
Abstract (translated)
通过调节气流并在语音产生中产生不同的谐振腔,声道配置在产生可区分的语音中起着至关重要的作用。它们包含丰富的信息,可用于更好地理解潜在的语音生成机制。作为将声道形状几何自动映射到声学的一个步骤,本文采用有效的视频动作识别技术,如长期递归卷积网络(LRCN)模型,从动态整形中识别不同的元音 - 辅音 - 元音(VCV)序列声道这种模型通常将基于CNN的深层次视觉特征提取器与递归网络相结合,理想地使网络在时间上足够深,以便学习用于视频分类任务的短视频剪辑的顺序动态。我们使用一个数据库,由17位发言者在VCV话语期间的声道整形的2D实时MRI组成。讨论了这类算法在各种参数设置和各种分类任务下的比较性能。有趣的是,结果表明,在语音分类的背景下,与通用序列或视频分类任务相比,模型性能存在显着差异。
URL
https://arxiv.org/abs/1807.11089