Paper Reading AI Learner

Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI

2018-07-29 17:36:08
Pramit Saha, Praneeth Srungarapu, Sidney Fels

Abstract

Vocal tract configurations play a vital role in generating distinguishable speech sounds, by modulating the airflow and creating different resonant cavities in speech production. They contain abundant information that can be utilized to better understand the underlying speech production mechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, this paper employs effective video action recognition techniques, like Long-term Recurrent Convolutional Networks (LRCN) models, to identify different vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract. Such a model typically combines a CNN based deep hierarchical visual feature extractor with Recurrent Networks, that ideally makes the network spatio-temporally deep enough to learn the sequential dynamics of a short video clip for video classification tasks. We use a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterances by 17 speakers. The comparative performances of this class of algorithms under various parameter settings and for various classification tasks are discussed. Interestingly, the results show a marked difference in the model performance in the context of speech classification with respect to generic sequence or video classification tasks.

Abstract (translated)

通过调节气流并在语音产生中产生不同的谐振腔,声道配置在产生可区分的语音中起着至关重要的作用。它们包含丰富的信息,可用于更好地理解潜在的语音生成机制。作为将声道形状几何自动映射到声学的一个步骤,本文采用有效的视频动作识别技术,如长期递归卷积网络(LRCN)模型,从动态整形中识别不同的元音 - 辅音 - 元音(VCV)序列声道这种模型通常将基于CNN的深层次视觉特征提取器与递归网络相结合,理想地使网络在时间上足够深,以便学习用于视频分类任务的短视频剪辑的顺序动态。我们使用一个数据库,由17位发言者在VCV话语期间的声道整形的2D实时MRI组成。讨论了这类算法在各种参数设置和各种分类任务下的比较性能。有趣的是,结果表明,在语音分类的背景下,与通用序列或视频分类任务相比,模型性能存在显着差异。

URL

https://arxiv.org/abs/1807.11089

PDF

https://arxiv.org/pdf/1807.11089.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot