Paper Reading AI Learner

Exploring Turkish Speech Recognition via Hybrid CTC/Attention Architecture and Multi-feature Fusion Network

2023-03-22 04:11:35
Zeyu Ren, Nurmement Yolwas, Huiru Wang, Wushour Slamu

Abstract

In recent years, End-to-End speech recognition technology based on deep learning has developed rapidly. Due to the lack of Turkish speech data, the performance of Turkish speech recognition system is poor. Firstly, this paper studies a series of speech recognition tuning technologies. The results show that the performance of the model is the best when the data enhancement technology combining speed perturbation with noise addition is adopted and the beam search width is set to 16. Secondly, to maximize the use of effective feature information and improve the accuracy of feature extraction, this paper proposes a new feature extractor LSPC. LSPC and LiGRU network are combined to form a shared encoder structure, and model compression is realized. The results show that the performance of LSPC is better than MSPC and VGGnet when only using Fbank features, and the WER is improved by 1.01% and 2.53% respectively. Finally, based on the above two points, a new multi-feature fusion network is proposed as the main structure of the encoder. The results show that the WER of the proposed feature fusion network based on LSPC is improved by 0.82% and 1.94% again compared with the single feature (Fbank feature and Spectrogram feature) extraction using LSPC. Our model achieves performance comparable to that of advanced End-to-End models.

Abstract (translated)

近年来,基于深度学习的端到端语音识别技术快速发展。由于土耳其语音数据缺乏,土耳其语音识别系统的性能较差。首先,本文研究了一系列语音识别优化技术。结果表明,采用结合速度增强和噪声添加的数据增强技术,模型表现最佳,同时光束搜索宽度设置为16。其次,为了最大限度地利用有效的特征信息并提高特征提取的准确性,本文提出了一种新的特征提取器LSPC。LSPC和LiGRU网络相结合组成了共享编码结构,模型压缩得以实现。结果表明,LSPC在仅使用Fbank特征的情况下表现优于MSPC和VGGnet,WER分别下降了1.01%和2.53%。最后,基于以上两个点,我们提出了一种新的多特征融合网络作为编码器的主要结构。结果表明,采用LSPC作为特征融合网络的主要结构的WER比使用LSPC单独提取单个特征(Fbank特征和Spectrogram特征)的方法下降了0.82%和1.94%。我们的模型实现了与高级端到端模型相当的性能。

URL

https://arxiv.org/abs/2303.12300

PDF

https://arxiv.org/pdf/2303.12300.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot