Paper Reading AI Learner

Dry, Focus, and Transcribe: End-to-End Integration of Dereverberation, Beamforming, and ASR

2019-04-19 01:36:37
Aswin Shanmugam Subramanian, Xiaofei Wang, Shinji Watanabe, Toru Taniguchi, Dung Tran, Yuya Fujita

Abstract

Sequence-to-sequence (S2S) modeling is becoming a popular paradigm for automatic speech recognition (ASR) because of its ability to jointly optimize all the conventional ASR components in an end-to-end (E2E) fashion. This paper extends the ability of E2E ASR from standard close-talk to far-field applications by encompassing entire multichannel speech enhancement and ASR components within the S2S model. There have been previous studies on jointly optimizing neural beamforming alongside E2E ASR for denoising. It is clear from both recent challenge outcomes and successful products that far-field systems would be incomplete without solving both denoising and dereverberation simultaneously. This paper proposes a novel architecture for far-field ASR by composing neural extensions of dereverberation and beamforming modules with the S2S ASR module as a single differentiable neural network and also clearly defining the role of each subnetwork. To our knowledge, this is the first successful demonstration of such a system, which we term DFTnet (dry, focus, and transcribe). It achieves better performance than conventional pipeline methods on the DIRHA English dataset and comparable performance on the REVERB dataset. It also has additional advantages of being neither iterative nor requiring parallel noisy and clean speech data.

Abstract (translated)

序列到序列(S2S)建模由于能够以端到端(E2E)的方式联合优化所有传统ASR组件,正成为自动语音识别(ASR)的一种流行模式。本文通过在S2S模型中包含整个多通道语音增强和ASR组件,将E2E ASR的能力从标准的近距离通话扩展到了远场应用。以往的研究都是在E2E ASR的基础上联合优化神经波束形成来进行去噪。从最近的挑战结果和成功的产品可以清楚地看出,如果不同时解决去噪和去噪,远场系统将是不完整的。本文提出了一种新的远场ASR结构,将去描述和波束形成模块的神经扩展与S2S ASR模块作为一个可微神经网络组成,并明确定义了各子网络的作用。据我们所知,这是第一次成功演示这样一个系统,我们称之为dftnet(dry,focus,and transcribe)。它在Dirha English数据集上的性能比传统的管道方法好,在Reverb数据集上的性能也相当。它还有其他优点,既不需要迭代,也不需要并行噪声和干净的语音数据。

URL

https://arxiv.org/abs/1904.09049

PDF

https://arxiv.org/pdf/1904.09049.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot