Paper Reading AI Learner

On-the-fly Text Retrieval for End-to-End ASR Adaptation

2023-03-20 08:54:40
Bolaji Yusuf, Aditya Gourav, Ankur Gandhe, Ivan Bulyko

Abstract

End-to-end speech recognition models are improved by incorporating external text sources, typically by fusion with an external language model. Such language models have to be retrained whenever the corpus of interest changes. Furthermore, since they store the entire corpus in their parameters, rare words can be challenging to recall. In this work, we propose augmenting a transducer-based ASR model with a retrieval language model, which directly retrieves from an external text corpus plausible completions for a partial ASR hypothesis. These completions are then integrated into subsequent predictions by an adapter, which is trained once, so that the corpus of interest can be switched without incurring the computational overhead of retraining. Our experiments show that the proposed model significantly improves the performance of a transducer baseline on a pair of question-answering datasets. Further, it outperforms shallow fusion on recognition of named entities by about 7 relative; when the two are combined, the relative improvement increases to 13%.

Abstract (translated)

端到端语音识别模型可以通过添加外部文本来源而改进,通常通过与外部语言模型 fusion 来实现。这些语言模型必须在感兴趣的语料库发生变化时重新训练。此外,因为它们会将整个语料库存储在他们的参数中,罕见的单词可能很难回忆。在这项工作中,我们建议将基于转换器的 ASR 模型与检索语言模型结合起来,该检索语言模型直接从外部文本语料库中检索可能的完成句子,以支持 partial ASR 假设。这些完成句子随后通过适配器被集成到后续的预测中,一次性训练了一次,因此感兴趣的语料库可以切换而不涉及重新训练的计算 overhead。我们的实验结果表明, proposed 模型 significantly improves the performance of a transducer baseline on two question-answering datasets. Furthermore, it outperforms shallow fusion by about 7 relative on the recognition of named entities. When the two are combined, the relative improvement increases to 13%.

URL

https://arxiv.org/abs/2303.10942

PDF

https://arxiv.org/pdf/2303.10942.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot