Paper Reading AI Learner

Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognit

2023-03-23 06:54:37
Haoyu Tang, Zhaoyi Liu, Chang Zeng, Xinfeng Li

Abstract

Transformer-based models have recently made significant achievements in the application of end-to-end (E2E) automatic speech recognition (ASR). It is possible to deploy the E2E ASR system on smart devices with the help of Transformer-based models. While these models still have the disadvantage of requiring a large number of model parameters. To overcome the drawback of universal Transformer models for the application of ASR on edge devices, we propose a solution that can reuse the block in Transformer models for the occasion of the small footprint ASR system, which meets the objective of accommodating resource limitations without compromising recognition accuracy. Specifically, we design a novel block-reusing strategy for speech Transformer (BRST) to enhance the effectiveness of parameters and propose an adapter module (ADM) that can produce a compact and adaptable model with only a few additional trainable parameters accompanying each reusing block. We conducted an experiment with the proposed method on the public AISHELL-1 corpus, and the results show that the proposed approach achieves the character error rate (CER) of 9.3%/6.63% with only 7.6M/8.3M parameters without and with the ADM, respectively. In addition, we also make a deeper analysis to show the effect of ADM in the general block-reusing method.

Abstract (translated)

Transformer-based models 最近在端到端(E2E)自动语音识别(ASR)的应用方面取得了重要成就。借助Transformer-based模型,可以在智能设备上部署E2E ASR系统。尽管这些模型仍然具有需要大量模型参数的缺点,但我们希望克服通用Transformer模型在边缘设备上ASR应用的缺点,并提出一种解决方案,可以在Transformer模型中重用块以实现小 footprint ASR系统,满足适应资源限制并不影响识别精度的目标。具体来说,我们设计了一种Speech Transformer(BRST)的块重用策略,以提高参数的有效性,并提出了适应模块(ADM),该模块可以产生紧凑且可适应的模型,每个重用块仅有几个训练参数相随。我们在公共AIShell-1语料库上进行了实验,结果表明,没有ADM的情况下,该方法实现了字符错误率(CER)9.3%/6.63%,而有了ADM的情况下,仅使用7.6M/8.3M参数分别实现了9.3%/6.63%。此外,我们还进行了深入分析,以显示通用块重用方法中的ADM效应。

URL

https://arxiv.org/abs/2303.13072

PDF

https://arxiv.org/pdf/2303.13072.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot