Paper Reading AI Learner

DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition

2024-03-26 12:27:32
Yi-Cheng Wang, Hsin-Wei Wang, Bi-Cheng Yan, Chi-Han Lin, Berlin Chen

Abstract

End-to-end automatic speech recognition (E2E ASR) systems often suffer from mistranscription of domain-specific phrases, such as named entities, sometimes leading to catastrophic failures in downstream tasks. A family of fast and lightweight named entity correction (NEC) models for ASR have recently been proposed, which normally build on phonetic-level edit distance algorithms and have shown impressive NEC performance. However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic confusion for NEC on ASR transcription. To this end, an efficient entity description augmented masked language model (EDA-MLM) comprised of a dense retrieval model is introduced, enabling MLM to adapt swiftly to domain-specific entities for the NEC task. A series of experiments conducted on the AISHELL-1 and Homophone datasets confirm the effectiveness of our modeling approach. DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities. More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities.

Abstract (translated)

端到端自动语音识别(E2E ASR)系统通常会因为领域特定短语(如命名实体)的误转而产生混淆,有时会导出下游任务的灾难性失败。最近,提出了一种快速轻量级的命名实体修正(NEC)模型家族,这些模型通常基于语音级别编辑距离算法,并在命名实体识别(NE)方面取得了令人印象深刻的性能。然而,随着命名实体的列表不断增长,NE列表中的语音混淆问题变得更加突出;例如,同音异义词混淆增加了很多。针对这个问题,我们提出了一个新的描述增强命名实体修正(DANCER)模型,该模型利用实体描述为NEC转录提供额外的信息,从而减轻语音混淆。为此,我们引入了一个高效实体描述增强掩码语言模型(EDA-MLM),使得MLM能够快速适应领域特定的实体,从而在NEC任务上取得成功。在AisHELL-1和Homophone数据集上进行的一系列实验证实了我们的建模方法的有效性。DANCER在AisHELL-1数据集上的性能优于强大的基线模型——基于语音编辑距离的命名实体修正模型(PED-NEC)。更值得注意的是,在测试含有高语音混淆的Homophone数据集时,DANCER相对于PED-NEC的CER减少程度高达46%。

URL

https://arxiv.org/abs/2403.17645

PDF

https://arxiv.org/pdf/2403.17645.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot