Paper Reading AI Learner

Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

2023-01-29 03:47:59
Xian Shi, Yanni Chen, Shiliang Zhang, Zhijie Yan

Abstract

Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability. This paper proposes to perform timestamp prediction~(TP) while recognizing by utilizing continuous integrate-and-fire~(CIF) mechanism in non-autoregressive ASR model - Paraformer. Foucing on the fire place bias issue of CIF, we conduct post-processing strategies including fire-delay and silence insertion. Besides, we propose to use scaled-CIF to smooth the weights of CIF output, which is proved beneficial for both ASR and TP task. Accumulated averaging shift~(AAS) and diarization error rate~(DER) are adopted to measure the quality of timestamps and we compare these metrics of proposed system and conventional hybrid force-alignment system. The experiment results over manually-marked timestamps testset show that the proposed optimization methods significantly improve the accuracy of CIF timestamps, reducing 66.7\% and 82.1\% of AAS and DER respectively. Comparing to Kaldi force-alignment trained with the same data, optimized CIF timestamps achieved 12.3\% relative AAS reduction.

Abstract (translated)

传统的主动语音识别系统(ASR)使用帧级别的语音后进行力匹配~(FA)并提供时间戳,而基于端到端的ASR系统的特别是基于 AED 系统的ASR 系统则缺乏这种能力。本文提出在识别过程中进行时间戳预测~(TP)的同时利用非自回归ASR模型-Paraformer中连续集成和Fire~(CIF)机制进行识别。针对 CIF 的 Fire 位置偏见问题,我们进行了预处理策略,包括Fire 延迟和沉默插入。此外,我们提议使用 scaled-CIF 来平滑 CIF 输出的重量,这证明了对于 ASR 和 TP 任务都有益处。采用累积平均偏移~(AAS)和离散化错误率~(DER)来衡量时间戳的质量,我们比较了 proposed 系统和传统的混合力匹配系统的标准指标。在手动标记的时间戳测试集上,实验结果表明,提出的优化方法显著提高了 CIF 时间戳的准确性,分别减少了 66.7% 和 82.1%的 AAS 和 DER。与使用相同数据训练的 Kaldi 力匹配系统相比,优化的 CIF 时间戳实现了 12.3% 的相对 AAS 减少。

URL

https://arxiv.org/abs/2301.12343

PDF

https://arxiv.org/pdf/2301.12343.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot