Abstract
Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability. This paper proposes to perform timestamp prediction~(TP) while recognizing by utilizing continuous integrate-and-fire~(CIF) mechanism in non-autoregressive ASR model - Paraformer. Foucing on the fire place bias issue of CIF, we conduct post-processing strategies including fire-delay and silence insertion. Besides, we propose to use scaled-CIF to smooth the weights of CIF output, which is proved beneficial for both ASR and TP task. Accumulated averaging shift~(AAS) and diarization error rate~(DER) are adopted to measure the quality of timestamps and we compare these metrics of proposed system and conventional hybrid force-alignment system. The experiment results over manually-marked timestamps testset show that the proposed optimization methods significantly improve the accuracy of CIF timestamps, reducing 66.7\% and 82.1\% of AAS and DER respectively. Comparing to Kaldi force-alignment trained with the same data, optimized CIF timestamps achieved 12.3\% relative AAS reduction.
Abstract (translated)
传统的主动语音识别系统(ASR)使用帧级别的语音后进行力匹配~(FA)并提供时间戳,而基于端到端的ASR系统的特别是基于 AED 系统的ASR 系统则缺乏这种能力。本文提出在识别过程中进行时间戳预测~(TP)的同时利用非自回归ASR模型-Paraformer中连续集成和Fire~(CIF)机制进行识别。针对 CIF 的 Fire 位置偏见问题,我们进行了预处理策略,包括Fire 延迟和沉默插入。此外,我们提议使用 scaled-CIF 来平滑 CIF 输出的重量,这证明了对于 ASR 和 TP 任务都有益处。采用累积平均偏移~(AAS)和离散化错误率~(DER)来衡量时间戳的质量,我们比较了 proposed 系统和传统的混合力匹配系统的标准指标。在手动标记的时间戳测试集上,实验结果表明,提出的优化方法显著提高了 CIF 时间戳的准确性,分别减少了 66.7% 和 82.1%的 AAS 和 DER。与使用相同数据训练的 Kaldi 力匹配系统相比,优化的 CIF 时间戳实现了 12.3% 的相对 AAS 减少。
URL
https://arxiv.org/abs/2301.12343