Anchored Speech Recognition with Neural Transducers

2022-10-20 21:00:42

Desh Raj, Junteng Jia, Jay Mahadeokar, Chunyang Wu, Niko Moritz, Xiaohui Zhang, Ozlem Kalinli

arXiv_SD

arXiv_SD Speech_Recognition Recognition Embedding Pose Action Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

Neural transducers have gained popularity in production ASR systems, achieving human level recognition accuracy on standard benchmark datasets. However, their performance significantly degrades in the presence of crosstalks, especially when the background speech/noise is non-negligible as compared to the primary speech (i.e. low signal-to-noise ratio). Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed speech while ignoring interfering background speech/noise. In this paper, we investigate anchored speech recognition in the context of neural transducers. We use a tiny auxiliary network to extract context information from the anchor segment, and explore encoder biasing and joiner gating to guide the transducer towards the target speech. Moreover, to improve the robustness of context embedding extraction, we propose auxiliary training objectives to disentagle lexical content from speaking style. Our proposed methods are evaluated on synthetic LibriSpeech-based mixtures, where they improve word error rates by up to 36% compared to a background augmentation baseline.

Abstract (translated)

URL

https://arxiv.org/abs/2210.11588

PDF

https://arxiv.org/pdf/2210.11588.pdf