FDLP-Spectrogram: Capturing Speech Dynamics in Spectrograms for End-to-end Automatic Speech Recognition

Abstract
Abstract (translated)
URL
PDF

Abstract

We propose a technique to compute spectrograms using Frequency Domain Linear Prediction (FDLP) that uses all-pole models to fit the Hilbert envelope of speech in different frequency sub-bands. The spectrogram of a complete speech utterance is computed by overlap-add of contiguous all-pole model responses. The long context window of 1.5 seconds allows us to capture the low frequency temporal modulations of speech in the spectrogram. For an end-to-end automatic speech recognition task, the FDLP-spectrogram performs at-par with the standard mel-spectrogram features for clean read speech training and test data. For more realistic mismatched train-test situations and noisy, reverberated training data, the FDLP-spectrogram shows up to 25% and 22% WER improvements over mel-spectrogram respectively.

Abstract (translated)

URL

https://arxiv.org/abs/2103.14129

PDF

https://arxiv.org/pdf/2103.14129.pdf