Audiomer: A Convolutional Transformer for Keyword Spotting

2021-09-21 15:28:41

Surya Kant Sahu, Sai Mitheran, Juhi Kamdar, Meet Gandhi

arXiv_CL

Abstract
Abstract (translated)
URL
PDF

Abstract

Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or reach competitive performance after feature extraction through Fourier-based methods, incurring a loss-floor. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in Keyword Spotting with raw audio waveforms, out-performing all previous methods while also being computationally cheaper, much more parameter and data-efficient. Audiomer allows for deployment in compute-constrained devices and training on smaller datasets.

Abstract (translated)

URL

https://arxiv.org/abs/2109.10252

PDF

https://arxiv.org/pdf/2109.10252.pdf