Efficient Transformer for Direct Speech Translation

2021-07-07 08:13:40

Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussà

arXiv_CL

arXiv_CL CNN Face Transformer Pose Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

The advent of Transformer-based models has surpassed the barriers of text. When working with speech, we must face a problem: the sequence length of an audio input is not suitable for the Transformer. To bypass this problem, a usual approach is adding strided convolutional layers, to reduce the sequence length before using the Transformer. In this paper, we propose a new approach for direct Speech Translation, where thanks to an efficient Transformer we can work with a spectrogram without having to use convolutional layers before the Transformer. This allows the encoder to learn directly from the spectrogram and no information is lost. We have created an encoder-decoder model, where the encoder is an efficient Transformer -- the Longformer -- and the decoder is a traditional Transformer decoder. Our results, which are close to the ones obtained with the standard approach, show that this is a promising research direction.

Abstract (translated)

URL

https://arxiv.org/abs/2107.03069

PDF

https://arxiv.org/pdf/2107.03069.pdf