Neural Representations for Modeling Variation in English Speech

2020-11-25 11:19:12

Martijn Bartelds, Wietse de Vries, Faraz Sanal, Caitlin Richter, Mark Liberman, Martijn Wieling

arXiv_CL

arXiv_CL Embedding Transformer Action Self-Supervised Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

Variation in speech is often represented and investigated using phonetic transcriptions, but transcribing speech is time-consuming and error prone. To create reliable representations of speech independent from phonetic transcriptions, we investigate the extraction of acoustic embeddings from several self-supervised neural models. We use these representations to compute word-based pronunciation differences between non-native and native speakers of English, and evaluate these differences by comparing them with human native-likeness judgments. We show that Transformer-based speech representations lead to significant performance gains over the use of phonetic transcriptions, and find that feature-based use of Transformer models is most effective with one or more middle layers instead of the final layer. We also demonstrate that these neural speech representations not only capture segmental differences, but also intonational and durational differences that cannot be represented by a set of discrete symbols used in phonetic transcriptions.

Abstract (translated)

URL

https://arxiv.org/abs/2011.12649

PDF

https://arxiv.org/pdf/2011.12649.pdf