Learning to Compute the Articulatory Representations of Speech with the MIRRORNET

2022-10-29 00:46:48

Yashish M. Siriwardena, Carol Espy-Wilson, Shihab Shamma

arXiv_SD

Abstract
Abstract (translated)
URL
PDF

Abstract

Most organisms including humans function by coordinating and integrating sensory signals with motor actions to survive and accomplish desired tasks. Learning these complex sensorimotor mappings proceeds simultaneously and often in an unsupervised or semi-supervised fashion. An autoencoder architecture (MirrorNet) inspired by this sensorimotor learning paradigm is explored in this work to learn how to control an articulatory synthesizer. The synthesizer takes as input control signals consisting of six vocal Tract Variables (TVs) and source features (voicing indicators and pitch), and generates the corresponding auditory spectrograms. Due to the non-linear structure of the synthesizer, the control parameters that produce a target speech signal are not readily computable nor are they always unique. Here we demonstrate how to initialize the MirrorNet learning so as to produce a meaningful range of articulatory values. Once trained, the MirrorNet successfully estimates the TVs and source features needed to synthesize any arbitrary speech utterance. This approach outperforms the best previously designed `speech inversion' systems on the Wisconsin X-ray microbeam (XRMB) dataset.

Abstract (translated)

URL

https://arxiv.org/abs/2210.16454

PDF

https://arxiv.org/pdf/2210.16454.pdf