Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

2022-10-15 04:15:50

Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky

arXiv_SD

Abstract
Abstract (translated)
URL
PDF

Abstract

Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alternative way of extracting speaker and emotion information from self-supervised trained models, based on the correlations between the coefficients of the representations - correlation pooling. We show improvements over mean pooling and further gains when the pooling methods are combined via fusion. The code is available at this http URL.

Abstract (translated)

URL

https://arxiv.org/abs/2210.09513

PDF

https://arxiv.org/pdf/2210.09513.pdf