Discriminative Speaker Representation via Contrastive Learning with Class-Aware Attention in Angular Space

Abstract
Abstract (translated)
URL
PDF

Abstract

The challenges in applying contrastive learning to speaker verification (SV) are that the softmax-based contrastive loss lacks discriminative power and that the hard negative pairs can easily influence learning. To overcome these challenges, we propose a contrastive learning SV framework incorporating an additive angular margin into the supervised contrastive loss. The margin improves the speaker representation's discrimination ability. We introduce a class-aware attention mechanism through which hard negative samples contribute less significantly to the supervised contrastive loss. We also employed a gradient-based multi-objective optimization approach to balance the classification and contrastive loss. Experimental results on CN-Celeb and Voxceleb1 show that this new learning objective can cause the encoder to find an embedding space that exhibits great speaker discrimination across languages.

Abstract (translated)

URL

https://arxiv.org/abs/2210.16622

PDF

https://arxiv.org/pdf/2210.16622.pdf