Abstract
In this paper, we investigate representation learning for low-resource keyword spotting (KWS). The main challenges of KWS are limited labeled data and limited available device resources. To address those challenges, we explore representation learning for KWS by self-supervised contrastive learning and self-training with pretrained model. First, local-global contrastive siamese networks (LGCSiam) are designed to learn similar utterance-level representations for similar audio samplers by proposed local-global contrastive loss without requiring ground-truth. Second, a self-supervised pretrained Wav2Vec 2.0 model is applied as a constraint module (WVC) to force the KWS model to learn frame-level acoustic representations. By the LGCSiam and WVC modules, the proposed small-footprint KWS model can be pretrained with unlabeled data. Experiments on speech commands dataset show that the self-training WVC module and the self-supervised LGCSiam module significantly improve accuracy, especially in the case of training on a small labeled dataset.
Abstract (translated)
本论文研究的是低资源关键字标注(KWS)的表示学习。KWS的主要挑战是标注数据有限和可用设备资源有限。为了解决这些问题,我们采用自监督比较学习方法和自训练预训练模型来进行KWS的表示学习。首先,我们设计了一个 local-global 比较无监督神经网络(LGCSiam),该网络为类似音频编辑器的学习类似语音片段的表示,而无需真实值,只需要求局部和全局比较损失。其次,我们应用自监督的 WVC 模型作为约束模块(WVC),强制KWS 模型学习帧级别的声学表示。通过 LGCSiam 和 WVC 模块,我们提出了小型KWS模型,该模型使用未标记数据进行预训练。在语音命令数据集的实验中,表明自训练的 WVC 模块和自监督的 LGCSiam 模块显著提高了准确性,特别是在训练仅使用少量标记数据的情况下。
URL
https://arxiv.org/abs/2303.10912