Paper Reading AI Learner

Exploring Representation Learning for Small-Footprint Keyword Spotting

2023-03-20 07:09:26
Fan Cui, Liyong Guo, Quandong Wang, Peng Gao, Yujun Wang

Abstract

In this paper, we investigate representation learning for low-resource keyword spotting (KWS). The main challenges of KWS are limited labeled data and limited available device resources. To address those challenges, we explore representation learning for KWS by self-supervised contrastive learning and self-training with pretrained model. First, local-global contrastive siamese networks (LGCSiam) are designed to learn similar utterance-level representations for similar audio samplers by proposed local-global contrastive loss without requiring ground-truth. Second, a self-supervised pretrained Wav2Vec 2.0 model is applied as a constraint module (WVC) to force the KWS model to learn frame-level acoustic representations. By the LGCSiam and WVC modules, the proposed small-footprint KWS model can be pretrained with unlabeled data. Experiments on speech commands dataset show that the self-training WVC module and the self-supervised LGCSiam module significantly improve accuracy, especially in the case of training on a small labeled dataset.

Abstract (translated)

本论文研究的是低资源关键字标注(KWS)的表示学习。KWS的主要挑战是标注数据有限和可用设备资源有限。为了解决这些问题,我们采用自监督比较学习方法和自训练预训练模型来进行KWS的表示学习。首先,我们设计了一个 local-global 比较无监督神经网络(LGCSiam),该网络为类似音频编辑器的学习类似语音片段的表示,而无需真实值,只需要求局部和全局比较损失。其次,我们应用自监督的 WVC 模型作为约束模块(WVC),强制KWS 模型学习帧级别的声学表示。通过 LGCSiam 和 WVC 模块,我们提出了小型KWS模型,该模型使用未标记数据进行预训练。在语音命令数据集的实验中,表明自训练的 WVC 模块和自监督的 LGCSiam 模块显著提高了准确性,特别是在训练仅使用少量标记数据的情况下。

URL

https://arxiv.org/abs/2303.10912

PDF

https://arxiv.org/pdf/2303.10912.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot