Paper Reading AI Learner

Multi-Sample Dynamic Time Warping for Few-Shot Keyword Spotting

2024-04-23 10:36:23
Kevin Wilkinghoff, Alessia Cornaggia-Urrigshardt

Abstract

In multi-sample keyword spotting, each keyword class is represented by multiple spoken instances, called samples. A naïve approach to detect keywords in a target sequence consists of querying all samples of all classes using sub-sequence dynamic time warping. However, the resulting processing time increases linearly with respect to the number of samples belonging to each class. Alternatively, only a single Fréchet mean can be queried for each class, resulting in reduced processing time but usually also in worse detection performance as the variability of the query samples is not captured sufficiently well. In this work, multi-sample dynamic time warping is proposed to compute class-specific cost-tensors that include the variability of all query samples. To significantly reduce the computational complexity during inference, these cost tensors are converted to cost matrices before applying dynamic time warping. In experimental evaluations for few-shot keyword spotting, it is shown that this method yields a very similar performance as using all individual query samples as templates while having a runtime that is only slightly slower than when using Fréchet means.

Abstract (translated)

在多样本关键词检测中,每个关键词类别由多个口头实例表示,这些实例被称为样本。检测目标序列中关键词的一种 naive 方法包括对所有类别的样本使用子序列动态时间压缩。然而,由于每个类别的样本数量不同,因此处理时间会线性增加。另外,为每个类别只能查询一个 Fréchet mean,导致处理时间降低,但通常检测性能也会较差,因为查询样本的变异程度没有被捕捉足够好。 在本文中,提出了一种多样本动态时间压缩方法来计算包括所有查询样本变异性的类特定成本张量。为了在推理过程中显著降低计算复杂性,这些成本张量在应用动态时间压缩之前被转换为成本矩阵。在少量样本关键词检测的实验评估中,研究表明,这种方法与使用所有单个查询样本作为模板时的性能非常相似,但运行时间略慢于使用 Fréchet mean。

URL

https://arxiv.org/abs/2404.14903

PDF

https://arxiv.org/pdf/2404.14903.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot