Abstract
This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a class label embedding. We use VGGish to extract audio feature embeddings from audio recordings. We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings. Results on the ESC-50 dataset show that the proposed system can perform zero-shot audio classification with small training dataset. It can achieve accuracy (26 % on average) better than random guess (10 %) on each audio category. Particularly, it reaches up to 39.7 % for the category of natural audio classes.
Abstract (translated)
本文提出了一种基于类标签文本信息的音频分类零镜头学习方法,该方法不需要目标类的任何音频样本。提出了一种基于双线性模型的音频分类系统,以音频特征嵌入和语义类标签嵌入为输入,测量了音频特征嵌入和类标签嵌入的兼容性。我们使用vggish从音频记录中提取音频特性嵌入。我们将文本标签作为音频类的语义侧信息,并使用word2vec生成类标签嵌入。对ESC-50数据集的测试结果表明,该系统可以对训练数据集较小的情况进行零镜头音频分类。它可以在每个音频类别上达到比随机猜测(10%)更好的精度(平均26%)。特别是,自然音频类高达39.7%。
URL
https://arxiv.org/abs/1905.01926