Abstract
Generalized Zero-Shot Learning (GZSL) is a challenging task requiring accurate classification of both seen and unseen classes. Within this domain, Audio-visual GZSL emerges as an extremely exciting yet difficult task, given the inclusion of both visual and acoustic features as multi-modal inputs. Existing efforts in this field mostly utilize either embedding-based or generative-based methods. However, generative training is difficult and unstable, while embedding-based methods often encounter domain shift problem. Thus, we find it promising to integrate both methods into a unified framework to leverage their advantages while mitigating their respective disadvantages. Our study introduces a general framework employing out-of-distribution (OOD) detection, aiming to harness the strengths of both approaches. We first employ generative adversarial networks to synthesize unseen features, enabling the training of an OOD detector alongside classifiers for seen and unseen classes. This detector determines whether a test feature belongs to seen or unseen classes, followed by classification utilizing separate classifiers for each feature type. We test our framework on three popular audio-visual datasets and observe a significant improvement comparing to existing state-of-the-art works. Codes can be found in this https URL.
Abstract (translated)
泛化零 shot学习(GZSL)是一项具有挑战性的任务,需要对见到的和未见到的类别进行准确的分类。在这个领域,音频视觉泛化零 shot 学习(AV-GZSL) emerging as 一个非常激动人心但很难的任务,由于将视觉和听觉特征作为多模态输入而包括在内。该领域现有的大多数努力主要基于嵌入式或生成式方法。然而,生成训练很难且不稳定,而嵌入式方法通常会遇到领域漂移问题。因此,我们发现将两种方法集成到一个统一框架中是很有前途的,以利用它们的优点并减轻各自的缺点。我们的研究引入了一个使用离散余弦(DCSC)检测的一般框架,旨在利用两种方法的优点。我们首先使用生成对抗网络(GAN)合成未见过的特征,使得在见到的和未见到的类别上训练 OOD 检测器。这个检测器确定一个测试特征属于见闻类别中的哪一个,然后利用每个特征类型的单独分类器进行分类。我们在三个流行的音频视觉数据集上测试我们的框架,并观察到与现有最先进的工作相比显著的改进。代码可以从该链接找到:https://github.com/your-name/av-gzsl
URL
https://arxiv.org/abs/2408.01284