Abstract
Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at this https URL.
Abstract (translated)
训练从视听数据进行视频分类的深度学习模型通常需要大量经过昂贵过程收集的标记训练数据。一种挑战性且未被充分探索,但成本相对较低的方法是少量视频学习。特别是,具有声音和视觉信息的视听数据本身的多模态性质未被充分利用来处理少量视频分类任务。因此,我们提出了三个数据集上的统一视听少量视频分类基准,即VGG Sound-FSL、UCF-FSL和ActivityNet-FSL数据集,并适应和比较了十种方法。此外,我们提出了AV-diff,一种文本到特征扩散框架,该框架首先通过跨模态注意力将时间和视听特征融合,然后生成新类的新型多模态特征。我们证明,AV-diff在我们提出的视听(普遍化)少量视频学习基准上的最先进的性能。我们的基准当只有有限的标记数据可用时为有效的视听分类打开了道路。代码和数据可在这个httpsURL上可用。
URL
https://arxiv.org/abs/2309.03869