Paper Reading AI Learner

Text-to-feature diffusion for audio-visual few-shot learning

2023-09-07 17:30:36
Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

Abstract

Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at this https URL.

Abstract (translated)

训练从视听数据进行视频分类的深度学习模型通常需要大量经过昂贵过程收集的标记训练数据。一种挑战性且未被充分探索,但成本相对较低的方法是少量视频学习。特别是,具有声音和视觉信息的视听数据本身的多模态性质未被充分利用来处理少量视频分类任务。因此,我们提出了三个数据集上的统一视听少量视频分类基准,即VGG Sound-FSL、UCF-FSL和ActivityNet-FSL数据集,并适应和比较了十种方法。此外,我们提出了AV-diff,一种文本到特征扩散框架,该框架首先通过跨模态注意力将时间和视听特征融合,然后生成新类的新型多模态特征。我们证明,AV-diff在我们提出的视听(普遍化)少量视频学习基准上的最先进的性能。我们的基准当只有有限的标记数据可用时为有效的视听分类打开了道路。代码和数据可在这个httpsURL上可用。

URL

https://arxiv.org/abs/2309.03869

PDF

https://arxiv.org/pdf/2309.03869.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot