Paper Reading AI Learner

Out-Of-Distribution Detection for Audio-visual Generalized Zero-Shot Learning: A General Framework

2024-08-02 14:10:20
Liuyuan Wen

Abstract

Generalized Zero-Shot Learning (GZSL) is a challenging task requiring accurate classification of both seen and unseen classes. Within this domain, Audio-visual GZSL emerges as an extremely exciting yet difficult task, given the inclusion of both visual and acoustic features as multi-modal inputs. Existing efforts in this field mostly utilize either embedding-based or generative-based methods. However, generative training is difficult and unstable, while embedding-based methods often encounter domain shift problem. Thus, we find it promising to integrate both methods into a unified framework to leverage their advantages while mitigating their respective disadvantages. Our study introduces a general framework employing out-of-distribution (OOD) detection, aiming to harness the strengths of both approaches. We first employ generative adversarial networks to synthesize unseen features, enabling the training of an OOD detector alongside classifiers for seen and unseen classes. This detector determines whether a test feature belongs to seen or unseen classes, followed by classification utilizing separate classifiers for each feature type. We test our framework on three popular audio-visual datasets and observe a significant improvement comparing to existing state-of-the-art works. Codes can be found in this https URL.

Abstract (translated)

泛化零 shot学习(GZSL)是一项具有挑战性的任务,需要对见到的和未见到的类别进行准确的分类。在这个领域,音频视觉泛化零 shot 学习(AV-GZSL) emerging as 一个非常激动人心但很难的任务,由于将视觉和听觉特征作为多模态输入而包括在内。该领域现有的大多数努力主要基于嵌入式或生成式方法。然而,生成训练很难且不稳定,而嵌入式方法通常会遇到领域漂移问题。因此,我们发现将两种方法集成到一个统一框架中是很有前途的,以利用它们的优点并减轻各自的缺点。我们的研究引入了一个使用离散余弦(DCSC)检测的一般框架,旨在利用两种方法的优点。我们首先使用生成对抗网络(GAN)合成未见过的特征,使得在见到的和未见到的类别上训练 OOD 检测器。这个检测器确定一个测试特征属于见闻类别中的哪一个,然后利用每个特征类型的单独分类器进行分类。我们在三个流行的音频视觉数据集上测试我们的框架,并观察到与现有最先进的工作相比显著的改进。代码可以从该链接找到:https://github.com/your-name/av-gzsl

URL

https://arxiv.org/abs/2408.01284

PDF

https://arxiv.org/pdf/2408.01284.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot