Paper Reading AI Learner

DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

2024-04-23 10:17:42
Haozhe Cheng, Cheng Ju, Haicheng Wang, Jinxiang Liu, Mengting Chen, Qiang Hu, Xiaoyun Zhang, Yanfeng Wang

Abstract

As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.

Abstract (translated)

作为计算机视觉中的一个基本视频任务,Open-Vocabulary Action Recognition (OVAR) 最近受到了越来越多的关注,随着视觉语言预训练的发展。为了实现任意类别的泛化,现有方法将类标签视为文本描述,然后将 OVAR 表示为评估视觉样本与文本类之间的嵌入相似度。然而,一个关键问题被完全忽视了:用户提供的类描述可能会有噪音,例如拼写和错别字,这限制了原典 OVAR 的现实应用。为了填补研究空白,本文通过模拟各种类型的多级噪音来评估现有方法,并揭示了它们的脆弱性。为了应对噪音 OVAR 任务,我们进一步提出了一个新颖的 DENOISER 框架,包括生成和分类两个部分。具体来说,生成部分通过一个解码过程消除噪音类-文本名称,即提出文本候选者,然后利用跨模态和内模态信息进行投票,获得最佳。在分类部分,我们使用原典 OVAR 模型将视觉样本分配到类文本名称,从而获得更多的语义信息。为了优化,我们交替迭代生成和分类部分进行逐步改进。消除的文本类有助于 OVAR 模型更准确地分类视觉样本;与此同时,分类的视觉样本有助于更好地消除噪音。在三个数据集上,我们进行了广泛的实验,以展示我们卓越的鲁棒性,并对每个组件的深入剖析进行了充分研究。

URL

https://arxiv.org/abs/2404.14890

PDF

https://arxiv.org/pdf/2404.14890.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot