DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

Abstract
Abstract (translated)
URL
PDF

Abstract

As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.

Abstract (translated)

作为计算机视觉中的一个基本视频任务，Open-Vocabulary Action Recognition (OVAR) 最近受到了越来越多的关注，随着视觉语言预训练的发展。为了实现任意类别的泛化，现有方法将类标签视为文本描述，然后将 OVAR 表示为评估视觉样本与文本类之间的嵌入相似度。然而，一个关键问题被完全忽视了：用户提供的类描述可能会有噪音，例如拼写和错别字，这限制了原典 OVAR 的现实应用。为了填补研究空白，本文通过模拟各种类型的多级噪音来评估现有方法，并揭示了它们的脆弱性。为了应对噪音 OVAR 任务，我们进一步提出了一个新颖的 DENOISER 框架，包括生成和分类两个部分。具体来说，生成部分通过一个解码过程消除噪音类-文本名称，即提出文本候选者，然后利用跨模态和内模态信息进行投票，获得最佳。在分类部分，我们使用原典 OVAR 模型将视觉样本分配到类文本名称，从而获得更多的语义信息。为了优化，我们交替迭代生成和分类部分进行逐步改进。消除的文本类有助于 OVAR 模型更准确地分类视觉样本；与此同时，分类的视觉样本有助于更好地消除噪音。在三个数据集上，我们进行了广泛的实验，以展示我们卓越的鲁棒性，并对每个组件的深入剖析进行了充分研究。

URL

https://arxiv.org/abs/2404.14890

PDF

https://arxiv.org/pdf/2404.14890.pdf

DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

Abstract

Abstract (translated)

URL

PDF Copy

PDF