Paper Reading AI Learner

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

2024-07-09 16:44:04
Jeongseok Hyun, Su Ho Han, Hyolim Kang, Joon-Young Lee, Seon Joo Kim

Abstract

The vocabulary size in temporal action localization (TAL) is constrained by the scarcity of large-scale annotated datasets. To address this, recent works incorporate powerful pre-trained vision-language models (VLMs), such as CLIP, to perform open-vocabulary TAL (OV-TAL). However, unlike VLMs trained on extensive image/video-text pairs, existing OV-TAL methods still rely on small, fully labeled TAL datasets for training an action localizer. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our self-training approach consists of two stages. First, a class-agnostic action localizer is trained on a human-labeled TAL dataset and used to generate pseudo-labels for unlabeled videos. Second, the large-scale pseudo-labeled dataset is combined with the human-labeled dataset to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally, we highlighted issues with existing OV-TAL evaluation schemes and proposed a new evaluation protocol. Code is released at this https URL

Abstract (translated)

语料库大小在时间动作定位(TAL)中受到大型带标签数据集的稀缺性的约束。为解决这个问题,最近的工作包括CLIP等强大的预训练视觉语言模型(VLMs)来执行开放词汇TAL(OV-TAL)。然而,与训练在广泛的图像/视频对上的VLMs不同,现有的OV-TAL方法仍然依赖于小而完全标注的TAL数据集来训练动作定位器。在本文中,我们探讨了使用未标记的YouTube视频进行自训练在OV-TAL中的可扩展性。我们的自训练方法包括两个阶段。第一步,在人类标注的TAL数据集上训练一个类无关的动作定位器,并用于为未标记的视频生成伪标签。第二步,将大型带标签的数据集与人类标注的数据集相结合来训练定位器。大量的实验证明,在自训练中利用网络规模的视频可以显著增强动作定位器的泛化能力。此外,我们还强调了现有OV-TAL评估方案的问题,并提出了新的评估协议。代码发布在https:// this URL

URL

https://arxiv.org/abs/2407.07024

PDF

https://arxiv.org/pdf/2407.07024.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot