Paper Reading AI Learner

Leveraging Temporal Contextualization for Video Action Recognition

2024-04-15 06:24:56
Minji Kim, Dongyoon Han, Taekyung Kim, Bohyung Han

Abstract

Pretrained vision-language models have shown effectiveness in video understanding. However, recent studies have not sufficiently leveraged essential temporal information from videos, simply averaging frame-wise representations or referencing consecutive frames. We introduce Temporally Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding that effectively and efficiently leverages comprehensive video information. We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process. Furthermore, our Video-conditional Prompting (VP) module manufactures context tokens to generate informative prompts in text modality. We conduct extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition to validate the superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design choices. Code is available at this https URL

Abstract (translated)

预训练的视觉-语言模型已经在视频理解方面取得了有效性。然而,最近的研究并没有充分利用视频的必要时间信息,仅仅是平均帧级表示或参考连续帧。我们引入了 Temporally Contextualized CLIP (TC-CLIP),这是一个先驱性的框架,用于视频理解,有效且高效地利用了全面视频信息。我们提出了 Temporal Contextualization (TC),一种新的层间时间信息注入机制,用于提取每个帧的核心信息,将视频中的相关信息连接起来,并最终在特征编码过程中利用上下文 tokens。此外,我们的 Video-conditional Prompting (VP) 模块用于在文本模态生成有信息的提示。我们在零散、少散、基础到 novel 和完全监督的动作识别上进行了广泛的实验,以验证我们的 TC-CLIP 的优越性。TC 和 VP 的消融研究确保了我们的设计选择。代码可以从该链接获取:

URL

https://arxiv.org/abs/2404.09490

PDF

https://arxiv.org/pdf/2404.09490.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot