Paper Reading AI Learner

Video Action Recognition with Attentive Semantic Units

2023-03-17 03:44:15
Yifei Chen, Dapeng Chen, Ruijin Liu, Hao Li, Wei Peng

Abstract

Visual-Language Models (VLMs) have significantly advanced action video recognition. Supervised by the semantics of action labels, recent works adapt the visual branch of VLMs to learn video representations. Despite the effectiveness proved by these works, we believe that the potential of VLMs has yet to be fully harnessed. In light of this, we exploit the semantic units (SU) hiding behind the action labels and leverage their correlations with fine-grained items in frames for more accurate action recognition. SUs are entities extracted from the language descriptions of the entire action set, including body parts, objects, scenes, and motions. To further enhance the alignments between visual contents and the SUs, we introduce a multi-region module (MRA) to the visual branch of the VLM. The MRA allows the perception of region-aware visual features beyond the original global feature. Our method adaptively attends to and selects relevant SUs with visual features of frames. With a cross-modal decoder, the selected SUs serve to decode spatiotemporal video representations. In summary, the SUs as the medium can boost discriminative ability and transferability. Specifically, in fully-supervised learning, our method achieved 87.8\% top-1 accuracy on Kinetics-400. In K=2 few-shot experiments, our method surpassed the previous state-of-the-art by +7.1% and +15.0% on HMDB-51 and UCF-101, respectively.

Abstract (translated)

视觉语言模型(VLMs)已经极大地提高了动作视频识别的性能。通过执行动作标签语义,最近的工作调整了VLM的视觉分支以学习视频表示。尽管这些工作证明了VLM的潜力,但我们仍然相信VLM的潜力还没有完全 harness。因此,我们利用隐藏在动作标签背后的语义单元(SU)并利用它们与帧中的细粒度物品的相关性来更准确地执行动作识别。 SU 是从整个动作集合的语言描述中提取实体,包括身体部分、物体、场景和运动。为了进一步增强视觉内容和 SU 之间的对齐,我们引入了一个多区域模块(MRA)到 VLM 的视觉分支。 MRA 允许超越原始全局特征的区域感知视觉特征。我们的算法自适应地关注并选择与帧视觉特征相关的 SU。使用跨模态解码器,选定的 SU 用于解码时序视频表示。因此, SU 作为媒介可以增强区分能力和转移性。具体而言,在完全监督学习中,我们的方法在Kinetics-400上实现了87.8\%的第一名准确性。在 K=2的少量样本实验中,我们的方法分别比先前的最先进的技术提高了7.1%和15.0%。

URL

https://arxiv.org/abs/2303.09756

PDF

https://arxiv.org/pdf/2303.09756.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot