Paper Reading AI Learner

Learning Higher-order Object Interactions for Keypoint-based Video Understanding

2023-05-16 15:30:33
Yi Huang, Asim Kadav, Farley Lai, Deep Patel, Hans Peter Graf

Abstract

Action recognition is an important problem that requires identifying actions in video by learning complex interactions across scene actors and objects. However, modern deep-learning based networks often require significant computation, and may capture scene context using various modalities that further increases compute costs. Efficient methods such as those used for AR/VR often only use human-keypoint information but suffer from a loss of scene context that hurts accuracy. In this paper, we describe an action-localization method, KeyNet, that uses only the keypoint data for tracking and action recognition. Specifically, KeyNet introduces the use of object based keypoint information to capture context in the scene. Our method illustrates how to build a structured intermediate representation that allows modeling higher-order interactions in the scene from object and human keypoints without using any RGB information. We find that KeyNet is able to track and classify human actions at just 5 FPS. More importantly, we demonstrate that object keypoints can be modeled to recover any loss in context from using keypoint information over AVA action and Kinetics datasets.

Abstract (translated)

动作识别是重要的问题,需要通过学习场景演员和物体之间的复杂交互来识别视频中的动作。然而,现代深度学习网络通常需要大量计算,并且可能使用各种模式来捕获场景上下文,进一步增加了计算成本。例如,用于增强现实/虚拟现实的方法通常仅使用人类关键点信息,但会损失场景上下文,从而伤害准确性。在本文中,我们描述了一种动作本地化方法 KeyNet,它仅使用关键点数据进行跟踪和动作识别。具体来说,KeyNet引入了使用基于对象的关键点信息来捕获场景上下文的方法。我们的方法 illustrate 如何建立结构化的中间表示,以便从物体和人类关键点模型场景的高级交互,而无需使用任何RGB信息。我们发现 KeyNet可以在只有5帧每秒的速度下跟踪和分类人类动作。更重要的是,我们证明了物体关键点可以建模,以从使用关键点信息恢复场景上下文的任何损失。

URL

https://arxiv.org/abs/2305.09539

PDF

https://arxiv.org/pdf/2305.09539.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot