Paper Reading AI Learner

Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective

2023-05-25 04:14:49
Thanh-Dat Truong, Khoa Luu

Abstract

Understanding action recognition in egocentric videos has emerged as a vital research topic with numerous practical applications. With the limitation in the scale of egocentric data collection, learning robust deep learning-based action recognition models remains difficult. Transferring knowledge learned from the large-scale exocentric data to the egocentric data is challenging due to the difference in videos across views. Our work introduces a novel cross-view learning approach to action recognition (CVAR) that effectively transfers knowledge from the exocentric to the egocentric view. First, we introduce a novel geometric-based constraint into the self-attention mechanism in Transformer based on analyzing the camera positions between two views. Then, we propose a new cross-view self-attention loss learned on unpaired cross-view data to enforce the self-attention mechanism learning to transfer knowledge across views. Finally, to further improve the performance of our cross-view learning approach, we present the metrics to measure the correlations in videos and attention maps effectively. Experimental results on standard egocentric action recognition benchmarks, i.e., Charades-Ego, EPIC-Kitchens-55, and EPIC-Kitchens-100, have shown our approach's effectiveness and state-of-the-art performance.

Abstract (translated)

理解个人视角视频的行动识别已经发展成为一个重要的实用研究主题,具有大量的实际应用。由于个人视角数据采集的规模限制,学习可靠的深度学习行动识别模型仍然非常困难。从大型外部视角数据到个人视角数据的学习知识转移面临着由于不同视角视频差异的挑战。我们的工作提出了一种新的跨视角学习方法(CVAR),能够有效地将外部视角知识转移到个人视角视角。首先,我们引入一种新的几何约束,在Transformer中引入自注意力机制,基于分析两个视角之间的摄像机位置。然后,我们提出了一种新的跨视角自注意力损失,在配对跨视角数据上训练,以强制自注意力机制学习跨视角知识转移。最后,为了进一步提高我们的跨视角学习方法的性能,我们提出了指标,以有效地测量视频和注意力地图之间的相关性。标准个人视角行动识别基准测试数据,如Charades-Ego、Epic厨房55和Epic厨房100,已经证明了我们的方法的有效性和最先进的性能。

URL

https://arxiv.org/abs/2305.15699

PDF

https://arxiv.org/pdf/2305.15699.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot