Paper Reading AI Learner

Unified Framework with Consistency across Modalities for Human Activity Recognition

2024-09-04 02:25:10
Tuyen Tran, Thao Minh Le, Hung Tran, Truyen Tran

Abstract

Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ($\textbf{COMP}ositional h\textbf{U}man-cen\textbf{T}ric qu\textbf{ER}y$ machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: this https URL.

Abstract (translated)

由于人类活动的空间和时间复杂性以及上下文相关性,从视频中识别人类活动具有挑战性。之前的研究通常依赖于单个输入模式,如RGB或骨骼数据,这限制了它们在模态之间互补优势的利用能力。最近的研究专注于使用简单的特征融合技术将这些两种模式进行结合。然而,由于这些输入模式固有的差异表示,设计一个统一的神经网络架构有效地利用它们的互补信息仍然是一个重要的挑战。为了应对这个问题,我们提出了一个全面的多模态视频基于人类活动识别框架。我们的关键贡献是引入了一种新颖的合成查询机器,称为COMPUTER(Compositional human-centric-time machine),一种通用的神经网络架构,它建模了感兴趣的人类和周围环境在空间和时间上的相互作用。由于其多才多艺的设计,COMPUTER可以用于对各种输入模态进行降维。此外,我们引入了一个一致性损失,用于在模态之间建模预测的一致性,利用多模态输入的互补信息进行稳健的人类运动识别。通过在动作定位和群活动识别任务上的广泛实验,我们的方法与最先进的方法相比表现出卓越的性能。我们的代码可在此处下载:https:// this URL。

URL

https://arxiv.org/abs/2409.02385

PDF

https://arxiv.org/pdf/2409.02385.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot