Abstract
Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ($\textbf{COMP}ositional h\textbf{U}man-cen\textbf{T}ric qu\textbf{ER}y$ machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: this https URL.
Abstract (translated)
由于人类活动的空间和时间复杂性以及上下文相关性,从视频中识别人类活动具有挑战性。之前的研究通常依赖于单个输入模式,如RGB或骨骼数据,这限制了它们在模态之间互补优势的利用能力。最近的研究专注于使用简单的特征融合技术将这些两种模式进行结合。然而,由于这些输入模式固有的差异表示,设计一个统一的神经网络架构有效地利用它们的互补信息仍然是一个重要的挑战。为了应对这个问题,我们提出了一个全面的多模态视频基于人类活动识别框架。我们的关键贡献是引入了一种新颖的合成查询机器,称为COMPUTER(Compositional human-centric-time machine),一种通用的神经网络架构,它建模了感兴趣的人类和周围环境在空间和时间上的相互作用。由于其多才多艺的设计,COMPUTER可以用于对各种输入模态进行降维。此外,我们引入了一个一致性损失,用于在模态之间建模预测的一致性,利用多模态输入的互补信息进行稳健的人类运动识别。通过在动作定位和群活动识别任务上的广泛实验,我们的方法与最先进的方法相比表现出卓越的性能。我们的代码可在此处下载:https:// this URL。
URL
https://arxiv.org/abs/2409.02385