Learning Higher-order Object Interactions for Keypoint-based Video Understanding

Abstract
Abstract (translated)
URL
PDF

Abstract

Action recognition is an important problem that requires identifying actions in video by learning complex interactions across scene actors and objects. However, modern deep-learning based networks often require significant computation, and may capture scene context using various modalities that further increases compute costs. Efficient methods such as those used for AR/VR often only use human-keypoint information but suffer from a loss of scene context that hurts accuracy. In this paper, we describe an action-localization method, KeyNet, that uses only the keypoint data for tracking and action recognition. Specifically, KeyNet introduces the use of object based keypoint information to capture context in the scene. Our method illustrates how to build a structured intermediate representation that allows modeling higher-order interactions in the scene from object and human keypoints without using any RGB information. We find that KeyNet is able to track and classify human actions at just 5 FPS. More importantly, we demonstrate that object keypoints can be modeled to recover any loss in context from using keypoint information over AVA action and Kinetics datasets.

Abstract (translated)

动作识别是重要的问题，需要通过学习场景演员和物体之间的复杂交互来识别视频中的动作。然而，现代深度学习网络通常需要大量计算，并且可能使用各种模式来捕获场景上下文，进一步增加了计算成本。例如，用于增强现实/虚拟现实的方法通常仅使用人类关键点信息，但会损失场景上下文，从而伤害准确性。在本文中，我们描述了一种动作本地化方法 KeyNet，它仅使用关键点数据进行跟踪和动作识别。具体来说，KeyNet引入了使用基于对象的关键点信息来捕获场景上下文的方法。我们的方法 illustrate 如何建立结构化的中间表示，以便从物体和人类关键点模型场景的高级交互，而无需使用任何RGB信息。我们发现 KeyNet可以在只有5帧每秒的速度下跟踪和分类人类动作。更重要的是，我们证明了物体关键点可以建模，以从使用关键点信息恢复场景上下文的任何损失。

URL

https://arxiv.org/abs/2305.09539

PDF

https://arxiv.org/pdf/2305.09539.pdf