Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition

Abstract
Abstract (translated)
URL
PDF

Abstract

The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets.

Abstract (translated)

人与物体之间的互动对于识别物体中心行动非常重要。现有的方法通常采用两阶段流程，首先使用预训练的检测器检测物体建议，然后将它们输入到动作识别模型中，以提取视频特征并学习动作识别中的物体关系。然而，在物体检测阶段，动作先验未知，重要物体可能很容易被忽视，导致动作识别性能下降。在本文中，我们提出了一种端到端的物体中心动作识别框架，在同一阶段同时执行检测和交互推理。特别地，在提取视频特征的基础上，我们创建了三个并发物体检测和交互推理模块。首先，基于补丁的对象编码器生成视频补丁标记的提议。然后，一个交互式物体精炼和聚合模块确定动作识别中的重要物体，根据位置和外观调整提议得分，并将物体级信息汇总到全局视频表示中。最后，一个物体关系建模模块编码物体关系。这三个模块与视频特征提取器可以以协同训练的方式进行训练，从而避免对预定义的物体检测器的过度依赖，并减少多阶段训练负担。我们对两个数据集 Something-Else 和 Ikea-Assembly 进行了实验，以评估所提出方法在传统、组合和少样本动作识别任务上的性能。通过深入的实验分析，我们证明了交互式物体在动作识别中的关键作用，并且在两个数据集上都能够超越最先进的 methods。

URL

https://arxiv.org/abs/2404.11903

PDF

https://arxiv.org/pdf/2404.11903.pdf

Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition

Abstract

Abstract (translated)

URL

PDF Copy

PDF