Paper Reading AI Learner

Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition

2024-04-18 05:06:12
Xunsong Li, Pengzhan Sun, Yangcen Liu, Lixin Duan, Wen Li

Abstract

The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets.

Abstract (translated)

人与物体之间的互动对于识别物体中心行动非常重要。现有的方法通常采用两阶段流程,首先使用预训练的检测器检测物体建议,然后将它们输入到动作识别模型中,以提取视频特征并学习动作识别中的物体关系。然而,在物体检测阶段,动作先验未知,重要物体可能很容易被忽视,导致动作识别性能下降。在本文中,我们提出了一种端到端的物体中心动作识别框架,在同一阶段同时执行检测和交互推理。特别地,在提取视频特征的基础上,我们创建了三个并发物体检测和交互推理模块。首先,基于补丁的对象编码器生成视频补丁标记的提议。然后,一个交互式物体精炼和聚合模块确定动作识别中的重要物体,根据位置和外观调整提议得分,并将物体级信息汇总到全局视频表示中。最后,一个物体关系建模模块编码物体关系。这三个模块与视频特征提取器可以以协同训练的方式进行训练,从而避免对预定义的物体检测器的过度依赖,并减少多阶段训练负担。我们对两个数据集 Something-Else 和 Ikea-Assembly 进行了实验,以评估所提出方法在传统、组合和少样本动作识别任务上的性能。通过深入的实验分析,我们证明了交互式物体在动作识别中的关键作用,并且在两个数据集上都能够超越最先进的 methods。

URL

https://arxiv.org/abs/2404.11903

PDF

https://arxiv.org/pdf/2404.11903.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot