Paper Reading AI Learner

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

2023-12-29 17:08:38
Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos

Abstract

Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that, after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity. In this paper, we observe that a straight bipartite matching loss can be applied to the output tokens of a vision transformer. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViT-S architecture trained with bipartite matching to perform both tasks surpasses the same MViT-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our MViTv2-S model achieves +3 mAP on AVA2.2. w.r.t. the two-stage counterpart. Code and models will be released after paper revision.

Abstract (translated)

动作局部化是一个结合检测和识别任务具有挑战性的问题,这些问题通常分别处理。最先进的解决方案依赖于在高端分辨率下预先计算的边界框检测,并提出了专注于分类任务的Transformer模型。这种两阶段解决方案对于实时部署来说过于昂贵。另一方面,单阶段方法通过将网络的部分(通常是骨干网络)用于共享工作负载,以牺牲性能来提高速度。这些方法在添加可学习查询的DETR头部后,可以在 cross- 和 self- 注意力之后将对应于检测到的边界框和动作的查询发送到相应的MLP。然而,DETR类似架构的训练和实现具有挑战性,并且容易产生较大复杂度。在本文中,我们观察到可以将二元二分匹配损失应用于视觉Transformer的输出标记。这导致了一个骨干网络+MLP架构,可以同时完成这两项任务,而无需增加额外的编码器-解码器头部和学习查询。我们证明了,通过二元二分匹配训练的单MViT-S模型在预计算边界框上训练时可以超越使用RoI调整的单MViT-S模型。通过仔细设计标记池和提出的训练流程,我们的MViTv2-S模型在AVA2.2上实现了+3的AP。代码和模型将在论文修订后发布。

URL

https://arxiv.org/abs/2312.17686

PDF

https://arxiv.org/pdf/2312.17686.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot