Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Abstract
Abstract (translated)
URL
PDF

Abstract

Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that, after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity. In this paper, we observe that a straight bipartite matching loss can be applied to the output tokens of a vision transformer. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViT-S architecture trained with bipartite matching to perform both tasks surpasses the same MViT-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our MViTv2-S model achieves +3 mAP on AVA2.2. w.r.t. the two-stage counterpart. Code and models will be released after paper revision.

Abstract (translated)

动作局部化是一个结合检测和识别任务具有挑战性的问题，这些问题通常分别处理。最先进的解决方案依赖于在高端分辨率下预先计算的边界框检测，并提出了专注于分类任务的Transformer模型。这种两阶段解决方案对于实时部署来说过于昂贵。另一方面，单阶段方法通过将网络的部分（通常是骨干网络）用于共享工作负载，以牺牲性能来提高速度。这些方法在添加可学习查询的DETR头部后，可以在 cross- 和 self- 注意力之后将对应于检测到的边界框和动作的查询发送到相应的MLP。然而，DETR类似架构的训练和实现具有挑战性，并且容易产生较大复杂度。在本文中，我们观察到可以将二元二分匹配损失应用于视觉Transformer的输出标记。这导致了一个骨干网络+MLP架构，可以同时完成这两项任务，而无需增加额外的编码器-解码器头部和学习查询。我们证明了，通过二元二分匹配训练的单MViT-S模型在预计算边界框上训练时可以超越使用RoI调整的单MViT-S模型。通过仔细设计标记池和提出的训练流程，我们的MViTv2-S模型在AVA2.2上实现了+3的AP。代码和模型将在论文修订后发布。

URL

https://arxiv.org/abs/2312.17686

PDF

https://arxiv.org/pdf/2312.17686.pdf

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Abstract

Abstract (translated)

URL

PDF Copy

PDF