Compact Transformer Tracker with Correlative Masked Modeling

Abstract
Abstract (translated)
URL
PDF

Abstract

Transformer framework has been showing superior performances in visual object tracking for its great strength in information aggregation across the template and search image with the well-known attention mechanism. Most recent advances focus on exploring attention mechanism variants for better information aggregation. We find these schemes are equivalent to or even just a subset of the basic self-attention mechanism. In this paper, we prove that the vanilla self-attention structure is sufficient for information aggregation, and structural adaption is unnecessary. The key is not the attention structure, but how to extract the discriminative feature for tracking and enhance the communication between the target and search image. Based on this finding, we adopt the basic vision transformer (ViT) architecture as our main tracker and concatenate the template and search image for feature embedding. To guide the encoder to capture the invariant feature for tracking, we attach a lightweight correlative masked decoder which reconstructs the original template and search image from the corresponding masked tokens. The correlative masked decoder serves as a plugin for the compact transform tracker and is skipped in inference. Our compact tracker uses the most simple structure which only consists of a ViT backbone and a box head, and can run at 40 fps. Extensive experiments show the proposed compact transform tracker outperforms existing approaches, including advanced attention variants, and demonstrates the sufficiency of self-attention in tracking tasks. Our method achieves state-of-the-art performance on five challenging datasets, along with the VOT2020, UAV123, LaSOT, TrackingNet, and GOT-10k benchmarks. Our project is available at this https URL.

Abstract (translated)

Transformer框架在视觉对象跟踪中表现出更好的性能,因其在模板和搜索图像之间的信息聚合方面的强大能力而著称。当前的最新研究主要集中在探索更好的信息聚合策略。我们发现这些策略等价于或只是基本的自注意力机制的子集。在本文中,我们证明了普通自注意力结构足够用于信息聚合,而结构适应ation不必要的。关键是注意力结构,而不是如何提取用于跟踪和增强目标与搜索图像之间的特征。基于这一发现,我们采用基本的视觉Transformer(ViT)架构作为我们的主跟踪器,并将它们拼接成特征嵌入的模板和搜索图像。为了指导编码器捕捉跟踪不变的特征,我们附加一个轻量级相关的掩码解码器,它从相应的掩码元中恢复原始模板和搜索图像。相关的掩码解码器作为紧凑变换跟踪器的插件,并省略了推理。我们的紧凑跟踪器使用最简单的结构,仅包括ViT骨架和一个盒子头,可以以40帧率运行。广泛的实验表明, proposed的紧凑变换跟踪器优于现有的方法,包括高级注意力变异体,并证明了自注意力在跟踪任务中的足够性。我们的方法在五个挑战性数据集上实现了最先进的性能,与VOT2020、UAV123、LaSOT、跟踪Net和GOT-10k基准数据集。我们的项目可在这个https URL上访问。

URL

https://arxiv.org/abs/2301.10938

PDF

https://arxiv.org/pdf/2301.10938.pdf