MixFormerV2: Efficient Fully Transformer Tracking

Abstract
Abstract (translated)
URL
PDF

Abstract

Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70.6\% on LaSOT and an AUC of 57.4\% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7\% AUC on LaSOT with a real-time CPU speed.

Abstract (translated)

基于Transformer的跟踪器已经在标准基准测试中取得了很高的准确性。然而，它们的效率仍然是在GPU和CPU平台上实现实际部署的一个障碍。在本文中，为了克服这个问题，我们提出了一种 fully Transformer 跟踪框架，称为 \emph{ MixFormerV2}，而不需要Dense convolutional操作和复杂的得分预测模块。我们的设计关键是引入四个特殊的预测代币，并将它们与目标模板和搜索区域中的代币拼接在一起。然后，我们将这些混合代币序列应用统一的Transformer主链。这些预测代币能够通过混合注意力捕获目标模板和搜索区域之间的复杂相关性。基于它们，我们可以轻松地预测跟踪框，并通过简单的MLP头估计其置信分数。为了进一步改善 MixFormerV2 的效率，我们提出了一种新的蒸馏模型减少范式，包括Dense-to-Sparse蒸馏和Deep-to-Shallow蒸馏。前者旨在将DenseHead-based MixViT的知识转移给我们的 fully Transformer 跟踪器，而后者则用于修剪主链的某些层。我们实例化了两种 MixForemrV2 类型，其中 MixFormerV2-B 在LaSOT上实现了70.6%的AUC，而在TNL2k上实现了57.4%的AUC，且GPU速度达到165FPS。同时， MixFormerV2-S在LaSOT上通过实时CPU速度实现了与FEAR-L相比2.7%的AUC的提高。

URL

https://arxiv.org/abs/2305.15896

PDF

https://arxiv.org/pdf/2305.15896.pdf