Paper Reading AI Learner

MixFormerV2: Efficient Fully Transformer Tracking

2023-05-25 09:50:54
Yutao Cui, Tianhui Song, Gangshan Wu, Limin Wang

Abstract

Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70.6\% on LaSOT and an AUC of 57.4\% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7\% AUC on LaSOT with a real-time CPU speed.

Abstract (translated)

基于Transformer的跟踪器已经在标准基准测试中取得了很高的准确性。然而,它们的效率仍然是在GPU和CPU平台上实现实际部署的一个障碍。在本文中,为了克服这个问题,我们提出了一种 fully Transformer 跟踪框架,称为 \emph{ MixFormerV2},而不需要Dense convolutional操作和复杂的得分预测模块。我们的设计关键是引入四个特殊的预测代币,并将它们与目标模板和搜索区域中的代币拼接在一起。然后,我们将这些混合代币序列应用统一的Transformer主链。这些预测代币能够通过混合注意力捕获目标模板和搜索区域之间的复杂相关性。基于它们,我们可以轻松地预测跟踪框,并通过简单的MLP头估计其置信分数。为了进一步改善 MixFormerV2 的效率,我们提出了一种新的蒸馏模型减少范式,包括Dense-to-Sparse蒸馏和Deep-to-Shallow蒸馏。前者旨在将DenseHead-based MixViT的知识转移给我们的 fully Transformer 跟踪器,而后者则用于修剪主链的某些层。我们实例化了两种 MixForemrV2 类型,其中 MixFormerV2-B 在LaSOT上实现了70.6%的AUC,而在TNL2k上实现了57.4%的AUC,且GPU速度达到165FPS。同时, MixFormerV2-S在LaSOT上通过实时CPU速度实现了与FEAR-L相比2.7%的AUC的提高。

URL

https://arxiv.org/abs/2305.15896

PDF

https://arxiv.org/pdf/2305.15896.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot