Paper Reading AI Learner

Compact Transformer Tracker with Correlative Masked Modeling

2023-01-26 04:58:08
Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang

Abstract

Transformer framework has been showing superior performances in visual object tracking for its great strength in information aggregation across the template and search image with the well-known attention mechanism. Most recent advances focus on exploring attention mechanism variants for better information aggregation. We find these schemes are equivalent to or even just a subset of the basic self-attention mechanism. In this paper, we prove that the vanilla self-attention structure is sufficient for information aggregation, and structural adaption is unnecessary. The key is not the attention structure, but how to extract the discriminative feature for tracking and enhance the communication between the target and search image. Based on this finding, we adopt the basic vision transformer (ViT) architecture as our main tracker and concatenate the template and search image for feature embedding. To guide the encoder to capture the invariant feature for tracking, we attach a lightweight correlative masked decoder which reconstructs the original template and search image from the corresponding masked tokens. The correlative masked decoder serves as a plugin for the compact transform tracker and is skipped in inference. Our compact tracker uses the most simple structure which only consists of a ViT backbone and a box head, and can run at 40 fps. Extensive experiments show the proposed compact transform tracker outperforms existing approaches, including advanced attention variants, and demonstrates the sufficiency of self-attention in tracking tasks. Our method achieves state-of-the-art performance on five challenging datasets, along with the VOT2020, UAV123, LaSOT, TrackingNet, and GOT-10k benchmarks. Our project is available at this https URL.

Abstract (translated)

Transformer框架在视觉对象跟踪中表现出更好的性能,因其在模板和搜索图像之间的信息聚合方面的强大能力而著称。当前的最新研究主要集中在探索更好的信息聚合策略。我们发现这些策略等价于或只是基本的自注意力机制的子集。在本文中,我们证明了普通自注意力结构足够用于信息聚合,而结构适应ation不必要的。关键是注意力结构,而不是如何提取用于跟踪和增强目标与搜索图像之间的特征。基于这一发现,我们采用基本的视觉Transformer(ViT)架构作为我们的主跟踪器,并将它们拼接成特征嵌入的模板和搜索图像。为了指导编码器捕捉跟踪不变的特征,我们附加一个轻量级相关的掩码解码器,它从相应的掩码元中恢复原始模板和搜索图像。相关的掩码解码器作为紧凑变换跟踪器的插件,并省略了推理。我们的紧凑跟踪器使用最简单的结构,仅包括ViT骨架和一个盒子头,可以以40帧率运行。广泛的实验表明, proposed的紧凑变换跟踪器优于现有的方法,包括高级注意力变异体,并证明了自注意力在跟踪任务中的足够性。我们的方法在五个挑战性数据集上实现了最先进的性能,与VOT2020、UAV123、LaSOT、跟踪Net和GOT-10k基准数据集。我们的项目可在这个https URL上访问。

URL

https://arxiv.org/abs/2301.10938

PDF

https://arxiv.org/pdf/2301.10938.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot