Paper Reading AI Learner

SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking

2023-03-08 05:01:00
Liangliang Yao, Changhong Fu, Sihang Li, Guangze Zheng, Junjie Ye

Abstract

Vision-based object tracking has boosted extensive autonomous applications for unmanned aerial vehicles (UAVs). However, the dynamic changes in flight maneuver and viewpoint encountered in UAV tracking pose significant difficulties, e.g. , aspect ratio change, and scale variation. The conventional cross-correlation operation, while commonly used, has limitations in effectively capturing perceptual similarity and incorporates extraneous background information. To mitigate these limitations, this work presents a novel saliency-guided dynamic vision Transformer (SGDViT) for UAV tracking. The proposed method designs a new task-specific object saliency mining network to refine the cross-correlation operation and effectively discriminate foreground and background information. Additionally, a saliency adaptation embedding operation dynamically generates tokens based on initial saliency, thereby reducing the computational complexity of the Transformer architecture. Finally, a lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information. The efficacy and robustness of the proposed approach have been thoroughly assessed through experiments on three widely-used UAV tracking benchmarks and real-world scenarios, with results demonstrating its superiority. The source code and demo videos are available at this https URL.

Abstract (translated)

基于视觉的对象跟踪已经促进了对无人机(UAVs)的广泛自主应用。然而,在无人机跟踪中,飞行行为和视角的动态变化遇到了重大困难,例如, aspect ratio 改变和尺度变化。尽管常用,但传统的交叉验证操作具有在有效捕捉感知相似性和整合无关背景信息方面的局限性。为了克服这些限制,本研究提出了一种基于视觉的新的任务特定对象视觉跟踪器(Saliency- guided dynamic vision Transformer),以改进交叉验证操作并有效地区分前景和背景信息。此外,一种基于初始视觉吸引力的适应嵌入操作动态地生成代币,从而降低了Transformer架构的计算复杂性。最后,一种轻量级的视觉吸引力过滤Transformer进一步 refine了视觉吸引力信息并增加了外观信息的重点。本研究通过实验研究了三个广泛使用的无人机跟踪基准和实际场景,结果证明了其优越性。源代码和演示视频可在本网站上获得。

URL

https://arxiv.org/abs/2303.04378

PDF

https://arxiv.org/pdf/2303.04378.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot