Abstract
Vision-based object tracking has boosted extensive autonomous applications for unmanned aerial vehicles (UAVs). However, the dynamic changes in flight maneuver and viewpoint encountered in UAV tracking pose significant difficulties, e.g. , aspect ratio change, and scale variation. The conventional cross-correlation operation, while commonly used, has limitations in effectively capturing perceptual similarity and incorporates extraneous background information. To mitigate these limitations, this work presents a novel saliency-guided dynamic vision Transformer (SGDViT) for UAV tracking. The proposed method designs a new task-specific object saliency mining network to refine the cross-correlation operation and effectively discriminate foreground and background information. Additionally, a saliency adaptation embedding operation dynamically generates tokens based on initial saliency, thereby reducing the computational complexity of the Transformer architecture. Finally, a lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information. The efficacy and robustness of the proposed approach have been thoroughly assessed through experiments on three widely-used UAV tracking benchmarks and real-world scenarios, with results demonstrating its superiority. The source code and demo videos are available at this https URL.
Abstract (translated)
基于视觉的对象跟踪已经促进了对无人机(UAVs)的广泛自主应用。然而,在无人机跟踪中,飞行行为和视角的动态变化遇到了重大困难,例如, aspect ratio 改变和尺度变化。尽管常用,但传统的交叉验证操作具有在有效捕捉感知相似性和整合无关背景信息方面的局限性。为了克服这些限制,本研究提出了一种基于视觉的新的任务特定对象视觉跟踪器(Saliency- guided dynamic vision Transformer),以改进交叉验证操作并有效地区分前景和背景信息。此外,一种基于初始视觉吸引力的适应嵌入操作动态地生成代币,从而降低了Transformer架构的计算复杂性。最后,一种轻量级的视觉吸引力过滤Transformer进一步 refine了视觉吸引力信息并增加了外观信息的重点。本研究通过实验研究了三个广泛使用的无人机跟踪基准和实际场景,结果证明了其优越性。源代码和演示视频可在本网站上获得。
URL
https://arxiv.org/abs/2303.04378