Paper Reading AI Learner

DASTSiam: Spatio-Temporal Fusion and Discriminative Augmentation for Improved Siamese Tracking

2023-01-22 06:23:53
Yucheng Huang, Eksan Firkat, Ziwang Xiao, Jihong Zhu, Askar Hamdulla

Abstract

Tracking tasks based on deep neural networks have greatly improved with the emergence of Siamese trackers. However, the appearance of targets often changes during tracking, which can reduce the robustness of the tracker when facing challenges such as aspect ratio change, occlusion, and scale variation. In addition, cluttered backgrounds can lead to multiple high response points in the response map, leading to incorrect target positioning. In this paper, we introduce two transformer-based modules to improve Siamese tracking called DASTSiam: the spatio-temporal (ST) fusion module and the Discriminative Augmentation (DA) module. The ST module uses cross-attention based accumulation of historical cues to improve robustness against object appearance changes, while the DA module associates semantic information between the template and search region to improve target discrimination. Moreover, Modifying the label assignment of anchors also improves the reliability of the object location. Our modules can be used with all Siamese trackers and show improved performance on several public datasets through comparative and ablation experiments.

Abstract (translated)

随着深度学习网络的出现,基于三目搜索跟踪任务的技术已经得到了极大的改进。然而,在跟踪过程中,目标的外观经常发生变化,这可能会削弱跟踪器在面对如比例变化、遮挡和尺寸变化等挑战时的鲁棒性。此外,背景杂乱可能导致响应地图上多个高响应点,导致目标定位错误。在本文中,我们介绍了两个基于Transformer的模块,称为DASTSiam,用于改进Siamese跟踪:空间-时间融合模块和分类增强模块。空间-时间融合模块使用交叉注意力based的历史 cues 进行积累,以提高对象外观变化时的鲁棒性;分类增强模块将模板和搜索区域之间的语义信息联系起来,以提高目标识别。此外,修改anchor 标签 assignments 也改善了对象位置的可靠性。我们的模块可以与所有Siamese跟踪器一起使用,通过比较和消除实验展示了在多个公共数据集上的性能改进。

URL

https://arxiv.org/abs/2301.09063

PDF

https://arxiv.org/pdf/2301.09063.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot