Paper Reading AI Learner

Dynamic Attention Mechanism in Spatiotemporal Memory Networks for Object Tracking

2025-03-21 00:48:31
Meng Zhou, Jiadong Xie, Mingsheng Xu

Abstract

Mainstream visual object tracking frameworks predominantly rely on template matching paradigms. Their performance heavily depends on the quality of template features, which becomes increasingly challenging to maintain in complex scenarios involving target deformation, occlusion, and background clutter. While existing spatiotemporal memory-based trackers emphasize memory capacity expansion, they lack effective mechanisms for dynamic feature selection and adaptive fusion. To address this gap, we propose a Dynamic Attention Mechanism in Spatiotemporal Memory Network (DASTM) with two key innovations: 1) A differentiable dynamic attention mechanism that adaptively adjusts channel-spatial attention weights by analyzing spatiotemporal correlations between the templates and memory features; 2) A lightweight gating network that autonomously allocates computational resources based on target motion states, prioritizing high-discriminability features in challenging scenarios. Extensive evaluations on OTB-2015, VOT 2018, LaSOT, and GOT-10K benchmarks demonstrate our DASTM's superiority, achieving state-of-the-art performance in success rate, robustness, and real-time efficiency, thereby offering a novel solution for real-time tracking in complex environments.

Abstract (translated)

主流的视觉对象跟踪框架主要依赖于模板匹配范式。其性能很大程度上取决于模板特征的质量,而在涉及目标变形、遮挡和背景杂乱等复杂场景的情况下,保持高质量模板特征变得越来越具有挑战性。尽管现有的基于时空记忆的追踪器强调扩大内存容量,但它们缺乏有效的动态特征选择和自适应融合机制。为了弥补这一不足,我们提出了一种在时空记忆网络中的动态注意力机制(DASTM),其包含两个关键创新点:1)一种可微分的动态注意力机制,该机制通过分析模板与记忆特征之间的时空相关性来自适应地调整通道-空间注意权重;2)一个轻量级的门控网络,根据目标运动状态自主分配计算资源,在复杂场景中优先处理高区分度特征。在OTB-2015、VOT 2018、LaSOT和GOT-10K基准测试中的广泛评估证明了我们提出的DASTM的优势,实现了成功率、鲁棒性和实时效率方面的最新性能,从而为复杂环境下的实时跟踪提供了新颖的解决方案。

URL

https://arxiv.org/abs/2503.16768

PDF

https://arxiv.org/pdf/2503.16768.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot