Paper Reading AI Learner

Improving Visual Object Tracking through Visual Prompting

2024-09-27 16:39:50
Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

Abstract

Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.

Abstract (translated)

学习一个区分目标的 discriminative 模型以便从其周围的干扰者中区分目标对于通用视觉物体跟踪是至关重要的。由于现有跟踪器的鉴别能力有限,动态目标表示对于干扰物的适应是具有挑战性的。我们提出了一个新的通用视觉物体跟踪(PiVOT)视觉提示机制来解决这个问题。PiVOT 提出了一个使用预训练基础模型 CLIP 的提示生成网络,以自动生成和优化视觉提示,从而实现基础模型知识的转移,以进行跟踪。虽然 CLIP 提供广泛的类别级知识,但跟踪器通过针对特定实例的训练,在识别独特物体实例方面表现出色。因此,PiVOT 首先通过 CLIP 根据潜在目标与参考模板之间的一致性来优化视觉提示。一旦视觉提示被优化,它可以更好地突出潜在目标的位置,从而减少无关提示信息。通过所提出的提示机制,跟踪器可以通过视觉提示生成改进的实例感知特征图,从而有效地减少干扰者。所提出的方法在训练过程中不使用 CLIP,从而保持相同的训练复杂度并保留预训练基础模型的泛化能力。在多个基准测试上进行的大量实验证明,PiVOT 使用所提出的提示方法可以抑制干扰物并增强跟踪器。

URL

https://arxiv.org/abs/2409.18901

PDF

https://arxiv.org/pdf/2409.18901.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot