Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on this progress, current state-of-the-art methods rely on combining multiple large models, resulting in high computational demands and reduced efficiency. In this work, we address this challenge by proposing a more computationally efficient approach. We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Furthermore, we demonstrate that by incorporating 3D data, we are able to further improve performance, without the need for human-annotated correspondences. Overall, our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence. Our code and weights are publicly available on our project page.
https://arxiv.org/abs/2412.03512
3D single object tracking is essential in autonomous driving and robotics. Existing methods often struggle with sparse and incomplete point cloud scenarios. To address these limitations, we propose a Multimodal-guided Virtual Cues Projection (MVCP) scheme that generates virtual cues to enrich sparse point clouds. Additionally, we introduce an enhanced tracker MVCTrack based on the generated virtual cues. Specifically, the MVCP scheme seamlessly integrates RGB sensors into LiDAR-based systems, leveraging a set of 2D detections to create dense 3D virtual cues that significantly improve the sparsity of point clouds. These virtual cues can naturally integrate with existing LiDAR-based 3D trackers, yielding substantial performance gains. Extensive experiments demonstrate that our method achieves competitive performance on the NuScenes dataset.
https://arxiv.org/abs/2412.02734
In this paper, we present a novel benchmark, GSOT3D, that aims at facilitating development of generic 3D single object tracking (SOT) in the wild. Specifically, GSOT3D offers 620 sequences with 123K frames, and covers a wide selection of 54 object categories. Each sequence is offered with multiple modalities, including the point cloud (PC), RGB image, and depth. This allows GSOT3D to support various 3D tracking tasks, such as single-modal 3D SOT on PC and multi-modal 3D SOT on RGB-PC or RGB-D, and thus greatly broadens research directions for 3D object tracking. To provide highquality per-frame 3D annotations, all sequences are labeled manually with multiple rounds of meticulous inspection and refinement. To our best knowledge, GSOT3D is the largest benchmark dedicated to various generic 3D object tracking tasks. To understand how existing 3D trackers perform and to provide comparisons for future research on GSOT3D, we assess eight representative point cloud-based tracking models. Our evaluation results exhibit that these models heavily degrade on GSOT3D, and more efforts are required for robust and generic 3D object tracking. Besides, to encourage future research, we present a simple yet effective generic 3D tracker, named PROT3D, that localizes the target object via a progressive spatial-temporal network and outperforms all current solutions by a large margin. By releasing GSOT3D, we expect to advance further 3D tracking in future research and applications. Our benchmark and model as well as the evaluation results will be publicly released at our webpage this https URL.
https://arxiv.org/abs/2412.02129
Efficient and accurate object pose estimation is an essential component for modern vision systems in many applications such as Augmented Reality, autonomous driving, and robotics. While research in model-based 6D object pose estimation has delivered promising results, model-free methods are hindered by the high computational load in rendering and inferring consistent poses of arbitrary objects in a live RGB-D video stream. To address this issue, we present 6DOPE-GS, a novel method for online 6D object pose estimation \& tracking with a single RGB-D camera by effectively leveraging advances in Gaussian Splatting. Thanks to the fast differentiable rendering capabilities of Gaussian Splatting, 6DOPE-GS can simultaneously optimize for 6D object poses and 3D object reconstruction. To achieve the necessary efficiency and accuracy for live tracking, our method uses incremental 2D Gaussian Splatting with an intelligent dynamic keyframe selection procedure to achieve high spatial object coverage and prevent erroneous pose updates. We also propose an opacity statistic-based pruning mechanism for adaptive Gaussian density control, to ensure training stability and efficiency. We evaluate our method on the HO3D and YCBInEOAT datasets and show that 6DOPE-GS matches the performance of state-of-the-art baselines for model-free simultaneous 6D pose tracking and reconstruction while providing a 5$\times$ speedup. We also demonstrate the method's suitability for live, dynamic object tracking and reconstruction in a real-world setting.
https://arxiv.org/abs/2412.01543
Handling occlusion remains a significant challenge for video instance-level tasks like Multiple Object Tracking (MOT) and Video Instance Segmentation (VIS). In this paper, we propose a novel framework, Amodal-Aware Video Instance Segmentation (A2VIS), which incorporates amodal representations to achieve a reliable and comprehensive understanding of both visible and occluded parts of objects in a video. The key intuition is that awareness of amodal segmentation through spatiotemporal dimension enables a stable stream of object information. In scenarios where objects are partially or completely hidden from view, amodal segmentation offers more consistency and less dramatic changes along the temporal axis compared to visible segmentation. Hence, both amodal and visible information from all clips can be integrated into one global instance prototype. To effectively address the challenge of video amodal segmentation, we introduce the spatiotemporal-prior Amodal Mask Head, which leverages visible information intra clips while extracting amodal characteristics inter clips. Through extensive experiments and ablation studies, we show that A2VIS excels in both MOT and VIS tasks in identifying and tracking object instances with a keen understanding of their full shape.
https://arxiv.org/abs/2412.01147
Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video object tracking, where inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added Gaussian noise or motion blur. Our project page is available at this https URL
https://arxiv.org/abs/2412.01136
Recent advances in video question answering (VideoQA) offer promising applications, especially in traffic monitoring, where efficient video interpretation is critical. Within ITS, answering complex, real-time queries like "How many red cars passed in the last 10 minutes?" or "Was there an incident between 3:00 PM and 3:05 PM?" enhances situational awareness and decision-making. Despite progress in vision-language models, VideoQA remains challenging, especially in dynamic environments involving multiple objects and intricate spatiotemporal relationships. This study evaluates state-of-the-art VideoQA models using non-benchmark synthetic and real-world traffic sequences. The framework leverages GPT-4o to assess accuracy, relevance, and consistency across basic detection, temporal reasoning, and decomposition queries. VideoLLaMA-2 excelled with 57% accuracy, particularly in compositional reasoning and consistent answers. However, all models, including VideoLLaMA-2, faced limitations in multi-object tracking, temporal coherence, and complex scene interpretation, highlighting gaps in current architectures. These findings underscore VideoQA's potential in traffic monitoring but also emphasize the need for improvements in multi-object tracking, temporal reasoning, and compositional capabilities. Enhancing these areas could make VideoQA indispensable for incident detection, traffic flow management, and responsive urban planning. The study's code and framework are open-sourced for further exploration: this https URL
https://arxiv.org/abs/2412.01132
Object tracking is a fundamental tool in modern innovation, with applications in defense systems, autonomous vehicles, and biomedical research. It enables precise identification, monitoring, and spatiotemporal analysis of objects across sequential frames, providing insights into dynamic behaviors. In cell biology, object tracking is vital for uncovering cellular mechanisms, such as migration, interactions, and responses to drugs or pathogens. These insights drive breakthroughs in understanding disease progression and therapeutic interventions. Over time, object tracking methods have evolved from traditional feature-based approaches to advanced machine learning and deep learning frameworks. While classical methods are reliable in controlled settings, they struggle in complex environments with occlusions, variable lighting, and high object density. Deep learning models address these challenges by delivering greater accuracy, adaptability, and robustness. This review categorizes object tracking techniques into traditional, statistical, feature-based, and machine learning paradigms, with a focus on biomedical applications. These methods are essential for tracking cells and subcellular structures, advancing our understanding of health and disease. Key performance metrics, including accuracy, efficiency, and adaptability, are discussed. The paper explores limitations of current methods and highlights emerging trends to guide the development of next-generation tracking systems for biomedical research and broader scientific domains.
https://arxiv.org/abs/2412.01119
In the field of autonomous driving or robotics, simultaneous localization and mapping (SLAM) and multi-object tracking (MOT) are two fundamental problems and are generally applied separately. Solutions to SLAM and MOT usually rely on certain assumptions, such as the static environment assumption for SLAM and the accurate ego-vehicle pose assumption for MOT. But in complex dynamic environments, it is difficult or even impossible to meet these assumptions. Therefore, the SLAMMOT, i.e., simultaneous localization, mapping, and moving object tracking, integrated system of SLAM and object tracking, has emerged for autonomous vehicles in dynamic environments. However, many conventional SLAMMOT solutions directly perform data association on the predictions and detections for object tracking, but ignore their quality. In practice, inaccurate predictions caused by continuous multi-frame missed detections in temporary occlusion scenarios, may degrade the performance of tracking, thereby affecting SLAMMOT. To address this challenge, this paper presents a LiDAR SLAMMOT based on confidence-guided data association (Conf SLAMMOT) method, which tightly couples the LiDAR SLAM and the confidence-guided data association based multi-object tracking into a graph optimization backend for estimating the state of the ego-vehicle and objects simultaneously. The confidence of prediction and detection are applied in the factor graph-based multi-object tracking for its data association, which not only avoids the performance degradation caused by incorrect initial assignments in some filter-based methods but also handles issues such as continuous missed detection in tracking while also improving the overall performance of SLAMMOT. Various comparative experiments demonstrate the superior advantages of Conf SLAMMOT, especially in scenes with some missed detections.
https://arxiv.org/abs/2412.01041
Object perception from multi-view cameras is crucial for intelligent systems, particularly in indoor environments, e.g., warehouses, retail stores, and hospitals. Most traditional multi-target multi-camera (MTMC) detection and tracking methods rely on 2D object detection, single-view multi-object tracking (MOT), and cross-view re-identification (ReID) techniques, without properly handling important 3D information by multi-view image aggregation. In this paper, we propose a 3D object detection and tracking framework, named BEV-SUSHI, which first aggregates multi-view images with necessary camera calibration parameters to obtain 3D object detections in bird's-eye view (BEV). Then, we introduce hierarchical graph neural networks (GNNs) to track these 3D detections in BEV for MTMC tracking results. Unlike existing methods, BEV-SUSHI has impressive generalizability across different scenes and diverse camera settings, with exceptional capability for long-term association handling. As a result, our proposed BEV-SUSHI establishes the new state-of-the-art on the AICity'24 dataset with 81.22 HOTA, and 95.6 IDF1 on the WildTrack dataset.
https://arxiv.org/abs/2412.00692
Following the successful 2023 edition, we organised the Second Perception Test challenge as a half-day workshop alongside the IEEE/CVF European Conference on Computer Vision (ECCV) 2024, with the goal of benchmarking state-of-the-art video models and measuring the progress since last year using the Perception Test benchmark. This year, the challenge had seven tracks (up from six last year) and covered low-level and high-level tasks, with language and non-language interfaces, across video, audio, and text modalities; the additional track covered hour-long video understanding and introduced a novel video QA benchmark 1h-walk VQA. Overall, the tasks in the different tracks were: object tracking, point tracking, temporal action localisation, temporal sound localisation, multiple-choice video question-answering, grounded video question-answering, and hour-long video question-answering. We summarise in this report the challenge tasks and results, and introduce in detail the novel hour-long video QA benchmark 1h-walk VQA.
https://arxiv.org/abs/2411.19941
We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground-truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.
https://arxiv.org/abs/2411.19167
Simultaneous Localization and Mapping (SLAM) and Multi-Object Tracking (MOT) are pivotal tasks in the realm of autonomous driving, attracting considerable research attention. While SLAM endeavors to generate real-time maps and determine the vehicle's pose in unfamiliar settings, MOT focuses on the real-time identification and tracking of multiple dynamic objects. Despite their importance, the prevalent approach treats SLAM and MOT as independent modules within an autonomous vehicle system, leading to inherent limitations. Classical SLAM methodologies often rely on a static environment assumption, suitable for indoor rather than dynamic outdoor scenarios. Conversely, conventional MOT techniques typically rely on the vehicle's known state, constraining the accuracy of object state estimations based on this prior. To address these challenges, previous efforts introduced the unified SLAMMOT paradigm, yet primarily focused on simplistic motion patterns. In our team's previous work IMM-SLAMMOT\cite{IMM-SLAMMOT}, we present a novel methodology incorporating consideration of multiple motion models into SLAMMOT i.e. tightly coupled SLAM and MOT, demonstrating its efficacy in LiDAR-based systems. This paper studies feasibility and advantages of instantiating this methodology as visual SLAMMOT, bridging the gap between LiDAR and vision-based sensing mechanisms. Specifically, we propose a solution of visual SLAMMOT considering multiple motion models and validate the inherent advantages of IMM-SLAMMOT in the visual domain.
https://arxiv.org/abs/2411.19134
The fusion of camera- and LiDAR-based detections offers a promising solution to mitigate tracking failures in 3D multi-object tracking (MOT). However, existing methods predominantly exploit camera detections to correct tracking failures caused by potential LiDAR detection problems, neglecting the reciprocal benefit of refining camera detections using LiDAR data. This limitation is rooted in their single-stage architecture, akin to single-stage object detectors, lacking a dedicated trajectory refinement module to fully exploit the complementary multi-modal information. To this end, we introduce CrossTracker, a novel two-stage paradigm for online multi-modal 3D MOT. CrossTracker operates in a coarse-to-fine manner, initially generating coarse trajectories and subsequently refining them through an independent refinement process. Specifically, CrossTracker incorporates three essential modules: i) a multi-modal modeling (M^3) module that, by fusing multi-modal information (images, point clouds, and even plane geometry extracted from images), provides a robust metric for subsequent trajectory generation. ii) a coarse trajectory generation (C-TG) module that generates initial coarse dual-stream trajectories, and iii) a trajectory refinement (TR) module that refines coarse trajectories through cross correction between camera and LiDAR streams. Comprehensive experiments demonstrate the superior performance of our CrossTracker over its eighteen competitors, underscoring its effectiveness in harnessing the synergistic benefits of camera and LiDAR sensors for robust multi-modal 3D MOT.
https://arxiv.org/abs/2411.18850
This paper presents a preliminary study of an efficient object tracking approach, comparing the performance of two different 3D point cloud sensory sources: LiDAR and stereo cameras, which have significant price differences. In this preliminary work, we focus on single object tracking. We first developed a fast heuristic object detector that utilizes prior information about the environment and target. The resulting target points are subsequently fed into an extended object tracking framework, where the target shape is parameterized using a star-convex hypersurface model. Experimental results show that our object tracking method using a stereo camera achieves performance similar to that of a LiDAR sensor, with a cost difference of more than tenfold.
本文提出了一种高效目标跟踪方法的初步研究,比较了两种不同3D点云传感器源(激光雷达和立体相机)的性能,这两种传感器的价格差异显著。在本初步工作中,我们专注于单个对象的跟踪。首先,我们开发了一个快速启发式的目标检测器,该检测器利用了环境和目标的先验信息。得到的目标点随后被输入到一个扩展的目标跟踪框架中,在这个框架中,目标形状使用星形凸超曲面模型进行参数化。实验结果显示,使用立体相机实现的对象跟踪方法与激光雷达传感器达到了相似的性能水平,但成本差异超过十倍。
https://arxiv.org/abs/2411.18476
We propose a real-time dynamic LiDAR odometry pipeline for mobile robots in Urban Search and Rescue (USAR) scenarios. Existing approaches to dynamic object detection often rely on pretrained learned networks or computationally expensive volumetric maps. To enhance efficiency on computationally limited robots, we reuse data between the odometry and detection module. Utilizing a range image segmentation technique and a novel residual-based heuristic, our method distinguishes dynamic from static objects before integrating them into the point cloud map. The approach demonstrates robust object tracking and improved map accuracy in environments with numerous dynamic objects. Even highly non-rigid objects, such as running humans, are accurately detected at point level without prior downsampling of the point cloud and hence, without loss of information. Evaluation on simulated and real-world data validates its computational efficiency. Compared to a state-of-the-art volumetric method, our approach shows comparable detection performance at a fraction of the processing time, adding only 14 ms to the odometry module for dynamic object detection and tracking. The implementation and a new real-world dataset are available as open-source for further research.
我们提出了一种适用于城市搜救(USAR)场景中移动机器人的实时动态激光雷达里程计管道。现有的动态物体检测方法通常依赖于预训练的学习网络或计算密集的体积图。为了提高计算资源有限的机器人上的效率,我们在里程计模块和检测模块之间重用了数据。通过使用范围图像分割技术以及一种新颖的基于残差的启发式方法,我们的方法能够在将对象整合到点云地图之前区分动态与静态物体。这种方法在包含大量动态物体的环境中表现出稳健的对象跟踪能力和提高的地图准确性。即使是高度非刚性的物体,如奔跑的人类,也能在不先对点云进行降采样的情况下以点级别的精度被准确检测出来,从而不会丢失信息。模拟和真实世界数据上的评估验证了其计算效率。与最先进的体积方法相比,我们的方法在仅增加了14毫秒的动态物体检测和跟踪处理时间的情况下,实现了相当的检测性能。该实现以及一个新的现实世界数据集作为开源提供给进一步的研究使用。
https://arxiv.org/abs/2411.18443
Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.
基于内存的追踪器是视频对象分割方法的一种,它们通过将最近跟踪的帧串联到一个记忆缓冲区中来形成目标模型,并通过将当前图像与缓冲区中的帧进行注意力匹配来定位目标。尽管已经在许多基准测试中取得了顶尖性能,但最近SAM2的发布才使基于内存的追踪器成为了视觉对象追踪社区的关注焦点。然而,现代追踪器在存在干扰物的情况下仍然表现不佳。我们认为需要一个更复杂的记忆模型,并为SAM2提出了一种新的干扰感知记忆模型和一种基于内省的更新策略,以共同解决分割准确性和跟踪鲁棒性问题。由此产生的追踪器被标记为SAM2.1++。我们还提出了一个新的干扰提取数据集DiDi,以更好地研究干扰物问题。在七个基准测试中,SAM2.1++超越了SAM2.1及相关SAM记忆扩展,并在其中六个上确立了新的稳固的领先水平。
https://arxiv.org/abs/2411.17576
Transformer-based multi-object tracking (MOT) methods have captured the attention of many researchers in recent years. However, these models often suffer from slow inference speeds due to their structure or other issues. To address this problem, we revisited the Joint Detection and Tracking (JDT) method by looking back at past approaches. By integrating the original JDT approach with some advanced theories, this paper employs an efficient method of information transfer between frames on the DETR, constructing a fast and novel JDT-type MOT framework: FastTrackTr. Thanks to the superiority of this information transfer method, our approach not only reduces the number of queries required during tracking but also avoids the excessive introduction of network structures, ensuring model simplicity. Experimental results indicate that our method has the potential to achieve real-time tracking and exhibits competitive tracking accuracy across multiple datasets.
基于Transformer的多目标跟踪(MOT)方法近年来吸引了许多研究人员的关注。然而,由于其结构或其他问题,这些模型通常会受到推理速度慢的影响。为了解决这个问题,我们回顾了过去的Joint Detection and Tracking (JDT) 方法。通过将原始的JDT方法与一些先进的理论相结合,在DETR上采用了高效的帧间信息传递方法,构建了一个快速新颖的JDT类型MOT框架:FastTrackTr。得益于该信息传递方法的优势,我们的方法不仅减少了跟踪过程中所需的查询数量,还避免了网络结构的过度引入,确保了模型的简洁性。实验结果表明,我们提出的方法有潜力实现实时追踪,并在多个数据集上展现出竞争力的跟踪精度。
https://arxiv.org/abs/2411.15811
Vision-language tracking (VLT) extends traditional single object tracking by incorporating textual information, providing semantic guidance to enhance tracking performance under challenging conditions like fast motion and deformations. However, current VLT trackers often underperform compared to single-modality methods on multiple benchmarks, with semantic information sometimes becoming a "distraction." To address this, we propose VLTVerse, the first fine-grained evaluation framework for VLT trackers that comprehensively considers multiple challenge factors and diverse semantic information, hoping to reveal the role of language in VLT. Our contributions include: (1) VLTVerse introduces 10 sequence-level challenge labels and 6 types of multi-granularity semantic information, creating a flexible and multi-dimensional evaluation space for VLT; (2) leveraging 60 subspaces formed by combinations of challenge factors and semantic types, we conduct systematic fine-grained evaluations of three mainstream SOTA VLT trackers, uncovering their performance bottlenecks across complex scenarios and offering a novel perspective on VLT evaluation; (3) through decoupled analysis of experimental results, we examine the impact of various semantic types on specific challenge factors in relation to different algorithms, providing essential guidance for enhancing VLT across data, evaluation, and algorithmic dimensions. The VLTVerse, toolkit, and results will be available at \url{this http URL}.
视觉-语言追踪(VLT)通过整合文本信息,扩展了传统的单目标追踪技术,为在快速运动和变形等具有挑战性的条件下提供语义指导以增强追踪性能。然而,目前的VLT跟踪器通常在多个基准测试中的表现不如单一模态的方法,有时语义信息甚至会成为一种“干扰”。为了应对这一问题,我们提出了VLTVerse,这是首个针对VLT跟踪器的细粒度评估框架,全面考虑了多种挑战因素和多样化的语义信息,旨在揭示语言在VLT中的作用。我们的贡献包括:(1) VLTVerse引入了10个序列级别的挑战标签和6种多粒度语义信息类型,创建了一个灵活且多维的VLT评估空间;(2) 利用由挑战因素和语义类型组合形成的60个子空间,我们对三种主流SOTA(最先进)VLT跟踪器进行了系统的细粒度评估,揭示了它们在复杂场景下的性能瓶颈,并提供了对VLT评估的新视角;(3) 通过实验结果的解耦分析,我们探讨了不同语义类型对于特定挑战因素的影响与算法之间的关系,为数据、评价和算法维度上的VLT提升提供关键指导。VLTVerse工具包和结果将在\url{this http URL}提供。
https://arxiv.org/abs/2411.15600
The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update. Besides, we introduce a modality-selection module that dynamically adjusts the weighting between visual and language references, mitigating potential ambiguities from either reference type. Extensive experimental results show that our method performs favorably against state-of-the-art trackers across diverse benchmarks.
视觉语言跟踪任务旨在基于各种模态参考进行目标跟踪。现有的基于Transformer的视觉语言跟踪方法通过利用自注意力机制的全局建模能力取得了显著的进步。然而,当前的方法在有效利用时间信息和动态更新参考特征方面仍然面临挑战。最近,状态空间模型(SSM),即Mamba,在高效的长序列建模中展现了惊人的能力。特别是,其状态空间演化过程展示了在线性复杂度下记忆多模态时序信息的强大潜力。鉴于其成功,我们提出了一种基于Mamba的视觉语言跟踪模型,利用其在时间空间中的状态空间演化能力进行鲁棒的多模态跟踪,该模型被称为MambaVLT。特别地,我们的方法主要集成了一个随时间演化的混合状态空间块和一个选择性局部增强块,以捕捉上下文信息进行多模态建模和自适应参考特征更新。此外,我们引入了一个动态调整视觉与语言参考权重的模态选择模块,减轻了来自任一类型参考的潜在歧义。广泛的实验结果表明,我们的方法在各种基准测试中优于最先进的跟踪器。
https://arxiv.org/abs/2411.15459