The field of visual object tracking is dominated by methods that combine simple tracking algorithms and ad hoc schemes. Probabilistic tracking algorithms, which are leading in other fields, are surprisingly absent from the leaderboards. We found that accounting for distance in target kinematics, exploiting detector confidence and modelling non-uniform clutter characteristics is critical for a probabilistic tracker to work in visual tracking. Previous probabilistic methods fail to address most or all these aspects, which we believe is why they fall so far behind current state-of-the-art (SOTA) methods (there are no probabilistic trackers in the MOT17 top 100). To rekindle progress among probabilistic approaches, we propose a set of pragmatic models addressing these challenges, and demonstrate how they can be incorporated into a probabilistic framework. We present BASE (Bayesian Approximation Single-hypothesis Estimator), a simple, performant and easily extendible visual tracker, achieving state-of-the-art (SOTA) on MOT17 and MOT20, without using Re-Id. Code will be made available at this https URL
视觉对象跟踪领域主要由结合简单跟踪算法和特定计划的方法所占据。在其他领域的领先者——Probabilistic跟踪算法——却在排行榜上出乎意料地消失了。我们发现,在目标运动学中考虑距离、利用探测器信心、建模非均匀杂波特征对于Probabilistic跟踪算法在视觉跟踪中工作至关重要。以前的Probabilistic方法未能解决大部分或这些问题,我们认为这就是为什么它们落后于当前最先进的方法(在MOT17 top100中没有Probabilistic跟踪器)的原因。为了重新点燃Probabilistic方法之间的进展,我们提出了一组实用模型解决这些问题,并演示了如何将它们纳入Probabilistic框架中。我们介绍了BASE(Bayesian近似单假设估计器),这是一个简单、高效且易于扩展的视觉跟踪器,在MOT17和MOT20上实现了最先进的方法(不使用Re-Id)。代码将在这个https URL上提供。
https://arxiv.org/abs/2309.12035
In currently available literature, no tracking-by-detection (TBD) paradigm-based tracking method has considered the localization confidence of detection boxes. In most TBD-based methods, it is considered that objects of low detection confidence are highly occluded and thus it is a normal practice to directly disregard such objects or to reduce their priority in matching. In addition, appearance similarity is not a factor to consider for matching these objects. However, in terms of the detection confidence fusing classification and localization, objects of low detection confidence may have inaccurate localization but clear appearance; similarly, objects of high detection confidence may have inaccurate localization or unclear appearance; yet these objects are not further classified. In view of these issues, we propose Localization-Guided Track (LG-Track). Firstly, localization confidence is applied in MOT for the first time, with appearance clarity and localization accuracy of detection boxes taken into account, and an effective deep association mechanism is designed; secondly, based on the classification confidence and localization confidence, a more appropriate cost matrix can be selected and used; finally, extensive experiments have been conducted on MOT17 and MOT20 datasets. The results show that our proposed method outperforms the compared state-of-art tracking methods. For the benefit of the community, our code has been made publicly at this https URL.
目前可用文献中,没有基于检测(TBD)范式的检测跟踪方法考虑检测框的定位 confidence。在大多数 TBD 方法中,被认为是低检测 confidence 的物体具有高度遮蔽性,因此通常直接忽略这些物体或降低它们在匹配中的优先级。此外,与这些物体的外观相似性不是匹配这些物体的因素。然而,在检测 confidence 结合分类和定位 confidence 方面,低检测 confidence 的物体可能具有不准确的定位但清晰的外观;类似地,高检测 confidence 的物体可能具有不准确的定位或不清晰的外观;但这些物体没有被进一步分类。考虑到这些问题,我们提出了定位引导跟踪(LG-Track)方法。首先,第一次在 MOT 中应用定位 confidence,考虑检测框的外观清晰性和定位精度,并设计有效的深度关联机制;其次,基于分类 confidence 和定位 confidence,可以选择和使用更适当的成本矩阵;最后,在 MOT17 和 MOT20 数据集上进行了广泛的实验。结果表明,我们提出的方法比与之相比的先进跟踪方法表现更好。为了社区的利益,我们的代码已在此 https URL 上公开。
https://arxiv.org/abs/2309.09765
Mobile autonomy relies on the precise perception of dynamic environments. Robustly tracking moving objects in 3D world thus plays a pivotal role for applications like trajectory prediction, obstacle avoidance, and path planning. While most current methods utilize LiDARs or cameras for Multiple Object Tracking (MOT), the capabilities of 4D imaging radars remain largely unexplored. Recognizing the challenges posed by radar noise and point sparsity in 4D radar data, we introduce RaTrack, an innovative solution tailored for radar-based tracking. Bypassing the typical reliance on specific object types and 3D bounding boxes, our method focuses on motion segmentation and clustering, enriched by a motion estimation module. Evaluated on the View-of-Delft dataset, RaTrack showcases superior tracking precision of moving objects, largely surpassing the performance of the state of the art.
移动自主系统依赖于精确感知动态环境。对于像轨迹预测、障碍物避免和路径规划等应用, robustly tracking 在三维世界中移动的物体是至关重要的。虽然当前大多数方法使用激光雷达或相机进行多目标跟踪(MOT),但4D成像雷达的能力仍然 largely 未被探索。认识到4D雷达数据中的雷达噪声和点稀疏性所提出的挑战,我们引入了RaTrack,这是一个专门为雷达跟踪而设计的创新性解决方案。通过避免通常依赖于特定物体类型和3D边界框的依赖,我们的方法专注于运动分割和聚类,并通过运动估计模块进行增强。在《 Delft视图》数据集上进行评估,RaTrack展示了移动物体的更好的跟踪精度, largely 超过了当前技术水平的表现。
https://arxiv.org/abs/2309.09737
Supervised trackers trained on labeled data dominate the single object tracking field for superior tracking accuracy. The labeling cost and the huge computational complexity hinder their applications on edge devices. Unsupervised learning methods have also been investigated to reduce the labeling cost but their complexity remains high. Aiming at lightweight high-performance tracking, feasibility without offline pre-training, and algorithmic transparency, we propose a new single object tracking method, called the green object tracker (GOT), in this work. GOT conducts an ensemble of three prediction branches for robust box tracking: 1) a global object-based correlator to predict the object location roughly, 2) a local patch-based correlator to build temporal correlations of small spatial units, and 3) a superpixel-based segmentator to exploit the spatial information of the target frame. GOT offers competitive tracking accuracy with state-of-the-art unsupervised trackers, which demand heavy offline pre-training, at a lower computation cost. GOT has a tiny model size (<3k parameters) and low inference complexity (around 58M FLOPs per frame). Since its inference complexity is between 0.1%-10% of DL trackers, it can be easily deployed on mobile and edge devices.
监督跟踪器以更好的跟踪精度在单个物体跟踪领域占据主导地位。标记费用和巨大的计算复杂性限制了其在边缘设备上的应用范围。也进行了研究以减少标记费用的无监督学习方法,但复杂性仍然很高。旨在实现轻量级高性能跟踪,无需 offline 预训练,以及算法透明度,我们提出了一种新单物体跟踪方法,称为绿色物体跟踪器(GOT)。 GOT开展三个预测分支,以进行 robust box 跟踪:1) 全球对象基函数correlator,大致预测物体位置;2) 本地块基函数correlator,建立小型空间单元的时序相关性;3) 超像素基分割器,利用目标帧的空间信息。 GOT提供与最先进的无监督跟踪器竞争的跟踪精度,并要求重 offline 预训练,计算成本更低。 GOT模型尺寸很小(<3k 参数),推断复杂性很低(每个帧大约 58 百万 FLOPs)。由于其推断复杂性介于深度学习跟踪器之间,即 0.1% - 10% 的水平,它可以轻松部署在移动设备和边缘设备上。
https://arxiv.org/abs/2309.09078
Due to long-distance correlation and powerful pretrained models, transformer-based methods have initiated a breakthrough in visual object tracking performance. Previous works focus on designing effective architectures suited for tracking, but ignore that data augmentation is equally crucial for training a well-performing model. In this paper, we first explore the impact of general data augmentations on transformer-based trackers via systematic experiments, and reveal the limited effectiveness of these common strategies. Motivated by experimental observations, we then propose two data augmentation methods customized for tracking. First, we optimize existing random cropping via a dynamic search radius mechanism and simulation for boundary samples. Second, we propose a token-level feature mixing augmentation strategy, which enables the model against challenges like background interference. Extensive experiments on two transformer-based trackers and six benchmarks demonstrate the effectiveness and data efficiency of our methods, especially under challenging settings, like one-shot tracking and small image resolutions.
得益于远距离相关性和强大的预训练模型,基于Transformer的方法在视觉对象跟踪性能方面取得了突破。以前的研究主要关注设计适合跟踪的有效架构,但忽略了数据增强对于训练表现出色模型同样至关重要。在本文中,我们首先通过系统实验探索通用数据增强对Transformer跟踪器的影响,并揭示了这些常见策略的局限性。基于实验观察,我们随后提出了两个适用于跟踪的数据增强方法。首先,我们通过动态搜索半径机制优化现有的随机裁剪,并模拟边界样本。其次,我们提出了一种 token 级别的特征混合增强策略,这使模型能够对抗背景干扰等挑战。我们对两个Transformer跟踪器和六个基准进行了广泛的实验,证明了我们方法的有效性和数据效率,特别是在挑战性设置下,如一次性跟踪和小型图像分辨率。
https://arxiv.org/abs/2309.08264
Early detection and tracking of ejecta in the vicinity of small solar system bodies is crucial to guarantee spacecraft safety and support scientific observation. During the visit of active asteroid Bennu, the OSIRIS-REx spacecraft relied on the analysis of images captured by onboard navigation cameras to detect particle ejection events, which ultimately became one of the mission's scientific highlights. To increase the scientific return of similar time-constrained missions, this work proposes an event-based solution that is dedicated to the detection and tracking of centimetre-sized particles. Unlike a standard frame-based camera, the pixels of an event-based camera independently trigger events indicating whether the scene brightness has increased or decreased at that time and location in the sensor plane. As a result of the sparse and asynchronous spatiotemporal output, event cameras combine very high dynamic range and temporal resolution with low-power consumption, which could complement existing onboard imaging techniques. This paper motivates the use of a scientific event camera by reconstructing the particle ejection episodes reported by the OSIRIS-REx mission in a photorealistic scene generator and in turn, simulating event-based observations. The resulting streams of spatiotemporal data support future work on event-based multi-object tracking.
在小型太阳系天体附近早期检测和跟踪飞溅是非常重要的,以确保太空船的安全性并支持科学观测。在访问活动小行星 Bennu 的过程中,OSIRIS-REx 太空船依赖内置导航摄像头拍摄的图像进行分析以检测粒子飞溅事件,这些事件最终成为该任务的科学亮点之一。为了增加类似时间限制的科学返回,这项工作提出了一种事件驱动的解决方案,专门用于检测和跟踪微米级别的粒子。与标准帧based相机不同,事件驱动相机的像素独立触发事件,以指示在传感器平面上当时和位置的 scene 亮度是否增加或减少。由于稀疏和非同步的时间输出,事件相机结合了极高的动态范围和时间分辨率,且功耗很低,可以补充现有的内置成像技术。本文激励使用科学事件相机,通过模拟 OSIRIS-REx 任务报告的粒子飞溅事件,重建了照片般的场景生成器中的图像,然后模拟事件观察。所产生的时间空间数据流支持未来基于事件的对象跟踪研究。
https://arxiv.org/abs/2309.06819
Accurate tracking of transparent objects, such as glasses, plays a critical role in many robotic tasks such as robot-assisted living. Due to the adaptive and often reflective texture of such objects, traditional tracking algorithms that rely on general-purpose learned features suffer from reduced performance. Recent research has proposed to instill transparency awareness into existing general object trackers by fusing purpose-built features. However, with the existing fusion techniques, the addition of new features causes a change in the latent space making it impossible to incorporate transparency awareness on trackers with fixed latent spaces. For example, many of the current days transformer-based trackers are fully pre-trained and are sensitive to any latent space perturbations. In this paper, we present a new feature fusion technique that integrates transparency information into a fixed feature space, enabling its use in a broader range of trackers. Our proposed fusion module, composed of a transformer encoder and an MLP module, leverages key query-based transformations to embed the transparency information into the tracking pipeline. We also present a new two-step training strategy for our fusion module to effectively merge transparency features. We propose a new tracker architecture that uses our fusion techniques to achieve superior results for transparent object tracking. Our proposed method achieves competitive results with state-of-the-art trackers on TOTB, which is the largest transparent object tracking benchmark recently released. Our results and the implementation of code will be made publicly available at this https URL.
准确追踪透明物体(如眼镜)在许多机器人任务中如机器人辅助生活中扮演了关键角色。由于这些物体的自适应和常常反射的特性,依赖通用学习特征的传统跟踪算法性能受到了影响。最近的研究表明,可以通过融合专门设计的特征来将透明度意识灌输到现有的一般物体跟踪器中。但是,在现有的融合技术中,添加新特征会导致潜在空间的变化,使跟踪器具有固定潜在空间的跟踪器无法融入透明度意识。例如,许多当前时代的Transformer-based跟踪器已经完全预训练,对任何潜在空间扰动非常敏感。在本文中,我们提出了一种新特征融合技术,将透明度信息集成到固定特征空间中,使其可以在更广泛的跟踪器中使用。我们提出的融合模块由Transformer编码器和MLP模块组成,利用关键查询变换将透明度信息嵌入跟踪管道中。我们还提出了一种新的两步训练策略,以有效地融合透明度特征。我们提出了一种新的跟踪器架构,利用我们的融合技术来实现透明物体跟踪的最佳结果。我们提出的方法和代码将在此httpsURL上公开发布。
https://arxiv.org/abs/2309.06701
The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on this https URL. Baselines and development kits can be found on this https URL.
《 soccerNet 2023 挑战》是由 soccerNet 团队组织的三年期视频理解挑战。这一版挑战由七个视觉任务组成,被分成三个主要主题。第一个主题是广播视频理解,包括三个高级别的任务,与描述视频广播中发生的事件有关:(1)行动Spotting,专注于从视频中检索与足球全球行动相关的所有时间戳;(2)球行动Spotting,专注于从视频中检索与足球球状态变化相关的所有时间戳;(3)稠密视频字幕,专注于用自然语言描述广播,并Anchoring 时间戳。第二个主题是实地理解,涉及任务 (4) 的单一任务,专注于从图像中检索内在和外部相机参数。第三个主题是球员理解,包括三个低级别的任务,与提取有关球员的信息有关:(5)重识别,专注于从多个视角检索相同的球员;(6)多物体跟踪,专注于通过未编辑的视频流跟踪球员和球;(7)球衣号码识别,专注于从跟踪let 中识别球员的球衣号码。与前几次《 soccerNet 挑战》相比,任务 (2-3-7) 是全新的,包括新的注释和数据,任务 (4) 得到了更多的数据和注释增强,而任务 (6) 则专注于端到端的方法。更多关于任务、挑战和排行榜的信息可以在 this https URL 上获取。基准线和开发kit 可以在 this https URL 上找到。
https://arxiv.org/abs/2309.06006
The introduction of robust backbones, such as Vision Transformers, has improved the performance of object tracking algorithms in recent years. However, these state-of-the-art trackers are computationally expensive since they have a large number of model parameters and rely on specialized hardware (e.g., GPU) for faster inference. On the other hand, recent lightweight trackers are fast but are less accurate, especially on large-scale datasets. We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time. We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization. The experimental results show that our MobileViT-based Tracker, MVT, surpasses the performance of recent lightweight trackers on the large-scale datasets GOT10k and TrackingNet, and with a high inference speed. In addition, our method outperforms the popular DiMP-50 tracker despite having 4.7 times fewer model parameters and running at 2.8 times its speed on a GPU. The tracker code and models are available at this https URL
引入 robust backbones,如 Vision Transformers,已经改进了过去几年的目标跟踪算法性能。然而,这些先进的跟踪器由于拥有大量的模型参数,并依赖 specialized hardware(如 GPU)以更快的推理速度实现。另一方面,最近推出的轻量级跟踪器虽然速度快,但准确度较低,特别是在大规模数据集上。我们提出了一种轻量级、准确、且快速的跟踪算法,首次使用 Mobile Vision Transformers(MobileViT)作为 backbone。我们还提出了一种 fusion 方法,将 MobileViT backbone 中的模板和搜索区域表示融合起来,生成更好的特征编码,用于目标定位。实验结果显示,我们的 MobileViT based 跟踪器,MVT,在大规模数据集 GOT10k 和 TrackingNet 上超过了最近推出的轻量级跟踪器的性能,且推理速度提高了很多。此外,我们的方法和 popular DiMP-50 跟踪器相比表现更好,尽管模型参数数量少了4.7倍,并在 GPU 上运行速度只有其2.8倍。跟踪代码和模型可在 this https URL 上获取。
https://arxiv.org/abs/2309.05829
Multiple object tracking (MOT) tends to become more challenging when severe occlusions occur. In this paper, we analyze the limitations of traditional Convolutional Neural Network-based methods and Transformer-based methods in handling occlusions and propose DNMOT, an end-to-end trainable DeNoising Transformer for MOT. To address the challenge of occlusions, we explicitly simulate the scenarios when occlusions occur. Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture, so that our model can exhibit strong robustness and perform well under crowded scenes. Additionally, we propose a Cascaded Mask strategy to better coordinate the interaction between different types of queries in the decoder to prevent the mutual suppression between neighboring trajectories under crowded scenes. Notably, the proposed method requires no additional modules like matching strategy and motion state estimation in inference. We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
在出现严重 occlusion 时,多重对象跟踪(MOT)往往变得更加困难。本文分析了传统基于卷积神经网络的方法和基于Transformer的方法在处理 occlusion 方面的局限性,并提出了 DNMOT,一种适用于 MOT 的 end-to-end 训练的 DeNoising Transformer。为了解决 occlusion 的挑战,我们有意地模拟了 occlusion 发生时的场景。具体来说,我们在训练期间添加噪声,使模型学习编码器和解码器架构中的去噪过程,从而使得模型能够在拥挤的场景中表现出强大的鲁棒性。此外,我们提出了一种Cascaded Mask 策略,更好地协调解码器中不同类型查询之间的交互,以避免在拥挤的场景中相邻轨迹之间的相互抑制。值得注意的是,我们提出的方法不需要类似于推理中匹配策略和运动状态估计等额外的模块。我们对 MOT17、MOT20 和舞蹈跟踪数据集进行了广泛的实验,实验结果表明,我们的方法明显超越了以前的先进方法。
https://arxiv.org/abs/2309.04682
The deployment of transformers for visual object tracking has shown state-of-the-art results on several benchmarks. However, the transformer-based models are under-utilized for Siamese lightweight tracking due to the computational complexity of their attention blocks. This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking. The proposed backbone utilizes the separable mixed attention transformers to fuse the template and search regions during feature extraction to generate superior feature encoding. Our prediction head performs global contextual modeling of the encoded features by leveraging efficient self-attention blocks for robust target state estimation. With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time. Our ablation study testifies to the effectiveness of the proposed combination of backbone and head modules. Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets, while running at 37 fps on CPU, 158 fps on GPU, and having 3.8M parameters. For example, it significantly surpasses the closely related trackers E.T.Track and MixFormerV2-S on GOT10k-test by a margin of 7.9% and 5.8%, respectively, in the AO metric. The tracker code and model is available at this https URL
使用Transformers进行视觉对象跟踪已经展示了在多个基准数据上的最先进的结果。然而,由于Transformers的注意力块计算复杂性,Siamese lightweight跟踪中Transformers的利用率较低。本文提出了一种高效的self和混合注意力Transformer-based架构,用于 lightweight跟踪。该提议的骨干使用可分离混合注意力Transformers来在特征提取期间将模板和搜索区域融合,产生更好的特征编码。我们的预测头使用高效的self-attention blocks来对编码特征进行全球上下文建模,通过利用这种高效的self-attention blocks进行 robust target state estimation。通过这些贡献,提议的 lightweight跟踪器首次同时部署了Transformer-based的骨干和头模块。我们的割裂研究证明了所提出的骨干和头模块组合的有效性。仿真显示,我们的可分离self和混合注意力跟踪器SMAT在GOT10k、TrackingNet、LaSOT、NfS30、UAV123和AVisT数据集上比相关的 lightweight 跟踪器的性能高,同时在CPU上运行速度为37fps,在GPU上运行速度为158fps,具有3.8M参数。例如,它在GOT10k测试中比密切相关的跟踪器ET.Track和 MixFormerV2-S高出7.9%和5.8%。跟踪代码和模型可在this https URL上获取。
https://arxiv.org/abs/2309.03979
The tracking of various fish species plays a profoundly significant role in understanding the behavior of individual fish and their groups. Present tracking methods suffer from issues of low accuracy or poor robustness. In order to address these concerns, this paper proposes a novel tracking approach, named FishMOT (Fish Multiple Object Tracking). This method combines object detection techniques with the IoU matching algorithm, thereby achieving efficient, precise, and robust fish detection and tracking. Diverging from other approaches, this method eliminates the need for multiple feature extractions and identity assignments for each individual, instead directly utilizing the output results of the detector for tracking, thereby significantly reducing computational time and storage space. Furthermore, this method imposes minimal requirements on factors such as video quality and variations in individual appearance. As long as the detector can accurately locate and identify fish, effective tracking can be achieved. This approach enhances robustness and generalizability. Moreover, the algorithm employed in this method addresses the issue of missed detections without relying on complex feature matching or graph optimization algorithms. This contributes to improved accuracy and reliability. Experimental trials were conducted in the open-source video dataset provided by this http URL, and comparisons were made with state-of-the-art detector-based multi-object tracking methods. Additionally, comparisons were made with this http URL and TRex, two tools that demonstrate exceptional performance in the field of animal tracking. The experimental results demonstrate that the proposed method outperforms other approaches in various evaluation metrics, exhibiting faster speed and lower memory requirements. The source codes and pre-trained models are available at: this https URL
对多种鱼类的跟踪在理解个体鱼类和其群体的行为方面扮演了深刻的重要角色。目前跟踪方法存在低精度或不良鲁棒性的问题。为了解决这些问题,本文提出了一种全新的跟踪方法,称为 FishMOT(鱼类多对象跟踪),该方法结合了物体检测技术和 IoU 匹配算法,实现了高效、精确和鲁棒的鱼检测和跟踪。不同于其他方法,该方法不需要对每个个体进行多个特征提取和身份 assignments,而是直接利用检测器的输出结果进行跟踪,从而 significantly 减少了计算时间和存储空间。此外,该方法对视频质量、个体外观变异等因素的影响提出了最小要求。只要检测器能够准确定位和识别鱼类,就可以实现有效的跟踪。这种方法提高了鲁棒性和泛化能力。此外,该方法所使用的算法解决了错过检测的问题,而无需依赖复杂的特征匹配或图优化算法。这有助于改善准确性和可靠性。实验 trials 在由这个 https URL 提供的自由开源视频数据集上进行了开展,并与其他先进的基于检测器的多对象跟踪方法进行了比较。此外,还与其他工具——TRex(在动物追踪领域表现出卓越表现的工具)进行了比较。实验结果表明,该方法在多种评估指标上优于其他方法,表现出更快的速度和更低的内存要求。源代码和预训练模型可在 this https URL 上获取。
https://arxiv.org/abs/2309.02975
In recent years, Video Object Segmentation (VOS) has emerged as a complementary method to Video Object Tracking (VOT). VOS focuses on classifying all the pixels around the target, allowing for precise shape labeling, while VOT primarily focuses on the approximate region where the target might be. However, traditional segmentation modules usually classify pixels frame by frame, disregarding information between adjacent frames. In this paper, we propose a new algorithm that addresses this limitation by analyzing the motion pattern using the inherent tensor structure. The tensor structure, obtained through Tucker2 tensor decomposition, proves to be effective in describing the target's motion. By incorporating this information, we achieved competitive results on Four benchmarks LaSOT\cite{fan2019lasot}, AVisT\cite{noman2022avist}, OTB100\cite{7001050}, and GOT-10k\cite{huang2019got} LaSOT\cite{fan2019lasot} with SOTA. Furthermore, the proposed tracker is capable of real-time operation, adding value to its practical application.
近年来,视频对象分割(VOS)已成为视频对象跟踪(VOT)的一种互补方法。VOS专注于将目标周围的所有像素进行分类,以便精确的形状标注,而VOT主要关注目标可能存在于近似区域。然而,传统的分割模块通常逐帧地分类像素,忽略了相邻帧之间的信息。在本文中,我们提出了一种新算法,通过分析运动模式,利用固有的矩阵结构来分析。通过Tucker2矩阵分解,该矩阵结构证明能够有效描述目标的运动。通过引入这些信息,我们在四个基准上取得了与SOTA相当的 competitive results,包括LaSOT\cite{fan2019lasot}、AvisT\cite{noman2022avist}、OTB100\cite{7001050}和GOT-10k\cite{huang2019got} LaSOT\cite{fan2019lasot},同时该 proposed tracker 还能够实时运行,为实际应用增加了价值。
https://arxiv.org/abs/2309.03247
Current state-of-the-art (SOTA) methods in visual object tracking often require extensive computational resources and vast amounts of training data, leading to a risk of overfitting. This study introduces a more efficient training strategy to mitigate overfitting and reduce computational requirements. We balance the training process with a mix of negative and positive samples from the outset, named as Joint learning with Negative samples (JN). Negative samples refer to scenarios where the object from the template is not present in the search region, which helps to prevent the model from simply memorizing the target, and instead encourages it to use the template for object location. To handle the negative samples effectively, we adopt a distribution-based head, which modeling the bounding box as distribution of distances to express uncertainty about the target's location in the presence of negative samples, offering an efficient way to manage the mixed sample training. Furthermore, our approach introduces a target-indicating token. It encapsulates the target's precise location within the template image. This method provides exact boundary details with negligible computational cost but improving performance. Our model, JN-256, exhibits superior performance on challenging benchmarks, achieving 75.8% AO on GOT-10k and 84.1% AUC on TrackingNet. Notably, JN-256 outperforms previous SOTA trackers that utilize larger models and higher input resolutions, even though it is trained with only half the number of data sampled used in those works.
视觉对象跟踪的当前最先进的方法(SOTA)往往需要大量的计算资源和大量的训练数据,导致过拟合的风险。本研究介绍了一种更高效的训练策略,以减轻过拟合并减少计算要求。我们首先从开始就是混合了消极和积极样本的训练过程,并称之为“联合学习与消极样本”(JN)。消极样本指的是模板中的对象不在搜索区域的情况,这有助于防止模型只是简单地记住目标,而是鼓励它使用模板进行目标位置。为了有效地处理消极样本,我们采用了基于分布的头,该头将边界框建模为距离的分布,以在存在消极样本的情况下表达目标位置的不确定性,提供了一种高效的方式来管理混合样本训练。此外,我们的方法引入了一个目标指示符 token。它包含目标在模板图像中的精确位置。这种方法提供了零计算成本的精确边界细节,但提高了性能。我们的模型JN-256在具有挑战性的基准上表现出更好的性能,在GOT-10k上实现75.8%的AO,在TrackingNet上实现84.1%的AUC。值得注意的是,JN-256比以前的SOTA跟踪器在利用更大的模型和高输入分辨率的情况下表现更好,尽管它训练的数据样本只有那些工作使用的一半。
https://arxiv.org/abs/2309.02903
Recent Transformer-based visual tracking models have showcased superior performance. Nevertheless, prior works have been resource-intensive, requiring prolonged GPU training hours and incurring high GFLOPs during inference due to inefficient training methods and convolution-based target heads. This intensive resource use renders them unsuitable for real-world applications. In this paper, we present DETRack, a streamlined end-to-end visual object tracking framework. Our framework utilizes an efficient encoder-decoder structure where the deformable transformer decoder acting as a target head, achieves higher sparsity than traditional convolution heads, resulting in decreased GFLOPs. For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique, significantly accelerating model's convergence. Comprehensive experiments affirm the effectiveness and efficiency of our proposed method. For instance, DETRack achieves 72.9% AO on challenging GOT-10k benchmarks using only 20% of the training epochs required by the baseline, and runs with lower GFLOPs than all the transformer-based trackers.
近年来,基于Transformer的视觉跟踪模型表现出卓越的性能。然而,以前的工作资源密集型,需要较长的GPU训练时间,并且在推理期间会产生高GFLOP,因为训练方法低效,以及基于卷积的目标头。这种密集的资源使用使得它们不适合实际应用程序。在本文中,我们介绍了DETRack,一个简化的端到端的视觉对象跟踪框架。我们的框架利用高效的编码-解码结构,其中可变性Transformer解码器作为目标头,比传统的卷积头实现更高的稀疏性,导致更低的GFLOP。为训练引入一种新的一一标签分配和辅助去噪技术,显著加速模型的收敛。全面的实验确认了我们提出的方法的有效性和效率。例如,DETRack在挑战性的GOT-10k基准上实现72.9%的 AO,仅使用基准训练 epochs的20%,并且运行速度比所有基于Transformer的跟踪器都慢。
https://arxiv.org/abs/2309.02676
Object tracking is an important functionality of edge video analytic systems and services. Multi-object tracking (MOT) detects the moving objects and tracks their locations frame by frame as real scenes are being captured into a video. However, it is well known that real time object tracking on the edge poses critical technical challenges, especially with edge devices of heterogeneous computing resources. This paper examines the performance issues and edge-specific optimization opportunities for object tracking. We will show that even the well trained and optimized MOT model may still suffer from random frame dropping problems when edge devices have insufficient computation resources. We present several edge specific performance optimization strategies, collectively coined as EMO, to speed up the real time object tracking, ranging from window-based optimization to similarity based optimization. Extensive experiments on popular MOT benchmarks demonstrate that our EMO approach is competitive with respect to the representative methods for on-device object tracking techniques in terms of run-time performance and tracking accuracy. EMO is released on Github at this https URL.
对象跟踪是边缘视频分析系统和服务的一个重要功能。多物体跟踪(MOT)检测到移动对象,并逐帧跟踪它们的位置,当真实场景被捕获成视频时。然而,众所周知,实时对象跟踪在边缘面临关键技术挑战,特别是不同类型的计算资源的边缘设备。本文研究了对象跟踪的性能问题和边缘特定的优化机会。我们将证明,即使经过良好训练和优化的MOT模型,当边缘设备缺乏计算资源时,仍可能随机帧丢失问题。我们提出了几种边缘特定的性能优化策略,统称为EMO,以加快实时对象跟踪,包括基于窗口的优化到基于相似性的优化。对流行的MOT基准进行广泛的实验表明,我们的EMO方法在运行时性能和跟踪精度方面与在场物体跟踪技术的代表方法竞争。EMO在httpsURL上发布。
https://arxiv.org/abs/2309.02666
3D single object tracking (SOT) in point clouds is still a challenging problem due to appearance variation, distractors, and high sparsity of point clouds. Notably, in autonomous driving scenarios, the target object typically maintains spatial adjacency across consecutive frames, predominantly moving horizontally. This spatial continuity offers valuable prior knowledge for target localization. However, existing trackers, which often employ point-wise representations, struggle to efficiently utilize this knowledge owing to the irregular format of such representations. Consequently, they require elaborate designs and solving multiple subtasks to establish spatial correspondence. In this paper, we introduce BEVTrack, a simple yet strong baseline framework for 3D SOT. After converting consecutive point clouds into the common Bird's-Eye-View representation, BEVTrack inherently encodes spatial proximity and adeptly captures motion cues for tracking via a simple element-wise operation and convolutional layers. Additionally, to better deal with objects having diverse sizes and moving patterns, BEVTrack directly learns the underlying motion distribution rather than making a fixed Laplacian or Gaussian assumption as in previous works. Without bells and whistles, BEVTrack achieves state-of-the-art performance on KITTI and NuScenes datasets while maintaining a high inference speed of 122 FPS. The code will be released at this https URL.
点云中的3D单物体跟踪(SOT)仍然是一个挑战性的问题,因为点云的外观变化、干扰和密度很低。值得注意的是,在自动驾驶场景中,目标物体通常在连续帧中保持空间相邻,主要横向移动。这种空间连续性提供了目标定位的宝贵先前知识。然而,现有的跟踪器,通常采用点级表示法,很难有效地利用这种知识,因为它们的不规则格式。因此,它们需要复杂的设计并解决多个子任务,以建立空间对应关系。在本文中,我们介绍了BEVTrack,这是一个简单但强大的基线框架,用于3D SOT。在将连续点云转换为常见的鸟眼视图表示后,BEVTrack本身编码了空间接近并有效地捕捉运动线索,通过简单的元素级操作和卷积层。此外,为了处理各种大小和移动模式的物体,BEVTrack直接学习 underlying motion distribution,而不是像先前工作那样采用固定的拉普拉斯或高斯假设。在没有奇技淫巧的情况下,BEVTrack在KITTI和NuScenes数据集上实现最先进的性能,同时保持122FPS的高速推理速度。代码将在此httpsURL上发布。
https://arxiv.org/abs/2309.02185
Object detection has long been a topic of high interest in computer vision literature. Motivated by the fact that annotating data for the multi-object tracking (MOT) problem is immensely expensive, recent studies have turned their attention to the unsupervised learning setting. In this paper, we push forward the state-of-the-art performance of unsupervised MOT methods by proposing UnsMOT, a novel framework that explicitly combines the appearance and motion features of objects with geometric information to provide more accurate tracking. Specifically, we first extract the appearance and motion features using CNN and RNN models, respectively. Then, we construct a graph of objects based on their relative distances in a frame, which is fed into a GNN model together with CNN features to output geometric embedding of objects optimized using an unsupervised loss function. Finally, associations between objects are found by matching not only similar extracted features but also geometric embedding of detections and tracklets. Experimental results show remarkable performance in terms of HOTA, IDF1, and MOTA metrics in comparison with state-of-the-art methods.
对象检测一直是计算机视觉文献中备受关注的话题。由于标注数据用于多对象跟踪(MOT)问题非常昂贵,最近的研究开始关注无监督MOT方法的性能。在本文中,我们提出了UnsMOT,一个崭新的框架,该框架 explicitly 结合了对象的外观和运动特征,以及几何信息,以提供更准确跟踪。具体来说,我们首先使用卷积神经网络和循环神经网络模型分别提取外观和运动特征。然后,我们基于它们在帧内相对距离构建对象图,并将其与卷积神经网络特征一起输入到GNN模型,以输出使用无监督损失函数优化的对象几何嵌入。最后,我们不仅匹配提取出的相似特征,还匹配识别器和跟踪器几何嵌入。实验结果表明,与当前方法相比,HOTA、IDF1和MOTA指标表现出了卓越的性能。
https://arxiv.org/abs/2309.01078
Unsupervised learning is a challenging task due to the lack of labels. Multiple Object Tracking (MOT), which inevitably suffers from mutual object interference, occlusion, etc., is even more difficult without label supervision. In this paper, we explore the latent consistency of sample features across video frames and propose an Unsupervised Contrastive Similarity Learning method, named UCSL, including three contrast modules: self-contrast, cross-contrast, and ambiguity contrast. Specifically, i) self-contrast uses intra-frame direct and inter-frame indirect contrast to obtain discriminative representations by maximizing self-similarity. ii) Cross-contrast aligns cross- and continuous-frame matching results, mitigating the persistent negative effect caused by object occlusion. And iii) ambiguity contrast matches ambiguous objects with each other to further increase the certainty of subsequent object association through an implicit manner. On existing benchmarks, our method outperforms the existing unsupervised methods using only limited help from ReID head, and even provides higher accuracy than lots of fully supervised methods.
无监督学习由于缺乏标签而是一项挑战性的任务。多目标跟踪(MOT)不可避免地会受到相互对象干扰、遮挡等因素的影响,在没有标签监督的情况下变得更加困难。在本文中,我们探索了视频帧间样本特征的潜在一致性,并提出了称为UCSL的无监督对比相似性学习方法,包括三个对比模块:自对比、交叉对比和歧义对比。具体来说,第一,自对比利用帧内直接和间接对比来最大化自身的相似性,以获得歧视性表示。第二,交叉对比将交叉和连续帧匹配结果对齐,减轻对象遮挡的持久消极影响。第三,歧义对比匹配歧义对象,通过暗示的方式进一步加强后续对象关联的不确定性。在现有的基准上,我们的方法仅使用有限的帮助来自ReID头,却能在不使用完全监督的情况下优于许多完全监督的方法,甚至提供了比许多完全监督方法更高的精度。
https://arxiv.org/abs/2309.00942
Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects are often split into parts and are not consistently tracked over time. In fact, state-of-the-art models achieve pixel-level accuracy and temporal consistency by relying on supervised object detection with additional ID labels for the association through time. This paper proposes a video object-centric model for MOT. It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module that builds complete object prototypes to handle occlusions. Benefited from object-centric learning, we only require sparse detection labels (0%-6.25%) for object localization and feature binding. Relying on our self-supervised Expectation-Maximization-inspired loss for object association, our approach requires no ID labels. Our experiments significantly narrow the gap between the existing object-centric model and the fully supervised state-of-the-art and outperform several unsupervised trackers.
无监督对象中心学习方法可以在没有额外定位信息的情况下将场景划分为实体,是减少多目标跟踪(MOT)流程标注负担的优秀候选方法。不幸的是,它们缺乏两个关键属性:对象经常分裂成部分,并在未来是不一致的跟踪。实际上,最先进的模型通过依赖监督的对象检测和额外的ID标签,实现了像素级别的精度和时间一致性。本文提出了MOT视频对象中心模型。它由一个索引合并模块,将对象中心位置的插槽适应到检测输出中,以及一个对象记忆模块,构建完整的对象原型来处理遮挡。从对象中心学习受益匪浅,我们只需要少量的检测标签(0%-6.25%)用于对象定位和特征绑定。基于我们的自监督期望最大化损失,我们的方法不需要ID标签。我们的实验 significantly narrow the gap between the existing object-centric model and the fully supervised state-of-the-art and outperform several unsupervised跟踪器。
https://arxiv.org/abs/2309.00233