The complex dynamicity of open-world objects presents non-negligible challenges for multi-object tracking (MOT), often manifested as severe deformations, fast motion, and occlusions. Most methods that solely depend on coarse-grained object cues, such as boxes and the overall appearance of the object, are susceptible to degradation due to distorted internal relationships of dynamic objects. To address this problem, this work proposes NetTrack, an efficient, generic, and affordable tracking framework to introduce fine-grained learning that is robust to dynamicity. Specifically, NetTrack constructs a dynamicity-aware association with a fine-grained Net, leveraging point-level visual cues. Correspondingly, a fine-grained sampler and matching method have been incorporated. Furthermore, NetTrack learns object-text correspondence for fine-grained localization. To evaluate MOT in extremely dynamic open-world scenarios, a bird flock tracking (BFT) dataset is constructed, which exhibits high dynamicity with diverse species and open-world scenarios. Comprehensive evaluation on BFT validates the effectiveness of fine-grained learning on object dynamicity, and thorough transfer experiments on challenging open-world benchmarks, i.e., TAO, TAO-OW, AnimalTrack, and GMOT-40, validate the strong generalization ability of NetTrack even without finetuning. Project page: this https URL.
开放世界对象的复杂动态性为多对象跟踪(MOT)带来了非忽视性的挑战,通常表现为严重的变形、快速的移动和遮挡。大多数仅依赖于粗粒度物体线索的方法,如盒子和物体整体外观,容易因动态对象的内部关系失真而降解。为解决这个问题,本文提出了NetTrack,一种高效、通用且经济实惠的跟踪框架,以引入对动态性的细粒度学习。具体来说,NetTrack通过利用点级视觉线索构建动态性感知关联,并相应地引入了细粒度采样和匹配方法。此外,NetTrack还学习物体与文本的对应关系进行细粒度局部定位。为了评估在极度动态的开放世界场景中MOT的效果,构建了一个展示高动态性和多样物种以及开放世界场景的鸟群跟踪(BFT)数据集。对BFT的全面评估证实了细粒度学习在物体动态性上的有效性,而通过对具有挑战性的开放世界基准进行深入的训练和调整,即TAO、TAO-OW、AnimalTrack和GMOT-40,证实了NetTrack的强泛化能力。项目页面:此https链接。
https://arxiv.org/abs/2403.11186
In this paper, we address the challenge of multi-object tracking (MOT) in moving Unmanned Aerial Vehicle (UAV) scenarios, where irregular flight trajectories, such as hovering, turning left/right, and moving up/down, lead to significantly greater complexity compared to fixed-camera MOT. Specifically, changes in the scene background not only render traditional frame-to-frame object IOU association methods ineffective but also introduce significant view shifts in the objects, which complicates tracking. To overcome these issues, we propose a novel universal HomView-MOT framework, which for the first time, harnesses the view Homography inherent in changing scenes to solve MOT challenges in moving environments, incorporating Homographic Matching and View-Centric concepts. We introduce a Fast Homography Estimation (FHE) algorithm for rapid computation of Homography matrices between video frames, enabling object View-Centric ID Learning (VCIL) and leveraging multi-view Homography to learn cross-view ID features. Concurrently, our Homographic Matching Filter (HMF) maps object bounding boxes from different frames onto a common view plane for a more realistic physical IOU association. Extensive experiments have proven that these innovations allow HomView-MOT to achieve state-of-the-art performance on prominent UAV MOT datasets VisDrone and UAVDT.
在本文中,我们解决了在运动无人机场景中多目标跟踪(MOT)的挑战,由于不规则的飞行轨迹(如悬停、左/右转和上升/下降),与固定相机MOT相比,复杂性明显更高。具体来说,场景背景的变化不仅使传统的帧到帧物体IOU关联方法变得无效,而且物体上产生显著的视差,从而复杂跟踪。为了克服这些问题,我们提出了一个新颖的通用HomView-MOT框架,它利用变化场景中的视变换来解决在移动环境中的MOT挑战,包括视变换匹配和视元中心概念。我们引入了快速Homography估计(FHE)算法,用于计算视频帧之间的Homography矩阵,实现物体视元中心ID学习(VCIL)和利用多视图Homography学习跨视图ID特征。同时,我们的Homographic匹配滤波器(HMF)将不同帧中的物体边界框映射到共同的视平面,以实现更真实的物理IOU关联。大量实验证明,这些创新使HomView-MOT在显著的UAV MOT数据集VisDrone和UAVDT上实现最先进的性能。
https://arxiv.org/abs/2403.10830
In the field of multi-object tracking (MOT), traditional methods often rely on the Kalman Filter for motion prediction, leveraging its strengths in linear motion scenarios. However, the inherent limitations of these methods become evident when confronted with complex, nonlinear motions and occlusions prevalent in dynamic environments like sports and dance. This paper explores the possibilities of replacing the Kalman Filter with various learning-based motion model that effectively enhances tracking accuracy and adaptability beyond the constraints of Kalman Filter-based systems. In this paper, we proposed MambaTrack, an online motion-based tracker that outperforms all existing motion-based trackers on the challenging DanceTrack and SportsMOT datasets. Moreover, we further exploit the potential of the state-space-model in trajectory feature extraction to boost the tracking performance and proposed MambaTrack+, which achieves the state-of-the-art performance on DanceTrack dataset with 56.1 HOTA and 54.9 IDF1.
在多目标跟踪(MOT)领域,传统方法通常依赖卡尔曼滤波器(Kalman Filter)进行运动预测,利用其在线性运动场景中的优势。然而,当面临复杂、非线性运动和动态环境中普遍存在的遮挡时,这些方法的局限性变得明显。本文探讨了用各种学习模型的可能性来替代卡尔曼滤波器,从而在Kalman滤波器基于系统中增强跟踪准确性和适应性。本文提出MambaTrack,一种在线运动跟踪器,在具有挑战性的DanceTrack和SportsMOT数据集上优于所有现有运动跟踪器。此外,我们进一步利用状态空间模型的轨迹特征提取潜力,提高跟踪性能,并提出了MambaTrack+,在DanceTrack数据集上实现与56.1 HOTA和54.9 IDF1相同的最高性能。
https://arxiv.org/abs/2403.10826
Real-time high-accuracy optical flow estimation is a crucial component in various applications, including localization and mapping in robotics, object tracking, and activity recognition in computer vision. While recent learning-based optical flow methods have achieved high accuracy, they often come with heavy computation costs. In this paper, we propose a highly efficient optical flow architecture, called NeuFlow, that addresses both high accuracy and computational cost concerns. The architecture follows a global-to-local scheme. Given the features of the input images extracted at different spatial resolutions, global matching is employed to estimate an initial optical flow on the 1/16 resolution, capturing large displacement, which is then refined on the 1/8 resolution with lightweight CNN layers for better accuracy. We evaluate our approach on Jetson Orin Nano and RTX 2080 to demonstrate efficiency improvements across different computing platforms. We achieve a notable 10x-80x speedup compared to several state-of-the-art methods, while maintaining comparable accuracy. Our approach achieves around 30 FPS on edge computing platforms, which represents a significant breakthrough in deploying complex computer vision tasks such as SLAM on small robots like drones. The full training and evaluation code is available at this https URL.
实时高精度光流估计是各种应用的关键组件,包括机器人定位和地图、目标跟踪和计算机视觉活动识别。虽然最近基于学习的光流方法已经达到高准确度,但它们通常伴随着沉重的计算成本。在本文中,我们提出了一个高效的光流架构,称为NeuFlow,该架构解决了高准确度和计算成本的问题。架构遵循全局到局部方案。根据不同分辨率提取的输入图像的特征,采用全局匹配来估计初始光流在1/16分辨率上,捕获大的位移,然后在1/8分辨率上通过轻量级的CNN层进行微调,以提高准确性。我们在Jetson Orin Nano和RTX 2080上评估我们的方法,以证明不同计算平台上的效率改进。我们实现了与几个最先进方法相当的增长速度,同时保持较高的准确性。我们的方法在边缘计算平台上达到约30 FPS,这标志着在部署类似SLAM等复杂计算机视觉任务的小型机器人方面取得了显著的突破。完整的训练和评估代码可在此处访问:https://url.
https://arxiv.org/abs/2403.10425
Visual object tracking aims to localize the target object of each frame based on its initial appearance in the first frame. Depending on the input modility, tracking tasks can be divided into RGB tracking and RGB+X (e.g. RGB+N, and RGB+D) tracking. Despite the different input modalities, the core aspect of tracking is the temporal matching. Based on this common ground, we present a general framework to unify various tracking tasks, termed as OneTracker. OneTracker first performs a large-scale pre-training on a RGB tracker called Foundation Tracker. This pretraining phase equips the Foundation Tracker with a stable ability to estimate the location of the target object. Then we regard other modality information as prompt and build Prompt Tracker upon Foundation Tracker. Through freezing the Foundation Tracker and only adjusting some additional trainable parameters, Prompt Tracker inhibits the strong localization ability from Foundation Tracker and achieves parameter-efficient finetuning on downstream RGB+X tracking tasks. To evaluate the effectiveness of our general framework OneTracker, which is consisted of Foundation Tracker and Prompt Tracker, we conduct extensive experiments on 6 popular tracking tasks across 11 benchmarks and our OneTracker outperforms other models and achieves state-of-the-art performance.
视觉对象跟踪的目标是根据每个帧中目标对象初始出现的位置对其进行定位。根据输入的可调性,跟踪任务可以分为 RGB 跟踪和 RGB+X(例如 RGB+N 和 RGB+D)跟踪。尽管有不同的输入可调性,但跟踪的核心在于时间匹配。基于这一共同点,我们提出了一个统一跟踪任务的一般框架,称为 OneTracker。OneTracker 首先在名为 Foundation Tracker 的 RGB 跟踪器上进行大规模预训练。这个预训练阶段使 Foundation Tracker 具备估计目标物体位置的稳定能力。然后我们将其他模态信息视为提示并基于 Foundation Tracker 构建 Prompt Tracker。通过冻结 Foundation Tracker 并仅调整一些可训练参数,Prompt Tracker 抑制了 Foundation Tracker 的强大局部定位能力,并在下游的 RGB+X 跟踪任务中实现了参数高效的微调。为了评估我们的一般框架 OneTracker 的有效性,该框架由 Foundation Tracker 和 Prompt Tracker 组成,我们在 11 个基准上进行了广泛的实验。我们的 OneTracker 在其他模型中表现优异,并达到了最先进水平。
https://arxiv.org/abs/2403.09634
We propose a novel Transformer-based module to address the data association problem for multi-object tracking. From detections obtained by a pretrained detector, this module uses only coordinates from bounding boxes to estimate an affinity score between pairs of tracks extracted from two distinct temporal windows. This module, named TWiX, is trained on sets of tracks with the objective of discriminating pairs of tracks coming from the same object from those which are not. Our module does not use the intersection over union measure, nor does it requires any motion priors or any camera motion compensation technique. By inserting TWiX within an online cascade matching pipeline, our tracker C-TWiX achieves state-of-the-art performance on the DanceTrack and KITTIMOT datasets, and gets competitive results on the MOT17 dataset. The code will be made available upon publication.
我们提出了一个新颖的Transformer基模块来解决多目标跟踪中的数据相关问题。这个模块利用预训练检测器获得的检测结果中仅有的边界框坐标来估计来自两个不同时间窗口提取的轨迹对之间的亲和度分数。这个模块被称为TWiX,它通过一组轨迹的集合来训练,旨在区分来自同一对象的轨迹对与来自不同对象的轨迹对。我们的模块没有使用交集/并集度量,也没有要求任何运动先验或相机运动补偿技术。通过将TWiX集成在线级级联匹配管道中,我们的跟踪器C-TWiX在舞蹈曲目和KITTIMOT数据集上实现了最先进的性能,同时在MOT17数据集上获得了竞争力的结果。代码将在发表后公开。
https://arxiv.org/abs/2403.08018
Drones are also known as UAVs are originally designed for military purposes. With the technological advances, they can be seen in most of the aspects of life from filming to logistics. The increased use of drones made it sometimes essential to form a collaboration between them to perform the task efficiently in a defined process. This paper investigates the use of a combined centralised and decentralised architecture for the collaborative operation of drones in a parts delivery scenario to enable and expedite the operation of the factories of the future. The centralised and decentralised approaches were extensively researched, with experimentation being undertaken to determine the appropriateness of each approach for this use-case. Decentralised control was utilised to remove the need for excessive communication during the operation of the drones, resulting in smoother operations. Initial results suggested that the decentralised approach is more appropriate for this use-case. The individual functionalities necessary for the implementation of a decentralised architecture were proven and assessed, determining that a combination of multiple individual functionalities, namely VSLAM, dynamic collision avoidance and object tracking, would give an appropriate solution for use in an industrial setting. A final architecture for the parts delivery system was proposed for future work, using a combined centralised and decentralised approach to combat the limitations inherent in each architecture.
无人机,也称为UAV,最初是为军事目的而设计的。随着技术的进步,它们现在可以应用于生活的各个领域,从拍摄到物流。无人机在生活和工业中的应用越来越广泛,使它们在执行定义的任务时有时成为必不可少的工具。本文研究了在分派场景中使用综合集中和分散架构协同操作无人机以实现更高效操作的方法,以促进未来工厂的运作。本文对集中和分散方法进行了广泛研究,并通过实验确定了哪种方法最适合这个使用场景。通过采用分散控制,消除了无人机在操作过程中需要进行过度沟通的问题,从而实现了更顺畅的操作。初步结果表明,分散方法更适合这个使用场景。为了实现分散架构,对实施分散架构所需的单个功能进行了深入评估,确定多种单独功能的组合,即VSLAM、动态避障和物体跟踪,将为工业环境提供适当的解决方案。为未来工作,提出了一个集成了中央集中和分散架构的部件交付系统架构。该架构旨在克服每个架构固有的限制。
https://arxiv.org/abs/2403.07635
Hyperspectral video (HSV) offers valuable spatial, spectral, and temporal information simultaneously, making it highly suitable for handling challenges such as background clutter and visual similarity in object tracking. However, existing methods primarily focus on band regrouping and rely on RGB trackers for feature extraction, resulting in limited exploration of spectral information and difficulties in achieving complementary representations of object features. In this paper, a spatial-spectral fusion network with spectral angle awareness (SST-Net) is proposed for hyperspectral (HS) object tracking. Firstly, to address the issue of insufficient spectral feature extraction in existing networks, a spatial-spectral feature backbone ($S^2$FB) is designed. With the spatial and spectral extraction branch, a joint representation of texture and spectrum is obtained. Secondly, a spectral attention fusion module (SAFM) is presented to capture the intra- and inter-modality correlation to obtain the fused features from the HS and RGB modalities. It can incorporate the visual information into the HS spectral context to form a robust representation. Thirdly, to ensure a more accurate response of the tracker to the object position, a spectral angle awareness module (SAAM) investigates the region-level spectral similarity between the template and search images during the prediction stage. Furthermore, we develop a novel spectral angle awareness loss (SAAL) to offer guidance for the SAAM based on similar regions. Finally, to obtain the robust tracking results, a weighted prediction method is considered to combine the HS and RGB predicted motions of objects to leverage the strengths of each modality. Extensive experiments on the HOTC dataset demonstrate the effectiveness of the proposed SSF-Net, compared with state-of-the-art trackers.
超光谱视频(HSV)同时提供丰富的空间、光谱和时间信息,使其非常适合处理诸如背景噪声和物体跟踪中的视觉相似性等挑战。然而,现有的方法主要关注带的重组,依赖于RGB跟踪器进行特征提取,导致对光谱信息的探索有限,且难以实现对物体特征的互补表示。在本文中,我们提出了一个空间-光谱融合网络(SST-Net)用于超光谱(HS)物体跟踪。首先,为了解决现有网络中缺乏光谱特征提取的问题,我们设计了一个空间-光谱特征骨干(S2FB)。通过空间和光谱提取分支,获得了纹理和频谱的联合表示。其次,我们引入了光谱注意力融合模块(SAFM)来捕捉HS和RGB模式之间的内在和模式间相关性,以从HS和RGB模式中获得融合特征。它可以将视觉信息融入HS频谱上下文中,形成稳健的表示。第三,为了确保跟踪器对物体位置的更准确响应,我们研究了预测阶段模板和搜索图像之间的区域级频谱相似性。此外,我们开发了一种新的光谱角意识损失(SAAL)来基于类似区域为SAAM提供指导。最后,为了获得稳健的跟踪结果,我们考虑用加权预测方法结合HS和RGB预测运动物体,以利用每个模态的优势。在HOTC数据集上进行的大量实验证明,与最先进的跟踪器相比,所提出的SST-Net具有有效的效果。
https://arxiv.org/abs/2403.05852
Current event-/frame-event based trackers undergo evaluation on short-term tracking datasets, however, the tracking of real-world scenarios involves long-term tracking, and the performance of existing tracking algorithms in these scenarios remains unclear. In this paper, we first propose a new long-term and large-scale frame-event single object tracking dataset, termed FELT. It contains 742 videos and 1,594,474 RGB frames and event stream pairs and has become the largest frame-event tracking dataset to date. We re-train and evaluate 15 baseline trackers on our dataset for future works to compare. More importantly, we find that the RGB frames and event streams are naturally incomplete due to the influence of challenging factors and spatially sparse event flow. In response to this, we propose a novel associative memory Transformer network as a unified backbone by introducing modern Hopfield layers into multi-head self-attention blocks to fuse both RGB and event data. Extensive experiments on both FELT and RGB-T tracking dataset LasHeR fully validated the effectiveness of our model. The dataset and source code can be found at \url{this https URL}.
基于当前事件/帧事件的跟踪器在短期跟踪数据集上进行评估,然而,在现实场景中,跟踪涉及长期跟踪,现有跟踪算法的性能仍然不清楚。在本文中,我们首先提出了一个名为FELT的新长期大规模帧事件单对象跟踪数据集,这是目前最大的帧事件跟踪数据集。我们对我们的数据集上用于未来工作的15个基线跟踪器进行重新训练和评估,以进行比较。更重要的是,我们发现由于具有挑战性因素的影响和稀疏的空间事件流,RGB帧和事件流天然是不完整的。为了应对这一问题,我们提出了一个新颖的联合记忆Transformer网络作为统一的骨干,通过将现代Hopfield层引入多头自注意力模块将RGB和事件数据进行融合。对FELT和RGB-T跟踪数据集LasHeR的广泛实验充分验证了我们的模型的有效性。数据集和源代码可以通过以下链接找到:https://this https URL。
https://arxiv.org/abs/2403.05839
Current multi-object tracking (MOT) aims to predict trajectories of targets (i.e.,"where") in videos. Yet, knowing merely "where" is insufficient in many crucial applications. In comparison, semantic understanding such as fine-grained behaviors, interactions, and overall summarized captions (i.e., "what") from videos, associated with "where", is highly-desired for comprehensive video analysis. Thus motivated, we introduce Semantic Multi-Object Tracking (SMOT), that aims to estimate object trajectories and meanwhile understand semantic details of associated trajectories including instance captions, instance interactions, and overall video captions, integrating "where" and "what" for tracking. In order to foster the exploration of SMOT, we propose BenSMOT, a large-scale Benchmark for Semantic MOT. Specifically, BenSMOT comprises 3,292 videos with 151K frames, covering various scenarios for semantic tracking of humans. BenSMOT provides annotations for the trajectories of targets, along with associated instance captions in natural language, instance interactions, and overall caption for each video sequence. To our best knowledge, BenSMOT is the first publicly available benchmark for SMOT. Besides, to encourage future research, we present a novel tracker named SMOTer, which is specially designed and end-to-end trained for SMOT, showing promising performance. By releasing BenSMOT, we expect to go beyond conventional MOT by predicting "where" and "what" for SMOT, opening up a new direction in tracking for video understanding. Our BenSMOT and SMOTer will be released.
目前的多对象跟踪(MOT)旨在预测视频中的目标(即“哪里”)的运动轨迹。然而,仅知道“哪里”在许多关键应用中是不够的。与视频的语义理解(如细粒度行为、互动和总摘要captions)相关联的“哪里”,在综合视频分析中具有很高的需求。因此,我们引入了语义多对象跟踪(SMOT),旨在估计目标运动轨迹并同时理解相关轨迹的语义细节,包括实例captions、实例互动和总视频captions,将“哪里”和“什么”与跟踪相结合。为了促进SMOT的探索,我们提出了BenSMOT,一个大规模语义多对象跟踪(SMOT)基准。具体来说,BenSMOT包括3,292个视频和151K帧,涵盖了各种人类语义跟踪的场景。BenSMOT为每个视频序列提供了目标的轨迹注释,以及与自然语言相关的实例captions、实例互动和总视频captions。据我们所知,BenSMOT是第一个公开可用的SMOT基准。此外,为了鼓励未来的研究,我们提出了一个名为SMOTer的新跟踪器,专门为SMOT进行设计和端到端训练,表现出良好的性能。通过发布BenSMOT,我们期望通过预测SMOT的“哪里”和“什么”来超越传统MOT,为视频理解开辟新的方向。我们的BenSMOT和SMOTer将发布。
https://arxiv.org/abs/2403.05021
Multiple Object Tracking (MOT) is a critical area within computer vision, with a broad spectrum of practical implementations. Current research has primarily focused on the development of tracking algorithms and enhancement of post-processing techniques. Yet, there has been a lack of thorough examination concerning the nature of tracking data it self. In this study, we pioneer an exploration into the distribution patterns of tracking data and identify a pronounced long-tail distribution issue within existing MOT datasets. We note a significant imbalance in the distribution of trajectory lengths across different pedestrians, a phenomenon we refer to as "pedestrians trajectory long-tail distribution". Addressing this challenge, we introduce a bespoke strategy designed to mitigate the effects of this skewed distribution. Specifically, we propose two data augmentation strategies, including Stationary Camera View Data Augmentation (SVA) and Dynamic Camera View Data Augmentation (DVA) , designed for viewpoint states and the Group Softmax (GS) module for Re-ID. SVA is to backtrack and predict the pedestrian trajectory of tail classes, and DVA is to use diffusion model to change the background of the scene. GS divides the pedestrians into unrelated groups and performs softmax operation on each group individually. Our proposed strategies can be integrated into numerous existing tracking systems, and extensive experimentation validates the efficacy of our method in reducing the influence of long-tail distribution on multi-object tracking performance. The code is available at this https URL.
Multiple Object Tracking(MOT)是计算机视觉领域一个关键的区域,具有广泛的实现方法。当前的研究主要集中在跟踪算法的开发和后处理技术的增强上。然而,关于跟踪数据的本质,尚缺乏深入的研究。在这项研究中,我们首先对跟踪数据的分布模式进行了探索,并发现了现有MOT数据集中跟踪长度分布的不均匀现象。我们称之为“行人轨迹长尾分布”的现象。为解决这个挑战,我们引入了一种专门策略,旨在减轻这种不均衡分布的影响。具体来说,我们提出了两种数据增强策略,包括基于观点的静态相机视图数据增强(SVA)和基于扩散模型的动态相机视图数据增强(DVA),以及用于归一化识别(Re-ID)的Group Softmax(GS)模块。SVA是回退并预测尾类的行人轨迹,而DVA则是使用扩散模型改变场景的背景。GS将行人分为无关的组,并对每组进行单独的软顶操作。我们提出的策略可以集成到许多现有的跟踪系统中,而大量的实验验证了我们的方法在减少长尾分布对多对象跟踪性能的影响方面的有效性。代码可以从该链接下载:https://url.cn/
https://arxiv.org/abs/2403.04700
This paper presents a novel multi-modal Multi-Object Tracking (MOT) algorithm for self-driving cars that combines camera and LiDAR data. Camera frames are processed with a state-of-the-art 3D object detector, whereas classical clustering techniques are used to process LiDAR observations. The proposed MOT algorithm comprises a three-step association process, an Extended Kalman filter for estimating the motion of each detected dynamic obstacle, and a track management phase. The EKF motion model requires the current measured relative position and orientation of the observed object and the longitudinal and angular velocities of the ego vehicle as inputs. Unlike most state-of-the-art multi-modal MOT approaches, the proposed algorithm does not rely on maps or knowledge of the ego global pose. Moreover, it uses a 3D detector exclusively for cameras and is agnostic to the type of LiDAR sensor used. The algorithm is validated both in simulation and with real-world data, with satisfactory results.
本文提出了一种名为多模态多目标跟踪(MOT)的自我驾驶汽车新算法,结合了相机和激光雷达数据。相机帧使用最先进的3D物体检测器进行处理,而经典聚类技术用于处理激光雷达观测。所提出的MOT算法包括一个三步联合过程、用于估计每个检测到的动态障碍物运动的扩展卡尔曼滤波器以及跟踪管理阶段。扩展卡尔曼滤波器运动模型需要观测对象的当前测量相对位置和方向以及自车全局姿态和角速度作为输入。与大多数先进的multi-modal MOT方法不同,所提出的算法不依赖于自车的全局姿态知识或地图。此外,它只使用相机进行3D检测,对使用的激光雷达传感器类型不敏感。通过仿真和实际数据验证,该算法得到了令人满意的结果。
https://arxiv.org/abs/2403.04112
In this paper, we introduce a novel benchmark, dubbed VastTrack, towards facilitating the development of more general visual tracking via encompassing abundant classes and videos. VastTrack possesses several attractive properties: (1) Vast Object Category. In particular, it covers target objects from 2,115 classes, largely surpassing object categories of existing popular benchmarks (e.g., GOT-10k with 563 classes and LaSOT with 70 categories). With such vast object classes, we expect to learn more general object tracking. (2) Larger scale. Compared with current benchmarks, VastTrack offers 50,610 sequences with 4.2 million frames, which makes it to date the largest benchmark regarding the number of videos, and thus could benefit training even more powerful visual trackers in the deep learning era. (3) Rich Annotation. Besides conventional bounding box annotations, VastTrack also provides linguistic descriptions for the videos. The rich annotations of VastTrack enables development of both the vision-only and the vision-language tracking. To ensure precise annotation, all videos are manually labeled with multiple rounds of careful inspection and refinement. To understand performance of existing trackers and to provide baselines for future comparison, we extensively assess 25 representative trackers. The results, not surprisingly, show significant drops compared to those on current datasets due to lack of abundant categories and videos from diverse scenarios for training, and more efforts are required to improve general tracking. Our VastTrack and all the evaluation results will be made publicly available this https URL.
在本文中,我们提出了一个名为VastTrack的新基准,以促进通过包含丰富类和视频来开发更一般的视觉跟踪。VastTrack具有几个有吸引力的特性:(1)Vast对象类别。特别是,它涵盖了2,115个类别,远远超过了现有流行基准的对象类别(例如,GOT-10k具有563个类别,LaSOT具有70个类别)。有了如此广泛的对象类别,我们期望能够学习更一般的对象跟踪。(2)更大规模。与现有基准相比,VastTrack提供了50,610个序列和420,000帧,这使得它成为有史以来最大的关于视频数量的基准,因此对于在深度学习时代训练更强大的视觉跟踪器具有重要意义。(3)丰富注释。除了传统的边界框注释外,VastTrack还提供了视频的语义描述。VastTrack的丰富注释使得视觉和语言跟踪都可以发展。为了确保精确的标注,所有视频都经过多轮仔细检查和优化。为了了解现有跟踪器的性能并为进一步比较提供基线,我们对25个代表性的跟踪器进行了广泛评估。结果毫不意外地表明,由于缺乏丰富的类和视频,导致训练效果较差,需要投入更多的努力来提高一般的跟踪。我们的VastTrack和所有评估结果将公开发布在https://URL上。
https://arxiv.org/abs/2403.03493
Accurate data association is crucial in reducing confusion, such as ID switches and assignment errors, in multi-object tracking (MOT). However, existing advanced methods often overlook the diversity among trajectories and the ambiguity and conflicts present in motion and appearance cues, leading to confusion among detections, trajectories, and associations when performing simple global data association. To address this issue, we propose a simple, versatile, and highly interpretable data association approach called Decomposed Data Association (DDA). DDA decomposes the traditional association problem into multiple sub-problems using a series of non-learning-based modules and selectively addresses the confusion in each sub-problem by incorporating targeted exploitation of new cues. Additionally, we introduce Occlusion-aware Non-Maximum Suppression (ONMS) to retain more occluded detections, thereby increasing opportunities for association with trajectories and indirectly reducing the confusion caused by missed detections. Finally, based on DDA and ONMS, we design a powerful multi-object tracker named DeconfuseTrack, specifically focused on resolving confusion in MOT. Extensive experiments conducted on the MOT17 and MOT20 datasets demonstrate that our proposed DDA and ONMS significantly enhance the performance of several popular trackers. Moreover, DeconfuseTrack achieves state-of-the-art performance on the MOT17 and MOT20 test sets, significantly outperforms the baseline tracker ByteTrack in metrics such as HOTA, IDF1, AssA. This validates that our tracking design effectively reduces confusion caused by simple global association.
准确的轨迹关联在多目标跟踪(MOT)中至关重要,可以减少混淆,例如ID切换和分配错误。然而,现有的高级方法通常忽视了轨迹和运动和外观线索之间的多样性,导致在执行简单全局数据关联时产生混淆。为解决这个问题,我们提出了一个简单、通用且高度可解释的数据关联方法,称为解离数据关联(DDA)。DDA通过一系列基于非学习算法的模块将传统关联问题分解为多个子问题,并选择性地利用新线索对每个子问题进行目标性探索。此外,我们还引入了遮挡注意非最大抑制(ONMS)以保留更多遮挡的检测,从而增加与轨迹的关联机会,并间接减少由于错过检测而产生的混淆。最后,基于DDA和ONMS,我们设计了一个强大的多目标跟踪器,称为DeconfuseTrack,特别关注解决MOT中的混淆。在MOT17和MOT20数据集上进行的大量实验证明,我们提出的DDA和ONMS显著增强了几种流行的跟踪器的性能。此外,DeconfuseTrack在MOT17和MOT20测试集上实现了与基准跟踪器ByteTrack在诸如HOTA、IDF1和AssA等指标上的最佳性能。这验证了我们的跟踪设计有效减少了简单全局关联引起的混淆。
https://arxiv.org/abs/2403.02767
In Multiple Object Tracking, objects often exhibit non-linear motion of acceleration and deceleration, with irregular direction changes. Tacking-by-detection (TBD) with Kalman Filter motion prediction works well in pedestrian-dominant scenarios but falls short in complex situations when multiple objects perform non-linear and diverse motion simultaneously. To tackle the complex non-linear motion, we propose a real-time diffusion-based MOT approach named DiffMOT. Specifically, for the motion predictor component, we propose a novel Decoupled Diffusion-based Motion Predictor (D MP). It models the entire distribution of various motion presented by the data as a whole. It also predicts an individual object's motion conditioning on an individual's historical motion information. Furthermore, it optimizes the diffusion process with much less sampling steps. As a MOT tracker, the DiffMOT is real-time at 22.7FPS, and also outperforms the state-of-the-art on DanceTrack and SportsMOT datasets with 63.4 and 76.2 in HOTA metrics, respectively. To the best of our knowledge, DiffMOT is the first to introduce a diffusion probabilistic model into the MOT to tackle non-linear motion prediction.
在多目标跟踪中,对象通常表现出非线性加速度和减速度的运动,以及不规则的方向变化。通过检测的跟踪器(TBD)与Kalman滤波器运动预测在行人主导的场景中效果很好,但在复杂情况下,当多个对象同时进行非线性和多样运动时,它就显得不足了。为了应对复杂的非线性运动,我们提出了名为DiffMOT的实时扩散-based MOT方法。具体来说,对于运动预测组件,我们提出了一个新型的解耦扩散为基础的运动预测器(D MP)。它将数据中各种运动的整个分布建模为一个整体。它还预测每个人历史运动信息上单个对象的动量调整。此外,它通过大大减少采样步骤来优化扩散过程。作为跟踪器,DiffMOT在22.7FPS的实时性很高,同时在DanceTrack和SportsMOT数据集上,它分别比最先进的算法高出63.4和76.2的HOTA指标。据我们所知,DiffMOT是第一个将扩散概率模型引入到MOT以解决非线性运动预测的。
https://arxiv.org/abs/2403.02075
In the realm of computer vision and graphics, accurately establishing correspondences between geometric 3D shapes is pivotal for applications like object tracking, registration, texture transfer, and statistical shape analysis. Moving beyond traditional hand-crafted and data-driven feature learning methods, we incorporate spectral methods with deep learning, focusing on functional maps (FMs) and optimal transport (OT). Traditional OT-based approaches, often reliant on entropy regularization OT in learning-based framework, face computational challenges due to their quadratic cost. Our key contribution is to employ the sliced Wasserstein distance (SWD) for OT, which is a valid fast optimal transport metric in an unsupervised shape matching framework. This unsupervised framework integrates functional map regularizers with a novel OT-based loss derived from SWD, enhancing feature alignment between shapes treated as discrete probability measures. We also introduce an adaptive refinement process utilizing entropy regularized OT, further refining feature alignments for accurate point-to-point correspondences. Our method demonstrates superior performance in non-rigid shape matching, including near-isometric and non-isometric scenarios, and excels in downstream tasks like segmentation transfer. The empirical results on diverse datasets highlight our framework's effectiveness and generalization capabilities, setting new standards in non-rigid shape matching with efficient OT metrics and an adaptive refinement module.
在计算机视觉和图形领域,准确确定几何3D形状之间的对应关系对于应用诸如目标跟踪、配准、纹理传输和统计形状分析等任务至关重要。超越传统的手手工脚的和数据驱动的特征学习方法,我们引入了深度学习具有谱聚类的OT方法,关注功能图(FM)和最优传输(OT)。传统基于OT的方法,通常在基于学习的框架中依赖于二次成本的OT,由于其二次成本,在计算上面临挑战。我们的关键贡献是使用切片Wasserstein距离(SWD)进行OT,这是一种在无监督形状匹配框架中有效的快速最优传输度量。这种无监督框架将功能图 regularizers 与从SWD导出的 novel OT-based 损失相结合,增强了将处理为离散概率测度的形状之间的特征对齐。我们还引入了自适应平滑过程利用熵 Regularized OT,进一步精细处理形状之间的特征对齐,实现精确的点对点匹配。我们的方法在非刚性形状匹配(包括近等距和非等距场景)方面的表现优于传统方法,同时在下游任务(如传输分割)中也表现出色。多样数据集上的实证结果强调了我们的框架的有效性和泛化能力,为非刚性形状匹配领域设定了新的标准,同时具有高效的OT指标和自适应平滑模块。
https://arxiv.org/abs/2403.01781
In robotics, motion capture systems have been widely used to measure the accuracy of localization algorithms. Moreover, this infrastructure can also be used for other computer vision tasks, such as the evaluation of Visual (-Inertial) SLAM dynamic initialization, multi-object tracking, or automatic annotation. Yet, to work optimally, these functionalities require having accurate and reliable spatial-temporal calibration parameters between the camera and the global pose sensor. In this study, we provide two novel solutions to estimate these calibration parameters. Firstly, we design an offline target-based method with high accuracy and consistency. Spatial-temporal parameters, camera intrinsic, and trajectory are optimized simultaneously. Then, we propose an online target-less method, eliminating the need for a calibration target and enabling the estimation of time-varying spatial-temporal parameters. Additionally, we perform detailed observability analysis for the target-less method. Our theoretical findings regarding observability are validated by simulation experiments and provide explainable guidelines for calibration. Finally, the accuracy and consistency of two proposed methods are evaluated with hand-held real-world datasets where traditional hand-eye calibration method do not work.
在机器人领域,运动捕捉系统已被广泛用于衡量定位算法的准确性。此外,该基础设施还可以用于其他计算机视觉任务,例如评估视觉(-惯性)SLAM动态初始化、多对象跟踪或自动注释。然而,要实现最佳性能,这些功能需要相机和全局姿态传感器之间具有准确可靠的时空标定参数。在这项研究中,我们提供了两种新的解决方案来估计这些标定参数。首先,我们设计了一种基于目标的方法,具有高精度和一致性。同时优化空间-时间参数、相机固有参数和轨迹。然后,我们提出了一个无目标的方法,消除了需要标定的校准目标,并实现了时间变化的空间-时间参数的估计。此外,我们详细分析了无目标方法的可观测性。我们的理论结果通过仿真实验得到了验证,并为标定提供了可解释的指导。最后,我们用手持现实数据集评估了两种所提出方法的准确性和一致性。与传统的手眼标定方法不同,这些方法在现实世界中表现出了较高的准确性和一致性。
https://arxiv.org/abs/2403.00976
This paper introduces the task of Auditory Referring Multi-Object Tracking (AR-MOT), which dynamically tracks specific objects in a video sequence based on audio expressions and appears as a challenging problem in autonomous driving. Due to the lack of semantic modeling capacity in audio and video, existing works have mainly focused on text-based multi-object tracking, which often comes at the cost of tracking quality, interaction efficiency, and even the safety of assistance systems, limiting the application of such methods in autonomous driving. In this paper, we delve into the problem of AR-MOT from the perspective of audio-video fusion and audio-video tracking. We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers. The dual streams are intertwined with our Bidirectional Frequency-domain Cross-attention Fusion Module (Bi-FCFM), which bidirectionally fuses audio and video features from both frequency- and spatiotemporal domains. Moreover, we propose the Audio-visual Contrastive Tracking Learning (ACTL) regime to extract homogeneous semantic features between expressions and visual objects by learning homogeneous features between different audio and video objects effectively. Aside from the architectural design, we establish the first set of large-scale AR-MOT benchmarks, including Echo-KITTI, Echo-KITTI+, and Echo-BDD. Extensive experiments on the established benchmarks demonstrate the effectiveness of the proposed EchoTrack model and its components. The source code and datasets will be made publicly available at this https URL.
本文介绍了听觉参考多对象跟踪(AR-MOT)任务,该任务根据音频表现动态跟踪视频序列中的特定对象,并将其呈现为自动驾驶中具有挑战性的问题。由于音频和视频中语义建模能力的缺乏,现有的工作主要集中在基于文本的多对象跟踪上,往往以跟踪质量、交互效率甚至辅助系统的安全性为代价。因此,在自动驾驶中应用这类方法存在局限性。在本文中,我们从音频-视频融合和音频-视频跟踪的角度深入研究了AR-MOT问题。我们提出了EchoTrack,一个端到端AR-MOT框架,采用双流视觉变换器。双流互交于我们的双向频率域跨时域 fusion 模块(Bi-FCFM),它将音频和视频特征从频率和时态域进行双向融合。此外,我们还提出了音频-视觉对比学习(ACTL)方法,通过学习不同音频和视频对象之间的同构特征,提取表达和视觉对象之间的同构特征。除了架构设计外,我们还建立了第一个大型的AR-MOT基准,包括Echo-KITTI、Echo-KITTI+和Echo-BDD。在基准上进行的大量实验证明了所提出的EchoTrack模型的有效性和其组件。源代码和数据集将公开发布在本文的链接处。
https://arxiv.org/abs/2402.18302
Adversarial attacks in visual object tracking have significantly degraded the performance of advanced trackers by introducing imperceptible perturbations into images. These attack methods have garnered considerable attention from researchers in recent years. However, there is still a lack of research on designing adversarial defense methods specifically for visual object tracking. To address these issues, we propose an effective additional pre-processing network called DuaLossDef that eliminates adversarial perturbations during the tracking process. DuaLossDef is deployed ahead of the search branche or template branche of the tracker to apply defensive transformations to the input images. Moreover, it can be seamlessly integrated with other visual trackers as a plug-and-play module without requiring any parameter adjustments. We train DuaLossDef using adversarial training, specifically employing Dua-Loss to generate adversarial samples that simultaneously attack the classification and regression branches of the tracker. Extensive experiments conducted on the OTB100, LaSOT, and VOT2018 benchmarks demonstrate that DuaLossDef maintains excellent defense robustness against adversarial attack methods in both adaptive and non-adaptive attack scenarios. Moreover, when transferring the defense network to other trackers, it exhibits reliable transferability. Finally, DuaLossDef achieves a processing time of up to 5ms/frame, allowing seamless integration with existing high-speed trackers without introducing significant computational overhead. We will make our code publicly available soon.
视觉物体跟踪中的对抗攻击已经显著地降低了高级跟踪器的性能,通过在图像中引入不可察觉的扰动。这些攻击方法近年来吸引了研究人员的高度关注。然而,在为视觉物体跟踪设计专门的反制方法方面,目前仍缺乏研究。为解决这些问题,我们提出了一个名为DuaLossDef的有效的附加预处理网络,该网络在跟踪过程之前对输入图像进行防御变换。此外,它可以轻松地与其他视觉跟踪器集成,无需进行任何参数调整。我们使用对抗训练来训练DuaLossDef,特别是使用Dua-Loss生成同时攻击跟踪器分类和回归分支的对抗样本。在OTB100、LaSOT和VOT2018基准上进行的大量实验证明,DuaLossDef在自适应和非自适应攻击场景中保持出色的防御鲁棒性。此外,当将防御网络转移到其他跟踪器时,它具有可靠的转移性。最后,DuaLossDef达到每帧处理时间最高可达5ms,允许与现有的高速跟踪器无缝集成,而不会产生显著的计算开销。我们将很快公开我们的代码。
https://arxiv.org/abs/2402.17976
Modern robotic systems are required to operate in dense dynamic environments, requiring highly accurate real-time track identification and estimation. For 3D multi-object tracking, recent approaches process a single measurement frame recursively with greedy association and are prone to errors in ambiguous association decisions. Our method, Sliding Window Tracker (SWTrack), yields more accurate association and state estimation by batch processing many frames of sensor data while being capable of running online in real-time. The most probable track associations are identified by evaluating all possible track hypotheses across the temporal sliding window. A novel graph optimization approach is formulated to solve the multidimensional assignment problem with lifted graph edges introduced to account for missed detections and graph sparsity enforced to retain real-time efficiency. We evaluate our SWTrack implementation$^{2}$ on the NuScenes autonomous driving dataset to demonstrate improved tracking performance.
现代机器人系统需要在密集动态环境中操作,需要高度准确的实时跟踪识别和估计。对于3D多对象跟踪,最近的方法通过贪心关联对单测量帧进行递归处理,并容易产生在模糊关联决策中的错误。我们的方法Sliding Window Tracker(SWTrack)通过批处理传感器数据的许多帧,在线实时运行,从而实现了更精确的关联和状态估计。通过评估沿着时间滑动窗口内所有可能的跟踪假设,最可能的跟踪关联得以确定。为了解决多维分配问题,我们引入了提升图边来考虑未检测到的检测,并通过强制保留实时效率来解决图稀疏问题。我们在NuScenes自动驾驶数据集上评估了SWTrack实现$^{2}$,以展示改进的跟踪性能。
https://arxiv.org/abs/2402.17892