3D Multi-Object Tracking (MOT) obtains significant performance improvements with the rapid advancements in 3D object detection, particularly in cost-effective multi-camera setups. However, the prevalent end-to-end training approach for multi-camera trackers results in detector-specific models, limiting their versatility. Moreover, current generic trackers overlook the unique features of multi-camera detectors, i.e., the unreliability of motion observations and the feasibility of visual information. To address these challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors. Following the Tracking-By-Detection framework, RockTrack is compatible with various off-the-shelf detectors. RockTrack incorporates a confidence-guided preprocessing module to extract reliable motion and image observations from distinct representation spaces from a single detector. These observations are then fused in an association module that leverages geometric and appearance cues to minimize mismatches. The resulting matches are propagated through a staged estimation process, forming the basis for heuristic noise modeling. Additionally, we introduce a novel appearance similarity metric for explicitly characterizing object affinities in multi-camera settings. RockTrack achieves state-of-the-art performance on the nuScenes vision-only tracking leaderboard with 59.1% AMOTA while demonstrating impressive computational efficiency.
3D Multi-Object Tracking(MOT)在3D物体检测的快速发展下取得了显著的性能提升,特别是在成本效益的多摄像头设置中。然而,普遍的端到端训练方法导致多摄像头跟踪器产生特定于检测器的模型,限制了它们的多样性。此外,当前的通用跟踪器忽略了多摄像头检测器的独特特点,即运动观察的不确定性和视觉信息的可行性。为了应对这些挑战,我们提出了RockTrack,一种多摄像头检测的3D MOT方法。遵循跟踪ById的方法,RockTrack与各种商用探测器兼容。RockTrack包含一个基于置信度的预处理模块,从单个检测器的不同表示空间中提取可靠的运动和图像观察。这些观察被融合到一个关联模块中,该模块利用几何和外观线索最小化匹配差异。由此产生的匹配通过级联估计过程传播,成为基于噪声建模的基。此外,我们引入了一个新颖的视觉相似度度量指标,明确表示多摄像头设置中物体的亲和性。RockTrack在仅使用视觉信息的nuScenes视觉跟踪领导者名单上实现了最先进的性能,同时具有出色的计算效率。
https://arxiv.org/abs/2409.11749
Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set. Currently, the best-performing methods are mainly based on pure appearance matching. Due to the complexity of motion patterns in the large-vocabulary scenarios and unstable classification of the novel objects, the motion and semantics cues are either ignored or applied based on heuristics in the final matching steps by existing methods. In this paper, we present a unified framework SLAck that jointly considers semantics, location, and appearance priors in the early steps of association and learns how to integrate all valuable information through a lightweight spatial and temporal object graph. Our method eliminates complex post-processing heuristics for fusing different cues and boosts the association performance significantly for large-scale open-vocabulary tracking. Without bells and whistles, we outperform previous state-of-the-art methods for novel classes tracking on the open-vocabulary MOT and TAO TETA benchmarks. Our code is available at \href{this https URL}{this http URL}.
开放词汇多对象跟踪(MOT)旨在将跟踪器扩展到训练集中没有的新类别的目标。目前,最佳方法主要基于纯外观匹配。由于大规模词汇场景中运动模式的复杂性和新类别的不可预测分类,现有的方法在最后匹配阶段主要基于经验主义原则应用运动和语义线索。在本文中,我们提出了一个统一的框架SLAck,在联合考虑语义、位置和外观先验的情况下,通过轻量化的空间和时间对象图学习如何整合所有有价值的信息。我们的方法消除了对不同提示进行融合的复杂后处理技巧,显著提高了大规模开放词汇跟踪的联想性能。与浮夸的实现相比,我们的方法在开放词汇MOT和TAO TETA基准上实现了新颖类别跟踪的最佳性能。无需任何花哨的功能,我们在开放词汇MOT和TAO TETA基准上显著超过了最先进的水平。我们的代码可在此处访问:\href{this https URL}{this http URL}.
https://arxiv.org/abs/2409.11235
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at this https URL.
无人机视频中的多个目标跟踪(MOT)对于计算机视觉的各种应用非常重要。当前的MOT跟踪器依赖于准确的物体检测结果和精确的目标识别(ReID)匹配。这些方法专注于优化目标的空间属性,而忽略了建模物体关系的时间线索,尤其是在具有挑战性的跟踪条件下,如物体变形和模糊等。为解决上述问题,我们提出了一个新颖的时空凝聚多目标跟踪框架(STCMOT),它利用历史嵌入特征来建模ReID和检测特征的序列顺序。具体来说,我们引入了一个时间嵌入增强模块,以增强基于相邻帧合作的个体嵌入的区分度。然后,通过一个时间检测平滑模块将轨迹嵌入传播,以挖掘时间域中的显著目标位置。在VisDrone2019和UAVDT数据集上进行的大量实验证明,我们的STCMOT在MOTA和IDF1指标上达到了最先进的水平。源代码已发布在https://这个链接上。
https://arxiv.org/abs/2409.11234
In recent years, workplaces and educational institutes have widely adopted virtual meeting platforms. This has led to a growing interest in analyzing and extracting insights from these meetings, which requires effective detection and tracking of unique individuals. In practice, there is no standardization in video meetings recording layout, and how they are captured across the different platforms and services. This, in turn, creates a challenge in acquiring this data stream and analyzing it in a uniform fashion. Our approach provides a solution to the most general form of video recording, usually consisting of a grid of participants (\cref{fig:videomeeting}) from a single video source with no metadata on participant locations, while using the least amount of constraints and assumptions as to how the data was acquired. Conventional approaches often use YOLO models coupled with tracking algorithms, assuming linear motion trajectories akin to that observed in CCTV footage. However, such assumptions fall short in virtual meetings, where participant video feed window can abruptly change location across the grid. In an organic video meeting setting, participants frequently join and leave, leading to sudden, non-linear movements on the video grid. This disrupts optical flow-based tracking methods that depend on linear motion. Consequently, standard object detection and tracking methods might mistakenly assign multiple participants to the same tracker. In this paper, we introduce a novel approach to track and re-identify participants in remote video meetings, by utilizing the spatio-temporal priors arising from the data in our domain. This, in turn, increases tracking capabilities compared to the use of general object tracking. Our approach reduces the error rate by 95% on average compared to YOLO-based tracking methods as a baseline.
近年来,许多企业和教育机构广泛采用虚拟会议平台。这导致了对这些会议进行分析和提取洞见的需求不断增加,这需要对独特个体的有效检测和跟踪。在实践中,视频会议录制布局没有标准化,而且它们在不同的平台和服务上的捕捉方式也没有标准化。这导致在获取此数据流并以统一方式分析它时存在挑战。我们的方法提供了解决最一般形式视频录制问题的方案,通常包括一个来自单个视频源的参与者的网格(\cref{fig:videomeeting})没有元数据,同时使用最少量的约束和假设来获取数据。 传统方法通常使用与跟踪算法耦合的YOLO模型,假定其类似于摄像机 footage 观察到的线性运动轨迹。然而,在虚拟会议中,参与者的视频流窗口会突然改变位置,破坏了基于线性运动轨迹的跟踪方法。因此,标准物体检测和跟踪方法可能会错误地将多个参与者分配到同一个跟踪器上。 在本文中,我们提出了一个在远程视频会议中跟踪和重新识别参与者的全新方法,通过利用我们领域中数据产生的空间时间先验。这进而提高了跟踪能力与使用通用物体跟踪方法相比。我们的方法将物体检测和跟踪误差率降低了95%。
https://arxiv.org/abs/2409.09841
Multi-object tracking (MOT) emerges as a pivotal and highly promising branch in the field of computer vision. Classical closed-vocabulary MOT (CV-MOT) methods aim to track objects of predefined categories. Recently, some open-vocabulary MOT (OV-MOT) methods have successfully addressed the problem of tracking unknown categories. However, we found that the CV-MOT and OV-MOT methods each struggle to excel in the tasks of the other. In this paper, we present a unified framework, Associate Everything Detected (AED), that simultaneously tackles CV-MOT and OV-MOT by integrating with any off-the-shelf detector and supports unknown categories. Different from existing tracking-by-detection MOT methods, AED gets rid of prior knowledge (e.g. motion cues) and relies solely on highly robust feature learning to handle complex trajectories in OV-MOT tasks while keeping excellent performance in CV-MOT tasks. Specifically, we model the association task as a similarity decoding problem and propose a sim-decoder with an association-centric learning mechanism. The sim-decoder calculates similarities in three aspects: spatial, temporal, and cross-clip. Subsequently, association-centric learning leverages these threefold similarities to ensure that the extracted features are appropriate for continuous tracking and robust enough to generalize to unknown categories. Compared with existing powerful OV-MOT and CV-MOT methods, AED achieves superior performance on TAO, SportsMOT, and DanceTrack without any prior knowledge. Our code is available at this https URL.
多目标跟踪(MOT)是在计算机视觉领域呈现出关键且具有前景的一个分支。经典的关闭词汇MOT(CV-MOT)方法旨在跟踪预定义类别的物体。最近,一些开箱即用的词汇MOT(OV-MOT)方法成功地解决了跟踪未知类别的物体的问题。然而,我们发现CV-MOT和OV-MOT方法在另一个任务上都无法取得卓越的成绩。在本文中,我们提出了一个统一框架,即联合检测(AED),该框架同时解决CV-MOT和OV-MOT,通过与任何现成的检测器集成来支持未知类别。与现有的跟踪通过检测的方法不同,AED摆脱了先验知识(例如运动线索),仅依靠高度稳健的特征学习来处理OV-MOT任务中的复杂轨迹,同时保持在CV-MOT任务中的卓越性能。具体来说,我们将关联任务建模为相似性解码问题,并提出了一个以关联为中心的学习机制的模拟解码器。模拟解码器计算三个方面的相似性:空间、时间和跨片段。然后,以关联为中心的学习利用这三个方面的相似性来确保提取的特征适用于连续跟踪,并且对未知类别具有足够强的鲁棒性。与现有的强大的OV-MOT和CV-MOT方法相比,AED在没有任何先验知识的情况下在TAO、SportsMOT和DanceTrack上取得了卓越的性能。我们的代码可在此处访问:https://www.acm.org/dl/d/2216822
https://arxiv.org/abs/2409.09293
Multiple object tracking (MOT) involves identifying multiple targets and assigning them corresponding IDs within a video sequence, where occlusions are often encountered. Recent methods address occlusions using appearance cues through online learning techniques to improve adaptivity or offline learning techniques to utilize temporal information from videos. However, most existing online learning-based MOT methods are unable to learn from all past tracking information to improve adaptivity on long-term occlusions while maintaining real-time tracking speed. On the other hand, temporal information-based offline learning methods maintain a long-term memory to store past tracking information, but this approach restricts them to use only local past information during tracking. To address these challenges, we propose a new MOT framework called the Feature Adaptive Continual-learning Tracker (FACT), which enables real-time tracking and feature learning for targets by utilizing all past tracking information. We demonstrate that the framework can be integrated with various state-of-the-art feature-based trackers, thereby improving their tracking ability. Specifically, we develop the feature adaptive continual-learning (FAC) module, a neural network that can be trained online to learn features adaptively using all past tracking information during tracking. Moreover, we also introduce a two-stage association module specifically designed for the proposed continual learning-based tracking. Extensive experiment results demonstrate that the proposed method achieves state-of-the-art online tracking performance on MOT17 and MOT20 benchmarks. The code will be released upon acceptance.
多项对象跟踪(MOT)涉及在视频序列中确定多个目标并分配它们相应的ID,其中遮挡是经常遇到的。为了提高适应性或利用视频的时序信息,最近的方法通过在线学习技术利用外观提示来解决遮挡问题。然而,大多数基于在线学习的方法无法从所有过去的跟踪信息中学习,以提高适应性并在遮挡长时间保持实时跟踪速度。另一方面,基于时序信息的 offline 学习方法具有长期记忆来存储过去跟踪信息,但这种方法限制了它们在跟踪过程中只能使用局部过去信息。为了应对这些挑战,我们提出了一个名为特征自适应连续学习跟踪器(FACT)的新多目标跟踪(MOT)框架,它利用所有过去的跟踪信息进行实时跟踪和特征学习。我们证明了该框架可以与各种基于特征的跟踪器集成,从而提高它们的跟踪能力。具体来说,我们开发了特征自适应连续学习(FAC)模块,一个神经网络,可以在跟踪过程中使用所有过去的跟踪信息 adaptive 地学习特征。此外,我们还引入了专门为提出的连续学习跟踪设计的双阶段关联模块。大量实验结果表明,所提出的方法在 MOT17 和 MOT20 基准上实现了最先进的在线跟踪性能。代码将在接受时发布。
https://arxiv.org/abs/2409.07904
Extracting and matching Re-Identification (ReID) features is used by many state-of-the-art (SOTA) Multiple Object Tracking (MOT) methods, particularly effective against frequent and long-term occlusions. While end-to-end object detection and tracking have been the main focus of recent research, they have yet to outperform traditional methods in benchmarks like MOT17 and MOT20. Thus, from an application standpoint, methods with separate detection and embedding remain the best option for accuracy, modularity, and ease of implementation, though they are impractical for edge devices due to the overhead involved. In this paper, we investigate a selective approach to minimize the overhead of feature extraction while preserving accuracy, modularity, and ease of implementation. This approach can be integrated into various SOTA methods. We demonstrate its effectiveness by applying it to StrongSORT and Deep OC-SORT. Experiments on MOT17, MOT20, and DanceTrack datasets show that our mechanism retains the advantages of feature extraction during occlusions while significantly reducing runtime. Additionally, it improves accuracy by preventing confusion in the feature-matching stage, particularly in cases of deformation and appearance similarity, which are common in DanceTrack. this https URL, this https URL
提取和匹配重新识别(Re-Identification,ReID)特征是许多最先进的(SOTA)多对象跟踪(MOT)方法中使用的,尤其是在频繁和长期遮挡效果方面非常有效。尽管端到端的物体检测和跟踪是最近研究的主要关注点,但它们在像MOT17和MOT20这样的基准测试中的表现尚未超过传统方法。因此,从应用角度来看,具有单独检测和嵌入的方案在准确性、可扩展性和易用性方面仍然是最佳选择,尽管由于涉及开销,它们在边缘设备上不可行。在本文中,我们研究了一种选择性的方法来最小化开销,同时保留准确度、可扩展性和易用性。这种方法可以集成到各种SOTA方法中。我们通过将该方法应用于StrongSORT和Deep OC-SORT来证明其有效性。在MOT17、MOT20和DanceTrack数据集的实验中,我们的机制在遮挡过程中保留了特征提取的优势,同时显著减少了运行时间。此外,通过防止在特征匹配阶段混淆,特别在变形和外观相似的情况下,提高了准确性。这个链接,这个链接
https://arxiv.org/abs/2409.06617
Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi-modal sensors to collect comprehensive environmental data. Among these, the radar-camera multi-modal perception system is especially favored for its excellent sensing capabilities and cost-effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar-camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera-based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird's-eye-view (BEV) feature using a dual-stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query-based multi-view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera-based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test-time augmentation or model ensembling.
感知周围环境是自动驾驶中的一个基本任务。为了获得高度准确的感知结果,现代自动驾驶系统通常采用多模态传感器收集全面的环境数据。在这些传感器中,雷达-相机多模态感知系统因其出色的感知能力和经济性而特别受欢迎。然而,雷达和相机传感器之间的巨大模态差异在融合信息方面带来了挑战。为解决这个问题,本文提出了RCBEVDet,一个基于雷达-相机融合的三维物体检测框架。具体来说,RCBEVDet是从一个现有的基于相机的3D物体检测器演变而来的,并补充了一个专门设计的雷达特征提取器,RadarBEVNet,和一个跨注意多层融合(CAMF)模块。首先,RadarBEVNet使用双流雷达骨架将稀疏雷达点编码为密集的鸟瞰(BEV)特征,并使用雷达跨距Section aware BEV编码器。其次,CAMF模块采用可变形注意机制来对雷达和相机BEV特征进行对齐,并采用通道和空间融合层将它们融合在一起。为了进一步提高RCBEVDet的性能,我们引入了RCBEVDet++,它通过稀疏融合支持查询为基础的多视图相机感知模型,并适应更广泛的感知任务。在 nuScenes 上的大量实验表明,我们的方法与现有的基于相机的3D感知模型无缝集成,并提高了它们在各种感知任务上的性能。此外,我们的方法在3D物体检测、BEV语义分割和3D多对象跟踪任务上实现了最先进的雷达-相机融合结果。值得注意的是,使用ViT-L作为图像骨架,RCBEVDet++在无测试时间增强或模型集成的情况下在3D物体检测上实现了72.73 NDS和67.34mAP。
https://arxiv.org/abs/2409.04979
The Lightweight Integrated Tracking-Feature Extraction (LITE) paradigm is introduced as a novel multi-object tracking (MOT) approach. It enhances ReID-based trackers by eliminating inference, pre-processing, post-processing, and ReID model training costs. LITE uses real-time appearance features without compromising speed. By integrating appearance feature extraction directly into the tracking pipeline using standard CNN-based detectors such as YOLOv8m, LITE demonstrates significant performance improvements. The simplest implementation of LITE on top of classic DeepSORT achieves a HOTA score of 43.03% at 28.3 FPS on the MOT17 benchmark, making it twice as fast as DeepSORT on MOT17 and four times faster on the more crowded MOT20 dataset, while maintaining similar accuracy. Additionally, a new evaluation framework for tracking-by-detection approaches reveals that conventional trackers like DeepSORT remain competitive with modern state-of-the-art trackers when evaluated under fair conditions. The code will be available post-publication at this https URL.
轻量级集成跟踪特征提取(LITE)范式作为一种新颖的多目标跟踪(MOT)方法而被引入。它通过消除推理、预处理、后处理和ReID模型训练成本来增强基于ReID的跟踪器。LITE使用实时外观特征,同时不牺牲速度。通过将外观特征提取直接集成到跟踪管道中,使用标准的CNN检测器(如YOLOv8m)实现显著的性能提升。在经典DeepSORT的轻量级实现上,LITE在MOT17基准上的HOTA得分为43.03%,在28.3 FPS的检测速度上与DeepSORT相当,而在MOT20数据集中的检测速度是DeepSORT的4倍,同时保持相同的准确性。此外,一种新的跟踪-通过检测方法的评估框架揭示,与在公平条件下评估的现代最先进的跟踪器相比,传统的跟踪器如DeepSORT仍然具有竞争力。代码将在发表后在此处公开可用:https://www.acm.org/dl/d/2216919
https://arxiv.org/abs/2409.04187
We propose a Ground IoU (Gr-IoU) to address the data association problem in multi-object tracking. When tracking objects detected by a camera, it often occurs that the same object is assigned different IDs in consecutive frames, especially when objects are close to each other or overlapping. To address this issue, we introduce Gr-IoU, which takes into account the 3D structure of the scene. Gr-IoU transforms traditional bounding boxes from the image space to the ground plane using the vanishing point geometry. The IoU calculated with these transformed bounding boxes is more sensitive to the front-to-back relationships of objects, thereby improving data association accuracy and reducing ID switches. We evaluated our Gr-IoU method on the MOT17 and MOT20 datasets, which contain diverse tracking scenarios including crowded scenes and sequences with frequent occlusions. Experimental results demonstrated that Gr-IoU outperforms conventional real-time methods without appearance features.
我们提出了一个地面IoU(Gr-IoU)来解决多目标跟踪中的数据关联问题。当相机检测到的物体在连续帧中分配不同的ID时, often会发生相同物体在相邻帧中被分配不同的ID,尤其是当物体靠近彼此或重叠时。为解决此问题,我们引入了Gr-IoU,它考虑了场景的3D结构。Gr-IoU通过虚点几何变换将图像空间中的传统边界框转换到地面平面。用这些转换后的边界框计算的IoU对物体前后关系的变化更加敏感,从而提高数据关联精度和减少ID切换。我们在MOT17和MOT20数据集上评估了我们的Gr-IoU方法,这些数据集包含包括拥挤场景和频繁遮挡的序列等各种跟踪情况。实验结果表明,Gr-IoU在没有出现特征的情况下超过了传统实时方法。
https://arxiv.org/abs/2409.03252
While Multi-Object Tracking (MOT) has made substantial advancements, it is limited by heavy reliance on prior knowledge and limited to predefined categories. In contrast, Generic Multiple Object Tracking (GMOT), tracking multiple objects with similar appearance, requires less prior information about the targets but faces challenges with variants like viewpoint, lighting, occlusion, and resolution. Our contributions commence with the introduction of the \textbf{\text{Refer-GMOT dataset}} a collection of videos, each accompanied by fine-grained textual descriptions of their attributes. Subsequently, we introduce a novel text prompt-based open-vocabulary GMOT framework, called \textbf{\text{TP-GMOT}}, which can track never-seen object categories with zero training examples. Within \text{TP-GMOT} framework, we introduce two novel components: (i) {\textbf{\text{TP-OD}}, an object detection by a textual prompt}, for accurately detecting unseen objects with specific characteristics. (ii) Motion-Appearance Cost SORT \textbf{\text{MAC-SORT}}, a novel object association approach that adeptly integrates motion and appearance-based matching strategies to tackle the complex task of tracking multiple generic objects with high similarity. Our contributions are benchmarked on the \text{Refer-GMOT} dataset for GMOT task. Additionally, to assess the generalizability of the proposed \text{TP-GMOT} framework and the effectiveness of \text{MAC-SORT} tracker, we conduct ablation studies on the DanceTrack and MOT20 datasets for the MOT task. Our dataset, code, and models will be publicly available at: this https URL
虽然多目标跟踪(MOT)已经取得了重大进展,但它受到过去知识和预定义类别的限制。相比之下,具有类似外观的多对象跟踪(GMOT)需要更少的先前信息,但面临着视角、照明和遮挡等变体的挑战。我们的贡献始于引入了参考-GMOT数据集,这是一个由每个视频及其属性微细描述的组成的视频集合。随后,我们引入了一种基于文本提示的新颖文本提示开放词汇GMOT框架,称为TP-GMOT,可以跟踪未见过的对象,且无需训练示例。在TP-GMOT框架内,我们引入了两个新颖的组件: (i)TP-OD,通过文本提示进行物体检测,可以准确地检测具有特定特征的未见对象。 (ii)Motion-Appearance Cost SORT(MAC-SORT),一种新颖的物体关联方法,将运动和外观基于匹配策略相结合,解决了跟踪具有高相似性的多个通用对象的任务。 我们的贡献是在参考-GMOT数据集上对GMOT任务进行基准测试。此外,为了评估所提出的TP-GMOT框架的泛化能力和MAC-SORT跟踪器的有效性,我们在DanceTrack和MOT20数据集上进行了消融研究。我们的数据集、代码和模型将公开发布在以下链接:https://this URL。
https://arxiv.org/abs/2409.02490
Multi-modal 3D multi-object tracking (MOT) typically necessitates extensive computational costs of deep neural networks (DNNs) to extract multi-modal representations. In this paper, we propose an intriguing question: May we learn from multiple modalities only during training to avoid multi-modal input in the inference phase? To answer it, we propose \textbf{YOLOO}, a novel multi-modal 3D MOT paradigm: You Only Learn from Others Once. YOLOO empowers the point cloud encoder to learn a unified tri-modal representation (UTR) from point clouds and other modalities, such as images and textual cues, all at once. Leveraging this UTR, YOLOO achieves efficient tracking solely using the point cloud encoder without compromising its performance, fundamentally obviating the need for computationally intensive DNNs. Specifically, YOLOO includes two core components: a unified tri-modal encoder (UTEnc) and a flexible geometric constraint (F-GC) module. UTEnc integrates a point cloud encoder with image and text encoders adapted from pre-trained CLIP. It seamlessly fuses point cloud information with rich visual-textual knowledge from CLIP into the point cloud encoder, yielding highly discriminative UTRs that facilitate the association between trajectories and detections. Additionally, F-GC filters out mismatched associations with similar representations but significant positional discrepancies. It further enhances the robustness of UTRs without requiring any scene-specific tuning, addressing a key limitation of customized geometric constraints (e.g., 3D IoU). Lastly, high-quality 3D trajectories are generated by a traditional data association component. By integrating these advancements into a multi-modal 3D MOT scheme, our YOLOO achieves substantial gains in both robustness and efficiency.
多模态3D多对象跟踪(MOT)通常需要深度神经网络(DNNs)进行广泛的计算成本来提取多模态表示。在本文中,我们提出了一个有趣的问题:我们可以在训练期间仅从多个模态中学习吗?为了回答这个问题,我们提出了You Only Learn from Others Once(YOLOO),一种新颖的多模态3D MOT范式:一次只学习其他人的东西。YOLOO使点云编码器能够同时从点云和其他模态(如图像和文本提示)中学习统一的三模态表示(UTR)。利用这个UTR,YOLOO仅使用点云编码器实现高效的跟踪,而不会牺牲其性能。本质上,YOLOO包括两个核心组件:统一的三角多模态编码器(UTEnc)和灵活的几何约束(F-GC)模块。UTEnc集成了经过预训练的CLIP中的点云编码器以及图像和文本编码器。它平滑地将点云信息与CLIP中丰富的视觉-文本知识融合到点云编码器中,产生了高度判别性的UTR,促进了轨迹与检测之间的关联。此外,F-GC滤除与相似表示的不一致关联,进一步增强了UTR的鲁棒性,而不需要进行针对特定场景的调整。最后,通过传统的数据关联组件生成高质量的三维轨迹。将这些改进集成到多模态3D MOT方案中,我们的YOLOO在鲁棒性和效率方面都取得了显著的提高。
https://arxiv.org/abs/2409.00618
Temporal motion modeling has always been a key component in multiple object tracking (MOT) which can ensure smooth trajectory movement and provide accurate positional information to enhance association precision. However, current motion models struggle to be both efficient and effective across different application scenarios. To this end, we propose TrackSSM inspired by the recently popular state space models (SSM), a unified encoder-decoder motion framework that uses data-dependent state space model to perform temporal motion of trajectories. Specifically, we propose Flow-SSM, a module that utilizes the position and motion information from historical trajectories to guide the temporal state transition of object bounding boxes. Based on Flow-SSM, we design a flow decoder. It is composed of a cascaded motion decoding module employing Flow-SSM, which can use the encoded flow information to complete the temporal position prediction of trajectories. Additionally, we propose a Step-by-Step Linear (S$^2$L) training strategy. By performing linear interpolation between the positions of the object in the previous frame and the current frame, we construct the pseudo labels of step-by-step linear training, ensuring that the trajectory flow information can better guide the object bounding box in completing temporal transitions. TrackSSM utilizes a simple Mamba-Block to build a motion encoder for historical trajectories, forming a temporal motion model with an encoder-decoder structure in conjunction with the flow decoder. TrackSSM is applicable to various tracking scenarios and achieves excellent tracking performance across multiple benchmarks, further extending the potential of SSM-like temporal motion models in multi-object tracking tasks.
运动建模一直是多个目标跟踪(MOT)中的关键组成部分,可以确保平滑的轨迹运动并提高关联精度。然而,当前的运动模型在不同的应用场景下都很难实现高效和有效。为此,我们提出了一个基于最近流行的状态空间模型(SSM)的统一编码器-解码器运动框架:跟踪SSM。具体来说,我们提出了Flow-SSM,一个利用历史轨迹的位置和运动信息引导物体边界框的时间运动状态转移的模块。基于Flow-SSM,我们设计了一个流解码器。它由一个级联运动解码模块组成,利用编码的流信息来完成轨迹的时间位置预测。此外,我们还提出了一种逐步线性(S$^2$L)训练策略。通过在上一帧和当前帧之间对物体位置进行线性插值,我们构建了逐步线性训练的伪标签,从而确保轨迹流动信息可以更好地引导物体边界框在完成时间转换。跟踪SSM使用简单的Mamba-Block来构建历史轨迹的运动编码器,形成了一个编码器-解码器结构与流解码器相结合的时空运动模型。跟踪SSM适用于各种跟踪场景,在多个基准测试中都取得了优异的跟踪性能,进一步扩展了类似于SSM的时空运动模型的潜在可能性,在多对象跟踪任务中具有广泛的应用价值。
https://arxiv.org/abs/2409.00487
The study of collective animal behavior, especially in aquatic environments, presents unique challenges and opportunities for understanding movement and interaction patterns in the field of ethology, ecology, and bio-navigation. The Fish Tracking Challenge 2024 (this https URL) introduces a multi-object tracking competition focused on the intricate behaviors of schooling sweetfish. Using the SweetFish dataset, participants are tasked with developing advanced tracking models to accurately monitor the locations of 10 sweetfishes simultaneously. This paper introduces the competition's background, objectives, the SweetFish dataset, and the appraoches of the 1st to 3rd winners and our baseline. By leveraging video data and bounding box annotations, the competition aims to foster innovation in automatic detection and tracking algorithms, addressing the complexities of aquatic animal movements. The challenge provides the importance of multi-object tracking for discovering the dynamics of collective animal behavior, with the potential to significantly advance scientific understanding in the above fields.
集体动物行为的研究,尤其是在水生环境中,为理解生态学、行为学和生物导航领域中的运动和互动模式带来了独特的挑战和机遇。2024年鱼类追踪挑战(此<https:// URL>)针对的是学金额目鱼的复杂行为。利用SweetFish数据集,参与者被要求开发高级追踪模型,准确监测同时跟踪的10条鱼的位置。本文介绍了比赛的背景、目标、SweetFish数据集以及第一至第三名的评估方法和我们的基线。通过利用视频数据和边界框注释,比赛旨在促进自动检测和跟踪算法的创新,解决水生动物运动复杂性的问题。挑战赛证明了多对象追踪在发现集体动物行为动态方面的的重要性,这些领域有望取得显著的进展。
https://arxiv.org/abs/2409.00339
The tracking-by-detection paradigm is the mainstream in multi-object tracking, associating tracks to the predictions of an object detector. Although exhibiting uncertainty through a confidence score, these predictions do not capture the entire variability of the inference process. For safety and security critical applications like autonomous driving, surveillance, etc., knowing this predictive uncertainty is essential though. Therefore, we introduce, for the first time, a fast way to obtain the empirical predictive distribution during object detection and incorporate that knowledge in multi-object tracking. Our mechanism can easily be integrated into state-of-the-art trackers, enabling them to fully exploit the uncertainty in the detections. Additionally, novel association methods are introduced that leverage the proposed mechanism. We demonstrate the effectiveness of our contribution on a variety of benchmarks, such as MOT17, MOT20, DanceTrack, and KITTI.
跟踪-检测范式是多目标跟踪的主流方法,将跟踪与物体检测器的预测相关联。尽管通过置信分数表现出不确定性,但这些预测并没有捕捉到推理过程的整个变异性。对于如自动驾驶、监视等关键应用,了解这种预测不确定性至关重要。因此,我们首次引入了一种快速获得物体检测期间经验预测分布的方法,并将其纳入多目标跟踪中。我们的机制可以轻松地集成到最先进的跟踪器中,使它们能够充分利用检测结果的不确定性。此外,我们还引入了利用所提出的机制的新型关联方法。我们在多个基准上证明了我们的贡献的有效性,如MOT17、MOT20、DanceTrack和KITTI。
https://arxiv.org/abs/2408.17098
Three-dimensional (3D) reconstruction from two-dimensional images is an active research field in computer vision, with applications ranging from navigation and object tracking to segmentation and three-dimensional modeling. Traditionally, parametric techniques have been employed for this task. However, recent advancements have seen a shift towards learning-based methods. Given the rapid pace of research and the frequent introduction of new image matching methods, it is essential to evaluate them. In this paper, we present a comprehensive evaluation of various image matching methods using a structure-from-motion pipeline. We assess the performance of these methods on both in-domain and out-of-domain datasets, identifying key limitations in both the methods and benchmarks. We also investigate the impact of edge detection as a pre-processing step. Our analysis reveals that image matching for 3D reconstruction remains an open challenge, necessitating careful selection and tuning of models for specific scenarios, while also highlighting mismatches in how metrics currently represent method performance.
三维(3D)从二维图像重建是一个活跃的研究领域,在计算机视觉中具有广泛的应用,从导航和目标跟踪到分割和三维建模。传统上,参数技术用于此任务。然而,最近的研究趋势是向基于学习的技术转移。鉴于研究速率和图像匹配方法的频繁引入,评估它们非常重要。在本文中,我们使用结构从运动框架对各种图像匹配方法进行了全面的评估。我们评估这些方法在领域内和领域外数据上的性能,并确定了方法和基准中的关键局限性。我们还研究了边缘检测作为预处理步骤的影响。我们的分析揭示了三维重建图像匹配仍然是一个未解决的问题,需要对特定场景的模型进行仔细选择和调整,同时突出了当前指标如何表示方法性能的差异。
https://arxiv.org/abs/2408.16445
Interpreting motion captured in image sequences is crucial for a wide range of computer vision applications. Typical estimation approaches include optical flow (OF), which approximates the apparent motion instantaneously in a scene, and multiple object tracking (MOT), which tracks the motion of subjects over time. Often, the motion of objects in a scene is governed by some underlying dynamical system which could be inferred by analyzing the motion of groups of objects. Standard motion analyses, however, are not designed to intuit flow dynamics from trajectory data, making such measurements difficult in practice. The goal of this work is to extend gradient-based dynamical systems analyses to real-world applications characterized by complex, feature-rich image sequences with imperfect tracers. The tracer trajectories are tracked using deep vision networks and gradients are approximated using Lagrangian gradient regression (LGR), a tool designed to estimate spatial gradients from sparse data. From gradients, dynamical features such as regions of coherent rotation and transport barriers are identified. The proposed approach is affordably implemented and enables advanced studies including the motion analysis of two distinct object classes in a single image sequence. Two examples of the method are presented on data sets for which standard gradient-based analyses do not apply.
解释图像序列中捕获的运动对于广泛的计算机视觉应用至关重要。典型的估计方法包括光流(OF),它近似场景中物体的显着运动,和多目标跟踪(MOT),它跟踪主题在时间上的运动。通常,场景中物体的运动由一些潜在的动态系统决定,可以通过分析物体群的运动来推断。然而,标准的运动分析方法并不是为了从轨迹数据中直观地推断流动力学,因此在实践中很难实现。本工作的目标是将基于梯度的动态系统分析扩展到具有复杂、丰富图像序列且存在不完美跟踪器的现实应用中。跟踪器轨迹使用深度视觉网络进行跟踪,而Lagrangian梯度回归(LGR)工具用于估计稀疏数据中的空间梯度。从梯度中,可以识别出动力学特征,如协调旋转区域和传输障碍。所提出的方法具有经济实惠的实现,并能够进行包括在单个图像序列中研究两个不同物体类别的运动分析在内的高级研究。为了解释标准梯度分析不适用于某些数据集的情况,我们提供了两个示例。
https://arxiv.org/abs/2408.16190
Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame. Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios. However, these methods still face challenges such as poor noise resistance and frequent ID switches. In this research, we propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on perturbed bounding boxes. This progressive denoising strategy significantly improves the model's noise resistance. During the training phase, paired object boxes within two adjacent frames are diffused from ground-truth boxes to a random distribution, and then the model learns to detect and track by reversing this process. In inference, the model refines randomly generated boxes into detection and tracking results through minimal denoising steps. ConsistencyTrack also introduces an innovative target association strategy to address target occlusion. Experiments on the MOT17 and DanceTrack datasets demonstrate that ConsistencyTrack outperforms other compared methods, especially better than DiffusionTrack in inference speed and other performance metrics. Our code is available at this https URL.
多目标跟踪(MOT)是计算机视觉中一项关键技术,旨在在视频序列中检测多个目标并为每个目标分配唯一的ID。已有的MOT方法在准确跟踪多个物体在实时场景中表现出色,但仍然面临诸如噪音抗性差和经常ID切换等问题。在这项研究中,我们提出了一个新颖的一致性跟踪、联合检测和跟踪(JDT)框架,将检测和关联表示为在扰动边界框的噪声扩散过程中。这种渐进式的去噪策略显著提高了模型的噪音抗性。在训练阶段,相邻帧中的成对物体框通过从地面真实框到随机分布进行扩散,然后模型通过反转这个过程学习检测和跟踪。在推理阶段,模型通过最小去噪步骤将随机生成的框优化为检测和跟踪结果。一致性跟踪还引入了一种创新的 targets association 策略来解决目标遮挡问题。在 MOT17 和 DanceTrack 数据集上的实验证明,一致性跟踪比其他比较方法表现出色,特别是在推理速度和其他性能指标方面。我们的代码可在此处访问:https://www.xiaoyu.ac.cn/
https://arxiv.org/abs/2408.15548
Visual tracking has seen remarkable advancements, largely driven by the availability of large-scale training datasets that have enabled the development of highly accurate and robust algorithms. While significant progress has been made in tracking general objects, research on more challenging scenarios, such as tracking camouflaged objects, remains limited. Camouflaged objects, which blend seamlessly with their surroundings or other objects, present unique challenges for detection and tracking in complex environments. This challenge is particularly critical in applications such as military, security, agriculture, and marine monitoring, where precise tracking of camouflaged objects is essential. To address this gap, we introduce the Camouflaged Object Tracking Dataset (COTD), a specialized benchmark designed specifically for evaluating camouflaged object tracking methods. The COTD dataset comprises 200 sequences and approximately 80,000 frames, each annotated with detailed bounding boxes. Our evaluation of 20 existing tracking algorithms reveals significant deficiencies in their performance with camouflaged objects. To address these issues, we propose a novel tracking framework, HiPTrack-MLS, which demonstrates promising results in improving tracking performance for camouflaged objects. COTD and code are avialable at this https URL.
视觉跟踪取得了显著的进步,主要得益于大规模训练数据集的可用,这些数据集使得开发高度准确且鲁棒性的算法成为可能。虽然在跟踪一般对象方面取得了显著的进展,但研究更具有挑战性的场景,如跟踪伪装物体,仍然有限。伪装物体,它们与周围环境或其他物体融为一体,对复杂环境中的检测和跟踪带来了独特的挑战。这个挑战在军事、安全、农业和海洋监测等应用中尤为关键,在这些应用中精确跟踪伪装物体至关重要。为了填补这一空白,我们引入了 Camouflaged Object Tracking Dataset(COTD),一个专门为评估伪装物体跟踪方法而设计的基准数据集。COTD 数据集包括 200 个序列和大约 80,000 个标注帧,每个帧都有详细的边界框。我们对 20 种现有跟踪算法的评估表明,它们的性能在伪装物体方面存在显著的不足。为了解决这些问题,我们提出了 HiPTrack-MLS,一种改进了跟踪性能的新型跟踪框架。COTD 和代码都可以在這個網址 https://www.himpack.org/ 找到。
https://arxiv.org/abs/2408.13877
Multi-camera tracking plays a pivotal role in various real-world applications. While end-to-end methods have gained significant interest in single-camera tracking, multi-camera tracking remains predominantly reliant on heuristic techniques. In response to this gap, this paper introduces Multi-Camera Tracking tRansformer (MCTR), a novel end-to-end approach tailored for multi-object detection and tracking across multiple cameras with overlapping fields of view. MCTR leverages end-to-end detectors like DEtector TRansformer (DETR) to produce detections and detection embeddings independently for each camera view. The framework maintains set of track embeddings that encaplusate global information about the tracked objects, and updates them at every frame by integrating the local information from the view-specific detection embeddings. The track embeddings are probabilistically associated with detections in every camera view and frame to generate consistent object tracks. The soft probabilistic association facilitates the design of differentiable losses that enable end-to-end training of the entire system. To validate our approach, we conduct experiments on MMPTrack and AI City Challenge, two recently introduced large-scale multi-camera multi-object tracking datasets.
多摄像头跟踪在各种现实应用中扮演着关键角色。虽然端到端方法在单摄像头跟踪中已经获得了很大的关注,但多摄像头跟踪仍然主要依赖于启发式技术。为了填补这一空白,本文引入了Multi-Camera Tracking tRansformer(MCTR),一种专为多摄像头检测和跟踪设计的端到端方法,覆盖多个摄像头和重叠的视野范围。MCTR利用像Detector TRansformer(DETR)这样的端到端检测器来独立产生每个摄像头的检测和检测嵌入。该框架维护了一组跟踪嵌入,概括了跟踪对象的全局信息,并在每个帧通过整合视图特定的检测嵌入来更新它们。跟踪嵌入与每个相机视图和帧的检测之间是概率相关联,从而生成一致的物体跟踪。软概率关联使得设计不同损失函数成为可能,这种损失函数可以实现整个系统的端到端训练。为了验证我们的方法,我们在MMPTrack和AI City Challenge这两个最近引入的大型多摄像头多对象跟踪数据集上进行了实验。
https://arxiv.org/abs/2408.13243