Multi-object tracking (MOT) is one of the most important problems in computer vision and a key component of any vision-based perception system used in advanced autonomous mobile robotics. Therefore, its implementation on low-power and real-time embedded platforms is highly desirable. Modern MOT algorithms should be able to track objects of a given class (e.g. people or vehicles). In addition, the number of objects to be tracked is not known in advance, and they may appear and disappear at any time, as well as be obscured. For these reasons, the most popular and successful approaches have recently been based on the tracking paradigm. Therefore, the presence of a high quality object detector is essential, which in practice accounts for the vast majority of the computational and memory complexity of the whole MOT system. In this paper, we propose an FPGA (Field-Programmable Gate Array) implementation of an embedded MOT system based on a quantized YOLOv8 detector and the SORT (Simple Online Realtime Tracker) tracker. We use a modified version of the FINN framework to utilize external memory for model parameters and to support operations necessary required by YOLOv8. We discuss the evaluation of detection and tracking performance using the COCO and MOT15 datasets, where we achieve 0.21 mAP and 38.9 MOTA respectively. As the computational platform, we use an MPSoC system (Zynq UltraScale+ device from AMD/Xilinx) where the detector is deployed in reprogrammable logic and the tracking algorithm is implemented in the processor system.
多目标跟踪(MOT)是计算机视觉领域中最重要且关键的问题之一,也是高级自主移动机器人中基于视觉感知系统的关键组件。因此,在低功耗和实时嵌入式平台上实现它具有很高的需求。现代的MOT算法应该能够追踪给定类别(如人或车辆)的对象,并且被跟踪的目标数量事先并不确定,这些目标可能随时出现、消失,甚至被遮挡。由于这些原因,最近最流行和成功的方法都是基于跟踪范式的。 因此,在实践中高质量的对象检测器是至关重要的,它占据了整个MOT系统绝大部分的计算复杂度和内存需求。本文中,我们提出了一种基于量化YOLOv8检测器与SORT(简单在线实时追踪)算法的嵌入式MOT系统的FPGA实现方法。我们使用了FINN框架的一个修改版本来利用外部存储器存放模型参数,并支持YOLOv8所需的操作。 我们通过COCO和MOT15数据集评估了检测和跟踪性能,分别实现了0.21 mAP和38.9 MOTA的结果。作为计算平台,我们使用了一个MPSoC系统(AMD/Xilinx公司的Zynq UltraScale+设备),其中检测器部署在可重编程逻辑上,而追踪算法则在处理器系统中实现。
https://arxiv.org/abs/2503.13023
Accurate 3D multi-object tracking (MOT) is crucial for autonomous driving, as it enables robust perception, navigation, and planning in complex environments. While deep learning-based solutions have demonstrated impressive 3D MOT performance, model-based approaches remain appealing for their simplicity, interpretability, and data efficiency. Conventional model-based trackers typically rely on random vector-based Bayesian filters within the tracking-by-detection (TBD) framework but face limitations due to heuristic data association and track management schemes. In contrast, random finite set (RFS)-based Bayesian filtering handles object birth, survival, and death in a theoretically sound manner, facilitating interpretability and parameter tuning. In this paper, we present OptiPMB, a novel RFS-based 3D MOT method that employs an optimized Poisson multi-Bernoulli (PMB) filter while incorporating several key innovative designs within the TBD framework. Specifically, we propose a measurement-driven hybrid adaptive birth model for improved track initialization, employ adaptive detection probability parameters to effectively maintain tracks for occluded objects, and optimize density pruning and track extraction modules to further enhance overall tracking performance. Extensive evaluations on nuScenes and KITTI datasets show that OptiPMB achieves superior tracking accuracy compared with state-of-the-art methods, thereby establishing a new benchmark for model-based 3D MOT and offering valuable insights for future research on RFS-based trackers in autonomous driving.
准确的三维多目标跟踪(MOT)对于自动驾驶至关重要,因为它能够支持复杂环境中的稳健感知、导航和规划。虽然基于深度学习的方法已经展示了令人印象深刻的3D MOT性能,但模型驱动方法因其简洁性、可解释性和数据效率而依然具有吸引力。传统的模型驱动追踪器通常依赖于在检测到跟踪(TBD)框架内的随机向量贝叶斯滤波器,但由于启发式的数据关联和轨迹管理方案的限制,它们面临挑战。相比之下,基于随机有限集(RFS)的贝叶斯过滤能够以理论合理的方式处理对象的产生、存活和消亡问题,从而促进可解释性和参数调优。 在本文中,我们提出了一种新的基于RFS的3D MOT方法——OptiPMB,该方法使用优化后的泊松多伯努利(PMB)滤波器,并在TBD框架内融合了几个关键创新设计。具体来说,我们提出了一个由测量驱动的混合自适应出生模型以改进轨迹初始化;采用可调检测概率参数来有效维持被遮挡对象的轨迹;并优化密度修剪和轨迹提取模块进一步提高整体跟踪性能。 广泛的评估表明,在nuScenes和KITTI数据集上OptiPMB达到了比现有先进方法更高的跟踪精度,从而为基于模型的3D MOT设定了新标准,并为未来在自动驾驶中使用RFS基追踪器的研究提供了有价值的见解。
https://arxiv.org/abs/2503.12968
Transformer-based trackers have achieved promising success and become the dominant tracking paradigm due to their accuracy and efficiency. Despite the substantial progress, most of the existing approaches tackle object tracking as a deterministic coordinate regression problem, while the target localization uncertainty has been greatly overlooked, which hampers trackers' ability to maintain reliable target state prediction in challenging scenarios. To address this issue, we propose UncTrack, a novel uncertainty-aware transformer tracker that predicts the target localization uncertainty and incorporates this uncertainty information for accurate target state inference. Specifically, UncTrack utilizes a transformer encoder to perform feature interaction between template and search images. The output features are passed into an uncertainty-aware localization decoder (ULD) to coarsely predict the corner-based localization and the corresponding localization uncertainty. Then the localization uncertainty is sent into a prototype memory network (PMN) to excavate valuable historical information to identify whether the target state prediction is reliable or not. To enhance the template representation, the samples with high confidence are fed back into the prototype memory bank for memory updating, making the tracker more robust to challenging appearance variations. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods. Our code is available at this https URL.
基于Transformer的跟踪器由于其准确性和效率已经取得了显著的成功,并成为主流的追踪范式。尽管取得了重大进展,现有的大多数方法都将物体跟踪视为确定性的坐标回归问题,而忽视了目标定位不确定性的问题,这阻碍了跟踪器在具有挑战性场景中维持可靠的目标状态预测的能力。为了解决这个问题,我们提出了UncTrack,这是一种新颖的、感知不确定性的Transformer追踪器,它可以预测目标的定位不确定性,并将这种不确定性信息整合到准确的目标状态推理中。 具体来说,UncTrack利用一个变压器编码器来执行模板图像和搜索图像之间的特征交互。输出的特征会被传递给一个感知不确定性的本地化解码器(ULD),该解码器用于粗略地预测基于角点的定位以及相应的定位不确定性。然后将这种定位不确定性发送到原型内存网络(PMN)中,以挖掘有价值的过去信息来确定目标状态预测是否可靠。为了增强模板表示,高置信度样本会被反馈回原型记忆库进行内存更新,从而使跟踪器对挑战性的外观变化更加稳健。 广泛的实验表明,我们的方法优于其他最先进的方法。我们的代码可以在提供的URL上找到。
https://arxiv.org/abs/2503.12888
The aim of multiple object tracking (MOT) is to detect all objects in a video and bind them into multiple trajectories. Generally, this process is carried out in two steps: detecting objects and associating them across frames based on various cues and metrics. Many studies and applications adopt object appearance, also known as re-identification (ReID) features, for target matching through straightforward similarity calculation. However, we argue that this practice is overly naive and thus overlooks the unique characteristics of MOT tasks. Unlike regular re-identification tasks that strive to distinguish all potential targets in a general representation, multi-object tracking typically immerses itself in differentiating similar targets within the same video sequence. Therefore, we believe that seeking a more suitable feature representation space based on the different sample distributions of each sequence will enhance tracking performance. In this paper, we propose using history-aware transformations on ReID features to achieve more discriminative appearance representations. Specifically, we treat historical trajectory features as conditions and employ a tailored Fisher Linear Discriminant (FLD) to find a spatial projection matrix that maximizes the differentiation between different trajectories. Our extensive experiments reveal that this training-free projection can significantly boost feature-only trackers to achieve competitive, even superior tracking performance compared to state-of-the-art methods while also demonstrating impressive zero-shot transfer capabilities. This demonstrates the effectiveness of our proposal and further encourages future investigation into the importance and customization of ReID models in multiple object tracking. The code will be released at this https URL.
多目标跟踪(MOT)的目标是在视频中检测所有对象并将其绑定为多个轨迹。通常,此过程分为两个步骤:检测对象和根据各种线索和指标在帧之间关联它们。许多研究和应用采用对象外观特征,也称为重新识别(ReID)特征,通过简单的相似度计算来进行目标匹配。然而,我们认为这种做法过于简单,并因此忽略了MOT任务的独特特性。与常规的重新识别任务致力于在一个通用表示中区分所有潜在的目标不同,多对象跟踪通常专注于在同一个视频序列中区分类似的目标。因此,我们相信根据每个序列不同的样本分布寻找更合适的特征表示空间将增强跟踪性能。 在这篇论文中,我们提出使用具有历史感知变换的ReID特征来实现更具判别性的外观表示。具体来说,我们将历史轨迹特征视为条件,并采用一种定制化的费希尔线性判别法(FLD)来找到最大化不同轨迹之间差异的空间投影矩阵。我们的广泛实验表明,这种无需训练的投影可以显著增强仅使用特征的跟踪器的表现,使其能够达到甚至优于最先进的方法的追踪性能,并且还展示了令人印象深刻的零样本迁移能力。这证明了我们提案的有效性,并进一步鼓励未来对MOT中ReID模型的重要性和定制化的研究。 代码将在以下网址发布:[此链接](https://this-url.com)。
https://arxiv.org/abs/2503.12562
The availability of large-scale remote sensing video data underscores the importance of high-quality interactive segmentation. However, challenges such as small object sizes, ambiguous features, and limited generalization make it difficult for current methods to achieve this goal. In this work, we propose ROS-SAM, a method designed to achieve high-quality interactive segmentation while preserving generalization across diverse remote sensing data. The ROS-SAM is built upon three key innovations: 1) LoRA-based fine-tuning, which enables efficient domain adaptation while maintaining SAM's generalization ability, 2) Enhancement of deep network layers to improve the discriminability of extracted features, thereby reducing misclassifications, and 3) Integration of global context with local boundary details in the mask decoder to generate high-quality segmentation masks. Additionally, we design the data pipeline to ensure the model learns to better handle objects at varying scales during training while focusing on high-quality predictions during inference. Experiments on remote sensing video datasets show that the redesigned data pipeline boosts the IoU by 6%, while ROS-SAM increases the IoU by 13%. Finally, when evaluated on existing remote sensing object tracking datasets, ROS-SAM demonstrates impressive zero-shot capabilities, generating masks that closely resemble manual annotations. These results confirm ROS-SAM as a powerful tool for fine-grained segmentation in remote sensing applications. Code is available at this https URL.
大规模遥感视频数据的可用性突显了高质量交互分割的重要性。然而,小目标尺寸、模糊特征和有限泛化能力等挑战使得当前方法难以实现这一目标。在本工作中,我们提出了ROS-SAM(Remote Sensing Segment Anything Machine),这是一种旨在实现在各种遥感数据中保持泛化的高质量交互分割的方法。ROS-SAM基于三项关键创新:1)LoRA(低秩适应)的微调,使领域适应更加高效的同时保留了SAM的一般化能力;2)增强深层网络层以提高提取特征的区分度,从而减少误分类;3)在掩码解码器中整合全局上下文和局部边界细节,生成高质量分割掩码。此外,我们设计了数据管道,确保模型能够在训练过程中更好地处理不同尺度的对象,并且专注于推理过程中的高质量预测。在遥感视频数据集上的实验表明,重新设计的数据管道将IoU提高了6%,而ROS-SAM则使IoU增加了13%。最后,在现有的遥感目标跟踪数据集上进行评估时,ROS-SAM展示了其卓越的零样本学习能力,生成了与人工标注非常接近的掩码。这些结果证实了ROS-SAM是用于遥感应用中精细分割的强大工具。代码可在该链接处获取:[提供具体的URL地址]
https://arxiv.org/abs/2503.12006
As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the "what" and "where" pathways from human visual processing system to RMOT tasks. Specifically, our framework comprises three collaborative components: (1)The Bidirectional Interactive Fusion module first establishes cross-modal connections while preserving modality-specific characteristics; (2) Building upon this foundation, the Progressive Semantic-Decoupled Query Learning mechanism hierarchically injects complementary information into object queries, progressively refining object understanding from coarse to fine-grained semantic levels; (3) Finally, the Structural Consensus Constraint enforces bidirectional semantic consistency between visual features and language descriptions, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.
在智能交通感知系统中,基于语言参考的多目标跟踪(RMOT)是多源信息融合的一个重要应用。这种方法旨在根据语言描述,在视频序列中定位和追踪特定的对象。然而,现有的RMOT方法通常将语言描述视为整体嵌入,并且难以有效地整合语言表达中的丰富语义信息与视觉特征之间的关系。这种限制在需要对静态物体属性和空间运动信息进行全面理解的复杂场景中最明显。 为了解决这些问题,本文提出了一种认知分离的基于引用多目标跟踪(CDRMT)框架。该框架借鉴了人类视觉处理系统中的“是什么”(what)和“在哪里”(where)路径,并将其应用于RMOT任务中。具体来说,我们的框架包含三个协作组件: 1. **双向互动融合模块**:首先建立了跨模态连接,同时保留特定模式的特性。 2. **渐进式语义解耦查询学习机制**:在此基础上,该机制分层次地将补充信息注入到物体查询中,逐步从粗略级别细化为细粒度级别的物体理解。 3. **结构共识约束**:最后,这一组件确保了视觉特征与语言描述之间的双向语义一致性,使跟踪的物体能够忠实反映引用表达。 在不同基准数据集上的大量实验表明,CDRMT相对于现有最佳方法实现了显著改进,在Refer-KITTI和Refer-KITTI-V2数据集上平均提高了HOTA评分6.0% 和 3.2%。我们的方法不仅推进了RMOT领域的最新成果,同时也为多源信息融合提供了新的见解。
https://arxiv.org/abs/2503.11496
Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training, enabling their application across a variety of real-world scenarios. However, the existing open-vocabulary tracker is constrained by its framework structure, isolated frame-level perception, and insufficient modal interactions, which hinder its performance in open-vocabulary classification and tracking. In this paper, we propose OVTR (End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously. To achieve stable classification and continuous tracking, we design the CIP (Category Information Propagation) strategy, which establishes multiple high-level category information priors for subsequent frames. Additionally, we introduce a dual-branch structure for generalization capability and deep multimodal interaction, and incorporate protective strategies in the decoder to enhance performance. Experimental results show that our method surpasses previous trackers on the open-vocabulary MOT benchmark while also achieving faster inference speeds and significantly reducing preprocessing requirements. Moreover, the experiment transferring the model to another dataset demonstrates its strong adaptability. Models and code are released at this https URL.
开放词汇多目标跟踪旨在将追踪器推广到训练期间未见过的类别,从而使其能够应用于各种现实场景中。然而,现有的开放词汇追踪器由于其框架结构、孤立帧级感知以及模态互动不足的问题,在开放词汇分类和追踪方面的性能受到限制。 针对上述挑战,本文提出了OVTR(端到端开放词汇多目标跟踪与Transformer),这是首个同时建模运动、外观和类别的端到端开放词汇追踪器。为实现稳定的分类和连续追踪,我们设计了CIP(类别信息传播)策略,该策略在后续帧中建立多个高级别类别信息先验。 此外,为了增强泛化能力和深度多模态交互,我们引入了一种双分支结构,并且在解码器中加入防护策略以进一步提升性能。实验结果显示,与之前的追踪器相比,我们的方法在开放词汇MOT基准测试上取得了更好的效果,同时实现了更快的推理速度和显著减少的数据预处理需求。 此外,在将模型转移到另一个数据集上的实验中,证明了其强大的适应性。该方法的相关代码和模型已经在如下网址公开发布:[此处插入链接]。
https://arxiv.org/abs/2503.10616
Object tracking is an essential task for autonomous systems. With the advancement of 3D sensors, these systems can better perceive their surroundings using effective 3D Extended Object Tracking (EOT) methods. Based on the observation that common road users are symmetrical on the right and left sides in the traveling direction, we focus on the side view profile of the object. In order to leverage of the development in 2D EOT and balance the number of parameters of a shape model in the tracking algorithms, we propose a method for 3D extended object tracking (EOT) by describing the side view profile of the object with B-spline curves and forming an extrusion to obtain a 3D extent. The use of B-spline curves exploits their flexible representation power by allowing the control points to move freely. The algorithm is developed into an Extended Kalman Filter (EKF). For a through evaluation of this method, we use simulated traffic scenario of different vehicle models and realworld open dataset containing both radar and lidar data.
目标跟踪是自主系统中的一个重要任务。随着3D传感器技术的进步,这些系统可以更好地利用有效的三维扩展目标跟踪(EOT)方法来感知周围环境。鉴于常见道路使用者在行驶方向上的左右两侧具有对称性,我们专注于对象的侧视轮廓。为了充分利用二维EOT的发展成果并平衡形状模型中参数的数量,我们提出了一种基于B样条曲线描述物体侧面轮廓并通过拉伸获取三维范围的方法来进行3D扩展目标跟踪(EOT)。使用B样条曲线的优势在于其灵活的表现力,允许控制点自由移动。该算法被开发为一种扩展卡尔曼滤波器(EKF)的一部分。为了全面评估此方法的有效性,我们采用了不同车辆模型的模拟交通场景和包含雷达及激光雷达数据的真实世界开放数据集进行测试。
https://arxiv.org/abs/2503.10730
The trackers based on lightweight neural networks have achieved great success in the field of aerial remote sensing, most of which aggregate multi-stage deep features to lift the tracking quality. However, existing algorithms usually only generate single-stage fusion features for state decision, which ignore that diverse kinds of features are required for identifying and locating the object, limiting the robustness and precision of tracking. In this paper, we propose a novel target-aware Bidirectional Fusion transformer (BFTrans) for UAV tracking. Specifically, we first present a two-stream fusion network based on linear self and cross attentions, which can combine the shallow and the deep features from both forward and backward directions, providing the adjusted local details for location and global semantics for recognition. Besides, a target-aware positional encoding strategy is designed for the above fusion model, which is helpful to perceive the object-related attributes during the fusion phase. Finally, the proposed method is evaluated on several popular UAV benchmarks, including UAV-123, UAV20L and UAVTrack112. Massive experimental results demonstrate that our approach can exceed other state-of-the-art trackers and run with an average speed of 30.5 FPS on embedded platform, which is appropriate for practical drone deployments.
基于轻量级神经网络的追踪器在航空遥感领域取得了巨大成功,大多数方法通过聚合多阶段深度特征来提升跟踪质量。然而,现有的算法通常只为状态决策生成单一阶段的融合特征,忽略了识别和定位目标所需的各种类型特征,从而限制了跟踪的鲁棒性和精度。为此,在本文中我们提出了一种新型的目标感知双向融合变换器(BFTrans)用于无人机追踪。 具体而言,我们首先提出了一个基于线性自注意力和交叉注意机制的双流融合网络,该网络可以从前向和后向两个方向结合浅层与深层特征,为定位提供调整后的局部细节,并为识别提供全局语义。此外,还设计了一种针对目标感知的位置编码策略,有助于在融合阶段感知与对象相关的属性。 最后,在几个流行的无人机基准测试数据集上(包括UAV-123、UAV20L和UAVTrack112)对所提出的方法进行了评估。大量的实验结果表明,我们的方法可以超越其他最先进的追踪器,并且在嵌入式平台上以平均每秒30.5帧的速度运行,适用于实际无人机部署的需求。
https://arxiv.org/abs/2503.09951
Image-based multi-object detection (MOD) and multi-object tracking (MOT) are advancing at a fast pace. A variety of 2D and 3D MOD and MOT methods have been developed for monocular and stereo cameras. Road safety analysis can benefit from those advancements. As crashes are rare events, surrogate measures of safety (SMoS) have been developed for safety analyses. (Semi-)Automated safety analysis methods extract road user trajectories to compute safety indicators, for example, Time-to-Collision (TTC) and Post-encroachment Time (PET). Inspired by the success of deep learning in MOD and MOT, we investigate three MOT methods, including one based on a stereo-camera, using the annotated KITTI traffic video dataset. Two post-processing steps, IDsplit and SS, are developed to improve the tracking results and investigate the factors influencing the TTC. The experimental results show that, despite some advantages in terms of the numbers of interactions or similarity to the TTC distributions, all the tested methods systematically over-estimate the number of interactions and under-estimate the TTC: they report more interactions and more severe interactions, making the road user interactions appear less safe than they are. Further efforts will be directed towards testing more methods and more data, in particular from roadside sensors, to verify the results and improve the performance.
基于图像的多目标检测(MOD)和多目标跟踪(MOT)技术正在快速发展。针对单目及双目摄像头,已经开发出了多种2D和3D的MOD与MOT方法。道路安全分析可以从这些进展中受益。由于交通事故是罕见事件,因此发展了替代的安全度量标准(SMoS),用于安全性分析。半自动化的安全分析方法通过提取道路使用者轨迹来计算诸如碰撞时间(TTC)和后侵占时间(PET)等安全指标。受到深度学习在MOD与MOT领域成功应用的启发,我们研究了三种MOT方法,包括一种基于双目摄像头的方法,并使用带有标注的KITTI交通视频数据集进行了实验。开发了两种后处理步骤IDsplit和SS以改进跟踪结果并探究影响TTC的因素。实验结果显示,尽管在互动数量或与TTC分布相似性方面存在一些优势,但所有测试方法系统地高估了互动次数并低估了TTC:它们报告更多的互动以及更严重的互动情况,使道路使用者的互动看起来比实际情况更不安全。未来的工作将集中在测试更多方法和数据上,特别是来自路边传感器的数据,以验证结果并改进性能。
https://arxiv.org/abs/2503.09807
Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addresses panoptic occupancy segmentation and object tracking from camera-only input. Furthermore, we propose TrackOcc, a cutting-edge approach that processes image inputs in a streaming, end-to-end manner with 4D panoptic queries to address the proposed task. Leveraging the localization-aware loss, TrackOcc enhances the accuracy of 4D panoptic occupancy tracking without bells and whistles. Experimental results demonstrate that our method achieves state-of-the-art performance on the Waymo dataset. The source code will be released at this https URL.
从摄像机输入中进行全面且一致的动态场景理解是高级自主系统的关键。传统的基于摄像头感知任务,如三维对象跟踪和语义占用预测,要么缺乏空间上的全面性,要么缺少时间上的一致性。在本文中,我们引入了一个全新的任务:基于摄像头的4D全景占据跟踪(Camera-based 4D Panoptic Occupancy Tracking),该任务同时解决了仅通过摄像机输入进行全景占据分割和对象跟踪的问题。此外,我们提出了TrackOcc这一前沿方法,这是一种以流式处理图像输入的方式,并使用4D全景查询来解决提出的任务的方法。利用定位感知损失,TrackOcc在不增加复杂性的前提下增强了4D全景占据跟踪的准确性。实验结果表明,在Waymo数据集上我们的方法达到了最先进的性能水平。源代码将在这个网址发布。
https://arxiv.org/abs/2503.08471
Open-Vocabulary Multi-Object Tracking (OV-MOT) aims to enable approaches to track objects without being limited to a predefined set of categories. Current OV-MOT methods typically rely primarily on instance-level detection and association, often overlooking trajectory information that is unique and essential for object tracking tasks. Utilizing trajectory information can enhance association stability and classification accuracy, especially in cases of occlusion and category ambiguity, thereby improving adaptability to novel classes. Thus motivated, in this paper we propose \textbf{TRACT}, an open-vocabulary tracker that leverages trajectory information to improve both object association and classification in OV-MOT. Specifically, we introduce a \textit{Trajectory Consistency Reinforcement} (\textbf{TCR}) strategy, that benefits tracking performance by improving target identity and category consistency. In addition, we present \textbf{TraCLIP}, a plug-and-play trajectory classification module. It integrates \textit{Trajectory Feature Aggregation} (\textbf{TFA}) and \textit{Trajectory Semantic Enrichment} (\textbf{TSE}) strategies to fully leverage trajectory information from visual and language perspectives for enhancing the classification results. Extensive experiments on OV-TAO show that our TRACT significantly improves tracking performance, highlighting trajectory information as a valuable asset for OV-MOT. Code will be released.
开放词汇多目标跟踪(OV-MOT)的目标是使方法能够不受预定义类别限制地追踪物体。当前的OV-MOT方法通常主要依赖于实例级检测和关联,往往忽视了对于对象跟踪任务独特且至关重要的轨迹信息。利用轨迹信息可以增强关联稳定性及分类准确性,特别是在遮挡和类别模糊的情况下,这有助于提高对新类别的适应性。因此,在本文中我们提出了一种名为**TRACT**的开放词汇追踪器,该方法通过利用轨迹信息来提升OV-MOT中的对象关联和分类性能。 具体来说,我们引入了**轨迹一致性强化(TCR)**策略,这一策略通过改善目标身份和类别的一致性来增强跟踪性能。此外,我们还提出了一个即插即用的轨迹分类模块**TraCLIP**。该模块结合了**轨迹特征聚合(TFA)** 和 **轨迹语义丰富化 (TSE)** 策略,从视觉和语言角度充分利用轨迹信息以提升分类效果。 在OV-TAO上的广泛实验表明,我们的TRACT显著提升了追踪性能,强调了轨迹信息对于OV-MOT而言是一项宝贵的资产。代码将会公开发布。
https://arxiv.org/abs/2503.08145
Referring Multi-Object Tracking (RMOT) aims to localize target trajectories specified by natural language expressions in videos. Existing RMOT methods mainly follow two paradigms, namely, one-stage strategies and two-stage ones. The former jointly trains tracking with referring but suffers from substantial computational overhead. Although the latter improves computational efficiency, its CLIP-inspired dual-tower architecture restricts compatibility with other visual/text backbones and is not future-proof. To overcome these limitations, we propose CPAny, a novel encoder-decoder framework for two-stage RMOT, which introduces two core components: (1) a Contextual Visual Semantic Abstractor (CVSA) performs context-aware aggregation on visual backbone features and projects them into a unified semantic space; (2) a Parallel Semantic Summarizer (PSS) decodes the visual and linguistic features at the semantic level in parallel and generates referring scores. By replacing the inherent feature alignment of encoders with a self-constructed unified semantic space, CPAny achieves flexible compatibility with arbitrary emerging visual / text encoders. Meanwhile, CPAny aggregates contextual information by encoding only once and processes multiple expressions in parallel, significantly reducing computational redundancy. Extensive experiments on the Refer-KITTI and Refer-KITTI-V2 datasets show that CPAny outperforms SOTA methods across diverse encoder combinations, with a particular 7.77\% HOTA improvement on Refer-KITTI-V2. Code will be available soon.
参照多目标跟踪(RMOT)的目标是在视频中通过自然语言表达来定位指定的目标轨迹。现有的RMOT方法主要遵循两种范式:一阶段策略和两阶段策略。前者同时训练跟踪和引用任务,但计算开销较大。尽管后者提高了计算效率,其基于CLIP的双塔架构限制了与其它视觉/文本骨干网络的兼容性,并不具备未来适应性。为了克服这些局限性,我们提出了一种新的编码-解码框架CPAny,专门用于两阶段RMOT任务,引入了两个核心组件: 1. **上下文视觉语义抽象器(CVSA)**:它在视觉主干特征上进行上下文感知聚合,并将它们投影到统一的语义空间中。 2. **并行语义摘要生成器(PSS)**:它以并行方式解码视觉和语言特征,从而在语义层次上生成引用分数。 通过用自建的统一语义空间替换编码器中的固有特征对齐机制,CPAny实现了与任意新兴视觉/文本编码器灵活兼容的能力。同时,CPAny只需一次性编码即可聚合上下文信息,并且能够并行处理多个表达式,显著减少了计算冗余。在Refer-KITTI和Refer-KITTI-V2数据集上的广泛实验表明,无论使用何种编码组合,CPAny的表现均优于当前最先进的方法(SOTA),特别是在Refer-KITTI-V2上实现了7.77%的HOTA改进。相关代码即将发布。
https://arxiv.org/abs/2503.07516
Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geometric deformation, and uneven lighting, hinder direct adaptation of existing MOT methods, leading to significant performance degradation. To address these challenges, we propose OmniTrack, an omnidirectional MOT framework that incorporates Tracklet Management to introduce temporal cues, FlexiTrack Instances for object localization and association, and the CircularStatE Module to alleviate image and geometric distortions. This integration enables tracking in large field-of-view scenarios, even under rapid sensor motion. To mitigate the lack of panoramic MOT datasets, we introduce the QuadTrack dataset--a comprehensive panoramic dataset collected by a quadruped robot, featuring diverse challenges such as wide fields of view, intense motion, and complex environments. Extensive experiments on the public JRDB dataset and the newly introduced QuadTrack benchmark demonstrate the state-of-the-art performance of the proposed framework. OmniTrack achieves a HOTA score of 26.92% on JRDB, representing an improvement of 3.43%, and further achieves 23.45% on QuadTrack, surpassing the baseline by 6.81%. The dataset and code will be made publicly available at this https URL.
全景图像以其360°的视野提供了支持多目标跟踪(MOT)的全面信息,有助于捕捉周围物体的空间和时间关系。然而,大多数MOT算法都是为具有有限视图的针孔图像设计的,在全景设置中效果不佳。此外,全景图像中的分辨率损失、几何变形和光照不均等畸变问题妨碍了现有MOT方法的直接应用,导致性能显著下降。为了应对这些挑战,我们提出了OmniTrack,这是一种面向全方位的MOT框架,它结合了轨迹管理以引入时间线索、FlexiTrack实例化对象定位和关联以及CircularStatE模块来缓解图像和几何畸变问题。这种集成使得在传感器快速移动的大视野场景下进行跟踪成为可能。 为了弥补全景MOT数据集缺乏的问题,我们推出了QuadTrack数据集——这是由四足机器人收集的一个全面的全景数据集,它包含了诸如广阔视野、强烈运动以及复杂环境等多方面的挑战。在公共JRDB数据集和新推出的QuadTrack基准上的大量实验表明了所提出的框架的最先进的性能表现。OmniTrack在JRDB上实现了26.92%的HOTA分数,比基线提高了3.43%,并且在QuadTrack上进一步达到了23.45%,超越了基线6.81%。 该数据集和代码将在以下网址公开发布:[https URL]。
https://arxiv.org/abs/2503.04565
Optical flow is a fundamental technique for motion estimation, widely applied in video stabilization, interpolation, and object tracking. Recent advancements in artificial intelligence (AI) have enabled deep learning models to leverage optical flow as an important feature for motion analysis. However, traditional optical flow methods rely on restrictive assumptions, such as brightness constancy and slow motion constraints, limiting their effectiveness in complex scenes. Deep learning-based approaches require extensive training on large domain-specific datasets, making them computationally demanding. Furthermore, optical flow is typically visualized in the HSV color space, which introduces nonlinear distortions when converted to RGB and is highly sensitive to noise, degrading motion representation accuracy. These limitations inherently constrain the performance of downstream models, potentially hindering object tracking and motion analysis tasks. To address these challenges, we propose Reynolds flow, a novel training-free flow estimation inspired by the Reynolds transport theorem, offering a principled approach to modeling complex motion dynamics. Beyond the conventional HSV-based visualization, denoted ReynoldsFlow, we introduce an alternative representation, ReynoldsFlow+, designed to improve flow visualization. We evaluate ReynoldsFlow and ReynoldsFlow+ across three video-based benchmarks: tiny object detection on UAVDB, infrared object detection on Anti-UAV, and pose estimation on GolfDB. Experimental results demonstrate that networks trained with ReynoldsFlow+ achieve state-of-the-art (SOTA) performance, exhibiting improved robustness and efficiency across all tasks.
光流是一种基本的运动估计技术,在视频稳定、插值和目标跟踪等方面广泛应用。近年来,人工智能(AI)的发展使得深度学习模型能够利用光流作为运动分析的重要特征。然而,传统的光流方法依赖于亮度恒定和缓慢移动等限制性假设,这在复杂场景中限制了它们的效果。基于深度学习的方法需要在大量的领域特定数据集上进行广泛的训练,使其计算成本高昂。此外,光流通常使用HSV颜色空间进行可视化,在转换为RGB时会产生非线性扭曲,并且对噪声高度敏感,降低了运动表示的准确性。这些局限性会制约下游模型的表现,可能影响目标跟踪和运动分析任务的效果。 为了应对这些挑战,我们提出了Reynolds流,这是一种基于雷诺输运定理的无需训练的流估计方法,为建模复杂的运动动态提供了一种原理性的途径。除了传统的HSV基可视化之外,我们引入了另一种表示形式ReynoldsFlow+,旨在改进流可视化效果。 我们在三个基于视频的基准测试上评估了ReynoldsFlow和ReynoldsFlow+:UAVDB上的小型目标检测、Anti-UAV红外目标检测以及GolfDB上的姿态估计。实验结果表明,使用ReynoldsFlow+训练的网络在所有任务中均达到了最先进的(SOTA)性能,并表现出更强的鲁棒性和效率。
https://arxiv.org/abs/2503.04500
Object tracking is a key challenge of computer vision with various applications that all require different architectures. Most tracking systems have limitations such as constraining all movement to a 2D plane and they often track only one object. In this paper, we present a new modular pipeline that calculates 3D trajectories of multiple objects. It is adaptable to various settings where multiple time-synced and stationary cameras record moving objects, using off the shelf webcams. Our pipeline was tested on the Table Setting Dataset, where participants are recorded with various sensors as they set a table with tableware objects. We need to track these manipulated objects, using 6 rgb webcams. Challenges include: Detecting small objects in 9.874.699 camera frames, determining camera poses, discriminating between nearby and overlapping objects, temporary occlusions, and finally calculating a 3D trajectory using the right subset of an average of 11.12.456 pixel coordinates per 3-minute trial. We implement a robust pipeline that results in accurate trajectories with covariance of x,y,z-position as a confidence metric. It deals dynamically with appearing and disappearing objects, instantiating new Extended Kalman Filters. It scales to hundreds of table-setting trials with very little human annotation input, even with the camera poses of each trial unknown. The code is available at this https URL
物体跟踪是计算机视觉中的一个关键挑战,它具有多种应用,并且需要不同的架构。大多数现有的跟踪系统存在一些限制,例如将所有运动都约束在一个二维平面上,并且通常只能跟踪单个对象。在本文中,我们提出了一种新的模块化流水线,该流水线可以计算多个物体的三维轨迹。此流水线适用于各种场景,在这些场景下,多台同步和固定的摄像头记录移动的物体,甚至使用现成的网络摄像头也可以实现这一功能。我们在餐具摆放数据集上测试了我们的流水线,参与者在设置餐桌时被多种传感器录制下来,我们需要用六个RGB摄像头来跟踪他们操作的物体。 面临的挑战包括: - 在9,874,699帧图像中检测小物体; - 确定摄像机的位置(即姿态); - 区分近距离和重叠的对象; - 应对暂时性的遮挡; - 最后,根据每次试验平均1,124,560个像素坐标中的正确子集计算三维轨迹。 我们实施了一个稳健的流水线,能够生成精确的轨迹,并使用x、y、z位置的标准差作为置信度指标。此系统动态地处理出现和消失的对象,通过即时创建新的扩展卡尔曼滤波器来实现这一点。即使在每次试验中摄像机姿态未知的情况下,该系统也能应对数百次餐具摆放尝试,且几乎不需要人工注释输入。 相关代码可在提供的URL(https://this-url-is-a-placeholder)获取。
https://arxiv.org/abs/2503.04322
As smart homes become more prevalent in daily life, the ability to understand dynamic environments is essential which is increasingly dependent on AI systems. This study focuses on developing an intelligent algorithm which can navigate a robot through a kitchen, recognizing objects, and tracking their relocation. The kitchen was chosen as the testing ground due to its dynamic nature as objects are frequently moved, rearranged and replaced. Various techniques, such as SLAM feature-based tracking and deep learning-based object detection (e.g., Faster R-CNN), are commonly used for object tracking. Additionally, methods such as optical flow analysis and 3D reconstruction have also been used to track the relocation of objects. These approaches often face challenges when it comes to problems such as lighting variations and partial occlusions, where parts of the object are hidden in some frames but visible in others. The proposed method in this study leverages the YOLOv5 architecture, initialized with pre-trained weights and subsequently fine-tuned on a custom dataset. A novel method was developed, introducing a frame-scoring algorithm which calculates a score for each object based on its location and features within all frames. This scoring approach helps to identify changes by determining the best-associated frame for each object and comparing the results in each scene, overcoming limitations seen in other methods while maintaining simplicity in design. The experimental results demonstrate an accuracy of 97.72%, a precision of 95.83% and a recall of 96.84% for this algorithm, which highlights the efficacy of the model in detecting spatial changes.
随着智能家居在日常生活中的普及,理解动态环境的能力变得越来越重要,而这越来越依赖于人工智能系统。本研究旨在开发一种智能算法,该算法能够让机器人在厨房环境中导航,并识别和追踪物体的移动。厨房被选为测试场地,因为它的动态性很强——物品经常被移动、重新排列或更换。 对于目标跟踪,通常使用的技术包括基于SLAM(Simultaneous Localization and Mapping)特征的跟踪技术以及基于深度学习的目标检测方法(例如Faster R-CNN)。此外,还有其他方法如光学流分析和3D重建也被用于追踪物体的位置变化。然而,这些方法在面对诸如光照变化和部分遮挡等问题时常常遇到挑战——即同一个对象在某些帧中是被隐藏的,在另一些帧中则是可见的。 本研究提出的解决方案利用了YOLOv5架构,并通过预训练权重进行初始化后,在定制的数据集上进行了微调。一种新的方法被开发出来,引入了一个基于每一帧计算每个目标得分的方法,该算法根据对象在所有帧中的位置和特征来为每个物体评分。这种打分方式有助于识别变化,通过确定每个物体的最佳关联帧,并比较不同场景下的结果,从而克服了其他方法的限制,同时保持设计上的简洁性。 实验结果显示,该算法具有97.72%的准确率、95.83%的精度和96.84%的召回率,这证明了模型在检测空间变化方面的有效性。
https://arxiv.org/abs/2503.01547
Multi-view object tracking (MVOT) offers promising solutions to challenges such as occlusion and target loss, which are common in traditional single-view tracking. However, progress has been limited by the lack of comprehensive multi-view datasets and effective cross-view integration methods. To overcome these limitations, we compiled a Multi-View object Tracking (MVTrack) dataset of 234K high-quality annotated frames featuring 27 distinct objects across various scenes. In conjunction with this dataset, we introduce a novel MVOT method, Multi-View Integration Tracker (MITracker), to efficiently integrate multi-view object features and provide stable tracking outcomes. MITracker can track any object in video frames of arbitrary length from arbitrary viewpoints. The key advancements of our method over traditional single-view approaches come from two aspects: (1) MITracker transforms 2D image features into a 3D feature volume and compresses it into a bird's eye view (BEV) plane, facilitating inter-view information fusion; (2) we propose an attention mechanism that leverages geometric information from fused 3D feature volume to refine the tracking results at each view. MITracker outperforms existing methods on the MVTrack and GMTD datasets, achieving state-of-the-art performance. The code and the new dataset will be available at this https URL.
多视角目标跟踪(MVOT)为传统单视图跟踪中常见的遮挡和目标丢失等问题提供了有前景的解决方案。然而,由于缺乏全面的多视角数据集和有效的跨视角整合方法,该领域的进展受到了限制。为了克服这些局限性,我们编译了一个名为MVTrack的数据集,包含234,000帧高质量标注图像以及在各种场景中出现的27个不同对象。与这个数据集一起,我们引入了一种新颖的多视图跟踪(MVOT)方法——多视角集成追踪器(MITracker),旨在高效地整合多视角目标特征,并提供稳定的跟踪结果。MITracker能够从任意长度和任意视角的视频帧中跟踪任何对象。 相较于传统的单视图方法,我们的方法在两个方面取得了关键进展:(1) MITracker将2D图像特征转换为3D特征体积并压缩到鸟瞰图(BEV)平面上,从而促进跨视角的信息融合;(2) 我们提出了一种注意力机制,利用融合的3D特征体中的几何信息来优化每个视图下的跟踪结果。 在MVTrack和GMTD数据集上的实验表明,MITracker的表现超越了现有的方法,并取得了当前的最佳性能。代码及新数据集将在[此链接](https://this-URL.com)提供。
https://arxiv.org/abs/2502.20111
Hyperspectral object tracking using snapshot mosaic cameras is emerging as it provides enhanced spectral information alongside spatial data, contributing to a more comprehensive understanding of material properties. Using transformers, which have consistently outperformed convolutional neural networks (CNNs) in learning better feature representations, would be expected to be effective for Hyperspectral object tracking. However, training large transformers necessitates extensive datasets and prolonged training periods. This is particularly critical for complex tasks like object tracking, and the scarcity of large datasets in the hyperspectral domain acts as a bottleneck in achieving the full potential of powerful transformer models. This paper proposes an effective methodology that adapts large pretrained transformer-based foundation models for hyperspectral object tracking. We propose an adaptive, learnable spatial-spectral token fusion module that can be extended to any transformer-based backbone for learning inherent spatial-spectral features in hyperspectral data. Furthermore, our model incorporates a cross-modality training pipeline that facilitates effective learning across hyperspectral datasets collected with different sensor modalities. This enables the extraction of complementary knowledge from additional modalities, whether or not they are present during testing. Our proposed model also achieves superior performance with minimal training iterations.
使用快照马赛克相机进行高光谱物体跟踪正在成为一种新兴技术,因为它提供了增强的光谱信息以及空间数据,有助于更全面地理解材料特性。由于变压器在网络特征表示学习方面始终优于卷积神经网络(CNN),因此预计对于高光谱物体跟踪而言,它们将同样有效。然而,训练大型变压器需要大量的数据集和长时间的训练期。特别是对于复杂任务如对象跟踪来说,这一问题尤为关键,并且在高光谱领域的大型数据集稀缺成为了发挥强大变压器模型全部潜力的主要瓶颈。 本文提出了一种有效的适应性方法,该方法利用大规模预训练的基础变压器模型进行高光谱物体跟踪。我们提出了一个自适应、可学习的空间-光谱令牌融合模块,可以扩展到任何基于变压器的主干网中以学习高光谱数据中的固有空间-光谱特性。此外,我们的模型整合了一个跨模态训练流程,这有助于在使用不同传感器模式收集的高光谱数据集中进行有效的跨模态学习。这种方法使得可以从额外的数据集中提取互补知识,即使这些数据集不在测试时可用。 我们的模型还能够以最少的训练迭代次数实现卓越性能。
https://arxiv.org/abs/2502.18748
Multi-modal tracking is essential in single-object tracking (SOT), as different sensor types contribute unique capabilities to overcome challenges caused by variations in object appearance. However, existing unified RGB-X trackers (X represents depth, event, or thermal modality) either rely on the task-specific training strategy for individual RGB-X image pairs or fail to address the critical importance of modality-adaptive perception in real-world applications. In this work, we propose UASTrack, a unified adaptive selection framework that facilitates both model and parameter unification, as well as adaptive modality discrimination across various multi-modal tracking tasks. To achieve modality-adaptive perception in joint RGB-X pairs, we design a Discriminative Auto-Selector (DAS) capable of identifying modality labels, thereby distinguishing the data distributions of auxiliary modalities. Furthermore, we propose a Task-Customized Optimization Adapter (TCOA) tailored to various modalities in the latent space. This strategy effectively filters noise redundancy and mitigates background interference based on the specific characteristics of each modality. Extensive comparisons conducted on five benchmarks including LasHeR, GTOT, RGBT234, VisEvent, and DepthTrack, covering RGB-T, RGB-E, and RGB-D tracking scenarios, demonstrate our innovative approach achieves comparative performance by introducing only additional training parameters of 1.87M and flops of 1.95G. The code will be available at this https URL.
多模态跟踪在单目标跟踪(SOT)中至关重要,因为不同类型的传感器提供了独特的功能来克服由物体外观变化引起的各种挑战。然而,现有的统一的RGB-X追踪器(X代表深度、事件或热感模式)要么依赖于特定任务的训练策略来进行单一的RGB-X图像对处理,要么未能解决实际应用中模态自适应感知的关键重要性问题。 在这项工作中,我们提出了一种名为UASTrack的框架,这是一个统一且具有自适应选择功能的框架,能够促进模型和参数的一致化,并在各种多模式跟踪任务之间进行模态区分。为了实现联合RGB-X对中的模态自适应感知,我们设计了一个鉴别式自动选择器(DAS),该选择器可以识别模态标签,从而区分辅助模态的数据分布。此外,我们提出了一种针对潜在空间中不同模态量身定制的任务优化适配器(TCOA)。这种策略能够根据每个模态的特定特性有效地过滤噪声冗余并减轻背景干扰。 在包括LasHeR、GTOT、RGBT234、VisEvent和DepthTrack在内的五个基准测试上的广泛对比表明,通过引入额外的1.87M训练参数和1.95G计算量(FLOPs),我们的创新方法实现了与现有技术相比较为优异的表现。相关代码将在提供的链接处提供。
https://arxiv.org/abs/2502.18220