Multi-Object Tracking (MOT) plays a crucial role in autonomous driving systems, as it lays the foundations for advanced perception and precise path planning modules. Nonetheless, single agent based MOT lacks in sensing surroundings due to occlusions, sensors failures, etc. Hence, the integration of multiagent information is essential for comprehensive understanding of the environment. This paper proposes a novel Cooperative MOT framework for tracking objects in 3D LiDAR scene by formulating and solving a graph topology-aware optimization problem so as to fuse information coming from multiple vehicles. By exploiting a fully connected graph topology defined by the detected bounding boxes, we employ the Graph Laplacian processing optimization technique to smooth the position error of bounding boxes and effectively combine them. In that manner, we reveal and leverage inherent coherences of diverse multi-agent detections, and associate the refined bounding boxes to tracked objects at two stages, optimizing localization and tracking accuracies. An extensive evaluation study has been conducted, using the real-world V2V4Real dataset, where the proposed method significantly outperforms the baseline frameworks, including the state-of-the-art deep-learning DMSTrack and V2V4Real, in various testing sequences.
多目标跟踪(MOT)在自动驾驶系统中扮演着关键角色,因为它为先进的感知和精确路径规划模块奠定了基础。然而,基于单个代理的MOT由于遮挡、传感器故障等原因,在感知周围环境方面存在不足。因此,整合多代理信息对于全面理解环境至关重要。 本文提出了一种新颖的合作MOT框架,通过制定并解决一个具有图拓扑感知优化问题的方法来融合来自多个车辆的信息,以在三维LiDAR场景中跟踪物体。我们利用由检测到的边界框定义的完全连接图拓扑结构,并采用图拉普拉斯处理优化技术来平滑边界框的位置误差,从而有效地结合它们。这样,我们可以揭示和利用不同多代理检测之间的固有一致性,并分两个阶段将精化后的边界框与追踪对象关联起来,从而优化定位和跟踪精度。 通过使用现实世界中的V2V4Real数据集进行了广泛的评估研究,在各种测试序列中,所提出的方法在包括最新深度学习方法DMSTrack和V2V4Real在内的基线框架上表现出显著的优越性。
https://arxiv.org/abs/2506.09469
Vision-Language Models (VLMs) lag behind Large Language Models due to the scarcity of annotated datasets, as creating paired visual-textual annotations is labor-intensive and expensive. To address this bottleneck, we introduce SAM2Auto, the first fully automated annotation pipeline for video datasets requiring no human intervention or dataset-specific training. Our approach consists of two key components: SMART-OD, a robust object detection system that combines automatic mask generation with open-world object detection capabilities, and FLASH (Frame-Level Annotation and Segmentation Handler), a multi-object real-time video instance segmentation (VIS) that maintains consistent object identification across video frames even with intermittent detection gaps. Unlike existing open-world detection methods that require frame-specific hyperparameter tuning and suffer from numerous false positives, our system employs statistical approaches to minimize detection errors while ensuring consistent object tracking throughout entire video sequences. Extensive experimental validation demonstrates that SAM2Auto achieves comparable accuracy to manual annotation while dramatically reducing annotation time and eliminating labor costs. The system successfully handles diverse datasets without requiring retraining or extensive parameter adjustments, making it a practical solution for large-scale dataset creation. Our work establishes a new baseline for automated video annotation and provides a pathway for accelerating VLM development by addressing the fundamental dataset bottleneck that has constrained progress in vision-language understanding.
视觉语言模型(VLMs)由于注释数据集的稀缺而落后于大型语言模型,创建配对的视听文本标注既费时又昂贵。为了应对这一瓶颈,我们引入了SAM2Auto,这是第一个无需人工干预或特定于数据集训练的视频数据集全自动标注流水线。我们的方法包括两个关键组成部分:SMART-OD(Smart Mask and Open World Object Detection),这是一个将自动掩码生成与开放世界对象检测能力结合在一起的稳健的对象检测系统;以及FLASH(Frame-Level Annotation and Segmentation Handler),这是一种多目标实时视频实例分割技术,即使在间歇性检测间隔期间也能保持视频帧之间的一致物体识别。 不同于现有的需要针对每一帧进行超参数调整且误报率较高的开放世界检测方法,我们的系统利用统计方法来减少检测错误,并在整个视频序列中确保一致的对象跟踪。广泛的实验验证表明,SAM2Auto能够在大幅减少标注时间和消除劳动力成本的同时达到与手动注释相当的准确性。该系统成功处理了各种数据集,无需重新训练或进行大量参数调整,从而为大规模数据集创建提供了一种实用解决方案。 我们的工作建立了自动化视频标注的新基准,并通过解决阻碍视觉语言理解进展的根本性数据瓶颈问题,为加速VLM的发展铺平了道路。
https://arxiv.org/abs/2506.07850
Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.
密集视频预测任务,如物体跟踪和语义分割,需要生成每一帧上具有时间一致性且空间稠密特征的视频编码器。然而,现有方法存在不足:图像编码器(例如DINO或CLIP)缺乏对时间信息的理解,而诸如VideoMAE之类的视频模型在密集预测任务中表现不如图像编码器。我们通过引入FRAME来填补这一空白,这是一种针对密集视频理解自我监督视频帧编码器。FRAME学习从过去和当前的RGB帧中预测当前及未来的DINO补丁特征,从而生成空间上精确且时间上一致的表示。据我们所知,FRAME是第一个利用基于图像模型进行密集预测并超越它们在需要细粒度视觉对应的任务上的表现的视频编码器。作为辅助能力,FRAME将其类别令牌与CLIP的语义空间对齐,支持如视频分类等语言驱动任务。我们在六个不同的密集预测任务上使用七个数据集评估了FRAME,并发现其始终优于图像编码器和现有的自我监督视频模型。尽管具备多功能性,但FRAME仍保持紧凑型架构,适用于各种下游应用。
https://arxiv.org/abs/2506.05543
Finding reliable matches is essential in multi-object tracking to ensure the accuracy and reliability of perception systems in safety-critical applications such as autonomous vehicles. Effective matching mitigates perception errors, enhancing object identification and tracking for improved performance and safety. However, traditional metrics such as Intersection over Union (IoU) and Center Point Distances (CPDs), which are effective in 2D image planes, often fail to find critical matches in complex 3D scenes. To address this limitation, we introduce Contour Errors (CEs), an ego or object-centric metric for identifying matches of interest in tracking scenarios from a functional perspective. By comparing bounding boxes in the ego vehicle's frame, contour errors provide a more functionally relevant assessment of object matches. Extensive experiments on the nuScenes dataset demonstrate that contour errors improve the reliability of matches over the state-of-the-art 2D IoU and CPD metrics in tracking-by-detection methods. In 3D car tracking, our results show that Contour Errors reduce functional failures (FPs/FNs) by 80% at close ranges and 60% at far ranges compared to IoU in the evaluation stage.
在多目标跟踪中,找到可靠的匹配是确保感知系统在自动驾驶等安全关键应用中的准确性和可靠性的重要因素。有效的匹配可以减少感知错误,提高物体识别和追踪的性能与安全性。然而,传统的度量标准如交并比(IoU)和中心点距离(CPD),虽然在二维图像平面上效果显著,但在复杂的三维场景中往往难以找到关键匹配。为了解决这一局限性,我们引入了轮廓误差(CEs),这是一种从功能角度出发、以自身车辆或目标为中心的度量标准,用于识别跟踪场景中的感兴趣匹配。通过比较自身车辆坐标系中的边界框,轮廓误差提供了更为功能相关的物体匹配评估。在nuScenes数据集上的广泛实验表明,与当前最优的2D IoU和CPD指标相比,轮廓误差提高了检测到追踪方法中匹配的可靠性。 特别是在三维汽车跟踪方面,我们的结果显示,在评估阶段,与IoU相比,轮廓误差可以将近距离的功能失败(FPs/FNs)减少80%,远距离的则减少60%。
https://arxiv.org/abs/2506.04122
Multi-object tracking (MOT) in team sports is particularly challenging due to the fast-paced motion and frequent occlusions resulting in motion blur and identity switches, respectively. Predicting player positions in such scenarios is particularly difficult due to the observed highly non-linear motion patterns. Current methods are heavily reliant on object detection and appearance-based tracking, which struggle to perform in complex team sports scenarios, where appearance cues are ambiguous and motion patterns do not necessarily follow a linear pattern. To address these challenges, we introduce SportMamba, an adaptive hybrid MOT technique specifically designed for tracking in dynamic team sports. The technical contribution of SportMamba is twofold. First, we introduce a mamba-attention mechanism that models non-linear motion by implicitly focusing on relevant embedding dependencies. Second, we propose a height-adaptive spatial association metric to reduce ID switches caused by partial occlusions by accounting for scale variations due to depth changes. Additionally, we extend the detection search space with adaptive buffers to improve associations in fast-motion scenarios. Our proposed technique, SportMamba, demonstrates state-of-the-art performance on various metrics in the SportsMOT dataset, which is characterized by complex motion and severe occlusion. Furthermore, we demonstrate its generalization capability through zero-shot transfer to VIP-HTD, an ice hockey dataset.
在团队运动中,多目标跟踪(Multi-object Tracking, MOT)特别具有挑战性,因为快速的运动员移动和频繁的遮挡分别导致了模糊不清的动作轨迹和身份切换。由于观察到的高度非线性运动模式,在这种情况下预测球员位置尤其困难。目前的方法严重依赖于物体检测和基于外观的追踪技术,但在复杂团队体育场景中这些方法表现不佳,因为在这些场景中外观线索模棱两可且运动模式不一定遵循直线规律。 为了解决这些问题,我们引入了SportMamba,这是一种专门针对动态团队运动跟踪设计的自适应混合多目标跟踪(MOT)技术。SportMamba的技术贡献是双方面的:首先,我们引入了一种mamba注意机制,通过隐式关注相关的嵌入依赖来建模非线性运动;其次,我们提出了一种高度自适应的空间关联度量方法以减少由于部分遮挡导致的身份切换问题,并考虑因深度变化而产生的尺度变化。 此外,为了在快速移动的场景中改善关联,我们将检测搜索空间与自适应缓冲区扩展相结合。我们的提议技术SportMamba在SportsMOT数据集上展示了卓越性能,该数据集的特点是复杂的运动模式和严重的遮挡情况。除此之外,我们还通过零样本迁移至VIP-HTD(一个冰球数据集)验证了其泛化能力。 总结来说,SportMamba是一种创新的多目标跟踪方法,它能够在高度动态且具有挑战性的团队体育环境中实现更准确、稳定的球员追踪。
https://arxiv.org/abs/2506.03335
Visual Object Tracking (VOT) is a fundamental task with widespread applications in autonomous navigation, surveillance, and maritime robotics. Despite significant advances in generic object tracking, maritime environments continue to present unique challenges, including specular water reflections, low-contrast targets, dynamically changing backgrounds, and frequent occlusions. These complexities significantly degrade the performance of state-of-the-art tracking algorithms, highlighting the need for domain-specific datasets. To address this gap, we introduce the Maritime Visual Tracking Dataset (MVTD), a comprehensive and publicly available benchmark specifically designed for maritime VOT. MVTD comprises 182 high-resolution video sequences, totaling approximately 150,000 frames, and includes four representative object classes: boat, ship, sailboat, and unmanned surface vehicle (USV). The dataset captures a diverse range of operational conditions and maritime scenarios, reflecting the real-world complexities of maritime environments. We evaluated 14 recent SOTA tracking algorithms on the MVTD benchmark and observed substantial performance degradation compared to their performance on general-purpose datasets. However, when fine-tuned on MVTD, these models demonstrate significant performance gains, underscoring the effectiveness of domain adaptation and the importance of transfer learning in specialized tracking contexts. The MVTD dataset fills a critical gap in the visual tracking community by providing a realistic and challenging benchmark for maritime scenarios. Dataset and Source Code can be accessed here "this https URL.
视觉对象跟踪(VOT)是一项具有广泛应用的基础任务,包括自主导航、监控和海事机器人技术。尽管通用目标追踪方面取得了重大进展,但海洋环境仍面临着独特的挑战,例如镜面水反射、低对比度目标、动态变化的背景以及频繁遮挡现象。这些复杂性显著降低了现有跟踪算法的性能,凸显了领域特定数据集的需求。 为了解决这一缺口,我们引入了海事视觉追踪数据集(MVTD),这是一个专为海洋VOT设计的全面且公开可用的基准测试库。MVTD包含182个高分辨率视频序列,总计约150,000帧,并包括四个代表性目标类别:船只、帆船和无人水面舰艇(USV)。该数据集捕捉了多样化的操作条件和海洋场景,反映了海事环境中实际存在的复杂性。 我们在MVTD基准测试上评估了14种最新的SOTA跟踪算法,发现与它们在通用数据集上的表现相比,性能显著下降。然而,在使用MVTD进行微调后,这些模型表现出明显的性能提升,这强调了领域适应的有效性和在专业追踪场景中转移学习的重要性。 MVTD数据集通过提供现实且具有挑战性的基准测试填补了视觉跟踪社区中的关键空白,为海洋场景的应用提供了支持。该数据集和源代码可在此链接处访问:"this https URL"。
https://arxiv.org/abs/2506.02866
Multi-object tracking (MOT) is essential for sports analytics, enabling performance evaluation and tactical insights. However, tracking in sports is challenging due to fast movements, occlusions, and camera shifts. Traditional tracking-by-detection methods require extensive tuning, while segmentation-based approaches struggle with track processing. We propose McByte, a tracking-by-detection framework that integrates temporally propagated segmentation mask as an association cue to improve robustness without per-video tuning. Unlike many existing methods, McByte does not require training, relying solely on pre-trained models and object detectors commonly used in the community. Evaluated on SportsMOT, DanceTrack, SoccerNet-tracking 2022 and MOT17, McByte demonstrates strong performance across sports and general pedestrian tracking. Our results highlight the benefits of mask propagation for a more adaptable and generalizable MOT approach. Code will be made available at this https URL.
多目标跟踪(MOT)在体育分析中至关重要,它能够评估运动表现并提供战术洞察。然而,在体育环境中进行跟踪存在挑战,如快速移动、遮挡和摄像机切换等问题。传统的目标检测基础上的追踪方法需要大量的调整工作,而基于分割的方法则难以处理跟踪过程中的问题。我们提出了McByte,这是一种目标检测基础的追踪框架,它集成了通过时间传播的分割掩码作为关联提示,从而在无需针对每段视频进行调优的情况下提高了鲁棒性。与许多现有的方法不同,McByte不需要训练,仅依赖于社区常用的预训练模型和物体检测器。我们在SportsMOT、DanceTrack、SoccerNet-tracking 2022以及MOT17数据集上对McByte进行了评估,结果显示它在体育和通用行人跟踪中均表现出色。我们的研究结果强调了掩码传播对于更加灵活且普适的多目标追踪方法的好处。代码将在[此处提供链接]发布。
https://arxiv.org/abs/2506.01373
Current motion-based multiple object tracking (MOT) approaches rely heavily on Intersection-over-Union (IoU) for object association. Without using 3D features, they are ineffective in scenarios with occlusions or visually similar objects. To address this, our paper presents a novel depth-aware framework for MOT. We estimate depth using a zero-shot approach and incorporate it as an independent feature in the association process. Additionally, we introduce a Hierarchical Alignment Score that refines IoU by integrating both coarse bounding box overlap and fine-grained (pixel-level) alignment to improve association accuracy without requiring additional learnable parameters. To our knowledge, this is the first MOT framework to incorporate 3D features (monocular depth) as an independent decision matrix in the association step. Our framework achieves state-of-the-art results on challenging benchmarks without any training nor fine-tuning. The code is available at this https URL
当前基于运动的多目标跟踪(MOT)方法严重依赖于交并比(IoU)来进行对象关联。在没有使用三维特征的情况下,这些方法在存在遮挡或视觉上相似的对象场景中效果不佳。为解决这些问题,我们的论文提出了一种新的深度感知框架用于多目标跟踪。我们采用零样本方法来估计深度,并将其作为独立的特性融入到关联过程中。此外,我们还引入了一个分层对齐得分(Hierarchical Alignment Score),通过将粗略边界框重叠与细粒度(像素级)对齐相结合,改进了IoU,从而提高了关联准确性,且无需额外的学习参数。据我们所知,这是第一个在关联步骤中将三维特征(单目深度)作为独立决策矩阵纳入的MOT框架。我们的框架在没有训练或微调的情况下,在具有挑战性的基准测试上取得了最佳结果。代码可在此链接获取:[此位置应为实际代码链接]
https://arxiv.org/abs/2506.00774
Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples. This requires mechanisms that can both identify semantically related objects across images and accurately produce segmentation masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, offers both strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art performance on few-shot segmentation benchmarks specifically designed to assess generalization, outperforms generalist methods in the popular in-context setting, supports various prompts flexible interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code is available at this https URL.
少量样本分割(Few-shot segmentation)的目标是从仅有的几个标注示例中对未见过的对象类别进行分割。这需要既能跨图像识别语义相关对象又能准确生成分割掩膜的机制。我们注意到,Segment Anything 2 (SAM2) 及其提示和传播机制同时提供了强大的分割能力和内置特征匹配过程。然而,我们发现它的表示与优化为物体跟踪的任务特定线索纠缠在一起,影响了它在需要更高层次语义理解任务中的应用。 我们的核心见解是,在无类别意识的预训练过程中,SAM2 已经在其特征中编码了丰富的语义结构。为此,我们提出了 SANSA(Semantically AligNed Segment Anything 2),一个将这种潜在结构显式化的框架,并通过最小的任务特定修改重新利用 SAM2 进行少量样本分割。SANSA 在专门用于评估泛化能力的少量样本分割基准测试中取得了最先进的性能,在流行的上下文环境中超过了通用方法,支持通过点、框或草图进行灵活互动的各种提示,并且比先前的方法快得多且更紧凑。 相关代码可在以下链接获得:[提供具体 URL 位置]
https://arxiv.org/abs/2505.21795
The integration of image and event streams offers a promising approach for achieving robust visual object tracking in complex environments. However, current fusion methods achieve high performance at the cost of significant computational overhead and struggle to efficiently extract the sparse, asynchronous information from event streams, failing to leverage the energy-efficient advantages of event-driven spiking paradigms. To address this challenge, we propose the first fully Spiking Frame-Event Tracking framework called SpikeFET. This network achieves synergistic integration of convolutional local feature extraction and Transformer-based global modeling within the spiking paradigm, effectively fusing frame and event data. To overcome the degradation of translation invariance caused by convolutional padding, we introduce a Random Patchwork Module (RPM) that eliminates positional bias through randomized spatial reorganization and learnable type encoding while preserving residual structures. Furthermore, we propose a Spatial-Temporal Regularization (STR) strategy that overcomes similarity metric degradation from asymmetric features by enforcing spatio-temporal consistency among temporal template features in latent space. Extensive experiments across multiple benchmarks demonstrate that the proposed framework achieves superior tracking accuracy over existing methods while significantly reducing power consumption, attaining an optimal balance between performance and efficiency. The code will be released.
图像流和事件流的融合为在复杂环境中实现鲁棒的视觉对象跟踪提供了一种有前景的方法。然而,当前的融合方法虽然能够达到较高的性能,但需要承受较大的计算开销,并且难以高效地从事件流中提取稀疏、异步的信息,未能充分利用基于事件驱动脉冲范式的节能优势。为解决这一挑战,我们提出了首个完全基于脉冲框架下的帧-事件跟踪网络,命名为SpikeFET。该网络在脉冲范式内实现了卷积局部特征提取与Transformer全局建模的协同融合,有效整合了帧和事件数据。为了克服由卷积填充导致的位置不变性退化问题,我们引入了一个随机拼贴模块(Random Patchwork Module, RPM),通过空间重组以及可学习类型编码来消除位置偏差,同时保留残差结构。此外,我们还提出了一种时空正则化(Spatial-Temporal Regularization, STR)策略,该策略通过在潜在空间中强制时序模板特征之间的时空一致性,克服了不对称特征导致的相似性度量退化问题。 多项跨多基准实验表明,所提出的框架实现了比现有方法更高的跟踪精度,并显著降低了能耗,在性能与效率之间达到了最优平衡。代码将公开发布。
https://arxiv.org/abs/2505.20834
Referring Multi-object tracking (RMOT) is an important research field in computer vision. Its task form is to guide the models to track the objects that conform to the language instruction. However, the RMOT task commonly requires clear language instructions, such methods often fail to work when complex language instructions with reasoning characteristics appear. In this work, we propose a new task, called Reasoning-based Multi-Object Tracking (ReaMOT). ReaMOT is a more challenging task that requires accurate reasoning about objects that match the language instruction with reasoning characteristic and tracking the objects' trajectories. To advance the ReaMOT task and evaluate the reasoning capabilities of tracking models, we construct ReaMOT Challenge, a reasoning-based multi-object tracking benchmark built upon 12 datasets. Specifically, it comprises 1,156 language instructions with reasoning characteristic, 423,359 image-language pairs, and 869 diverse scenes, which is divided into three levels of reasoning difficulty. In addition, we propose a set of evaluation metrics tailored for the ReaMOT task. Furthermore, we propose ReaTrack, a training-free framework for reasoning-based multi-object tracking based on large vision-language models (LVLM) and SAM2, as a baseline for the ReaMOT task. Extensive experiments on the ReaMOT Challenge benchmark demonstrate the effectiveness of our ReaTrack framework.
参考多目标跟踪(RMOT)是计算机视觉领域的一个重要研究方向。其任务形式旨在指导模型追踪符合语言指令的对象。然而,RMOT任务通常需要清晰的语言指示,当遇到具有推理特征的复杂语言指令时,这类方法往往无法有效工作。为此,在这项工作中,我们提出了一项新的任务——基于推理的多目标跟踪(ReaMOT)。ReaMOT是一项更具挑战性的任务,它要求模型能够准确地对具有推理特点的语言指示进行推理,并追踪符合这些指示的对象轨迹。 为了推进ReaMOT任务并评估跟踪模型的推理能力,我们构建了ReaMOT Challenge——一个基于12个数据集的推理型多目标跟踪基准。具体而言,该挑战包含了1,156条具有推理特性的语言指令、423,359张图片-文本对以及869种多样化的场景,并且根据推理难度将其分为三个级别。 此外,我们还提出了针对ReaMOT任务的一系列评估指标。同时,为了提供一个基准参考,我们基于大规模视觉-语言模型(LVLM)和SAM2提出了一种名为ReaTrack的无需训练框架。广泛的实验表明,在ReaMOT Challenge基准上的实验证明了我们的ReaTrack框架的有效性。
https://arxiv.org/abs/2505.20381
In this work, we propose a progressive scaling training strategy for visual object tracking, systematically analyzing the influence of training data volume, model size, and input resolution on tracking performance. Our empirical study reveals that while scaling each factor leads to significant improvements in tracking accuracy, naive training suffers from suboptimal optimization and limited iterative refinement. To address this issue, we introduce DT-Training, a progressive scaling framework that integrates small teacher transfer and dual-branch alignment to maximize model potential. The resulting scaled tracker consistently outperforms state-of-the-art methods across multiple benchmarks, demonstrating strong generalization and transferability of the proposed method. Furthermore, we validate the broader applicability of our approach to additional tasks, underscoring its versatility beyond tracking.
在这项工作中,我们提出了一种针对视觉对象跟踪的渐进式缩放训练策略,并系统地分析了训练数据量、模型大小和输入分辨率对跟踪性能的影响。我们的实证研究表明,虽然每个因素的放大都会显著提高跟踪精度,但简单的训练方式会导致次优优化和有限的迭代改进。为了解决这一问题,我们引入了DT-Training框架,该框架整合了小规模教师迁移学习和双分支校准技术以最大化模型潜力。由此产生的扩展追踪器在多个基准测试中持续优于现有方法,展示了所提方法的强大泛化能力和可移植性。此外,我们验证了我们的方法在其他任务中的更广泛适用性,强调其超越跟踪应用的多功能性。
https://arxiv.org/abs/2505.19990
In this paper, we present a novel distributed expectation propagation algorithm for multiple sensors, multiple objects tracking in cluttered environments. The proposed framework enables each sensor to operate locally while collaboratively exchanging moment estimates with other sensors, thus eliminating the need to transmit all data to a central processing node. Specifically, we introduce a fast and parallelisable Rao-Blackwellised Gibbs sampling scheme to approximate the tilted distributions, which enhances the accuracy and efficiency of expectation propagation updates. Results demonstrate that the proposed algorithm improves both communication and inference efficiency for multi-object tracking tasks with dynamic sensor connectivity and varying clutter levels.
在这篇论文中,我们提出了一种新颖的分布式期望传播算法,用于多传感器、多目标在复杂环境中的追踪任务。所提出的框架允许每个传感器独立运行,并与其他传感器协作交换矩估计,从而无需将所有数据传输到中央处理节点。具体而言,我们引入了一个快速且可并行化的Rao-Blackwell化Gibbs采样方案来近似倾斜分布,这提高了期望传播更新的准确性和效率。实验结果表明,所提出的算法在动态传感器连接和不同杂波水平下的多目标追踪任务中,能够提高通信和推理效率。
https://arxiv.org/abs/2505.18795
Multi-view multi-object tracking (MVMOT) has found widespread applications in intelligent transportation, surveillance systems, and urban management. However, existing studies rarely address genuinely free-viewpoint MVMOT systems, which could significantly enhance the flexibility and scalability of cooperative tracking systems. To bridge this gap, we first construct the Multi-Drone Multi-Object Tracking (MDMOT) dataset, captured by mobile drone swarms across diverse real-world scenarios, initially establishing the first benchmark for multi-object tracking in arbitrary multi-view environment. Building upon this foundation, we propose \textbf{FusionTrack}, an end-to-end framework that reasonably integrates tracking and re-identification to leverage multi-view information for robust trajectory association. Extensive experiments on our MDMOT and other benchmark datasets demonstrate that FusionTrack achieves state-of-the-art performance in both single-view and multi-view tracking.
多视角多目标跟踪(MVMOT)在智能交通、监控系统和城市管理系统中得到了广泛应用。然而,现有研究很少涉及真正的自由视点MVMOT系统,这可以显著提高合作跟踪系统的灵活性和可扩展性。为填补这一空白,我们首先构建了由移动无人机群在各种现实场景下捕捉的Multi-Drone Multi-Object Tracking(MDMOT)数据集,初步建立了任意多视角环境下多目标跟踪的第一个基准。在此基础上,我们提出了\textbf{FusionTrack},这是一个端到端框架,合理地整合了跟踪和重识别功能,以利用多视图信息进行稳健的轨迹关联。在我们的MDMOT和其他基准数据集上进行了广泛的实验表明,FusionTrack在单视图和多视图跟踪方面均达到了最先进的性能。
https://arxiv.org/abs/2505.18727
We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.
我们提出了一种将Segment Anything Model 2 (SAM2) 有效适应于视觉对象跟踪(VOT)任务的方法。我们的方法利用了SAM2的强大预训练能力,并结合了几项关键技术以增强其在VOT应用中的性能。通过将SAM2与我们提出的优化相结合,我们在2024年ICPR多模态目标跟踪挑战赛中获得了89.4的AUC分数第一名,证明了该方法的有效性。本文详细介绍了我们的方法论、对SAM2所做的具体改进以及在VOT解决方案和数据集多模态方面的结果全面分析。
https://arxiv.org/abs/2505.18111
Multi-object tracking (MOT) in computer vision has made significant advancements, yet tracking small fish in underwater environments presents unique challenges due to complex 3D motions and data noise. Traditional single-view MOT models often fall short in these settings. This thesis addresses these challenges by adapting state-of-the-art single-view MOT models, FairMOT and YOLOv8, for underwater fish detecting and tracking in ecological studies. The core contribution of this research is the development of a multi-view framework that utilizes stereo video inputs to enhance tracking accuracy and fish behavior pattern recognition. By integrating and evaluating these models on underwater fish video datasets, the study aims to demonstrate significant improvements in precision and reliability compared to single-view approaches. The proposed framework detects fish entities with a relative accuracy of 47% and employs stereo-matching techniques to produce a novel 3D output, providing a more comprehensive understanding of fish movements and interactions
在计算机视觉领域,多目标跟踪(MOT)取得了显著进展。然而,在水下环境中追踪小型鱼类面临着独特的挑战,因为这些环境中的三维运动复杂且数据噪声大。传统的单视图MOT模型在这种情况下往往表现不足。本论文通过调整最先进的单视图MOT模型——FairMOT和YOLOv8,以适应生态研究中水下鱼类的检测与跟踪任务,来解决上述挑战。 该研究的核心贡献在于开发了一个利用立体视频输入的多视角框架,以此增强跟踪精度,并促进鱼类行为模式识别。通过在水下鱼视频数据集上集成并评估这些模型,本研究旨在展示相比于单视图方法,在精度和可靠性方面有显著提升。所提出的框架能够以相对47%的准确性检测鱼类实体,并利用立体匹配技术生成新颖的三维输出结果,从而更全面地理解鱼类的运动模式及其互动行为。
https://arxiv.org/abs/2505.17201
Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at this https URL.
近期在视频问答(Video Question Answering,简称VideoQA)领域取得的进展引入了基于大型语言模型(LLM)的代理、模块化框架和程序性解决方案,取得了令人鼓舞的结果。这些系统采用动态代理和基于记忆机制来分解复杂任务并优化答案生成。然而,在长时间内跟踪物体以及根据推理进行决策方面仍需显著改进,以更好地将对象参考与语言模型输出对齐;随着新模型在这两项任务上的表现日益出色,这一需求显得尤为迫切。 本文介绍了一种结合“思考链”框架和基于实例化推理的零样本视频问答(VideoQA)LLM大脑代理。该方法与YOLO-World相结合,增强了对象跟踪和对齐能力,并在VideoQA及视频理解领域设立了新的技术标准,在NExT-QA、iVQA和ActivityNet-QA等基准测试中表现出色。 此外,我们的框架还支持时间框架内的实例化验证检查,从而提高了准确性,并为跨多个视频领域的输出可靠性提供了重要保障。相关代码可在[此处](https://example.com)获取(实际链接应根据实际情况填写)。
https://arxiv.org/abs/2505.15928
Existing tracking algorithms typically rely on low-frame-rate RGB cameras coupled with computationally intensive deep neural network architectures to achieve effective tracking. However, such frame-based methods inherently face challenges in achieving low-latency performance and often fail in resource-constrained environments. Visual object tracking using bio-inspired event cameras has emerged as a promising research direction in recent years, offering distinct advantages for low-latency applications. In this paper, we propose a novel Slow-Fast Tracking paradigm that flexibly adapts to different operational requirements, termed SFTrack. The proposed framework supports two complementary modes, i.e., a high-precision slow tracker for scenarios with sufficient computational resources, and an efficient fast tracker tailored for latency-aware, resource-constrained environments. Specifically, our framework first performs graph-based representation learning from high-temporal-resolution event streams, and then integrates the learned graph-structured information into two FlashAttention-based vision backbones, yielding the slow and fast trackers, respectively. The fast tracker achieves low latency through a lightweight network design and by producing multiple bounding box outputs in a single forward pass. Finally, we seamlessly combine both trackers via supervised fine-tuning and further enhance the fast tracker's performance through a knowledge distillation strategy. Extensive experiments on public benchmarks, including FE240, COESOT, and EventVOT, demonstrate the effectiveness and efficiency of our proposed method across different real-world scenarios. The source code has been released on this https URL.
现有的追踪算法通常依赖于低帧率的RGB相机,并结合计算资源密集型的深度神经网络架构来实现有效的跟踪。然而,基于帧的方法在本质上面临难以实现低延迟性能的挑战,并且在资源受限环境中往往表现不佳。近年来,采用生物启发式的事件相机进行视觉对象跟踪成为了一个有前景的研究方向,为低延迟应用提供了独特的优势。本文中,我们提出了一种新颖的慢速-快速追踪范式(Slow-Fast Tracking paradigm),名为SFTrack,该方法能够灵活适应不同的操作需求。所提出的框架支持两种互补模式:一种是针对计算资源充足的场景设计的高精度慢速跟踪器;另一种是为延迟敏感、资源受限环境量身定制的高效快速跟踪器。 具体而言,我们的框架首先通过图表示学习从高时间分辨率事件流中提取信息,并将学到的结构化信息整合到两个基于FlashAttention的视觉骨干网络中,从而分别生成慢速和快速追踪器。快速跟踪器则通过轻量级网络设计以及在单次前向传递中产生多个边界框输出的方式实现了低延迟。 最后,我们通过对监督微调方法的应用无缝结合了这两种跟踪器,并进一步通过知识蒸馏策略提高了快速跟踪器的性能。我们在FE240、COESOT和EventVOT等公开基准测试上的广泛实验验证了所提出的方法在不同现实场景中的有效性和效率。源代码已在以下网址发布:[URL](请将URL替换为实际链接)。
https://arxiv.org/abs/2505.12903
Multi-object tracking from LiDAR point clouds presents unique challenges due to the sparse and irregular nature of the data, compounded by the need for temporal coherence across frames. Traditional tracking systems often rely on hand-crafted features and motion models, which can struggle to maintain consistent object identities in crowded or fast-moving scenes. We present a lidar-based two-staged DETR inspired transformer; a smoother and tracker. The smoother stage refines lidar object detections, from any off-the-shelf detector, across a moving temporal window. The tracker stage uses a DETR-based attention block to maintain tracks across time by associating tracked objects with the refined detections using the point cloud as context. The model is trained on the datasets nuScenes and KITTI in both online and offline (forward peeking) modes demonstrating strong performance across metrics such as ID-switch and multiple object tracking accuracy (MOTA). The numerical results indicate that the online mode outperforms the lidar-only baseline and SOTA models on the nuScenes dataset, with an aMOTA of 0.722 and an aMOTP of 0.475, while the offline mode provides an additional 3 pp aMOTP
从激光雷达点云中进行多目标跟踪面临独特的挑战,由于数据的稀疏性和不规则性,加上需要在帧之间保持时间一致性。传统的追踪系统往往依赖于手工制作的功能和运动模型,在拥挤或快速移动的场景中难以维持一致的对象身份。 我们提出了一种基于激光雷达的两阶段DETR启发式变换器——平滑器和平移器。第一阶段(平滑器)使用任何现成的目标检测器对点云中的对象检测进行改进,这在整个时间窗口内进行。第二阶段(跟踪器)利用基于DETR的注意力块,在整个时间段内通过将追踪到的对象与平滑后的检测结果相关联来保持跟踪的一致性,同时以点云作为上下文信息。 该模型在nuScenes和KITTI数据集上进行了在线模式和离线模式(向前窥视模式)的训练,展示了ID切换和多目标跟踪准确率(MOTA)等指标上的强大性能。数值结果显示,在nuScenes数据集中,在线模式的表现优于仅使用激光雷达作为基准的方法以及最先进的模型,其aMOTA值为0.722,aMOTP值为0.475;而离线模式则额外提供了3个百分点的aMOTP。 简单总结: - 提出了一种基于激光雷达数据进行多目标跟踪的新方法。 - 该方法包括两个阶段:平滑器和平移器。 - 在nuScenes和KITTI数据集上进行了测试,展示了优于现有基准的方法的表现。 - 在线模式在nuScenes上的aMOTA为0.722,aMOTP为0.475;离线模式进一步提高了3个百分点的aMOTP。
https://arxiv.org/abs/2505.12753
Multi-modal object tracking integrates auxiliary modalities such as depth, thermal infrared, event flow, and language to provide additional information beyond RGB images, showing great potential in improving tracking stabilization in complex scenarios. Existing methods typically start from an RGB-based tracker and learn to understand auxiliary modalities only from training data. Constrained by the limited multi-modal training data, the performance of these methods is unsatisfactory. To alleviate this limitation, this work proposes a unified multi-modal tracker Diff-MM by exploiting the multi-modal understanding capability of the pre-trained text-to-image generation model. Diff-MM leverages the UNet of pre-trained Stable Diffusion as a tracking feature extractor through the proposed parallel feature extraction pipeline, which enables pairwise image inputs for object tracking. We further introduce a multi-modal sub-module tuning method that learns to gain complementary information between different modalities. By harnessing the extensive prior knowledge in the generation model, we achieve a unified tracker with uniform parameters for RGB-N/D/T/E tracking. Experimental results demonstrate the promising performance of our method compared with recently proposed trackers, e.g., its AUC outperforms OneTracker by 8.3% on TNL2K.
多模态目标跟踪结合了深度、热红外、事件流和语言等辅助模式,提供了超出RGB图像的信息,显示出在复杂场景中提高追踪稳定性的巨大潜力。现有的方法通常从基于RGB的追踪器开始,并仅通过训练数据来学习理解辅助模式。由于受限于有限的多模态训练数据,这些方法的表现不尽如人意。为了缓解这一限制,这项工作提出了统一的多模态追踪器Diff-MM,利用预先训练的文本到图像生成模型的多模态理解能力。Diff-MM通过提出的并行特征提取管道,利用预训练Stable Diffusion中的UNet作为跟踪特性提取器,并使该方法能够用于对象跟踪的一对一图像输入。我们进一步引入了一种多模态子模块调优方法,以学习不同模式之间互补信息的获取方式。借助生成模型中广泛的先验知识,我们实现了一个统一的追踪器,对于RGB-N/D/T/E跟踪具有统一的参数设置。实验结果表明,与最近提出的追踪器相比,我们的方法表现出色,例如在TNL2K数据集上,AUC指标优于OneTracker 8.3%。
https://arxiv.org/abs/2505.12606