Accurate online multiple-camera vehicle tracking is essential for intelligent transportation systems, autonomous driving, and smart city applications. Like single-camera multiple-object tracking, it is commonly formulated as a graph problem of tracking-by-detection. Within this framework, existing online methods usually consist of two-stage procedures that cluster temporally first, then spatially, or vice versa. This is computationally expensive and prone to error accumulation. We introduce a graph representation that allows spatial-temporal clustering in a single, combined step: New detections are spatially and temporally connected with existing clusters. By keeping sparse appearance and positional cues of all detections in a cluster, our method can compare clusters based on the strongest available evidence. The final tracks are obtained online using a simple multicut assignment procedure. Our method does not require any training on the target scene, pre-extraction of single-camera tracks, or additional annotations. Notably, we outperform the online state-of-the-art on the CityFlow dataset in terms of IDF1 by more than 14%, and on the Synthehicle dataset by more than 25%, respectively. The code is publicly available.
准确的在线多摄像头车辆跟踪对于智能交通系统、自动驾驶和智能城市应用至关重要。与单摄像头多对象跟踪一样,通常用跟踪检测问题来表示它。在这个框架内,现有的在线方法通常包括两个步骤:首先进行时序聚类,然后进行空间聚类;或者反过来。这是计算密集型且容易累积错误的。我们引入了一个图表示,允许在单个、联合步骤中进行空间-时间聚类:新检测到的样本在空间和时间上与现有的聚类相互连接。通过保留所有检测到的样本的稀疏表示和位置线索,我们的方法可以基于最强的可用证据比较聚类。通过简单的多路复用分配方案,我们可以在在线过程中获得最终轨迹。我们的方法不需要在目标场景上进行训练,也不需要预先提取单摄像头的轨迹或附加注释。值得注意的是,我们在CityFlow数据集上比在线最先进的方法提高了约14%,而在Synthehicle数据集上提高了约25%。代码是公开可用的。
https://arxiv.org/abs/2410.02638
Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on this http URL.
视觉语言跟踪(VLT)已成为一个尖端的研究领域,利用语言数据来增强具有多模态输入的算法的性能,并将传统单对象跟踪(SOT)的范围扩展到涵盖视频理解应用。然而,大多数VLT基准仍然依赖于对每个视频的简洁、人类编写的文本描述。这些描述往往捕捉不到视频内容动态的细微之处,缺乏语言的风格多样性,受到其详细程度和固定注释周期的限制。因此,算法倾向于默认采用“记住答案”策略,从实现对视频内容更深刻理解的核心目标上偏离。 幸运的是,大型语言模型(LLMs)的出现已经使得生成多样文本成为可能。这项工作利用LLMs生成具有不同文本长度和粒度的多样语义注释(在语义层次上),从而建立了一个新颖的多模态基准。具体来说,我们(1)提出了一个名为DTVLT的新视觉语言跟踪基准,基于五个突出的VLT和SOT基准,包括三个子任务:短期跟踪、长期跟踪和全局实例跟踪。(2)我们在基准中提供了四种粒度文本,考虑了语义信息的范围和密度。我们期望这种多粒度生成策略将为VLT和视频理解研究创造一个有利的环境。(3)我们对DTVLT进行了全面实验分析,评估了多样性文本对跟踪性能的影响,并希望识别出现有算法的性能瓶颈,以便进一步研究VLT和视频理解。提出的基准、实验结果和工具包将逐步发布在上述网址。
https://arxiv.org/abs/2410.02492
Event-based cameras are attracting significant interest as they provide rich edge information, high dynamic range, and high temporal resolution. Many state-of-the-art event-based algorithms rely on splitting the events into fixed groups, resulting in the omission of crucial temporal information, particularly when dealing with diverse motion scenarios (e.g., high/low speed). In this work, we propose SpikeSlicer, a novel-designed plug-and-play event processing method capable of splitting events stream adaptively. SpikeSlicer utilizes a lightweight (0.41M) and low-energy spiking neural network (SNN) to trigger event slicing. To guide the SNN to fire spikes at optimal time steps, we propose the Spiking Position-aware Loss (SPA-Loss) to modulate the neuron's state. Additionally, we develop a Feedback-Update training strategy that refines the slicing decisions using feedback from the downstream artificial neural network (ANN). Extensive experiments demonstrate that our method yields significant performance improvements in event-based object tracking and recognition. Notably, SpikeSlicer provides a brand-new SNN-ANN cooperation paradigm, where the SNN acts as an efficient, low-energy data processor to assist the ANN in improving downstream performance, injecting new perspectives and potential avenues of exploration.
基于事件的相机因提供丰富的边缘信息、高动态范围和高时间分辨率而受到广泛关注。许多最先进的基于事件的算法依赖于将事件划分为固定组,导致关键时间信息的遗漏,尤其是在处理多样运动场景(如高速/低速)时。在这项工作中,我们提出了SpikeSlicer,一种新型的插件和可运行的事件处理方法,具有自适应分割事件流的功能。SpikeSlicer利用轻量级(0.41M)且低能量的尖峰神经网络(SNN)触发事件切片。为了指导SNN在最佳时间步骤触发尖峰,我们提出了尖峰位置感知损失(SPA-Loss)来调节神经元状态。此外,我们还开发了反馈更新训练策略,通过下游人工神经网络(ANN)的反馈来优化切片决策。大量实验证明,我们的方法在基于事件的对象跟踪和识别方面产生了显著的性能提升。值得注意的是,SpikeSlicer提供了一种新的SNN-ANN合作范例,其中SNN充当高效、低能量的数据处理器,协助ANN提高下游性能,注入新的视点和探索途径。
https://arxiv.org/abs/2410.02249
Objects we encounter often change appearance as we interact with them. Changes in illumination (shadows), object pose, or movement of nonrigid objects can drastically alter available image features. How do biological visual systems track objects as they change? It may involve specific attentional mechanisms for reasoning about the locations of objects independently of their appearances -- a capability that prominent neuroscientific theories have associated with computing through neural synchrony. We computationally test the hypothesis that the implementation of visual attention through neural synchrony underlies the ability of biological visual systems to track objects that change in appearance over time. We first introduce a novel deep learning circuit that can learn to precisely control attention to features separately from their location in the world through neural synchrony: the complex-valued recurrent neural network (CV-RNN). Next, we compare object tracking in humans, the CV-RNN, and other deep neural networks (DNNs), using FeatureTracker: a large-scale challenge that asks observers to track objects as their locations and appearances change in precisely controlled ways. While humans effortlessly solved FeatureTracker, state-of-the-art DNNs did not. In contrast, our CV-RNN behaved similarly to humans on the challenge, providing a computational proof-of-concept for the role of phase synchronization as a neural substrate for tracking appearance-morphing objects as they move about.
我们经常遇到的对象在互动过程中会改变外观。光照变化、物体姿态或运动非刚性对象的改变会导致可用图像特征发生极大改变。生物视觉系统如何跟踪随其变化的对象呢?这可能涉及特定注意力机制来独立于物体外观计算物体位置的推理能力——这一能力与通过神经同步计算的神经科学理论密切相关。我们通过计算视觉注意力通过神经同步实现来测试假设,即视觉注意力通过神经同步实现了生物视觉系统在时间上跟踪随其外观变化的对象的能力。 首先,我们介绍了一个新型的深度学习电路,可以通过神经同步准确地控制对特征的关注度,而无需考虑它们在空间中的位置:复杂值循环神经网络(CV-RNN)。接下来,我们使用FeatureTracker这个大型的挑战来比较人类、CV-RNN和其他深度神经网络(DNNs)的物体跟踪能力。尽管人类轻松地解决了FeatureTracker,但最先进的DNNs没有做到。相反,我们的CV-RNN在挑战中表现出了与人类相似的行为,提供了计算同步作为神经基因为追踪随其运动变化的外貌变形的物体的证明。
https://arxiv.org/abs/2410.02094
Multiple object tracking in complex scenarios - such as coordinated dance performances, team sports, or dynamic animal groups - presents unique challenges. In these settings, objects frequently move in coordinated patterns, occlude each other, and exhibit long-term dependencies in their trajectories. However, it remains a key open research question on how to model long-range dependencies within tracklets, interdependencies among tracklets, and the associated temporal occlusions. To this end, we introduce Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet. Samba autoregressively predicts the future track query for each sequence while maintaining synchronized long-term memory representations across tracklets. By integrating Samba into a tracking-by-propagation framework, we propose SambaMOTR, the first tracker effectively addressing the aforementioned issues, including long-range dependencies, tracklet interdependencies, and temporal occlusions. Additionally, we introduce an effective technique for dealing with uncertain observations (MaskObs) and an efficient training recipe to scale SambaMOTR to longer sequences. By modeling long-range dependencies and interactions among tracked objects, SambaMOTR implicitly learns to track objects accurately through occlusions without any hand-crafted heuristics. Our approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT, and SportsMOT datasets.
在复杂场景中,例如舞蹈表演、团队运动或动态动物群,多个物体追踪带来了独特的挑战。在这些场景中,物体经常以协调的方式移动,遮挡彼此,并表现出它们轨迹中的长期依赖关系。然而,如何建模在跟踪器中长距离依赖关系、轨迹间相互依赖以及相关的时间遮挡仍然是一个关键的研究问题。为此,我们引入了Samba,一种新颖的线性时间序列模型,旨在通过同步处理多个跟踪器的多个选择状态来共同处理多个跟踪器。Samba自回归地预测每个序列的未来轨迹,同时保持同步的长远记忆表示。通过将Samba集成到跟踪通过传播的框架中,我们提出了SambaMOTR,这是第一个有效解决上述问题的跟踪器,包括长距离依赖关系、轨迹间相互依赖以及时间遮挡。此外,我们还引入了处理不确定观测(MaskObs)的有效技术以及将SambaMOTR扩展到较长序列的高效训练方法。通过建模长距离依赖关系和跟踪对象之间的相互作用,SambaMOTR通过遮挡而无需任何自定义的启发式方法学会了准确跟踪物体。与DanceTrack、BFT和SportsMOT数据集上的 prior 状态最先进相比,我们的方法在很大程度上超越了这些数据集。
https://arxiv.org/abs/2410.01806
3D multi-object tracking plays a critical role in autonomous driving by enabling the real-time monitoring and prediction of multiple objects' movements. Traditional 3D tracking systems are typically constrained by predefined object categories, limiting their adaptability to novel, unseen objects in dynamic environments. To address this limitation, we introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories. We formulate the problem of open-vocabulary 3D tracking and introduce dataset splits designed to represent various open-vocabulary scenarios. We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes. Our method effectively reduces the performance gap between tracking known and novel objects through strategic adaptation. Experimental results demonstrate the robustness and adaptability of our method in diverse outdoor driving scenarios. To the best of our knowledge, this work is the first to address open-vocabulary 3D tracking, presenting a significant advancement for autonomous systems in real-world settings. Code, trained models, and dataset splits are available publicly.
3D多对象跟踪在自动驾驶中扮演着关键角色,通过实现对多个物体运动情况的实时监测和预测,提高了自动驾驶系统的实时性能。传统3D跟踪系统通常受到预定义的物体类别的限制,导致其对动态环境中新颖、未见物体的适应性受限。为了应对这一局限,我们引入了开放词汇3D跟踪,将3D跟踪的范围扩展到包括超出预定义类别的物体。我们形式化开放词汇3D跟踪的问题,并引入了旨在表示各种开放词汇场景的数据集划分。我们提出了一个新方法,将开放词汇功能整合到3D跟踪框架中,允许对未见物体类进行泛化。我们的方法通过策略性调整有效减少了跟踪已知和未见物体之间的性能差距。实验结果表明,我们的方法在各种户外驾驶场景中具有稳健性和适应性。据我们所知,这是第一个针对开放词汇3D跟踪的论文,为现实环境中的自动驾驶系统带来了显著的进展。代码、训练的模型和数据集都可以公开获取。
https://arxiv.org/abs/2410.01678
We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based methods, such as LISA, struggle with video tasks due to the additional temporal dimension, which requires temporal dynamic understanding and consistent segmentation across frames. VideoLISA addresses these challenges by integrating a Sparse Dense Sampling strategy into the video-LLM, which balances temporal context and spatial detail within computational constraints. Additionally, we propose a One-Token-Seg-All approach using a specially designed <TRK> token, enabling the model to segment and track objects across multiple frames. Extensive evaluations on diverse benchmarks, including our newly introduced ReasonVOS benchmark, demonstrate VideoLISA's superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking. While optimized for videos, VideoLISA also shows promising generalization to image segmentation, revealing its potential as a unified foundation model for language-instructed object segmentation. Code and model will be available at: this https URL.
我们提出了VideoLISA,一种基于视频的多模态大型语言模型,旨在解决视频中的语言指令推理分割问题。通过利用大型语言模型的推理能力和世界知识,并通过Segment Anything Model进行增强,VideoLISA在视频上根据语言指令生成时序一致的分割掩码。现有的图像为基础的方法,如LISA,由于需要额外的时域维度,在视频任务上表现不佳,这需要时域动态理解和跨帧的稳健分割。为了应对这些挑战,VideoLISA通过将稀疏密集采样策略集成到视频-LLM中,平衡了计算约束下的时域上下文和空间细节,从而解决了这些问题。此外,我们还提出了一种One-Token-Seg-All方法,使用专门设计的<TRK>标记,使模型能够在多个帧上对对象进行分割和跟踪。在包括我们新引入的ReasonVOS基准在内的各种测试中进行了广泛的评估,VideoLISA在涉及复杂推理、时域理解和对象跟踪的视频物体分割任务中表现优异。尽管VideoLISA是为视频优化的,但它也表现出将图像分割作为统一基础模型的潜在可能性。代码和模型将在此处提供:https://这个链接。
https://arxiv.org/abs/2409.19603
Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.
学习一个区分目标的 discriminative 模型以便从其周围的干扰者中区分目标对于通用视觉物体跟踪是至关重要的。由于现有跟踪器的鉴别能力有限,动态目标表示对于干扰物的适应是具有挑战性的。我们提出了一个新的通用视觉物体跟踪(PiVOT)视觉提示机制来解决这个问题。PiVOT 提出了一个使用预训练基础模型 CLIP 的提示生成网络,以自动生成和优化视觉提示,从而实现基础模型知识的转移,以进行跟踪。虽然 CLIP 提供广泛的类别级知识,但跟踪器通过针对特定实例的训练,在识别独特物体实例方面表现出色。因此,PiVOT 首先通过 CLIP 根据潜在目标与参考模板之间的一致性来优化视觉提示。一旦视觉提示被优化,它可以更好地突出潜在目标的位置,从而减少无关提示信息。通过所提出的提示机制,跟踪器可以通过视觉提示生成改进的实例感知特征图,从而有效地减少干扰者。所提出的方法在训练过程中不使用 CLIP,从而保持相同的训练复杂度并保留预训练基础模型的泛化能力。在多个基准测试上进行的大量实验证明,PiVOT 使用所提出的提示方法可以抑制干扰物并增强跟踪器。
https://arxiv.org/abs/2409.18901
Feature tracking is crucial for, structure from motion (SFM), simultaneous localization and mapping (SLAM), object tracking and various computer vision tasks. Event cameras, known for their high temporal resolution and ability to capture asynchronous changes, have gained significant attention for their potential in feature tracking, especially in challenging conditions. However, event cameras lack the fine-grained texture information that conventional cameras provide, leading to error accumulation in tracking. To address this, we propose a novel framework, BlinkTrack, which integrates event data with RGB images for high-frequency feature tracking. Our method extends the traditional Kalman filter into a learning-based framework, utilizing differentiable Kalman filters in both event and image branches. This approach improves single-modality tracking, resolves ambiguities, and supports asynchronous data fusion. We also introduce new synthetic and augmented datasets to better evaluate our model. Experimental results indicate that BlinkTrack significantly outperforms existing event-based methods, exceeding 100 FPS with preprocessed event data and 80 FPS with multi-modality data.
特征跟踪对运动结构从运动(SFM)、同时定位与映射(SLAM)、目标跟踪以及各种计算机视觉任务至关重要。事件相机以其高时间分辨率和高捕捉异步变化的能力而受到广泛关注,尤其是在具有挑战性的条件下。然而,事件相机缺乏传统相机提供的细微纹理信息,导致跟踪误差累积。为了解决这个问题,我们提出了一个名为BlinkTrack的新框架,将事件数据与RGB图像进行整合,实现高频特征跟踪。我们的方法将传统的卡尔曼滤波器扩展为一个基于学习的框架,利用不同可导卡尔曼滤波器在事件和图像分支。这种方法改进了单模态跟踪,解决了不确定性,并支持异步数据融合。我们还引入了新的合成和增强数据集,以更好地评估我们的模型。实验结果表明,BlinkTrack显著优于现有的基于事件的方法,在预处理的事件数据下的超帧速率为100 FPS,而在多模态数据下的超帧速率为80 FPS。
https://arxiv.org/abs/2409.17981
Transformer-based trackers have established a dominant role in the field of visual object tracking. While these trackers exhibit promising performance, their deployment on resource-constrained devices remains challenging due to inefficiencies. To improve the inference efficiency and reduce the computation cost, prior approaches have aimed to either design lightweight trackers or distill knowledge from larger teacher models into more compact student trackers. However, these solutions often sacrifice accuracy for speed. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce the size of a pre-trained tracking model into a lightweight tracker with minimal performance degradation. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages, enabling the student model to emulate each corresponding teacher stage more effectively. Additionally, we also design a unique replacement training technique that involves randomly substituting specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model's ability to replicate the teacher model's behavior. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model's compression process. Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of CompressTracker. Our CompressTracker-4 with 4 transformer layers, which is compressed from OSTrack, retains about 96% performance on LaSOT (66.1% AUC) while achieves 2.17x speed up.
Transformer-based trackers have emerged as dominant in the field of visual object tracking. While these trackers show promising performance, their deployment on resource-constrained devices remains challenging due to inefficiencies. To improve inference efficiency and reduce computational cost, previous approaches have aimed to either design lightweight trackers or distill knowledge from larger teacher models into more compact student trackers. However, these solutions often sacrifice accuracy for speed. Therefore, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce the size of a pre-trained tracking model into a lightweight tracker with minimal performance degradation. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages, enabling the student model to emulate each corresponding teacher stage more effectively. Additionally, we also design a unique replacement training technique that involves randomly substituting specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model's ability to replicate the teacher model's behavior. To further force the student model to emulate the teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model's compression process. Our framework, CompressTracker, is structurally agnostic and compatible with any transformer architecture. We conduct a series of experiments to verify the effectiveness and generalizability of CompressTracker. Our CompressTracker-4 with 4 transformer layers, which is compressed from OSTrack, retains about 96% performance on LaSOT (66.1% AUC) while achieving 2.17x speedup.
https://arxiv.org/abs/2409.17564
This paper proposes CAMOT, a simple camera angle estimator for multi-object tracking to tackle two problems: 1) occlusion and 2) inaccurate distance estimation in the depth direction. Under the assumption that multiple objects are located on a flat plane in each video frame, CAMOT estimates the camera angle using object detection. In addition, it gives the depth of each object, enabling pseudo-3D MOT. We evaluated its performance by adding it to various 2D MOT methods on the MOT17 and MOT20 datasets and confirmed its effectiveness. Applying CAMOT to ByteTrack, we obtained 63.8% HOTA, 80.6% MOTA, and 78.5% IDF1 in MOT17, which are state-of-the-art results. Its computational cost is significantly lower than the existing deep-learning-based depth estimators for tracking.
本文提出了一种名为CAMOT的简单相机角度估计算法,用于解决多目标跟踪中的两个问题:1)遮挡和2)深度方向上不准确的距离估计。在假设每个视频帧中的多个物体都位于平面上假设的基础上,CAMOT使用物体检测来估计相机角度。此外,它还给出了每个对象的深度,使得伪3D MOT成为可能。我们通过将CAMOT应用于MOT17和MOT20数据集上的各种2D MOT方法来评估其性能,并证实了其有效性。将CAMOT应用于ByteTrack,我们在MOT17上获得了63.8%的HOTA,在MOT20上获得了80.6%的MOTA,在IDF1方面获得了78.5%的IDF,这些成绩都是最先进的。与现有的基于深度学习的跟踪深度估计算法相比,其计算成本明显较低。
https://arxiv.org/abs/2409.17533
The supervision of state-of-the-art multiple object tracking (MOT) methods requires enormous annotation efforts to provide bounding boxes for all frames of all videos, and instance IDs to associate them through time. To this end, we introduce Walker, the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels. First, we design a quasi-dense temporal object appearance graph, and propose a novel multi-positive contrastive objective to optimize random walks on the graph and learn instance similarities. Then, we introduce an algorithm to enforce mutually-exclusive connective properties across instances in the graph, optimizing the learned topology for MOT. At inference time, we propose to associate detected instances to tracklets based on the max-likelihood transition state under motion-constrained bi-directional walks. Walker is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K. Remarkably, our proposal outperforms the previous self-supervised trackers even when drastically reducing the annotation requirements by up to 400x.
监督当今最先进的多对象跟踪(MOT)方法需要巨大的注释努力来为所有视频的每个帧提供边界框和通过时间关联的实例ID。为此,我们介绍Walker,这是第一个自监督跟踪器,可以从稀疏边界框注释的视频中学习,并且没有跟踪标签。首先,我们设计了一个稀疏的时间对象外观图,并提出了一种新的多正则对比目标,用于在图中优化随机漫步并学习实例相似性。然后,我们引入了一种算法,以在图中强制实例之间的相互排斥连接属性,优化用于MOT的学习拓扑。在推理时间,我们提出将检测到的实例与跟踪器关联,基于运动约束的双向漫步下的最大似然转移状态。Walker是第一个在MOT17、DanceTrack和BDD100K上实现竞争性能的自监督跟踪器。值得注意的是,我们的建议甚至当我们减少注释要求400倍时,仍然优于之前的自监督跟踪器。
https://arxiv.org/abs/2409.17221
Improved surgical skill is generally associated with improved patient outcomes, although assessment is subjective; labour-intensive; and requires domain specific expertise. Automated data driven metrics can alleviate these difficulties, as demonstrated by existing machine learning instrument tracking models in minimally invasive surgery. However, these models have been tested on limited datasets of laparoscopic surgery, with a focus on isolated tasks and robotic surgery. In this paper, a new public dataset is introduced, focusing on simulated surgery, using the nasal phase of endoscopic pituitary surgery as an exemplar. Simulated surgery allows for a realistic yet repeatable environment, meaning the insights gained from automated assessment can be used by novice surgeons to hone their skills on the simulator before moving to real surgery. PRINTNet (Pituitary Real-time INstrument Tracking Network) has been created as a baseline model for this automated assessment. Consisting of DeepLabV3 for classification and segmentation; StrongSORT for tracking; and the NVIDIA Holoscan SDK for real-time performance, PRINTNet achieved 71.9% Multiple Object Tracking Precision running at 22 Frames Per Second. Using this tracking output, a Multilayer Perceptron achieved 87% accuracy in predicting surgical skill level (novice or expert), with the "ratio of total procedure time to instrument visible time" correlated with higher surgical skill. This therefore demonstrates the feasibility of automated surgical skill assessment in simulated endoscopic pituitary surgery. The new publicly available dataset can be found here: this https URL.
提高的手术技能通常与更好的患者结局相关,尽管评估是主观的;劳动密集型;并且需要领域特定专业知识。自动数据驱动的指标可以减轻这些困难,如现有最小侵入性手术中的机器人手术跟踪模型所示。然而,这些模型仅在有限的数据集上进行了测试,重点是孤立任务和机器人手术。在本文中,介绍了一个新的公共数据集,重点放在模拟手术上,使用内窥镜垂体切除术的鼻腔期作为示例。模拟手术允许实现一个真实但可重复的环境,这意味着自动评估获得的见解可以为初学者在模拟器上提高技能提供指导。PRINTNet(垂体实时仪器跟踪网络)被创建为基于自动评估的 baseline 模型。由 DeepLabV3 进行分类和分割;StrongSORT 进行跟踪;NVIDIA Holoscan SDK 实现实时性能。PRINTNet 在每秒 22 帧的帧率下实现了 71.9% 的多目标跟踪精度。使用这种跟踪输出,多层感知器获得了 87% 在预测手术技能水平(新手或专家)方面的准确度,其中“总手术时间与仪器可见时间的比例”与更高的手术技能相关。因此,这证明了在模拟内窥镜垂体手术中进行自动手术技能评估是可行的。这个新的公共数据集可以在这里找到:https://this URL。
https://arxiv.org/abs/2409.17025
Over the past decade, significant progress has been made in visual object tracking, largely due to the availability of large-scale training datasets. However, existing tracking datasets are primarily focused on open-air scenarios, which greatly limits the development of object tracking in underwater environments. To address this issue, we take a step forward by proposing the first large-scale underwater camouflaged object tracking dataset, namely UW-COT. Based on the proposed dataset, this paper presents an experimental evaluation of several advanced visual object tracking methods and the latest advancements in image and video segmentation. Specifically, we compare the performance of the Segment Anything Model (SAM) and its updated version, SAM 2, in challenging underwater environments. Our findings highlight the improvements in SAM 2 over SAM, demonstrating its enhanced capability to handle the complexities of underwater camouflaged objects. Compared to current advanced visual object tracking methods, the latest video segmentation foundation model SAM 2 also exhibits significant advantages, providing valuable insights into the development of more effective tracking technologies for underwater scenarios. The dataset will be accessible at \color{magenta}{this https URL}.
在过去的十年里,视觉物体跟踪取得了显著的进展,主要得益于大规模训练数据集的可用性。然而,现有的跟踪数据集主要关注陆地场景,这大大限制了水下环境中物体跟踪的发展。为解决这个问题,我们迈出一步,提出了第一个大型水下伪装物体跟踪数据集—— UW-COT。基于所提出的数据集,本文对几种先进的视觉物体跟踪方法和图像和视频分割的最新进展进行了实验评估。具体来说,我们比较了 SAE(Segment Anything Model)及其更新版本 SAM 2 在具有挑战性的水下环境中的性能。我们的研究结果突出了 SAM 2 相对于 SAM 的改善,证明了其在水下伪装物体处理复杂性方面的增强能力。与当前的高级视觉物体跟踪方法相比,最新的视频分割基础模型 SAM 2 也具有显著的优势,为水下场景开发更有效的跟踪技术提供了宝贵的洞见。数据集将在 \color{magenta}{this <https://url>} 这个 URL 上公开。
https://arxiv.org/abs/2409.16902
State-of-the-art (SOTA) visual object tracking methods have significantly enhanced the autonomy of unmanned aerial vehicles (UAVs). However, in low-light conditions, the presence of irregular real noise from the environments severely degrades the performance of these SOTA methods. Moreover, existing SOTA denoising techniques often fail to meet the real-time processing requirements when deployed as plug-and-play denoisers for UAV tracking. To address this challenge, this work proposes a novel conditional generative denoiser (CGDenoiser), which breaks free from the limitations of traditional deterministic paradigms and generates the noise conditioning on the input, subsequently removing it. To better align the input dimensions and accelerate inference, a novel nested residual Transformer conditionalizer is developed. Furthermore, an innovative multi-kernel conditional refiner is designed to pertinently refine the denoised output. Extensive experiments show that CGDenoiser promotes the tracking precision of the SOTA tracker by 18.18\% on DarkTrack2021 whereas working 5.8 times faster than the second well-performed denoiser. Real-world tests with complex challenges also prove the effectiveness and practicality of CGDenoiser. Code, video demo and supplementary proof for CGDenoier are now available at: \url{this https URL}.
先进的视觉对象跟踪方法显著增强了无人驾驶无人机(UAV)的自主性。然而,在低光条件下,环境中存在不规则的实噪声会严重破坏这些最先进方法的性能。此外,现有的最先进去噪技术在作为插件和播放的去噪器部署时,通常无法满足实时处理要求。为了应对这一挑战,本文提出了一种新颖的条件下生成去噪器(CGDenoiser),它摆脱了传统确定性范式的限制,并在输入上生成噪音条件,然后将其去除。为了更好地对输入数据进行对齐并加速推理,还开发了一种新颖的嵌套残差Transformer条件器。此外,还设计了一种创新的多核条件修整器,旨在精确修整去噪输出。大量实验证明,CGDenoiser通过DarkTrack2021提高了SOTA跟踪器的跟踪精度18.18%,而速度快了5.8倍于第二个表现优异的去噪器。在具有复杂挑战的实际场景中,CGDenoiser也证明了其有效性和实用性。CGDenoier的代码、视频演示和补充证明现在都可以在:<https:// this https URL>。
https://arxiv.org/abs/2409.16834
Visual object tracking has significantly promoted autonomous applications for unmanned aerial vehicles (UAVs). However, learning robust object representations for UAV tracking is especially challenging in complex dynamic environments, when confronted with aspect ratio change and occlusion. These challenges severely alter the original information of the object. To handle the above issues, this work proposes a novel progressive representation learning framework for UAV tracking, i.e., PRL-Track. Specifically, PRL-Track is divided into coarse representation learning and fine representation learning. For coarse representation learning, two innovative regulators, which rely on appearance and semantic information, are designed to mitigate appearance interference and capture semantic information. Furthermore, for fine representation learning, a new hierarchical modeling generator is developed to intertwine coarse object representations. Exhaustive experiments demonstrate that the proposed PRL-Track delivers exceptional performance on three authoritative UAV tracking benchmarks. Real-world tests indicate that the proposed PRL-Track realizes superior tracking performance with 42.6 frames per second on the typical UAV platform equipped with an edge smart camera. The code, model, and demo videos are available at \url{this https URL}.
视觉对象跟踪显著推动了无人驾驶飞机(UAV)的自动化应用。然而,在复杂动态环境中,学习适用于UAV跟踪的鲁棒物体表示尤其具有挑战性,尤其是在比例变化和遮挡的情况下。这些挑战严重扭曲了对象的原始信息。为解决上述问题,本文提出了一种名为PRL-Track的新颖的渐进式表示学习框架用于UAV跟踪,即PRL-Track。具体而言,PRL-Track分为粗表示学习和细表示学习。对于粗表示学习,设计了两个创新的分层器,它们依赖于外观和语义信息,以减轻外观干扰并捕捉语义信息。此外,对于细表示学习,新层次建模生成器被开发出来,以交织粗对象表示。全面的实验证明,与三个权威的UAV跟踪基准相比,PRL-Track的性能都具有卓越的竞争力。真实世界的测试表明,PRL-Track通过配备边缘智能相机的典型UAV平台,实现了每秒42.6帧的卓越跟踪性能。代码、模型和演示视频都可以在\url{这个链接}找到。
https://arxiv.org/abs/2409.16652
Visual object tracking has boosted extensive intelligent applications for unmanned aerial vehicles (UAVs). However, the state-of-the-art (SOTA) enhancers for nighttime UAV tracking always neglect the uneven light distribution in low-light images, inevitably leading to excessive enhancement in scenarios with complex illumination. To address these issues, this work proposes a novel enhancer, i.e., LDEnhancer, enhancing nighttime UAV tracking with light distribution suppression. Specifically, a novel image content refinement module is developed to decompose the light distribution information and image content information in the feature space, allowing for the targeted enhancement of the image content information. Then this work designs a new light distribution generation module to capture light distribution effectively. The features with light distribution information and image content information are fed into the different parameter estimation modules, respectively, for the parameter map prediction. Finally, leveraging two parameter maps, an innovative interweave iteration adjustment is proposed for the collaborative pixel-wise adjustment of low-light images. Additionally, a challenging nighttime UAV tracking dataset with uneven light distribution, namely NAT2024-2, is constructed to provide a comprehensive evaluation, which contains 40 challenging sequences with over 74K frames in total. Experimental results on the authoritative UAV benchmarks and the proposed NAT2024-2 demonstrate that LDEnhancer outperforms other SOTA low-light enhancers for nighttime UAV tracking. Furthermore, real-world tests on a typical UAV platform with an NVIDIA Orin NX confirm the practicality and efficiency of LDEnhancer. The code is available at this https URL.
视觉对象跟踪已极大地推动了无人机(UAV)的广泛智能应用。然而,目前夜间UAV跟踪的顶级增强器 always 忽视低光图像中的不均匀光线分布,从而导致在复杂光照场景中过度增强。为解决这些问题,本文提出了一种新的增强器,即LDEnhancer,通过抑制低光图像中的光线分布信息来提高夜间UAV跟踪。具体来说,本文开发了一个新的图像内容修复模块,以分解特征空间中的光线分布信息和图像内容信息,从而实现针对图像内容信息的针对性增强。然后,本文设计了一个新的光线分布生成模块,以有效捕捉光线分布。将光线分布信息和图像内容信息输入到参数图预测的不同参数估计模块中。最后,本文提出了一个创新的天线波束迭代调整方案,用于合作像素级的低光图像调整。此外,为了提供全面评估,还构建了一个具有不均匀光线分布的具有挑战性的夜间UAV跟踪数据集,即NAT2024-2,该数据集包含40个具有超过74K帧的挑战性序列。作者在权威的UAV基准测试和所提出的NAT2024-2上进行的实验结果表明,LDEnhancer在其他SOTA夜间UAV跟踪增强器中具有优越的性能。此外,在具有NVIDIA Orin NX的典型UAV平台上进行实际测试证实了LDEnhancer的实际可行性和有效性。代码可在此链接处获取:https://www.acm.org/dl/2022.pdf
https://arxiv.org/abs/2409.16631
Nowadays, unmanned aerial vehicles (UAVs) are commonly used in search and rescue scenarios to gather information in the search area. The automatic identification of the person searched for in aerial footage could increase the autonomy of such systems, reduce the search time, and thus increase the missed person's chances of survival. In this paper, we present a novel approach to perform semantically conditioned open vocabulary object tracking that is specifically designed to cope with the limitations of UAV hardware. Our approach has several advantages. It can run with verbal descriptions of the missing person, e.g., the color of the shirt, it does not require dedicated training to execute the mission and can efficiently track a potentially moving person. Our experimental results demonstrate the versatility and efficacy of our approach.
如今,无人机(UAVs)通常用于搜索和救援场景,以收集搜索区域内的信息。在空中的视频片段中自动识别搜寻目标可能会增加这类系统的自主性,减少搜索时间,从而增加错过人员的生存机会。在本文中,我们提出了一个专门为应对UAV硬件限制的新型方法来进行语义条件下的开放词汇对象跟踪。我们方法有以下几个优点。它可以运行于对失踪人员的口头描述,例如T恤的颜色,不需要专门培训来执行任务,并且可以有效地追踪一个可能移动的人员。我们的实验结果证明了我们对方法的多功能性及有效性。
https://arxiv.org/abs/2409.16111
This paper introduces MCTrack, a new 3D multi-object tracking method that achieves state-of-the-art (SOTA) performance across KITTI, nuScenes, and Waymo datasets. Addressing the gap in existing tracking paradigms, which often perform well on specific datasets but lack generalizability, MCTrack offers a unified solution. Additionally, we have standardized the format of perceptual results across various datasets, termed BaseVersion, facilitating researchers in the field of multi-object tracking (MOT) to concentrate on the core algorithmic development without the undue burden of data preprocessing. Finally, recognizing the limitations of current evaluation metrics, we propose a novel set that assesses motion information output, such as velocity and acceleration, crucial for downstream tasks. The source codes of the proposed method are available at this link: this https URL}{this https URL
本文介绍了一种名为MCTrack的新3D多目标跟踪方法,在KITTI、nuScenes和Waymo数据集上实现了最先进的(SOTA)性能。解决了现有跟踪范式之间的空白,即往往在特定数据集上表现出色,但缺乏泛化性。MCTrack提供了一个统一的解决方案。此外,我们标准化了各种数据集上的感知结果格式,称为BaseVersion,便于该领域的研究人员在核心算法开发过程中集中精力,而不会过分负担数据预处理。最后,认识到当前评估指标的局限性,我们提出了一个新方法来评估运动信息输出,例如速度和加速度,这对下游任务至关重要。该方法的原型代码可在此链接中查看:https://<https://github.com/MC-MOT-Track/MC-MOT-Track>
https://arxiv.org/abs/2409.16149
Accurately detecting and tracking high-speed, small objects, such as balls in sports videos, is challenging due to factors like motion blur and occlusion. Although recent deep learning frameworks like TrackNetV1, V2, and V3 have advanced tennis ball and shuttlecock tracking, they often struggle in scenarios with partial occlusion or low visibility. This is primarily because these models rely heavily on visual features without explicitly incorporating motion information, which is crucial for precise tracking and trajectory prediction. In this paper, we introduce an enhancement to the TrackNet family by fusing high-level visual features with learnable motion attention maps through a motion-aware fusion mechanism, effectively emphasizing the moving ball's location and improving tracking performance. Our approach leverages frame differencing maps, modulated by a motion prompt layer, to highlight key motion regions over time. Experimental results on the tennis ball and shuttlecock datasets show that our method enhances the tracking performance of both TrackNetV2 and V3. We refer to our lightweight, plug-and-play solution, built on top of the existing TrackNet, as TrackNetV4.
准确检测和跟踪高速、小型物体,如体育视频中的球,由于因素如运动模糊和遮挡,是一项具有挑战性的任务。尽管最近的一些深度学习框架如TrackNetV1、V2和V3已经先进地跟踪了网球球和羽毛球拍,但在部分遮挡或低可见的场景中,它们往往无法保证准确跟踪。这主要是因为这些模型在很大程度上依赖于视觉特征,而没有明确地包含运动信息,这对于精确跟踪和轨迹预测至关重要。在本文中,我们在TrackNet家族中通过运动感知融合机制将高级视觉特征与可学习运动关注图融合在一起,有效地强调了运动球的位置,从而提高了跟踪性能。我们的方法利用了由运动提示层产生的帧差分图来突出关键运动区域。在网球球和羽毛球拍数据集上的实验结果表明,我们的方法提高了TrackNetV2和V3的跟踪性能。我们将基于现有TrackNet的轻量、可插拔的解决方案称为TrackNetV4。
https://arxiv.org/abs/2409.14543