As drone technology advances, using unmanned aerial vehicles for aerial surveys has become the dominant trend in modern low-altitude remote sensing. The surge in aerial video data necessitates accurate prediction for future scenarios and motion states of the interested target, particularly in applications like traffic management and disaster response. Existing video prediction methods focus solely on predicting future scenes (video frames), suffering from the neglect of explicitly modeling target's motion states, which is crucial for aerial video interpretation. To address this issue, we introduce a novel task called Target-Aware Aerial Video Prediction, aiming to simultaneously predict future scenes and motion states of the target. Further, we design a model specifically for this task, named TAFormer, which provides a unified modeling approach for both video and target motion states. Specifically, we introduce Spatiotemporal Attention (STA), which decouples the learning of video dynamics into spatial static attention and temporal dynamic attention, effectively modeling the scene appearance and motion. Additionally, we design an Information Sharing Mechanism (ISM), which elegantly unifies the modeling of video and target motion by facilitating information interaction through two sets of messenger tokens. Moreover, to alleviate the difficulty of distinguishing targets in blurry predictions, we introduce Target-Sensitive Gaussian Loss (TSGL), enhancing the model's sensitivity to both target's position and content. Extensive experiments on UAV123VP and VisDroneVP (derived from single-object tracking datasets) demonstrate the exceptional performance of TAFormer in target-aware video prediction, showcasing its adaptability to the additional requirements of aerial video interpretation for target awareness.
随着无人机技术的进步,使用无人机进行航空测量已成为现代低空遥感的优势趋势。高空视频数据的激增迫使对未来场景和感兴趣目标的动态状态进行准确预测,特别是在交通管理和灾害应对等领域。现有的视频预测方法仅关注预测未来场景(视频帧),忽略了明确建模目标运动状态,这是高空视频解释的关键。为解决这个问题,我们引入了一个名为 Target-Aware Aerial Video Prediction 的新任务,旨在同时预测未来场景和目标的动态状态。此外,我们为这个任务设计了一个名为 TAFormer 的模型,提供了一种统一建模视频和目标运动状态的方法。具体来说,我们引入了 Spatiotemporal Attention(STA),将视频动态学习的空间静态注意力和时间动态注意力解耦,有效建模场景外观和运动。此外,我们设计了一个信息共享机制(ISM),通过促进信息交互来统一建模视频和目标运动。为了减轻在模糊预测中区分目标的努力,我们引入了 Target-Sensitive Gaussian Loss(TSGL),提高了模型对目标位置和内容的敏感度。对于 UAV123VP 和 VisDroneVP(源于单对象跟踪数据集)的实验表明,TAFormer 在目标意识视频预测方面的表现异常出色,展示了它对空中视频解释额外需求的适应能力。
https://arxiv.org/abs/2403.18238
RGB-T tracking, a vital downstream task of object tracking, has made remarkable progress in recent years. Yet, it remains hindered by two major challenges: 1) the trade-off between performance and efficiency; 2) the scarcity of training data. To address the latter challenge, some recent methods employ prompts to fine-tune pre-trained RGB tracking models and leverage upstream knowledge in a parameter-efficient manner. However, these methods inadequately explore modality-independent patterns and disregard the dynamic reliability of different modalities in open scenarios. We propose M3PT, a novel RGB-T prompt tracking method that leverages middle fusion and multi-modal and multi-stage visual prompts to overcome these challenges. We pioneer the use of the middle fusion framework for RGB-T tracking, which achieves a balance between performance and efficiency. Furthermore, we incorporate the pre-trained RGB tracking model into the framework and utilize multiple flexible prompt strategies to adapt the pre-trained model to the comprehensive exploration of uni-modal patterns and the improved modeling of fusion-modal features, harnessing the potential of prompt learning in RGB-T tracking. Our method outperforms the state-of-the-art methods on four challenging benchmarks, while attaining 46.1 fps inference speed.
RGB-T跟踪,作为对象跟踪的一个重要下游任务,近年来取得了显著的进步。然而,它仍然受到两个主要挑战的限制:1)性能和效率之间的权衡;2)训练数据的稀缺性。为解决后一个挑战,一些最近的方法采用提示来微调预训练的RGB跟踪模型,并以参数高效的方式利用上游知识。然而,这些方法不足以探索模态无关的模式,并忽视了开放场景中不同模态的动态可靠性。我们提出M3PT,一种新颖的RGB-T提示跟踪方法,它利用中融合和多模态多阶段视觉提示来克服这些挑战。我们首创了使用中融合框架进行RGB-T跟踪,实现了性能和效率之间的平衡。此外,我们将预训练的RGB跟踪模型集成到该框架中,并利用多种灵活的提示策略将预训练模型适应单模态模式全面探索和融合模态特征的改进建模,利用提示学习在RGB-T跟踪中的潜在潜力。我们的方法在四个具有挑战性的基准测试上超过了最先进的方法,同时在推理速度上达到46.1 fps。
https://arxiv.org/abs/2403.18193
The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational language models, such as GPT-3, with extensive training corpora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address various types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at this https URL.
视频理解任务的的核心,如识别、标题和跟踪,是通过自动检测视频中的物体或动作来分析它们的时序发展。尽管它们有一个共同的目标,但不同的任务通常依赖于独特的模型架构和注释格式。相比之下,自然语言处理从统一输出空间中受益,即文本序列,这简化了使用广泛的训练语料库训练强大的基础语言模型(如GPT-3)的训练过程。受到这个想法的启发,我们试图通过使用语言作为标签,并进一步引入时间和盒标记,统一视频理解任务的输出空间。这样,各种视频任务可以表示为基于视频的标记生成。这使我们能够在一个完全共享的编码-解码架构中解决各种类型的视频任务,包括分类(如动作识别)、标题(包括剪辑标题、视频问题回答和密集视频标题)和定位任务(如视觉物体跟踪)。通过全面的实验,我们证明了这种简单而直接的想法非常有效,可以在七个视频基准上实现最先进的或竞争力的结果,为更通用的视频理解提供了一个新的视角。代码可在此处访问:https://www.aclweb.org/anthology/N18-1196
https://arxiv.org/abs/2403.17935
The speed-precision trade-off is a critical problem for visual object tracking which usually requires low latency and deployment on constrained resources. Existing solutions for efficient tracking mainly focus on adopting light-weight backbones or modules, which nevertheless come at the cost of a sacrifice in precision. In this paper, inspired by dynamic network routing, we propose DyTrack, a dynamic transformer framework for efficient tracking. Real-world tracking scenarios exhibit diverse levels of complexity. We argue that a simple network is sufficient for easy frames in video sequences, while more computation could be assigned to difficult ones. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Thus, it can achieve higher performance with the same running speed. We formulate instance-specific tracking as a sequential decision problem and attach terminating branches to intermediate layers of the entire model. Especially, to fully utilize the computations, we introduce the feature recycling mechanism to reuse the outputs of predecessors. Furthermore, a target-aware self-distillation strategy is designed to enhance the discriminating capabilities of early predictions by effectively mimicking the representation pattern of the deep model. Extensive experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model. For instance, DyTrack obtains 64.9% AUC on LaSOT with a speed of 256 fps.
速度精度权衡是一个关键问题,因为通常需要低延迟并在有限资源上部署。现有的跟踪解决方案主要关注采用轻量级骨干或模块,但代价是精度下降。在本文中,我们受到动态网络路由的启发,提出了DyTrack,一个用于高效跟踪的动态Transformer框架。现实世界的跟踪场景具有不同的复杂性。我们认为,对于视频序列中的简单帧,一个简单的网络就足够了,而困难帧则可以分配更多的计算资源。DyTrack可以自动学习为各种输入配置适当的推理路线,从而更好地利用现有计算预算。因此,它可以在相同的运行速度下实现更高的性能。我们将实例特定的跟踪表示为序列决策问题,并附加到整个模型的中间层。特别是,为了充分利用计算,我们引入了特征回收机制来重用前者的输出。此外,还设计了一个目标感知的自蒸馏策略,通过有效地模仿深度模型的表示模式来增强早期预测的鉴别能力。在多个基准测试上进行的大量实验证明,DyTrack仅使用一个模型就实现了速度精度权衡的承诺。例如,DyTrack在LaSOT上的AUC为64.9%,运行速度为256 fps。
https://arxiv.org/abs/2403.17651
In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association. They leverage robust single-frame detectors and treat object association as a post-processing step through hand-crafted heuristic algorithms and surrogate tasks. However, the nature of heuristic techniques prevents end-to-end exploitation of training data, leading to increasingly cumbersome and challenging manual modification while facing complicated or novel scenarios. In this paper, we regard this object association task as an End-to-End in-context ID prediction problem and propose a streamlined baseline called MOTIP. Specifically, we form the target embeddings into historical trajectory information while considering the corresponding IDs as in-context prompts, then directly predict the ID labels for the objects in the current frame. Thanks to this end-to-end process, MOTIP can learn tracking capabilities straight from training data, freeing itself from burdensome hand-crafted algorithms. Without bells and whistles, our method achieves impressive state-of-the-art performance in complex scenarios like DanceTrack and SportsMOT, and it performs competitively with other transformer-based methods on MOT17. We believe that MOTIP demonstrates remarkable potential and can serve as a starting point for future research. The code is available at this https URL.
在多目标跟踪(MOT)中,跟踪-检测方法已经经过多年的考验,根据定义将过程划分为两个部分:目标检测和关联。它们利用鲁棒的单帧检测器,并通过手工制作的启发式算法和代理任务将目标关联视为后处理步骤。然而,启发式技术的本质意味着无法通过训练数据进行端到端的学习,导致在面临复杂或新颖场景时,手动修改变得越来越繁琐和具有挑战性。在本文中,我们将这个目标关联任务视为端到端上下文ID预测问题,并提出了一个简化 baseline,称为 MOTIP。具体来说,我们在考虑相应ID的同时,将目标嵌入 forming the target embeddings into historical trajectory information,然后直接预测当前帧中物体的ID标签。由于这种端到端过程,MOTIP 可以直接从训练数据中学习跟踪能力,摆脱繁琐的手工算法。在复杂场景如舞蹈跟踪和体育MOT中,没有花言巧语,我们的方法在 MOT17 上的表现令人印象深刻,与其他基于Transformer的方法相当。我们相信,MOTIP 具有显著潜力,可以作为未来研究的起点。代码可在此处访问:https://www.osac.org.cn/file/new_file/2022-08/17/09213303_MOTIP_v1.pdf
https://arxiv.org/abs/2403.16848
Due to the complementary nature of visible light and thermal in-frared modalities, object tracking based on the fusion of visible light images and thermal images (referred to as RGB-T tracking) has received increasing attention from researchers in recent years. How to achieve more comprehensive fusion of information from the two modalities at a lower cost has been an issue that re-searchers have been exploring. Inspired by visual prompt learn-ing, we designed a novel two-stream RGB-T tracking architecture based on cross-modal mutual prompt learning, and used this model as a teacher to guide a one-stream student model for rapid learning through knowledge distillation techniques. Extensive experiments have shown that, compared to similar RGB-T track-ers, our designed teacher model achieved the highest precision rate, while the student model, with comparable precision rate to the teacher model, realized an inference speed more than three times faster than the teacher model.(Codes will be available if accepted.)
由于可见光和热红外模态之间的互补性质,基于可见光图像和热成像图像的融合(称为RGB-T跟踪)近年来引起了越来越多的研究者的关注。如何以较低的成本实现两种模态信息的全面融合是一个研究者在探索的问题。受到视觉提示学习启发,我们设计了一种基于跨模态相互提示学习的新颖两个通道RGB-T跟踪架构,并使用该模型作为教师来指导一个知识蒸馏技术的快速学习。大量实验证明,与类似的RGB-T跟踪器相比,我们设计的教师模型在准确率上实现了最高的提升,而与教师模型具有相当准确率的的学生模型在推理速度上超过了三倍于教师模型。(如果接受,代码将可用。)
https://arxiv.org/abs/2403.16834
Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset paired with novel tasks: Referring Single Object Tracking (RSOT) and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million annotated video frames with corresponding object boxes and descriptions. Leveraging this dataset, we conduct training of MLLMs and propose a token-compression model T-Selector to tackle the second challenge. Our proposed approach, Elysium: Exploring Object-level Perception in Videos via MLLM, is an end-to-end trainable MLLM that makes the first attempt to conduct object-level tasks in videos without requiring any additional plug-in or expert models.
多模态大型语言模型(MLLMs)已经在静止图像中表现出感知物体的能力,但在与视频相关任务中的应用,例如物体跟踪,研究还很少。这种缺乏探索主要是因为两个关键挑战。首先,在大型视频数据集上进行广泛的预训练,使MLLMs具有在多帧感知物体并理解跨帧关系的能力。其次,在大型语言模型(LLMs)的上下文窗口内处理大量帧数可能对计算产生重大负担。为解决第一个挑战,我们引入了ElysiumTrack-1M,一个与新任务搭配的大型视频数据集:引用单对象跟踪(RSOT)和视频引用表达生成(Video-REG)。ElysiumTrack-1M 包含1270万已注释视频帧,以及相应的物体框和描述。利用这个数据集,我们对MLLMs进行训练,并提出了一个词向量压缩模型T-Selector来解决第二个挑战。我们提出的方法,Elysium:通过MLLM在视频中探索物体级别感知,是一种端到端可训练的MLLM,不需要任何额外的插件或专家模型。
https://arxiv.org/abs/2403.16558
As a neuromorphic sensor with high temporal resolution, spike cameras offer notable advantages over traditional cameras in high-speed vision applications such as high-speed optical estimation, depth estimation, and object tracking. Inspired by the success of the spike camera, we proposed Spike-NeRF, the first Neural Radiance Field derived from spike data, to achieve 3D reconstruction and novel viewpoint synthesis of high-speed scenes. Instead of the multi-view images at the same time of NeRF, the inputs of Spike-NeRF are continuous spike streams captured by a moving spike camera in a very short time. To reconstruct a correct and stable 3D scene from high-frequency but unstable spike data, we devised spike masks along with a distinctive loss function. We evaluate our method qualitatively and numerically on several challenging synthetic scenes generated by blender with the spike camera simulator. Our results demonstrate that Spike-NeRF produces more visually appealing results than the existing methods and the baseline we proposed in high-speed scenes. Our code and data will be released soon.
作为一种具有高时间分辨率的神经形态传感器,尖点相机在高速视觉应用中(如高速光学估计、深度估计和目标跟踪)相对于传统相机具有显著的优势。受到尖点相机的成功启发,我们提出了Spike-NeRF,这是第一个基于尖点数据的神经辐射场,旨在实现高速场景的3D重建和新的视点合成。与NeRF相同,Spike-NeRF的输入是来自移动尖点相机的连续尖点流。为了从高频但不稳定的尖点数据中重构正确的稳定3D场景,我们设计了一种具有独特损失函数的尖点掩码。我们在用尖点相机模拟器生成的一些具有挑战性的合成场景上进行了定性和数值评估。我们的结果表明,Spike-NeRF产生比现有方法和我们提出的基于尖点数据的基准更具有视觉吸引力的结果。我们的代码和数据将很快发布。
https://arxiv.org/abs/2403.16410
Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at this https URL.
多模态视觉目标跟踪(VOT)最近因其稳健性而得到了显著关注。早期的研究主要集中在完全微调基于RGB的跟踪器,这是低效的,并且由于多模态数据的稀缺性,缺乏通用的表示。因此,最近的研究利用提示调整来将预训练的基于RGB的跟踪器迁移到多模态数据。然而,模态差距限制了预训练知识的回忆,RGB模态的统治地位仍然存在,阻碍了其他模态信息的充分利用。为了应对这些问题,我们提出了一个名为SDSTrack的新型对称多模态跟踪框架。我们引入了轻量级的适应性以实现高效的微调,直接将RGB模型的特征提取能力转移到其他领域,具有少数可训练参数,并且以平衡、对称的方式整合了多模态特征。此外,我们还设计了一种互补的遮罩抽头策略,以增强 trackers 在复杂环境(如极端天气、低质成像和传感器故障)中的鲁棒性。大量实验证明,SDSTrack在各种多模态跟踪场景中优于最先进的Methods,包括RGB+Depth、RGB+Thermal和RGB+Event tracking,并在极端条件下表现出色。我们的源代码可以从该链接获取。
https://arxiv.org/abs/2403.16002
3D single object tracking within LIDAR point clouds is a pivotal task in computer vision, with profound implications for autonomous driving and robotics. However, existing methods, which depend solely on appearance matching via Siamese networks or utilize motion information from successive frames, encounter significant challenges. Issues such as similar objects nearby or occlusions can result in tracker drift. To mitigate these challenges, we design an innovative spatio-temporal bi-directional cross-frame distractor filtering tracker, named STMD-Tracker. Our first step involves the creation of a 4D multi-frame spatio-temporal graph convolution backbone. This design separates KNN graph spatial embedding and incorporates 1D temporal convolution, effectively capturing temporal fluctuations and spatio-temporal information. Subsequently, we devise a novel bi-directional cross-frame memory procedure. This integrates future and synthetic past frame memory to enhance the current memory, thereby improving the accuracy of iteration-based tracking. This iterative memory update mechanism allows our tracker to dynamically compensate for information in the current frame, effectively reducing tracker drift. Lastly, we construct spatially reliable Gaussian masks on the fused features to eliminate distractor points. This is further supplemented by an object-aware sampling strategy, which bolsters the efficiency and precision of object localization, thereby reducing tracking errors caused by distractors. Our extensive experiments on KITTI, NuScenes and Waymo datasets demonstrate that our approach significantly surpasses the current state-of-the-art methods.
3D 单对象跟踪在 LIDAR 点云中是一个关键任务,对自动驾驶和机器人技术具有深刻的意义。然而,现有的方法,仅通过 Siamese 网络或利用连续帧之间的运动信息,遇到了严重的挑战。诸如附近类似物体或遮挡等问题可能导致跟踪器漂移。为了应对这些挑战,我们设计了一种创新的时空双向交叉帧分布式过滤跟踪器,名为 STMD-Tracker。我们的第一步包括创建一个 4D 多帧时空图卷积骨干网络。这种设计分离了 KNN 图形嵌入和 1D 时间卷积,有效捕捉了时间波动和时空信息。接下来,我们设计了一种新颖的跨帧记忆程序。该程序将未来和合成过去帧记忆集成,从而增强当前记忆,从而提高迭代跟踪的准确性。这种迭代记忆更新机制使我们的跟踪器能够动态地补偿当前帧的信息,有效地减少了跟踪器漂移。最后,我们在融合特征上构建了空间可靠的高斯掩码,以消除干扰点。这进一步补充了对象感知的采样策略,提高了物体定位的效率和精度,从而减少了由干扰者引起的跟踪误差。我们在 KITTI、NuScenes 和 Waymo 数据集上的广泛实验证明,我们的方法显著超越了当前最先进的方法。
https://arxiv.org/abs/2403.15831
Multiple object tracking is a critical task in autonomous driving. Existing works primarily focus on the heuristic design of neural networks to obtain high accuracy. As tracking accuracy improves, however, neural networks become increasingly complex, posing challenges for their practical application in real driving scenarios due to the high level of latency. In this paper, we explore the use of the neural architecture search (NAS) methods to search for efficient architectures for tracking, aiming for low real-time latency while maintaining relatively high accuracy. Another challenge for object tracking is the unreliability of a single sensor, therefore, we propose a multi-modal framework to improve the robustness. Experiments demonstrate that our algorithm can run on edge devices within lower latency constraints, thus greatly reducing the computational requirements for multi-modal object tracking while keeping lower latency.
multiple object tracking是自动驾驶中的一个关键任务。现有的工作主要集中在神经网络的启发式设计以获得高精度的精确度。然而,随着跟踪准确度的提高,神经网络变得越来越复杂,这对其在现实驾驶场景中的实际应用造成了延迟。在本文中,我们探讨了使用神经架构搜索(NAS)方法来寻找跟踪的高效架构,旨在保持较低的实时延迟,同时保持相对较高的精度。另一个挑战是物体跟踪的不确定性,因此我们提出了一个多模态框架来提高其鲁棒性。实验证明,我们的算法可以在较低的延迟约束下运行边缘设备,从而大大减少多模态物体跟踪的计算需求,同时保持较低的延迟。
https://arxiv.org/abs/2403.15712
Accurate detection and tracking of surrounding objects is essential to enable self-driving vehicles. While Light Detection and Ranging (LiDAR) sensors have set the benchmark for high performance, the appeal of camera-only solutions lies in their cost-effectiveness. Notably, despite the prevalent use of Radio Detection and Ranging (RADAR) sensors in automotive systems, their potential in 3D detection and tracking has been largely disregarded due to data sparsity and measurement noise. As a recent development, the combination of RADARs and cameras is emerging as a promising solution. This paper presents Camera-RADAR 3D Detection and Tracking (CR3DT), a camera-RADAR fusion model for 3D object detection, and Multi-Object Tracking (MOT). Building upon the foundations of the State-of-the-Art (SotA) camera-only BEVDet architecture, CR3DT demonstrates substantial improvements in both detection and tracking capabilities, by incorporating the spatial and velocity information of the RADAR sensor. Experimental results demonstrate an absolute improvement in detection performance of 5.3% in mean Average Precision (mAP) and a 14.9% increase in Average Multi-Object Tracking Accuracy (AMOTA) on the nuScenes dataset when leveraging both modalities. CR3DT bridges the gap between high-performance and cost-effective perception systems in autonomous driving, by capitalizing on the ubiquitous presence of RADAR in automotive applications.
准确检测和跟踪周围物体是实现自动驾驶车辆的必要条件。尽管激光探测和测距(LiDAR)传感器已经设置了高性能的基准,但仅使用摄像头的解决方案具有成本效益。值得注意的是,尽管在汽车系统中普遍使用雷达探测和测距(RADAR)传感器,但它们在3D检测和跟踪方面的潜力因数据稀疏和测量噪声而被大大忽视。作为最近的一个发展,雷达和摄像头的结合正在成为一种有前景的解决方案。本文介绍了一种名为Camera-RADAR 3D Detection and Tracking(CR3DT)的相机-雷达融合模型用于3D物体检测和跟踪,以及多对象跟踪(MOT)。在基于最先进(SotA)相机-only BEV Detection架构的基础上,CR3DT在检测和跟踪能力上取得了显著的提高,通过将雷达传感器的空间和速度信息融入其中。实验结果表明,在 nuScenes 数据集上,利用两种模式相结合,绝对检测性能提高了5.3%,平均多对象跟踪准确性(AMOTA)提高了14.9%。CR3DT在自动驾驶中连接了高性能和低成本感知系统,并利用汽车应用中无处不在的雷达来 capitalize。
https://arxiv.org/abs/2403.15313
Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experiment results on various datasets show that STATM can significantly enhance object-centric learning capabilities of slot-based video models.
面向对象的训练旨在将复杂的视觉场景分解为更易管理的物体表示,提高机器学习系统对物理世界的理解和推理能力。近年来,基于槽的视频模型在分割和跟踪物体方面表现出色,但它们忽视了有效推理模块的重要性。在现实生活中,推理和预测能力在人类感知和物体跟踪中起着关键作用;特别是,这些能力与人类的直觉物理学密切相关。受到这个启发的,我们设计了一个名为基于槽的时间空间Transformer(STATM)的新推理模块,以增强复杂场景中模型的时间空间感知能力。内存缓冲器主要用于存储来自上游模块的槽信息,基于槽的时间空间注意力计算和融合进行预测。我们对各种数据集的实验结果表明,STATM可以显著增强基于槽的视频模型的物体中心化学习能力。
https://arxiv.org/abs/2403.15245
3D Multi-Object Tracking (MOT) captures stable and comprehensive motion states of surrounding obstacles, essential for robotic perception. However, current 3D trackers face issues with accuracy and latency consistency. In this paper, we propose Fast-Poly, a fast and effective filter-based method for 3D MOT. Building upon our previous work Poly-MOT, Fast-Poly addresses object rotational anisotropy in 3D space, enhances local computation densification, and leverages parallelization technique, improving inference speed and precision. Fast-Poly is extensively tested on two large-scale tracking benchmarks with Python implementation. On the nuScenes dataset, Fast-Poly achieves new state-of-the-art performance with 75.8% AMOTA among all methods and can run at 34.2 FPS on a personal CPU. On the Waymo dataset, Fast-Poly exhibits competitive accuracy with 63.6% MOTA and impressive inference speed (35.5 FPS). The source code is publicly available at this https URL.
3D Multi-Object Tracking(MOT)可以捕捉周围障碍物的稳定和全面的运动状态,这对机器人感知至关重要。然而,当前的3D跟踪器在准确性和延迟一致性方面存在问题。在本文中,我们提出了Fast-Poly,一种快速有效的基于滤波器的3D MOT方法。基于我们之前的 work Poly-MOT,Fast-Poly 解决了3D空间中对象旋转轴对称性问题,提高了局部计算密度,并利用并行技术,提高了推理速度和精度。Fast-Poly在两个大型跟踪基准测试中进行了广泛测试,使用Python实现。在nuScenes数据集上,Fast-Poly在所有方法中实现了75.8%的AMOTA新最优性能,并且在个人CPU上的运行速度可达34.2 FPS。在Waymo数据集上,Fast-Poly具有竞争力的准确率(63.6% MOTA)和令人印象深刻的推理速度(35.5 FPS)。源代码可在此https URL上获取。
https://arxiv.org/abs/2403.13443
Taking advantage of multi-view aggregation presents a promising solution to tackle challenges such as occlusion and missed detection in multi-object tracking and detection. Recent advancements in multi-view detection and 3D object recognition have significantly improved performance by strategically projecting all views onto the ground plane and conducting detection analysis from a Bird's Eye View. In this paper, we compare modern lifting methods, both parameter-free and parameterized, to multi-view aggregation. Additionally, we present an architecture that aggregates the features of multiple times steps to learn robust detection and combines appearance- and motion-based cues for tracking. Most current tracking approaches either focus on pedestrians or vehicles. In our work, we combine both branches and add new challenges to multi-view detection with cross-scene setups. Our method generalizes to three public datasets across two domains: (1) pedestrian: Wildtrack and MultiviewX, and (2) roadside perception: Synthehicle, achieving state-of-the-art performance in detection and tracking. this https URL
利用多视角聚合解决多目标跟踪和检测中的遮挡和 missed 检测是一个有前景的解决方案。在多视角检测和 3D 物体识别方面,最近取得的进展显著提高了性能,通过将所有视角投影到地面平面并从鸟瞰角进行检测分析,实现了这一点。在本文中,我们将现代提升方法(无论是参数 free 还是参数化的)与多视角聚合进行比较。此外,我们还提出了一个聚合多个步骤特征的架构,以学习稳健的检测,并结合了特征和运动基于的跟踪。大多数现有的跟踪方法要么专注于行人,要么专注于车辆。在我们的工作中,我们结合了两个分支,并在跨场景设置的多视角检测中添加了新的挑战。我们的方法在两个领域(1)行人:Wildtrack 和 MultiviewX 和(2)路侧感知:Synthehicle)的多个公共数据集上取得了最先进的检测和跟踪性能。这个链接
https://arxiv.org/abs/2403.12573
The complex dynamicity of open-world objects presents non-negligible challenges for multi-object tracking (MOT), often manifested as severe deformations, fast motion, and occlusions. Most methods that solely depend on coarse-grained object cues, such as boxes and the overall appearance of the object, are susceptible to degradation due to distorted internal relationships of dynamic objects. To address this problem, this work proposes NetTrack, an efficient, generic, and affordable tracking framework to introduce fine-grained learning that is robust to dynamicity. Specifically, NetTrack constructs a dynamicity-aware association with a fine-grained Net, leveraging point-level visual cues. Correspondingly, a fine-grained sampler and matching method have been incorporated. Furthermore, NetTrack learns object-text correspondence for fine-grained localization. To evaluate MOT in extremely dynamic open-world scenarios, a bird flock tracking (BFT) dataset is constructed, which exhibits high dynamicity with diverse species and open-world scenarios. Comprehensive evaluation on BFT validates the effectiveness of fine-grained learning on object dynamicity, and thorough transfer experiments on challenging open-world benchmarks, i.e., TAO, TAO-OW, AnimalTrack, and GMOT-40, validate the strong generalization ability of NetTrack even without finetuning. Project page: this https URL.
开放世界对象的复杂动态性为多对象跟踪(MOT)带来了非忽视性的挑战,通常表现为严重的变形、快速的移动和遮挡。大多数仅依赖于粗粒度物体线索的方法,如盒子和物体整体外观,容易因动态对象的内部关系失真而降解。为解决这个问题,本文提出了NetTrack,一种高效、通用且经济实惠的跟踪框架,以引入对动态性的细粒度学习。具体来说,NetTrack通过利用点级视觉线索构建动态性感知关联,并相应地引入了细粒度采样和匹配方法。此外,NetTrack还学习物体与文本的对应关系进行细粒度局部定位。为了评估在极度动态的开放世界场景中MOT的效果,构建了一个展示高动态性和多样物种以及开放世界场景的鸟群跟踪(BFT)数据集。对BFT的全面评估证实了细粒度学习在物体动态性上的有效性,而通过对具有挑战性的开放世界基准进行深入的训练和调整,即TAO、TAO-OW、AnimalTrack和GMOT-40,证实了NetTrack的强泛化能力。项目页面:此https链接。
https://arxiv.org/abs/2403.11186
In this paper, we address the challenge of multi-object tracking (MOT) in moving Unmanned Aerial Vehicle (UAV) scenarios, where irregular flight trajectories, such as hovering, turning left/right, and moving up/down, lead to significantly greater complexity compared to fixed-camera MOT. Specifically, changes in the scene background not only render traditional frame-to-frame object IOU association methods ineffective but also introduce significant view shifts in the objects, which complicates tracking. To overcome these issues, we propose a novel universal HomView-MOT framework, which for the first time, harnesses the view Homography inherent in changing scenes to solve MOT challenges in moving environments, incorporating Homographic Matching and View-Centric concepts. We introduce a Fast Homography Estimation (FHE) algorithm for rapid computation of Homography matrices between video frames, enabling object View-Centric ID Learning (VCIL) and leveraging multi-view Homography to learn cross-view ID features. Concurrently, our Homographic Matching Filter (HMF) maps object bounding boxes from different frames onto a common view plane for a more realistic physical IOU association. Extensive experiments have proven that these innovations allow HomView-MOT to achieve state-of-the-art performance on prominent UAV MOT datasets VisDrone and UAVDT.
在本文中,我们解决了在运动无人机场景中多目标跟踪(MOT)的挑战,由于不规则的飞行轨迹(如悬停、左/右转和上升/下降),与固定相机MOT相比,复杂性明显更高。具体来说,场景背景的变化不仅使传统的帧到帧物体IOU关联方法变得无效,而且物体上产生显著的视差,从而复杂跟踪。为了克服这些问题,我们提出了一个新颖的通用HomView-MOT框架,它利用变化场景中的视变换来解决在移动环境中的MOT挑战,包括视变换匹配和视元中心概念。我们引入了快速Homography估计(FHE)算法,用于计算视频帧之间的Homography矩阵,实现物体视元中心ID学习(VCIL)和利用多视图Homography学习跨视图ID特征。同时,我们的Homographic匹配滤波器(HMF)将不同帧中的物体边界框映射到共同的视平面,以实现更真实的物理IOU关联。大量实验证明,这些创新使HomView-MOT在显著的UAV MOT数据集VisDrone和UAVDT上实现最先进的性能。
https://arxiv.org/abs/2403.10830
In the field of multi-object tracking (MOT), traditional methods often rely on the Kalman Filter for motion prediction, leveraging its strengths in linear motion scenarios. However, the inherent limitations of these methods become evident when confronted with complex, nonlinear motions and occlusions prevalent in dynamic environments like sports and dance. This paper explores the possibilities of replacing the Kalman Filter with various learning-based motion model that effectively enhances tracking accuracy and adaptability beyond the constraints of Kalman Filter-based systems. In this paper, we proposed MambaTrack, an online motion-based tracker that outperforms all existing motion-based trackers on the challenging DanceTrack and SportsMOT datasets. Moreover, we further exploit the potential of the state-space-model in trajectory feature extraction to boost the tracking performance and proposed MambaTrack+, which achieves the state-of-the-art performance on DanceTrack dataset with 56.1 HOTA and 54.9 IDF1.
在多目标跟踪(MOT)领域,传统方法通常依赖卡尔曼滤波器(Kalman Filter)进行运动预测,利用其在线性运动场景中的优势。然而,当面临复杂、非线性运动和动态环境中普遍存在的遮挡时,这些方法的局限性变得明显。本文探讨了用各种学习模型的可能性来替代卡尔曼滤波器,从而在Kalman滤波器基于系统中增强跟踪准确性和适应性。本文提出MambaTrack,一种在线运动跟踪器,在具有挑战性的DanceTrack和SportsMOT数据集上优于所有现有运动跟踪器。此外,我们进一步利用状态空间模型的轨迹特征提取潜力,提高跟踪性能,并提出了MambaTrack+,在DanceTrack数据集上实现与56.1 HOTA和54.9 IDF1相同的最高性能。
https://arxiv.org/abs/2403.10826
Real-time high-accuracy optical flow estimation is a crucial component in various applications, including localization and mapping in robotics, object tracking, and activity recognition in computer vision. While recent learning-based optical flow methods have achieved high accuracy, they often come with heavy computation costs. In this paper, we propose a highly efficient optical flow architecture, called NeuFlow, that addresses both high accuracy and computational cost concerns. The architecture follows a global-to-local scheme. Given the features of the input images extracted at different spatial resolutions, global matching is employed to estimate an initial optical flow on the 1/16 resolution, capturing large displacement, which is then refined on the 1/8 resolution with lightweight CNN layers for better accuracy. We evaluate our approach on Jetson Orin Nano and RTX 2080 to demonstrate efficiency improvements across different computing platforms. We achieve a notable 10x-80x speedup compared to several state-of-the-art methods, while maintaining comparable accuracy. Our approach achieves around 30 FPS on edge computing platforms, which represents a significant breakthrough in deploying complex computer vision tasks such as SLAM on small robots like drones. The full training and evaluation code is available at this https URL.
实时高精度光流估计是各种应用的关键组件,包括机器人定位和地图、目标跟踪和计算机视觉活动识别。虽然最近基于学习的光流方法已经达到高准确度,但它们通常伴随着沉重的计算成本。在本文中,我们提出了一个高效的光流架构,称为NeuFlow,该架构解决了高准确度和计算成本的问题。架构遵循全局到局部方案。根据不同分辨率提取的输入图像的特征,采用全局匹配来估计初始光流在1/16分辨率上,捕获大的位移,然后在1/8分辨率上通过轻量级的CNN层进行微调,以提高准确性。我们在Jetson Orin Nano和RTX 2080上评估我们的方法,以证明不同计算平台上的效率改进。我们实现了与几个最先进方法相当的增长速度,同时保持较高的准确性。我们的方法在边缘计算平台上达到约30 FPS,这标志着在部署类似SLAM等复杂计算机视觉任务的小型机器人方面取得了显著的突破。完整的训练和评估代码可在此处访问:https://url.
https://arxiv.org/abs/2403.10425
Visual object tracking aims to localize the target object of each frame based on its initial appearance in the first frame. Depending on the input modility, tracking tasks can be divided into RGB tracking and RGB+X (e.g. RGB+N, and RGB+D) tracking. Despite the different input modalities, the core aspect of tracking is the temporal matching. Based on this common ground, we present a general framework to unify various tracking tasks, termed as OneTracker. OneTracker first performs a large-scale pre-training on a RGB tracker called Foundation Tracker. This pretraining phase equips the Foundation Tracker with a stable ability to estimate the location of the target object. Then we regard other modality information as prompt and build Prompt Tracker upon Foundation Tracker. Through freezing the Foundation Tracker and only adjusting some additional trainable parameters, Prompt Tracker inhibits the strong localization ability from Foundation Tracker and achieves parameter-efficient finetuning on downstream RGB+X tracking tasks. To evaluate the effectiveness of our general framework OneTracker, which is consisted of Foundation Tracker and Prompt Tracker, we conduct extensive experiments on 6 popular tracking tasks across 11 benchmarks and our OneTracker outperforms other models and achieves state-of-the-art performance.
视觉对象跟踪的目标是根据每个帧中目标对象初始出现的位置对其进行定位。根据输入的可调性,跟踪任务可以分为 RGB 跟踪和 RGB+X(例如 RGB+N 和 RGB+D)跟踪。尽管有不同的输入可调性,但跟踪的核心在于时间匹配。基于这一共同点,我们提出了一个统一跟踪任务的一般框架,称为 OneTracker。OneTracker 首先在名为 Foundation Tracker 的 RGB 跟踪器上进行大规模预训练。这个预训练阶段使 Foundation Tracker 具备估计目标物体位置的稳定能力。然后我们将其他模态信息视为提示并基于 Foundation Tracker 构建 Prompt Tracker。通过冻结 Foundation Tracker 并仅调整一些可训练参数,Prompt Tracker 抑制了 Foundation Tracker 的强大局部定位能力,并在下游的 RGB+X 跟踪任务中实现了参数高效的微调。为了评估我们的一般框架 OneTracker 的有效性,该框架由 Foundation Tracker 和 Prompt Tracker 组成,我们在 11 个基准上进行了广泛的实验。我们的 OneTracker 在其他模型中表现优异,并达到了最先进水平。
https://arxiv.org/abs/2403.09634