Object detection is one of the most important and fundamental aspects of computer vision tasks, which has been broadly utilized in pose estimation, object tracking and instance segmentation models. To obtain training data for object detection model efficiently, many datasets opt to obtain their unannotated data in video format and the annotator needs to draw a bounding box around each object in the images. Annotating every frame from a video is costly and inefficient since many frames contain very similar information for the model to learn from. How to select the most informative frames from a video to annotate has become a highly practical task to solve but attracted little attention in research. In this paper, we proposed a novel active learning algorithm for object detection models to tackle this problem. In the proposed active learning algorithm, both classification and localization informativeness of unlabelled data are measured and aggregated. Utilizing the temporal information from video frames, two novel localization informativeness measurements are proposed. Furthermore, a weight curve is proposed to avoid querying adjacent frames. Proposed active learning algorithm with multiple configurations was evaluated on the MuPoTS dataset and FootballPD dataset.
对象检测是计算机视觉任务中最为重要的和基本方面之一,在姿态估计、物体跟踪和实例分割模型中被广泛应用。为了高效地获取对象检测模型的训练数据,许多数据集选择以视频格式获取未标注数据,标注者需要在每个图像中画一个边界框来包围每个对象。对每个视频帧进行标注非常昂贵且效率低,因为许多帧包含模型可学习的信息。如何选择视频中最有用的帧来标注已成为一个非常实际的问题,但研究人员对此问题的关注较少。在本文中,我们提出了一种针对对象检测模型的新主动学习算法来解决此问题。在所提出的主动学习算法中,未标注数据的分类和定位 informativeness 都被测量和聚合。利用视频帧的时序信息,我们提出了两个新的定位 informativeness 测量方法。此外,我们提出了一种权重曲线,以避免询问相邻帧。所提出的多种配置的主动学习算法在MuPoTS数据和足球PD数据集上进行了评估。
https://arxiv.org/abs/2303.12760
Siamese network based trackers develop rapidly in the field of visual object tracking in recent years. The majority of siamese network based trackers now in use treat each channel in the feature maps generated by the backbone network equally, making the similarity response map sensitive to background influence and hence challenging to focus on the target region. Additionally, there are no structural links between the classification and regression branches in these trackers, and the two branches are optimized separately during training. Therefore, there is a misalignment between the classification and regression branches, which results in less accurate tracking results. In this paper, a Target Highlight Module is proposed to help the generated similarity response maps to be more focused on the target region. To reduce the misalignment and produce more precise tracking results, we propose a corrective loss to train the model. The two branches of the model are jointly tuned with the use of corrective loss to produce more reliable prediction results. Experiments on 5 challenging benchmark datasets reveal that the method outperforms current models in terms of performance, and runs at 38 fps, proving its effectiveness and efficiency.
近年来,视觉对象跟踪领域Siamese网络 based跟踪器发展迅速。目前使用的大多数Siamese网络 based跟踪器都将每个由基线网络生成的特征通道视为同等重要的,这使得相似响应图具有背景影响敏感性,因此难以专注于目标区域。此外,这些跟踪器中的分类和回归分支之间没有任何结构性链接,并且在训练期间两个分支分别优化。因此,分类和回归分支之间的不对齐导致了不准确的跟踪结果。在本文中,我们提出了一个目标突出模块,以帮助生成的特征响应图更专注于目标区域。为了减少不对齐并产生更准确的跟踪结果,我们提出了一种纠正损失来训练模型。模型的两个分支通过使用纠正损失共同优化,以产生更可靠的预测结果。对五个具有挑战性的基准数据集的实验表明,这种方法在性能方面优于当前模型,运行在38帧率上,证明了其效率和有效性。
https://arxiv.org/abs/2303.12304
Object tracking (OT) aims to estimate the positions of target objects in a video sequence. Depending on whether the initial states of target objects are specified by provided annotations in the first frame or the categories, OT could be classified as instance tracking (e.g., SOT and VOS) and category tracking (e.g., MOT, MOTS, and VIS) tasks. Combing the advantages of the best practices developed in both communities, we propose a novel tracking-with-detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for association. Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline. Extensive experiments on 7 tracking datasets, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.
对象跟踪(OT)的目标是在视频序列中估计目标物体的位置。取决于目标物体的初始状态是否由第一个帧提供的注解或分类指定,OT可以被视为实例跟踪(例如SOT和VOS)和分类跟踪(例如MOT、MOTS和VIS)任务。结合两个社区开发的最佳实践的优势,我们提出了一种新的跟踪与检测范式,其中跟踪补充检测的外观先验,而检测提供跟踪与候选边界框以匹配。配备这种设计,一个统一的跟踪模型 OmniTracker 被进一步介绍,以通过完全共享的网络架构、模型权重和推理管道解决所有跟踪任务。对7个跟踪数据集,包括LaSOT、TrackingNet、 Davis16-17、MOT17、MOTS20和YTVIS19进行了广泛的实验,结果表明, OmniTracker 与任务特定的和统一跟踪模型相比,实现与同样或更好的结果。
https://arxiv.org/abs/2303.12079
3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial role in autonomous driving. Current approaches all follow the Siamese paradigm based on appearance matching. However, LiDAR point clouds are usually textureless and incomplete, which hinders effective appearance matching. Besides, previous methods greatly overlook the critical motion clues among targets. In this work, beyond 3D Siamese tracking, we introduce a motion-centric paradigm to handle LiDAR SOT from a new perspective. Following this paradigm, we propose a matching-free two-stage tracker M^2-Track. At the 1st-stage, M^2-Track localizes the target within successive frames via motion transformation. Then it refines the target box through motion-assisted shape completion at the 2nd-stage. Due to the motion-centric nature, our method shows its impressive generalizability with limited training labels and provides good differentiability for end-to-end cycle training. This inspires us to explore semi-supervised LiDAR SOT by incorporating a pseudo-label-based motion augmentation and a self-supervised loss term. Under the fully-supervised setting, extensive experiments confirm that M^2-Track significantly outperforms previous state-of-the-arts on three large-scale datasets while running at 57FPS (~8%, ~17% and ~22% precision gains on KITTI, NuScenes, and Waymo Open Dataset respectively). While under the semi-supervised setting, our method performs on par with or even surpasses its fully-supervised counterpart using fewer than half labels from KITTI. Further analysis verifies each component's effectiveness and shows the motion-centric paradigm's promising potential for auto-labeling and unsupervised domain adaptation.
在激光雷达点云中的三维单物体跟踪(LiDAR SOT)在无人驾驶中扮演着关键角色。当前的方法都基于外观匹配,但LiDAR点云通常缺乏纹理和不完整,这阻碍了有效的外观匹配。此外,以前的方法严重忽略了目标之间的关键运动线索。在本文中,除了3D Siamese跟踪,我们引入了一种以运动为中心的范式,从新的角度处理LiDAR SOT。遵循这个范式,我们提出了一个无匹配的两步跟踪器M^2-Track。在第一个阶段,M^2-Track通过运动变换在相邻帧内定位目标。然后,在第二个阶段,它通过运动辅助的形状重构优化目标框。由于运动中心性质,我们的方法和 limited训练标签的情况下表现出令人印象深刻的泛化能力,并为端到端循环训练提供了良好的不同iability。这激励我们探索半监督的LiDAR SOT,通过添加伪标签的运动增强和自监督损失函数。在完全监督的情况下,广泛的实验确认M^2-Track在三个大规模数据集上显著优于以前的最高水平,同时运行在57FPS(KITTI、NuScenes和Waymo Open Dataset分别提高了~8%、~17%和~22%的精度)。在半监督的情况下,我们的方法和使用KITTI不到一半的标签数量的性能与它的完全监督对手相当或甚至超过了它。进一步的分析证实了每个组件的有效性,并展示了运动中心范式在自动 labeling和无监督域适应方面的潜力。
https://arxiv.org/abs/2303.12535
Most previous progress in object tracking is realized in daytime scenes with favorable illumination. State-of-the-arts can hardly carry on their superiority at night so far, thereby considerably blocking the broadening of visual tracking-related unmanned aerial vehicle (UAV) applications. To realize reliable UAV tracking at night, a spatial-channel Transformer-based low-light enhancer (namely SCT), which is trained in a novel task-inspired manner, is proposed and plugged prior to tracking approaches. To achieve semantic-level low-light enhancement targeting the high-level task, the novel spatial-channel attention module is proposed to model global information while preserving local context. In the enhancement process, SCT denoises and illuminates nighttime images simultaneously through a robust non-linear curve projection. Moreover, to provide a comprehensive evaluation, we construct a challenging nighttime tracking benchmark, namely DarkTrack2021, which contains 110 challenging sequences with over 100 K frames in total. Evaluations on both the public UAVDark135 benchmark and the newly constructed DarkTrack2021 benchmark show that the task-inspired design enables SCT with significant performance gains for nighttime UAV tracking compared with other top-ranked low-light enhancers. Real-world tests on a typical UAV platform further verify the practicability of the proposed approach. The DarkTrack2021 benchmark and the code of the proposed approach are publicly available at this https URL.
大多数先前在物体跟踪方面的进展都是在白天光线良好的场景中实现的。目前,最先进的技术很难在夜晚做出显著优势,因此极大地限制了与视觉跟踪相关的无人机应用的发展。为了实现可靠的无人机夜间跟踪,我们提出了一种基于空间通道Transformer的低光增强器(称为SCT),该增强器采用一种新的任务启发式方法进行训练。为了针对高级别任务实现语义层面的低光增强,我们提出了一种新的空间通道注意力模块,同时保留局部上下文信息。在增强过程中,SCT通过一种稳健的非线性曲线投影方式同时降噪和照明夜间图像。此外,为了进行全面评估,我们建立了一个具有挑战性的夜间跟踪基准,即DarkTrack2021,该基准包含超过100 K帧的110个挑战性序列。在公开的UAVDark135基准和新建的DarkTrack2021基准上进行了评估,结果表明,任务启发式设计使SCT在夜晚无人机跟踪方面比其他顶级低光增强器表现出显著的性能增益。针对典型的无人机平台的实际测试进一步验证了所提出的方法的可行性。DarkTrack2021基准和所提出的方法代码在此httpsURL上公开可用。
https://arxiv.org/abs/2303.10951
Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at this https URL.
视觉多模态对象追踪产生了一系列后续的多模态追踪分支。为了继承基础模型强大的表示,一种自然的多模态追踪方法是完全优化基于RGB参数的指标。尽管这种方法很有效,但由于后续数据稀缺且数据传输差,等等原因,它不是最优的。在本文中,受到语言模型 prompt learning 最近的成功启发,我们开发Visual Prompt multi-modal Tracking(ViPT),它学习modal相关的提示,以将冻结的训练基础模型适应各种后续的多模态跟踪任务。ViPT 找到了一种更好的方法来刺激规模训练好的RGB模型的知识,同时仅引入几个可训练参数(模型参数中的小于1%)。ViPT 在多个后续跟踪任务中比完全优化范式表现出色,包括 RGB+深度、RGB+ thermal 和 RGB+事件跟踪。广泛的实验表明,视觉提示学习对于多模态跟踪的潜力,ViPT 可以实现高水平的性能,同时满足参数效率。代码和模型在此 https URL 上可用。
https://arxiv.org/abs/2303.10826
The main challenge of Multi-Object Tracking~(MOT) lies in maintaining a continuous trajectory for each target. Existing methods often learn reliable motion patterns to match the same target between adjacent frames and discriminative appearance features to re-identify the lost targets after a long period. However, the reliability of motion prediction and the discriminability of appearances can be easily hurt by dense crowds and extreme occlusions in the tracking process. In this paper, we propose a simple yet effective multi-object tracker, i.e., MotionTrack, which learns robust short-term and long-term motions in a unified framework to associate trajectories from a short to long range. For dense crowds, we design a novel Interaction Module to learn interaction-aware motions from short-term trajectories, which can estimate the complex movement of each target. For extreme occlusions, we build a novel Refind Module to learn reliable long-term motions from the target's history trajectory, which can link the interrupted trajectory with its corresponding detection. Our Interaction Module and Refind Module are embedded in the well-known tracking-by-detection paradigm, which can work in tandem to maintain superior performance. Extensive experimental results on MOT17 and MOT20 datasets demonstrate the superiority of our approach in challenging scenarios, and it achieves state-of-the-art performances at various MOT metrics.
多目标跟踪~(MOT)的主要挑战在于维持每个目标持续的跟踪轨迹。现有的方法通常学习可靠的运动模式,在相邻帧之间匹配相同的目标,并在长时间内识别丢失的目标。然而,在跟踪过程中,密集的人群和极端的遮挡可能会伤害运动预测的可靠性和外貌区分的精度。在本文中,我们提出了一种简单但有效的多目标跟踪器,即 MotionTrack,它在一个统一框架中学习可靠的短期和长期运动,将轨迹从短到长range联系起来。对于密集的人群,我们设计了一个新相互作用模块,从短期轨迹中学习相互作用 aware 的运动,可以估计每个目标的复杂运动。对于极端遮挡,我们建立了一个新找回模块,从目标的历史轨迹中学习可靠的长期运动,可以链接中断的轨迹与相应的检测。我们的相互作用模块和找回模块嵌入了著名的跟踪-检测范式,可以协同工作以保持更好的性能。在 MOT17 和 MOT20 数据集上的广泛实验结果表明,我们在挑战性场景中的优越性,并且它在不同 MOT 指标上实现了最先进的性能。
https://arxiv.org/abs/2303.10404
Object tracking is divided into single-object tracking (SOT) and multi-object tracking (MOT). MOT aims to maintain the identities of multiple objects across a series of continuous video sequences. In recent years, MOT has made rapid progress. However, modeling the motion and appearance models of objects in complex scenes still faces various challenging issues. In this paper, we design a novel direction consistency method for smooth trajectory prediction (STP-DC) to increase the modeling of motion information and overcome the lack of robustness in previous methods in complex scenes. Existing methods use pedestrian re-identification (Re-ID) to model appearance, however, they extract more background information which lacks discriminability in occlusion and crowded scenes. We propose a hyper-grain feature embedding network (HG-FEN) to enhance the modeling of appearance models, thus generating robust appearance descriptors. We also proposed other robustness techniques, including CF-ECM for storing robust appearance information and SK-AS for improving association accuracy. To achieve state-of-the-art performance in MOT, we propose a robust tracker named Rt-track, incorporating various tricks and techniques. It achieves 79.5 MOTA, 76.0 IDF1 and 62.1 HOTA on the test set of MOT17.Rt-track also achieves 77.9 MOTA, 78.4 IDF1 and 63.3 HOTA on MOT20, surpassing all published methods.
对象跟踪可以分为单对象跟踪(SOT)和多对象跟踪(MOT)。MOT的目标是在一系列连续视频序列中维持多个物体的身份。近年来,MOT取得了迅速进展。然而,在复杂场景中建模物体的运动和外观模型仍然面临各种挑战。在本文中,我们设计了一种平滑路径预测的新方向一致性方法(STP-DC),以提高运动信息的建模能力,并克服在复杂场景中之前方法的缺乏可靠性。现有方法使用人名识别(Re-ID)来建模外观,但是它们提取更多的背景信息,在遮挡和拥挤场景中缺乏分辨性。我们提出了一种超颗粒特征嵌入网络(HG-FEN)来增强外观模型的建模能力,从而生成可靠的外观描述符。我们还提出了其他可靠性技术,包括存储可靠的外观信息的实验方法CF-ECM和提高关联准确性的SK-AS。为了在MOT中实现最先进的性能,我们提出了名为Rt-track的可靠跟踪器,综合各种技巧和方法。它在MOT17测试集上实现了79.5 MOTA、76.0 IDF1和62.1 HOTA。Rt-track还在MOT20上实现了77.9 MOTA、78.4 IDF1和63.3 HOTA,超越了所有公开方法。
https://arxiv.org/abs/2303.09668
3D single object tracking (SOT) is an indispensable part of automated driving. Existing approaches rely heavily on large, densely labeled datasets. However, annotating point clouds is both costly and time-consuming. Inspired by the great success of cycle tracking in unsupervised 2D SOT, we introduce the first semi-supervised approach to 3D SOT. Specifically, we introduce two cycle-consistency strategies for supervision: 1) Self tracking cycles, which leverage labels to help the model converge better in the early stages of training; 2) forward-backward cycles, which strengthen the tracker's robustness to motion variations and the template noise caused by the template update strategy. Furthermore, we propose a data augmentation strategy named SOTMixup to improve the tracker's robustness to point cloud diversity. SOTMixup generates training samples by sampling points in two point clouds with a mixing rate and assigns a reasonable loss weight for training according to the mixing rate. The resulting MixCycle approach generalizes to appearance matching-based trackers. On the KITTI benchmark, based on the P2B tracker, MixCycle trained with $\textbf{10%}$ labels outperforms P2B trained with $\textbf{100%}$ labels, and achieves a $\textbf{28.4%}$ precision improvement when using $\textbf{1%}$ labels. Our code will be publicly released.
3D single object tracking (SOT) 是自动驾驶中不可或缺的一部分。现有的方法在很大程度上依赖于大量密集标记的数据集。然而,标注点云既昂贵又耗时。受无监督2DSOT中循环追踪的巨大成功启发,我们引入了3DSOT的第一个半监督方法。具体来说,我们引入了两个循环一致性策略来进行监督:1)自我追踪循环,利用标签帮助模型在训练的早期阶段更好地收敛;2)前后循环循环,增强跟踪器对运动变化和模板更新策略造成的模板噪声的鲁棒性。此外,我们提出了名为SOT混合增广的策略,以提高跟踪器对点云多样性的鲁棒性。SOT混合增广通过采样两个点云的点以混合率进行训练样本生成,并根据混合率分配合理的训练损失权重。因此,混合Cycle方法可以泛化到基于外观匹配跟踪器的跟踪器。在P2B跟踪器基于KITTI基准的参考框架上,使用10%的标签训练的混合Cycle比使用100%标签训练的P2B更好,并且在使用1%的标签时实现了28.4%的精度改进。我们的代码将公开发布。
https://arxiv.org/abs/2303.09219
In recent years, anchor-free object detection models combined with matching algorithms are used to achieve real-time muti-object tracking and also ensure high tracking accuracy. However, there are still great challenges in multi-object tracking. For example, when most part of a target is occluded or the target just disappears from images temporarily, it often leads to tracking interruptions for most of the existing tracking algorithms. Therefore, this study offers a bi-directional matching algorithm for multi-object tracking that makes advantage of bi-directional motion prediction information to improve occlusion handling. A stranded area is used in the matching algorithm to temporarily store the objects that fail to be tracked. When objects recover from occlusions, our method will first try to match them with objects in the stranded area to avoid erroneously generating new identities, thus forming a more continuous trajectory. Experiments show that our approach can improve the multi-object tracking performance in the presence of occlusions. In addition, this study provides an attentional up-sampling module that not only assures tracking accuracy but also accelerates training speed. In the MOT17 challenge, the proposed algorithm achieves 63.4% MOTA, 55.3% IDF1, and 20.1 FPS tracking speed.
近年来,使用无锚对象检测模型和匹配算法实现实时多目标跟踪,并确保高跟踪准确性已成为研究热点。然而,在多目标跟踪方面,仍然存在许多挑战。例如,当目标的大部分部分被遮挡或目标在图像中暂时消失时,它常常会导致大多数现有跟踪算法中断跟踪。因此,本研究提出了一种双向匹配算法,利用双向运动预测信息改善遮挡处理。在匹配算法中,使用了一个片段区域来暂存无法跟踪的对象。当对象从遮挡中恢复时,我们的算法将首先尝试将它们与片段区域内的对象匹配,以避免错误生成新的身份,从而形成更连续的轨迹。实验结果表明,我们的 approach 在遮挡存在的情况下可以改进多目标跟踪性能。此外,本研究提供了一种关注增强模块,不仅保证跟踪准确性,还加速了训练速度。在MOT17挑战中,该算法取得了63.4% MOTA、55.3% IDF1和20.1 FPS跟踪速度。
https://arxiv.org/abs/2303.08444
Planar object tracking is a critical computer vision problem and has drawn increasing interest owing to its key roles in robotics, augmented reality, etc. Despite rapid progress, its further development, especially in the deep learning era, is largely hindered due to the lack of large-scale challenging benchmarks. Addressing this, we introduce PlanarTrack, a large-scale challenging planar tracking benchmark. Specifically, PlanarTrack consists of 1,000 videos with more than 490K images. All these videos are collected in complex unconstrained scenarios from the wild, which makes PlanarTrack, compared with existing benchmarks, more challenging but realistic for real-world applications. To ensure the high-quality annotation, each frame in PlanarTrack is manually labeled using four corners with multiple-round careful inspection and refinement. To our best knowledge, PlanarTrack, to date, is the largest and most challenging dataset dedicated to planar object tracking. In order to analyze the proposed PlanarTrack, we evaluate 10 planar trackers and conduct comprehensive comparisons and in-depth analysis. Our results, not surprisingly, demonstrate that current top-performing planar trackers degenerate significantly on the challenging PlanarTrack and more efforts are needed to improve planar tracking in the future. In addition, we further derive a variant named PlanarTrack$_{\mathbf{BB}}$ for generic object tracking from PlanarTrack. Our evaluation of 10 excellent generic trackers on PlanarTrack$_{\mathrm{BB}}$ manifests that, surprisingly, PlanarTrack$_{\mathrm{BB}}$ is even more challenging than several popular generic tracking benchmarks and more attention should be paid to handle such planar objects, though they are rigid. All benchmarks and evaluations will be released at the project webpage.
平面对象跟踪是一个重要的计算机视觉问题,并因为它在机器人、增强现实等领域的关键作用而越来越引起关注。尽管取得了进展,但平面对象跟踪的进一步开发,特别是在深度学习时代,很大程度上受到了缺乏大规模挑战性基准的限制。为了解决这个问题,我们引入了PlanarTrack,它是一个大规模的挑战性平面跟踪基准。具体来说,PlanarTrack包含超过490K张图像的1000部电影。所有这些电影都是在复杂的无约束场景中从野外收集的,这使得PlanarTrack相对于现有的基准来说更具挑战性,但对于现实世界的应用来说更为真实。为了确保高质量的标注,每个帧在PlanarTrack中手动使用四个角落进行手动标注,并进行多次仔细检查和精修。据我们所知,PlanarTrack是目前专门用于平面对象跟踪的最大且最具挑战性的dataset。为了分析提出的PlanarTrack,我们评估了10个平面跟踪器,并进行了全面的比较和深入分析。我们的结果显示,当前最好的平面跟踪器在挑战性的PlanarTrack上表现急剧恶化,因此需要更多的努力来改进未来的平面跟踪。此外,我们还从PlanarTrack中推导了一个名为PlanarTrack$_{mathbf{BB}}$的变体,用于一般对象跟踪,我们的评估结果表明,PlanarTrack$_{mathrm{BB}}$比一些流行的跟踪基准更具挑战性,尽管这些对象是坚硬的。所有基准和评估都将在项目网站上发布。
https://arxiv.org/abs/2303.07625
The task of 3D single object tracking (SOT) with LiDAR point clouds is crucial for various applications, such as autonomous driving and robotics. However, existing approaches have primarily relied on appearance matching or motion modeling within only two successive frames, thereby overlooking the long-range continuous motion property of objects in 3D space. To address this issue, this paper presents a novel approach that views each tracklet as a continuous stream: at each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank, enabling efficient exploitation of sequential information. To achieve effective cross-frame message passing, a hybrid attention mechanism is designed to account for both long-range relation modeling and local geometric feature extraction. Furthermore, to enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is designed, which uses ground truth tracklets to augment training sequences and promote discrimination against false positives in a contrastive manner. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art method by significant margins (approximately 8%, 6%, and 12% improvements in the success performance on KITTI, nuScenes, and Waymo, respectively).
与激光雷达点云的三维单物体跟踪任务对于各种应用至关重要,例如自动驾驶和机器人。然而,现有方法主要依赖于前后两帧的外貌匹配或运动建模,从而忽视了三维空间中物体的远程连续运动性质。为了解决这一问题,本文提出了一种新方法,将每个跟踪器视为连续流:在每个时间帧,仅当前帧被输入到网络中,与内存 bank中存储的多个帧历史特征相互作用,以高效利用顺序信息。为了实现有效的跨帧消息传递,设计了一种混合注意力机制,以考虑远程关系建模和局部几何特征提取。此外,为了提高对稳健跟踪的利用,设计了一种对比增强序列增强策略,使用实际跟踪器增强训练序列,并以对比方式促进区分false positives。广泛的实验结果表明, proposed 方法比当前最先进的方法表现更好(KITTI、nuScenes和谷歌自动驾驶系统的成功性能分别提高了约8%、6%和12%)。
https://arxiv.org/abs/2303.07605
Modern perception systems of autonomous vehicles are known to be sensitive to occlusions and lack the capability of long perceiving range. It has been one of the key bottlenecks that prevents Level 5 autonomy. Recent research has demonstrated that the Vehicle-to-Vehicle (V2V) cooperative perception system has great potential to revolutionize the autonomous driving industry. However, the lack of a real-world dataset hinders the progress of this field. To facilitate the development of cooperative perception, we present V2V4Real, the first large-scale real-world multi-modal dataset for V2V perception. The data is collected by two vehicles equipped with multi-modal sensors driving together through diverse scenarios. Our V2V4Real dataset covers a driving area of 410 km, comprising 20K LiDAR frames, 40K RGB frames, 240K annotated 3D bounding boxes for 5 classes, and HDMaps that cover all the driving routes. V2V4Real introduces three perception tasks, including cooperative 3D object detection, cooperative 3D object tracking, and Sim2Real domain adaptation for cooperative perception. We provide comprehensive benchmarks of recent cooperative perception algorithms on three tasks. The V2V4Real dataset and codebase can be found at this https URL.
现代自动驾驶车辆的感知系统 known to be sensitive to occlusions and lack the capability of long perceiving range. It has been one of the key bottlenecks that prevent Level 5 autonomy. Recent research has demonstrated that the Vehicle-to-Vehicle (V2V) cooperative感知系统 has great potential to revolutionize the autonomous driving industry. However, the lack of a real-world dataset hinders the progress of this field. To facilitate the development of cooperative perception, we present V2V4Real, the first large-scale real-world multi-modal dataset for V2V perception. The data is collected by two vehicles equipped with multi-modal sensors driving together through diverse scenarios. Our V2V4Real dataset covers a driving area of 410 km, comprising 20K LiDAR frames, 40K RGB frames, 240K annotated 3D bounding boxes for 5 classes, and HDMaps that cover all the driving routes. V2V4Real introduces three perception tasks, including cooperative 3D object detection, cooperative 3D object tracking, and Sim2Real domain adaptation for cooperative perception. We provide comprehensive benchmarks of recent cooperative perception algorithms on three tasks. The V2V4Real dataset and codebase can be found at this https URL.
https://arxiv.org/abs/2303.07601
Existence of symmetric objects, whose observation at different viewpoints can be identical, can deteriorate the performance of simultaneous localization and mapping(SLAM). This work proposes a system for robustly optimizing the pose of cameras and objects even in the presence of symmetric objects. We classify objects into three categories depending on their symmetry characteristics, which is efficient and effective in that it allows to deal with general objects and the objects in the same category can be associated with the same type of ambiguity. Then we extract only the unambiguous parameters corresponding to each category and use them in data association and joint optimization of the camera and object pose. The proposed approach provides significant robustness to the SLAM performance by removing the ambiguous parameters and utilizing as much useful geometric information as possible. Comparison with baseline algorithms confirms the superior performance of the proposed system in terms of object tracking and pose estimation, even in challenging scenarios where the baseline fails.
存在对称对象,其在不同视角下的观察结果完全相同,可能会恶化同时定位和映射(SLAM)的性能。该研究提出了一种系统,能够在存在对称对象的情况下, robustly 优化相机和对象的姿态。我们将对象按照其对称性特征分为三个类别,这种方法既高效又有效,因为它可以处理一般对象,同一类别中的 objects 可以具有相同的歧义。然后,我们只提取每个类别中的无歧义参数,并将其用于相机和对象姿态的数据关联和联合优化。该方法提供了对 SLAM 性能的重大鲁棒性,通过去除歧义参数并尽可能利用有用的几何信息。与基准算法进行比较确认了 proposed 系统在对象跟踪和姿态估计方面的优势,即使在基准算法失败的情况下也是如此。
https://arxiv.org/abs/2303.07872
All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks. In this work, we present a universal instance perception model of the next generation, termed UNINEXT. UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts. This unified formulation brings the following benefits: (1) enormous data from different tasks and label vocabularies can be exploited for jointly training general instance-level representations, which is especially beneficial for tasks lacking in training data. (2) the unified model is parameter-efficient and can save redundant computation when handling multiple tasks simultaneously. UNINEXT shows superior performance on 20 challenging benchmarks from 10 instance-level tasks including classical image-level tasks (object detection and instance segmentation), vision-and-language tasks (referring expression comprehension and segmentation), and six video-level object tracking tasks. Code is available at this https URL.
所有实例感知任务的目标都是找到由某些查询指定的对象,例如类别名称、语言表达和目标标注,但这一完整的领域已经被分裂成多个独立的子任务。在这项工作中,我们提出了下一代的通用实例感知模型,称为UNI Next。UNI Next将多个实例感知任务改编为统一的对象发现和提取范式,并且可以通过改变输入提示来灵活感知不同类型的对象。这种统一性带来了以下好处:(1)可以从不同任务和标签词汇库中收集巨大的数据,共同训练通用实例级表示,这对于缺乏训练数据的任务特别有益。(2)统一模型参数效率高,可以同时处理多个任务,并节省冗余计算。UNI Next在10个实例级任务中的20个挑战基准表现优异,包括经典图像级任务(对象检测和实例分割)、视觉和语言任务(指 expression comprehension 和分割)、以及六个视频级对象跟踪任务。代码在此https URL上可用。
https://arxiv.org/abs/2303.06674
3D single object tracking has been a crucial problem for decades with numerous applications such as autonomous driving. Despite its wide-ranging use, this task remains challenging due to the significant appearance variation caused by occlusion and size differences among tracked targets. To address these issues, we present MBPTrack, which adopts a Memory mechanism to utilize past information and formulates localization in a coarse-to-fine scheme using Box Priors given in the first frame. Specifically, past frames with targetness masks serve as an external memory, and a transformer-based module propagates tracked target cues from the memory to the current frame. To precisely localize objects of all sizes, MBPTrack first predicts the target center via Hough voting. By leveraging box priors given in the first frame, we adaptively sample reference points around the target center that roughly cover the target of different sizes. Then, we obtain dense feature maps by aggregating point features into the reference points, where localization can be performed more effectively. Extensive experiments demonstrate that MBPTrack achieves state-of-the-art performance on KITTI, nuScenes and Waymo Open Dataset, while running at 50 FPS on a single RTX3090 GPU.
三维单个对象跟踪已经困扰了数十年,并且有许多应用,如自动驾驶。尽管它的应用范围广泛,但这仍然是一项具有挑战性的任务,因为跟踪目标之间的显著外观变化是由遮挡和目标大小差异引起的。为了解决这些问题,我们提出了 MBPTrack,它采用 Memory 机制利用过去的信息,并在粗到精的Scheme中使用第一帧中的 Box Priors 对定位进行建模。具体来说,过去的带有目标性掩膜的帧用作外部内存,而一个Transformer-based模块从内存中传播跟踪目标线索到当前帧。为了精确定位所有大小的对象,MBPTrack 首先通过 Hough 投票预测目标中心。通过利用第一帧中的 Box Priors,我们自适应地采样目标中心周围的参考点,大致覆盖不同大小的目标。然后,我们获得密集的特征映射,通过将点特征聚合到参考点,使得定位更加有效。广泛的实验表明,MBPTrack 在KITTI、nuScenes 和 Waymo 开放数据集上实现了最先进的性能,同时运行在单个 RTX3090 显卡上的 50 帧率。
https://arxiv.org/abs/2303.05071
Event cameras capture visual information with a high temporal resolution and a wide dynamic range. This enables capturing visual information at fine time granularities (e.g., microseconds) in rapidly changing environments. This makes event cameras highly useful for high-speed robotics tasks involving rapid motion, such as high-speed perception, object tracking, and control. However, convolutional neural network inference on event camera streams cannot currently perform real-time inference at the high speeds at which event cameras operate - current CNN inference times are typically closer in order of magnitude to the frame rates of regular frame-based cameras. Real-time inference at event camera rates is necessary to fully leverage the high frequency and high temporal resolution that event cameras offer. This paper presents EvConv, a new approach to enable fast inference on CNNs for inputs from event cameras. We observe that consecutive inputs to the CNN from an event camera have only small differences between them. Thus, we propose to perform inference on the difference between consecutive input tensors, or the increment. This enables a significant reduction in the number of floating-point operations required (and thus the inference latency) because increments are very sparse. We design EvConv to leverage the irregular sparsity in increments from event cameras and to retain the sparsity of these increments across all layers of the network. We demonstrate a reduction in the number of floating operations required in the forward pass by up to 98%. We also demonstrate a speedup of up to 1.6X for inference using CNNs for tasks such as depth estimation, object recognition, and optical flow estimation, with almost no loss in accuracy.
事件相机以高时间分辨率和宽动态范围捕获视觉信息,使得在快速变化的环境中捕获微小的时间粒度(例如微秒)成为可能。这使得事件相机对于涉及快速运动的高速机器人任务非常有用,例如高速感知、物体跟踪和控制。然而,卷积神经网络在事件相机流上的 inference 无法在目前的事件相机运行速度下进行实时 inference - 目前 CNN inference 时间通常更接近普通帧率的帧数。因此,需要在事件相机速率下进行实时 inference,以充分利用事件相机提供的频率和时间分辨率。本文介绍了 EvConv,一种新的方法来使从事件相机输入的 CNN 进行快速Inference。我们观察到,连续向量输入到 CNN 中仅它们之间存在微小的差异。因此,我们提议进行Inference 对连续输入向量之间的差异或增加。这导致所需的浮点操作数量(以及 inference 延迟)显著减少,因为增加非常稀疏。我们设计 EvConv 利用事件相机增加的不规则稀疏,并保留这些增加在网络所有层中的稀疏性。我们演示了 forward pass 中所需浮点操作的减少高达 98%。我们还演示了使用 CNN 进行深度估计、物体识别和光学流估计任务的速度增强,几乎不影响准确性。
https://arxiv.org/abs/2303.04670
Vision-based object tracking has boosted extensive autonomous applications for unmanned aerial vehicles (UAVs). However, the dynamic changes in flight maneuver and viewpoint encountered in UAV tracking pose significant difficulties, e.g. , aspect ratio change, and scale variation. The conventional cross-correlation operation, while commonly used, has limitations in effectively capturing perceptual similarity and incorporates extraneous background information. To mitigate these limitations, this work presents a novel saliency-guided dynamic vision Transformer (SGDViT) for UAV tracking. The proposed method designs a new task-specific object saliency mining network to refine the cross-correlation operation and effectively discriminate foreground and background information. Additionally, a saliency adaptation embedding operation dynamically generates tokens based on initial saliency, thereby reducing the computational complexity of the Transformer architecture. Finally, a lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information. The efficacy and robustness of the proposed approach have been thoroughly assessed through experiments on three widely-used UAV tracking benchmarks and real-world scenarios, with results demonstrating its superiority. The source code and demo videos are available at this https URL.
基于视觉的对象跟踪已经促进了对无人机(UAVs)的广泛自主应用。然而,在无人机跟踪中,飞行行为和视角的动态变化遇到了重大困难,例如, aspect ratio 改变和尺度变化。尽管常用,但传统的交叉验证操作具有在有效捕捉感知相似性和整合无关背景信息方面的局限性。为了克服这些限制,本研究提出了一种基于视觉的新的任务特定对象视觉跟踪器(Saliency- guided dynamic vision Transformer),以改进交叉验证操作并有效地区分前景和背景信息。此外,一种基于初始视觉吸引力的适应嵌入操作动态地生成代币,从而降低了Transformer架构的计算复杂性。最后,一种轻量级的视觉吸引力过滤Transformer进一步 refine了视觉吸引力信息并增加了外观信息的重点。本研究通过实验研究了三个广泛使用的无人机跟踪基准和实际场景,结果证明了其优越性。源代码和演示视频可在本网站上获得。
https://arxiv.org/abs/2303.04378
This paper introduces a novel approach to video object detection detection and tracking on Unmanned Aerial Vehicles (UAVs). By incorporating metadata, the proposed approach creates a memory map of object locations in actual world coordinates, providing a more robust and interpretable representation of object locations in both, image space and the real world. We use this representation to boost confidences, resulting in improved performance for several temporal computer vision tasks, such as video object detection, short and long-term single and multi-object tracking, and video anomaly detection. These findings confirm the benefits of metadata in enhancing the capabilities of UAVs in the field of temporal computer vision and pave the way for further advancements in this area.
本论文介绍了一种针对无人飞行器(UAVs)的视频物体检测、检测和跟踪的新方法。通过引入元数据,该方法创造了实际世界坐标下物体位置的记忆地图,提供了在图像空间和现实世界中更加鲁棒且可解释的对象位置表示。我们利用这种表示来提高信心,从而改进了多个时间级计算机视觉任务的性能,例如视频物体检测、短期和长期单物体跟踪和视频异常检测。这些发现证实了元数据在增强UAVs的时间级计算机视觉领域的能力方面的优势,并为该领域的进一步进展铺平了道路。
https://arxiv.org/abs/2303.03508
Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts.
现有的参考理解任务通常涉及检测一个文本 referred 的对象。在本文中,我们提出了一种新且普遍的参考理解任务,称为参考多物体跟踪(RMOT)。其核心思想是利用语言表述作为语义线索来指导多物体跟踪的预测。据我们所知,这是第一个在视频中实现任意数量 referent 对象预测的工作。为了推动RMOT,我们基于kittit(基训练数据集)构建了一个可扩展的表达基准,名为 refer-kittit。具体来说,它提供了18视频和818个表达,每个视频中的每个表达都被标注平均有10.7个物体。此外,我们还开发了基于Transformer架构的TransRMOT,以在线处理新任务,取得了令人印象深刻的检测性能,并优于其他同类任务。
https://arxiv.org/abs/2303.03366