Abstract
The task of 3D single object tracking (SOT) with LiDAR point clouds is crucial for various applications, such as autonomous driving and robotics. However, existing approaches have primarily relied on appearance matching or motion modeling within only two successive frames, thereby overlooking the long-range continuous motion property of objects in 3D space. To address this issue, this paper presents a novel approach that views each tracklet as a continuous stream: at each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank, enabling efficient exploitation of sequential information. To achieve effective cross-frame message passing, a hybrid attention mechanism is designed to account for both long-range relation modeling and local geometric feature extraction. Furthermore, to enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is designed, which uses ground truth tracklets to augment training sequences and promote discrimination against false positives in a contrastive manner. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art method by significant margins (approximately 8%, 6%, and 12% improvements in the success performance on KITTI, nuScenes, and Waymo, respectively).
Abstract (translated)
与激光雷达点云的三维单物体跟踪任务对于各种应用至关重要,例如自动驾驶和机器人。然而,现有方法主要依赖于前后两帧的外貌匹配或运动建模,从而忽视了三维空间中物体的远程连续运动性质。为了解决这一问题,本文提出了一种新方法,将每个跟踪器视为连续流:在每个时间帧,仅当前帧被输入到网络中,与内存 bank中存储的多个帧历史特征相互作用,以高效利用顺序信息。为了实现有效的跨帧消息传递,设计了一种混合注意力机制,以考虑远程关系建模和局部几何特征提取。此外,为了提高对稳健跟踪的利用,设计了一种对比增强序列增强策略,使用实际跟踪器增强训练序列,并以对比方式促进区分false positives。广泛的实验结果表明, proposed 方法比当前最先进的方法表现更好(KITTI、nuScenes和谷歌自动驾驶系统的成功性能分别提高了约8%、6%和12%)。
URL
https://arxiv.org/abs/2303.07605