Abstract
Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at this https URL.
Abstract (translated)
水下观测系统通常集成了光学相机和成像声纳系统。当水下能见度不足时,只有声纳系统能够提供稳定的数据,这就需要研究水下声学目标跟踪(UAOT)任务。先前的研究探索了传统方法和孪生网络在UAOT中的应用。然而,缺乏统一的评估基准显著限制了这些方法的价值。为了缓解这一局限性,我们提出了首个大规模UAOT基准——SonarT165,其中包括165个正方形序列、165个扇形序列以及205K条高质量标注信息。实验结果表明,SonarT165揭示了当前最先进的单目标跟踪器在性能上的局限性。为解决这些局限性,我们提出了STFTrack,这是一种高效的声学目标跟踪框架。该框架包括两个创新模块:多视图模板融合模块(MTFM)和最优轨迹校正模块(OTCM)。MTFM模块整合了原图像与动态模板的二值图像中的多视角特征,并引入了一种类似于交叉注意力机制的层来融合时空目标表示。OTCM模块则提出了声响应等效像素属性,以及归一化的像素亮度响应评分,从而抑制由于卡尔曼滤波预测框不准确而导致的次优匹配。为了进一步提升模型特性,STFTrack还引入了声学图像增强方法和频率增强模块(FEM)到其跟踪流程中。全面实验显示,所提出的STFTrack在新基准测试中的表现达到了最先进的水平。代码可在此网址获取:[此URL链接]
URL
https://arxiv.org/abs/2504.15609