Paper Reading AI Learner

SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking

2025-04-22 06:02:32
Yunfeng Li, Bo Wang, Jiahao Wan, Xueyi Wu, Ye Li

Abstract

Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at this https URL.

Abstract (translated)

水下观测系统通常集成了光学相机和成像声纳系统。当水下能见度不足时,只有声纳系统能够提供稳定的数据,这就需要研究水下声学目标跟踪(UAOT)任务。先前的研究探索了传统方法和孪生网络在UAOT中的应用。然而,缺乏统一的评估基准显著限制了这些方法的价值。为了缓解这一局限性,我们提出了首个大规模UAOT基准——SonarT165,其中包括165个正方形序列、165个扇形序列以及205K条高质量标注信息。实验结果表明,SonarT165揭示了当前最先进的单目标跟踪器在性能上的局限性。为解决这些局限性,我们提出了STFTrack,这是一种高效的声学目标跟踪框架。该框架包括两个创新模块:多视图模板融合模块(MTFM)和最优轨迹校正模块(OTCM)。MTFM模块整合了原图像与动态模板的二值图像中的多视角特征,并引入了一种类似于交叉注意力机制的层来融合时空目标表示。OTCM模块则提出了声响应等效像素属性,以及归一化的像素亮度响应评分,从而抑制由于卡尔曼滤波预测框不准确而导致的次优匹配。为了进一步提升模型特性,STFTrack还引入了声学图像增强方法和频率增强模块(FEM)到其跟踪流程中。全面实验显示,所提出的STFTrack在新基准测试中的表现达到了最先进的水平。代码可在此网址获取:[此URL链接]

URL

https://arxiv.org/abs/2504.15609

PDF

https://arxiv.org/pdf/2504.15609.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot