Abstract
Accurately detecting and tracking high-speed, small objects, such as balls in sports videos, is challenging due to factors like motion blur and occlusion. Although recent deep learning frameworks like TrackNetV1, V2, and V3 have advanced tennis ball and shuttlecock tracking, they often struggle in scenarios with partial occlusion or low visibility. This is primarily because these models rely heavily on visual features without explicitly incorporating motion information, which is crucial for precise tracking and trajectory prediction. In this paper, we introduce an enhancement to the TrackNet family by fusing high-level visual features with learnable motion attention maps through a motion-aware fusion mechanism, effectively emphasizing the moving ball's location and improving tracking performance. Our approach leverages frame differencing maps, modulated by a motion prompt layer, to highlight key motion regions over time. Experimental results on the tennis ball and shuttlecock datasets show that our method enhances the tracking performance of both TrackNetV2 and V3. We refer to our lightweight, plug-and-play solution, built on top of the existing TrackNet, as TrackNetV4.
Abstract (translated)
准确检测和跟踪高速、小型物体,如体育视频中的球,由于因素如运动模糊和遮挡,是一项具有挑战性的任务。尽管最近的一些深度学习框架如TrackNetV1、V2和V3已经先进地跟踪了网球球和羽毛球拍,但在部分遮挡或低可见的场景中,它们往往无法保证准确跟踪。这主要是因为这些模型在很大程度上依赖于视觉特征,而没有明确地包含运动信息,这对于精确跟踪和轨迹预测至关重要。在本文中,我们在TrackNet家族中通过运动感知融合机制将高级视觉特征与可学习运动关注图融合在一起,有效地强调了运动球的位置,从而提高了跟踪性能。我们的方法利用了由运动提示层产生的帧差分图来突出关键运动区域。在网球球和羽毛球拍数据集上的实验结果表明,我们的方法提高了TrackNetV2和V3的跟踪性能。我们将基于现有TrackNet的轻量、可插拔的解决方案称为TrackNetV4。
URL
https://arxiv.org/abs/2409.14543