Paper Reading AI Learner

Spatio-temporal Graph Learning on Adaptive Mined Key Frames for High-performance Multi-Object Tracking

2025-01-17 11:36:38
Futian Wang, Fengxiang Liu, Xiao Wang

Abstract

In the realm of multi-object tracking, the challenge of accurately capturing the spatial and temporal relationships between objects in video sequences remains a significant hurdle. This is further complicated by frequent occurrences of mutual occlusions among objects, which can lead to tracking errors and reduced performance in existing methods. Motivated by these challenges, we propose a novel adaptive key frame mining strategy that addresses the limitations of current tracking approaches. Specifically, we introduce a Key Frame Extraction (KFE) module that leverages reinforcement learning to adaptively segment videos, thereby guiding the tracker to exploit the intrinsic logic of the video content. This approach allows us to capture structured spatial relationships between different objects as well as the temporal relationships of objects across frames. To tackle the issue of object occlusions, we have developed an Intra-Frame Feature Fusion (IFF) module. Unlike traditional graph-based methods that primarily focus on inter-frame feature fusion, our IFF module uses a Graph Convolutional Network (GCN) to facilitate information exchange between the target and surrounding objects within a frame. This innovation significantly enhances target distinguishability and mitigates tracking loss and appearance similarity due to occlusions. By combining the strengths of both long and short trajectories and considering the spatial relationships between objects, our proposed tracker achieves impressive results on the MOT17 dataset, i.e., 68.6 HOTA, 81.0 IDF1, 66.6 AssA, and 893 IDS, proving its effectiveness and accuracy.

Abstract (translated)

在多目标跟踪领域,准确捕捉视频序列中物体之间的空间和时间关系仍是一个重大挑战。这个问题进一步因物体之间频繁出现的相互遮挡而复杂化,这可能导致现有方法中的追踪误差和性能下降。为了解决这些难题,我们提出了一种新颖的自适应关键帧挖掘策略,以弥补当前跟踪方法的不足。具体而言,我们引入了一个关键帧提取(KFE)模块,该模块利用强化学习来对视频进行自适应分割,从而指导追踪器探索视频内容的本质逻辑。这种方法使我们能够捕捉不同物体之间的结构化空间关系以及物体在不同帧间的时间关系。 为了应对物体遮挡的问题,我们开发了一种帧内特征融合(IFF)模块。与传统的基于图的方法主要关注跨帧特征融合不同,我们的IFF模块使用图卷积网络(GCN),促进目标及其周围物体在同一帧内的信息交换。这一创新显著提高了目标的可辨识性,并减少了由于遮挡导致的追踪丢失和外观相似度问题。 通过结合长轨迹和短轨迹的优点并考虑物体之间的空间关系,我们提出的跟踪器在MOT17数据集上取得了卓越的成绩:68.6 HOTA、81.0 IDF1、66.6 AssA 和 893 IDS,证明了其有效性和准确性。

URL

https://arxiv.org/abs/2501.10129

PDF

https://arxiv.org/pdf/2501.10129.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot