Paper Reading AI Learner

MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking

2023-05-23 17:40:13
En Yu, Tiancai Wang, Zhuoling Li, Yuang Zhang, Xiangyu Zhang, Wenbing Tao

Abstract

Although end-to-end multi-object trackers like MOTR enjoy the merits of simplicity, they suffer from the conflict between detection and association seriously, resulting in unsatisfactory convergence dynamics. While MOTRv2 partly addresses this problem, it demands an additional detection network for assistance. In this work, we serve as the first to reveal that this conflict arises from the unfair label assignment between detect queries and track queries during training, where these detect queries recognize targets and track queries associate them. Based on this observation, we propose MOTRv3, which balances the label assignment process using the developed release-fetch supervision strategy. In this strategy, labels are first released for detection and gradually fetched back for association. Besides, another two strategies named pseudo label distillation and track group denoising are designed to further improve the supervision for detection and association. Without the assistance of an extra detection network during inference, MOTRv3 achieves impressive performance across diverse benchmarks, e.g., MOT17, DanceTrack.

Abstract (translated)

尽管像MOTR这样的端到端多目标跟踪器享受简单的优点,但它们在检测和关联之间存在严重冲突,导致不满意的收敛动态。尽管MOTRv2部分解决了这个问题,但它需要额外的检测网络来进行协助。在这个工作中,我们是第一个揭示这个问题的人,发现这冲突在训练期间从检测询问和跟踪询问之间的不公平标签分配中产生,这些检测询问识别目标并将跟踪询问与之关联。基于这个观察,我们提出了MOTRv3,它使用开发的发布-查找监督策略平衡标签分配过程。在这个策略中,先释放标签用于检测,然后逐步回收用于关联。此外,我们还设计了另一个名为伪标签分解和跟踪组去噪的策略,以进一步提高检测和关联的监督。在没有额外的检测网络推理期间提供帮助的情况下,MOTRv3能够在各种基准上实现令人印象深刻的表现,例如MOT17和DanceTrack。

URL

https://arxiv.org/abs/2305.14298

PDF

https://arxiv.org/pdf/2305.14298.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot