Paper Reading AI Learner

Dense Distinct Query for End-to-End Object Detection

2023-03-22 17:42:22
Shilong Zhang, Wang xinjiang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang, Ping Luo, Kai Chen

Abstract

One-to-one label assignment in object detection has successfully obviated the need for non-maximum suppression (NMS) as postprocessing and makes the pipeline end-to-end. However, it triggers a new dilemma as the widely used sparse queries cannot guarantee a high recall, while dense queries inevitably bring more similar queries and encounter optimization difficulties. As both sparse and dense queries are problematic, then what are the expected queries in end-to-end object detection? This paper shows that the solution should be Dense Distinct Queries (DDQ). Concretely, we first lay dense queries like traditional detectors and then select distinct ones for one-to-one assignments. DDQ blends the advantages of traditional and recent end-to-end detectors and significantly improves the performance of various detectors including FCN, R-CNN, and DETRs. Most impressively, DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a ResNet-50 backbone, outperforming all existing detectors in the same setting. DDQ also shares the benefit of end-to-end detectors in crowded scenes and achieves 93.8 AP on CrowdHuman. We hope DDQ can inspire researchers to consider the complementarity between traditional methods and end-to-end detectors. The source code can be found at \url{this https URL}.

Abstract (translated)

在物体检测中,一对一的标签分配已经成功避免了使用非最大抑制(NMS)作为后续处理并实现了整个流程的端到端。然而,它引发了一个新困境,因为广泛使用的稀疏查询不能保证高召回率,而Dense查询不可避免地会导致更多的类似查询并遇到优化难题。由于稀疏和Dense查询都存在问题,那么什么是端到端物体检测中预期查询的问题?本论文表明,解决方案应该是Dense distinct queries(DDQ)。具体来说,我们首先像传统的探测器一样布置Dense查询,然后选择唯一的DDQ作为一对一的分配。DDQ将传统的和最近的端到端探测器的优势相结合,并显著改进了包括Fcn、R-CNN和DeTRs等多种探测器的性能。最令人瞩目的是,DDQ-DETR在MS-COCO数据集上使用ResNet-50骨架在12 epochs内实现了52.1AP,在所有现有探测器中表现最优。DDQ也在拥挤的场景中分享了端到端探测器的优势,并在人群人类数据集上实现了93.8AP。我们希望DDQ能够激励研究人员考虑传统方法和端到端探测器之间的互补性。源代码可以在\url{this https URL}找到。

URL

https://arxiv.org/abs/2303.12776

PDF

https://arxiv.org/pdf/2303.12776.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot