Paper Reading AI Learner

The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models

2024-04-18 07:22:38
Cheng Shi, Sibei Yang

Abstract

Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at this https URL

Abstract (translated)

基础模型,在大量数据上预训练,已经在各种下游任务中展示了出色的零样本能力。然而,在目标检测和实例分割这两个对大量人类标注依赖的基本计算机视觉任务中,基础模型如SAM和DINO很难实现令人满意的成绩。在这项研究中,我们揭示了对象边界就在那里,即这些基础模型无法区分单个对象的边界。对于第一个,我们观察到CLIP,它从未访问过任何实例级别的标注,在其特定中间层的聚类结果中可以提供高度有益的实例级别边界先验。接着,我们提出了Zip,它将CLIP和SAM在一种新颖的分类先于发现的数据管道中结合,实现无标注、复杂场景 capable的开放词汇目标检测和实例分割。我们的Zip显著提高了SAM在COCO数据集上的掩码AP,并建立了各种设置中的最先进性能,包括无需训练、自训练和标签效率微调。此外,无标注的Zip甚至与使用基本注释的最佳性能对象检测器相当。代码发布在https://这个URL上。

URL

https://arxiv.org/abs/2404.11957

PDF

https://arxiv.org/pdf/2404.11957.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot