Paper Reading AI Learner

Realistic Model Selection for Weakly Supervised Object Localization

2024-04-15 17:25:21
Shakeeb Murtaza, Soufiane Belharbi, Marco Pedersoli, Eric Granger

Abstract

Weakly Supervised Object Localization (WSOL) allows for training deep learning models for classification and localization, using only global class-level labels. The lack of bounding box (bbox) supervision during training represents a considerable challenge for hyper-parameter search and model selection. Earlier WSOL works implicitly observed localization performance over a test set which leads to biased performance evaluation. More recently, a better WSOL protocol has been proposed, where a validation set with bbox annotations is held out for model selection. Although it does not rely on the test set, this protocol is unrealistic since bboxes are not available in real-world applications, and when available, it is better to use them directly to fit model weights. Our initial empirical analysis shows that the localization performance of a model declines significantly when using only image-class labels for model selection (compared to using bounding-box annotations). This suggests that adding bounding-box labels is preferable for selecting the best model for localization. In this paper, we introduce a new WSOL validation protocol that provides a localization signal without the need for manual bbox annotations. In particular, we leverage noisy pseudo boxes from an off-the-shelf ROI proposal generator such as Selective-Search, CLIP, and RPN pretrained models for model selection. Our experimental results with several WSOL methods on ILSVRC and CUB-200-2011 datasets show that our noisy boxes allow selecting models with performance close to those selected using ground truth boxes, and better than models selected using only image-class labels.

Abstract (translated)

Weakly Supervised Object Localization (WSOL) 允许使用仅全局类别级别标签来训练深度学习模型进行分类和定位,而无需进行边界框(bbox)监督。在训练过程中缺乏边界框监督代表了超参数搜索和模型选择的相当大挑战。较早的 WSOL 工作在测试集上隐式观察到了局部化性能,从而导致了偏差性能评估。更最近,提出了一个更好的 WSOL 协议,其中为模型选择保留了一个带边界框注释的验证集。尽管这个协议不依赖于测试集,但它是不可信的,因为在现实世界中,边界框是不存在的,而当它们存在时,直接使用它们来调整模型权重会更好。我们的初始实证分析表明,当仅使用图像类标签进行模型选择时,模型的定位性能会显著下降。这表明,为选择最佳的位置模型,应该添加边界框标签。在本文中,我们引入了一个新的 WSOL 验证协议,不需要手动边界框注释来提供定位信号。特别地,我们利用了诸如 Selective-Search、CLIP 和 RPN 预训练模型等噪声伪盒,用于模型选择。我们在 ILSVRC 和 CUB-200-2011 数据集上与几个 WSOL 方法的实验结果表明,我们的噪声盒子能够选择性能接近于通过地面真实框选择的模型,并且比仅使用图像类标签选择的模型更好。

URL

https://arxiv.org/abs/2404.10034

PDF

https://arxiv.org/pdf/2404.10034.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot