Abstract
Weakly Supervised Object Localization (WSOL) allows for training deep learning models for classification and localization, using only global class-level labels. The lack of bounding box (bbox) supervision during training represents a considerable challenge for hyper-parameter search and model selection. Earlier WSOL works implicitly observed localization performance over a test set which leads to biased performance evaluation. More recently, a better WSOL protocol has been proposed, where a validation set with bbox annotations is held out for model selection. Although it does not rely on the test set, this protocol is unrealistic since bboxes are not available in real-world applications, and when available, it is better to use them directly to fit model weights. Our initial empirical analysis shows that the localization performance of a model declines significantly when using only image-class labels for model selection (compared to using bounding-box annotations). This suggests that adding bounding-box labels is preferable for selecting the best model for localization. In this paper, we introduce a new WSOL validation protocol that provides a localization signal without the need for manual bbox annotations. In particular, we leverage noisy pseudo boxes from an off-the-shelf ROI proposal generator such as Selective-Search, CLIP, and RPN pretrained models for model selection. Our experimental results with several WSOL methods on ILSVRC and CUB-200-2011 datasets show that our noisy boxes allow selecting models with performance close to those selected using ground truth boxes, and better than models selected using only image-class labels.
Abstract (translated)
Weakly Supervised Object Localization (WSOL) 允许使用仅全局类别级别标签来训练深度学习模型进行分类和定位,而无需进行边界框(bbox)监督。在训练过程中缺乏边界框监督代表了超参数搜索和模型选择的相当大挑战。较早的 WSOL 工作在测试集上隐式观察到了局部化性能,从而导致了偏差性能评估。更最近,提出了一个更好的 WSOL 协议,其中为模型选择保留了一个带边界框注释的验证集。尽管这个协议不依赖于测试集,但它是不可信的,因为在现实世界中,边界框是不存在的,而当它们存在时,直接使用它们来调整模型权重会更好。我们的初始实证分析表明,当仅使用图像类标签进行模型选择时,模型的定位性能会显著下降。这表明,为选择最佳的位置模型,应该添加边界框标签。在本文中,我们引入了一个新的 WSOL 验证协议,不需要手动边界框注释来提供定位信号。特别地,我们利用了诸如 Selective-Search、CLIP 和 RPN 预训练模型等噪声伪盒,用于模型选择。我们在 ILSVRC 和 CUB-200-2011 数据集上与几个 WSOL 方法的实验结果表明,我们的噪声盒子能够选择性能接近于通过地面真实框选择的模型,并且比仅使用图像类标签选择的模型更好。
URL
https://arxiv.org/abs/2404.10034