Diverse, Difficult, and Odd Instances : A New Test Set for Object Classification

Abstract
Abstract (translated)
URL
PDF

Abstract

Test sets are an integral part of evaluating models and gauging progress in object recognition, and more broadly in computer vision and AI. Existing test sets for object recognition, however, suffer from shortcomings such as bias towards the ImageNet characteristics and idiosyncrasies (e.g., ImageNet-V2), being limited to certain types of stimuli (e.g., indoor scenes in ObjectNet), and underestimating the model performance (e.g., ImageNet-A). To mitigate these problems, we introduce a new test set, called D2O, which is sufficiently different from existing test sets. Images are a mix of generated images as well as images crawled from the web. They are diverse, unmodified, and representative of real-world scenarios and cause state-of-the-art models to misclassify them with high confidence. To emphasize generalization, our dataset by design does not come paired with a training set. It contains 8,060 images spread across 36 categories, out of which 29 appear in ImageNet. The best Top-1 accuracy on our dataset is around 60% which is much lower than 91% best Top-1 accuracy on ImageNet. We find that popular vision APIs perform very poorly in detecting objects over D2O categories such as ``faces'', ``cars'', and ``cats''. Our dataset also comes with a ``miscellaneous'' category, over which we test the image tagging models. Overall, our investigations demonstrate that the D2O test set contain a mix of images with varied levels of difficulty and is predictive of the average-case performance of models. It can challenge object recognition models for years to come and can spur more research in this fundamental area.

Abstract (translated)

测试集是评估模型和衡量对象识别进展的不可或缺的部分,更广泛地应用于计算机视觉和人工智能领域。然而,现有的对象识别测试集存在缺点,例如偏向图像Net的特征和特性(例如ImageNet-V2)、限制到特定类型的刺激(例如ObjectNet中的室内场景)、以及低估模型性能(例如ImageNet-A)。为了解决这些问题,我们引入了名为D2O的新测试集,它与现有的测试集存在明显的差异。图像集是由生成图像和从网络上爬取的图像组成的混合集,它们具有多样性,未修改,代表了现实世界的场景,并且使得最先进的模型高概率地误判它们。为了强调泛化性,我们的数据集按计划不与训练集配对。它包含8,060张图像,分散在第36个类别中,其中29个出现在ImageNet中。我们数据集的最优Top1准确率约为60%,这比ImageNet上的最优Top1准确率91%低得多。我们发现,流行的视觉API在检测D2O类别中的“脸”、“汽车”和“猫”等对象方面表现非常差。我们的数据集还包括一个“杂项”类别,在该类别中测试图像标签模型。总的来说,我们的研究结果表明,D2O测试集包含不同难度级别的图像混合,预测模型的平均性能。它未来可以挑战对象识别模型,并推动该领域更多的研究。

URL

https://arxiv.org/abs/2301.12527

PDF

https://arxiv.org/pdf/2301.12527.pdf