Abstract
Over the last years, advancements in deep learning models for computer vision have led to a dramatic improvement in their image classification accuracy. However, models with a higher accuracy in the task they were trained on do not necessarily develop better image representations that allow them to also perform better in other tasks they were not trained on. In order to investigate the representation learning capabilities of prominent high-performing computer vision models, we investigated how well they capture various indices of perceptual similarity from large-scale behavioral datasets. We find that higher image classification accuracy rates are not associated with a better performance on these datasets, and in fact we observe no improvement in performance since GoogLeNet (released 2015) and VGG-M (released 2014). We speculate that more accurate classification may result from hyper-engineering towards very fine-grained distinctions between highly similar classes, which does not incentivize the models to capture overall perceptual similarities.
Abstract (translated)
过去几年中,对于计算机视觉的深度学习模型的改进导致了图像分类准确率的显著提高。然而,训练任务更准确的模型并不一定能够发展出更好的图像表示,从而在他们没有训练过的其他任务中表现更好。为了研究 prominent 高性能计算机视觉模型的表示学习能力,我们研究了它们从大型行为数据集中提取感知相似性的各种指标的表达能力。我们发现,更高的图像分类准确率与这些数据集的性能没有直接关系,事实上,自GoogLeNet(2015年发布)和VGG-M(2014年发布)以来,性能没有发生变化。我们猜测,更精确的分类可能源于过度优化,倾向于在高度相似的类之间进行非常精细的区分,这并没有激励模型捕获整体感知相似性。
URL
https://arxiv.org/abs/2303.07084