Paper Reading AI Learner

The challenge of representation learning: Improved accuracy in deep vision models does not come with better predictions of perceptual similarity

2023-03-13 13:08:20
Fritz Günther, Marco Marelli, Marco Alessandro Petilli

Abstract

Over the last years, advancements in deep learning models for computer vision have led to a dramatic improvement in their image classification accuracy. However, models with a higher accuracy in the task they were trained on do not necessarily develop better image representations that allow them to also perform better in other tasks they were not trained on. In order to investigate the representation learning capabilities of prominent high-performing computer vision models, we investigated how well they capture various indices of perceptual similarity from large-scale behavioral datasets. We find that higher image classification accuracy rates are not associated with a better performance on these datasets, and in fact we observe no improvement in performance since GoogLeNet (released 2015) and VGG-M (released 2014). We speculate that more accurate classification may result from hyper-engineering towards very fine-grained distinctions between highly similar classes, which does not incentivize the models to capture overall perceptual similarities.

Abstract (translated)

过去几年中,对于计算机视觉的深度学习模型的改进导致了图像分类准确率的显著提高。然而,训练任务更准确的模型并不一定能够发展出更好的图像表示,从而在他们没有训练过的其他任务中表现更好。为了研究 prominent 高性能计算机视觉模型的表示学习能力,我们研究了它们从大型行为数据集中提取感知相似性的各种指标的表达能力。我们发现,更高的图像分类准确率与这些数据集的性能没有直接关系,事实上,自GoogLeNet(2015年发布)和VGG-M(2014年发布)以来,性能没有发生变化。我们猜测,更精确的分类可能源于过度优化,倾向于在高度相似的类之间进行非常精细的区分,这并没有激励模型捕获整体感知相似性。

URL

https://arxiv.org/abs/2303.07084

PDF

https://arxiv.org/pdf/2303.07084.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot