Abstract
This paper introduces a novel approach to leverage features learned from both supervised and self-supervised paradigms, to improve image classification tasks, specifically for vehicle classification. Two state-of-the-art self-supervised learning methods, DINO and data2vec, were evaluated and compared for their representation learning of vehicle images. The former contrasts local and global views while the latter uses masked prediction on multi-layered representations. In the latter case, supervised learning is employed to finetune a pretrained YOLOR object detector for detecting vehicle wheels, from which definitive wheel positional features are retrieved. The representations learned from these self-supervised learning methods were combined with the wheel positional features for the vehicle classification task. Particularly, a random wheel masking strategy was utilized to finetune the previously learned representations in harmony with the wheel positional features during the training of the classifier. Our experiments show that the data2vec-distilled representations, which are consistent with our wheel masking strategy, outperformed the DINO counterpart, resulting in a celebrated Top-1 classification accuracy of 97.2% for classifying the 13 vehicle classes defined by the Federal Highway Administration.
Abstract (translated)
本论文介绍了一种利用监督和自监督范式学习的特征来提高图像分类任务的方法,特别是对于车辆分类。这两种先进的自监督学习方法Dino和data2vec被评估和比较,以评估和比较它们在车辆图像表示学习中的表现。Dino比较了 local 和 global 视角,而data2vec则使用了多层表示中的掩码预测。在后者的情况下,监督学习被用来优化预先训练的Yolor物体检测器,以检测车辆车轮,并从确定了车轮位置的特征中获取。从这些自监督学习方法中学习的特征在车辆分类任务中与车轮位置特征相结合。特别地,一种随机车轮掩码策略被用来在分类器训练期间优化先前学习的特征与车轮位置特征的协调。我们的实验表明,data2vec蒸馏表示与Dino对应的表示相比表现更好,导致Federal Highway Administration所定义的13个车辆类别的分类准确率达到97.2%。
URL
https://arxiv.org/abs/2302.00648