Abstract
This paper introduces a novel framework for zero-shot learning (ZSL), i.e., to recognize new categories that are unseen during training, by using a multi-model and multi-alignment integration method. Specifically, we propose three strategies to enhance the model's performance to handle ZSL: 1) Utilizing the extensive knowledge of ChatGPT and the powerful image generation capabilities of DALL-E to create reference images that can precisely describe unseen categories and classification boundaries, thereby alleviating the information bottleneck issue; 2) Integrating the results of text-image alignment and image-image alignment from CLIP, along with the image-image alignment results from DINO, to achieve more accurate predictions; 3) Introducing an adaptive weighting mechanism based on confidence levels to aggregate the outcomes from different prediction methods. Experimental results on multiple datasets, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our model can significantly improve classification accuracy compared to single-model approaches, achieving AUROC scores above 96% across all test datasets, and notably surpassing 99% on the CIFAR-10 dataset.
Abstract (translated)
本文提出了一种新颖的零样本学习(ZSL)框架,即利用多模型和多对齐整合方法,在训练过程中识别未见过的类别。具体来说,我们提出了三种策略来提高模型的性能以处理ZSL:1)利用ChatGPT的广泛知识以及DALL-E强大的图像生成能力,生成可以精确描述未见过的类别的参考图像,从而减轻信息瓶颈问题;2)将CLIP中的文本图像对齐和图像图像对齐的结果,与DINO中的图像图像对齐结果相结合,以实现更准确的预测;3)引入基于置信度级别的自适应加权机制,以汇总不同预测方法的结果。在多个数据集上的实验结果,包括CIFAR-10、CIFAR-100和TinyImageNet,表明我们的模型可以在与单模型方法相比显著提高分类精度,实现所有测试数据集的AUROC分数都超过96%,并且在CIFAR-10数据集上更是超过了99%。
URL
https://arxiv.org/abs/2405.02155