Paper Reading AI Learner

Multi-method Integration with Confidence-based Weighting for Zero-shot Image Classification

2024-05-03 15:02:41
Siqi Yin, Lifan Jiang

Abstract

This paper introduces a novel framework for zero-shot learning (ZSL), i.e., to recognize new categories that are unseen during training, by using a multi-model and multi-alignment integration method. Specifically, we propose three strategies to enhance the model's performance to handle ZSL: 1) Utilizing the extensive knowledge of ChatGPT and the powerful image generation capabilities of DALL-E to create reference images that can precisely describe unseen categories and classification boundaries, thereby alleviating the information bottleneck issue; 2) Integrating the results of text-image alignment and image-image alignment from CLIP, along with the image-image alignment results from DINO, to achieve more accurate predictions; 3) Introducing an adaptive weighting mechanism based on confidence levels to aggregate the outcomes from different prediction methods. Experimental results on multiple datasets, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our model can significantly improve classification accuracy compared to single-model approaches, achieving AUROC scores above 96% across all test datasets, and notably surpassing 99% on the CIFAR-10 dataset.

Abstract (translated)

本文提出了一种新颖的零样本学习(ZSL)框架,即利用多模型和多对齐整合方法,在训练过程中识别未见过的类别。具体来说,我们提出了三种策略来提高模型的性能以处理ZSL:1)利用ChatGPT的广泛知识以及DALL-E强大的图像生成能力,生成可以精确描述未见过的类别的参考图像,从而减轻信息瓶颈问题;2)将CLIP中的文本图像对齐和图像图像对齐的结果,与DINO中的图像图像对齐结果相结合,以实现更准确的预测;3)引入基于置信度级别的自适应加权机制,以汇总不同预测方法的结果。在多个数据集上的实验结果,包括CIFAR-10、CIFAR-100和TinyImageNet,表明我们的模型可以在与单模型方法相比显著提高分类精度,实现所有测试数据集的AUROC分数都超过96%,并且在CIFAR-10数据集上更是超过了99%。

URL

https://arxiv.org/abs/2405.02155

PDF

https://arxiv.org/pdf/2405.02155.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot